<a href="https://colab.research.google.com/github/muhammetsnts/SPARK/blob/main/projects/6.Prediction_of_Chemical_Causing_Spoil_with_RF.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project Info

We have a dog food company dataset about their products. Some batches of their dog food are spoiling much quicker than intended. The dog food company first mixes up a batch of preservative that contains 4 different preservative chemicals (A,B,C,D) and then is completed with a "filler" chemical. The food scientists beelive one of the A,B,C, or D preservatives is causing the problem, don't know to figure out which one!

* Pres_A : Percentage of preservative A in the mix
* Pres_B : Percentage of preservative B in the mix
* Pres_C : Percentage of preservative C in the mix
* Pres_D : Percentage of preservative D in the mix
* Spoiled: Label indicating whether or not the dog food batch was spoiled.


# Setup Environment

In [1]:
# install Java8
!apt-get -q install openjdk-8-jdk-headless -qq > /dev/null

# download spark3.1.1
!wget -q https://archive.apache.org/dist/spark/spark-3.1.1/spark-3.1.1-bin-hadoop2.7.tgz

# unzip it
!tar xf spark-3.1.1-bin-hadoop2.7.tgz

# install findspark 
!pip install -q findspark


import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["PATH"] = os.environ["JAVA_HOME"] + "/bin:" + os.environ["PATH"]
os.environ["SPARK_HOME"] = "/content/spark-3.1.1-bin-hadoop2.7"


import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()
#spark = SparkSession.builder.appName('ops').getOrCreate()

# Download and Read the Data

In [2]:
!wget -q https://raw.githubusercontent.com/muhammetsnts/SPARK/main/data/dog_food.csv

In [3]:
data = spark.read.csv("dog_food.csv", header=True, inferSchema=True)

In [4]:
data.printSchema()

root
 |-- A: integer (nullable = true)
 |-- B: integer (nullable = true)
 |-- C: double (nullable = true)
 |-- D: integer (nullable = true)
 |-- Spoiled: double (nullable = true)



In [5]:
data.show()

+---+---+----+---+-------+
|  A|  B|   C|  D|Spoiled|
+---+---+----+---+-------+
|  4|  2|12.0|  3|    1.0|
|  5|  6|12.0|  7|    1.0|
|  6|  2|13.0|  6|    1.0|
|  4|  2|12.0|  1|    1.0|
|  4|  2|12.0|  3|    1.0|
| 10|  3|13.0|  9|    1.0|
|  8|  5|14.0|  5|    1.0|
|  5|  8|12.0|  8|    1.0|
|  6|  5|12.0|  9|    1.0|
|  3|  3|12.0|  1|    1.0|
|  9|  8|11.0|  3|    1.0|
|  1| 10|12.0|  3|    1.0|
|  1|  5|13.0| 10|    1.0|
|  2| 10|12.0|  6|    1.0|
|  1| 10|11.0|  4|    1.0|
|  5|  3|12.0|  2|    1.0|
|  4|  9|11.0|  8|    1.0|
|  5|  1|11.0|  1|    1.0|
|  4|  9|12.0| 10|    1.0|
|  5|  8|10.0|  9|    1.0|
+---+---+----+---+-------+
only showing top 20 rows



In [6]:
# Import VectorAssembler and Vectors

from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler

In [7]:
data.columns

['A', 'B', 'C', 'D', 'Spoiled']

In [21]:
assembler = VectorAssembler(inputCols=[ 'A', 'B', 'C', 'D'], 
                            outputCol='features')

In [22]:
output = assembler.transform(data)

In [23]:
output.printSchema()

root
 |-- A: integer (nullable = true)
 |-- B: integer (nullable = true)
 |-- C: double (nullable = true)
 |-- D: integer (nullable = true)
 |-- Spoiled: double (nullable = true)
 |-- features: vector (nullable = true)



In [24]:
final_data = output.select(['features', 'Spoiled'])

# Modelling

We will create RF model and fit all the data, then check the feature importances and decide which chemical is causing the problem. 

In [14]:
from pyspark.ml.classification import RandomForestClassifier

In [25]:
rfc = RandomForestClassifier(numTrees=120,labelCol='Spoiled',maxDepth=7)

In [26]:
rfc_model = rfc.fit(final_data)

## Feature Importances


In [27]:
rfc_model.featureImportances

SparseVector(4, {0: 0.0288, 1: 0.0286, 2: 0.9104, 3: 0.0322})

We needed to know which feature causes the most for spoiling, and know we found it! In this vector, results refer to;

- 0: A
- 1: B
- 2: C
- 3: D

So the checmical C is the guilty :)