### Using PySpark and a RandomForestClassifier's Feature Importances to Understand Spoilage in Dog Food

By: Matt Purvis

The scenario: Purina Dog Food wants to investigate four preservatives (A,B,C,D) that may be causing their dog food to spoil. This project will train a random forest classifier and use its' feature importances to figure out which preservative(s) are most responsible for the spoiled batches of food.

In [0]:
# Import SparkSession and create the session
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('tree').getOrCreate()

In [0]:
# Import the data using SparkSQL
data = spark.sql('select * from dog_food_csv')

In [0]:
# Look at the schemas to get column names and types
data.printSchema()

In [0]:
# Preview the data
data.show()

In [0]:
# Import the vectorassembler to transform the data to the correct format
from pyspark.ml.feature import VectorAssembler

In [0]:
# Look at column names to use for the vector assembler
data.columns

In [0]:
# Create the vector assembler object, using the 4 preservatives as the features
assembler = VectorAssembler(inputCols=['A', 'B', 'C', 'D'],outputCol='features')

In [0]:
# Transform the data using the assembler object
output = assembler.transform(data)

In [0]:
# Show the resulting transformed dataframe
output.show()

In [0]:
# Get only the features and label columns
final_data = output.select('features', 'Spoiled')

In [0]:
# Preview the data
final_data.show()

In [0]:
# Import the randomforestclassifier and create the object that will be used to train on the data
from pyspark.ml.classification import RandomForestClassifier
rfc = RandomForestClassifier(numTrees = 150, labelCol = 'Spoiled', featuresCol = 'features')

#### Fit the model on all the data

We do not need train/test split because we do not need the predictions in this scenario. We will not deploy the model. We are only interested in seeing the preservative(s) most responsible for the spoiled food

In [0]:
# Fit the model
rfc_model = rfc.fit(final_data)

In [0]:
# Grab the coefficients of all the preservatives
rfc_model.featureImportances

Preservative C is the most important predictor for spoiled dog food. It is recommended to be reduced or taken out of the formula completely to prevent dog food from future spoilage