#### Using PySpark to Train Tree Methods(Decision Trees, Random Forest, Gradient Boosted Trees) in Order to Predict Whether a College is Private or Public

By: Matt Purvis

In [0]:
# Import SparkSession and create session
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('tree').getOrCreate()

In [0]:
# Import the data
data = spark.sql('select * from College_csv')

In [0]:
# Preview the data
data.show()

In [0]:
# Display column names and dtypes
data.printSchema()

In [0]:
# Import VectorAssembler in order to transform data into proper format
from pyspark.ml.feature import VectorAssembler

In [0]:
# View columns in order to choose features for assembler object
data.columns

In [0]:
# Create the assembler object that will be used to transform the data
assembler = VectorAssembler(inputCols=['Apps',
 'Accept',
 'Enroll',
 'Top10perc',
 'Top25perc',
 'F_Undergrad',
 'P_Undergrad',
 'Outstate',
 'Room_Board',
 'Books',
 'Personal',
 'PhD',
 'Terminal',
 'S_F_Ratio',
 'perc_alumni',
 'Expend',
 'Grad_Rate'],outputCol='features')

In [0]:
# Transform the data and store in output variable
output = assembler.transform(data)

In [0]:
# Import StringIndexer to properly code Private column in ones and zeros
from pyspark.ml.feature import StringIndexer

In [0]:
# Create indexer object that will be used to code Private column
indexer = StringIndexer(inputCol='Private', outputCol = 'PrivateIndex')

In [0]:
# Code the Private column and store in output_fixed variable
output_fixed = indexer.fit(output).transform(output)

In [0]:
# Display the resulting df
output_fixed.show()

In [0]:
# Get only the features and the label column and store in new variable
final_data = output_fixed.select('features', 'PrivateIndex')

In [0]:
# Create train and test split
train_data, test_data = final_data.randomSplit([.7,.3])

In [0]:
# Import the classifiers we will use
from pyspark.ml.classification import DecisionTreeClassifier, GBTClassifier, RandomForestClassifier

In [0]:
# Create the decision tree, random forest and gradient boosted tree objects that will be used to train data
dtc = DecisionTreeClassifier(labelCol = 'PrivateIndex', featuresCol = 'features')
rfc = RandomForestClassifier(numTrees = 150, labelCol = 'PrivateIndex', featuresCol = 'features')
gbt = GBTClassifier(labelCol = 'PrivateIndex', featuresCol = 'features')

In [0]:
# Train all three models on the training data
dtc_model = dtc.fit(train_data)
rfc_model = rfc.fit(train_data)
gbt_model = gbt.fit(train_data)

In [0]:
# Get the predictions on the test data
dtc_preds = dtc_model.transform(test_data)
rfc_preds = rfc_model.transform(test_data)
gbt_preds = gbt_model.transform(test_data)

In [0]:
dtc_preds.show()

In [0]:
rfc_preds.show()

In [0]:
gbt_preds.show()

In [0]:
# Import evaluator to get ROC/AUC and create evaluator object
from pyspark.ml.evaluation import BinaryClassificationEvaluator
my_binary_eval = BinaryClassificationEvaluator(labelCol='PrivateIndex')

In [0]:
# Print ROC/AUC for Decision tree
print('DTC')
print(my_binary_eval.evaluate(dtc_preds))

In [0]:
# Print ROC/AUC for Random Forest
print('RFC')
print(my_binary_eval.evaluate(rfc_preds))

In [0]:
# Print ROC/AUC for Gradient Boosted Trees
print('GBT')
print(my_binary_eval.evaluate(gbt_preds))

The Random Forest Classifier performed the best, followed by the Gradient Boosted Trees Classifier. The Decision Tree was the worse model, which makes sense since it only builds one tree. Further tweaking the model parameters for each of the models would probably boost the performance even more. Performing a grid search would be ideal in this scenario.

#### Looking at Accuracy

In [0]:
# import multiclass evaluator and create the object
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
acc_eval = MulticlassClassificationEvaluator(labelCol = 'PrivateIndex', metricName = 'accuracy')

In [0]:
# Fit the evaluator on the predictions from each model
dtc_acc = acc_eval.evaluate(dtc_preds)
rfc_acc = acc_eval.evaluate(rfc_preds)
gbt_acc = acc_eval.evaluate(gbt_preds)

In [0]:
print('DTC')
dtc_acc

In [0]:
print('RFC')
rfc_acc

In [0]:
print('GBT')
gbt_acc

The random forest had the highest accuracy, followed by the decision tree and then the GBT classifier. It is important to not look at accuracy in a vaccuum. Unbalanced datasets can impact accuracy. We could also look at precision and recall and f1 score as well. All metrics should be evaluated together in order to get a full picture.