<a href="https://colab.research.google.com/github/muhammetsnts/SPARK/blob/main/projects/5.Collage_Prediction_with_Tree_Methods.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project Info
We have a collage features dataset and will try to predict they are public or private.

We will create 3 different tree methos and to compare their results.
* A single decision tree
* A random forest
* A gradient boosted tree classifier
    
We will be using a college dataset to try to classify colleges as Private or Public based off these features:

    Private A factor with levels No and Yes indicating private or public university
    Apps Number of applications received
    Accept Number of applications accepted
    Enroll Number of new students enrolled
    Top10perc Pct. new students from top 10% of H.S. class
    Top25perc Pct. new students from top 25% of H.S. class
    F.Undergrad Number of fulltime undergraduates
    P.Undergrad Number of parttime undergraduates
    Outstate Out-of-state tuition
    Room.Board Room and board costs
    Books Estimated book costs
    Personal Estimated personal spending
    PhD Pct. of faculty with Ph.D.’s
    Terminal Pct. of faculty with terminal degree
    S.F.Ratio Student/faculty ratio
    perc.alumni Pct. alumni who donate
    Expend Instructional expenditure per student
    Grad.Rate Graduation rate

# Setup Environment

In [2]:
# install Java8
!apt-get -q install openjdk-8-jdk-headless -qq > /dev/null

# download spark3.1.1
!wget -q https://archive.apache.org/dist/spark/spark-3.1.1/spark-3.1.1-bin-hadoop2.7.tgz

# unzip it
!tar xf spark-3.1.1-bin-hadoop2.7.tgz

# install findspark 
!pip install -q findspark


import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["PATH"] = os.environ["JAVA_HOME"] + "/bin:" + os.environ["PATH"]
os.environ["SPARK_HOME"] = "/content/spark-3.1.1-bin-hadoop2.7"


import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()
#spark = SparkSession.builder.appName('ops').getOrCreate()

# Download and Read the Data

In [3]:
!wget -q https://raw.githubusercontent.com/muhammetsnts/SPARK/main/data/College.csv

In [4]:
data = spark.read.csv("College.csv", header=True, inferSchema=True)

In [6]:
data.printSchema()

root
 |-- School: string (nullable = true)
 |-- Private: string (nullable = true)
 |-- Apps: integer (nullable = true)
 |-- Accept: integer (nullable = true)
 |-- Enroll: integer (nullable = true)
 |-- Top10perc: integer (nullable = true)
 |-- Top25perc: integer (nullable = true)
 |-- F_Undergrad: integer (nullable = true)
 |-- P_Undergrad: integer (nullable = true)
 |-- Outstate: integer (nullable = true)
 |-- Room_Board: integer (nullable = true)
 |-- Books: integer (nullable = true)
 |-- Personal: integer (nullable = true)
 |-- PhD: integer (nullable = true)
 |-- Terminal: integer (nullable = true)
 |-- S_F_Ratio: double (nullable = true)
 |-- perc_alumni: integer (nullable = true)
 |-- Expend: integer (nullable = true)
 |-- Grad_Rate: integer (nullable = true)



In [5]:
data.show()

+--------------------+-------+----+------+------+---------+---------+-----------+-----------+--------+----------+-----+--------+---+--------+---------+-----------+------+---------+
|              School|Private|Apps|Accept|Enroll|Top10perc|Top25perc|F_Undergrad|P_Undergrad|Outstate|Room_Board|Books|Personal|PhD|Terminal|S_F_Ratio|perc_alumni|Expend|Grad_Rate|
+--------------------+-------+----+------+------+---------+---------+-----------+-----------+--------+----------+-----+--------+---+--------+---------+-----------+------+---------+
|Abilene Christian...|    Yes|1660|  1232|   721|       23|       52|       2885|        537|    7440|      3300|  450|    2200| 70|      78|     18.1|         12|  7041|       60|
|  Adelphi University|    Yes|2186|  1924|   512|       16|       29|       2683|       1227|   12280|      6450|  750|    1500| 29|      30|     12.2|         16| 10527|       56|
|      Adrian College|    Yes|1428|  1097|   336|       22|       50|       1036|         99|  

In [7]:
data.head(1)

[Row(School='Abilene Christian University', Private='Yes', Apps=1660, Accept=1232, Enroll=721, Top10perc=23, Top25perc=52, F_Undergrad=2885, P_Undergrad=537, Outstate=7440, Room_Board=3300, Books=450, Personal=2200, PhD=70, Terminal=78, S_F_Ratio=18.1, perc_alumni=12, Expend=7041, Grad_Rate=60)]

In [8]:
# Import VectorAssembler and Vectors
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler

In [9]:
data.columns

['School',
 'Private',
 'Apps',
 'Accept',
 'Enroll',
 'Top10perc',
 'Top25perc',
 'F_Undergrad',
 'P_Undergrad',
 'Outstate',
 'Room_Board',
 'Books',
 'Personal',
 'PhD',
 'Terminal',
 'S_F_Ratio',
 'perc_alumni',
 'Expend',
 'Grad_Rate']

In [10]:
assembler = VectorAssembler(inputCols=[ 'Apps',
                                        'Accept',
                                        'Enroll',
                                        'Top10perc',
                                        'Top25perc',
                                        'F_Undergrad',
                                        'P_Undergrad',
                                        'Outstate',
                                        'Room_Board',
                                        'Books',
                                        'Personal',
                                        'PhD',
                                        'Terminal',
                                        'S_F_Ratio',
                                        'perc_alumni',
                                        'Expend',
                                        'Grad_Rate'], 
                            outputCol='features')

In [11]:
output = assembler.transform(data)

In [12]:
from pyspark.ml.feature import StringIndexer

In [13]:
indexer = StringIndexer(inputCol='Private', outputCol='PrivateIndex')

In [14]:
output_fixed = indexer.fit(output).transform(output)

In [15]:
output_fixed.printSchema()

root
 |-- School: string (nullable = true)
 |-- Private: string (nullable = true)
 |-- Apps: integer (nullable = true)
 |-- Accept: integer (nullable = true)
 |-- Enroll: integer (nullable = true)
 |-- Top10perc: integer (nullable = true)
 |-- Top25perc: integer (nullable = true)
 |-- F_Undergrad: integer (nullable = true)
 |-- P_Undergrad: integer (nullable = true)
 |-- Outstate: integer (nullable = true)
 |-- Room_Board: integer (nullable = true)
 |-- Books: integer (nullable = true)
 |-- Personal: integer (nullable = true)
 |-- PhD: integer (nullable = true)
 |-- Terminal: integer (nullable = true)
 |-- S_F_Ratio: double (nullable = true)
 |-- perc_alumni: integer (nullable = true)
 |-- Expend: integer (nullable = true)
 |-- Grad_Rate: integer (nullable = true)
 |-- features: vector (nullable = true)
 |-- PrivateIndex: double (nullable = false)



In [16]:
final_data = output_fixed.select(['features', 'PrivateIndex'])

# Train-Test Split

In [17]:
train_data, test_data = final_data.randomSplit([0.7,0.3])

# The Classifiers

In [18]:
from pyspark.ml import Pipeline
from pyspark.ml.classification import RandomForestClassifier, GBTClassifier, DecisionTreeClassifier

In [19]:
# We can play with the parameters

dtc = DecisionTreeClassifier(labelCol='PrivateIndex', featuresCol='features') 
rfc = RandomForestClassifier(labelCol='PrivateIndex', featuresCol='features')
gbt = GBTClassifier(labelCol='PrivateIndex', featuresCol='features')

In [20]:
dtc_model = dtc.fit(train_data)
rfc_model = rfc.fit(train_data)
gbt_model = gbt.fit(train_data)

# Model Comparison

In [21]:
dtc_preds = dtc_model.transform(test_data)
rfc_preds = rfc_model.transform(test_data)
gbt_preds = gbt_model.transform(test_data)

Lets show the predictions.

In [22]:
dtc_preds.show()

+--------------------+------------+-------------+--------------------+----------+
|            features|PrivateIndex|rawPrediction|         probability|prediction|
+--------------------+------------+-------------+--------------------+----------+
|[150.0,130.0,88.0...|         0.0|  [278.0,1.0]|[0.99641577060931...|       0.0|
|[167.0,130.0,46.0...|         0.0|  [278.0,1.0]|[0.99641577060931...|       0.0|
|[174.0,146.0,88.0...|         0.0|   [14.0,0.0]|           [1.0,0.0]|       0.0|
|[191.0,165.0,63.0...|         0.0|  [278.0,1.0]|[0.99641577060931...|       0.0|
|[202.0,184.0,122....|         0.0|  [278.0,1.0]|[0.99641577060931...|       0.0|
|[222.0,185.0,91.0...|         0.0|  [278.0,1.0]|[0.99641577060931...|       0.0|
|[232.0,216.0,106....|         0.0|   [16.0,0.0]|           [1.0,0.0]|       0.0|
|[233.0,233.0,153....|         1.0|   [16.0,0.0]|           [1.0,0.0]|       0.0|
|[313.0,228.0,137....|         0.0|  [278.0,1.0]|[0.99641577060931...|       0.0|
|[321.0,318.0,17

In [23]:
rfc_preds.show()

+--------------------+------------+--------------------+--------------------+----------+
|            features|PrivateIndex|       rawPrediction|         probability|prediction|
+--------------------+------------+--------------------+--------------------+----------+
|[150.0,130.0,88.0...|         0.0|[19.8608961603520...|[0.99304480801760...|       0.0|
|[167.0,130.0,46.0...|         0.0|[19.8608961603520...|[0.99304480801760...|       0.0|
|[174.0,146.0,88.0...|         0.0|[18.4376683537422...|[0.92188341768711...|       0.0|
|[191.0,165.0,63.0...|         0.0|[18.3727517561357...|[0.91863758780678...|       0.0|
|[202.0,184.0,122....|         0.0|[19.7827711603520...|[0.98913855801760...|       0.0|
|[222.0,185.0,91.0...|         0.0|[19.8608961603520...|[0.99304480801760...|       0.0|
|[232.0,216.0,106....|         0.0|[17.2104056840445...|[0.86052028420222...|       0.0|
|[233.0,233.0,153....|         1.0|[15.2854998521147...|[0.76427499260573...|       0.0|
|[313.0,228.0,137....

In [24]:
gbt_preds.show()

+--------------------+------------+--------------------+--------------------+----------+
|            features|PrivateIndex|       rawPrediction|         probability|prediction|
+--------------------+------------+--------------------+--------------------+----------+
|[150.0,130.0,88.0...|         0.0|[1.53750860011155...|[0.95585038530407...|       0.0|
|[167.0,130.0,46.0...|         0.0|[1.53957909345878...|[0.95602480720723...|       0.0|
|[174.0,146.0,88.0...|         0.0|[1.46520215450664...|[0.94932912721439...|       0.0|
|[191.0,165.0,63.0...|         0.0|[1.50399412639891...|[0.95293370750518...|       0.0|
|[202.0,184.0,122....|         0.0|[1.53750860011155...|[0.95585038530407...|       0.0|
|[222.0,185.0,91.0...|         0.0|[1.53957909345878...|[0.95602480720723...|       0.0|
|[232.0,216.0,106....|         0.0|[1.36278594159611...|[0.93851882495134...|       0.0|
|[233.0,233.0,153....|         1.0|[0.60127513976984...|[0.76897815333182...|       0.0|
|[313.0,228.0,137....

# Evaluation

This is a binary classification task so we can use ROC or precision, recall etc. But Multiclassclassificationevaluator works even though this is a binary classification task.

## BinaryClassificationEvaluator

In [25]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

In [26]:
my_binary_eval = BinaryClassificationEvaluator(labelCol='PrivateIndex')

### ROC Results (AUC)

In [29]:
print('DTC: ', my_binary_eval.evaluate(dtc_preds))  

DTC:  0.9003429878048781


In [30]:
print('RFC :', my_binary_eval.evaluate(rfc_preds))

RFC : 0.9950457317073171


In [32]:
my_binary_eval2 = BinaryClassificationEvaluator(labelCol='PrivateIndex', rawPredictionCol='prediction',)

In [33]:
print('GBT :', my_binary_eval2.evaluate(gbt_preds))

GBT : 0.9300685975609756


## MulticlassClassificationEvaluator

In [34]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

In [35]:
acc_evaluator = MulticlassClassificationEvaluator(labelCol="PrivateIndex", predictionCol="prediction", metricName="accuracy")

In [37]:
dtc_acc = acc_evaluator.evaluate(dtc_preds)
rfc_acc = acc_evaluator.evaluate(rfc_preds)
gbt_acc = acc_evaluator.evaluate(gbt_preds)

In [38]:
print("Here are the results!")
print('-'*80)
print('A single decision tree had an accuracy of: {0:2.2f}%'.format(dtc_acc*100))
print('-'*80)
print('A random forest ensemble had an accuracy of: {0:2.2f}%'.format(rfc_acc*100))
print('-'*80)
print('A ensemble using GBT had an accuracy of: {0:2.2f}%'.format(gbt_acc*100))

Here are the results!
--------------------------------------------------------------------------------
A single decision tree had an accuracy of: 95.18%
--------------------------------------------------------------------------------
A random forest ensemble had an accuracy of: 96.05%
--------------------------------------------------------------------------------
A ensemble using GBT had an accuracy of: 94.74%


### Feature Importances

In [39]:
rfc_model.featureImportances

SparseVector(17, {0: 0.0209, 1: 0.049, 2: 0.1378, 3: 0.016, 4: 0.0051, 5: 0.1606, 6: 0.144, 7: 0.1932, 8: 0.0562, 9: 0.0106, 10: 0.0258, 11: 0.0182, 12: 0.0165, 13: 0.0487, 14: 0.0388, 15: 0.0407, 16: 0.0179})

In [40]:
dtc_model.featureImportances

SparseVector(17, {0: 0.0187, 1: 0.0076, 2: 0.0077, 3: 0.0173, 4: 0.0121, 5: 0.4675, 6: 0.0422, 7: 0.3286, 8: 0.0293, 10: 0.0025, 11: 0.0422, 14: 0.0147, 15: 0.0096})

In [41]:
gbt_model.featureImportances

SparseVector(17, {0: 0.0566, 1: 0.0093, 2: 0.0175, 3: 0.0549, 4: 0.0107, 5: 0.2921, 6: 0.0743, 7: 0.227, 8: 0.0416, 9: 0.005, 10: 0.022, 11: 0.056, 12: 0.0093, 13: 0.0796, 14: 0.0191, 15: 0.0189, 16: 0.006})