# College Classification using Tree Methods

We will be using a college dataset to try to classify colleges as Private or Public based off these features:

    Private A factor with levels No and Yes indicating private or public university
    Apps Number of applications received
    Accept Number of applications accepted
    Enroll Number of new students enrolled
    Top10perc Pct. new students from top 10% of H.S. class
    Top25perc Pct. new students from top 25% of H.S. class
    F.Undergrad Number of fulltime undergraduates
    P.Undergrad Number of parttime undergraduates
    Outstate Out-of-state tuition
    Room.Board Room and board costs
    Books Estimated book costs
    Personal Estimated personal spending
    PhD Pct. of faculty with Ph.D.’s
    Terminal Pct. of faculty with terminal degree
    S.F.Ratio Student/faculty ratio
    perc.alumni Pct. alumni who donate
    Expend Instructional expenditure per student
    Grad.Rate Graduation rate

In this we will do a comparision using the following:

* A single decision tree
* A random forest
* A gradient boosted tree classifier

In [1]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('College Classification').getOrCreate()

In [2]:
from pyspark.ml.feature import VectorAssembler, StringIndexer

In [3]:
data = spark.read.csv('College.csv', inferSchema=True, header=True)

In [4]:
data.printSchema()

root
 |-- School: string (nullable = true)
 |-- Private: string (nullable = true)
 |-- Apps: integer (nullable = true)
 |-- Accept: integer (nullable = true)
 |-- Enroll: integer (nullable = true)
 |-- Top10perc: integer (nullable = true)
 |-- Top25perc: integer (nullable = true)
 |-- F_Undergrad: integer (nullable = true)
 |-- P_Undergrad: integer (nullable = true)
 |-- Outstate: integer (nullable = true)
 |-- Room_Board: integer (nullable = true)
 |-- Books: integer (nullable = true)
 |-- Personal: integer (nullable = true)
 |-- PhD: integer (nullable = true)
 |-- Terminal: integer (nullable = true)
 |-- S_F_Ratio: double (nullable = true)
 |-- perc_alumni: integer (nullable = true)
 |-- Expend: integer (nullable = true)
 |-- Grad_Rate: integer (nullable = true)



In [5]:
data.show()

+--------------------+-------+----+------+------+---------+---------+-----------+-----------+--------+----------+-----+--------+---+--------+---------+-----------+------+---------+
|              School|Private|Apps|Accept|Enroll|Top10perc|Top25perc|F_Undergrad|P_Undergrad|Outstate|Room_Board|Books|Personal|PhD|Terminal|S_F_Ratio|perc_alumni|Expend|Grad_Rate|
+--------------------+-------+----+------+------+---------+---------+-----------+-----------+--------+----------+-----+--------+---+--------+---------+-----------+------+---------+
|Abilene Christian...|    Yes|1660|  1232|   721|       23|       52|       2885|        537|    7440|      3300|  450|    2200| 70|      78|     18.1|         12|  7041|       60|
|  Adelphi University|    Yes|2186|  1924|   512|       16|       29|       2683|       1227|   12280|      6450|  750|    1500| 29|      30|     12.2|         16| 10527|       56|
|      Adrian College|    Yes|1428|  1097|   336|       22|       50|       1036|         99|  

In [6]:
data.columns

['School',
 'Private',
 'Apps',
 'Accept',
 'Enroll',
 'Top10perc',
 'Top25perc',
 'F_Undergrad',
 'P_Undergrad',
 'Outstate',
 'Room_Board',
 'Books',
 'Personal',
 'PhD',
 'Terminal',
 'S_F_Ratio',
 'perc_alumni',
 'Expend',
 'Grad_Rate']

Transforming Data into libsvm format

In [7]:
assembler = VectorAssembler(
    inputCols = ['Apps', 'Accept', 'Enroll', 'Top10perc', 'Top25perc', 'F_Undergrad', 'P_Undergrad', 'Outstate',
                 'Room_Board', 'Books', 'Personal', 'PhD', 'Terminal', 'S_F_Ratio', 'perc_alumni', 'Expend',
                 'Grad_Rate'],
    outputCol = 'features')

trans_data = assembler.transform(data)
trans_data.printSchema()

root
 |-- School: string (nullable = true)
 |-- Private: string (nullable = true)
 |-- Apps: integer (nullable = true)
 |-- Accept: integer (nullable = true)
 |-- Enroll: integer (nullable = true)
 |-- Top10perc: integer (nullable = true)
 |-- Top25perc: integer (nullable = true)
 |-- F_Undergrad: integer (nullable = true)
 |-- P_Undergrad: integer (nullable = true)
 |-- Outstate: integer (nullable = true)
 |-- Room_Board: integer (nullable = true)
 |-- Books: integer (nullable = true)
 |-- Personal: integer (nullable = true)
 |-- PhD: integer (nullable = true)
 |-- Terminal: integer (nullable = true)
 |-- S_F_Ratio: double (nullable = true)
 |-- perc_alumni: integer (nullable = true)
 |-- Expend: integer (nullable = true)
 |-- Grad_Rate: integer (nullable = true)
 |-- features: vector (nullable = true)



In [8]:
index = StringIndexer(inputCol="Private", outputCol="label")
output = index.fit(trans_data).transform(trans_data)

In [9]:
ml_data = output.select('features','label')
ml_data.show()

+--------------------+-----+
|            features|label|
+--------------------+-----+
|[1660.0,1232.0,72...|  0.0|
|[2186.0,1924.0,51...|  0.0|
|[1428.0,1097.0,33...|  0.0|
|[417.0,349.0,137....|  0.0|
|[193.0,146.0,55.0...|  0.0|
|[587.0,479.0,158....|  0.0|
|[353.0,340.0,103....|  0.0|
|[1899.0,1720.0,48...|  0.0|
|[1038.0,839.0,227...|  0.0|
|[582.0,498.0,172....|  0.0|
|[1732.0,1425.0,47...|  0.0|
|[2652.0,1900.0,48...|  0.0|
|[1179.0,780.0,290...|  0.0|
|[1267.0,1080.0,38...|  0.0|
|[494.0,313.0,157....|  0.0|
|[1420.0,1093.0,22...|  0.0|
|[4302.0,992.0,418...|  0.0|
|[1216.0,908.0,423...|  0.0|
|[1130.0,704.0,322...|  0.0|
|[3540.0,2001.0,10...|  1.0|
+--------------------+-----+
only showing top 20 rows



In [10]:
train, test = ml_data.randomSplit([0.7,0.3])

In [11]:
from pyspark.ml.classification import (RandomForestClassifier, GBTClassifier, DecisionTreeClassifier)

Making various models

In [12]:
rfc = RandomForestClassifier(numTrees=100)
gbt = GBTClassifier()
dtc = DecisionTreeClassifier()

In [13]:
rfc_model = rfc.fit(train)
gbt_model = gbt.fit(train)
dtc_model = dtc.fit(train)

In [14]:
#applying the model on the test datarfc_pred = rfc_model.transform(test)
rfc_pred = rfc_model.transform(test)
gbt_pred = gbt_model.transform(test)
dtc_pred = dtc_model.transform(test)

evaluating using binaryclass classifier

In [15]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator
my_eval = BinaryClassificationEvaluator()

In [16]:
rfc_pred.printSchema()

root
 |-- features: vector (nullable = true)
 |-- label: double (nullable = true)
 |-- rawPrediction: vector (nullable = true)
 |-- probability: vector (nullable = true)
 |-- prediction: double (nullable = true)



In [17]:
print("RFC--Area Under ROC")
my_eval.evaluate(rfc_pred)

RFC--Area Under ROC


0.986994760479042

In [18]:
gbt_pred.printSchema()

root
 |-- features: vector (nullable = true)
 |-- label: double (nullable = true)
 |-- rawPrediction: vector (nullable = true)
 |-- probability: vector (nullable = true)
 |-- prediction: double (nullable = true)



In [19]:
print("GBT--Area Under ROC")
my_eval.evaluate(gbt_pred)

GBT--Area Under ROC


0.971744011976048

In [26]:
dtc_pred.printSchema()

root
 |-- features: vector (nullable = true)
 |-- label: double (nullable = true)
 |-- rawPrediction: vector (nullable = true)
 |-- probability: vector (nullable = true)
 |-- prediction: double (nullable = true)



In [20]:
print("DTC--Area Under ROC")
my_eval.evaluate(dtc_pred)

DTC--Area Under ROC


0.9350205838323353

Lets goto multiclass classifier and see the accuracy for these tree based methods or models

In [21]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
my_acc = MulticlassClassificationEvaluator(metricName='accuracy')

In [22]:
gbt_acc = my_acc.evaluate(gbt_pred)
rfc_acc = my_acc.evaluate(rfc_pred)
dtc_acc = my_acc.evaluate(dtc_pred)

In [23]:
print("-"*50)
print(' M O D E L - A C C U R A C Y ')
print("-"*50)
print("The accuracy for GBT: {0:2.2f}".format(gbt_acc*100))
print("-"*50)
print("The accuracy for DTC: {0:2.2f}".format(dtc_acc*100))
print("-"*50)
print("The accuracy for RFC: {0:2.2f}".format(rfc_acc*100))

--------------------------------------------------
 M O D E L - A C C U R A C Y 
--------------------------------------------------
The accuracy for GBT: 91.77
--------------------------------------------------
The accuracy for DTC: 91.77
--------------------------------------------------
The accuracy for RFC: 96.10
