# Classification template

This notebook demonstrates the classification task in data analytics.

In all examples, we will use the `heart_disease` dataset. The target is to predict whether a patient is having heart disease (`1`) or not (`0`) based on their other information.

We will first load the data in and process using pipeline like in the `pipeline_template` notebook

In [1]:
%spark2.pyspark

#path to data
hdfs_path = '/tmp/data/'
data_file = 'heart_disease.csv'
split_ratio = [0.7, 0.3]
drop_cols = ['PatientID']
integer_cols = ['Age', 'RestingBP', 'Cholesterol', 'FastingBS', 'MaxHR', 'HeartDisease']
string_cols = ['ChestPainType', 'RestingECG', 'ExerciseAngina', 'ST_Slope']
numeric_cols = ['Age','RestingBP','Cholesterol','FastingBS','MaxHR','Oldpeak']
target = 'HeartDisease'


from pyspark.sql.functions import col
from pyspark.sql.types import DoubleType

#read data
data = spark.read.options(header='True',inferSchema='True',delimiter=',').csv(hdfs_path+data_file)

#drop columns
data = data.drop(*drop_cols)

#cast integer columns to double
for c in integer_cols:
    data = data.withColumn(c, col(c).cast(DoubleType()))
    
#train-test split
data_train, data_test = data.randomSplit(split_ratio)

from pyspark.ml.feature import StringIndexer, OneHotEncoder, Imputer, StandardScaler, VectorAssembler
from pyspark.ml import Pipeline

###one hot encode the categorical columns
encoders = []
for c in string_cols:
    encoders.append(StringIndexer(inputCol=c, outputCol=c+'Index', handleInvalid='keep'))
    encoders.append(OneHotEncoder(inputCol=c+'Index', outputCol=c+'Codes'))

###impute the numeric columns
imputer = Imputer(inputCols = numeric_cols, outputCols = [c+'Imp' for c in numeric_cols], strategy = 'median')

###standardization
num_assembler = VectorAssembler(inputCols=[c+'Imp' for c in numeric_cols], outputCol='imputed')
scaler = StandardScaler(inputCol = 'imputed', outputCol = 'scaled')

###combine results
assembler = VectorAssembler(inputCols=[c+'Codes' for c in string_cols]+['scaled'], outputCol='features')



###build pipeline
pipeline = Pipeline(stages = encoders + [imputer, num_assembler, scaler, assembler])

###train pipeline
pipeline_trained = pipeline.fit(data_train)

###process training data annd testing data
train_prc = pipeline_trained.transform(data_train).select(target,'features')
test_prc = pipeline_trained.transform(data_test).select(target,'features')

## Modeling

We will tune and test some common classification models:
- Logistic regression
- Decision tree
- Random forest
- Gradient boosting model
- Multilayer Perceptron (Neural networks)

We can automate the search for the best hyperparamters with Cross Validation Grid Search. In pyspark, we use a combination of `ParamGridBuilder` and `CrossValidator`. The steps are as follows
1. Create an empty model
2. Create the parameter grid with `ParamGridBuilder`. Each hyperparameter requires a different `addGrid()` call; multiple `addGrid()` can be chained.
3. Create the `CrossValidator` object
    - `estimator`: the empty model
    - `estimatorParamMaps`: the parameter grid
    - `evaluator`: the evaluator object (`MulticlassClassificationEvaluator` for classification)
    - `numFolds`: number of folds for cross validation
4. Train the CrossValidator with fit()

First, we import general libaries and create an evaluator. `metricName` are commonly `f1` or `accuracy`

In [3]:
%spark2.pyspark
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

evaluator = MulticlassClassificationEvaluator(labelCol=target, metricName='f1')

### Logistic regression

Logistic regression has one hyperparameters, `regParam` and `elasticNetParam`, to tune


In [5]:
%spark2.pyspark

from pyspark.ml.classification import LogisticRegression

#create empty model
logistic = LogisticRegression(featuresCol='features', labelCol=target)

#parameter grid for decision tree
paramGridLogistic = ParamGridBuilder().addGrid(logistic.regParam, [0.001, 0.01, 0.1, 1., 10.])\
                                      .addGrid(logistic.elasticNetParam, [0.25, 0.5, 0.75]).build()

#cross validator
crossval = CrossValidator(estimator=logistic,
                          estimatorParamMaps=paramGridLogistic,
                          evaluator=evaluator,
                          numFolds=3) 

#perform the search
cvLogistic = crossval.fit(train_prc)

#test the tuned model
train_pred_cvLogistic = cvLogistic.transform(train_prc)
test_pred_cvLogistic = cvLogistic.transform(test_prc)
print('cross-validation logistic regression')
print('training F1: ', evaluator.evaluate(train_pred_cvLogistic))
print('testing F1: ', evaluator.evaluate(test_pred_cvLogistic))

cross-validation logistic regression
('training F1: ', 0.8529777439466435)
('testing F1: ', 0.872577764672501)


### Decision tree

The two important hyperparameters to tune for decision tree are `maxDepth` and `minInstancesPerNode`

In [7]:
%spark2.pyspark

from pyspark.ml.classification import DecisionTreeClassifier

#create empty model
dt = DecisionTreeClassifier(featuresCol='features', labelCol=target)

#parameter grid for decision tree
paramGridTree = ParamGridBuilder().addGrid(dt.maxDepth, [3, 5, 7])\
                                  .addGrid(dt.minInstancesPerNode, [10, 20, 30]).build()

#cross validator
crossval = CrossValidator(estimator=dt,
                          estimatorParamMaps=paramGridTree,
                          evaluator=evaluator,
                          numFolds=3) 

#perform the search
cvTree = crossval.fit(train_prc)

#test the tuned model
train_pred_cvTree = cvTree.transform(train_prc)
test_pred_cvTree = cvTree.transform(test_prc)
print('cross-validation decision tree')
print('training F1: ', evaluator.evaluate(train_pred_cvTree))
print('testing F1: ', evaluator.evaluate(test_pred_cvTree))

cross-validation decision tree
('training F1: ', 0.8580293817175548)
('testing F1: ', 0.8310770136876168)


### Random Forest

Random Forest is an ensemble of decision trees and usually yields better performances. 

Similar to a tree, we need to tune maxDepth and `minInstancesPerNode`. We also need to tune `numTrees` - the number of trees in a forest model.


In [9]:
%spark2.pyspark

from pyspark.ml.classification import RandomForestClassifier

#initialize model
rf = RandomForestClassifier(featuresCol='features', labelCol=target)

#paramter grid
paramGridForest = ParamGridBuilder().addGrid(rf.numTrees, [10, 30, 50])\
                                    .addGrid(dt.maxDepth, [3, 5, 7])\
                                    .addGrid(dt.minInstancesPerNode, [10, 20, 30])\
                                    .build()
#cross validator
crossval = CrossValidator(estimator = rf,
                          estimatorParamMaps = paramGridForest,
                          evaluator = evaluator,
                          numFolds = 3) 

#perform tuning
cvForest = crossval.fit(train_prc)

#test the tuned model
train_pred_cvForest = cvForest.transform(train_prc)
test_pred_cvForest = cvForest.transform(test_prc)

print('cross-validation random forest')
print('training F1: ', evaluator.evaluate(train_pred_cvForest))
print('testing F1: ', evaluator.evaluate(test_pred_cvForest))

cross-validation random forest
('training F1: ', 0.9038743620786964)
('testing F1: ', 0.8409396703765732)


<h4>Gradient Boosting Model</h4>

Gradient boosting model (GBT) is similar to random forest, however, each tree is added to the ensemble to minimize the current training error instead of randomly.

GBT models still have `maxDepth` and `minInstancesPerNode` to tune, however, we do not tune the numTrees anymore.

In [11]:
%spark2.pyspark

from pyspark.ml.classification import GBTClassifier

gbt = GBTClassifier(featuresCol='features', labelCol=target)

paramGridGBT = ParamGridBuilder().addGrid(gbt.maxDepth, [3, 5, 7])\
                                 .addGrid(gbt.minInstancesPerNode, [10, 20, 30])\
                                 .build()

crossval = CrossValidator(estimator = gbt,
                          estimatorParamMaps = paramGridGBT,
                          evaluator = evaluator,
                          numFolds = 3) 

cvGBT = crossval.fit(train_prc)

train_pred_cvGBT = cvGBT.transform(train_prc)
test_pred_cvGBT = cvGBT.transform(test_prc)

print('cross-validation GBT')
print('training F1: ', evaluator.evaluate(train_pred_cvGBT))
print('testing F1: ', evaluator.evaluate(test_pred_cvGBT))

cross-validation GBT
('training F1: ', 0.8751703579201073)
('testing F1: ', 0.8442918343712799)


### Multilayer Perceptron

Pyspark's version of neural networks. Only has sigmoid activations. We need to tune the `layers` hyperparameter which is a list of neurons per layer and must include the sizes of the input and output layers.

The size of the input layer can be observed with `train_prc.head()` and the size of the output layer is the number of unique classes in the target


In [13]:
%spark2.pyspark

train_prc.head(1)

[Row(HeartDisease=0.0, features=SparseVector(18, {2: 1.0, 5: 1.0, 7: 1.0, 10: 1.0, 12: 3.1433, 13: 6.9639, 14: 1.8085, 16: 7.979}))]


In [14]:
%spark2.pyspark

from pyspark.sql.functions import countDistinct
train_prc.select(countDistinct(target)).show()

+----------------------------+
|count(DISTINCT HeartDisease)|
+----------------------------+
|                           2|
+----------------------------+



In [15]:
%spark2.pyspark

from pyspark.ml.classification import MultilayerPerceptronClassifier

mlp = MultilayerPerceptronClassifier(featuresCol='features', labelCol=target)

paramGridMLP = ParamGridBuilder().addGrid(mlp.layers, [
        [18, 20, 2],
        [18, 20, 20, 2],
        [18, 30, 2],
        [18, 30, 30, 2]])\
        .addGrid(mlp.maxIter, [100,200,300]).build()

crossval = CrossValidator(estimator = mlp,
                          estimatorParamMaps = paramGridMLP,
                          evaluator = evaluator,
                          numFolds = 3) 

cvMLP = crossval.fit(train_prc)

train_pred_cvMLP = cvMLP.transform(train_prc)
test_pred_cvMLP = cvMLP.transform(test_prc)

print('cross-validation MLP')
print('training F1: ', evaluator.evaluate(train_pred_cvMLP))
print('testing F1: ', evaluator.evaluate(test_pred_cvMLP))

cross-validation MLP
('training F1: ', 0.8654006062912944)
('testing F1: ', 0.8309315136522067)
