# DS/CMPSC 410 Spring 2022
# Lab 7 Decision Tree Learning Using ML Pipeline, Visualization, and Hyperparameter Tuning

# Instructor: Professor John Yen
# TA: Rupesh Prajapati 
# LAs: Lily Jakielaszek and Cayla Shan Pun

## The goals of this lab are for you to be able to
- Understand the function of the different steps/stages involved in Spark ML pipeline
- Be able to construct a decision tree using Spark ML machine learning module
- Be able to generate a visualization of Decision Trees
- Be able to use Spark ML pipeline to perform automated hyper-parameter tuning for Decision Trees 
- Be able to apply .persist() to suitable DataFrames and evaluate its impact on computation time.

## The data set used in this lab is a Breast Cancer diagnosis dataset.

## Submit the following items for Lab 7 (DT)
- Completed Jupyter Notebook of Lab 7 (in HTML format)
- A .py file (e.g., Lab7DT.py) and logfile for running ONLY part 5 of this in cluster 
- A .py file (e.g., Lab7DT_P.py) and logfile for running ONLY part 5 of this in cluster, with persist() on two DataFrames.
- The output file that contains the best hyperparameters

## Total Number of Exercises: 100
- Exercise 1: 5 points
- Exercise 2: 5 points
- Exercise 3: 10 points  
- Exercise 4: 10 points 
- Exercise 5: 15 points
- Exercise 6: 15 points
- Exercise 7: 20 points
- Exercise 8: 20 points
## Total Points: 100 points

# Due: midnight, Feb 27, 2022

# Load and set up the Python files for this Lab
1. Create a "Lab7DT" directory in the work directory of your ICDS-ROAR home directory.
2. If you have not done so, copy or upload this file to the "Lab7DT" directory.
3. Create a subdirectory under "Lab7DT" called "decision_tree_plot" (named the directory EXACTLY this way).
4. Upload the following three files in Module 8 from Canvas to the decision_tree_plot directory
- decision_tree_parser.py
- decision_tree_plot.py
- tree_template.jinjia2

# Follow the instructions below and execute the PySpark code cell by cell below. Make modifications as required.

In [1]:
import pyspark
import pandas as pd
import csv

## Notice that we use PySpark SQL module to import SparkSession because ML works with SparkSession
## Notice also the different methods imported from ML and three submodules of ML: classification, feature, and evaluation.

In [2]:
from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.sql.types import StructField, StructType, StringType, LongType, IntegerType, FloatType
import pyspark.sql.functions as F
from pyspark.sql.types import *
from pyspark.ml import Pipeline
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.feature import OneHotEncoder, StringIndexer, VectorAssembler, IndexToString
from pyspark.ml.evaluation import MulticlassClassificationEvaluator, BinaryClassificationEvaluator

## The following two lines import relevant functions from the two python files you uploaded into the decision_tree_plot subdirectory.

In [3]:
from decision_tree_plot.decision_tree_parser import decision_tree_parse
from decision_tree_plot.decision_tree_plot import plot_trees

## This lab runs Spark in the local mode.
## Notice we are creating a SparkSession, not a SparkContext, when we use ML pipeline.
## The "getOrCreate()" method means we can re-evaluate this without a need to "stop the current SparkSession" first (unlike SparkContext).

In [4]:
ss=SparkSession.builder.master("local").appName("lab 7 DT").getOrCreate()

## Exercise 1: (5 points) Enter your name below:
- My Name: Haichen Wei

## As we have seen in Lab 4, SparkSession offers a way to read a CSV/text file with the capability to interpret the first row as being the header and infer the type of different columns based on their values.

## Exercise 2: (5 points) Complete the following path with the path for your home directory.  

In [6]:
data = ss.read.csv("/storage/home/hxw5245/Lab7DT/breast-cancer-wisconsin.data.txt", header=True, inferSchema=True)

# Part 1 Feature Transformation Using DataFrame

In [7]:
data.printSchema()

root
 |-- id: integer (nullable = true)
 |-- clump_thickness: integer (nullable = true)
 |-- unif_cell_size: integer (nullable = true)
 |-- unif_cell_shape: integer (nullable = true)
 |-- marg_adhesion: integer (nullable = true)
 |-- single_epith_cell_size: integer (nullable = true)
 |-- bare_nuclei: string (nullable = true)
 |-- bland_chrom: integer (nullable = true)
 |-- norm_nucleoli: integer (nullable = true)
 |-- mitoses: integer (nullable = true)
 |-- class: integer (nullable = true)



In [8]:
data.show(5)

+-------+---------------+--------------+---------------+-------------+----------------------+-----------+-----------+-------------+-------+-----+
|     id|clump_thickness|unif_cell_size|unif_cell_shape|marg_adhesion|single_epith_cell_size|bare_nuclei|bland_chrom|norm_nucleoli|mitoses|class|
+-------+---------------+--------------+---------------+-------------+----------------------+-----------+-----------+-------------+-------+-----+
|1000025|              5|             1|              1|            1|                     2|          1|          3|            1|      1|    2|
|1002945|              5|             4|              4|            5|                     7|         10|          3|            2|      1|    2|
|1015425|              3|             1|              1|            1|                     2|          2|          3|            1|      1|    2|
|1016277|              6|             8|              8|            1|                     3|          4|          3|       

In [9]:
from pyspark.sql.functions import col
class_count = data.groupBy(col("class")).count()
class_count.show()

+-----+-----+
|class|count|
+-----+-----+
|    4|  241|
|    2|  458|
+-----+-----+



In [10]:
bnIndexer = StringIndexer(inputCol="bare_nuclei", outputCol="bare_nuclei_index").fit(data)

In [11]:
bnIndexer

StringIndexerModel: uid=StringIndexer_933a948fc9b1, handleInvalid=error

In [12]:
transformed_data = bnIndexer.transform(data)

In [13]:
transformed_data.show(4)

+-------+---------------+--------------+---------------+-------------+----------------------+-----------+-----------+-------------+-------+-----+-----------------+
|     id|clump_thickness|unif_cell_size|unif_cell_shape|marg_adhesion|single_epith_cell_size|bare_nuclei|bland_chrom|norm_nucleoli|mitoses|class|bare_nuclei_index|
+-------+---------------+--------------+---------------+-------------+----------------------+-----------+-----------+-------------+-------+-----+-----------------+
|1000025|              5|             1|              1|            1|                     2|          1|          3|            1|      1|    2|              0.0|
|1002945|              5|             4|              4|            5|                     7|         10|          3|            2|      1|    2|              1.0|
|1015425|              3|             1|              1|            1|                     2|          2|          3|            1|      1|    2|              2.0|
|1016277|       

In [14]:
labelIndexer= StringIndexer(inputCol="class", outputCol="indexedLabel").fit(data)

In [15]:
labelIndexer

StringIndexerModel: uid=StringIndexer_afb7331903a6, handleInvalid=error

In [16]:
transformed2_data = labelIndexer.transform(transformed_data)

In [17]:
transformed2_data.show(4)

+-------+---------------+--------------+---------------+-------------+----------------------+-----------+-----------+-------------+-------+-----+-----------------+------------+
|     id|clump_thickness|unif_cell_size|unif_cell_shape|marg_adhesion|single_epith_cell_size|bare_nuclei|bland_chrom|norm_nucleoli|mitoses|class|bare_nuclei_index|indexedLabel|
+-------+---------------+--------------+---------------+-------------+----------------------+-----------+-----------+-------------+-------+-----+-----------------+------------+
|1000025|              5|             1|              1|            1|                     2|          1|          3|            1|      1|    2|              0.0|         0.0|
|1002945|              5|             4|              4|            5|                     7|         10|          3|            2|      1|    2|              1.0|         0.0|
|1015425|              3|             1|              1|            1|                     2|          2|          

In [18]:
input_features = ['clump_thickness', 'unif_cell_size', 'unif_cell_shape', 'marg_adhesion', \
                  'single_epith_cell_size', 'bare_nuclei_index', 'bland_chrom', 'norm_nucleoli', 'mitoses']

In [19]:
assembler = VectorAssembler(inputCols=input_features, outputCol="features")

In [20]:
assembler

VectorAssembler_033c3ec3996c

In [21]:
transformed3_data = assembler.transform(transformed2_data)

In [22]:
selected_transformed3_data = transformed3_data.select("features",'indexedLabel')
selected_transformed3_data.show(5)

+--------------------+------------+
|            features|indexedLabel|
+--------------------+------------+
|[5.0,1.0,1.0,1.0,...|         0.0|
|[5.0,4.0,4.0,5.0,...|         0.0|
|[3.0,1.0,1.0,1.0,...|         0.0|
|[6.0,8.0,8.0,1.0,...|         0.0|
|[4.0,1.0,1.0,3.0,...|         0.0|
+--------------------+------------+
only showing top 5 rows



# Part 2 Decision Tree Learning and Evaluation

## randomSplit is a method for DataFrame that split data in the DataFrame into two subsets, one for training, the other for testing, using a number as the seed for random number generator.
## If you want to generate a different split, you can use a different seed

In [23]:
trainingData3, testData3= transformed3_data.randomSplit([0.75, 0.25], seed=1234)

In [24]:
dt=DecisionTreeClassifier(featuresCol="features", labelCol="indexedLabel", maxDepth=6, minInstancesPerNode=2)

In [25]:
dt

DecisionTreeClassifier_e6a043c1f024

In [26]:
dt_model = dt.fit(trainingData3)

In [27]:
dt_model

DecisionTreeClassificationModel: uid=DecisionTreeClassifier_e6a043c1f024, depth=6, numNodes=33, numClasses=2, numFeatures=9

In [28]:
test_prediction = dt_model.transform(testData3)

In [29]:
test_prediction.persist().show(3)

+------+---------------+--------------+---------------+-------------+----------------------+-----------+-----------+-------------+-------+-----+-----------------+------------+--------------------+-------------+--------------------+----------+
|    id|clump_thickness|unif_cell_size|unif_cell_shape|marg_adhesion|single_epith_cell_size|bare_nuclei|bland_chrom|norm_nucleoli|mitoses|class|bare_nuclei_index|indexedLabel|            features|rawPrediction|         probability|prediction|
+------+---------------+--------------+---------------+-------------+----------------------+-----------+-----------+-------------+-------+-----+-----------------+------------+--------------------+-------------+--------------------+----------+
| 63375|              9|             1|              2|            6|                     4|         10|          7|            7|      2|    4|              1.0|         1.0|[9.0,1.0,2.0,6.0,...|    [0.0,4.0]|           [0.0,1.0]|       1.0|
|128059|              1|    

In [30]:
test_prediction.select("features","class","indexedLabel", "rawPrediction", "probability", "prediction").show(5)

+--------------------+-----+------------+-------------+--------------------+----------+
|            features|class|indexedLabel|rawPrediction|         probability|prediction|
+--------------------+-----+------------+-------------+--------------------+----------+
|[9.0,1.0,2.0,6.0,...|    4|         1.0|    [0.0,4.0]|           [0.0,1.0]|       1.0|
|[1.0,1.0,1.0,1.0,...|    2|         0.0|  [295.0,1.0]|[0.99662162162162...|       0.0|
|[3.0,3.0,5.0,2.0,...|    4|         1.0|   [4.0,15.0]|[0.21052631578947...|       1.0|
|[10.0,8.0,8.0,2.0...|    4|         1.0|  [1.0,126.0]|[0.00787401574803...|       1.0|
|[1.0,1.0,1.0,1.0,...|    2|         0.0|  [295.0,1.0]|[0.99662162162162...|       0.0|
+--------------------+-----+------------+-------------+--------------------+----------+
only showing top 5 rows



In [31]:
labelIndexer.labels

['2', '4']

In [32]:
labelConverter=IndexToString(inputCol="prediction", outputCol="predictedClass", labels=labelIndexer.labels)

In [33]:
test2_prediction = labelConverter.transform(test_prediction)

In [34]:
test2_prediction.select("features","class","indexedLabel","prediction","predictedClass").show(5)

+--------------------+-----+------------+----------+--------------+
|            features|class|indexedLabel|prediction|predictedClass|
+--------------------+-----+------------+----------+--------------+
|[9.0,1.0,2.0,6.0,...|    4|         1.0|       1.0|             4|
|[1.0,1.0,1.0,1.0,...|    2|         0.0|       0.0|             2|
|[3.0,3.0,5.0,2.0,...|    4|         1.0|       1.0|             4|
|[10.0,8.0,8.0,2.0...|    4|         1.0|       1.0|             4|
|[1.0,1.0,1.0,1.0,...|    2|         0.0|       0.0|             2|
+--------------------+-----+------------+----------+--------------+
only showing top 5 rows



In [35]:
evaluator = MulticlassClassificationEvaluator(
    labelCol="indexedLabel", predictionCol="prediction", metricName="f1")

In [36]:
f1 = evaluator.evaluate(test_prediction)
print("f1 score:", f1)

f1 score: 0.9780273904005113


# Part 3 DT Learning Using ML Pipeline

## Exercise 3: (10 points) In the code cell below, fill in a value for maxDepth and a value of minInstancesPerNode. Run the entire sequence of code below to generate a decision tree (using pipeline) and compute f1 measure of the testing data.
- Record the f1 measure for the max_depth below  
- Recommended value for maxDepth: 2 to 10
- Recommended value for minInstancesPerNode: 1 to 7

## Answer for Exercise 3: 
- The f1 measure of testing data for max_detph = 5 and minInstancesPerNode = 2

In [37]:
trainingData, testData= data.randomSplit([0.75, 0.25], seed=1234)

In [38]:
assembler = VectorAssembler( inputCols=input_features, outputCol="features")
dt=DecisionTreeClassifier(labelCol="indexedLabel", featuresCol="features", maxDepth=5, minInstancesPerNode=2)
labelConverter = IndexToString(inputCol="prediction", outputCol="predictedClass", labels=labelIndexer.labels)
pipeline = Pipeline(stages=[labelIndexer, bnIndexer, assembler, dt, labelConverter])
model = pipeline.fit(trainingData)
test_predictions = model.transform(testData)

In [39]:
pipeline

Pipeline_5b18e8951d82

In [40]:
model

PipelineModel_aba42a53db60

In [41]:
test_predictions.select("class","indexedLabel","prediction","predictedClass").show(10)

+-----+------------+----------+--------------+
|class|indexedLabel|prediction|predictedClass|
+-----+------------+----------+--------------+
|    4|         1.0|       1.0|             4|
|    2|         0.0|       0.0|             2|
|    4|         1.0|       1.0|             4|
|    4|         1.0|       1.0|             4|
|    2|         0.0|       0.0|             2|
|    2|         0.0|       0.0|             2|
|    4|         1.0|       1.0|             4|
|    2|         0.0|       0.0|             2|
|    2|         0.0|       0.0|             2|
|    2|         0.0|       0.0|             2|
+-----+------------+----------+--------------+
only showing top 10 rows



In [42]:
evaluator = MulticlassClassificationEvaluator(
    labelCol="indexedLabel", predictionCol="prediction", metricName="f1")

In [43]:
f1 = evaluator.evaluate(test_predictions)
print("f1 score of testing data:", f1)

f1 score of testing data: 0.9726090922212769


# Part 4 Decision Tree Visualization

## stages[3] of the pipeline is "dt" (DecisionTreeClassifier). 
## model is a DataFrame representing a trained pipeline.
## model.stages[3] gives us the Decision Tree model learned.

## Exercise 4: (10 points) 
- Complete the code below to generate a visualization of the decision tree.
- Download the HTML file of the tree and submit it as a part of Lab7 assignment.

In [44]:
DTmodel = model.stages[3]
print(DTmodel)

DecisionTreeClassificationModel: uid=DecisionTreeClassifier_c84b5ce2cbe6, depth=5, numNodes=23, numClasses=2, numFeatures=9


In [45]:
model_path="./DTmodel_vis"

In [46]:
tree=decision_tree_parse(DTmodel, ss, model_path)
column = dict([(str(idx), i) for idx, i in enumerate(input_features)])
plot_trees(tree, column = column, output_path = '/storage/home/hxw5245/Lab7DT/DTtree2.html')

# Part 5 Automated Hyperparameter Tuning for Decision Tree

## Exercise 5: (15 points)  
- Complete the code below to perform hyper parameter tuning of Decision Tree (for two parameters: max_depth and minInstancesPerNode)

In [None]:
trainingData, testingData= data.randomSplit([0.75, 0.25], seed=1234)
model_path="./DTmodel_vis"

In [None]:
## Initialize a Pandas DataFrame to store evaluation results of all combination of hyper-parameter settings
hyperparams_eval_df = pd.DataFrame( columns = ['max_depth', 'minInstancesPerNode', 'training f1', 'testing f1', 'Best Model'] )
# initialize index to the hyperparam_eval_df to 0
index =0 
# initialize lowest_error
highest_testing_f1 = 0
# Set up the possible hyperparameter values to be evaluated
max_depth_list = [2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
minInstancesPerNode_list = [2, 3, 4, 5, 6]
assembler = VectorAssembler( inputCols=input_features, outputCol="features")
labelConverter = IndexToString(inputCol = "prediction", outputCol="predictedClass", labels=labelIndexer.labels)
for max_depth in max_depth_list:
    for minInsPN in minInstancesPerNode_list:
        seed = 37
        # Construct a DT model using a set of hyper-parameter values and training data
        dt= DecisionTreeClassifier(labelCol="indexedLabel", featuresCol="features", maxDepth= max_depth, minInstancesPerNode= minInsPN)
        pipeline = Pipeline(stages=[labelIndexer, bnIndexer, assembler, dt, labelConverter])
        model = pipeline.fit(trainingData)
        training_predictions = model.transform(trainingData)
        testing_predictions = model.transform(testingData)
        evaluator = MulticlassClassificationEvaluator(labelCol="indexedLabel", predictionCol="prediction", metricName="f1")
        training_f1 = evaluator.evaluate(training_predictions)
        testing_f1 = evaluator.evaluate(testing_predictions)
        # We use 0 as default value of the 'Best Model' column in the Pandas DataFrame.
        # The best model will have a value 1000
        hyperparams_eval_df.loc[index] = [ max_depth, minInsPN, training_f1, testing_f1, 0]  
        index = index +1
        if testing_f1 > highest_testing_f1 :
            best_max_depth = max_depth
            best_minInsPN = minInsPN
            best_index = index -1
            best_parameters_training_f1 = training_f1
            best_DTmodel= model.stages[3]
            best_tree = decision_tree_parse(best_DTmodel, ss, model_path)
            column = dict( [ (str(idx), i) for idx, i in enumerate(input_features) ])           
            highest_testing_f1 = testing_f1
print('The best max_depth is ', best_max_depth, ', best minInstancesPerNode = ', \
      best_minInsPN, ', testing f1 = ', highest_testing_f1) 
# column = dict([(str(idx), i) for idx, i in enumerate(input_features)])
plot_trees(best_tree, column = column, output_path = '/storage/home/hxw5245/Lab7DT/bestDTtree2.html')

In [None]:
# Store the Testing RMS in the DataFrame
hyperparams_eval_df.loc[best_index]=[best_max_depth, best_minInsPN, best_parameters_training_f1, highest_testing_f1, 1000]

In [None]:
schema3= StructType([ StructField("Max Depth", FloatType(), True), \
                      StructField("MinInstancesPerNode", FloatType(), True ), \
                      StructField("Training f1", FloatType(), True), \
                      StructField("Testing f1", FloatType(), True), \
                      StructField("Best Model", FloatType(), True) \
                    ])

## Convert the pandas DataFrame that stores validation errors of all hyperparameters and the testing error for the best model to Spark DataFrame


In [None]:
HyperParams_Tuning_DF = ss.createDataFrame(hyperparams_eval_df, schema3)

## Exercise 6 (15 points)
### Complete the path below to save the result of your hyperparameter tuning in a directory.

## Notice: Modify the output path before you export this to a .py file for running in cluster mode.  
## Notice: Remember to change the output_path directory after each spark-submit in cluster. Otherwise, the spark_submit will NOT run the action (saveAsTextFile) successfully due to being unable to write into an existing directory.

In [None]:
output_path = "/storage/home/hxw5245/Lab7DT/Lab7_B"
HyperParams_Tuning_DF.rdd.saveAsTextFile(output_path)

In [None]:
ss.stop()

# Exercise 7 (20 points) 
## Modify this Notebook to comment out (or remove) Part 1, 2, 3, and 4, and modify it for running spark-submit in cluster mode.  Export it as a .py file (e.g., Lab7DT.py). Record the computation time below:

## Answer to Exercise 7: 
real: 3m59.904s

user: 1m45.409s

sys: 0m8.081s

# Exercise 8 (20 points)
## Modify .py file used in Exercise 7 to add `persist()` to two DataFrame for enhanced scalability of the code. (a) Record the computation time below. (b) Compare the computation time with and without persist.

## Answer to Exercise 8: 

(a) real: 2m38.377s

user: 1m41.350s

sys: 0m7.553s

(b) The computation time with persist is less than without persist. 