##### Grading Feedback

# Question 0 (-2 If not answered)
Please provide the following the data so we can verify your github information and ensure accurate grading:
- Your Name: Yunhan Zhang
- Your SU ID: 405379315

# IST 718: Big Data Analytics

- Professors: 
  - Willard Williamson <wewillia@syr.edu>
  - Emory Creel <emcreel@g.syr.edu>
- Faculty Assistants: 
  - Warren Justin Fernandes <wjfernan@syr.edu>
  - Ruchita Hiteshkumar Harsora <	rharsora@g.syr.edu>

## General instructions:

- You are welcome to discuss the problems with your classmates but __you are not allowed to copy any part of your answers from your classmates.  Short code snippets are allowed from the internet.  Code from the class text books or class provided code can be copied in its entirety.__
- Google Colab is the official class runtime environment so you should test your code on Colab before submission.
- Do not modify cells marked as grading cells or marked as do not modify.
- Before submitting your work, remember to check for run time errors with the following procedure:
`Runtime `$\rightarrow$ Factory reset runtime followed by Runtime $\rightarrow$ Run All.  All runtime errors will result in a minimum penalty of half off.
- All plots shall include descriptive title and axis labels.  Plot legends shall be included where possible.  Unless stated otherwise, plots can be made using any Python plotting package.
- Grading feedback cells are there for graders to provide feedback to students.  Don't change or remove grading feedback cells.
- Don't add or remove files from your git repo.
- Do not change file names in your repo.  This also means don't change the title of the ipython notebook.
- You are free to add additional code cells around the cells marked `your code here`.
- import * is not allowed because it is considered a very bad coding practice and in some cases can result in a significant delay (which slows down the grading process) in loading imports.  For example, the statement `from sympy import *` is not allowed.  You must import the specific packages that you need. 
- The graders reserve the right to deduct points for subjective things we see with your code.  For example, if we ask you to create a pandas data frame to display values from an investigation and you hard code the values, we will take points off for that.  This is only one of many different things we could find in reviewing your code.  In general, write your code like you are submitting it for a code peer review in industry.  
- Level of effort is part of our subjective grading.  For example, in cases where we ask for a more open ended investigation, some students put in significant effort and some students do the minimum possible to meet requirements.  In these cases, we may take points off for students who did not put in much effort as compared to students who put in a lot of effort.  We feel that the students who did a better job deserve a better grade.  We reserve the right to invoke level of effort grading at any time.
- Your notebook must run from start to finish without requiring manual input by the graders.  For example, do not mount your personal Google drive in your notebook as this will require graders to perform manual steps.  In short, your notebook should run from start to finish with no runtime errors and no need for graders to perform any manual steps.

I was very disappointed with the linear regression model accuracy releted to the insurance data set in homework 3.  I'm sure you were disappointed too.  In this homework, we will revisit the insurance data set and try to improve prediction scores.  Specifically, we will use random forest, gradient boosting trees, and deep learning to see if we can improve upon the scores achieved in homework 3.  Part 1 of the assignment will explore random forest and GBT.  Part 2 of the assignment will use deep learning.

In [1]:
# Grading Cell
enable_grid_search = False

The following cell is used to read the insurance data set into the colab environment.  Do not change or modify the following cell.

In [2]:
%%bash
# Do not change or modify this cell
# Need to install pyspark
# if pyspark is already installed, will print a message indicating pyspark already installed
pip install pyspark &> /dev/null

# Download the data files from github
# If the data file does not exist in the colab environment
data_file_1=insurance.csv

if [[ ! -f ./${data_file_1} ]]; then 
   # download the data file from github and save it in this colab environment instance
   wget https://raw.githubusercontent.com/wewilli1/ist718_data/master/${data_file_1} &> /dev/null
fi

In [3]:
from pyspark.sql import SparkSession
from pyspark.ml import feature, regression, evaluation, Pipeline
from pyspark.sql import functions as fn, Row
import matplotlib.pyplot as plt
spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext
from pyspark.sql import functions as fn
from pyspark.ml.classification import GBTClassifier
from pyspark.ml.regression import RandomForestRegressor
from pyspark.ml.feature import VectorIndexer
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.tuning import ParamGridBuilder
from pyspark.ml.tuning import CrossValidator
from pyspark.ml.regression import GBTRegressor
import numpy as np
import pandas as pd

# Assignment Specific Instructions
Your grade for grid search problems in this assignment will be determined in part on level of effort and your model performance results as compared to other students in the class.

In this assignment, we will be comparing scores between random forest, gradient boosting trees, and deep learning.  You are required to correctly use train / test / validation data sets for model comparison as outlined in lecture.  Use train and test sets to train and score individual models during grid search.  Only use validation data to compare scores between models. You must name your data sets exactly `train`, `test`, and `validation` so that the graders know what data set is being used in each question.  I will be taking off 1 letter grade (10 points) for not following these instructions.

# Question 1 (10 pts)
- Read the insurance data file into a spark data frame named `medical_df`.  Drop any rows that contain NAN / Null values.  Check the schema and fix if needed.  Perform needed feature engineering using **only** a string indexer to get ready for training decision trees.  One hot encoding is not needed for random forest - do not use one hot encoding or any other transformations other than string indexing. 
- Split the data into variables named exactly train, test, and validation. Set the spark randomSplit seed argument to 2019.

In [4]:
# import data and drop Null values
medical_df_withna = spark.read.csv("insurance.csv", inferSchema = True, header = True)
medical_df_raw = medical_df_withna.dropna()

In [5]:
# Define a sex_pipe that uses the StringIndexer to encode the gender data
sex_pipe = feature.StringIndexer(inputCol='sex', handleInvalid='skip',outputCol="sex_index")

In [6]:
# Define a smoker_pipe that uses the StringIndexer to encode the smoker data
smoker_pipe = feature.StringIndexer(inputCol='smoker', handleInvalid='skip',outputCol="smoker_index")

In [7]:
# Define a region_pipe uses the StringIndexer to encode the region data
region_pipe = feature.StringIndexer(inputCol='region', handleInvalid='skip',outputCol = "region_index")

In [8]:
features = Pipeline(stages=[feature.VectorAssembler(inputCols=['age', 'children', 'bmi', 'sex_index', 'smoker_index', 
                                                               'region_index'], outputCol = 'features')])

In [9]:
fe_pipe = Pipeline(stages=[sex_pipe, smoker_pipe, region_pipe, features])
fitted_fe_pipe = fe_pipe.fit(medical_df_raw)
medical_df = fitted_fe_pipe.transform(medical_df_raw)

In [10]:
train, validation, test = medical_df_raw.randomSplit([0.6, 0.3, 0.1], seed = 2019)

In [11]:
#Print the schema
medical_df.printSchema()
#Print the shape
# print('The shape of the dataframe is:', shape(medical_df))
# print the head
medical_df.show()

root
 |-- age: integer (nullable = true)
 |-- sex: string (nullable = true)
 |-- bmi: double (nullable = true)
 |-- children: integer (nullable = true)
 |-- smoker: string (nullable = true)
 |-- region: string (nullable = true)
 |-- charges: double (nullable = true)
 |-- sex_index: double (nullable = false)
 |-- smoker_index: double (nullable = false)
 |-- region_index: double (nullable = false)
 |-- features: vector (nullable = true)

+---+------+------+--------+------+---------+-----------+---------+------------+------------+--------------------+
|age|   sex|   bmi|children|smoker|   region|    charges|sex_index|smoker_index|region_index|            features|
+---+------+------+--------+------+---------+-----------+---------+------------+------------+--------------------+
| 19|female|  27.9|       0|   yes|southwest|  16884.924|      1.0|         1.0|         2.0|[19.0,0.0,27.9,1....|
| 18|  male| 33.77|       1|    no|southeast|  1725.5523|      0.0|         0.0|         0.0|[18.0,1

##### Grading Feedback Cell

The following questions will create a random forest regressor model.  The goal is to see if we can improve upon the linear regression score from homework 3. You can find the spark documentation for the random forest regressor [here](https://spark.apache.org/docs/latest/ml-classification-regression.html#random-forest-regression).

# Question 2 (10 pts)
Create and train a random forest regressor model using a grid search in the cell below.  Score your model using MSE.  Your grid search must be entirely encapsulated in the `if enable_grid_search` if statement.  The `enable_grid_search` Boolean is defined in a grading cell above.  You will disable the grid search before you submit by setting enable_grid_search to false.  Setting enable_grid_search to false should not result in a runtime error.  You will not receive full credit if any part of your grid search is outside of the if statement or if runtime errros result from setting the `enable_grid_search` variable to false.

In [12]:
rf = RandomForestRegressor(featuresCol="features", labelCol = "charges")
pipeline = Pipeline(stages=[fe_pipe, rf])

In [13]:
# your code here
if enable_grid_search:
    paramGrid = ParamGridBuilder() \
    .addGrid(rf.numTrees, [int(x) for x in np.linspace(start = 10, stop = 50, num = 3)]) \
    .addGrid(rf.maxDepth, [int(x) for x in np.linspace(start = 5, stop = 25, num = 3)]) \
    .build()
    
    crossval = CrossValidator(estimator=pipeline,
                          estimatorParamMaps=paramGrid,
                          evaluator=RegressionEvaluator().setLabelCol("charges"),
                          numFolds=3)
    
    model = crossval.fit(train)
    predictions = model.transform(test)
    evaluator = RegressionEvaluator(
        labelCol="charges", predictionCol="prediction", metricName="mse")
    mse = evaluator.evaluate(predictions)
    print("The Best Mean Squared Error (MSE) on test data = %g" % mse)
    bestModel = model.bestModel.stages[1]
    print("NumTrees in the best model", bestModel.getNumTrees)
    print("maxDepth in the best model", bestModel.getOrDefault('maxDepth'))

**Note:** Refer to https://www.silect.is/blog/random-forest-models-in-spark-ml/. This transformer (i.e. prediction generator) from out cross-validator by default applies the best performing pipeline. We can test our new model by making predictions on the hold out data.

##### Grading Feedback Cell

# Question 3 (10 pts)
Create a pipeline named `best_rf_pipe` that hard codes the tuning parameters from the best model found by the grid search in question 2 above.  Train and test best_rf_pipe.  Score your model using validation data and the MSE scoring metric.  Save train and validation MSE scores in variables named rf_train_mse and rf_validation_mse.

In [14]:
MaxDepth = 25
NumTrees = 50

best_rf = RandomForestRegressor().\
        setLabelCol('charges').\
        setFeaturesCol('features').\
        setMaxDepth(MaxDepth).\
        setNumTrees(NumTrees)

best_rf_pipe = Pipeline(stages=[fe_pipe, best_rf])
model_q3 = best_rf_pipe.fit(train)
predictions_q3 = model_q3.transform(test)
evaluator = RegressionEvaluator(
        labelCol="charges", predictionCol="prediction", metricName="mse")
rf_train_mse = evaluator.evaluate(predictions_q3)
print("Mean Squared Error (MSE) on train data = %g" % rf_train_mse)

Mean Squared Error (MSE) on train data = 1.76135e+07


In [15]:
predictions_q3_val = model_q3.transform(validation)
evaluator_val = RegressionEvaluator(
        labelCol="charges", predictionCol="prediction", metricName="mse")
rf_validation_mse = evaluator_val.evaluate(predictions_q3_val)
print("Mean Squared Error (MSE) on validation data = %g" % rf_validation_mse)

Mean Squared Error (MSE) on validation data = 2.55118e+07


In [16]:
# Grading cell do not modify
print("rf_train_mse =", rf_train_mse)
print("rf_validation_mse =", rf_validation_mse)

rf_train_mse = 17613516.100674648
rf_validation_mse = 25511794.8633418


##### Grading Feedback Cell

# Question 4 (10 pts)
Use `best_rf_pipe` in question 3 for inference.  Create a pandas data frame named `rf_feature_importance` which contains 2 columns: `feature`, and `importance`.  Load the feature column with the feature name and the importance column with the feature importance score as determined by the random forest model. Sort the feature importances from high to low such that the most important feature is in the first row of the data frame.

In [17]:
best_rf_model = model_q3.stages[-1]

In [18]:
rf_feature_importance = pd.DataFrame(list(zip(medical_df.columns[2:], best_rf_model.featureImportances.toArray())),
            columns = ['feature', 'importance']).sort_values('importance')

In [19]:
# grading cell - do not modify
display(rf_feature_importance)

Unnamed: 0,feature,importance
3,region,0.010451
5,sex_index,0.026427
1,children,0.027338
0,bmi,0.135357
2,smoker,0.143534
4,charges,0.656893


##### Grading Feedback Cell

# Question 5 (10 pts)
Repeat question 2 but this time use a [GBT regressor](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.regression.GBTRegressor.html#pyspark.ml.regression.GBTRegressor).  Create and train a GBT regressor model using a grid search in the cell below.  Score your model using validation data and the MSE scoring metric.  Your grid search must be entirely encapsulated in the `if enable_grid_search` if statement.  The `enable_grid_search` Boolean is defined in a grading cell above.  You will disable the grid search before you submit by setting enable_grid_search to false.  Setting enable_grid_search to false should not result in a runtime error.  You will not receive full credit if any part of your grid search is outside of the if statement or if runtime errros result from setting the `enable_grid_search` variable to false.

In [20]:
gbt = GBTRegressor(featuresCol="features", labelCol = "charges")
gbt_pipeline = Pipeline(stages=[fe_pipe, gbt])

In [21]:
# your code here
if enable_grid_search:
    paramGrid = ParamGridBuilder() \
    .addGrid(rf.numTrees, [int(x) for x in np.linspace(start = 10, stop = 50, num = 3)]) \
    .addGrid(rf.maxDepth, [int(x) for x in np.linspace(start = 5, stop = 25, num = 3)]) \
    .build()
    
    gbt_crossval = CrossValidator(estimator=gbt_pipeline,
                          estimatorParamMaps=paramGrid,
                          evaluator=RegressionEvaluator().setLabelCol("charges"),
                          numFolds=3)
    
    gbt_model = gbt_crossval.fit(train)
    gbt_predictions = gbt_model.transform(test)
    gbt_evaluator = RegressionEvaluator(
        labelCol="charges", predictionCol="prediction", metricName="mse")
    gbt_mse = gbt_evaluator.evaluate(gbt_predictions)
    print("The Best Mean Squared Error (MSE) on test data = %g" % gbt_mse)
    gbt_bestpip = gbt_model.bestModel
    gbt_bestModel = gbt_bestpip.stages[1]
    print("Best Pipeline: ", gbt_bestModel)
    print("maxDepth in the best model", gbt_bestModel.getOrDefault('maxDepth'))
    print(gbt_bestModel.getOrDefault('stepSize'))

##### Grading Feedback Cell

# Question 6 (10 pts Total)
This is a repeat of question 3 but for GBT.  Create a pipeline named `best_gbt_pipe` that hard codes the tuning parameters from the best model found by the grid search in question 5 above.  Train and test best_gbt_pipe using MSE as the scoring metric. Clearly print the resulting train and test MSE for `best_gbt_pipe` so it's easy for the graders to see your resulting MSEs.  Save train and test MSE scores in variables named gbt_train_mse and gbt_validation_mse.

In [22]:
MaxDepth = 5
StepSize = 0.1

best_gbt = GBTRegressor().\
        setLabelCol('charges').\
        setFeaturesCol('features').\
        setMaxDepth(MaxDepth).\
        setStepSize(StepSize)

best_gbt_pipe = Pipeline(stages=[fe_pipe, best_gbt])
model_q6 = best_gbt_pipe.fit(train)
predictions_q6 = model_q6.transform(test)
gbt_evaluator = RegressionEvaluator(
        labelCol="charges", predictionCol="prediction", metricName="mse")
gbt_train_mse = gbt_evaluator.evaluate(predictions_q6)
print("Mean Squared Error (MSE) on train data = %g" % gbt_train_mse)

Mean Squared Error (MSE) on train data = 2.48386e+07


In [23]:
predictions_q6_val = model_q6.transform(validation)
gbt_evaluator_val = RegressionEvaluator(
        labelCol="charges", predictionCol="prediction", metricName="mse")
gbt_validation_mse = gbt_evaluator_val.evaluate(predictions_q6_val)
print("Mean Squared Error (MSE) on validation data = %g" % gbt_validation_mse)

Mean Squared Error (MSE) on validation data = 2.84579e+07


In [24]:
# Grading cell do not modify
print("gbt_train_mse =", gbt_train_mse)
print("gbt_validation_mse =", gbt_validation_mse)

gbt_train_mse = 24838572.975433294
gbt_validation_mse = 28457932.60573362


##### Grading Feedback Cell

# Question 7 (10 pts)
Create a pandas dataframe named `rf_gbt_mse_compare` which contains 3 columns: Model, Train MSE, and Validation MSE.  Load the Model column with "RF" or "GBT", the Train MSE column with the corresponding train MSE, and the Validation MSE column with the corresponding validation MSE scores from the random forest / gradient boosted tree scores.  Use rf_train_mse, rf_validation_mse, gbt_train_mse, and gbt_validation_mse variables to load the dataframe.  

GBT models usually produce better scores than random forest.  I am not sure if that will be the case for this dataset but you will be graded in comparison to other students in the class.

In [25]:
data = {'Model': ['RF', 'GBT'],
        'Train MSE': [rf_train_mse,gbt_train_mse],
        'Validation MSE': [rf_validation_mse, gbt_validation_mse]}
rf_gbt_mse_compare = pd.DataFrame(data)

In [26]:
# Grading Cell Do Not Modify
display(rf_gbt_mse_compare)

Unnamed: 0,Model,Train MSE,Validation MSE
0,RF,17613520.0,25511790.0
1,GBT,24838570.0,28457930.0


##### Grading Feedback Cell

# Question 8 (-5 pts if not performed)
Set the `enable_grid_search` Boolean variable to False in the grading cell at the top of this notebook.  Perform a __Runtime -> Disconnect and Delter Runtime__, __Runtime -> Run all__ test to verify there are no runtime errors.  Leave the `enable_grid_search` variable set to False and turn in your assignment.

# Question 9 (0 pts - get ready for part 2)
In part 2 of the assignment, we want to see how deep learning MSE scores compare to RF and GBT.  In the cell below, hard code variables which contain train and test scores for the models in part 1.

In [27]:
# Grading Cell - do not modify
# print the train and test scores
print("rf_train_mse =", rf_train_mse)
print("rf_validation_mse =", rf_validation_mse)
print("gbt_train_mse =", gbt_train_mse)
print("gbt_validation_mse =", gbt_validation_mse)

rf_train_mse = 17613516.100674648
rf_validation_mse = 25511794.8633418
gbt_train_mse = 24838572.975433294
gbt_validation_mse = 28457932.60573362


In [28]:
# uncomment and hard code the following variables using output from above.  
# You will copy this code for use in part 2 question 0
# hc_rf_train_mse =
# hc_rf_validation_mse =
# hc_gbt_train_mse =
# hc_gbt_validation_mse = 