# IST 718: Big Data Analytics

- Professor: Daniel Acuna <deacuna@syr.edu>

## General instructions:

- You are welcome to discuss the problems with your classmates but __you are not allowed to copy any part of your answers either from your classmates or from the internet__
- You can put the homework files anywhere you want in your http://jupyterhub.ischool.syr.edu/ workspace but _do not change_ the file names. The TAs and the professor use these names to grade your homework.
- Remove or comment out code that contains `raise NotImplementedError`. This is mainly to make the `assert` statement fail if nothing is submitted.
- The tests shown in some cells (i.e., `assert` and `np.testing.` statements) are used to grade your answers. **However, the professor and TAs will use __additional__ test for your answer. Think about cases where your code should run even if it passess all the tests you see.**
- Before downloading and submitting your work through Blackboard, remember to save and press `Validate` (or go to 
`Kernel`$\rightarrow$`Restart and Run All`). 
- Good luck!

In [1]:
# load these packages
import pyspark
from pyspark.ml import feature, classification
from pyspark.ml import Pipeline
from pyspark.sql import functions as fn
import numpy as np
from pyspark.sql import SparkSession
from pyspark.ml import feature, regression, evaluation, Pipeline
from pyspark.sql import functions as fn, Row
import matplotlib.pyplot as plt
spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext
import pandas as pd

# Part 1: Random Forest and gradient boosted trees

In these questions, we will examine the famous [Auto dataset](https://vincentarelbundock.github.io/Rdatasets/doc/ISLR/Auto.html). With this dataset, the goal is to predict the miles per gallon (`mpg`) performance based on characteristics of the car such as number of cylinders (`cylinders`), displacement between wheels (`displacement`), horsepower of the engine (`horsepower`), weight of the car (`weight`), top acceleration (`acceleration`), year of the model (`year`), and origin (`origin`).

In [2]:
# read-only
mpg_df = spark.read.csv('Auto.csv', header=True, inferSchema=True).\
    drop('_c0').\
    withColumn('horsepower2', fn.col('horsepower').cast('int')).\
    drop('horsepower').\
    withColumnRenamed('horsepower2', 'horsepower').\
    dropna()
training_df, validation_df, testing_df = mpg_df.randomSplit([0.6, 0.3, 0.1], seed=0)
mpg_df.printSchema()

root
 |-- mpg: double (nullable = true)
 |-- cylinders: integer (nullable = true)
 |-- displacement: double (nullable = true)
 |-- weight: integer (nullable = true)
 |-- acceleration: double (nullable = true)
 |-- year: integer (nullable = true)
 |-- origin: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- horsepower: integer (nullable = true)



# Question 1: (10 pts)

Create three pipelines that contain three different random forests that take in all numeric features from `mpg_df` (`cylinders`, `displacement`, `horsepower`, `weight`, `acceleration`, `year`, and `origin`) to predict (`mpg`). **Set the `seed` parameter of the random forest to 0.** Fit these pipelines to the training data (`training_df`):

- `pipe_rf1`: Random forest with `maxDepth=1` and `numTrees=60`
- `pipe_rf2`: Random forest with `maxDepth=3` and `numTrees=40`
- `pipe_rf3`: Random forest with `maxDepth=6`, `numTrees=20`

In [3]:
# create the fitted pipelines `pipe_rf1`, `pipe_rf2`, and `pipe_rf3` here
va = feature.VectorAssembler(inputCols=['cylinders', 'displacement', 'horsepower', 'weight', 'acceleration', 'year', 'origin'], outputCol='features') # Created Vector Assembler stage for converting input features into a single feature vector as output
rf1 = regression.RandomForestRegressor(labelCol='mpg', featuresCol='features', maxDepth=1, numTrees=60, seed=0) # Created RandomForestRegressor stage with maxDepth=1 and numTrees=60
rf2 = regression.RandomForestRegressor(labelCol='mpg', featuresCol='features', maxDepth=3, numTrees=40, seed=0) # Created RandomForestRegressor stage with maxDepth=3 and numTrees=40
rf3 = regression.RandomForestRegressor(labelCol='mpg', featuresCol='features', maxDepth=6, numTrees=20, seed=0) # Created RandomForestRegressor stage with maxDepth=6 and numTrees=20
pipe1 = Pipeline(stages=[va, rf1]) # Created two stage pipeline with the vector assembler and first random forest
pipe2 = Pipeline(stages=[va, rf2]) # Created two stage pipeline with the vector assembler and second random forest
pipe3 = Pipeline(stages=[va, rf3]) # Created two stage pipeline with the vector assembler and third random forest
pipe_rf1 = pipe1.fit(training_df) # Fit first pipeline to the training dataset
pipe_rf2 = pipe2.fit(training_df) # Fit second pipeline to the training dataset
pipe_rf3 = pipe3.fit(training_df) # Fit third pipeline to the training dataset

In [4]:
# tests for 10 pts
np.testing.assert_equal(type(pipe_rf1.stages[0]), feature.VectorAssembler)
np.testing.assert_equal(type(pipe_rf2.stages[0]), feature.VectorAssembler)
np.testing.assert_equal(type(pipe_rf3.stages[0]), feature.VectorAssembler)
np.testing.assert_equal(type(pipe_rf1.stages[1]), regression.RandomForestRegressionModel)
np.testing.assert_equal(type(pipe_rf2.stages[1]), regression.RandomForestRegressionModel)
np.testing.assert_equal(type(pipe_rf3.stages[1]), regression.RandomForestRegressionModel)
np.testing.assert_equal(type(pipe_rf1.transform(training_df)), pyspark.sql.dataframe.DataFrame)
np.testing.assert_equal(type(pipe_rf2.transform(training_df)), pyspark.sql.dataframe.DataFrame)
np.testing.assert_equal(type(pipe_rf3.transform(training_df)), pyspark.sql.dataframe.DataFrame)

# Question 2 (10 pts)

Use the following evaluator to compute the $R^2$ of the models on validation data. Assign the $R^2$ of the three models to `R2_1`, `R2_2`, and `R2_3`, respectively, and the performance. Assign the best pipeline based on validation performance to a variable `best_model`

In [5]:
evaluator = evaluation.RegressionEvaluator(labelCol='mpg', metricName='r2')
# use it as follows:
#   evaluator.evaluate(fitted_pipeline.transform(df)) -> R2

In [6]:
R2_1 = evaluator.evaluate(pipe_rf1.transform(validation_df)) # Used the defined evaluator to compute the R^2 of the first model on validation data
R2_2 = evaluator.evaluate(pipe_rf2.transform(validation_df)) # Used the defined evaluator to compute the R^2 of the second model on validation data
R2_3 = evaluator.evaluate(pipe_rf3.transform(validation_df)) # Used the defined evaluator to compute the R^2 of the third model on validation data
print(f'Pipe 1: {R2_1}, Pipe 2: {R2_2}, Pipe 3: {R2_3}') # Returned R^2 value for all three models
best_model = pipe_rf3 # Assigned the third model as the best model because it had the highest R^2 value

Pipe 1: 0.6356640531609501, Pipe 2: 0.8222168008753123, Pipe 3: 0.8833324964226545


In [7]:
# tests for 10 pts
np.testing.assert_equal(type(best_model.stages[0]), feature.VectorAssembler)
np.testing.assert_equal(type(best_model.stages[1]), regression.RandomForestRegressionModel)
np.testing.assert_equal(type(best_model.transform(validation_df)), pyspark.sql.dataframe.DataFrame)
np.testing.assert_array_less(R2_1, 1.)
np.testing.assert_array_less(0.5, R2_1)
np.testing.assert_array_less(R2_2, 1.)
np.testing.assert_array_less(0.5, R2_2)
np.testing.assert_array_less(R2_3, 1.)
np.testing.assert_array_less(0.5, R2_3)

# Question 3: 5 pts

Compute the $R^2$ of the model on testing data, print it, and assign it to variable `R2_best`

In [8]:
# create AUC_best below
R2_best = evaluator.evaluate(best_model.transform(testing_df)) # Used the defined evaluator to compute the R^2 of the best model on test data
print(f'R2_best: {R2_best}') # Returned R^2 value

R2_best: 0.8116746291631238


In [9]:
# tests for 5 pts
np.testing.assert_array_less(R2_best, 1.)
np.testing.assert_array_less(0.5, R2_best)

# Question 4: 5 pts

Using the parameters of the best model, create a new pipeline called `final_model` and fit it to the entire data (`mpg_df`)

In [10]:
# create the fitted pipeline `final_model` here
final_model = pipe3.fit(mpg_df) # Used the parameters of the best model to create a new pipeline called final_model fitted to the entire data

In [11]:
# tests for 10 pts
np.testing.assert_equal(type(final_model.stages[0]), feature.VectorAssembler)
np.testing.assert_equal(type(final_model.stages[1]), regression.RandomForestRegressionModel)
np.testing.assert_equal(type(final_model.transform(mpg_df)), pyspark.sql.dataframe.DataFrame)

# Question 5: 10 pts

Create a pandas dataframe `feature_importance` with the columns `feature` and `importance` which contains the names of the features (`cylinder`, `displacement`, etc.) and their feature importances as determined by the random forest of the final model. Sort the dataframe by `importance` in descending order.

In [12]:
# create feature_importance below
mpg_df_columns = ['cylinders', 'displacement', 'horsepower', 'weight', 'acceleration', 'year', 'origin'] # Defined column names for the features that were used in the model
feature_importance = pd.DataFrame(list(zip(mpg_df_columns, final_model.stages[-1].featureImportances.toArray())), columns = ['feature', 'importance']).sort_values('importance', ascending=False) # Created pandas dataframe displaying feature importance for every independent feature that was used in the model

In [13]:
# display it here
feature_importance # Returned pandas dataframe

Unnamed: 0,feature,importance
1,displacement,0.379758
2,horsepower,0.172883
3,weight,0.143796
0,cylinders,0.134298
5,year,0.133876
4,acceleration,0.024922
6,origin,0.010467


In [14]:
# tests for 10 pts
assert type(feature_importance) == pd.core.frame.DataFrame
np.testing.assert_array_equal(list(feature_importance.columns), ['feature', 'importance'])
np.testing.assert_array_equal(list(feature_importance.columns), ['feature', 'importance'])

**(5 pts)** Comment below on the importance that random forest has given to each feature. Are they reasonable? Do they tell you anything valuable about the mpg dataset? Answer in the cell below

I would comment that the importance that random forest has given to each feature is reasonable. From the algorithm, I am able to tell that displacement is very important in determining mpg, while the origin of the vehicle and its acceleration has less of a correlation. 

# Question 6:  5 pts.

Pick any of the trees from the final model and assign its `toDebugString` property to a variable `example_tree`. Print this variable and add comments to the cell describing how you think this particular tree is fitting the data

In [15]:
# create a variable example_tree with the toDebugString property of a tree from final_model.
# print this string and comment in this same cell about the branches that this tree fit
example_tree = final_model.stages[-1].trees[4].toDebugString # Created example tree using the toDebugString property
print(example_tree) # Printed Tree
# The root node for this tree makes a decision based on the year the car was from, 
# it then makes a decision based on weight, followed by weight again, year again, weight again, 
# and finally displacement where a final decision is made at the leaf node for the first branch.
# It makes sense that displacement would be making the final decision based on the 
# feature importance that was returned.

DecisionTreeRegressionModel: uid=dtr_3e6afd3183a5, depth=6, numNodes=103, numFeatures=7
  If (feature 5 <= 78.5)
   If (feature 3 <= 2715.5)
    If (feature 3 <= 2188.5)
     If (feature 5 <= 73.5)
      If (feature 3 <= 2115.0)
       If (feature 1 <= 89.5)
        Predict: 30.0
       Else (feature 1 > 89.5)
        Predict: 26.142857142857142
      Else (feature 3 > 2115.0)
       If (feature 4 <= 15.25)
        Predict: 27.5
       Else (feature 4 > 15.25)
        Predict: 24.333333333333332
     Else (feature 5 > 73.5)
      If (feature 3 <= 2047.5)
       If (feature 4 <= 16.45)
        Predict: 28.8
       Else (feature 4 > 16.45)
        Predict: 32.78333333333333
      Else (feature 3 > 2047.5)
       If (feature 3 <= 2115.0)
        Predict: 24.0
       Else (feature 3 > 2115.0)
        Predict: 29.1875
    Else (feature 3 > 2188.5)
     If (feature 5 <= 75.5)
      If (feature 0 <= 4.5)
       If (feature 4 <= 18.25)
        Predict: 24.24
       Else (feature 4 > 18.25)
   

In [16]:
# tests for 5 points
assert type(example_tree) == str
assert 'DecisionTreeRegressionModel' in example_tree
assert 'feature 0' in example_tree
assert 'If' in example_tree
assert 'Else' in example_tree
assert 'Predict' in example_tree

# **Question 7 (5 pts)**

Gradient boosted trees are becoming increasingly popular for competitions. There is a high-performance implementation, [xgboost](https://en.wikipedia.org/wiki/XGBoost), that is particularly popular. Compare gradient boosted regression to the best model found with random forest in Question 3. Use the validation set. For GBR, use all the default parameters except make `seed=0`. Assign the pipeline and the $R^2$ of the model to `gbr_pipe` and `R2_gbr`, respectively. Does it have an amazing or dissapointing $R^2$? Comment.

In [17]:
gb = regression.GBTRegressor(labelCol='mpg', featuresCol='features', seed=0) # Created GBTRegressor stage using 'mpg' as the label column and 'features' and the features column
pipe = Pipeline(stages=[va, gb]) # Created two stage pipeline with the vector assembler and gradient booster
gbr_pipe = pipe.fit(training_df) # Fit the defined pipeline to the training data
R2_gbr = evaluator.evaluate(gbr_pipe.transform(validation_df)) # Used the defined evaluator to compute the R^2 of the gradient boosted model on validation data

In [18]:
# test your models here
print("Performance of best RF: ", evaluator.evaluate(best_model.transform(validation_df))) # Returned the best random forest model performance
print("Performance of GBR: ", R2_gbr) # Returned the gradient boosted model performance
# The performance of GBR seemed to be relatively dissapointing. It had a lower AUC than the performance of the best RF, but I did not adjust any of the model parameters. Perhaps if I did, it would be another story.

Performance of best RF:  0.8833324964226545
Performance of GBR:  0.8404903251080562


In [19]:
# tests for 5 pts
np.testing.assert_equal(type(gbr_pipe.stages[0]), feature.VectorAssembler)
np.testing.assert_equal(type(gbr_pipe.stages[1]), regression.GBTRegressionModel)
np.testing.assert_equal(type(gbr_pipe.transform(validation_df)), pyspark.sql.dataframe.DataFrame)
np.testing.assert_array_less(R2_gbr, 1.)
np.testing.assert_array_less(0.5, R2_gbr)