# IST 718: Big Data Analytics

- Professor: Daniel Acuna <deacuna@syr.edu>

## General instructions:

- You are welcome to discuss the problems with your classmates but __you are not allowed to copy any part of your answers either from your classmates or from the internet__
- You can put the homework files anywhere you want in your http://notebook.acuna.io workspace but _do not change_ the file names. The TAs and the professor use these names to grade your homework.
- Remove or comment out code that contains `raise NotImplementedError`. This is mainly to make the `assert` statement fail if nothing is submitted.
- The tests shown in some cells (i.e., `assert` and `np.testing.` statements) are used to grade your answers. **However, the professor and TAs will use __additional__ test for your answer. Think about cases where your code should run even if it passess all the tests you see.**
- Before downloading and submitting your work through Blackboard, remember to save and press `Validate` (or go to 
`Kernel`$\rightarrow$`Restart and Run All`). 
- Good luck!

In [9]:
# load these packages
import pyspark
from pyspark.ml import feature, classification
from pyspark.ml import Pipeline, pipeline
from pyspark.sql import functions as fn
import numpy as np
from pyspark.sql import SparkSession
from pyspark.ml import feature, regression, evaluation, Pipeline
from pyspark.sql import functions as fn, Row
import matplotlib.pyplot as plt
spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext
import pandas as pd

We will analyze the Mid-atlantic wage dataset (https://rdrr.io/cran/ISLR/man/Wage.html). 

In [10]:
# read-only
drop_cols = ['_c0', 'logwage', 'sex', 'region']
wage_df = spark.read.csv('/datasets/ISLR/Wage.csv', header=True, inferSchema=True).drop(*drop_cols)
training_df, validation_df, testing_df = wage_df.randomSplit([0.6, 0.3, 0.1], seed=0)
wage_df.printSchema()

root
 |-- year: integer (nullable = true)
 |-- age: integer (nullable = true)
 |-- maritl: string (nullable = true)
 |-- race: string (nullable = true)
 |-- education: string (nullable = true)
 |-- jobclass: string (nullable = true)
 |-- health: string (nullable = true)
 |-- health_ins: string (nullable = true)
 |-- wage: double (nullable = true)



In [4]:
# explore the data
wage_df.limit(10).toPandas()

Unnamed: 0,year,age,maritl,race,education,jobclass,health,health_ins,wage
0,2006,18,1. Never Married,1. White,1. < HS Grad,1. Industrial,1. <=Good,2. No,75.043154
1,2004,24,1. Never Married,1. White,4. College Grad,2. Information,2. >=Very Good,2. No,70.47602
2,2003,45,2. Married,1. White,3. Some College,1. Industrial,1. <=Good,1. Yes,130.982177
3,2003,43,2. Married,3. Asian,4. College Grad,2. Information,2. >=Very Good,1. Yes,154.685293
4,2005,50,4. Divorced,1. White,2. HS Grad,2. Information,1. <=Good,1. Yes,75.043154
5,2008,54,2. Married,1. White,4. College Grad,2. Information,2. >=Very Good,1. Yes,127.115744
6,2009,44,2. Married,4. Other,3. Some College,1. Industrial,2. >=Very Good,1. Yes,169.528538
7,2008,30,1. Never Married,3. Asian,3. Some College,2. Information,1. <=Good,1. Yes,111.720849
8,2006,41,1. Never Married,2. Black,3. Some College,2. Information,2. >=Very Good,1. Yes,118.884359
9,2004,52,2. Married,1. White,2. HS Grad,2. Information,2. >=Very Good,1. Yes,128.680488


# Question 1: Codify the data using transformers (20 pts)

Create a fitted pipeline to the entire data `wage_df` and call it `pipe_feat`. This pipeline should codify the columns `maritl`, `race`, `education`, `jobclass`, `health`, and `health_ins`. The codification should be a combination of a `StringIndexer` and a `OneHotEncoder`. For example, for `maritl`, `StringIndexer` should create a column `maritl_index` and `OneHotEncoder` should create a column `maritl_feat`. Investigate the parameters of `StringIndexer` so that the labels are indexed alphabetically in ascending order so that, for example, the 1st index for `maritl_index` corresponds to `1. Never Married`, the 2nd index corresponds to `2. Married`, and so forth. Also, investigate the parameters of  `OneHotEncoder` so that there are no columns dropped as it is usually done for dummy variables. This is, marital status should have one column for each of the classes.

The pipeline should create a column `features` that combines `year`, `age`, and all codified columns.

In [11]:
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler

cols = ['maritl','race','education','jobclass','health','health_ins']

indexers = [
    StringIndexer(inputCol=c, outputCol="{0}_index".format(c),stringOrderType="alphabetAsc",handleInvalid="error")
    for c in cols
]

encoders = [
    OneHotEncoder(
        inputCol=c + '_index',
        outputCol="{0}_feat".format(c)) 
    for c in cols
]
numericCols = ['year','age']
assembler = VectorAssembler(
    inputCols=numericCols+["{0}_index".format(c) for c in cols]+["{0}_feat".format(c) for c in cols],
    outputCol="features"
)


pipe_feat = Pipeline(stages=indexers + encoders + [assembler]).fit(wage_df)


In [24]:
# create `pipe_feat` below
# YOUR CODE HERE
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer, OneHotEncoderEstimator,VectorAssembler,OneHotEncoder
st1=StringIndexer(inputCol='maritl',outputCol='maritl_index',stringOrderType="alphabetAsc",handleInvalid="error")
st2=StringIndexer(inputCol='race',outputCol='race_index',stringOrderType="alphabetAsc",handleInvalid="error")
st3=StringIndexer(inputCol='education',outputCol='education_index',stringOrderType="alphabetAsc",handleInvalid="error")
st4=StringIndexer(inputCol='jobclass',outputCol='jobclass_index',stringOrderType="alphabetAsc",handleInvalid="error")
st5=StringIndexer(inputCol='health',outputCol='health_index',stringOrderType="alphabetAsc",handleInvalid="error")
st6=StringIndexer(inputCol='health_ins',outputCol='health_ins_index',stringOrderType="alphabetAsc",handleInvalid="error")

en1=OneHotEncoder(inputCol='maritl_index',outputCol='maritl_feat',dropLast=False)
en2=OneHotEncoder(inputCol='race_index',outputCol='race_feat',dropLast=False)
en3=OneHotEncoder(inputCol='education_index',outputCol='education_feat',dropLast=False)
en4=OneHotEncoder(inputCol='jobclass_index',outputCol='jobclass_feat',dropLast=False)
en5=OneHotEncoder(inputCol='health_index',outputCol='health_feat',dropLast=False)
en6=OneHotEncoder(inputCol='health_ins_index',outputCol='health_ins_feat',dropLast=False)

va=VectorAssembler(inputCols=['year','age','maritl_feat','race_feat','education_feat','jobclass_feat','health_feat'
                             ,'health_ins_feat'], outputCol='features')

pipe_feat=Pipeline(stages=[st1,st2,st3,st4,st5,st6,en1,en2,en3,en4,en5,en6,va]).fit(wage_df)
#raise NotImplementedError()

In [None]:
# create `pipe_feat` below
# YOUR CODE HERE
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer, OneHotEncoderEstimator,VectorAssembler,OneHotEncoder
st1=StringIndexer(inputCol='maritl',outputCol='maritl_index',stringOrderType="alphabetAsc",handleInvalid="error")
st2=StringIndexer(inputCol='race',outputCol='race_index',stringOrderType="alphabetAsc",handleInvalid="error")
st3=StringIndexer(inputCol='education',outputCol='education_index',stringOrderType="alphabetAsc",handleInvalid="error")
st4=StringIndexer(inputCol='jobclass',outputCol='jobclass_index',stringOrderType="alphabetAsc",handleInvalid="error")
st5=StringIndexer(inputCol='health',outputCol='health_index',stringOrderType="alphabetAsc",handleInvalid="error")
st6=StringIndexer(inputCol='health_ins',outputCol='health_ins_index',stringOrderType="alphabetAsc",handleInvalid="error")

en1=OneHotEncoder(inputCol='maritl_index',outputCol='maritl_feat',dropLast=False)
en2=OneHotEncoder(inputCol='race_index',outputCol='race_feat',dropLast=False)
en3=OneHotEncoder(inputCol='education_index',outputCol='education_feat',dropLast=False)
en4=OneHotEncoder(inputCol='jobclass_index',outputCol='jobclass_feat',dropLast=False)
en5=OneHotEncoder(inputCol='health_index',outputCol='health_feat',dropLast=False)
en6=OneHotEncoder(inputCol='health_ins_index',outputCol='health_ins_feat',dropLast=False)

va=VectorAssembler(inputCols=['year','age','maritl_feat','race_feat','education_feat','jobclass_feat','health_feat'
                             ,'health_ins_feat'], outputCol='features')

pipe_feat=Pipeline(stages=[st1,st2,st3,st4,st5,st6,en1,en2,en3,en4,en5,en6,va]).fit(wage_df)



In [37]:
column = ['maritl','race','education','jobclass','health','health_ins']
indexers = [StringIndexer(inputCol=column, outputCol=column+"_index").fit(wage_df) for column in list(set(wage_df.columns)-set(['age','year'])) ]

In [41]:
from pyspark.ml.feature import OneHotEncoderEstimator, OneHotEncoderModel

encoder = OneHotEncoderEstimator(
    inputCols=[indexers.getOutputCol() for indexer in indexers],
    outputCols=[
        "{0}_encoded".format(indexer.getOutputCol()) for indexer in indexers]
)
#col=[]
assembler = VectorAssembler(
    inputCols='year'+'age'+indexers.getOutputCols()+encoder.getOutputCols(),
    outputCol="features"
)

pipeline = Pipeline(stages=indexers + [encoder, assembler])
pipeline.fit(wage_df).transform(wage_df)

AttributeError: 'list' object has no attribute 'getOutputCol'

In [56]:
from pyspark.ml.feature import OneHotEncoderEstimator, StringIndexer, VectorAssembler
cat_cols=['maritl','race','education','jobclass','health','health_ins']
num_cols=['year','age']
stages = []
for categoricalCol in cat_cols:
    stringIndexer = StringIndexer(inputCol = categoricalCol, outputCol = categoricalCol + '_index')
    encoder = OneHotEncoderEstimator(inputCols=[stringIndexer.getOutputCol()], outputCols=[categoricalCol + "_feat"])
stages += [stringIndexer, encoder]
assemblerInputs = num_cols+[c + "_index" for c in cat_cols]+[c + "_feat" for c in cat_cols] 
assembler = VectorAssembler(inputCols=assemblerInputs, outputCol="features")
stages += [assembler]
pipe_feat = Pipeline(stages = stages).fit(wage_df)

In [57]:
set(type(pm) for pm in pipe_feat.stages)

{pyspark.ml.feature.OneHotEncoderModel,
 pyspark.ml.feature.StringIndexerModel,
 pyspark.ml.feature.VectorAssembler}

In [197]:
encoders[0].

OneHotEncoder_8cddda4eaecb

In [6]:
pipe_feat.transform(wage_df).toPandas().T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,2990,2991,2992,2993,2994,2995,2996,2997,2998,2999
year,2006,2004,2003,2003,2005,2008,2009,2008,2006,2004,...,2009,2003,2007,2006,2009,2008,2007,2005,2005,2009
age,18,24,45,43,50,54,44,30,41,52,...,50,26,35,31,31,44,30,27,27,55
maritl,1. Never Married,1. Never Married,2. Married,2. Married,4. Divorced,2. Married,2. Married,1. Never Married,1. Never Married,2. Married,...,2. Married,1. Never Married,2. Married,2. Married,2. Married,2. Married,2. Married,2. Married,1. Never Married,5. Separated
race,1. White,1. White,1. White,3. Asian,1. White,1. White,4. Other,3. Asian,2. Black,1. White,...,2. Black,2. Black,1. White,1. White,1. White,1. White,1. White,2. Black,1. White,1. White
education,1. < HS Grad,4. College Grad,3. Some College,4. College Grad,2. HS Grad,4. College Grad,3. Some College,3. Some College,3. Some College,2. HS Grad,...,2. HS Grad,3. Some College,1. < HS Grad,2. HS Grad,4. College Grad,3. Some College,2. HS Grad,1. < HS Grad,3. Some College,2. HS Grad
jobclass,1. Industrial,2. Information,1. Industrial,2. Information,2. Information,2. Information,1. Industrial,2. Information,2. Information,2. Information,...,1. Industrial,1. Industrial,1. Industrial,2. Information,2. Information,1. Industrial,1. Industrial,1. Industrial,1. Industrial,1. Industrial
health,1. <=Good,2. >=Very Good,1. <=Good,2. >=Very Good,1. <=Good,2. >=Very Good,2. >=Very Good,1. <=Good,2. >=Very Good,2. >=Very Good,...,2. >=Very Good,2. >=Very Good,1. <=Good,2. >=Very Good,2. >=Very Good,2. >=Very Good,2. >=Very Good,1. <=Good,2. >=Very Good,1. <=Good
health_ins,2. No,2. No,1. Yes,1. Yes,1. Yes,1. Yes,1. Yes,1. Yes,1. Yes,1. Yes,...,2. No,2. No,2. No,1. Yes,1. Yes,1. Yes,2. No,2. No,1. Yes,1. Yes
wage,75.0432,70.476,130.982,154.685,75.0432,127.116,169.529,111.721,118.884,128.68,...,132.488,118.884,109.834,102.87,133.381,154.685,99.6895,66.2294,87.981,90.4819
maritl_index,0,0,1,1,3,1,1,0,0,1,...,1,0,1,1,1,1,1,1,0,4


In [52]:
len(pipe_feat.transform(wage_df).first().features)

IllegalArgumentException: 'Field "maritl_index" does not exist.\nAvailable fields: year, age, maritl, race, education, jobclass, health, health_ins, wage, health_ins_index, health_ins_feat'

In [54]:
assert set(type(pm) for pm in pipe_feat.stages) == {feature.OneHotEncoderModel, feature.StringIndexerModel, feature.VectorAssembler}
#assert len(pipe_feat.transform(wage_df).first().features) == 22

In [48]:
# (20 pts)
assert set(type(pm) for pm in pipe_feat.stages) == {feature.OneHotEncoder, feature.StringIndexerModel, feature.VectorAssembler}
assert len(pipe_feat.transform(wage_df).first().features) == 22


AssertionError: 

# Question 2: (15 pts)

Create three pipelines that contain three different random forest regressions that take in all features from the `wage_df` to predict `wage`. These pipelines should have as first stage the pipeline created in question 1 and should be fitted to the training data.

- `pipe_rf1`: Random forest with `maxDepth=1` and `numTrees=60`
- `pipe_rf2`: Random forest with `maxDepth=3` and `numTrees=40`
- `pipe_rf3`: Random forest with `maxDepth=6`, `numTrees=20`

In [49]:
# create the fitted pipelines `pipe_rf1`, `pipe_rf2`, and `pipe_rf3` here
# YOUR CODE HERE
from pyspark.ml.regression import RandomForestRegressor
rf_1 = RandomForestRegressor(featuresCol="features", labelCol="wage",maxDepth=1, numTrees=60, seed=1)
rf_2 = RandomForestRegressor(featuresCol="features", labelCol="wage",maxDepth=3, numTrees=40, seed=1)
rf_3 = RandomForestRegressor(featuresCol="features", labelCol="wage",maxDepth=6, numTrees=20, seed=1)
pipe_rf1=Pipeline(stages=[pipe_feat,rf_1]).fit(training_df)
pipe_rf2=Pipeline(stages=[pipe_feat,rf_2]).fit(training_df)
pipe_rf3=Pipeline(stages=[pipe_feat,rf_3]).fit(training_df)

#raise NotImplementedError()

IllegalArgumentException: 'Field "maritl_index" does not exist.\nAvailable fields: year, age, maritl, race, education, jobclass, health, health_ins, wage, health_ins_index, health_ins_feat'

In [27]:
# tests for 15 pts
np.testing.assert_equal(type(pipe_rf1.stages[0]), pipeline.PipelineModel)
np.testing.assert_equal(type(pipe_rf2.stages[0]), pipeline.PipelineModel)
np.testing.assert_equal(type(pipe_rf3.stages[0]), pipeline.PipelineModel)
np.testing.assert_equal(type(pipe_rf1.stages[1]), regression.RandomForestRegressionModel)
np.testing.assert_equal(type(pipe_rf2.stages[1]), regression.RandomForestRegressionModel)
np.testing.assert_equal(type(pipe_rf3.stages[1]), regression.RandomForestRegressionModel)
np.testing.assert_equal(type(pipe_rf1.transform(training_df)), pyspark.sql.dataframe.DataFrame)
np.testing.assert_equal(type(pipe_rf2.transform(training_df)), pyspark.sql.dataframe.DataFrame)
np.testing.assert_equal(type(pipe_rf3.transform(training_df)), pyspark.sql.dataframe.DataFrame)

# Question 3 (10 pts)

Use the following evaluator to compute the RMSE of the models on validation data. Print the RMSE of the three models and assign the best one (i.e., the best pipeline) to a variable `best_model`

In [28]:
evaluator = evaluation.RegressionEvaluator(labelCol='wage', metricName='rmse')
# use it as follows:
#   evaluator.evaluate(fitted_pipeline.transform(df)) -> RMSE

In [29]:
# print MSE of each model and define `best_model`
# YOUR CODE HERE
rmse1=evaluator.evaluate(pipe_rf1.transform(validation_df))
print('RMSE For Model1:',rmse1)
rmse2=evaluator.evaluate(pipe_rf2.transform(validation_df))
print('RMSE For Model2:',rmse2)
rmse3=evaluator.evaluate(pipe_rf3.transform(validation_df))
print('RMSE For Model3:',rmse3)
best_model=pipe_rf3
#raise NotImplementedError()

RMSE For Model1: 36.88620595960246
RMSE For Model2: 34.00518719700478
RMSE For Model3: 33.217025866555396


In [30]:
# tests for 10 pts
np.testing.assert_equal(type(best_model.stages[0]), pipeline.PipelineModel)
np.testing.assert_equal(type(best_model.stages[1]), regression.RandomForestRegressionModel)
np.testing.assert_equal(type(best_model.transform(training_df)), pyspark.sql.dataframe.DataFrame)

# Question 4: 5 pts

Compute the RMSE of the model on testing data, print it, and assign it to variable `RMSE_best`

In [31]:
# create RMSE_best below
# YOUR CODE HERE
RMSE_best=evaluator.evaluate(pipe_rf3.transform(testing_df))
print('RMSE For best model(model3-pipe_rf3 ) on testing set:',RMSE_best)
#raise NotImplementedError()

RMSE For best model(model3-pipe_rf3 ) on testing set: 34.615015843089935


In [32]:
# tests for 5 pts
np.testing.assert_array_less(RMSE_best, 40)
np.testing.assert_array_less(30, RMSE_best)

# Question 5: 5 pts

Using the parameters of the best model, create a new pipeline called `final_model` and fit it to the entire data (`wage_df`)

In [33]:
# create final_model pipeline below
# YOUR CODE HERE
final_model=Pipeline(stages=[pipe_feat,rf_3]).fit(wage_df)
#raise NotImplementedError()

In [34]:
# tests for 5 pts
np.testing.assert_equal(type(final_model.stages[0]), pipeline.PipelineModel)
np.testing.assert_equal(type(final_model.stages[1]), regression.RandomForestRegressionModel)
np.testing.assert_equal(type(final_model.transform(wage_df)), pyspark.sql.dataframe.DataFrame)

# Question 6: 30 pts

Create a pandas dataframe `feature_importance` with the columns `feature` and `importance` which contains the names of the features. Give appropriate column names such as `maritl_1._Never_Married`. You can build these feature names by using the labels from the fitted `StringIndexer` used in Question 1. Use as feature importance as determined by the random forest of the final model (`final_model`). Sort the pandas dataframe by `importance` in descending order and display.

In [210]:
print(f1)

health_ins_1._Yes


In [211]:
health_ins

'health_ins'

In [35]:
# create feature_importance below
# YOUR CODE HERE
maritl = pipe_feat.stages[0].getInputCol()
race = pipe_feat.stages[1].getInputCol()
education = pipe_feat.stages[2].getInputCol()
jobclass= pipe_feat.stages[3].getInputCol()
health = pipe_feat.stages[4].getInputCol()
health_ins = pipe_feat.stages[5].getInputCol()
#a=a.replace('index','1.Never_Married')
a = [i.replace(' ', '_') if isinstance(i, str) else i for i in pipe_feat.stages[0].labels]
b = [i.replace(' ', '_') if isinstance(i, str) else i for i in pipe_feat.stages[1].labels]
c = [i.replace(' ', '_') if isinstance(i, str) else i for i in pipe_feat.stages[2].labels]
d = [i.replace(' ', '_') if isinstance(i, str) else i for i in pipe_feat.stages[3].labels]
e = [i.replace(' ', '_') if isinstance(i, str) else i for i in pipe_feat.stages[4].labels]
f = [i.replace(' ', '_') if isinstance(i, str) else i for i in pipe_feat.stages[5].labels]
a1=maritl+'_'+a[0]
a2=maritl+'_'+a[1]
a3=maritl+'_'+a[2]
a4=maritl+'_'+a[3]
a5=maritl+'_'+a[4]
b1=race+'_'+b[0]
b2=race+'_'+b[1]
b3=race+'_'+b[2]
b4=race+'_'+b[3]
c1=education+'_'+c[0]
c2=education+'_'+c[1]
c3=education+'_'+c[2]
c4=education+'_'+c[3]
c5=education+'_'+c[4]
d1=jobclass+'_'+d[0]
d2=jobclass+'_'+d[1]
e1=health+'_'+e[0]
e2=health+'_'+e[1]
f1=health_ins+'_'+f[0]
f2=health_ins+'_'+f[1]
feature_importance=pd.DataFrame(list(zip(['year','age',a1,a2,a3,a4,a5,b1,b2,b3,b4,c1,c2,c3,c4,c5,d1,d2,e1,e2,f1,f2
                                         ], final_model.stages[-1].featureImportances.toArray())),
            columns = ['feature', 'importance']).sort_values('importance',ascending=False)
#raise NotImplementedError()

In [36]:
# display your feature importances here
feature_importance

Unnamed: 0,feature,importance
15,education_5._Advanced_Degree,0.294401
1,age,0.122606
20,health_ins_1._Yes,0.10489
14,education_4._College_Grad,0.099223
21,health_ins_2._No,0.083961
3,maritl_2._Married,0.075818
12,education_2._HS_Grad,0.052246
0,year,0.025875
13,education_3._Some_College,0.025003
11,education_1._<_HS_Grad,0.023609


In [138]:
# tests for 25 pts
assert type(feature_importance) == pd.core.frame.DataFrame
np.testing.assert_array_equal(list(feature_importance.columns), ['feature', 'importance'])
np.testing.assert_array_equal(list(feature_importance.columns), ['feature', 'importance'])

**(5 pts)** Comment below on the importance that random forest has given to each feature. Are they reasonable? Do they tell you anything valuable about the titanic dataset? Answer in the cell below

YOUR ANSWER HERE

# Question 7:  15 pts.

Pick any of the trees from the final model and assign its `toDebugString` property to a variable `example_tree`. Print this variable and add comments to the cell describing how you think this particular tree is fitting the data

In [214]:
# create a variable example_tree with the toDebugString property of a tree from final_model.
# print this string and comment in this same cell about the branches that this tree fit
# YOUR CODE HERE
rf_model=final_model.stages[-1]
len(rf_model.trees)
example_tree=rf_model.trees[0].toDebugString


#raise NotImplementedError()

In [215]:
# display the tree here
print(example_tree)

DecisionTreeRegressionModel (uid=dtr_887c784b0f7f) of depth 6 with 113 nodes
  If (feature 9 in {1.0})
   If (feature 15 in {0.0})
    If (feature 3 in {0.0})
     If (feature 1 <= 40.5)
      If (feature 1 <= 28.5)
       If (feature 7 in {0.0})
        Predict: 69.68204037827482
       Else (feature 7 not in {0.0})
        Predict: 76.54594325047503
      Else (feature 1 > 28.5)
       If (feature 14 in {1.0})
        Predict: 50.5421230453749
       Else (feature 14 not in {1.0})
        Predict: 62.63466312420231
     Else (feature 1 > 40.5)
      If (feature 1 <= 43.5)
       If (feature 2 in {1.0})
        Predict: 72.051883525888
       Else (feature 2 not in {1.0})
        Predict: 100.099662492042
      Else (feature 1 > 43.5)
       If (feature 13 in {1.0})
        Predict: 68.41786147486921
       Else (feature 13 not in {1.0})
        Predict: 81.55613629422528
    Else (feature 3 not in {0.0})
     If (feature 8 in {1.0})
      If (feature 0 <= 2004.5)
       If (feature 0

In [40]:
# tests for 10 points
assert type(example_tree) == str
assert 'DecisionTreeRegressionModel' in example_tree
assert 'feature 0' in example_tree
assert 'If' in example_tree
assert 'Else' in example_tree
assert 'Predict' in example_tree

**(5 pts)** Comment on the feature that is at the top of the tree. Does it make sense that that is the feature there?