# Decision Trees

In the previous notebook, you were working with the parametric model, Linear Regression. We could do some more hyperparameter tuning with the linear regression model, but we're going to try tree based methods and see if our performance improves.

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) In this lesson you:<br>
 - Identify the differences between single node and distributed decision tree implementations
 - Get the feature importance
 - Examine common pitfalls of decision trees

#### Importing modules and disabling MLflow

In [0]:
import os
import mlflow
mlflow.autolog(disable=True)

### Setting the default database and user name  
##### Substitute "renato" by your name in the `username` variable.

In [0]:
## Put your name here
username = "renato"

dbutils.widgets.text("username", username)
spark.sql(f"CREATE DATABASE IF NOT EXISTS dsacademy_embedded_wave3_{username}")
spark.sql(f"USE dsacademy_embedded_wave3_{username}")
spark.conf.set("spark.sql.shuffle.partitions", 40)

spark.sql("SET spark.databricks.delta.formatCheck.enabled = false")
spark.sql("SET spark.databricks.delta.properties.defaults.autoOptimize.optimizeWrite = true")

Out[23]: DataFrame[key: string, value: string]

### Reading Dataset

In [0]:
deltaPath = os.path.join("/", "tmp", username)    #If we were writing to the root folder and not to the DBFS
if not os.path.exists(deltaPath):
    os.mkdir(deltaPath)
    
print(deltaPath)

airbnbDF = spark.read.format("delta").load(deltaPath)

/tmp/renato


In [0]:
airbnbDF.limit(10).display()

host_is_superhost,instant_bookable,host_total_listings_count,neighbourhood_cleansed,latitude,longitude,property_type,room_type,accommodates,bedrooms,beds,minimum_nights,number_of_reviews,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,price,bedrooms_na,beds_na,review_scores_rating_na,review_scores_accuracy_na,review_scores_cleanliness_na,review_scores_checkin_na,review_scores_communication_na,review_scores_location_na,review_scores_value_na
f,f,6.0,Donaustadt,48.24262,16.42767,Room in bed and breakfast,Hotel room,3.0,1.0,2.0,1.0,14.0,4.71,4.86,4.93,4.93,4.86,4.71,4.5,110.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
t,t,3.0,Leopoldstadt,48.21924,16.37831,Entire rental unit,Entire home/apt,5.0,1.0,3.0,5.0,350.0,4.75,4.8,4.65,4.91,4.93,4.75,4.69,69.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
t,t,19.0,Rudolfsheim-Fnfhaus,48.18434,16.32701,Entire rental unit,Entire home/apt,6.0,2.0,4.0,1.0,181.0,4.83,4.9,4.88,4.89,4.93,4.59,4.7,145.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
f,f,6.0,Innere Stadt,48.21496,16.37161,Entire rental unit,Entire home/apt,2.0,1.0,1.0,2.0,100.0,4.64,4.73,4.55,4.8,4.91,4.89,4.59,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
t,f,3.0,Leopoldstadt,48.21778,16.37847,Entire rental unit,Entire home/apt,3.0,1.0,2.0,5.0,347.0,4.65,4.77,4.51,4.93,4.95,4.86,4.58,68.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
f,f,6.0,Innere Stadt,48.21351,16.37282,Entire rental unit,Entire home/apt,2.0,1.0,1.0,3.0,52.0,4.63,4.67,4.35,4.69,4.75,4.88,4.56,99.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
t,f,4.0,Leopoldstadt,48.2176,16.38018,Private room in rental unit,Private room,2.0,1.0,2.0,2.0,117.0,4.77,4.74,4.68,4.8,4.75,4.81,4.71,50.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
f,f,6.0,Innere Stadt,48.21318,16.37486,Entire rental unit,Entire home/apt,4.0,2.0,1.0,3.0,69.0,4.58,4.8,4.76,4.83,4.92,4.85,4.73,140.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
f,t,1.0,Ottakring,48.22207,16.31594,Entire rental unit,Entire home/apt,4.0,2.0,2.0,3.0,50.0,4.87,4.94,4.71,4.94,4.96,4.4,4.73,77.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
t,f,2.0,Favoriten,48.17437,16.39339,Entire condo,Entire home/apt,4.0,1.0,2.0,5.0,178.0,4.77,4.87,4.67,4.88,4.87,3.98,4.66,87.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#### Checking Null Values in Spark Dataframe

In [0]:
for col in airbnbDF.columns:
  print(col, ":", airbnbDF.filter(f"{col} is NULL").count())

host_is_superhost : 0
instant_bookable : 0
host_total_listings_count : 2
neighbourhood_cleansed : 0
latitude : 0
longitude : 0
property_type : 0
room_type : 0
accommodates : 0
bedrooms : 0
beds : 0
minimum_nights : 0
number_of_reviews : 0
review_scores_rating : 0
review_scores_accuracy : 0
review_scores_cleanliness : 0
review_scores_checkin : 0
review_scores_communication : 0
review_scores_location : 0
review_scores_value : 0
price : 0
bedrooms_na : 0
beds_na : 0
review_scores_rating_na : 0
review_scores_accuracy_na : 0
review_scores_cleanliness_na : 0
review_scores_checkin_na : 0
review_scores_communication_na : 0
review_scores_location_na : 0
review_scores_value_na : 0


#### Imputing Null Values  
[Python](https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.sql.DataFrame.fillna.html)

In [0]:
airbnbDF.filter("host_total_listings_count is NULL").display()

host_is_superhost,instant_bookable,host_total_listings_count,neighbourhood_cleansed,latitude,longitude,property_type,room_type,accommodates,bedrooms,beds,minimum_nights,number_of_reviews,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,price,bedrooms_na,beds_na,review_scores_rating_na,review_scores_accuracy_na,review_scores_cleanliness_na,review_scores_checkin_na,review_scores_communication_na,review_scores_location_na,review_scores_value_na
f,f,,Mariahilf,48.19336,16.34596,Entire rental unit,Entire home/apt,3.0,1.0,2.0,1.0,1.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,62.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
f,f,,Leopoldstadt,48.22447,16.38696,Entire rental unit,Entire home/apt,2.0,1.0,1.0,10.0,0.0,4.83,4.89,4.83,4.93,4.93,4.81,4.76,80.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [0]:
airbnbDF = airbnbDF.fillna(0)

In [0]:
airbnbDF.filter("host_total_listings_count is NULL").display()

host_is_superhost,instant_bookable,host_total_listings_count,neighbourhood_cleansed,latitude,longitude,property_type,room_type,accommodates,bedrooms,beds,minimum_nights,number_of_reviews,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,price,bedrooms_na,beds_na,review_scores_rating_na,review_scores_accuracy_na,review_scores_cleanliness_na,review_scores_checkin_na,review_scores_communication_na,review_scores_location_na,review_scores_value_na


## Train/Test Split

Let's use the same 80/20 split with the same seed as the previous notebook so we can compare our results apples to apples (unless you changed the cluster config!)

In [0]:
(trainDF, testDF) = airbnbDF.randomSplit([.8, .2], seed=42)

## How to Handle Categorical Features?

We saw in the previous notebook that we can use StringIndexer/OneHotEncoder/VectorAssembler or RFormula.

**However, for decision trees, and in particular, random forests, we should not OHE our variables.**

There is an excellent [blog](https://towardsdatascience.com/one-hot-encoding-is-making-your-tree-based-ensembles-worse-heres-why-d64b282b5769#:~:text=One%2Dhot%20encoding%20categorical%20variables,importance%20resulting%20in%20poorer%20performance) on this, and the essence is:
>>> "One-hot encoding categorical variables with high cardinality can cause inefficiency in tree-based methods. Continuous variables will be given more importance than the dummy variables by the algorithm, which will obscure the order of feature importance and can result in poorer performance."

In [0]:
from pyspark.ml.feature import StringIndexer

categoricalCols = [field for (field, dataType) in trainDF.dtypes if dataType == "string"]
indexOutputCols = [x + "Index" for x in categoricalCols]

stringIndexer = StringIndexer(inputCols=categoricalCols, outputCols=indexOutputCols, handleInvalid="skip")

## VectorAssembler

Let's use the VectorAssembler to combine all of our categorical and numeric inputs  
[Python](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.feature.VectorAssembler.html)

In [0]:
from pyspark.ml.feature import VectorAssembler

# Filter for just numeric columns (and exclude price, our label)
numericCols = [field for (field, dataType) in trainDF.dtypes if ((dataType == "double") & (field != "price"))]
# Combine output of StringIndexer defined above and numeric columns
assemblerInputs = indexOutputCols + numericCols
vecAssembler = VectorAssembler(inputCols=assemblerInputs, outputCol="features")

## Decision Tree

Now let's build a `DecisionTreeRegressor` with the default hyperparameters  
[Python](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.regression.DecisionTreeRegressor.html)

In [0]:
from pyspark.ml.regression import DecisionTreeRegressor

dt = DecisionTreeRegressor(labelCol="price")

## Fit Pipeline

The following cell is expected to error, but we subsequently fix this.

In [0]:
from pyspark.ml import Pipeline

# Combine stages into pipeline
stages = [stringIndexer, vecAssembler, dt]
pipeline = Pipeline(stages=stages)

# Uncomment to perform fit
#pipelineModel = pipeline.fit(trainDF)

## maxBins

What is this parameter [maxBins](https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.regression.DecisionTreeRegressor.maxBins)? Let's take a look at the PLANET implementation of distributed decision trees to help explain the `maxBins` parameter.

<img src="https://files.training.databricks.com/images/DistDecisionTrees.png" height=500px>

In Spark, data is partitioned by row. So when it needs to make a split, each worker has to compute summary statistics for every feature for  each split point. Then these summary statistics have to be aggregated (via tree reduce) for a split to be made.

Think about it: What if worker 1 had the value `32` but none of the others had it. How could you communicate how good of a split that would be? So, Spark has a maxBins parameter for discretizing continuous variables into buckets, but the number of buckets has to be as large as the categorical variable with the highest cardinality.

Let's go ahead and increase maxBins to `60`.

In [0]:
dt.setMaxBins(60)

Out[35]: DecisionTreeRegressor_7f1e1378837c

Take two.

In [0]:
pipelineModel = pipeline.fit(trainDF)

## Feature Importance

Let's go ahead and get the fitted decision tree model, and look at the feature importance scores.

In [0]:
dtModel = pipelineModel.stages[-1]
display(dtModel)

treeNode
"{""index"":31,""featureType"":""continuous"",""prediction"":null,""threshold"":5.5,""categories"":null,""feature"":8,""overflow"":false}"
"{""index"":15,""featureType"":""categorical"",""prediction"":null,""threshold"":null,""categories"":[1.0,7.0,11.0,16.0,18.0,23.0,24.0,26.0,28.0,30.0,31.0,32.0,34.0,35.0,36.0,39.0,40.0,41.0,42.0,43.0,44.0,45.0,48.0],""feature"":3,""overflow"":false}"
"{""index"":7,""featureType"":""categorical"",""prediction"":null,""threshold"":null,""categories"":[0.0,1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0,10.0,11.0,12.0,13.0,14.0,15.0,16.0,17.0,18.0,19.0,20.0,22.0],""feature"":2,""overflow"":false}"
"{""index"":3,""featureType"":""categorical"",""prediction"":null,""threshold"":null,""categories"":[0.0,2.0,3.0,4.0,6.0,8.0,10.0,11.0,13.0,15.0,16.0,17.0,18.0],""feature"":2,""overflow"":false}"
"{""index"":1,""featureType"":""continuous"",""prediction"":null,""threshold"":3.5,""categories"":null,""feature"":8,""overflow"":false}"
"{""index"":0,""featureType"":null,""prediction"":35.56032171581769,""threshold"":null,""categories"":null,""feature"":null,""overflow"":false}"
"{""index"":2,""featureType"":null,""prediction"":61.333333333333336,""threshold"":null,""categories"":null,""feature"":null,""overflow"":false}"
"{""index"":5,""featureType"":""continuous"",""prediction"":null,""threshold"":102.5,""categories"":null,""feature"":12,""overflow"":false}"
"{""index"":4,""featureType"":null,""prediction"":47.125,""threshold"":null,""categories"":null,""feature"":null,""overflow"":false}"
"{""index"":6,""featureType"":null,""prediction"":111.7741935483871,""threshold"":null,""categories"":null,""feature"":null,""overflow"":false}"


In [0]:
dtModel.featureImportances

Out[38]: SparseVector(29, {0: 0.0052, 1: 0.0011, 2: 0.0768, 3: 0.2472, 5: 0.2773, 8: 0.0644, 9: 0.0464, 12: 0.0182, 18: 0.2635, 19: 0.0})

## Interpreting Feature Importance

Hmmm... it's a little hard to know what feature 4 vs 11 is. Given that the feature importance scores are "small data", let's use Pandas to help us recover the original column names.

In [0]:
import pandas as pd

featuresDF = pd.DataFrame(list(zip(vecAssembler.getInputCols(), dtModel.featureImportances)), columns=["feature", "importance"])
featuresDF

Unnamed: 0,feature,importance
0,host_is_superhostIndex,0.00521
1,instant_bookableIndex,0.001051
2,neighbourhood_cleansedIndex,0.076796
3,property_typeIndex,0.247164
4,room_typeIndex,0.0
5,host_total_listings_count,0.277265
6,latitude,0.0
7,longitude,0.0
8,accommodates,0.06441
9,bedrooms,0.046424


## Why so few features are non-zero?

With SparkML, the default `maxDepth` is 5, so there are only a few features we could consider (we can also split on the same feature many times at different split points).

Let's use a Databricks widget to get the top-K features.

In [0]:
dbutils.widgets.text("topK", "5")
topK = int(dbutils.widgets.get("topK"))

topFeatures = featuresDF.sort_values(["importance"], ascending=False)[:topK]["feature"].values
print(topFeatures)

['host_total_listings_count' 'review_scores_location' 'property_typeIndex'
 'neighbourhood_cleansedIndex' 'accommodates']


## Scale Invariant

With decision trees, the scale of the features does not matter. For example, it will split 1/3 of the data if that split point is 100 or if it is normalized to be .33. The only thing that matters is how many data points fall left and right of that split point - not the absolute value of the split point.

This is not true for linear regression, and the default in Spark is to standardize first. Think about it: If you measure shoe sizes in American vs European sizing, the corresponding weight of those features will be very different even those those measures represent the same thing: the size of a person's foot!

## Apply model to test set

In [0]:
predDF = pipelineModel.transform(testDF)

display(predDF.select("features", "price", "prediction").orderBy("price", ascending=False))

features,price,prediction
"Map(vectorType -> dense, length -> 29, values -> List(0.0, 1.0, 3.0, 0.0, 0.0, 2.0, 48.14061, 16.38195, 3.0, 1.0, 1.0, 90.0, 0.0, 4.83, 4.89, 4.83, 4.93, 4.93, 4.81, 4.76, 1.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0))",1500.0,70.49599287622439
"Map(vectorType -> dense, length -> 29, values -> List(0.0, 0.0, 1.0, 2.0, 0.0, 10.0, 48.19271, 16.40434, 6.0, 2.0, 3.0, 1.0, 0.0, 4.83, 4.89, 4.83, 4.93, 4.93, 4.81, 4.76, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0))",1200.0,127.1993006993007
"Map(vectorType -> sparse, length -> 29, indices -> List(2, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19), values -> List(9.0, 242.0, 48.196587, 16.345758, 16.0, 9.0, 11.0, 1.0, 4.0, 4.75, 4.75, 5.0, 4.25, 4.75, 5.0, 4.75))",1104.0,1019.0
"Map(vectorType -> sparse, length -> 29, indices -> List(2, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19), values -> List(7.0, 74.0, 48.2129, 16.3574, 5.0, 2.0, 3.0, 1.0, 16.0, 4.75, 4.69, 4.81, 4.5, 4.63, 4.69, 4.88))",1057.0,163.0596590909091
"Map(vectorType -> sparse, length -> 29, indices -> List(1, 2, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19), values -> List(1.0, 8.0, 1.0, 48.21411, 16.33616, 6.0, 3.0, 3.0, 2.0, 2.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0))",1000.0,196.5808383233533
"Map(vectorType -> sparse, length -> 29, indices -> List(2, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19), values -> List(7.0, 34.0, 48.20905, 16.37605, 6.0, 2.0, 2.0, 1.0, 29.0, 4.62, 4.66, 4.52, 4.48, 4.69, 4.93, 4.79))",1000.0,191.0919540229885
"Map(vectorType -> dense, length -> 29, values -> List(0.0, 0.0, 7.0, 0.0, 0.0, 3.0, 48.20714, 16.36839, 4.0, 1.0, 2.0, 1.0, 0.0, 4.83, 4.89, 4.83, 4.93, 4.93, 4.81, 4.76, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0))",987.0,163.0596590909091
"Map(vectorType -> sparse, length -> 29, indices -> List(1, 2, 3, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19), values -> List(1.0, 7.0, 3.0, 4.0, 48.2175, 16.36946, 12.0, 4.0, 8.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.8))",855.0,328.73214285714283
"Map(vectorType -> sparse, length -> 29, indices -> List(2, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19), values -> List(5.0, 6.0, 48.22706, 16.36272, 8.0, 2.0, 2.0, 1.0, 81.0, 4.64, 4.66, 4.56, 4.79, 4.61, 4.76, 4.64))",840.0,127.1993006993007
"Map(vectorType -> sparse, length -> 29, indices -> List(0, 2, 3, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19), values -> List(1.0, 7.0, 3.0, 9.0, 48.20453, 16.37125, 10.0, 4.0, 5.0, 2.0, 56.0, 5.0, 5.0, 4.98, 5.0, 5.0, 5.0, 4.91))",808.0,328.73214285714283


## Pitfall

What if we get a massive Airbnb rental? It was 20 bedrooms and 20 bathrooms. What will a decision tree predict?

It turns out decision trees cannot predict any values larger than they were trained on. The max value in our training set was $10,000, so we can't predict any values larger than that.

In [0]:
from pyspark.ml.evaluation import RegressionEvaluator

regressionEvaluator = RegressionEvaluator(predictionCol="prediction", labelCol="price", metricName="rmse")

rmse = regressionEvaluator.evaluate(predDF)
r2 = regressionEvaluator.setMetricName("r2").evaluate(predDF)
print(f"RMSE is {rmse}")
print(f"R2 is {r2}")

RMSE is 99.98277839296213
R2 is -0.1593161089903854


## Uh oh!

This model is way worse than the linear regression model, and it's even worse than just predicting the average value.

In the next few notebooks, let's look at hyperparameter tuning and ensemble models to improve upon the performance of our single decision tree.

Code modified and enhanced from 2020 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>