-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
</div>

# Decision Trees

- In the previous notebook, you were working with the **parametric model**(Parametric models are those that require the specification of some parameters before they can be used to make predictions, while non-parametric models do not rely on any specific parameter settings and therefore often produce more accurate results.) i.e **Linear Regression**.
- We could do some more hyperparameter tuning with the linear regression model, but we're going to try tree based methods and see if our performance improves.

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) In this lesson you:<br>
 - **Identify the differences between single node and distributed decision tree implementations**
 - **Get the feature importance**
 - **Examine common pitfalls of decision trees**

In [0]:
%pip install mlflow

In [0]:
%run "./Includes/Classroom-Setup"

In [0]:
file_path = f"{datasets_dir}/airbnb/sf-listings/sf-listings-2019-03-06-clean.delta/"
airbnb_df = spark.read.format("delta").load(file_path)
train_df, test_df = airbnb_df.randomSplit([.8, .2], seed=42)

## How to Handle Categorical Features?

We saw in the previous notebook that we can use StringIndexer/OneHotEncoder/VectorAssembler or RFormula.

**However, for decision trees, and in particular, random forests, we should not OHE our variables.**

There is an excellent <a href="https://towardsdatascience.com/one-hot-encoding-is-making-your-tree-based-ensembles-worse-heres-why-d64b282b5769#:~:text=One%2Dhot%20encoding%20categorical%20variables,importance%20resulting%20in%20poorer%20performance" target="_blank">blog</a> on this, and the essence is:
>>> "One-hot encoding categorical variables with high cardinality can cause inefficiency in tree-based methods. Continuous variables will be given more importance than the dummy variables by the algorithm, which will obscure the order of feature importance and can result in poorer performance."

In [0]:
from pyspark.ml.feature import StringIndexer

categorical_cols = [field for (field, dataType) in train_df.dtypes if dataType == "string"]
index_output_cols = [x + "Index" for x in categorical_cols]

string_indexer = StringIndexer(inputCols=categorical_cols, outputCols=index_output_cols, handleInvalid="skip")

## VectorAssembler

Let's use the <a href="https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.feature.VectorAssembler.html?highlight=vectorassembler#pyspark.ml.feature.VectorAssembler" target="_blank">VectorAssembler</a> to combine all of our categorical and numeric inputs.

In [0]:
from pyspark.ml.feature import VectorAssembler

# Filter for just numeric columns (and exclude price, our label)
numeric_cols = [field for (field, dataType) in train_df.dtypes if ((dataType == "double") & (field != "price"))]
# Combine output of StringIndexer defined above and numeric columns
assembler_inputs = index_output_cols + numeric_cols
vec_assembler = VectorAssembler(inputCols=assembler_inputs, outputCol="features")

## Decision Tree

Now let's build a <a href="https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.regression.DecisionTreeRegressor.html?highlight=decisiontreeregressor#pyspark.ml.regression.DecisionTreeRegressor" target="_blank">DecisionTreeRegressor</a> with the default hyperparameters.

In [0]:
from pyspark.ml.regression import DecisionTreeRegressor

dt = DecisionTreeRegressor(labelCol="price")

## Fit Pipeline

The following cell is expected to error, but we subsequently fix this.

In [0]:
from pyspark.ml import Pipeline

# Combine stages into pipeline
stages = [string_indexer, vec_assembler, dt]
pipeline = Pipeline(stages=stages)

# Uncomment to perform fit
pipeline_model = pipeline.fit(train_df)

## maxBins

What is this parameter <a href="https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.regression.DecisionTreeRegressor.html?highlight=decisiontreeregressor#pyspark.ml.regression.DecisionTreeRegressor.maxBins" target="_blank">maxBins</a>? Let's take a look at the PLANET implementation of **distributed decision trees** to help explain the **`maxBins`** parameter.

<img src="https://files.training.databricks.com/images/DistDecisionTrees.png" height=500px>

In Spark, data is partitioned by row. So when it needs to make a split, each worker has to compute summary statistics for every feature for  each split point. Then these summary statistics have to be aggregated (via tree reduce) for a split to be made. 

Think about it: What if worker 1 had the value **`32`** but none of the others had it. How could you communicate how good of a split that would be? So, Spark has a maxBins parameter for discretizing continuous variables into buckets, but the number of buckets has to be as large as the categorical variable with the highest cardinality.

Let's go ahead and increase maxBins to **`40`**.

In [0]:
dt.setMaxBins(40)

Take two.

In [0]:
pipeline_model = pipeline.fit(train_df)

## Feature Importance

Let's go ahead and get the fitted decision tree model, and look at the feature importance scores.

In [0]:
dt_model = pipeline_model.stages[-1]
display(dt_model)

treeNode
"{""index"":31,""featureType"":""continuous"",""prediction"":null,""threshold"":2.5,""categories"":null,""feature"":12,""overflow"":false}"
"{""index"":15,""featureType"":""continuous"",""prediction"":null,""threshold"":1.5,""categories"":null,""feature"":12,""overflow"":false}"
"{""index"":7,""featureType"":""categorical"",""prediction"":null,""threshold"":null,""categories"":[1.0,2.0],""feature"":5,""overflow"":false}"
"{""index"":3,""featureType"":""categorical"",""prediction"":null,""threshold"":null,""categories"":[5.0,6.0,7.0,8.0,9.0,10.0,12.0,13.0,14.0,17.0,18.0,19.0,23.0,24.0,26.0,27.0,28.0,29.0,30.0,31.0,33.0,35.0],""feature"":3,""overflow"":false}"
"{""index"":1,""featureType"":""continuous"",""prediction"":null,""threshold"":8.5,""categories"":null,""feature"":10,""overflow"":false}"
"{""index"":0,""featureType"":null,""prediction"":96.27024390243902,""threshold"":null,""categories"":null,""feature"":null,""overflow"":false}"
"{""index"":2,""featureType"":null,""prediction"":984.0,""threshold"":null,""categories"":null,""feature"":null,""overflow"":false}"
"{""index"":5,""featureType"":""continuous"",""prediction"":null,""threshold"":37.744265,""categories"":null,""feature"":8,""overflow"":false}"
"{""index"":4,""featureType"":null,""prediction"":343.0,""threshold"":null,""categories"":null,""feature"":null,""overflow"":false}"
"{""index"":6,""featureType"":null,""prediction"":136.60329067641683,""threshold"":null,""categories"":null,""feature"":null,""overflow"":false}"


In [0]:
dt_model.featureImportances

## Interpreting Feature Importance

Hmmm... it's a little hard to know what feature 4 vs 11 is. Given that the feature importance scores are "small data", let's use Pandas to help us recover the original column names.

In [0]:
import pandas as pd

features_df = pd.DataFrame(list(zip(vec_assembler.getInputCols(), dt_model.featureImportances)), columns=["feature", "importance"])
features_df.sort_values(by='importance',ascending=False)

Unnamed: 0,feature,importance
12,bedrooms,0.211139
1,cancellation_policyIndex,0.155079
14,minimum_nights,0.143334
2,instant_bookableIndex,0.129501
15,number_of_reviews,0.113373
3,neighbourhood_cleansedIndex,0.101304
10,accommodates,0.060407
22,review_scores_value,0.036494
0,host_is_superhostIndex,0.017857
13,beds,0.017393


## Why so few features are non-zero?

With SparkML, the default **`maxDepth`** is 5, so there are only a few features we could consider (we can also split on the same feature many times at different split points).

Let's use a Databricks widget to get the top-K features.

In [0]:
dbutils.widgets.text("top_k", "5")
top_k = int(dbutils.widgets.get("top_k"))

top_features = features_df.sort_values(["importance"], ascending=False)[:top_k]["feature"].values
print(top_features)

## Scale Invariant

- With decision trees, the scale of the features does not matter. For example, it will split 1/3 of the data if that split point is 100 or if it is normalized to be .33. The only thing that matters is how many data points fall left and right of that split point - not the absolute value of the split point.

- This is not true for linear regression, and the default in Spark is to standardize first. Think about it: If you measure shoe sizes in American vs European sizing, the corresponding weight of those features will be very different even those those measures represent the same thing: the size of a person's foot!

## Apply model to test set

In [0]:
pred_df = pipeline_model.transform(test_df)

display(pred_df.select("features", "price", "prediction").orderBy("price", ascending=False))

features,price,prediction
"Map(vectorType -> sparse, length -> 33, indices -> List(2, 3, 4, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22), values -> List(1.0, 10.0, 2.0, 1.0, 37.79707, -122.41051, 5.0, 2.0, 3.0, 4.0, 30.0, 2.0, 100.0, 10.0, 10.0, 10.0, 10.0, 10.0, 8.0))",9000.0,2618.75
"Map(vectorType -> sparse, length -> 33, indices -> List(0, 3, 4, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22), values -> List(1.0, 11.0, 1.0, 3.0, 37.80658, -122.43649, 12.0, 3.5, 5.0, 7.0, 2.0, 3.0, 100.0, 10.0, 10.0, 9.0, 10.0, 10.0, 9.0))",1600.0,740.3870967741935
"Map(vectorType -> dense, length -> 33, values -> List(0.0, 0.0, 1.0, 21.0, 0.0, 0.0, 0.0, 165.0, 37.7879, -122.39384, 4.0, 2.0, 2.0, 2.0, 30.0, 0.0, 98.0, 10.0, 10.0, 10.0, 10.0, 10.0, 10.0, 0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0))",1599.0,408.10526315789474
"Map(vectorType -> sparse, length -> 33, indices -> List(0, 2, 3, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22), values -> List(1.0, 1.0, 2.0, 1.0, 37.77037, -122.43385, 6.0, 1.0, 2.0, 3.0, 2.0, 32.0, 95.0, 10.0, 9.0, 10.0, 10.0, 10.0, 9.0))",1500.0,252.7627856365615
"Map(vectorType -> sparse, length -> 33, indices -> List(1, 3, 4, 5, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22), values -> List(1.0, 24.0, 1.0, 1.0, 2.0, 37.72082, -122.46345, 2.0, 1.0, 1.0, 1.0, 30.0, 2.0, 100.0, 9.0, 10.0, 10.0, 10.0, 10.0, 10.0))",1500.0,96.27024390243902
"Map(vectorType -> sparse, length -> 33, indices -> List(1, 2, 3, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22), values -> List(2.0, 1.0, 7.0, 1.0, 37.75124, -122.42616, 8.0, 1.5, 3.0, 3.0, 1.0, 12.0, 97.0, 10.0, 10.0, 10.0, 10.0, 10.0, 9.0))",1500.0,486.4035087719298
"Map(vectorType -> sparse, length -> 33, indices -> List(1, 3, 4, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22), values -> List(1.0, 15.0, 2.0, 1.0, 37.80658, -122.41986, 4.0, 2.0, 2.0, 2.0, 2.0, 38.0, 100.0, 10.0, 10.0, 10.0, 10.0, 10.0, 10.0))",1500.0,408.10526315789474
"Map(vectorType -> sparse, length -> 33, indices -> List(0, 3, 4, 5, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22), values -> List(1.0, 2.0, 2.0, 1.0, 4.0, 37.77703, -122.44541, 2.0, 1.0, 1.0, 1.0, 3.0, 14.0, 96.0, 10.0, 9.0, 10.0, 10.0, 10.0, 9.0))",1500.0,136.60329067641683
"Map(vectorType -> dense, length -> 33, values -> List(1.0, 2.0, 0.0, 33.0, 1.0, 1.0, 0.0, 3.0, 37.78753, -122.49261, 3.0, 1.0, 2.0, 2.0, 1.0, 6.0, 100.0, 10.0, 10.0, 10.0, 10.0, 10.0, 10.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0))",1451.0,408.10526315789474
"Map(vectorType -> sparse, length -> 33, indices -> List(0, 1, 3, 4, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22), values -> List(1.0, 1.0, 10.0, 2.0, 1.0, 37.79148, -122.41386, 14.0, 2.0, 3.0, 4.0, 2.0, 37.0, 98.0, 10.0, 10.0, 10.0, 10.0, 10.0, 10.0))",1250.0,702.0


## Pitfall

What if we get a massive Airbnb rental? It was 20 bedrooms and 20 bathrooms. What will a decision tree predict?

It turns out decision trees cannot predict any values larger than they were trained on. The max value in our training set was $10,000, so we can't predict any values larger than that.

In [0]:
from pyspark.ml.evaluation import RegressionEvaluator

regression_evaluator = RegressionEvaluator(predictionCol="prediction", labelCol="price", metricName="rmse")

rmse = regression_evaluator.evaluate(pred_df)
r2 = regression_evaluator.setMetricName("r2").evaluate(pred_df)
print(f"RMSE is {rmse}")
print(f"R2 is {r2}")

## Uh oh!

**This model is way worse than the linear regression model, and it's even worse than just predicting the average value.**

In the next few notebooks, let's look at hyperparameter tuning and ensemble models to improve upon the performance of our single decision tree.

-sandbox
&copy; 2022 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="https://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="https://help.databricks.com/">Support</a>