<h1>Decision trees</h1>

Notebook Goals

* Learn how to create a regression tree model using scikit-learn
* Show that a lot of data science is not always modeling but also looking at the data
* Learn how to tune the model

<h2> What are regression trees?</h2> 

A decision tree used for regression. In this case of this notebook, we will use a regression tree to predict home prices. 

![image](images/regressionTree.png)

<h2> Import Libraries</h2>

For this paritcular notebook you need to install folium if you want to run all the cells. 

Option 1:
`pip install folium`

Option 2 (Anaconda):
`conda install -c conda-forge folium`

In [1]:
%matplotlib inline

import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np


from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn import tree

## Load the Dataset
Kaggle hosts a dataset which contains house sales prices for King County, which includes Seattle.

You can download the dataset from [Kaggle](https://www.kaggle.com/harlfoxem/housesalesprediction) or feel free to download it from my [GitHub](https://raw.githubusercontent.com/mGalarnyk/Tutorial_Data/master/King_County/kingCountyHouseData.csv)

In [None]:
url = 'https://raw.githubusercontent.com/mGalarnyk/Tutorial_Data/master/King_County/kingCountyHouseData.csv'
df = pd.read_csv(url)

In [None]:
df.head()

<h2>  Remove Missing or Impute Values </h2>
If you want to build models with your data, null values are (almost) never allowed. It is important to always see how many samples have missing values and for which columns.

In [None]:
# Look at the shape of the dataframe
df.shape

In [None]:
# There is a missing value in the Length column which is a feature
df.isnull().sum()

<h2>Exploratory data analysis</h2>

When we looked for missing values, that was part of exploratory data analysis. Some other things you need to look out for are duplicated rows, outliers, and lack of documentation. Additionally, sometimes datasets have inaccuracies and you may need to consult subject matter experts to get their opinion on some oddities in the data. The reason why we do this before machine learning is that a common critical mistake in machine learning is simply to assume your data is good to work with and doesn't have any surprises.

In [None]:
continous_columns = ['price',
                      'bedrooms',
                      'bathrooms',
                      'sqft_living',
                      'sqft_lot',
                      'floors',
                      'waterfront',
                      'view',
                      'condition',
                      'grade',
                      'sqft_above',
                      'sqft_basement',
                      'yr_built',
                      'yr_renovated',
                      'zipcode',
                      'lat',
                      'long',
                      'sqft_living15',
                      'sqft_lot15']
df.loc[:,continous_columns].hist(bins=25,figsize=(16,16),xlabelsize='10',ylabelsize='10',xrot=-15);

To look at bedrooms, floors, bathrooms, and other variables vs price, I prefer boxplots because we have numerical data that is mostly not continuous. If you are curious what a boxplot is, I have an article on it [here](https://towardsdatascience.com/understanding-boxplots-5e2df7bcbd51).

From the charts below, it can be seen that there are some outliers like 33 bedrooms for a house and a price around 7000000.

In [None]:
fig, axes = plt.subplots(nrows = 3,
                         ncols = 1,
                         dpi=1000)

sns.boxplot(x=df['bedrooms'],y=df['price'], ax=axes[0], showfliers = False)
#axes[0].set_ylabel('')
sns.boxplot(x=df['floors'],y=df['price'], ax=axes[1], showfliers = False)
#axes[1].set_ylabel('')
sns.boxplot(x=df['bathrooms'],y=df['price'], ax=axes[2], showfliers = False)
axes[2].tick_params(axis = 'x', rotation = 90)
fig.tight_layout()

In this dataset, we have latitude and longtitude information for the houses. By using lat and long columns, I created the map below using the folium library which is a wrapper of a javascript library called [leaflet](https://leafletjs.com/reference-1.6.0.html#circlemarker-option). Notice that I didnt have equal sized bins because splitting the data into equal sized bins when the greatest house is 7,700,000 million and the least is 75,000 into even 6 bins means that each bin would cover more than a million. In the map, I didnt add a legend though I probably should ([link](https://stackoverflow.com/questions/37466683/create-a-legend-on-a-folium-map) to learn how to do it if curious)

In [None]:
# Quick sidenote, look at the min and max price of home as well as most amount of bedrooms
df['price'].max()

In [None]:
df['price'].min()

In [None]:
df['price'].mean()

In [None]:
# Histogram for price
(n, bins, patches) = plt.hist(df['price'].values,
                              bins=6,
                              edgecolor='black',
                              linewidth=.9)
plt.tick_params(axis = 'x', rotation = 90, labelsize = 10)

In [None]:
# The edges of the bins.
bins

In [None]:
# Histogram for bedrooms
(n, bins, patches) = plt.hist(df['bedrooms'].values,
                              bins=6,
                              edgecolor='black',
                              linewidth=.9)
plt.tick_params(axis = 'x', rotation = 90, labelsize = 10)

In [None]:
# The edges of the bins. Having bins this size is ridiculous. Should I have 33 bins?
bins

<h3>Bin the histogram into quartiles so we can have some more balanced bins and reasonable colors</h3>

In [None]:
quantiles = df['price'].quantile([0,0.01, 0.25, 0.5, 0.75, 0.99, 1])
df['price_bin'] = pd.cut(df['price'], bins = quantiles.values)

# Removing left most house (cheapest)
df = df.loc[~df['price_bin'].isna(), :]
df['price_left'] = df['price_bin'].apply(lambda x: x.left)

# Making color based on quantiles rather than equal size bins. 
hex_dict = {}
for index,left in enumerate(df['price_left'].value_counts().sort_index().index.values):
    hex_dict[left] = sns.color_palette("RdBu", 6).as_hex()[index]

In [None]:
# Putting the colors as hex because folium doesnt take tuples as inputs (for unknown reason)
hex_dict

In [None]:
"""
# You need to have installed folium to make this work
# Creating Map
startingmap = folium.Map(location=[47.5112, -122.257], control_scale=True, zoom_start=9.4)

for index, row in df.iterrows():
    
    
    price = int(row['price'])
    bedrooms = row['bedrooms']
    floors = row['floors']
    bathrooms = row['bathrooms']
    living = row['sqft_living']
    waterfront = row['waterfront']
    
    popupinformation = ('Price: ' + "{:,}".format(price) + '<br>'
                        'Bedrooms: ' + str(bedrooms) + '<br>'
                        'Floors: ' + str(floors) + '<br>'
                        'Bathrooms: ' + str(bathrooms) + '<br>'
                        'Sqft_Living: ' + str(living) + '<br>'
                        'Waterfront: ' + str(waterfront) + '<br>'
                       )
    
    folium.CircleMarker([row['lat'], row['long']],
                        color = hex_dict[row['price_left']],
                        weight = .5,
                        fill = True,
                        fillColor = hex_dict[row['price_left']],
                        popup = popupinformation,
                        opacity = .3,
                        fillOpacity = .3).add_to(startingmap)
    
startingmap.save('seattleMap.html')

"""

There is clearly a relationship between location and price in this dataset but can a model that we build capture that. If we really want to make a good prediction, we could include additional information like schools in the area (like zillow) among many other things (distance to companies, more information on the homes). 

<h2> Arrange Data into Features Matrix and Target Vector </h2>
Target is price

In [None]:
# Picked some features for now
# See what happens if you input more or less features
feature_names = ['bedrooms','bathrooms','sqft_living','sqft_lot','floors']

X = df.loc[:, feature_names].values

y = df.loc[:, 'price'].values

<h2> Train Test Split </h2>

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=2)

<h2>Decision tree regressor</h2>

In [None]:
# Make an instance of the Model.
reg = DecisionTreeRegressor()

# Training the model on the data, storing the information learned from the data
reg.fit(X_train, y_train)

### Measure Model Performance
score = reg.score(X_test, y_test)
print(score)

In [None]:
reg.get_depth()

<h2>Finding the Optimal max_depth</h2>
Finding the optimal value for max_depth is one way to tune your model. The code below outputs the R^2 for regression trees with different values for max_depth.

In [None]:
# List of values to try for n_estimators:
max_depth_range = list(range(1,80))

# List to store the R2 for each value of max_depth
score_list = []

for depth in max_depth_range:
    reg = DecisionTreeRegressor(max_depth = depth)
    reg.fit(X_train, y_train)
    score = reg.score(X_test, y_test)
    score_list.append(score)

The graph below shows that the best R^2 for the model is when the hyperparameter max_depth is prepuned.

In [None]:
fig, ax = plt.subplots(nrows = 1, ncols = 1, figsize = (10,7));
ax.plot(max_depth_range,
        score_list,
        lw=2,
        color='k')
ax.set_xlim([1, max(max_depth_range)])
ax.grid(True,
        axis = 'both',
        zorder = 0,
        linestyle = ':',
        color = 'k')
ax.tick_params(labelsize = 18)
ax.set_xlabel('max_depth', fontsize = 24)
ax.set_ylabel('R^2', fontsize = 24)
fig.tight_layout()
#fig.savefig('images/max_depth_vs_R2.png', dpi = 300)

## Common questions

<h3>How do you create a visualization based on a decision tree? </h3>
You can visualize decision trees using matplotlib, graphviz, or an online converter. You can read how to do it <a href="https://towardsdatascience.com/visualizing-decision-trees-with-python-scikit-learn-graphviz-matplotlib-1c50b4aa68dc">here</a> (or use the notebook VisualizeDecisionTrees.ipynb to learn how to do it using matplotlib) In this example notice that petal width is the topmost split. It also happens to be the "most important" feature. 

In [None]:
# Get depth
reg.get_depth()

In [None]:
# Please dont run this code
# Depth of 39 is not probably not going to work the way you want it too. 
"""
fig, axes = plt.subplots(nrows = 1,ncols = 1,figsize = (4,4), dpi=300)
tree.plot_tree(reg,
               feature_names = feature_names,
               filled = True);
"""

<h3>How can we make a better model</h3>
A lot of machine learning is about data. It could be as simple as adding in more features. You could also use APIs to get more data (like distance to water, distance to stores, schools, etc). 

<h3>Which max_depth was best and why was the test R^2 so low</h3>
The model overfit on the training set when it was allowed to grow without prepruning. Here is a graph to show the relationship between the train R^2 and the test R^2.

In [None]:
model_list = []
max_depth_list = []
actual_depth_list = []
train_score_list = []
test_score_list = []

for max_depth in range(1,80):
    reg = DecisionTreeRegressor(max_depth = max_depth)
    reg.fit(X_train, y_train)
    
    model_list.append(reg)
    max_depth_list.append(max_depth)
    actual_depth_list.append(reg.get_depth())
    train_score_list.append(reg.score(X_train, y_train))
    test_score_list.append(reg.score(X_test, y_test))

In [None]:
data_dict = {'model_list': model_list,
             'max_depth': max_depth_list,
             'actual_depth': actual_depth_list,
             'train_score': train_score_list,
             'test_score': test_score_list}

In [None]:
temp_df = pd.DataFrame(data_dict)

In [None]:
temp_df

In [None]:
# finding the best test score
test_max = temp_df.loc[:, 'test_score'].max()
temp_df.loc[temp_df.loc[:,'test_score'] == test_max, :]

In [None]:
# visualize the tree
# kinda hard to read!
fig, axes = plt.subplots(nrows = 1,ncols = 1,figsize = (4,4), dpi=300)
tree.plot_tree(temp_df.loc[5, 'model_list'],
               feature_names = feature_names,
               filled = True);

In [None]:
fig, axes = plt.subplots(nrows = 1, ncols = 1)

axes.plot(temp_df['max_depth'].values,
          temp_df['test_score'].values,
          label = 'Test Score')
axes.plot(temp_df['max_depth'].values,
          temp_df['train_score'].values,
          label = 'Train Score',
          color = 'r')
axes.set_xlabel('max_depth', fontsize = 13)
axes.set_ylabel('R^2', fontsize = 13)
axes.set_title('Test vs Train R^2')
axes.grid();
axes.legend(loc = 'center right',
            fontsize = 10)
axes.set_title('Test vs Train R^2', fontsize = 13)

Dual axis graph below (no idea why I made it, but kinda like to show it).

In [None]:
fig, axes = plt.subplots(nrows = 1, ncols = 1)

axes.plot(temp_df['max_depth'].values,
          temp_df['test_score'].values )
axes.set_xlabel('max_depth')

axes2=axes.twinx()
axes2.plot(temp_df['max_depth'].values,temp_df['train_score'].values, color = 'r')
axes2.tick_params('y', colors='r');

trainLimits = temp_df['train_score'].min(), temp_df['train_score'].max() 
testLimits = temp_df['test_score'].min(), temp_df['test_score'].max() 


axes.set_ylim(testLimits)
axes.set_yticks(np.linspace(testLimits[0],testLimits[1], 14))


axes.set_ylabel('Test R^2')
axes.set_title('Test vs Train R^2')

axes2.set_ylim(trainLimits)
axes2.set_yticks(np.linspace(trainLimits[0],trainLimits[1], 14))
axes2.set_ylabel('Train R^2', color='r'); 

axes.grid();