# Decision Trees

In this notebook we'll use the famous [Iris Dataset](https://archive.ics.uci.edu/ml/datasets/iris) to check out some real decision trees!  

<img src="./data/iris.png">

This data set has:
1. 150 instances with 4 attributes (same units, all numeric)
2. Balanced class distribution
3. No missing data

In [None]:
# Includes and Standard Magic...
### Standard Magic and startup initializers.

# Load Numpy
import numpy as np
# Load MatPlotLib
import matplotlib
import matplotlib.pyplot as plt
# Load Pandas
import pandas as pd
# Load SQLITE
import sqlite3
# Load Stats
from scipy import stats

# This lets us show plots inline and also save PDF plots if we want them
%matplotlib inline
from matplotlib.backends.backend_pdf import PdfPages

# These two things are for Pandas, it widens the notebook and lets us display data easily.
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:95% !important; }</style>"))

In [None]:
# Import the data and check it out...
df_iris = pd.read_csv("./data/iris.csv")
df_iris.head()

In [None]:
df_iris.describe()

In [None]:
df_iris.groupby("species").size()

Make a test and train split.  Note that we are using a *stratified sample* here so that we don't mess up our classifier! [More info in the docs!](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)


In [None]:
# Vectorize the whole thing...
import sklearn
from sklearn.model_selection import train_test_split

train, test = train_test_split(df_iris, 
                               test_size=0.4, 
                               stratify=df_iris["species"])

In [None]:
# Check that...
train.groupby("species").size()

In [None]:
test.groupby("species").size()

In [None]:
# Just for fun..
import seaborn as sns
sns.pairplot(train, hue="species", height=2, palette='colorblind')

In [None]:
corrmat = train.corr()
sns.heatmap(corrmat, annot = True, square = True);

Now let's build a decision tree!

In [None]:
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn import metrics

In [None]:
features = ['sepal_length','sepal_width','petal_length','petal_width']
X_train = train[features]
y_train = train.species
X_test = test[features]
y_test = test.species


In [None]:
mod_dt = DecisionTreeClassifier(max_depth = 3, random_state = 1)
mod_dt.fit(X_train,y_train)
prediction=mod_dt.predict(X_test)

In [None]:
# Check some measures...
print(f"The accuracy of the Decision Tree is {metrics.accuracy_score(prediction,y_test):.3f}")
print(f"The Precision of the Decision Tree is {metrics.precision_score(prediction,y_test,average='weighted'):.3f}")
print(f"The Recall of the Decision Tree is {metrics.recall_score(prediction,y_test,average='weighted'):.3f}")

In [None]:
# Plot some graphs...
metrics.plot_confusion_matrix(mod_dt, X_test, y_test,
                                 display_labels=mod_dt.classes_,
                                 cmap=plt.cm.Blues, normalize='all')

In [None]:
# Cooler...
mod_dt.feature_importances_


In [None]:
plt.figure(figsize = (10,8))
plot_tree(mod_dt, feature_names = features, class_names = mod_dt.classes_, filled = True);

The Above only is using petal_width and petal_length... so we can plot the decision boundry..

What happens with the titanic dataset?

In [None]:
df_titanic = pd.read_csv("./data/titanic.csv")
df_titanic = pd.get_dummies(df_titanic, columns=['sex'])
# Be cheeky with our NAN
df_titanic = df_titanic[(df_titanic["age"].notna()) & (df_titanic["fare"].notna())]
df_titanic.head()

In [None]:
train, test = train_test_split(df_titanic, 
                               test_size=0.4, 
                               stratify=df_titanic["survived"])

In [None]:
features = ["pclass", "fare", "sex_female", "age"]
X_train = train[features]
y_train = train.survived
X_test = test[features]
y_test = test.survived

In [None]:
mod_dt = DecisionTreeClassifier(max_depth = 3, random_state = 1)
mod_dt.fit(X_train,y_train)
prediction=mod_dt.predict(X_test)
# Check some measures...
print(f"The accuracy of the Decision Tree is {metrics.accuracy_score(prediction,y_test):.3f}")
print(f"The Precision of the Decision Tree is {metrics.precision_score(prediction,y_test,average='weighted'):.3f}")
print(f"The Recall of the Decision Tree is {metrics.recall_score(prediction,y_test,average='weighted'):.3f}")

In [None]:
# Plot some graphs...
metrics.plot_confusion_matrix(mod_dt, X_test, y_test,
                                 display_labels=["died","survived"],
                                 cmap=plt.cm.Blues, normalize='all')

In [None]:
# Plot some graphs...
metrics.plot_precision_recall_curve(mod_dt, X_test, y_test)

In [None]:
plt.figure(figsize = (15,8))
plot_tree(mod_dt, feature_names = features, class_names={1:"survived", 0:"died"}, filled = True);

We can also show the boundry (no plot)
<img src="./data/boundry.png">

# A Quick Note: Feature Engineering

Sometimes we can't just use the features we have, we have to create a new feature from them.  This process is called feature enginnering.

To demonstrate, let's make and try to predict some circles from just their x and y cordinates.

In [None]:
from sklearn.model_selection import train_test_split

n_points = 100
data = {"x": np.random.randint(1,100,100), "y": np.random.randint(1,100,100)}
data['r'] = [np.sqrt(x**2 + y**2) for x,y in zip(data['x'],data['y'])]
df_circles = pd.DataFrame(data)
df_circles.head()

In [None]:
df_circles.plot.scatter(x='x', y='y')

In [None]:
from sklearn.linear_model import LinearRegression
# Let's use a Linear Regression to try to predict r given x and y.
X_train, X_test, y_train, y_test = train_test_split(df_circles[['x','y']], 
                                                    df_circles["r"], 
                                                    test_size=0.3)

In [None]:
reg = LinearRegression().fit(X_train, y_train)
print(f'''Model Score: {reg.score(X_train, y_train):.3f} and Validation Score: {reg.score(X_test, y_test):.3f}.''')

In [None]:
reg.coef_

That's pretty good but we know the equation for a circle is $x^2 + y^2 = r^2$, so what happens if we add $x^2$ and $y^2$ to our data frame?

In [None]:
df_circles['x2'] = df_circles['x']**2
df_circles['y2'] = df_circles['y']**2
df_circles.head()

In [None]:
# Train and run again...
X_train, X_test, y_train, y_test = train_test_split(df_circles[['x', 'y','x2','y2']], 
                                                    df_circles["r"], 
                                                    test_size=0.3)
reg = LinearRegression().fit(X_train, y_train)
print(f'''Model Score: {reg.score(X_train, y_train):.3f} and Validation Score: {reg.score(X_test, y_test):.3f}.''')

In [None]:
reg.coef_