# Building Decision Trees using scikit-learn
NJabbari; Aug 28th, 2019

Perform a decision tree algorithm with scikit-learn and using tennis.csv.  
I suggest a few changes to be made to the solution workflow provided by Learn.co:

1. y is not a matrix and hence should be lowercase.
2. Split train and test prior to OneHotEncoding rather than after. This helps avoid any information leakage from the test set into the train set.

In [None]:
import pandas as pd 
import numpy as np 
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier 
from sklearn import tree 
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.externals.six import StringIO  
from sklearn.tree import export_graphviz
import pydotplus

## Create dataframe and preprocess

In [None]:
# Load the dataset
df = pd.read_csv('tennis.csv') 

In [None]:
# Split features and target variable
X = df[['outlook', 'temp', 'humidity', 'windy']] 
y = df['play']

## Create Test and Training sets


In [None]:
X_train, X_test , y_train,y_test = train_test_split(X, y, test_size = 0.3, random_state = 42) 

In [None]:
# Instantiate a one hot encoder
ohe = OneHotEncoder(handle_unknown='ignore')

# Fit to X train and transform X train and X test 
X_train_ohe = ohe.fit_transform(X_train)
X_test_ohe = ohe.transform(X_test)

#convert the X train and X test to pandas dataframe
X_train_ohe_df = pd.DataFrame(X_train_ohe.todense(), columns = ohe.get_feature_names())
X_test_ohe_df = pd.DataFrame(X_test_ohe.todense(), columns = ohe.get_feature_names())

In [None]:
ohe.categories_

In [None]:
ohetarget = OneHotEncoder()
y_train_ohe = ohetarget.fit_transform(y_train.values.reshape(-1,1))
y_test_ohe = ohetarget.transform(y_test.values.reshape(-1,1))
#convert into pandas dataframe
y_train_ohe_df = pd.DataFrame(y_train_ohe.todense(),columns = ohetarget.get_feature_names())
y_test_ohe_df = pd.DataFrame(y_test_ohe.todense(),columns = ohetarget.get_feature_names())

In [None]:
ohetarget.categories_

Or instead of OneHotEncoding target, convert it to binary values.

In [None]:
y_train_binary = y_train.apply(lambda x: 1 if x == 'yes' else 0)
y_test_binary = y_test.apply(lambda x: 1 if x == 'yes' else 0)

## Train the Decision Tree 

In [None]:
clf= DecisionTreeClassifier(criterion='entropy')
clf.fit(X_train_ohe_df,y_train_ohe_df['x0_yes']) 
#generate predictions
y_test_pred = clf.predict(X_test_ohe_df)

In [None]:
dot_file = StringIO()

export_graphviz(clf, out_file=dot_file, filled=True,
               rounded=True)

image=pydotplus.graph_from_dot_data(dot_file.getvalue())
Image(image.create_png())

In [None]:
y_test_pred == y_test_ohe_df['x0_yes'] #3 out of 5 times "play" is truly predicted to be "yes".

In [None]:
#score is measure of accuracy
clf.score(X_test_ohe_df, y_test_ohe_df['x0_yes'])

#use the following if you pursue with y_test_binary
#clf.score(X_test_ohe_df, y_test_binary)

In [None]:
print(confusion_matrix(y_test_ohe_df['x0_yes'], y_test_pred))
print(classification_report(y_test_ohe_df['x0_yes'], y_test_pred))