# Decision Trees in scikit-learn
Using the `DecisionTreeClassifier` in scikit-learn.  

In [1]:
import pandas as pd
import numpy as np
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
import matplotlib.pyplot as plt


In [2]:
apears = pd.read_csv('data/ApplesPears.csv')
apears.head()

Unnamed: 0,Greeness,Height,Width,Taste,Weight,H/W,Class
0,210,60,62,Sweet,186,0.97,Apple
1,220,70,53,Sweet,180,1.32,Pear
2,215,55,50,Tart,152,1.1,Apple
3,180,76,40,Sweet,152,1.9,Pear
4,220,68,45,Sweet,153,1.51,Pear


scikit-learn can deal with a category class label but it cannot deal with categorical features.  
So we drop the `Taste` feature. 

See later in the notebook for how to deal with categorical features

In [3]:
y = apears.pop('Class').values
apears.pop('Taste')    # Can't deal with category features
ap_features = apears.columns
X = apears.values
X[0]

array([210.  ,  60.  ,  62.  , 186.  ,   0.97])

In [4]:
ap_features

Index(['Greeness', 'Height', 'Width', 'Weight', 'H/W'], dtype='object')

In [5]:
y

array(['Apple', 'Pear', 'Apple', 'Pear', 'Pear', 'Apple', 'Pear', 'Apple',
       'Apple', 'Apple'], dtype=object)

In [None]:
apears

Two key methods:
1. `fit` method will train the tree from the data.
2. `predict` method will produce class predictions for an array of test data. 

In [None]:
dtree = DecisionTreeClassifier(criterion='entropy')
ap_tree = dtree.fit(apears, y)

In [None]:
ap_tree.predict([X[2]])

### Plot the tree

Note that the left hand branch is always "Y".  Pure leaf nodes have no decision in line 1.

In [None]:
tree.plot_tree(ap_tree, feature_names=ap_features, fontsize = 12,
                      class_names=['Apple','Pear'],  
                      filled=True, rounded=True) 
None # supressing the verbose return from plot_tree

In [None]:
apears.pop('H/W')    # Delete this feature to make it harder
X = apears.values
ap_features = apears.columns

In [None]:
ap2_tree = dtree.fit(X, y)

In [None]:
fig, ax = plt.subplots(figsize=(6,6))
tree.plot_tree(ap2_tree,feature_names=ap_features, fontsize = 12,
                      class_names=['Apple','Pear'],  
                      filled=True, rounded=True) 
None # supressing the verbose return from plot_tree

***
   ## Athlete Data

In [None]:
import pandas as pd
athlete = pd.read_csv('data/AthleteSelection.csv',index_col = 'Athlete')
athlete.head()

In [None]:
y = athlete.pop('Selected').values
X = athlete.values

In [None]:
atree = DecisionTreeClassifier(criterion='gini')
atree = atree.fit(X,y)

In [None]:
fig, ax = plt.subplots(figsize=(6, 6))
tree.plot_tree(atree, feature_names=['Speed','Agility'],  
                      class_names=['Selected','No'],  
                      filled=True, rounded=True,)
None # supressing the verbose return from plot_tree

Run a test example, select one of the training examples

In [None]:
y_pred = atree.predict([X[5]])
print('Prediced class label:',y_pred[0])

## Restaurant Data
Predictive features are categories (rather than numeric).

In [None]:
import pandas as pd
restaurant = pd.read_csv('data/restaurant.csv',index_col = 'No')
restaurant.head()

## Dealing with category data
Convert to numeric - two options:  
1. `get_dummies` method for pandas.
2. `OneHotEncoding` for sklearn. 

In [None]:
df = pd.DataFrame({'Pet': ['cat', 'dog', 'cat','ferret'], 
                   'Transport': ['bike', 'car', 'car','bike'],
                   'Area': ['urban','urban','rural','urban']})
df

### Pandas `get_dummies`
The Pandas `get_dummies` method is the easiest way to do One-Hot encoding.  
But if you want to apply the encoding to a test file later, this gets awkward

In [None]:
pd.get_dummies(df)

In [None]:
pd.get_dummies(df,drop_first=True)

### Using `OneHotEncoder` to convert category features to numbers

In [None]:
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
onehot_encoder = OneHotEncoder(sparse=False)
dfOH = onehot_encoder.fit_transform(df)
dfOH

In [None]:
type(dfOH)

In [None]:
onehot_encoder.get_feature_names_out()

In [None]:
onehot_encoder.categories_

### `LabelEncoder` also converts category features to numbers
This is more compact.  
But it is not exactly what we want as the numbers are misleading.  
Ferrets are not more like dogs than cats. (Well maybe they are!)

In [None]:
# LabelEncoder only works on single columns so we must 'apply' it to the dataframe. 
label_encoder = LabelEncoder()
labelE = df.apply(label_encoder.fit_transform)
labelE

---
# Restaurant Data 
## Using OneHotEncoding
`OneHotEncoder` class has two key methods:   
1. `fit` to 'learn' the transform from the data,
2. `transform` to apply the OneHot transform to the data, the transform can be applied to other (e.g. test) datasets.


In [None]:
restaurant = pd.read_csv('data/restaurant.csv', index_col = 'No')
restaurant.head()

In [None]:
y = restaurant.pop('WillWait?').values
X = restaurant.values
X[:3,]

In [None]:
onehot_encoder = OneHotEncoder(sparse=False)  # We can add drop='first' 
restOH = onehot_encoder.fit(restaurant)
restOH_data = restOH.transform(restaurant)

In [None]:
restaurant.columns

In [None]:
restOH.get_feature_names_out(restaurant.columns)

In [None]:
# this is the number of features now in the dataset

restOH.get_feature_names_out(restaurant.columns).size

# Add the drop='first' parameter to the encoding and see how many features you end up with 

In [None]:
rtree = DecisionTreeClassifier(criterion='entropy')
rtreeOH = rtree.fit(restOH_data,y)
fig, ax = plt.subplots(figsize=(9, 9))
tree.plot_tree(rtreeOH, feature_names=restOH.get_feature_names_out(restaurant.columns),
                      class_names=['Yes','No'], fontsize = 10, 
                      filled=True, rounded=True)
None # supressing the verbose return from plot_tree

---
## Penguins Data
For more information on the Penguins dataset see:
https://allisonhorst.github.io/palmerpenguins/ 


In [None]:
penguins = pd.read_csv('data/penguins_train.csv')
penguins.head()

In [None]:
# keep only the numeric features
f_names = ['bill_length_mm', 'bill_depth_mm','flipper_length_mm', 'body_mass_g']
X = penguins[f_names].values
y = penguins['species']
species_names = np.unique(y)
species_names

In [None]:
X.shape

### Build the tree and visualise

Changing the `min_samples_leaf` attributes will change the *bushiness* of the tree.     


In [None]:
ptree = DecisionTreeClassifier(criterion='entropy', min_samples_leaf=20)
ptree.fit(X,y)

In [None]:
fig, ax = plt.subplots(figsize=(9, 9))
tree.plot_tree(ptree, feature_names=f_names,  
                      class_names=species_names, fontsize = 9,
                      filled=True, rounded=True) 
None # supressing the verbose return from plot_tree

In [None]:
ptree.get_n_leaves()   # number of leaves