# Challenge
Now that you've learned about random forests and decision trees let's do an exercise in accuracy. You know that random forests are basically a collection of decision trees. But how do the accuracies of the two models compare?

So here's what you should do. Pick a dataset. It could be one you've worked with before or it could be a new one. Then build the best decision tree you can.

Now try to match that with the simplest random forest you can. For our purposes measure simplicity with [runtime](https://stackoverflow.com/questions/1557571/how-do-i-get-time-of-a-python-programs-execution). Compare that to the runtime of the decision tree. This is imperfect but just go with it.

Hopefully out of this you'll see the power of random forests, but also their potential costs. Remember, in the real world you won't necessarily be dealing with thousands of rows. It could be millions, billions, or even more.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.model_selection import train_test_split

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.metrics import confusion_matrix
from sklearn.ensemble import RandomForestClassifier
from imblearn.over_sampling import SMOTE

%alias_magic t time

Created `%t` as an alias for `%time`.
Created `%%t` as an alias for `%%time`.


In [2]:
df = pd.read_csv('preprocessed_data.csv')

This dataset has been preprocessed and should be relatively clean. What we want to do is to classify whether the red fighter won or the blue fighter won. For simplicity, the winner column will be changed to 1 if red won and 0 if red lost (`R_win`).

In [3]:
df['R_win'] = (df.Winner == 'Red').astype('int')

Right off the bat, let's split the data into X (features) and y (target).

In [4]:
y = df.R_win
X = df.drop(columns=['Winner', 'R_win'])

In [5]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.1)

Now, our holdout for the entire project will be `X_test` and `y_test`, any transformations made to the data will ultimately have to be done to these as well, but until we are satisfied with the training set (using cross validation), they will not be touched until the end and a process is solidified.

# Evaluation Baseline Metric

Baseline evaluation metric will be the two fighters win percentages at the time of the fight $(wins/(wins + losses))$ and how often the higher win percentage fighter won the fight (not including fights where the win percentage was equal).

In [6]:
R_win_percent = X.R_wins / (X.R_wins + X.R_losses)
B_win_percent = X.B_wins / (X.B_wins + X.B_losses)

In [7]:
percents = pd.DataFrame({'R_win_percent': R_win_percent, 'B_win_percent': B_win_percent, 'R_win': y})

In [8]:
unequal_per = percents[percents.R_win_percent != percents.B_win_percent].copy()

In [9]:
unequal_per['R>B'] = unequal_per.R_win_percent.copy() > unequal_per.B_win_percent.copy()

In [10]:
baseline = (unequal_per.R_win == unequal_per["R>B"]).sum() / unequal_per.shape[0]
print(f'Baseline Rate: {baseline}')

Baseline Rate: 0.5589878409464344


So, as a baseline, simply having a better Win% estimates the winner of the fight almost 56% of the time, so at the very least we want our model's predictions to be higher than that.

# Decision Tree Method

As the challenge states, the first goal is to create a decision tree classifier, the best one that we can. So, let's do that for this dataset - first, let's try it with the raw (yet already preprocessed in this case) data. Lets time this method as well.

In [11]:
dtc = DecisionTreeClassifier()

Let's use GridSearchCV to help tune for the best hyperparameters!

In [12]:
param_grid = dict(criterion=['gini', 'entropy'], max_depth=np.arange(5, 26, 5), class_weight=[None, 'balanced'])
grid = GridSearchCV(dtc, param_grid, scoring='accuracy', n_jobs=-1, cv=10, iid=False)

In [13]:
%%t
grid.fit(X_train, y_train)

Wall time: 47 s


GridSearchCV(cv=10, error_score='raise-deprecating',
             estimator=DecisionTreeClassifier(class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features=None,
                                              max_leaf_nodes=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              presort=False, random_state=None,
                                              splitter='best'),
             iid=False, n_jobs=-1,
             param_grid={'class_weight': [None, 'balanced'],
                         'criterion': ['gini', 'entropy'],
                  

In [14]:
print(f'Best Score: {grid.best_score_}\nBest Estimator: {grid.best_estimator_}')

Best Score: 0.6537744142491305
Best Estimator: DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=5,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')


For a final evaluation of simplicity we will time the execution of one 10 fold cross validation for our decision tree.

In [15]:
dtc = grid.best_estimator_
dtc.fit(X_train, y_train)

DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=5,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')

In [16]:
%%t
cross_val_score(dtc, X_train, y_train, cv=10, scoring='accuracy', n_jobs=-1).mean(), baseline

Wall time: 1.61 s


(0.6522273821809426, 0.5589878409464344)

In [17]:
confusion_matrix(y_train, dtc.predict(X_train))

array([[ 153,  929],
       [  61, 2089]], dtype=int64)

There is severe class imbalance here! This method is very good at detecting true positives (red wins), but not good at predicting true negatives (blue wins). Let's see how it performs on our holdout group (the test group from earlier).

In [18]:
dtc.score(X_test, y_test)

0.6444444444444445

In [19]:
confusion_matrix(y_test, dtc.predict(X_test))

array([[  5, 125],
       [  3, 227]], dtype=int64)

We can see the reason it performs so well is because the majority of classes are red wins (which our training model chooses to classify at a higher rate and also does so very well), but it's performance when red loses is significantly worse! It only performs well here because for some reason red wins much more than blue for some reason in this dataset. It is easy to see that this is a very biased model (toward red wins).

# Using a Forest

In [20]:
rfc = RandomForestClassifier(n_estimators=100, n_jobs=-1)

In [21]:
%%t
cross_val_score(rfc, X_train, y_train, cv=10, n_jobs=-1, scoring='accuracy').mean(), baseline

Wall time: 10.7 s


(0.6791394335511982, 0.5589878409464344)

The base score for a basic Random Forest Classifier (100 trees) performs better on the training set than even the best decision tree we could create above (notice it took nearly 10x as long to compute the cross validation scores, however!) Let's do a similar exercise as above using GridSearchCV to try and tune our hyperparameters to be the best they can be.

In [22]:
param_grid2 = dict(criterion=['gini', 'entropy'], max_depth=np.arange(5, 26, 5), class_weight=[None, 'balanced'])

In [23]:
grid2 = GridSearchCV(rfc, param_grid2, scoring='accuracy', n_jobs=-1, iid=False, cv=10)

In [24]:
%%t
grid2.fit(X_train, y_train)

Wall time: 3min 25s


GridSearchCV(cv=10, error_score='raise-deprecating',
             estimator=RandomForestClassifier(bootstrap=True, class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features='auto',
                                              max_leaf_nodes=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              n_estimators=100, n_jobs=-1,
                                              oob_score=False,
                                              random_state=None, verbose=0,
                                              warm_start=False),
             iid=Fal

In [25]:
print(f'Best Score: {grid2.best_score_}\nBest Estimator: {grid2.best_estimator_}')

Best Score: 0.6834823988074763
Best Estimator: RandomForestClassifier(bootstrap=True, class_weight=None, criterion='entropy',
                       max_depth=20, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=-1, oob_score=False, random_state=None, verbose=0,
                       warm_start=False)


In [32]:
rfc = grid2.best_estimator_
rfc.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='entropy',
                       max_depth=20, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=-1, oob_score=False, random_state=None, verbose=0,
                       warm_start=False)

In [27]:
%%t
cross_val_score(rfc, X_train, y_train, cv=10, scoring='accuracy', n_jobs=-1).mean(), baseline

Wall time: 15.3 s


(0.6776029125100334, 0.5589878409464344)

So, it looks like this performs only slightly better than the default method while taking significantly longer to tune. Worth it? Let's compare our holdout group results.

In [28]:
rfc.score(X_test, y_test), baseline

(0.6527777777777778, 0.5589878409464344)

In [34]:
confusion_matrix(y_test, rfc.predict(X_test))

array([[ 20, 110],
       [ 10, 220]], dtype=int64)

In [35]:
rfc = RandomForestClassifier(n_estimators=100, n_jobs=-1)
rfc.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=-1, oob_score=False, random_state=None, verbose=0,
                       warm_start=False)

In [36]:
rfc.score(X_test, y_test), baseline

(0.6472222222222223, 0.5589878409464344)

In [37]:
confusion_matrix(y_test, rfc.predict(X_test))

array([[ 18, 112],
       [ 15, 215]], dtype=int64)

It appears the more generalized approach, the first one - may be better for the training data, however the class imbalance issue is still highly prevalent. Let's see if we can use SMOTE to better the results.

### Trying out SMOTE for class imbalance

In [38]:
X_train_smote, y_train_smote = SMOTE().fit_resample(X_train, y_train)

In [39]:
rfc = grid2.best_estimator_
rfc.fit(X_train_smote, y_train_smote)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='entropy',
                       max_depth=20, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=-1, oob_score=False, random_state=None, verbose=0,
                       warm_start=False)

In [42]:
rfc.score(X_test, y_test), baseline

(0.6555555555555556, 0.5589878409464344)

In [43]:
confusion_matrix(y_test, rfc.predict(X_test))

array([[ 41,  89],
       [ 35, 195]], dtype=int64)

It appears that by using the SMOTE method it evens out the results a little, but doesn't increase our overall accuracy. In this case, it's likely this method would be preferred as it is likely to not overvalue the red class as much (but bias still seems to be rather prevalent).

In [44]:
dtc.fit(X_train_smote, y_train_smote)
dtc.score(X_test, y_test), baseline

(0.5666666666666667, 0.5589878409464344)

In [45]:
confusion_matrix(y_test, dtc.predict(X_test))

array([[ 63,  67],
       [ 89, 141]], dtype=int64)

Turns out that the decision tree is the least biased of all the models, but it also performs the worst - this is likely due to the class imbalance of the data however.

# Summary

Since this data is 2 fighters for each round, it would probably make sense to balance out winners and losers. Making it so red wins half the time and blue wins half the time, this hopefully would take care of the class imbalance observed for every method above. Due to the nature of the data, red and blue could just be swapped for a number of the bouts and this should take care of the class imbalance.

As for which model is the best? Well, it depends. If you focus on tuning all your hyperparameters to be their most optimal, Random Forest could take a significantly long time to tune, and with larger datasets a dimensionality reduction technique should probably be used on the data to ensure it will not take an eternity (at least on my system). However, running a simple Random Forest model still seems to perform better than using the finest tuned Decision Tree! But it is important to keep in mind that the training sets were very imbalanced and the test set was imbalanced as well, so it is likely that those models wouldn't actually perform nearly that well out in the wild. Unless red really does win a lot!