<a href="https://colab.research.google.com/github/hkaragah/google_colab_repo/blob/main/hands_on_ml_exercises/06_ensamble_decision_trees.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Ensamble Decision Trees

__Disclaimer:__ This exercise is adopted from `"Hands-on Machine Learning with Scikit-Learn, Keras & Tensorflow (Third Edition)"` book written by `_Aurelien Geron_` publoshed by `_O'Reilly_`. I broke them down into smaller digestable snippets, made some modifications, and added some explanations so that I can undersatand them better. The porpuse of this notebook is just for me to understand the concept and have hands-on practice while reading the book material.

## Objective
Use ensamble decision trees for classification to miprove prediction accuracy

## Load Dataset

In [2]:
from sklearn.datasets import make_moons
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split, cross_val_score, ShuffleSplit
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn.preprocessing import StandardScaler
from graphviz import Source
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
import numpy as np
import pandas as pd
import seaborn as sns
from copy import deepcopy
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report


In [3]:
X_moons, y_moons = make_moons(n_samples=10_000, noise=0.4, random_state=42)

## Generate train set

In [16]:
X_train, X_test, y_train, y_test = train_test_split(X_moons, y_moons, test_size=0.2, random_state=42)

n_sets = 1000
set_size = 100
mini_batches = []

indices = ShuffleSplit(n_splits=n_sets, test_size=len(X_train) - set_size, random_state=42) # return randomly selected train and test indices

for train_index, test_index in indices.split(X_train):
    X_batch, y_batch = X_train[train_index], y_train[train_index]
    mini_batches.append((X_batch, y_batch))

## Define and Train Ensamble Trees (Forest)

In [17]:
# Best hyperparameters obtained from GridSearch (see notebook: "06-Overfitted_decision_tree.ipynb")
hyperparams = {'max_depth':8, 'max_leaf_nodes':19}
clf = DecisionTreeClassifier(random_state=42, **hyperparams)

# Generate models for each model, I make a deepcopy of the original model because I don't want to overwrite the trained parameters for each set
forest = [deepcopy(clf) for _ in range(n_sets)]

# Train each model
for i in range(n_sets):
    forest[i].fit(mini_batches[i][0], mini_batches[i][1])

## Evaluating Average of Accuracy Scores

In [25]:
scores = []
accuracy_scores = []

for tree in forest:
    # One way to computed accuracy scores
    scores.append(tree.score(X_test, y_test))

    # Another way to compute accuracy scores
    y_pred = tree.predict(X_test)
    accuracy_scores.append(accuracy_score(y_test, y_pred))

print(f"Average score: {np.mean(scores)}")
print(f"Average accuracy score: {np.mean(accuracy_scores)}")

Average score: 0.802034
Average accuracy score: 0.802034


Useing the tunes hyperparameters in `06-Overfitted_decision_tree.ipynb`, I obtained about 96% accuracy, but here each tree reached lower accuracy score. This is because here I only used 100 samples for trainig, but previously I used 10_000 samples. <br>
Let's use _majority-vote_ for the prediction to see if it can imporve the accuracy score.

In [26]:
y_pred = np.empty(shape=[n_sets, len(X_test)], dtype=np.uint8) # "dtype" doesn't need to be float, there are only two classes 0 and 1
print(y_pred.shape)

for i in range(n_sets):
    y_pred[i] = forest[i].predict(X_test)



(1000, 2000)


In [32]:
from scipy.stats import mode

mode(y_pred, axis=0)

ModeResult(mode=array([1, 1, 0, ..., 0, 0, 0]), count=array([948, 909, 962, ..., 910, 992, 598]))

Using `mode` returns two arrays, mode and count. "_mode_" array contains the majority-vote and "_count_" array contains the count of the majority-vote for each sample in the test set `X_test` becasue I ran it on `axis=0`.<br>
Now, let's compute the accuracy score.

In [36]:

y_pred_majority_votes = mode(y_pred, axis=0)[0] # shape (2000,)

print(f"Accuracy score: {accuracy_score(y_test, mode(y_pred, axis=0)[0])}")

Accuracy score: 0.872


the accuracy score increased from 0.802 to 0.872 (~8% improvement).