# Ensemble Learning and Random Forests

In this module, we will learn how to combine models to create even stronger models. This process is called <b>ensable learning</b>. We will also take a particular look into combining decision trees into ensamble models called random forests.

<b>Functions and attributes in this lecture: </b>
- `sklearn.ensemble` - Submodule for dealing with ensemble algorithms
 - `VotingClassifier` - Majority vote ensemble for classification
 - `RandomForestClassifier` - Classification ensemble for bagging with trees

In [1]:
# Non-sklearn packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Sklearn packages
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [2]:
# Importing the breast cancer dataset
from sklearn.datasets import load_breast_cancer

# Geting the data and targets
X = load_breast_cancer()['data']
y = load_breast_cancer()['target']

# Divide into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

# Printing out description of the dataset
print(load_breast_cancer()['DESCR'])

.. _breast_cancer_dataset:

Breast cancer wisconsin (diagnostic) dataset
--------------------------------------------

**Data Set Characteristics:**

    :Number of Instances: 569

    :Number of Attributes: 30 numeric, predictive attributes and the class

    :Attribute Information:
        - radius (mean of distances from center to points on the perimeter)
        - texture (standard deviation of gray-scale values)
        - perimeter
        - area
        - smoothness (local variation in radius lengths)
        - compactness (perimeter^2 / area - 1.0)
        - concavity (severity of concave portions of the contour)
        - concave points (number of concave portions of the contour)
        - symmetry
        - fractal dimension ("coastline approximation" - 1)

        The mean, standard error, and "worst" or largest (mean of the three
        worst/largest values) of these features were computed for each image,
        resulting in 30 features.  For instance, field 0 is Mean Radi

## Creating many machine learning models

In [3]:
# Manually creating decision trees
first_decision_tree = DecisionTreeClassifier(random_state=42, max_leaf_nodes=6, max_depth=3)
second_decision_tree = DecisionTreeClassifier(random_state=42, max_leaf_nodes=4, max_depth=7)

In [4]:
# Creating ten decision trees
ten_decision_trees = []
for i in range(2, 12):
    ten_decision_trees.append(DecisionTreeClassifier(random_state=42, max_leaf_nodes=i, max_depth=i))
    
print(ten_decision_trees)

[DecisionTreeClassifier(max_depth=2, max_leaf_nodes=2, random_state=42), DecisionTreeClassifier(max_depth=3, max_leaf_nodes=3, random_state=42), DecisionTreeClassifier(max_depth=4, max_leaf_nodes=4, random_state=42), DecisionTreeClassifier(max_depth=5, max_leaf_nodes=5, random_state=42), DecisionTreeClassifier(max_depth=6, max_leaf_nodes=6, random_state=42), DecisionTreeClassifier(max_depth=7, max_leaf_nodes=7, random_state=42), DecisionTreeClassifier(max_depth=8, max_leaf_nodes=8, random_state=42), DecisionTreeClassifier(max_depth=9, max_leaf_nodes=9, random_state=42), DecisionTreeClassifier(max_depth=10, max_leaf_nodes=10, random_state=42), DecisionTreeClassifier(max_depth=11, max_leaf_nodes=11, random_state=42)]


In [5]:
# Using list comprehensions
ten_decision_trees = [DecisionTreeClassifier(random_state=42, max_leaf_nodes=i, max_depth=i) for i in range(2, 12)]
print(ten_decision_trees)

[DecisionTreeClassifier(max_depth=2, max_leaf_nodes=2, random_state=42), DecisionTreeClassifier(max_depth=3, max_leaf_nodes=3, random_state=42), DecisionTreeClassifier(max_depth=4, max_leaf_nodes=4, random_state=42), DecisionTreeClassifier(max_depth=5, max_leaf_nodes=5, random_state=42), DecisionTreeClassifier(max_depth=6, max_leaf_nodes=6, random_state=42), DecisionTreeClassifier(max_depth=7, max_leaf_nodes=7, random_state=42), DecisionTreeClassifier(max_depth=8, max_leaf_nodes=8, random_state=42), DecisionTreeClassifier(max_depth=9, max_leaf_nodes=9, random_state=42), DecisionTreeClassifier(max_depth=10, max_leaf_nodes=10, random_state=42), DecisionTreeClassifier(max_depth=11, max_leaf_nodes=11, random_state=42)]


## Creating an ensemble majority vote

In [6]:
# Creating the models
decision_tree_models = [DecisionTreeClassifier(random_state=i, max_leaf_nodes=i, max_depth=i) for i in range(2, 102)]

In [7]:
# Fitting the models
for model in decision_tree_models:
    model.fit(X_train, y_train)

In [8]:
# Evaluate the models individually
for model in decision_tree_models:
    y_pred = model.predict(X_test)
    accuracy = round(accuracy_score(y_pred, y_test), 3)
    print(f"Accuracy score: ", accuracy)

Accuracy score:  0.888
Accuracy score:  0.899
Accuracy score:  0.926
Accuracy score:  0.952
Accuracy score:  0.957
Accuracy score:  0.952
Accuracy score:  0.947
Accuracy score:  0.931
Accuracy score:  0.941
Accuracy score:  0.931
Accuracy score:  0.947
Accuracy score:  0.899
Accuracy score:  0.931
Accuracy score:  0.92
Accuracy score:  0.936
Accuracy score:  0.92
Accuracy score:  0.92
Accuracy score:  0.936
Accuracy score:  0.904
Accuracy score:  0.888
Accuracy score:  0.926
Accuracy score:  0.92
Accuracy score:  0.91
Accuracy score:  0.883
Accuracy score:  0.91
Accuracy score:  0.92
Accuracy score:  0.904
Accuracy score:  0.904
Accuracy score:  0.926
Accuracy score:  0.899
Accuracy score:  0.92
Accuracy score:  0.926
Accuracy score:  0.904
Accuracy score:  0.91
Accuracy score:  0.941
Accuracy score:  0.926
Accuracy score:  0.91
Accuracy score:  0.915
Accuracy score:  0.915
Accuracy score:  0.904
Accuracy score:  0.926
Accuracy score:  0.888
Accuracy score:  0.91
Accuracy score:  0.915

In [9]:
# Create a Voting Classifier
from sklearn.ensemble import VotingClassifier
voting = VotingClassifier(
    estimators=[(f"model_{i}", model) for i, model in enumerate(decision_tree_models)]
)

# Get the accuracy score
voting.fit(X_train, y_train)
voting_pred = voting.predict(X_test)
print("Accuracy of majority vote: ", round(accuracy_score(voting_pred, y_test), 3))

Accuracy of majority vote:  0.941


## Random Forests

In [10]:
# Import random forest classifier
from sklearn.ensemble import RandomForestClassifier

In [11]:
# Training a random forest
forest = RandomForestClassifier(n_estimators=5000, max_leaf_nodes=16, n_jobs=-1)
forest.fit(X_train, y_train)

RandomForestClassifier(max_leaf_nodes=16, n_estimators=5000, n_jobs=-1)

In [12]:
# Predicting with a random forest
y_pred_forest = forest.predict(X_test)
accuracy_forest = round(accuracy_score(y_pred_forest, y_test), 3)
print("The accuracy of the random forest is: ", accuracy_forest)

The accuracy of the random forest is:  0.957
