# Ensemble Learning and Random Forest

- Aggregates data to come to the best possible outcome using wisdom of the crowd concept

### Voting Classifiers
- Each classifier is trained on a different subset of the data
- Hard voting is majority vote classifier that takes majority votes from all classifiers
- Can achieve a higher accuracy when weaker classifiers are used and ensembled
- Law of large numbers states that the probability of obtaining a majority of a certain class as the test size increases will also increase
- Independence of classifiers is important to ensure that the voting system is not reliant on one another and that each classifier offers a unique 'view'.

In [99]:
from sklearn.datasets import make_moons
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC

X, y = make_moons(n_samples=500, noise=0.3, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

voting_clf = VotingClassifier(
    estimators=[
        ('lr', LogisticRegression(random_state=42)),
        ('rf', RandomForestClassifier(random_state=42)),
        ('svc', SVC(random_state=42))
    ]
)

voting_clf.fit(X_train, y_train)

In [100]:
for name, clf in voting_clf.named_estimators_.items():
    print(f"{name} = {clf.score(X_test, y_test)}")

lr = 0.864
rf = 0.896
svc = 0.896


In [101]:
voting_clf.predict(X_test[:1])

array([1])

In [102]:
[clf.predict(X_test[:1]) for clf in voting_clf.estimators_]

[array([1]), array([1]), array([0])]

In [103]:
voting_clf.score(X_test, y_test)

0.912

- Can also do a soft voting approach which predicts the class using the highest probability averaged over the classifiers
- Achieves higher performance accuracy than hard voting

In [104]:
voting_clf.voting = 'soft'
voting_clf.named_estimators['svc'].probability = True
voting_clf.fit(X_train, y_train)
voting_clf.score(X_test, y_test)

0.92

### Bagging and Pasting

- Getting a diverse set of classifiers is important to reduce the dependence that they might have on each other
- Bagging is sampling performed with replacements and Pasting is sampling performed without replacements
- Bagging allows for instances to be sampled multiple times per classifier
- Once predictors are trained, aggregation function is used which is just the statistical mode
- Method scales well with a lot of data

In [105]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

bag_clf = BaggingClassifier(
    DecisionTreeClassifier(),
    n_estimators=500,
    max_samples=100,
    n_jobs=-1,
    random_state=42
)

bag_clf.fit(X_train, y_train)

- Bagging is usually better and results in better models since the genralization is better

### Out of Bag Evaluation

- On average, only 63% of training data is used in a bagging classifier, and as such the rest can be used as out of bag instances used for evaluation instead of a validation set

In [106]:
bag_clf = BaggingClassifier(
    DecisionTreeClassifier(),
    n_estimators=500,
    oob_score=True,
    n_jobs=-1,
    random_state=42
)

bag_clf.fit(X_train, y_train)
bag_clf.oob_score_ #Essentially the validation score, now must test with actual test set

0.896

In [107]:
from sklearn.metrics import accuracy_score

y_pred = bag_clf.predict(X_test)
accuracy_score(y_test, y_pred)

0.912

In [108]:
bag_clf.oob_decision_function_[:3] #Decision function for the bagging classifier for each class

array([[0.32352941, 0.67647059],
       [0.3375    , 0.6625    ],
       [1.        , 0.        ]])

### Random Patches and Random Subspaces

- Trained on a random subset of input feature
- Increases randomness and as a result, increases bias and reduces variance

### Random Forest

- An ensemble of decision trees
- There is the bagging classifier method with the DTC or there is the out of the box method using RandomForestClassifier

In [109]:
from sklearn.ensemble import RandomForestClassifier

rnd_clf = RandomForestClassifier(
    n_estimators=500,
    max_leaf_nodes=16,
    n_jobs=-1,
    random_state=42
)

rnd_clf.fit(X_train, y_train)

y_pred_rf = rnd_clf.predict(X_test)
y_pred_rf

array([0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1,
       0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0,
       0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0,
       0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0,
       1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1,
       1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0])

- This following bagging classifier is equivalent to the rnd_clf above

In [110]:
bag_clf = BaggingClassifier(
    DecisionTreeClassifier(max_features="sqrt", max_leaf_nodes=16),
    n_estimators=500,
    n_jobs=-1,
    random_state=42
)

bag_clf.fit(X_train, y_train)

y_pred_bag = bag_clf.predict(X_test)
y_pred_bag

array([0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1,
       0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0,
       0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0,
       0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0,
       1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1,
       1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0])

### Extra Trees

- Uses random thresholds for the features to introduce more randomness
- Extra trees stands for extremely random trees
- Usually better than random forest

In [111]:
from sklearn.tree import ExtraTreeClassifier

extra_clf = ExtraTreeClassifier(
    max_leaf_nodes=16,
    random_state=42
)

extra_clf.fit(X_train, y_train)

y_pred_extra = extra_clf.predict(X_test)

### Feature Importance

- Tree determines how much the nodes that use a specific feature reduce the impurity on average
- Each nodes weight is equal to the number of features associated with it
- Can autmatically calculate the importance

In [112]:
from sklearn.datasets import load_iris

iris=load_iris(as_frame=True)
rnd_clf = RandomForestClassifier(n_estimators=500, random_state=42)
rnd_clf.fit(iris.data, iris.target)
for score, name in zip(rnd_clf.feature_importances_, iris.data.columns):
    print(f"Feature importance of {round(score, 2)}", name)

Feature importance of 0.11 sepal length (cm)
Feature importance of 0.02 sepal width (cm)
Feature importance of 0.44 petal length (cm)
Feature importance of 0.42 petal width (cm)


### Boosting
- A way to train weak learners and combine them together to create a much stronger learner with boosted performance

### Adaboost
- Trains learners sequentially so that they can understand from previous mistakes
- Iteratively fixes the weights of each odel so that the next one can perform better based on the error rates
- Because it is sequential, it takes long to train
- Better predictors have higher weights

In [113]:
from sklearn.ensemble import AdaBoostClassifier

ada_clf = AdaBoostClassifier(
    DecisionTreeClassifier(max_depth=1),
    n_estimators=30,
    learning_rate=0.5,
    random_state=42
)

ada_clf.fit(X_train, y_train)

### Gradient Boosting

- Sequential method as well
- Trains models to fit to the residual errors, as such, allowing the final model to have the least errors
- Below method shows how to do it manually and automatically

In [114]:
import numpy as np
from sklearn.tree import DecisionTreeRegressor

np.random.seed(42)
X = np.random.rand(100, 1) - 0.5
y = 3 * X[:,0] ** 2 + 0.05 * np.random.randn(100)


In [115]:
tree_reg1 = DecisionTreeRegressor(max_depth=2, random_state=42)
tree_reg1.fit(X, y)

In [116]:
y2 = y - tree_reg1.predict(X)
tree_reg2 = DecisionTreeRegressor(max_depth=2, random_state=43)
tree_reg2.fit(X, y2)

In [117]:
y3 = y2 - tree_reg2.predict(X)
tree_reg3 = DecisionTreeRegressor(max_depth=2, random_state=44)
tree_reg3.fit(X, y3)

In [118]:
X_new = np.array([[-0.4], [0.], [0.5]])
sum(tree.predict(X_new) for tree in (tree_reg1, tree_reg2, tree_reg3))

array([0.49484029, 0.04021166, 0.75026781])

In [119]:
# Automnatically

from sklearn.ensemble import GradientBoostingRegressor

gbrt = GradientBoostingRegressor(max_depth=2, n_estimators=3, learning_rate=1.0, random_state=42)

In [120]:
gbrt.fit(X, y)

- Can implement early stopping so that the optimal numbe rof trees is selected as shown below

In [121]:
gbrt_best = GradientBoostingRegressor(
    max_depth=2, learning_rate=0.05, n_estimators=500, n_iter_no_change=10, random_state=42
)

gbrt_best.fit(X, y)

In [122]:
gbrt_best.n_estimators_ #92 decision trees is best for this method

92

### Histogram Based Gradient Boosting

- Great for larger datasets
- Bins inpit features and replaces them with integers
- Reduces precision but increases generalization

In [123]:
from sklearn.pipeline import make_pipeline
from sklearn.compose import make_column_transformer
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.preprocessing import OrdinalEncoder
from sklearn.datasets import fetch_california_housing

cali = fetch_california_housing(as_frame=True)
housing = cali.data
housing_labels = cali.target

In [None]:

hgb_reg = make_pipeline(
    make_column_transformer((OrdinalEncoder(), ["ocean_proximity"]),
                            remainder="passthrough"),
    HistGradientBoostingRegressor(categorical_features=[0], random_state=42)
)

hgb_reg.fit(housing, housing_labels)

### Stacking

- Aggregation of multiple models to get a better performing model

In [125]:
from sklearn.ensemble import StackingClassifier

stacking_clf = StackingClassifier(
    estimators=[
        ('lr', LogisticRegression(random_state=42)),
        ('rf', RandomForestClassifier(random_state=42)),
        ('svc', SVC(probability=True, random_state=42))
    ],
    final_estimator=RandomForestClassifier(random_state=43), #The final model or the blending model
    cv=5 #Cross vlaidation folds
)

stacking_clf.fit(X_train, y_train)

# Exercises

1. Yes, you can have 5 weak learners and combine them to make a stronger learner assuming that the performance of each of these learners are independent of each other. 
2. Hard voting classifier takes the highest frequency of predictions and uses it as the final prediction, where soft voting uses the probability average of each class from all the models and takes the highest probability for the clasas that is voted
3. Bagging can benefit well with scaling when all GPU cores are used. Pasting ensembles and random forests folow the same, but boosting ensembles cannot do this since the models are dependent on one another and are trained sequentially
4. Out of bag evaluation provides OOB instances which can be used as a sort of validation set
5. Extra trees ensemble utilises random thresholds for each feature which introduces another element of randomness. This means that they perform faster as compared to random forest trees since they dont spend time trying to find the optimal threshold values, but only when training it. Since it is still an ensemble of decision trees, the prediction time would be the same
6. If my adaboost ensemble underfits the data, then I should increase the number of estimators, or you can increase the learning rate.
7. If a gradient boosting ensemble overfits the data, decreasing the learning rate can be beneficial or implementing early stopping to find the optimal number of estimators
8. Voting Classifier

In [126]:
from sklearn.datasets import fetch_openml

X_mnist, y_mnist = fetch_openml('mnist_784', return_X_y=True, as_frame=False,
                                parser='auto')

train_size = 50000
validation_size = 10000
test_size = 10000

X_train = X_mnist[:train_size]
y_train = y_mnist[:train_size]
X_val = X_mnist[train_size:train_size+validation_size]
y_val = y_mnist[train_size:train_size+validation_size]
X_test = X_mnist[train_size+validation_size:]
y_test = y_mnist[train_size+validation_size:]

In [127]:
rnd_clf = RandomForestClassifier(n_estimators=100, random_state=42)
rnd_clf.fit(X_train, y_train)

In [128]:
rnd_clf.score(X_val, y_val)

0.9718

In [129]:
from sklearn.ensemble import ExtraTreesClassifier

extra_clf = ExtraTreesClassifier(n_estimators=100, random_state=42)
extra_clf.fit(X_train, y_train)

In [130]:
extra_clf.score(X_val, y_val)

0.9757

In [131]:
svc_clf = SVC(max_iter=100, random_state=42)
svc_clf.fit(X_train, y_train)



In [132]:
svc_clf.score(X_val, y_val)

0.9397

In [133]:
from sklearn.ensemble import VotingClassifier

voting_clf = VotingClassifier(
    estimators=[
        ("rnd", rnd_clf),
        ("extra", extra_clf),
        ("svc", svc_clf)
    ]
)

voting_clf.fit(X_train, y_train)



In [134]:
voting_clf.score(X_val, y_val)

0.976

In [135]:
y_pred = voting_clf.predict(X_test)

In [136]:
accuracy_score(y_test, y_pred)

0.9725

9. Making a Stacking Classifier

In [137]:
y_pred_val1 = rnd_clf.predict(X_val)
y_pred_val2 = extra_clf.predict(X_val)
y_pred_val3 = svc_clf.predict(X_val)

In [138]:
X_train_new = [[y_pred_val1[i], y_pred_val2[i], y_pred_val3[i]] for i in range(len(X_val))]

In [140]:
blending_clf = RandomForestClassifier(n_estimators=200, oob_score=True, random_state=42)

blending_clf.fit(X_train_new, y_val)
blending_clf.oob_score_ #If using oob, no need for a validation set to be used

0.9739