# Ensemble Learning and Random Forests

## Voting Classifiers
These classifiers constist of multiple models. Their predictions are compared and the final predicition is the most common one (<i>hard voting classifier</i>).

In [1]:
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_moons

X, y = make_moons(n_samples=500, noise=0.30, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [2]:
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

log_clf = LogisticRegression()
rnd_clf = RandomForestClassifier()
svm_clf = SVC()

voting_clf = VotingClassifier(
    estimators=[("lr", log_clf), ("rf", rnd_clf), ("svc", svm_clf)],
    voting="hard"
)

voting_clf.fit(X_train, y_train)

In [3]:
from sklearn.metrics import accuracy_score

for clf in (log_clf, rnd_clf, svm_clf, voting_clf):
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    
    print(clf.__class__.__name__, accuracy_score(y_test, y_pred))

LogisticRegression 0.864
RandomForestClassifier 0.888
SVC 0.896
VotingClassifier 0.904


If each classifier uses predict_proba() method, we can use <i>soft voting</i>. Prediction will be done using the probabilities values - bigger probabilities will have bigger weights.

In [4]:
log_clf = LogisticRegression()
rnd_clf = RandomForestClassifier()
svm_clf = SVC(probability=True)

voting_clf = VotingClassifier(
    estimators=[("lr", log_clf), ("rf", rnd_clf), ("svc", svm_clf)],
    voting="soft"
)

for clf in (log_clf, rnd_clf, svm_clf, voting_clf):
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    
    print(clf.__class__.__name__, accuracy_score(y_test, y_pred))

LogisticRegression 0.864
RandomForestClassifier 0.896
SVC 0.896
VotingClassifier 0.92


## Bagging and Pasting
Different approach is to train each predictior with the same algorithm but with different training subsets. If the sampling is done with replacement, this mechanism is called bagging (short from bootstrap aggregating). If the sampling is done without replacement, then this mechanism is called pasting. In other words: both mechanism allow us to sample using many training examples with all of the predictios, but only the first mechnism allow us to feed one predictior multiple times. (with replecement - feed one predictior multiple times)

In [5]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

bag_clf = BaggingClassifier(
    DecisionTreeClassifier(),
    n_estimators=500,
    max_samples=100,
    bootstrap=True,
    n_jobs=-1
)

bag_clf.fit(X_train, y_train)
y_pred = bag_clf.predict(X_test)

accuracy_score(y_test, y_pred)

0.912

### OOB score
While aggregating, about 63% of the trainign seet is never seen by a predictor. This data is called Out-Of-Bag instances - OOB. Becouse they are not used during learning, we can use them to evaluate the model. We can also measure the score of whole ensemble by calculating the mean of all OBB scores.

In [6]:
bag_clf = BaggingClassifier(
    DecisionTreeClassifier(),
    n_estimators=500,
    max_samples=100,
    bootstrap=True,
    n_jobs=-1,
    oob_score=True
)

bag_clf.fit(X_train, y_train)

bag_clf.oob_score_


0.9226666666666666

In [7]:
y_pred = bag_clf.predict(X_test)

accuracy_score(y_test, y_pred)

0.912

In [8]:
bag_clf.oob_decision_function_

array([[0.34351145, 0.65648855],
       [0.43243243, 0.56756757],
       [0.99746193, 0.00253807],
       [0.01044386, 0.98955614],
       [0.0230179 , 0.9769821 ],
       [0.09408602, 0.90591398],
       [0.35433071, 0.64566929],
       [0.06185567, 0.93814433],
       [0.95478723, 0.04521277],
       [0.81481481, 0.18518519],
       [0.56410256, 0.43589744],
       [0.05319149, 0.94680851],
       [0.74666667, 0.25333333],
       [0.87598945, 0.12401055],
       [0.91644909, 0.08355091],
       [0.08521303, 0.91478697],
       [0.02072539, 0.97927461],
       [0.92875318, 0.07124682],
       [0.6167979 , 0.3832021 ],
       [0.9673913 , 0.0326087 ],
       [0.04255319, 0.95744681],
       [0.22727273, 0.77272727],
       [0.86885246, 0.13114754],
       [0.99726027, 0.00273973],
       [0.94897959, 0.05102041],
       [0.00512821, 0.99487179],
       [0.96062992, 0.03937008],
       [1.        , 0.        ],
       [0.02088773, 0.97911227],
       [0.73643411, 0.26356589],
       [0.

## Random Patches and Random Subspaces
Sampling training examples and features - Random Patches Method

Sampling only features - Random Subspaces Method (Done by setting bootstrap=False, max_samples=1.0 and bootstrap_features=True, max_features lower than 1.0)

## Random Forests

In [9]:
from sklearn.ensemble import RandomForestRegressor

rnd_clf = RandomForestRegressor(n_estimators=500, max_leaf_nodes=16, n_jobs=-1)
rnd_clf.fit(X_train, y_train)

rnd_clf.predict(X_test)

array([0.53237933, 0.23699324, 0.21249745, 0.98607904, 0.98180838,
       0.88764525, 0.01156969, 0.03342178, 0.05328147, 0.06843063,
       0.98607904, 0.01156969, 0.92726734, 0.86190559, 0.98503549,
       0.01156969, 0.01156969, 0.95253994, 0.85542725, 0.08510785,
       0.01156969, 0.99713512, 0.55335608, 0.00171878, 0.10089583,
       0.1554341 , 0.95638781, 0.01156969, 0.93080494, 0.0115991 ,
       0.94215743, 0.98607904, 0.23812511, 0.01183423, 0.98572691,
       0.07914256, 0.02175827, 0.98572691, 0.98578486, 0.96884642,
       0.30665759, 0.93073188, 0.29628786, 0.23807645, 0.0115991 ,
       0.1373378 , 0.73928203, 0.55085862, 0.98577135, 0.55347355,
       0.93105818, 0.98607904, 0.01156969, 0.0115991 , 0.82938221,
       0.01183423, 0.90346889, 0.98584456, 0.0115991 , 0.97483181,
       0.01156969, 0.83398729, 0.98121195, 0.01156969, 0.97704034,
       0.01183423, 0.07406762, 0.16383095, 0.01156969, 0.99913512,
       0.42512969, 0.01907925, 0.84472355, 0.63447917, 0.01156

### Extra Trees
In each node the split is done randomly, not by looking the best threshold - Extremely Random Trees (Extra Trees).

### Feature Importances

In [11]:
from sklearn.datasets import load_iris

iris = load_iris()
rnd_clf = RandomForestClassifier(n_estimators=500, n_jobs=-1)
rnd_clf.fit(iris["data"], iris["target"])

for name, score in zip(iris["feature_names"], rnd_clf.feature_importances_):
    print(name, score)

sepal length (cm) 0.09238128154723367
sepal width (cm) 0.022376154674010943
petal length (cm) 0.44732226253629553
petal width (cm) 0.4379203012424599
