# Ensemble learning and random forest
## Ensemble learning - voting classifiers
* Hard voting: Each classifier predicts a class and the class of majority becomes the resulting prediction.
* Soft voting: Predict the class with the highest class probability, averaged over all the individual classifiers.  
This is only possible if all classifiers are able to estimate class probability.


In [32]:
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.datasets import make_moons
X, y = make_moons(n_samples = 200, noise = 0.15)

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

log_clf = LogisticRegression()
rnd_clf = RandomForestClassifier()
svm_clf = SVC()

voting_clf = VotingClassifier(
    estimators = [('lr', log_clf), ('rf', rnd_clf), ('svc', svm_clf)],
    voting="hard")
voting_clf.fit(X_train, y_train)

from sklearn.metrics import accuracy_score
for clf in (log_clf, rnd_clf, svm_clf, voting_clf):
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print(clf.__class__.__name__, accuracy_score(y_test, y_pred))

LogisticRegression 0.95
RandomForestClassifier 0.975
SVC 0.975
VotingClassifier 0.975


## Bagging and Pasting
* Instead of training many different models with the batch sample, train the same model with many mini-batch samples.
* Bagging: mini-batch samples are drawn randomly with replacement (=bootstrapping).
* Pasting: mini-batch samples are drawn randomly without replacement.

In [33]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
bag_clf = BaggingClassifier(
    DecisionTreeClassifier(), n_estimators= 500,
    max_samples = len(X_test), bootstrap=True, n_jobs = -1) # setting n_jobs = -1 performs learning with all available CPU cores
bag_clf.fit(X_train, y_train)
y_pred = bag_clf.predict(X_test)
accuracy_score(y_test, y_pred)

0.975

## Out-of-bag evaluation
* With bagging, some instances may be sampled several times while others may not be sampled at all.
* For example, if only 63% are drawn, the remaining 37% is calle 'out-of-bag' instances.
* Since a predictor never sees the oob instances during training, it can be evaluated on these instances,  
without the need for a separate validation set.

In [34]:
bag_clf = BaggingClassifier(
    DecisionTreeClassifier(), n_estimators = 500,
    bootstrap=True, n_jobs = -1, oob_score=True)
bag_clf.fit(X_train, y_train)
bag_clf.oob_score_

0.975

In [35]:
from sklearn.metrics import accuracy_score
y_pred = bag_clf.predict(X_test)
accuracy_score(y_test, y_pred) # The accuracy from the oob_score is very close to the accuracy on the test set

0.975

## Random patches and random subspace
* max_features performs feature sampling. The technique of using both max_features and bootstrapping is called  
'random patches method'
* keeping all training instances(bootstrap=False, max_samples=1) but sampling features  
(bootstrap_features = True, max_features < 1) is called 'random subspace method'.


# Random foreset
* Ensemble of decision trees, generally trained via the bagging method.
* They make it easy to measure the relative importance of each feature.

In [36]:
from sklearn.ensemble import RandomForestClassifier
rnd_clf = RandomForestClassifier(n_estimators=500, max_leaf_nodes=16, n_jobs=-1)
rnd_clf.fit(X_train, y_train)
y_pred_rf = rnd_clf.predict(X_test)

In [37]:
# Feature importance
from sklearn.datasets import load_iris
iris = load_iris()
rnd_clf = RandomForestClassifier(n_estimators=500, n_jobs=-1)
rnd_clf.fit(iris["data"], iris["target"])
for name, score in zip(iris["feature_names"], rnd_clf.feature_importances_):
    print(name, score)

sepal length (cm) 0.09689703198732647
sepal width (cm) 0.02463023253024131
petal length (cm) 0.4461566110942537
petal width (cm) 0.4323161243881785
