# Chapter 07 - Ensemble Learning 

## Voting Classifiers
Muchos algoritmos (idealmente, que cometen errores distintos.) que predicen sobre una misma instancia. El output del Voting Classifier es la instancia más votada}

In [10]:
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split

X, y = make_moons(n_samples=1000, noise=0.2)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [11]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

log_clf = LogisticRegression()
rnd_clf = RandomForestClassifier()
svm_clf = SVC()
voting_clf = VotingClassifier(
estimators=[('lr', log_clf), ('rf', rnd_clf), ('svc', svm_clf)],
voting='hard')
voting_clf.fit(X_train, y_train)

VotingClassifier(estimators=[('lr', LogisticRegression()),
                             ('rf', RandomForestClassifier()), ('svc', SVC())])

In [12]:
from sklearn.metrics import accuracy_score

for clf in (log_clf, rnd_clf, svm_clf, voting_clf):
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print(clf.__class__.__name__, accuracy_score(y_test, y_pred))

LogisticRegression 0.88
RandomForestClassifier 0.955
SVC 0.965
VotingClassifier 0.96


#### Bueno... no se ve tanto acá. Pero en el libro se ve que el VotingClassifier funciona mejor que cada uno de los clasificadores independientes.

## Bagging Classifier - _Bootsrap Aggregating Classifiers_
Básicamente, entrenan el mismo algoritmo con diferentes subsets del training set: bagging = Con reposición. pasting = Sin reposición.

In [13]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

bag_clf = BaggingClassifier(
    DecisionTreeClassifier(), n_estimators=500,
    max_samples=100, bootstrap=True, n_jobs=-1)
bag_clf.fit(X_train, y_train)
y_pred = bag_clf.predict(X_test)

In [14]:
accuracy_score(y_test, y_pred)

0.96

Ok... 96%... not bad

In [15]:
# Out-of-bag (oob) evaluation
bag_clf = BaggingClassifier(
    DecisionTreeClassifier(), n_estimators=500,
    max_samples=100, bootstrap=True, n_jobs=-1,
    oob_score=True  # Setteo oob=True
)  

In [16]:
bag_clf.fit(X_train, y_train)

BaggingClassifier(base_estimator=DecisionTreeClassifier(), max_samples=100,
                  n_estimators=500, n_jobs=-1, oob_score=True)

In [17]:
bag_clf.oob_score_

0.95625

## Random Forests

In [18]:
from sklearn.ensemble import RandomForestClassifier

It is equivalent to use Bagging Classifier with Decision Tree Classifier

In [19]:
# Also, they show feature importance
from sklearn.datasets import load_iris
iris = load_iris()

rf_clf = RandomForestClassifier(n_estimators=500, n_jobs=-1)
rf_clf.fit(iris["data"], iris["target"])

for name, score in zip(iris["feature_names"], rf_clf.feature_importances_):
    print(name, score)

sepal length (cm) 0.09566920207017698
sepal width (cm) 0.024677179297487993
petal length (cm) 0.42546989227725096
petal width (cm) 0.454183726355084


OK. Seems that petal width is the most important feature in this dataset