# 7. Ensemble Learning and Random Forests

### 7-0. Ensemble
- ensemble: to aggregate the predictions of a group of predictors  
**Ensemble methods work bet when the predictors are as independent from one another as possible.**  
***->Training w/ very diff. alg can be a solution***

### 7-1. Voting classifiers - Using different algs
##### 1.hard voting classifiers -> possible becuas of *the law of large numbers*  
  : majority-vote classifier  
  : to aggregate the predictions of each classifier and predict the class that gets the most votes  
    **only if all classifiers are perfectly independent, making uncorrelated erros. Meaning, training on the same data is not included**

##### 2.soft voting classifiers  
  : predict the class with the highest clas probability, averaged over all the individual classifiers(if classifiers has ***predict_proba()*** method)


In [9]:
from sklearn.datasets import make_moons
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

X, y = make_moons(n_samples=1000, shuffle=True, random_state=42)
X_train, y_train, X_test, y_test = X[:850], y[:850], X[850:], y[850:]

log_clf = LogisticRegression()
rnd_clf = RandomForestClassifier()
svm_clf = SVC()

voting_clf = VotingClassifier(estimators = [('lr', log_clf), ('rf', rnd_clf), ('svc', svm_clf)],
                             voting='hard') #hard voting

for clf in (log_clf, rnd_clf, svm_clf, voting_clf):
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print(clf.__class__.__name__, accuracy_score(y_test, y_pred))

LogisticRegression 0.9066666666666666
RandomForestClassifier 1.0
SVC 1.0
VotingClassifier 1.0


### 7-2. Bagging and Pasting - Using the same alg, but diff training on diff subset of train data
- scales well (can be trained in parallel)


1. Bagging(bootstrap aggregating): sampling w/ replacement
2. Pasting: sampling w/o replacement
`Bias`: bagging > pasting (slightly. Because of diversity in the subsets that each predictor is trained on)  
`Variance`: bagging < pasting (predictors end up being less correlated)


- Feature Sampling: using parameters *max_features, bootstrap_features*
    - Random Patches method: asmpling training instances & features (bootstrap=False, max_sample=1.0)
    - Random Subspacess method: kepping all training instances but sampling features (boottrap_features=True, max_features < 1.0)



In [10]:
from sklearn.ensemble import BaggingClassifier #soft voting if base classifiers has predict_proba() method
from sklearn.tree import DecisionTreeClassifier

bag_clf = BaggingClassifier(DecisionTreeClassifier(), 
                           n_estimators=500,
                           max_samples=100,
                           bootstrap=True, #False if pasting wanted
                           n_jobs=-1)
bag_clf.fit(X_train, y_train)
y_pred = bag_clf.predict(X_test)

3. OOB;Out-Of-Bag Evaluation  
  - a predictor never sees the oob instances during training, so it can be evaluated on OOB (No need to separate validation set or cross-validation)

In [11]:
bag_clf = BaggingClassifier(DecisionTreeClassifier(),
                           n_estimators=500,
                           bootstrap=True,
                           n_jobs=-1,
                           oob_score=True) 

bag_clf.fit(X_train, y_train)

#validation score
print(bag_clf.oob_score_ ) 

#test score
y_pred = bag_clf.predict(X_test)
print(accuracy_score(y_test, y_pred))

0.9952941176470588
0.9933333333333333


In [12]:
bag_clf.oob_decision_function_

array([[0.        , 1.        ],
       [0.        , 1.        ],
       [0.        , 1.        ],
       ...,
       [1.        , 0.        ],
       [0.35602094, 0.64397906],
       [1.        , 0.        ]])

### 7-3. Random Forest: ensemble of Decision Tress
- easy to measure the relative importance of each feature

In [13]:
# random forest with using own class
from sklearn.ensemble import RandomForestClassifier

rnd_clf = RandomForestClassifier(n_estimators=500, max_leaf_nodes=10, n_jobs=-1)
rnd_clf.fit(X_train, y_train)

y_pred_rf = rnd_clf.predict(X_test)


# same code using Bagging
'''
bag_clf = BaggingClassifier(DecisionTreeClassifier(splitter="random", max_leaf_nodes=16),
                            n_estimators=500, max_sampless=1.0, bootstrap=True, n_job=-1)
'''

'\nbag_clf = BaggingClassifier(DecisionTreeClassifier(splitter="random", max_leaf_nodes=16),\n                            n_estimators=500, max_sampless=1.0, bootstrap=True, n_job=-1)\n'

In [14]:
# feature importance map

from sklearn.datasets import load_iris
iris = load_iris()
rnd_clf = RandomForestClassifier(n_estimators=500, n_jobs=-1)
rnd_clf.fit(iris["data"], iris["target"])
for name, score in zip(iris["feature_names"], rnd_clf.feature_importances_):
    print(name, score)

sepal length (cm) 0.09372069734401645
sepal width (cm) 0.0249753496708033
petal length (cm) 0.44357764674501343
petal width (cm) 0.4377263062401666


### 7-4. AdaBoost - cannot be parallelized :(
- Boosting(hypothesis boosting) - combining several weak learners into a strong learner. Sequentially
- AdaBoost: correct its predecessor by paying a bit more attention to the training instance that the predecessor underfitted

In [15]:
from sklearn.ensemble import AdaBoostClassifier
ada_clf = AdaBoostClassifier(DecisionTreeClassifier(max_depth=1),
                            n_estimators=200,
                            algorithm="SAMME.R",
                            learning_rate=0.5)
ada_clf.fit(X_train, y_train)

AdaBoostClassifier(algorithm='SAMME.R',
                   base_estimator=DecisionTreeClassifier(ccp_alpha=0.0,
                                                         class_weight=None,
                                                         criterion='gini',
                                                         max_depth=1,
                                                         max_features=None,
                                                         max_leaf_nodes=None,
                                                         min_impurity_decrease=0.0,
                                                         min_impurity_split=None,
                                                         min_samples_leaf=1,
                                                         min_samples_split=2,
                                                         min_weight_fraction_leaf=0.0,
                                                         presort='deprecated',
                          

### 7-5. Gradient Boosting - sequentially add predictors to an ensemble.
- simiar w/ AdaBoost, but tries to fit the new predictor to the residual errors made by the previous predictor

In [24]:
from sklearn.tree import DecisionTreeRegressor

tree_reg1 = DecisionTreeRegressor(max_depth=2)
tree_reg1.fit(X_train, y_train)

y2 = y_train-tree_reg1.predict(X_train)
tree_reg2 = DecisionTreeRegressor(max_depth=2)
tree_reg2.fit(X_train, y2)

y3 = y2-tree_reg2.predict(X_train)
tree_reg3 = DecisionTreeRegressor(max_depth=2)
tree_reg3.fit(X_train, y3)

y_pred = sum(tree.predict(X_test) for tree in (tree_reg1, tree_reg2, tree_reg3))