# Boosting (Hypothesis Boosting)
 - Combine several weak learners into a strong leaner.
 - Train predictors sequentially.
 
## AdaBoost / Adaptive Boosting

As above for Boosting:
 - Similar to human learning, the algorithm learns from past mistakes by focusing more on difficult problems it did not get right in prior learning.
 - It pays more attention to training instances that previously underfit.
 
 
 - Fit a sequence of weak learners (models that are only slightly better than random guessing, such as small decision trees) on repeatedly modified versions of the data.
 - The precistions from all of them are then combined through a weighted majority vote (or sum) to produce the final prediction.
 - The data modifications at each boosting iteration consist of applying weights $w_1, w_2, ... , w_n$ for each of the training samples.
 - Initially, those weights are all set to $w_i=1/N$, so that the first step simply trains a weak learner on the original data.
 - For each successive iteration, the sample weights are indiviually modified and the learning algorithm is reapplied to the reweighted data.
 - At a given step, those training examples that were incorrectly predicted by the boosted model induced at the previous step have their weights increased, whereas the weights are decreased for those that were predicted correctly.
 - As iterations proceed, examples that are difficult to predict receive ever-increasing influence. Each subsequent weak learner is thereby forced to concentrate on the examples that are missed by the previous ones in the sequence.

In [15]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")

df = sns.load_dataset('titanic')
df.dropna(inplace=True)
X = df[['pclass', 'sex', 'age']]
from sklearn import preprocessing
lb = preprocessing.LabelBinarizer()
X['sex'] = lb.fit_transform(X['sex'])
y = df['survived']

In [2]:
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

In [3]:
def print_score(clf, X_train, y_train, X_test, y_test, train=True):
    '''
    print the accuracy score, classification report, and confusion matrix of classifier
    '''
    if train:
        '''
        Training Performance
        '''
        print("Train Result:\n")
        print("accuracy score: {0:.4f}\n".format(accuracy_score(y_train, clf.predict(X_train))))
        print("Classification Report:\n {} \n".format(classification_report(y_train, clf.predict(X_train))))
        print("Confusion Matrix:  \n {} \n".format(confusion_matrix(y_train, clf.predict(X_train))))
        
        res = cross_val_score(clf, X_train, y_train, cv=10, scoring='accuracy')
        print("Average Accuracy: \t {0:.4f}".format(np.mean(res)))
        print("Accuracy SD \t\t {0:.4f}".format(np.std(res)))
        
    elif train==False:
        '''
        test performance
        '''
        print("Test Result:\n")
        print("accuracy score: {0:.4f}\n".format(accuracy_score(y_test, clf.predict(X_test))))
        print("Classification Report:\n {} \n".format(classification_report(y_test, clf.predict(X_test))))
        print("Confusion Matrix:  \n {} \n".format(confusion_matrix(y_test, clf.predict(X_test))))

In [4]:
from sklearn.ensemble import AdaBoostClassifier

In [5]:
ada_clf = AdaBoostClassifier()

In [6]:
ada_clf.fit(X_train, y_train)

AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=1.0, n_estimators=50, random_state=None)

In [8]:
print_score(ada_clf, X_train, y_train, X_test, y_test, train=True)

Train Result:

accuracy score: 0.8819

Classification Report:
               precision    recall  f1-score   support

           0       0.81      0.86      0.84        44
           1       0.93      0.89      0.91        83

   micro avg       0.88      0.88      0.88       127
   macro avg       0.87      0.88      0.87       127
weighted avg       0.88      0.88      0.88       127
 

Confusion Matrix:  
 [[38  6]
 [ 9 74]] 

Average Accuracy: 	 0.7174
Accuracy SD 		 0.1236


In [9]:
print_score(ada_clf, X_train, y_train, X_test, y_test, train=False)

Test Result:

accuracy score: 0.7273

Classification Report:
               precision    recall  f1-score   support

           0       0.50      0.60      0.55        15
           1       0.84      0.78      0.81        40

   micro avg       0.73      0.73      0.73        55
   macro avg       0.67      0.69      0.68        55
weighted avg       0.75      0.73      0.73        55
 

Confusion Matrix:  
 [[ 9  6]
 [ 9 31]] 



#### AdaBoost with Random Forest

In [10]:
from sklearn.ensemble import RandomForestClassifier

In [16]:
ada_clf = AdaBoostClassifier(RandomForestClassifier())

In [17]:
ada_clf.fit(X_train, y_train)

AdaBoostClassifier(algorithm='SAMME.R',
          base_estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators='warn', n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False),
          learning_rate=1.0, n_estimators=50, random_state=None)

In [18]:
print_score(ada_clf, X_train, y_train, X_test, y_test, train=True)

Train Result:

accuracy score: 0.9528

Classification Report:
               precision    recall  f1-score   support

           0       0.93      0.93      0.93        44
           1       0.96      0.96      0.96        83

   micro avg       0.95      0.95      0.95       127
   macro avg       0.95      0.95      0.95       127
weighted avg       0.95      0.95      0.95       127
 

Confusion Matrix:  
 [[41  3]
 [ 3 80]] 

Average Accuracy: 	 0.7650
Accuracy SD 		 0.1295


In [19]:
print_score(ada_clf, X_train, y_train, X_test, y_test, train=False)

Test Result:

accuracy score: 0.7273

Classification Report:
               precision    recall  f1-score   support

           0       0.50      0.60      0.55        15
           1       0.84      0.78      0.81        40

   micro avg       0.73      0.73      0.73        55
   macro avg       0.67      0.69      0.68        55
weighted avg       0.75      0.73      0.73        55
 

Confusion Matrix:  
 [[ 9  6]
 [ 9 31]] 



**Exercise:** Try with grid search and increasing n_estimators.

In [34]:
from sklearn import pipeline
from sklearn.model_selection import GridSearchCV

In [35]:
rf_clf = RandomForestClassifier(random_state=42)

In [52]:
params_grid = {'max_depth': [3, 4, 5, None],
               'min_samples_split': range(2, 10),
               'min_samples_leaf': range(2, 10),
               'criterion': ['gini', 'entropy'],
               'n_estimators': [500]}

In [53]:
grid_search = GridSearchCV(rf_clf, params_grid,
                          n_jobs=-1, cv=5,
                          verbose=1, scoring='accuracy')

In [54]:
grid_search.fit(X_train, y_train)

Fitting 5 folds for each of 512 candidates, totalling 2560 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   22.2s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:   60.0s
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed:  2.0min
[Parallel(n_jobs=-1)]: Done 792 tasks      | elapsed:  3.5min
[Parallel(n_jobs=-1)]: Done 1242 tasks      | elapsed:  5.2min
[Parallel(n_jobs=-1)]: Done 1792 tasks      | elapsed:  7.4min
[Parallel(n_jobs=-1)]: Done 2442 tasks      | elapsed: 10.5min
[Parallel(n_jobs=-1)]: Done 2560 out of 2560 | elapsed: 11.0min finished


GridSearchCV(cv=5, error_score='raise-deprecating',
       estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators='warn', n_jobs=None,
            oob_score=False, random_state=42, verbose=0, warm_start=False),
       fit_params=None, iid='warn', n_jobs=-1,
       param_grid={'max_depth': [3, 4, 5, None], 'min_samples_split': range(2, 10), 'min_samples_leaf': range(2, 10), 'criterion': ['gini', 'entropy'], 'n_estimators': [500]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='accuracy', verbose=1)

In [55]:
grid_search.best_estimator_.get_params()

{'bootstrap': True,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 2,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 500,
 'n_jobs': None,
 'oob_score': False,
 'random_state': 42,
 'verbose': 0,
 'warm_start': False}

In [62]:
rf_clf = RandomForestClassifier(max_depth=None, min_samples_leaf=2, 
                                                    min_samples_split=2, criterion='gini',
                                                    n_estimators=500)

In [69]:
ada_clf = AdaBoostClassifier(rf_clf, n_estimators=100)

In [70]:
ada_clf.fit(X_train, y_train)

AdaBoostClassifier(algorithm='SAMME.R',
          base_estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=2, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=500, n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False),
          learning_rate=1.0, n_estimators=100, random_state=None)

In [71]:
print_score(ada_clf, X_train, y_train, X_test, y_test, train=True)

Train Result:

accuracy score: 0.9449

Classification Report:
               precision    recall  f1-score   support

           0       0.93      0.91      0.92        44
           1       0.95      0.96      0.96        83

   micro avg       0.94      0.94      0.94       127
   macro avg       0.94      0.94      0.94       127
weighted avg       0.94      0.94      0.94       127
 

Confusion Matrix:  
 [[40  4]
 [ 3 80]] 

Average Accuracy: 	 0.7900
Accuracy SD 		 0.1213


In [72]:
print_score(ada_clf, X_train, y_train, X_test, y_test, train=False)

Test Result:

accuracy score: 0.7273

Classification Report:
               precision    recall  f1-score   support

           0       0.50      0.60      0.55        15
           1       0.84      0.78      0.81        40

   micro avg       0.73      0.73      0.73        55
   macro avg       0.67      0.69      0.68        55
weighted avg       0.75      0.73      0.73        55
 

Confusion Matrix:  
 [[ 9  6]
 [ 9 31]] 



***

## Gradient Boosting / Gradient Boosting Machine (GBM)
Works for both regression and classification

 - Sequentially adding predictors
 - Each one correcting it's predecessor
 - Fit new predictor to the residual errors
 
 **Step 1.**
 $$Y=F(x_1)+\epsilon$$
 **Step 2.**
 $$\epsilon=G(x_2)+\epsilon_2$$
 Subsituting (2) into (1), we get:
 $$Y=F(x_1)+G(x_2)+\epsilon_2$$
 **Step 3.**
 $$\epsilon_2=H(x_3)+\epsilon_3$$
 Now:
 $$Y=F(x_1)+G(x_2)+H(x_3)+\epsilon_3$$
 Finally, by adding weighting:
 $$Y=\alpha F(x_1)+\beta G(x_2)+\gamma H(x_3)+\epsilon_4$$
 
 
 Gradient Boosting Machine involves three elements:
  - **Loss function to be optimized:** Loss function depends on the type of problem being solved. In the case of regression problems, mean squared error is used, and in classification problems, logarithmic loss will be used. In boostong, at least stage, unexplained loss from prior iterations will be optimized rather than starting from scratch.
  - **Weak learner to make predictions:** Decision trees are used as a weak learner in gradient boosting.
  - **Additive model to add weak learners to minimize loss function:** Trees are added one at a time and existing trees in the model are not changed. The gradient descent procedure is used to minimuze the loss when adding trees.

In [74]:
from sklearn.ensemble import GradientBoostingClassifier

**Base Model**

In [77]:
gbc_clf = GradientBoostingClassifier()
gbc_clf.fit(X_train, y_train)

GradientBoostingClassifier(criterion='friedman_mse', init=None,
              learning_rate=0.1, loss='deviance', max_depth=3,
              max_features=None, max_leaf_nodes=None,
              min_impurity_decrease=0.0, min_impurity_split=None,
              min_samples_leaf=1, min_samples_split=2,
              min_weight_fraction_leaf=0.0, n_estimators=100,
              n_iter_no_change=None, presort='auto', random_state=None,
              subsample=1.0, tol=0.0001, validation_fraction=0.1,
              verbose=0, warm_start=False)

In [78]:
print_score(gbc_clf, X_train, y_train, X_test, y_test, train=True)

Train Result:

accuracy score: 0.9528

Classification Report:
               precision    recall  f1-score   support

           0       0.93      0.93      0.93        44
           1       0.96      0.96      0.96        83

   micro avg       0.95      0.95      0.95       127
   macro avg       0.95      0.95      0.95       127
weighted avg       0.95      0.95      0.95       127
 

Confusion Matrix:  
 [[41  3]
 [ 3 80]] 

Average Accuracy: 	 0.7918
Accuracy SD 		 0.1144


In [79]:
print_score(gbc_clf, X_train, y_train, X_test, y_test, train=False)

Test Result:

accuracy score: 0.7636

Classification Report:
               precision    recall  f1-score   support

           0       0.56      0.60      0.58        15
           1       0.85      0.82      0.84        40

   micro avg       0.76      0.76      0.76        55
   macro avg       0.70      0.71      0.71        55
weighted avg       0.77      0.76      0.77        55
 

Confusion Matrix:  
 [[ 9  6]
 [ 7 33]] 



## XGBoost (Extreme Gradient Boosting)

### Objective Function: Training Loss + Regularization
$$Obj(\Theta)=L(\Theta)+\Omega(\Theta)$$
 - $L$ is the training loss function, and
 - $\Omega$ is the regularization term.
 
**Training Loss**

The training loss measures how predictive our model is on training data.

Example 1, Mean Squared Error for Linear RegressionL
$$L(\Theta)=\sum_i(y_i-\hat y_i)^2$$

Example 2, Logistic Loss for Logistic Regression:
$$L(\Theta)=\sum_i[y_i\ln(1+e^{i\hat y_i})+(1-y_i)\ln(1+e^{i\hat y_i})]$$


**Regularization Term**
The regularization term controls the complexity of the model, which helps us to avoid overfitting.

**Using Titanic data again from start to finish**

In [80]:
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import pandas as pd

In [82]:
df = sns.load_dataset('titanic')

In [83]:
df.dropna(inplace=True)

**Data Pre-Processing**

In [84]:
X = df[['pclass', 'sex', 'age']]

In [85]:
from sklearn import preprocessing
lb = preprocessing.LabelBinarizer()

In [86]:
X['sex'] = lb.fit_transform(X['sex'])

In [87]:
y = df['survived']

***

In [88]:
from sklearn.model_selection import train_test_split

In [89]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

In [90]:
from sklearn.model_selection import cross_val_predict, cross_val_score
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

In [91]:
def print_score(clf, X_train, y_train, X_test, y_test, train=True):
    '''
    print the accuracy score, classification report, and confusion matrix of classifier
    '''
    if train:
        '''
        Training Performance
        '''
        print("Train Result:\n")
        print("accuracy score: {0:.4f}\n".format(accuracy_score(y_train, clf.predict(X_train))))
        print("Classification Report:\n {} \n".format(classification_report(y_train, clf.predict(X_train))))
        print("Confusion Matrix:  \n {} \n".format(confusion_matrix(y_train, clf.predict(X_train))))
        
        res = cross_val_score(clf, X_train, y_train, cv=10, scoring='accuracy')
        print("Average Accuracy: \t {0:.4f}".format(np.mean(res)))
        print("Accuracy SD \t\t {0:.4f}".format(np.std(res)))
        
    elif train==False:
        '''
        test performance
        '''
        print("Test Result:\n")
        print("accuracy score: {0:.4f}\n".format(accuracy_score(y_test, clf.predict(X_test))))
        print("Classification Report:\n {} \n".format(classification_report(y_test, clf.predict(X_test))))
        print("Confusion Matrix:  \n {} \n".format(confusion_matrix(y_test, clf.predict(X_test))))

#### XGBoost

if !pip install xgboost doesn't work:
 - download xgboost whl file from [here](http://www.lfd.uci.edu/~gohlke/pythonlibs/) (make sure to match your python version and system architecture, e.g. "xgboost-0.6-cp35-cp35m-win_amd64.whl" for python 3.5 on 64-bit machine)
 - open command prompt
 - cd to your Downloads folder (or wherever you saved the whl file)
 - pip install xgboost-0.6-cp35-cp35m-win_amd64.whl (or whatever your whl file is named)

In [96]:
import xgboost as xgb

In [97]:
xgb_clf = xgb.XGBClassifier(max_depth=3, n_estimators=5000, learning_rate=0.2)

In [98]:
xgb_clf.fit(X_train, y_train)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.2, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=5000,
       n_jobs=1, nthread=None, objective='binary:logistic', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1)

In [99]:
print_score(xgb_clf, X_train, y_train, X_test, y_test, train=True)

Train Result:

accuracy score: 0.9449

Classification Report:
               precision    recall  f1-score   support

           0       0.92      0.94      0.93        47
           1       0.96      0.95      0.96        80

   micro avg       0.94      0.94      0.94       127
   macro avg       0.94      0.94      0.94       127
weighted avg       0.95      0.94      0.94       127
 

Confusion Matrix:  
 [[44  3]
 [ 4 76]] 

Average Accuracy: 	 0.7936
Accuracy SD 		 0.0991


In [100]:
print_score(xgb_clf, X_train, y_train, X_test, y_test, train=False)

Test Result:

accuracy score: 0.7455

Classification Report:
               precision    recall  f1-score   support

           0       0.46      0.92      0.61        12
           1       0.97      0.70      0.81        43

   micro avg       0.75      0.75      0.75        55
   macro avg       0.71      0.81      0.71        55
weighted avg       0.86      0.75      0.77        55
 

Confusion Matrix:  
 [[11  1]
 [13 30]] 



**Try with grid search**

In [109]:
from sklearn import pipeline
from sklearn.model_selection import GridSearchCV

xgb_clf = xgb.XGBClassifier()

In [110]:
params_grid = {'max_depth': range(1,10),
               'learning_rate': [.001, .01, .1, .2, .3],
               'n_estimators': [10000],
               'n_jobs': [-1]}

In [111]:
grid_search = GridSearchCV(xgb_clf, params_grid,
                          n_jobs=-1, cv=5,
                          verbose=1, scoring='accuracy')

In [112]:
grid_search.fit(X_train, y_train)

Fitting 5 folds for each of 45 candidates, totalling 225 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   51.0s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:  2.2min
[Parallel(n_jobs=-1)]: Done 225 out of 225 | elapsed:  2.5min finished


GridSearchCV(cv=5, error_score='raise-deprecating',
       estimator=XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
       n_jobs=1, nthread=None, objective='binary:logistic', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1),
       fit_params=None, iid='warn', n_jobs=-1,
       param_grid={'max_depth': range(1, 10), 'learning_rate': [0.001, 0.01, 0.1, 0.2, 0.3], 'n_estimators': [10000], 'n_jobs': [-1]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='accuracy', verbose=1)

In [113]:
grid_search.best_estimator_.get_params()

{'base_score': 0.5,
 'booster': 'gbtree',
 'colsample_bylevel': 1,
 'colsample_bytree': 1,
 'gamma': 0,
 'learning_rate': 0.001,
 'max_delta_step': 0,
 'max_depth': 3,
 'min_child_weight': 1,
 'missing': None,
 'n_estimators': 10000,
 'n_jobs': -1,
 'nthread': None,
 'objective': 'binary:logistic',
 'random_state': 0,
 'reg_alpha': 0,
 'reg_lambda': 1,
 'scale_pos_weight': 1,
 'seed': None,
 'silent': True,
 'subsample': 1}

In [115]:
xgb_clf = xgb.XGBClassifier(max_depth=3, n_estimators=10000, learning_rate=0.001)
xgb_clf.fit(X_train, y_train)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.001, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=10000,
       n_jobs=1, nthread=None, objective='binary:logistic', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1)

In [117]:
print_score(xgb_clf, X_train, y_train, X_test, y_test, train=True)

Train Result:

accuracy score: 0.9134

Classification Report:
               precision    recall  f1-score   support

           0       0.91      0.85      0.88        47
           1       0.92      0.95      0.93        80

   micro avg       0.91      0.91      0.91       127
   macro avg       0.91      0.90      0.91       127
weighted avg       0.91      0.91      0.91       127
 

Confusion Matrix:  
 [[40  7]
 [ 4 76]] 

Average Accuracy: 	 0.7942
Accuracy SD 		 0.0849


In [118]:
print_score(xgb_clf, X_train, y_train, X_test, y_test, train=False)

Test Result:

accuracy score: 0.8182

Classification Report:
               precision    recall  f1-score   support

           0       0.55      0.92      0.69        12
           1       0.97      0.79      0.87        43

   micro avg       0.82      0.82      0.82        55
   macro avg       0.76      0.85      0.78        55
weighted avg       0.88      0.82      0.83        55
 

Confusion Matrix:  
 [[11  1]
 [ 9 34]] 



Did a ton better on the test set!