### Boosting

In this notebook we have used Boosting techniques and the best parameters are choosen using Hyper Parameter tuning with GridSearch CV

1. Adaptive Boosting
2. Gradient Boost
3. XGBoost
4. Light GBM
5. CatBoost

In [2]:
!pip install catboost

Collecting catboost
  Downloading https://files.pythonhosted.org/packages/15/90/b2b8c7f2ed46071741cefc8f522104abc81068b0231ce7171c78059b6682/catboost-0.21-cp37-none-win_amd64.whl (63.4MB)
Collecting graphviz (from catboost)
  Downloading https://files.pythonhosted.org/packages/f5/74/dbed754c0abd63768d3a7a7b472da35b08ac442cf87d73d5850a6f32391e/graphviz-0.13.2-py2.py3-none-any.whl
Installing collected packages: graphviz, catboost
Successfully installed catboost-0.21 graphviz-0.13.2


In [3]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
import matplotlib.pyplot as plt
import seaborn as sns
import utils
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
import datetime
from sklearn.ensemble import AdaBoostRegressor
from sklearn.ensemble import GradientBoostingRegressor
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
from catboost import CatBoostRegressor
import warnings
warnings.filterwarnings('ignore')

In [4]:
df_train = pd.read_csv('dataset/df_train.csv')
df_test = pd.read_csv('dataset/df_test.csv')
target = df_train['SalePrice']
df_train = df_train.drop(['SalePrice'], axis = 1)

In [5]:
X_train, X_test, y_train, y_test = train_test_split(df_train, target, test_size = 0.25, random_state = 42)

#### 2. Gradient Boosting Regressor

In [10]:
print('##############################################\n{}\tGradient Boost'.format(datetime.datetime.now().strftime('%H:%M')))

parameters= {
    'n_estimators':[8000],
    'learning_rate':[0.01],
    'max_depth':[2],
    'max_features':['sqrt'],
    'min_samples_leaf':[10],
    'min_samples_split':[5],
    'loss':['huber'],
    'random_state':[42]
}

gbr = GradientBoostingRegressor()
clf = GridSearchCV(gbr, parameters, verbose=0, iid = False)
clf.fit(X_train, y_train)
gbr = GradientBoostingRegressor(**clf.best_params_)

print('\nRegressor: \n', gbr, '\n')
print('{}\tDone!\n####################################################'
      .format(datetime.datetime.now().strftime('%H:%M')))

##############################################
22:14	Gradient Boost

Regressor: 
 GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None,
                          learning_rate=0.01, loss='huber', max_depth=2,
                          max_features='sqrt', max_leaf_nodes=None,
                          min_impurity_decrease=0.0, min_impurity_split=None,
                          min_samples_leaf=10, min_samples_split=5,
                          min_weight_fraction_leaf=0.0, n_estimators=8000,
                          n_iter_no_change=None, presort='auto',
                          random_state=42, subsample=1.0, tol=0.0001,
                          validation_fraction=0.1, verbose=0, warm_start=False) 

22:15	Done!
####################################################


#### 3. XGBOOST

In [8]:
print('####################################################\n{}\tXGBoost'
      .format(datetime.datetime.now().strftime('%H:%M')))
# hyper parameter optimization
learning_rate =[0.05,0.1,0.15,0.20]
max_depth = [2,3,5,10,15]
min_child_weight = [1,2,3,4]
n_estimators = [100,500,900,100,1500]
booster = ['gbtree','gblinear']
base_score = [0.25,0.5,0.75,1]

#define the grid of hyperparameters to search
hyperparameter_grid = {
    'colsample_bytree':[0.4],
    'gamma':[0.0],
    'learning_rate': learning_rate,
    'max_depth':max_depth,
    'min_child_weight':min_child_weight,
    'n_estimators':n_estimators,
    'seed':[36],
    'subsample':[0.2],
    'objective':['reg:squarederror'],
    'reg_alpha':[0.00006],
    'cale_pos_weight':[1],
    'booster': booster,
    'base_score': base_score
}
classifier = XGBRegressor()
random_cv = RandomizedSearchCV(estimator=classifier,
            param_distributions=hyperparameter_grid,
            cv = 5, n_iter=50, n_jobs=4,
            scoring='neg_mean_absolute_error',
            verbose = 5, return_train_score= True,
            random_state=42)

random_cv.fit(X_train, y_train)b

####################################################
22:01	XGBoost
Fitting 5 folds for each of 50 candidates, totalling 250 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  10 tasks      | elapsed:   27.1s
[Parallel(n_jobs=4)]: Done  64 tasks      | elapsed:  2.7min
[Parallel(n_jobs=4)]: Done 154 tasks      | elapsed:  5.3min
[Parallel(n_jobs=4)]: Done 250 out of 250 | elapsed:  8.7min finished


RandomizedSearchCV(cv=5, error_score='raise-deprecating',
                   estimator=XGBRegressor(base_score=0.5, booster='gbtree',
                                          colsample_bylevel=1,
                                          colsample_bynode=1,
                                          colsample_bytree=1, gamma=0,
                                          importance_type='gain',
                                          learning_rate=0.1, max_delta_step=0,
                                          max_depth=3, min_child_weight=1,
                                          missing=None, n_estimators=100,
                                          n_jobs=1, nthread=None,
                                          objective='reg:linear',
                                          random_st...
                                        'colsample_bytree': [0.4],
                                        'gamma': [0.0],
                                        'learning_rate': [0.05, 0.

In [9]:
XGBoost = XGBRegressor(**random_cv.best_params_)
print('{}\tDone!\n####################################################'
      .format(datetime.datetime.now().strftime('%H:%M')))

22:10	Done!
####################################################


#### 4. Light GBM

In [6]:
print('#############################\n{}\tLightGBM'.format(datetime.datetime.now().strftime('%H:%M')))

parameters = {
    'objective':['regression'],
    'num_leaves':[5],
    'learning_rate':[0.05],
    'n_estimators':[720],
    'max_bin':[55],
    'max_depth':[2,3],
    'bagging_fraction':[.5,.8],
    'bagging_freq':[5],
    'bagging_seed':[9],
    'feature_fraction':[0.2319]
}

light = LGBMRegressor()
clf = GridSearchCV(light, parameters, verbose=0, iid=False)
clf.fit(X_train, y_train)

#############################
22:00	LightGBM


GridSearchCV(cv='warn', error_score='raise-deprecating',
             estimator=LGBMRegressor(boosting_type='gbdt', class_weight=None,
                                     colsample_bytree=1.0,
                                     importance_type='split', learning_rate=0.1,
                                     max_depth=-1, min_child_samples=20,
                                     min_child_weight=0.001, min_split_gain=0.0,
                                     n_estimators=100, n_jobs=-1, num_leaves=31,
                                     objective=None, random_state=None,
                                     reg_alpha=0.0, reg_l...
                                     subsample_freq=0),
             iid=False, n_jobs=None,
             param_grid={'bagging_fraction': [0.5, 0.8], 'bagging_freq': [5],
                         'bagging_seed': [9], 'feature_fraction': [0.2319],
                         'learning_rate': [0.05], 'max_bin': [55],
                         'max_depth': [2, 3

In [7]:
lightgbm = LGBMRegressor(**clf.best_params_)
print('\nRegressor: \n', lightgbm, '\n')
print('{}\tDone!\n####################################################'
      .format(datetime.datetime.now().strftime('%H:%M')))


Regressor: 
 LGBMRegressor(bagging_fraction=0.8, bagging_freq=5, bagging_seed=9,
              boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,
              feature_fraction=0.2319, importance_type='split',
              learning_rate=0.05, max_bin=55, max_depth=2, min_child_samples=20,
              min_child_weight=0.001, min_split_gain=0.0, n_estimators=720,
              n_jobs=-1, num_leaves=5, objective='regression',
              random_state=None, reg_alpha=0.0, reg_lambda=0.0, silent=True,
              subsample=1.0, subsample_for_bin=200000, subsample_freq=0) 

22:00	Done!
####################################################


#### 5. CatBoost Regressor

In [16]:
parameters= {
    'n_estimators':[100,500,900,100,1500],
    'learning_rate':[0.05,0.1,0.15,0.20],
    'max_depth':[2,3,5,10,15],
    'max_leaves':[31, 35, 45],
    'random_state':[42]
}

catBoost = CatBoostRegressor()
clf = GridSearchCV(catBoost, parameters, verbose=0, iid=False)
clf.fit(X_train, y_train)
cbr = CatBoostRegressor(**clf.best_params_)

print('\nRegressor: \n', cbr, '\n')
print('{}\tDone!\n####################################################'
      .format(datetime.datetime.now().strftime('%H:%M')))

0:	learn: 0.3910730	total: 55.3ms	remaining: 5.47s
1:	learn: 0.3813627	total: 56.2ms	remaining: 2.75s
2:	learn: 0.3716953	total: 57.2ms	remaining: 1.85s
3:	learn: 0.3627357	total: 58.1ms	remaining: 1.4s
4:	learn: 0.3544733	total: 59.1ms	remaining: 1.12s
5:	learn: 0.3462906	total: 60ms	remaining: 940ms
6:	learn: 0.3388498	total: 61ms	remaining: 810ms
7:	learn: 0.3317088	total: 62.1ms	remaining: 715ms
8:	learn: 0.3242022	total: 63.1ms	remaining: 638ms
9:	learn: 0.3170955	total: 64.2ms	remaining: 578ms
10:	learn: 0.3109155	total: 65.3ms	remaining: 528ms
11:	learn: 0.3047988	total: 66.7ms	remaining: 489ms
12:	learn: 0.2987876	total: 68.1ms	remaining: 456ms
13:	learn: 0.2935657	total: 69.1ms	remaining: 425ms
14:	learn: 0.2884143	total: 70.1ms	remaining: 398ms
15:	learn: 0.2833026	total: 71.1ms	remaining: 373ms
16:	learn: 0.2785780	total: 72ms	remaining: 351ms
17:	learn: 0.2737556	total: 72.9ms	remaining: 332ms
18:	learn: 0.2698399	total: 73.8ms	remaining: 315ms
19:	learn: 0.2647293	total: 7

0:	learn: 0.3938641	total: 1.45ms	remaining: 143ms
1:	learn: 0.3851218	total: 2.81ms	remaining: 138ms
2:	learn: 0.3760005	total: 3.96ms	remaining: 128ms
3:	learn: 0.3679100	total: 4.94ms	remaining: 119ms
4:	learn: 0.3609629	total: 5.91ms	remaining: 112ms
5:	learn: 0.3532167	total: 6.82ms	remaining: 107ms
6:	learn: 0.3448368	total: 7.92ms	remaining: 105ms
7:	learn: 0.3378007	total: 8.94ms	remaining: 103ms
8:	learn: 0.3308083	total: 9.91ms	remaining: 100ms
9:	learn: 0.3236372	total: 10.9ms	remaining: 97.9ms
10:	learn: 0.3168056	total: 11.8ms	remaining: 95.4ms
11:	learn: 0.3097722	total: 12.7ms	remaining: 93.3ms
12:	learn: 0.3034859	total: 13.7ms	remaining: 91.4ms
13:	learn: 0.2978474	total: 14.6ms	remaining: 89.6ms
14:	learn: 0.2927478	total: 15.5ms	remaining: 87.8ms
15:	learn: 0.2879757	total: 16.6ms	remaining: 87.2ms
16:	learn: 0.2832011	total: 17.7ms	remaining: 86.5ms
17:	learn: 0.2779539	total: 18.8ms	remaining: 85.6ms
18:	learn: 0.2731375	total: 20ms	remaining: 85.2ms
19:	learn: 0.2

CatBoostError: c:/goagent/pipelines/buildmaster/catboost.git/catboost/private/libs/options/json_helper.h:223: Error: change of option max_leaves is unimplemented for task type CPU and was not default in previous run