## Trees: Ensemble Methods - Boosting

Boosting is another ensemble technique to create a collection of predictors. In this technique, learners are learned sequentially with early learners fitting simple models to the data and then analyzing data for errors. In other words, we fit consecutive trees (random sample) at every step,and the goal is to solve for net error from the prior tree.

When an input is misclassified by a hypothesis, its weight is increased so that next hypothesis is more likely to classify it correctly. By combining the whole set at the end converts weak learners into a better performing model.

An ensemble of trees are built one by one and individual trees are summed sequentially. The Next tree tries to recover the loss (difference between actual and predicted values) from the previous tree.

 - boosting = low variance, high bias base learners
 
 ![Boosting Example](./images/boosting.png)

#### Adaboost = Adaptive Boosting
AdaBoost learns from the mistakes by increasing the weight of misclassified data points.

It is called Adaptive Boosting as the weights are re-assigned to each instance, with higher weights to incorrectly classified instances.

*Adaboost usually has just a node and two leaves.(A tree with one node and two leaves is called a stump)*

Steps:
<li> 0: Initialize the weights of data points. (e.g. data has 1000 points, each initial point would have 1/1000 = 0.001) </li>
<li> 1: Train a decision Tree (whole dataset) </li>
<li> 2: Calculate the weighted error rate (e) of the decision tree. </li>
<li> 3: Calculate this decision tree’s weight in the ensemble. The weight of this tree = learning rate * log( (1 — e) / e) </li> 
<br> ** The higher the weighted error of the tree, the less decision power the tree will be given during the later voting. </br>
<br> ** The lower the weighted error of the tree, the higher decision power the tree will be given during the later voting. </br>

<li> 4: Update weights of wrongly classified points. </li> 
<br> the weight of each data point stays same if the model got this data points correct.</br>
<br> the <strong><em>new weight of this data point = old weight*exp(weight of the tree)</em></strong>, if the model got this data point wrong </br> 

<li> 5: Repeat step 1 (dataset with new weights) </li>
<li> 6: Make final prediction </li>

Further reading:https://www.mygreatlearning.com/blog/adaboost-algorithm/

#### Gradient Boosting = Gradient Descent + Boosting.
Gradient Descent is a first-order iterative optimization algorithm for finding a local minimum of a differential function. If x(n+1) = x(n) - learning_rate*dF/dx(n) for a small learning_rate, then F(x(n)) => F(x(n+1)). (the idea is to move against the gradient)

Steps:
<li> 1: Calculate the average of the target label</li> 
<li> 2: Calculate the residuals </li> 
<li> 3: Construct a decision tree </li> 
<li> 4: Predict the target label using all of the trees within the ensemble </li> 
**Predicted Value = Average Value + Learning Rate*Residual Predicted by Decision Tree
<li> 5: Compute the new residuals </li> 
<li>6: Repeat steps 3 to 5 until the number of iterations matches the number specified by the hyperparameter (i.e. number of estimators) </li>

![Bias and Variance](./images/final_prediction.png)

<strong>Note:</strong>

<li> Gradient Boosting is prone to Over-fitting.</li>
<li> Requires careful tuning of different hyper-parameters.</li>

Example: https://towardsdatascience.com/machine-learning-part-18-boosting-algorithms-gradient-boosting-in-python-ef5ae6965be4

In [None]:
#import libraries
import xgboost as xgb
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import time
import catboost as cb
import lightgbm as lgb

In [None]:
#import dataset

X,y = load_boston(return_X_y=True)

#train,test split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=42)

#xgboost
xgbr = xgb.XGBRegressor(max_depth=5,learning_rate=0.1,n_estimators=100,n_jobs=1)
start_time = time.time()

xgbr.fit(X_train,y_train)

end_time = time.time()

y_predict = xgbr.predict(X_test)

print("--- %s seconds ---" % (end_time - start_time)) 

mean_squared_error(y_test,y_predict) #error

In [None]:
#lets try lightgbm
#it splits the tree leaf wise with the best fit whereas other boosting algorithms split the tree depth wise.

lgbr = lgb.LGBMRegressor(learning_rate=0.1,n_estimators=100,max_depth=5,num_leaves=50)

start_time = time.time()

lgbr.fit(X_train,y_train,verbose=0)

end_time = time.time()

y_predict = lgbr.predict(X_test)

print("--- %s seconds ---" % (end_time - start_time))

mean_squared_error(y_test,y_predict)    #error

In [None]:
#catboost helps you savetime by preprocessing of categorical columns for you.
#weighted sampling version of Stochastic Gradient Boosting.

#lets try catboost
cbr = cb.CatBoostRegressor(learning_rate=0.1,n_estimators=100,max_depth=5)

start_time = time.time()

cbr.fit(X_train,y_train,verbose=0)

end_time = time.time()

y_predict = cbr.predict(X_test)

print("--- %s seconds ---" % (end_time - start_time))

mean_squared_error(y_test,y_predict)    #error

Exercise: Load the promotion dataset from the data folder, train a model on the dataset and compare results using both random forests and gradient boosting.

<strong>Note: Also make sure to do some data cleaning, upsampling/downsampling, parameter tuning.</strong>

`n_estimators`
- increasing num trees will increase model complexity

`max_features`
- how many features to split on
- rule of thumb = sqrt(num_features)
- depends on ratio of noisy to important var in dataset
- small num features = reduce variance increase bias
- lots of noisy = small m will decrease probability of choosing an important variable at a split

`min samples per leaf` 
- increase a bit (default is 1) to get smaller trees w less overfitting

`max_depth`
- controls variance

`subsample`
- The fraction of observations to be selected for each tree. Selection is done by random sampling.
- Values slightly less than 1 make the model robust by reducing the variance.



## Starting point hyperparameters

*** Heard from a Kaggle Grandmaster

Learning rate = 0.05, 1000 rounds, max depth = 3-5, subsample = 0.8-1.0, colsample_bytree = 0.3 - 0.8, lambda = 0 to 5

Add capacity to combat bias - add rounds

Reduce capacity to combat variance - depth / regularization

In [120]:
import pandas as pd
df = pd.read_csv('./data/promotion/train.csv')
df.head()

Unnamed: 0,EmployeeNo,Division,Qualification,Gender,Channel_of_Recruitment,Trainings_Attended,Year_of_birth,Last_performance_score,Year_of_recruitment,Targets_met,Previous_Award,Training_score_average,State_Of_Origin,Foreign_schooled,Marital_Status,Past_Disciplinary_Action,Previous_IntraDepartmental_Movement,No_of_previous_employers,Promoted_or_Not
0,YAK/S/00001,Commercial Sales and Marketing,"MSc, MBA and PhD",Female,Direct Internal process,2,1986,12.5,2011,1,0,41,ANAMBRA,No,Married,No,No,0,0
1,YAK/S/00002,Customer Support and Field Operations,First Degree or HND,Male,Agency and others,2,1991,12.5,2015,0,0,52,ANAMBRA,Yes,Married,No,No,0,0
2,YAK/S/00003,Commercial Sales and Marketing,First Degree or HND,Male,Direct Internal process,2,1987,7.5,2012,0,0,42,KATSINA,Yes,Married,No,No,0,0
3,YAK/S/00004,Commercial Sales and Marketing,First Degree or HND,Male,Agency and others,3,1982,2.5,2009,0,0,42,NIGER,Yes,Single,No,No,1,0
4,YAK/S/00006,Information and Strategy,First Degree or HND,Male,Direct Internal process,3,1990,7.5,2012,0,0,77,AKWA IBOM,Yes,Married,No,No,1,0


In [121]:
df.shape

(38312, 19)

In [122]:
# import sys
# !{sys.executable} -m pip install -U pandas-profiling[notebook]
# !jupyter nbextension enable --py widgetsnbextension

In [123]:
# from pandas_profiling import ProfileReport
# profile = ProfileReport(df)
# profile.to_file(output_file="your_report.html")

In [124]:
df.isnull().sum()

EmployeeNo                                0
Division                                  0
Qualification                          1679
Gender                                    0
Channel_of_Recruitment                    0
Trainings_Attended                        0
Year_of_birth                             0
Last_performance_score                    0
Year_of_recruitment                       0
Targets_met                               0
Previous_Award                            0
Training_score_average                    0
State_Of_Origin                           0
Foreign_schooled                          0
Marital_Status                            0
Past_Disciplinary_Action                  0
Previous_IntraDepartmental_Movement       0
No_of_previous_employers                  0
Promoted_or_Not                           0
dtype: int64

In [125]:
df['Qualification'].value_counts()

First Degree or HND         25578
MSc, MBA and PhD            10469
Non-University Education      586
Name: Qualification, dtype: int64

In [126]:
df.dropna(axis='rows')

Unnamed: 0,EmployeeNo,Division,Qualification,Gender,Channel_of_Recruitment,Trainings_Attended,Year_of_birth,Last_performance_score,Year_of_recruitment,Targets_met,Previous_Award,Training_score_average,State_Of_Origin,Foreign_schooled,Marital_Status,Past_Disciplinary_Action,Previous_IntraDepartmental_Movement,No_of_previous_employers,Promoted_or_Not
0,YAK/S/00001,Commercial Sales and Marketing,"MSc, MBA and PhD",Female,Direct Internal process,2,1986,12.5,2011,1,0,41,ANAMBRA,No,Married,No,No,0,0
1,YAK/S/00002,Customer Support and Field Operations,First Degree or HND,Male,Agency and others,2,1991,12.5,2015,0,0,52,ANAMBRA,Yes,Married,No,No,0,0
2,YAK/S/00003,Commercial Sales and Marketing,First Degree or HND,Male,Direct Internal process,2,1987,7.5,2012,0,0,42,KATSINA,Yes,Married,No,No,0,0
3,YAK/S/00004,Commercial Sales and Marketing,First Degree or HND,Male,Agency and others,3,1982,2.5,2009,0,0,42,NIGER,Yes,Single,No,No,1,0
4,YAK/S/00006,Information and Strategy,First Degree or HND,Male,Direct Internal process,3,1990,7.5,2012,0,0,77,AKWA IBOM,Yes,Married,No,No,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
38306,YAK/S/54801,People/HR Management,First Degree or HND,Male,Agency and others,3,1987,12.5,2016,0,0,44,LAGOS,Yes,Married,No,No,1,0
38307,YAK/S/54802,Information Technology and Solution Support,First Degree or HND,Female,Direct Internal process,2,1990,0.0,2018,0,0,70,LAGOS,Yes,Married,No,No,0,0
38308,YAK/S/54805,Customer Support and Field Operations,"MSc, MBA and PhD",Female,Agency and others,2,1984,5.0,2013,0,0,48,IMO,Yes,Married,No,No,1,0
38309,YAK/S/54806,Information and Strategy,First Degree or HND,Male,Agency and others,2,1994,12.5,2016,1,0,71,ANAMBRA,No,Married,No,No,3,0


In [140]:
df.isnull().sum()

Trainings_Attended                                        0
Last_performance_score                                    0
Targets_met                                               0
Previous_Award                                            0
Training_score_average                                    0
Promoted_or_Not                                           0
Division_Business Finance Operations                      0
Division_Commercial Sales and Marketing                   0
Division_Customer Support and Field Operations            0
Division_Information Technology and Solution Support      0
Division_Information and Strategy                         0
Division_People/HR Management                             0
Division_Regulatory and Legal services                    0
Division_Research and Innovation                          0
Division_Sourcing and Purchasing                          0
Qualification_First Degree or HND                         0
Qualification_MSc, MBA and PhD          

In [127]:
# drop columns I don't need
columns = ['Previous_IntraDepartmental_Movement', 'EmployeeNo', 'State_Of_Origin','Foreign_schooled', 'Marital_Status', 'Past_Disciplinary_Action']
df = df.drop(columns = columns)
df.columns

Index(['Division', 'Qualification', 'Gender', 'Channel_of_Recruitment',
       'Trainings_Attended', 'Last_performance_score', 'Targets_met',
       'Previous_Award', 'Training_score_average', 'No_of_previous_employers',
       'Promoted_or_Not'],
      dtype='object')

In [128]:
df.shape

(38312, 11)

In [129]:
df = pd.get_dummies(df)

In [130]:
df

Unnamed: 0,Trainings_Attended,Last_performance_score,Targets_met,Previous_Award,Training_score_average,Promoted_or_Not,Division_Business Finance Operations,Division_Commercial Sales and Marketing,Division_Customer Support and Field Operations,Division_Information Technology and Solution Support,...,Channel_of_Recruitment_Agency and others,Channel_of_Recruitment_Direct Internal process,Channel_of_Recruitment_Referral and Special candidates,No_of_previous_employers_0,No_of_previous_employers_1,No_of_previous_employers_2,No_of_previous_employers_3,No_of_previous_employers_4,No_of_previous_employers_5,No_of_previous_employers_More than 5
0,2,12.5,1,0,41,0,0,1,0,0,...,0,1,0,1,0,0,0,0,0,0
1,2,12.5,0,0,52,0,0,0,1,0,...,1,0,0,1,0,0,0,0,0,0
2,2,7.5,0,0,42,0,0,1,0,0,...,0,1,0,1,0,0,0,0,0,0
3,3,2.5,0,0,42,0,0,1,0,0,...,1,0,0,0,1,0,0,0,0,0
4,3,7.5,0,0,77,0,0,0,0,0,...,0,1,0,0,1,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
38307,2,0.0,0,0,70,0,0,0,0,1,...,0,1,0,1,0,0,0,0,0,0
38308,2,5.0,0,0,48,0,0,0,1,0,...,1,0,0,0,1,0,0,0,0,0
38309,2,12.5,1,0,71,0,0,0,0,0,...,1,0,0,0,0,0,1,0,0,0
38310,2,2.5,0,0,37,0,0,1,0,0,...,0,1,0,0,1,0,0,0,0,0


In [141]:
### Correlations
corr_matrix = df.corr()
corr_matrix

Unnamed: 0,Trainings_Attended,Last_performance_score,Targets_met,Previous_Award,Training_score_average,Promoted_or_Not,Division_Business Finance Operations,Division_Commercial Sales and Marketing,Division_Customer Support and Field Operations,Division_Information Technology and Solution Support,...,Channel_of_Recruitment_Agency and others,Channel_of_Recruitment_Direct Internal process,Channel_of_Recruitment_Referral and Special candidates,No_of_previous_employers_0,No_of_previous_employers_1,No_of_previous_employers_2,No_of_previous_employers_3,No_of_previous_employers_4,No_of_previous_employers_5,No_of_previous_employers_More than 5
Trainings_Attended,1.0,-0.062042,-0.044789,-0.007409,0.041065,-0.024345,0.018066,0.029222,-0.075497,-0.00105,...,0.007765,-0.003303,-0.015492,-0.007007,0.008636,-0.003841,0.002667,-0.000674,-0.003654,0.000115
Last_performance_score,-0.062042,1.0,0.27635,0.026587,0.057836,0.11969,0.004866,-0.109401,0.125168,-0.042397,...,-0.013889,0.000431,0.046541,0.009148,-0.007861,0.001148,-0.000726,-0.004997,0.009668,-0.010945
Targets_met,-0.044789,0.27635,1.0,0.092934,0.077201,0.224518,0.030972,-0.122904,0.086584,-0.00896,...,-0.007734,-0.005354,0.04515,0.003332,-0.001091,0.005749,-0.014309,0.001385,0.002159,-0.000296
Previous_Award,-0.007409,0.026587,0.092934,1.0,0.07236,0.201434,0.006299,-0.01084,0.001457,0.002768,...,0.003364,-0.003487,0.000354,-0.00703,0.003538,0.007636,-0.000647,0.000329,0.000188,-0.000484
Training_score_average,0.041065,0.057836,0.077201,0.07236,1.0,0.178448,-0.052531,-0.651769,-0.121668,0.477898,...,-0.002915,-0.00436,0.025067,-0.010171,0.004884,0.005017,0.00321,0.001217,-0.000222,0.004684
Promoted_or_Not,-0.024345,0.11969,0.224518,0.201434,0.178448,1.0,-0.002263,-0.030213,0.006822,0.031617,...,-0.001268,-0.004354,0.019354,-0.005863,0.003367,0.005913,0.000352,0.00462,-0.005311,-0.002694
Division_Business Finance Operations,0.018066,0.004866,0.030972,0.006299,-0.052531,-0.002263,1.0,-0.146575,-0.113357,-0.085196,...,0.008121,0.000271,-0.02901,0.008403,-0.007313,0.003173,0.001254,-0.002522,-0.001566,-0.00571
Division_Commercial Sales and Marketing,0.029222,-0.109401,-0.122904,-0.01084,-0.651769,-0.030213,-0.146575,1.0,-0.339806,-0.255386,...,-0.003197,0.010748,-0.025896,0.002219,-0.001846,-0.001945,-0.000979,0.007398,-0.005069,-0.000784
Division_Customer Support and Field Operations,-0.075497,0.125168,0.086584,0.001457,-0.121668,0.006822,-0.113357,-0.339806,1.0,-0.197509,...,-0.001263,-0.000141,0.004855,0.005946,-0.007248,0.005853,-0.00428,0.000516,0.006953,-0.007866
Division_Information Technology and Solution Support,-0.00105,-0.042397,-0.00896,0.002768,0.477898,0.031617,-0.085196,-0.255386,-0.197509,1.0,...,-0.004136,-0.016561,0.071239,-0.012176,0.013439,-0.004963,0.001512,-5.7e-05,5.7e-05,-0.0014


In [131]:
X = df.drop('Promoted_or_Not', axis = 1)
y = df['Promoted_or_Not']

In [136]:
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split
from sklearn import tree
import matplotlib.pyplot as plt

In [137]:
#train_test_split
X_train,X_test, y_train,y_test = train_test_split(X,y,test_size=0.3,random_state=42)

#initialize the decisiontreeclassifier
dtc = tree.DecisionTreeClassifier(max_depth=5,random_state=42,criterion='entropy')
#criterion is the function to measure the quality of a split.

In [138]:
#fit and return f1_score
dtc.fit(X_train,y_train)

f1_score(y_test,dtc.predict(X_test),average=None)

array([0.96170252, 0.29535865])

In [150]:
#show decision tree
plt.rcParams["figure.figsize"] = (60,20)
tree.plot_tree(dtc,filled = True);
plt.savefig('./images/tree_promoted.png')
plt.show()

  plt.show()


In [144]:
df.Promoted_or_Not.value_counts()

0    35071
1     3241
Name: Promoted_or_Not, dtype: int64

In [145]:
df_promoted = df[df.Promoted_or_Not == 1]  #set the dataframes
df_not_promoted = df[df.Promoted_or_Not == 0]

In [146]:
df_promoted.shape

(3241, 30)

In [147]:
from sklearn.utils import resample

df_upsampled = resample(df_promoted,
                      n_samples = df_not_promoted.shape[0], 
                      random_state = 42)

In [148]:
df_upsampled.shape

(35071, 30)

In [151]:
df = pd.concat([df_not_promoted,df_upsampled])

In [152]:
df.shape

(70142, 30)

In [142]:
#import libraries
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import time

In [143]:
#train,test split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=42)

#xgboost
xgbr = xgb.XGBRegressor(max_depth=5,learning_rate=0.1,n_estimators=100,n_jobs=1)
start_time = time.time()

xgbr.fit(X_train,y_train)

end_time = time.time()

y_predict = xgbr.predict(X_test)

print("--- %s seconds ---" % (end_time - start_time)) 

mean_squared_error(y_test,y_predict) #error

  if getattr(data, 'base', None) is not None and \


--- 4.040838956832886 seconds ---


0.052074915904206216