Boosting refers to an ensemble method.
In this method, multiple models are trained sequentially, and each model learns from the error of the previous model.
This way, many weak learners form a strong learner.
Each predictor trained sequentially tries to correct the prediction of its predecessor.

<strong><span style="color:red"> Two boosting methods: </span></strong>
1. AdaBoost
2. Gradient Boosting

<strong><span style="color:red"> AdaBoots boost/Adaptive Boosting </span></strong>

    1. Each predictor pays attention to the instances wrongly predicted by the predecessor and tries to correct them by adjusting the weight of the training instances.
    2. Each predictor is assigned a coefficient alpha.
    3. Alpha depends on the predictor's training error.
    4. Alpha is used to find the weight of the next training model's specific instances.
    5. 2nd predictor pays more attention to the weighted instances.
    6. This process is repeated sequentially till the end predictor.

Learning rate - it is a value between 0 and 1. This has a trade-off between Learning date and the number of estimators. A smaller value of learning rate will be compensated by large number of estimators.

<strong><span style="color:green"> When all models are trained, the further new predictors' label can be predicted based on the nature of the problem </span></strong>

    1. For classification, weighted majority voting.
    2. For Regression, weighted average.

In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.tree import DecisionTreeClassifier

In [2]:
cancer_dataframe = pd.read_csv("cancer_dataset.csv")
cancer_dataframe.drop('Unnamed: 32', axis=1, inplace=True)
cancer_dataframe

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,fractal_dimension_mean,radius_se,texture_se,perimeter_se,area_se,smoothness_se,compactness_se,concavity_se,concave points_se,symmetry_se,fractal_dimension_se,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,842302,M,17.99,10.38,122.80,1001.0,0.11840,0.27760,0.30010,0.14710,0.2419,0.07871,1.0950,0.9053,8.589,153.40,0.006399,0.04904,0.05373,0.01587,0.03003,0.006193,25.380,17.33,184.60,2019.0,0.16220,0.66560,0.7119,0.2654,0.4601,0.11890
1,842517,M,20.57,17.77,132.90,1326.0,0.08474,0.07864,0.08690,0.07017,0.1812,0.05667,0.5435,0.7339,3.398,74.08,0.005225,0.01308,0.01860,0.01340,0.01389,0.003532,24.990,23.41,158.80,1956.0,0.12380,0.18660,0.2416,0.1860,0.2750,0.08902
2,84300903,M,19.69,21.25,130.00,1203.0,0.10960,0.15990,0.19740,0.12790,0.2069,0.05999,0.7456,0.7869,4.585,94.03,0.006150,0.04006,0.03832,0.02058,0.02250,0.004571,23.570,25.53,152.50,1709.0,0.14440,0.42450,0.4504,0.2430,0.3613,0.08758
3,84348301,M,11.42,20.38,77.58,386.1,0.14250,0.28390,0.24140,0.10520,0.2597,0.09744,0.4956,1.1560,3.445,27.23,0.009110,0.07458,0.05661,0.01867,0.05963,0.009208,14.910,26.50,98.87,567.7,0.20980,0.86630,0.6869,0.2575,0.6638,0.17300
4,84358402,M,20.29,14.34,135.10,1297.0,0.10030,0.13280,0.19800,0.10430,0.1809,0.05883,0.7572,0.7813,5.438,94.44,0.011490,0.02461,0.05688,0.01885,0.01756,0.005115,22.540,16.67,152.20,1575.0,0.13740,0.20500,0.4000,0.1625,0.2364,0.07678
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
564,926424,M,21.56,22.39,142.00,1479.0,0.11100,0.11590,0.24390,0.13890,0.1726,0.05623,1.1760,1.2560,7.673,158.70,0.010300,0.02891,0.05198,0.02454,0.01114,0.004239,25.450,26.40,166.10,2027.0,0.14100,0.21130,0.4107,0.2216,0.2060,0.07115
565,926682,M,20.13,28.25,131.20,1261.0,0.09780,0.10340,0.14400,0.09791,0.1752,0.05533,0.7655,2.4630,5.203,99.04,0.005769,0.02423,0.03950,0.01678,0.01898,0.002498,23.690,38.25,155.00,1731.0,0.11660,0.19220,0.3215,0.1628,0.2572,0.06637
566,926954,M,16.60,28.08,108.30,858.1,0.08455,0.10230,0.09251,0.05302,0.1590,0.05648,0.4564,1.0750,3.425,48.55,0.005903,0.03731,0.04730,0.01557,0.01318,0.003892,18.980,34.12,126.70,1124.0,0.11390,0.30940,0.3403,0.1418,0.2218,0.07820
567,927241,M,20.60,29.33,140.10,1265.0,0.11780,0.27700,0.35140,0.15200,0.2397,0.07016,0.7260,1.5950,5.772,86.22,0.006522,0.06158,0.07117,0.01664,0.02324,0.006185,25.740,39.42,184.60,1821.0,0.16500,0.86810,0.9387,0.2650,0.4087,0.12400


In [3]:
y = cancer_dataframe['diagnosis']
X = cancer_dataframe.drop(["diagnosis"], axis=1)

SEED = 1
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y,
                                                    random_state=SEED)
dt = DecisionTreeClassifier(max_depth=1, random_state=SEED)
adb_clf = AdaBoostClassifier(estimator=dt, n_estimators=100)
adb_clf.fit(X_train, y_train)
y_pred_proba = adb_clf.predict_proba(X_test)[:,1]
adb_clf_roc_auc_score = roc_auc_score(y_test, y_pred_proba)
adb_clf_roc_auc_score

0.9880257009345794

<strong><span style="color:red"> Gradient Boosting </span></strong>

In ensemble learning, the next predictor corrects the prediction error of the previous predictors by adjusting the weights of training instances. But in the gradient boosting method, instead of adjusting weights, post-predictors are trained using the predecessor's residual errors as labels. 

<strong><span style="color:green">Example output below:- </span></strong>

X,y -----------> TREE1 ----> PREDICT -----> Residual error (r1) = y - y_pred -----><br>

X, eta * r1 ---> TREE2 ----> PREDICT -----> r2 = r1 - r_pred ---------------------><br>

X, eta * r2 ---> TREE3 ----> PREDICT -----> r3 = r2 - r_pred ---------------------><br>

The prediction of each tree is shrunk when the residual error is multiplied by a learning rate (eta), whose value lies between 0 and 1. If the eta value increases, the number of estimators decreases, and if the eta value decreases, then the number of estimators increases.


In [4]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error as MSE
from sklearn.ensemble import GradientBoostingRegressor

car_df = pd.read_csv("auto_mpg/test-file.txt", 
                        sep='\s+',
                        header=None, 
                        names=['mpg', 'cylinders', 'displacement', 'horsepower', 'weight', 'acceleration', 'model year', 'origin', 'car name'])
car_df

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model year,origin,car name
0,18.0,8,307.0,130.0,3504.0,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165.0,3693.0,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150.0,3436.0,11.0,70,1,plymouth satellite
3,16.0,8,304.0,150.0,3433.0,12.0,70,1,amc rebel sst
4,17.0,8,302.0,140.0,3449.0,10.5,70,1,ford torino
...,...,...,...,...,...,...,...,...,...
393,27.0,4,140.0,86.00,2790.0,15.6,82,1,ford mustang gl
394,44.0,4,97.0,52.00,2130.0,24.6,82,2,vw pickup
395,32.0,4,135.0,84.00,2295.0,11.6,82,1,dodge rampage
396,28.0,4,120.0,79.00,2625.0,18.6,82,1,ford ranger


In [5]:
special_char_pattern = '[?]'
car_df['horsepower'] = car_df['horsepower'].str.replace(special_char_pattern, '0.0', regex=True)
car_df['has_special_char'] = car_df['horsepower'].str.contains(special_char_pattern, regex=True)
car_df['has_special_char']

0      False
1      False
2      False
3      False
4      False
       ...  
393    False
394    False
395    False
396    False
397    False
Name: has_special_char, Length: 398, dtype: bool

In [6]:
y = car_df['mpg']
X = car_df.drop(["mpg", "car name"], axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.3, 
                                                    random_state=3)
print ("Training data - independent variable")
X_train

Training data - independent variable


Unnamed: 0,cylinders,displacement,horsepower,weight,acceleration,model year,origin,has_special_char
65,8,351.0,153.0,4129.0,13.0,72,1,False
251,8,302.0,139.0,3570.0,12.8,78,1,False
238,4,98.0,83.00,2075.0,15.9,77,1,False
321,4,108.0,75.00,2265.0,15.2,80,3,False
70,8,400.0,190.0,4422.0,12.5,72,1,False
...,...,...,...,...,...,...,...,...
256,6,225.0,100.0,3430.0,17.2,78,1,False
131,4,71.0,65.00,1836.0,21.0,74,3,False
249,8,260.0,110.0,3365.0,15.5,78,1,False
152,6,225.0,95.00,3264.0,16.0,75,1,False


In [7]:
print ("Training data - dependent variable")
y_train

Training data - dependent variable


65     14.0
251    20.2
238    33.5
321    32.2
70     13.0
       ... 
256    20.5
131    32.0
249    19.9
152    19.0
362    24.2
Name: mpg, Length: 278, dtype: float64

In [8]:
print ("Test data - independent variable")
X_test

Test data - independent variable


Unnamed: 0,cylinders,displacement,horsepower,weight,acceleration,model year,origin,has_special_char
358,4,120.0,74.00,2635.0,18.3,81,3,False
16,6,199.0,97.00,2774.0,15.5,70,1,False
292,8,360.0,150.0,3940.0,13.0,79,1,False
81,4,97.0,92.00,2288.0,17.0,72,3,False
112,4,122.0,85.00,2310.0,18.5,73,1,False
...,...,...,...,...,...,...,...,...
345,4,81.0,60.00,1760.0,16.1,81,3,False
67,8,429.0,208.0,4633.0,11.0,72,1,False
228,6,250.0,98.00,3525.0,19.0,77,1,False
237,4,98.0,63.00,2051.0,17.0,77,1,False


In [9]:
print ("Test data - dependent variable")
y_test

Test data - dependent variable


358    31.6
16     18.0
292    18.5
81     28.0
112    19.0
       ... 
345    35.1
67     11.0
228    18.5
237    30.5
343    39.1
Name: mpg, Length: 120, dtype: float64

In [10]:
gbt = GradientBoostingRegressor(n_estimators=300, max_depth=1, random_state=SEED)
gbt.fit(X_train, y_train)

y_pred = gbt.predict(X_test)
rmse_test = MSE(y_test, y_pred)**(1/2)
rmse_test

3.1076925704902263

<strong><span style="color:red"> Stochastic Gradient Boosting </span></strong>

Stochastic Gradient Boosting is a machine learning teaching where multiple weak learning decision tree are used sequentially to build a powerful predictive model.

Rather using complete training data, first randomly sample the training data. Subsets are sampled based without replacement. Please note, not all the features are considered while making the subset to train the model.

Once a tree is trained and predictions are made, then computer the residual errors. These residual errors are multiplied by the learning rate(eta) and feed to the next tree in the ensemble. This procedure is continued sequentially unless all the trees in the ensemble are trained.

The whole process is the same like gradient boosting only diff is here random sampled trained data (with some sampled features) are used to train a tree.

In [13]:
y = car_df['mpg']
X = car_df.drop(["mpg", "car name"], axis=1)
SEED = 1
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.3, 
                                                    random_state=SEED)
sgbt = GradientBoostingRegressor(max_depth=1,
                                 subsample=0.8,
                                 max_features=0.2,
                                 n_estimators=300,
                                 random_state=SEED)
sgbt.fit(X_train, y_train)
y_pred = sgbt.predict(X_test)
rmse_test = MSE(y_test, y_pred)**(1/2)
rmse_test

2.783627946304547