# Gradient Boosting

In gradient boosting, small weak learners (mostly decision trees) are fitted sequentially. They are trained to predict the residual at training time. Initial prediction is the average value of target variable. Each tree after another tries to correct the initial prediction to get closer and closer to the ground truth of the training set.<br><br>
At inference time, prediction of each learner is scaled with learning rate and is added to the initial prediction step by step. Which means that each weak learner gradually corrects the initial prediction to reach to the optimal prediction which is as close as possible to the training ground truth.<br><br>
Why don't we just fit one decision tree to correct the initial prediction in one step? Decision trees have high variance and prone to overfitting. By fitting one big enough decision tree to predict the residual in one step, the tree will memorize and overfit the noise of the training set. By using only weak learners at each step, we are only trying to correct the residual of the previous step's prediction just by a little bit and by scaling the predicted residual of current model by learning rate, we are saying that we trust the prediction of current model but only up to a small bit.<br>
It also distributes the pattern finding responsibility across multiple trees instead of forcing one big tree to find all the patterns which can lead to overfitting.<br><br>
From another perspective, each learner in gradient boosting ensemble is predicting the negative gradient w.r.t to loss value (residual in this case). Each learner predicts the negative gradient and that gradient is scaled with learning rate and used to update the initial prediction.

In [1]:
import numpy as np
import pandas as pd

from sklearn.preprocessing import OrdinalEncoder
from sklearn.metrics import accuracy_score, confusion_matrix

from xgboost import XGBClassifier

In [2]:
# data processing

df_train = pd.read_csv("data/train.csv")
df_test = pd.read_csv("data/test.csv")

# drop unneccessary columns
df_train = df_train.drop(columns=['Unnamed: 0', 'id'])
df_test = df_test.drop(columns=['Unnamed: 0', 'id'])

# drop missing values
df_train = df_train.dropna()
df_test = df_test.dropna()

# convert Arrival Delay in Minutes to integer
df_train['Arrival Delay in Minutes'] = df_train['Arrival Delay in Minutes'].astype(np.int64)
df_test['Arrival Delay in Minutes'] = df_test['Arrival Delay in Minutes'].astype(np.int64)

# divide categorical and numerical features
cat_feats = []
num_feats = []

for col, dtype in dict(df_train.dtypes).items():
    if dtype == np.dtypes.ObjectDType:
        cat_feats.append(col)
    elif dtype == np.int64:
        num_feats.append(col)

target = 'satisfaction'
cat_feats.remove(target)

# prepare train and test sets
train_features = cat_feats + num_feats

for cf in cat_feats+[target]:
    cats = list(df_train[cf].unique())
    cats.sort()
    encoder = OrdinalEncoder(categories=[cats])
    df_train[cf] = encoder.fit_transform(df_train[[cf]]).astype(int)
    
    cats = list(df_test[cf].unique())
    cats.sort()
    encoder = OrdinalEncoder(categories=[cats])
    df_test[cf] = encoder.fit_transform(df_test[[cf]]).astype(int)
       
X_train = df_train[train_features]
y_train = df_train[target]

X_test = df_test[train_features]
y_test = df_test[target]

print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

(103594, 22) (103594,)
(25893, 22) (25893,)


In [20]:
# Baseline

bst = XGBClassifier(eta=0.2, objective='binary:logistic')

bst.fit(X_train, y_train)

y_pred = bst.predict(X_train)
accuracy = accuracy_score(y_train, y_pred)

print(f"train accuracy: {accuracy}\n")

y_pred = bst.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
cm = confusion_matrix(y_test, y_pred)

print(f"test accuracy: {accuracy}\n")

print("confusion matrix:")
print(cm)

train accuracy: 0.9703554259899222

test accuracy: 0.9640057158305333

confusion matrix:
[[14227   301]
 [  631 10734]]
