## Gradient boosting with XGBoost

Scope:
- I'm going to train a gradient boosting model on the same diabetes dataset I used in the linear regression example. I'm going to compare the performance with the baseline model.

In [3]:
from lightgbm import LGBMRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import pandas as pd
import altair as alt
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

In [22]:
diabetes_df = pd.read_csv('../data/diabetes.csv')

# I want the column names to be a bit more descriptive
diabetes_df.rename(columns={'S1':'t_cells', 'S2':'ld_lipo', 'S3':'hd_lipo',
                            'S4':'thyroid_sh', 'S5':'lamotrigine', 'S6':'blood_sugar'}, inplace=True)

diabetes_df.columns = [col.lower() for col in diabetes_df]

Let's see what this dataset looks like.

In [20]:
diabetes_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 442 entries, 0 to 441
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   age          442 non-null    int64  
 1   sex          442 non-null    int64  
 2   bmi          442 non-null    float64
 3   bp           442 non-null    float64
 4   t_cells      442 non-null    int64  
 5   ld_lipo      442 non-null    float64
 6   hd_lipo      442 non-null    float64
 7   thyroid-sh   442 non-null    float64
 8   lamotrigine  442 non-null    float64
 9   blood_sugar  442 non-null    int64  
 10  y            442 non-null    int64  
dtypes: float64(6), int64(5)
memory usage: 38.1 KB


We have no null values, which is great. We have 10 features or predictive variables and one target variable, Y. Y is a quantitative measure of disease progression one year after baseline.

---

Let's look at whether any of our features are well-correlated.  
Below, I can see that ld_lipo and t_cells are really highly-correlated: 0.896. So are lamotrigine and thyroid_sh at 0.617. ld_lipo and thyroid_sh are 0.659. This dataset is messy.

In [21]:
diabetes_df.corr()

Unnamed: 0,age,sex,bmi,bp,t_cells,ld_lipo,hd_lipo,thyroid-sh,lamotrigine,blood_sugar,y
age,1.0,0.173737,0.185085,0.335428,0.260061,0.219243,-0.075181,0.203841,0.270774,0.301731,0.187889
sex,0.173737,1.0,0.088161,0.24101,0.035277,0.142637,-0.37909,0.332115,0.149916,0.208133,0.043062
bmi,0.185085,0.088161,1.0,0.395411,0.249777,0.26117,-0.366811,0.413807,0.446157,0.38868,0.58645
bp,0.335428,0.24101,0.395411,1.0,0.242464,0.185548,-0.178762,0.25765,0.39348,0.39043,0.441482
t_cells,0.260061,0.035277,0.249777,0.242464,1.0,0.896663,0.051519,0.542207,0.515503,0.325717,0.212022
ld_lipo,0.219243,0.142637,0.26117,0.185548,0.896663,1.0,-0.196455,0.659817,0.318357,0.2906,0.174054
hd_lipo,-0.075181,-0.37909,-0.366811,-0.178762,0.051519,-0.196455,1.0,-0.738493,-0.398577,-0.273697,-0.394789
thyroid-sh,0.203841,0.332115,0.413807,0.25765,0.542207,0.659817,-0.738493,1.0,0.617859,0.417212,0.430453
lamotrigine,0.270774,0.149916,0.446157,0.39348,0.515503,0.318357,-0.398577,0.617859,1.0,0.464669,0.565883
blood_sugar,0.301731,0.208133,0.38868,0.39043,0.325717,0.2906,-0.273697,0.417212,0.464669,1.0,0.382483


Testing a hypothesis: our models do poorly because of correlated features.

In [28]:
# diabetes_df = diabetes_df.drop('ld_lipo', axis=1)
diabetes_df = diabetes_df.drop('thyroid_sh', axis=1)

In [29]:
# let's eliminate the predicted column, then split the data
X = diabetes_df.drop('y', axis=1)
y = diabetes_df['y']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123, shuffle=True)

In [30]:
def run_booster(learning_rate):
    bst = LGBMRegressor(n_estimators=500, learning_rate=learning_rate) # initialising using scikit API
    bst.fit(X_train, y_train,
            eval_set=[(X_test, y_test)],
            early_stopping_rounds=10,
            verbose=False)

    # predicting the test data
    return bst.predict(X_test)

In [31]:
# testing 7 values for the learning rate, equally spaced between 0.001 and 1

results = []
alpha_range = np.linspace(0.001, 1, num=20)
for alpha in alpha_range:
    predictions_bst = run_booster(alpha)
    results.append([alpha, 
                    mean_absolute_error(y_test, predictions_bst), 
                    round(r2_score(y_test, predictions_bst),2)])
    
column_names = ['learning_rate', 'mean_absolute_error', 'r2_score']
res_df = pd.DataFrame(results, columns=column_names).set_index('learning_rate')
res_df.sort_values(by='r2_score', ascending=False)

Unnamed: 0_level_0,mean_absolute_error,r2_score
learning_rate,Unnamed: 1_level_1,Unnamed: 2_level_1
0.158737,41.367401,0.6
0.369053,40.221245,0.6
0.263895,42.324769,0.59
0.053579,42.084717,0.59
0.579368,40.749605,0.59
0.106158,42.618516,0.58
0.211316,42.679527,0.58
0.316474,42.539185,0.58
0.684526,42.828967,0.56
0.737105,43.818958,0.56


LightGBM is better than XGBoost in this case, with 0.6 > 0.53, but still only slightly over a regular linear regressor.

In [None]:
# just playing around with different visualisations
# what would be useful to visualise?
def plot_altair(column):
    return alt.Chart(diabetes_df).mark_point(filled=True).encode(
        x = alt.X(column, scale=alt.Scale(zero=False)),
        y = alt.Y('y:Q', scale=alt.Scale(zero=False)))
        # color = alt.Color('SEX:N'),
        # size = alt.Size('blood_sugar:Q', title='Blood sugar'),
        # opacity = alt.OpacityValue(0.5))

# a regression line for each variable against the target variable
# but this is the /actual/ target variable, not the model's prediction of it
charts = []
for col in list(X.columns):
    chart = plot_altair(col + ':Q')
    charts.append(chart + chart.transform_regression(str(col), 'Y').mark_line())

alt.vconcat(*charts[2:])