## Gradient boosting with XGBoost

Scope:
- I'm going to train a gradient boosting model on the same diabetes dataset I used in the linear regression example. I'm going to compare the performance with the baseline model.

In [5]:
from xgboost import XGBRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import pandas as pd
import altair as alt
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

In [6]:
diabetes_df = pd.read_csv('../data/diabetes.csv')

# I want the column names to be a bit more descriptive
diabetes_df.rename(columns={'S1':'t_cells', 'S2':'ld_lipo', 'S3':'hd_lipo',
                            'S4':'thyroid_sh', 'S5':'lamotrigine', 'S6':'blood_sugar'}, inplace=True)

diabetes_df.columns = [col.lower() for col in diabetes_df]

Let's see what this dataset looks like.

In [5]:
diabetes_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 442 entries, 0 to 441
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   age          442 non-null    int64  
 1   sex          442 non-null    int64  
 2   bmi          442 non-null    float64
 3   bp           442 non-null    float64
 4   t_cells      442 non-null    int64  
 5   ld_lipo      442 non-null    float64
 6   hd_lipo      442 non-null    float64
 7   thyroid-sh   442 non-null    float64
 8   lamotrigine  442 non-null    float64
 9   blood_sugar  442 non-null    int64  
 10  y            442 non-null    int64  
dtypes: float64(6), int64(5)
memory usage: 38.1 KB


We have no null values, which is great. We have 10 features or predictive variables and one target variable, Y. Y is a quantitative measure of disease progression one year after baseline. But what does Y actually look like?

In [6]:
diabetes_df.describe()['y']

count    442.000000
mean     152.133484
std       77.093005
min       25.000000
25%       87.000000
50%      140.500000
75%      211.500000
max      346.000000
Name: y, dtype: float64

In [7]:
diabetes_df = diabetes_df.drop('ld_lipo', axis=1)
diabetes_df = diabetes_df.drop('thyroid_sh', axis=1)

In [8]:
# let's eliminate the predicted column, then split the data
X = diabetes_df.drop('y', axis=1)
y = diabetes_df['y']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123, shuffle=True)

In [9]:
def run_booster(learning_rate):
    bst = XGBRegressor(n_estimators=1000, learning_rate=learning_rate) # initialising using scikit API
    bst.fit(X_train, y_train,
            eval_set=[(X_test, y_test)],
            early_stopping_rounds=5,
            verbose=False)

    # predicting the test data
    return bst.predict(X_test)

In [13]:
# testing 7 values for the learning rate, equally spaced between 0.001 and 1

results = []
alpha_range = np.linspace(0.001, 1, num=50)
for alpha in alpha_range:
    predictions_bst = run_booster(alpha)
    results.append([alpha, 
                    mean_absolute_error(y_test, predictions_bst), 
                    round(r2_score(y_test, predictions_bst),2)])
    
column_names = ['learning_rate', 'mean_absolute_error', 'r2_score']
res_df = pd.DataFrame(results, columns=column_names).set_index('learning_rate')
res_df.sort_values(by='r2_score', ascending=False)

Unnamed: 0_level_0,mean_absolute_error,r2_score
learning_rate,Unnamed: 1_level_1,Unnamed: 2_level_1
0.36798,42.290773,0.57
0.551469,43.561084,0.56
0.510694,43.330521,0.53
0.347592,44.946175,0.53
0.143714,44.02881,0.53
0.164102,44.223291,0.53
0.18449,43.655887,0.53
0.082551,44.67412,0.52
0.204878,43.369208,0.52
0.286429,45.543297,0.51


It doesn't look like the gradient boosting is working very well. Our top score is 0.53.

In [None]:
# just playing around with different visualisations
# what would be useful to visualise?
def plot_altair(column):
    return alt.Chart(diabetes_df).mark_point(filled=True).encode(
        x = alt.X(column, scale=alt.Scale(zero=False)),
        y = alt.Y('y:Q', scale=alt.Scale(zero=False)))
        # color = alt.Color('SEX:N'),
        # size = alt.Size('blood_sugar:Q', title='Blood sugar'),
        # opacity = alt.OpacityValue(0.5))

# a regression line for each variable against the target variable
# but this is the /actual/ target variable, not the model's prediction of it
charts = []
for col in list(X.columns):
    chart = plot_altair(col + ':Q')
    charts.append(chart + chart.transform_regression(str(col), 'Y').mark_line())

alt.vconcat(*charts[2:])