## Linear regression

Scope:
- I'm going to train a simple linear regressor as a baseline model on the scikit-learn diabetes dataset. The data was weirdly normalised, so I just loaded in the raw .csv from the source.
- I'm also going to do some visualisation of the dataset with Altair and Seaborn, side-by-side;

In [2]:
from sklearn import linear_model
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import pandas as pd
import altair as alt
import matplotlib.pyplot as plt
import seaborn as sns

In [3]:
diabetes_df = pd.read_csv('../data/diabetes.csv')

# I want the column names to be a bit more descriptive
diabetes_df.rename(columns={'S1':'t_cells', 'S2':'ld_lipo', 'S3':'hd_lipo',
                            'S4':'thyroid_sh', 'S5':'lamotrigine', 'S6':'blood_sugar'}, inplace=True)

diabetes_df.columns = [col.lower() for col in diabetes_df]

Let's see what this dataset looks like.

In [5]:
diabetes_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 442 entries, 0 to 441
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   age          442 non-null    int64  
 1   sex          442 non-null    int64  
 2   bmi          442 non-null    float64
 3   bp           442 non-null    float64
 4   t_cells      442 non-null    int64  
 5   ld_lipo      442 non-null    float64
 6   hd_lipo      442 non-null    float64
 7   thyroid-sh   442 non-null    float64
 8   lamotrigine  442 non-null    float64
 9   blood_sugar  442 non-null    int64  
 10  y            442 non-null    int64  
dtypes: float64(6), int64(5)
memory usage: 38.1 KB


We have no null values, which is great. We have 10 features or predictive variables and one target variable, Y. Y is a quantitative measure of disease progression one year after baseline. But what does Y actually look like?

In [6]:
diabetes_df.describe()['y']

count    442.000000
mean     152.133484
std       77.093005
min       25.000000
25%       87.000000
50%      140.500000
75%      211.500000
max      346.000000
Name: y, dtype: float64

In [4]:
diabetes_df = diabetes_df.drop('ld_lipo', axis=1)
diabetes_df = diabetes_df.drop('thyroid_sh', axis=1)

In [16]:
# let's eliminate the predicted column, then split the data
X = diabetes_df.drop('y', axis=1)
y = diabetes_df['y']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123, shuffle=True)

# training the linear regressor on the training data
models = {}
predictions = {}
models['linear'] = linear_model.LinearRegression()
models['ridge'] = linear_model.Ridge(random_state=0)
models['lasso'] = linear_model.Lasso(random_state=0)
models['elasticnet'] = linear_model.ElasticNet(random_state=0)

# print([k, v for (k,v) in models.items()])
for (model_name, model) in models.items():
    model.fit(X_train, y_train)
    predictions[model_name] = model.predict(X_test)

In [18]:
# evaluating model
scores = []
for (model, prediction) in predictions.items():
    scores.append([model, 
                   mean_absolute_error(y_test, prediction),
                   r2_score(y_test, prediction)])

column_names = ['model_type', 'mean_absolute_error', 'r2_score']
res_df = pd.DataFrame(scores, columns=column_names).set_index('model_type')
res_df.sort_values(by='r2_score', ascending=False)

Unnamed: 0_level_0,mean_absolute_error,r2_score
model_type,Unnamed: 1_level_1,Unnamed: 2_level_1
linear,42.849647,0.571991
ridge,42.912183,0.570969
lasso,43.510871,0.560735
elasticnet,46.852633,0.491271


In [None]:
# just playing around with different visualisations
# what would be useful to visualise?
def plot_altair(column):
    return alt.Chart(diabetes_df).mark_point(filled=True).encode(
        x = alt.X(column, scale=alt.Scale(zero=False)),
        y = alt.Y('y:Q', scale=alt.Scale(zero=False)))
        # color = alt.Color('SEX:N'),
        # size = alt.Size('blood_sugar:Q', title='Blood sugar'),
        # opacity = alt.OpacityValue(0.5))

# a regression line for each variable against the target variable
# but this is the /actual/ target variable, not the model's prediction of it
charts = []
for col in list(X.columns):
    chart = plot_altair(col + ':Q')
    charts.append(chart + chart.transform_regression(str(col), 'Y').mark_line())

alt.vconcat(*charts[2:])