### What will we cover today ?

We are going to test how a Linear Regression model works with and without regularization on a real dataset (***diabetest severity prediction***).

1.   **Linear Regression(LR)**
2.   **Lasso Regression (LR with L1 Regularization)**
3.   **Ridge Regression (LR with L2 Regularization)**








[Diabetes Dataset](https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html)

**Summary of the dataset:**

*   **Features:** *age, sex, bmi, bp, s1, s2, s3, s4, s5, s6*
*   **Target (y)**: A quantitative measure to track the progression of diabetes.



**Note:** Some features are directly identifiable, say, age/sex/bmi, while some others are derived features (for more details, read the data description provided through the link above).



---


---





## Loading necessary python packages

In [1]:
# database related package(s)
from sklearn import datasets

# diabetes data comes as a part of the sklearn package
from sklearn.datasets import load_diabetes

# data processing packages
import pandas as pd

# Regressiong modeling package(s) (sklearn)
from sklearn.linear_model import LinearRegression

# model evaluation related packages
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# visualization
import plotly.express as px
import plotly.graph_objects as go

## Loading data and some preprocessing

In [2]:
# Load the  dataset
df = datasets.load_diabetes(as_frame=True)
df.keys()

dict_keys(['data', 'target', 'frame', 'DESCR', 'feature_names', 'data_filename', 'target_filename', 'data_module'])

In [3]:
# Separating features and labels dataframes
features_df, labels_df = df.data, df.target

In [4]:
# Looking at a sample of features
features_df.head()

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6
0,0.038076,0.05068,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019907,-0.017646
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.068332,-0.092204
2,0.085299,0.05068,0.044451,-0.00567,-0.045599,-0.034194,-0.032356,-0.002592,0.002861,-0.02593
3,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022688,-0.009362
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031988,-0.046641


In [5]:
# plotting the (bmi, target) pair values
fig = px.scatter(x=features_df['bmi'], y=labels_df.values)
fig.show()

We clearly see a linear relationship between the bmi feature and the disease progresson rate

## Modeling

Extracting features and labels as numpy matrices

In [6]:
X, y = features_df.values, labels_df.values

Splitting data into train, test splits

In [7]:
# test data amount (in terms of proportion)
TEST_PROP = 0.5
# Random number seed; important for experiment reproducibility
RANDOM_SEED = 0

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=TEST_PROP, random_state=RANDOM_SEED)
X_train.shape, y_train.shape, X_test.shape, y_test.shape

((221, 10), (221,), (221, 10), (221,))

Model Instantiation

In [20]:
# Create a linear regression object
regr = LinearRegression()

Model Training

In [21]:
# Train the model using the training set
regr.fit(X_train, y_train)

Making Predictions and evaluation (on the traning data)

- just checking how good the model fit was on the training data.

In [22]:
# Making predictions (training dataset)
y_pred = regr.predict(X_train)
# Estimating the Mean Squared Error (MSE)
mse_train = mean_squared_error(y_train, y_pred)
print("Mean squared error (training data): %.2f" % mse_train)

Mean squared error (training data): 2725.99


Making Predictions and evaluation (on the test data)

- This is more interesting metric as we are reporting on unseen data (by the model)

In [23]:
# Making predictions (test dataset)
y_pred = regr.predict(X_test)
# Estimating the Mean Squared Error (MSE)
mse_test = mean_squared_error(y_test, y_pred)
print("MSE (test data): %.2f" % mse_test)

MSE (test data): 3075.33


In [24]:
lreg_results = pd.DataFrame({
  'model': ['lr(without regularization)'],
  'train_err': [round(mse_train, 2)],
  'test_err': [round(mse_test, 2)]
})

Note(s):


*   We see test-error (3075.33) is much higher than the train-error (2725.99)
*   Can you explain, why test-error is higher than train-error?
*   How can we ensure our model perform similary at test time ?




## [Lasso](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html#sklearn.linear_model.Lasso): Liner Regression with L1 Regularizer

In [25]:
from sklearn.linear_model import Lasso
# Create linear regression object
regr_lasso = Lasso(alpha=0.2)
# Train the model using the training sets
regr_lasso.fit(X_train, y_train)

# Prediction and error estimation (traing data)
y_pred = regr_lasso.predict(X_train)
mse_train = mean_squared_error(y_train, y_pred)
print("MSE (test data): %.2f" % mse_train)

# Prediction and error estimation (test data)
y_pred = regr_lasso.predict(X_test)
mse_test = mean_squared_error(y_test, y_pred)
print("MSE (test data): %.2f" % mse_test)

# Storing results in a dataframe
lasso_results = pd.DataFrame({
  'model': ['lasso'],
  'train_err': [round(mse_train, 2)],
  'test_err': [round(mse_test, 2)]
})

MSE (test data): 2859.02
MSE (test data): 3058.01


## [Ride regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html#sklearn.linear_model.Ridge) : Linear Regression with L2 regularizer

In [26]:
from sklearn.linear_model import Ridge

# Create linear regression object
regr_ridge = Ridge(alpha=0.2)
# Train the model using the training sets
regr_ridge.fit(X_train, y_train)

# Prediction and error estimation (traing data)
y_pred = regr_ridge.predict(X_train)
mse_train = mean_squared_error(y_train, y_pred)
print("MSE (test data): %.2f" % mse_train)

# Prediction and error estimation (test data)
y_pred = regr_ridge.predict(X_test)
mse_test = mean_squared_error(y_test, y_pred)
print("MSE (test data): %.2f" % mse_test)

# Storing results in a dataframe
ridge_results = pd.DataFrame({
  'model': ['ridge'],
  'train_err': [round(mse_train, 2)],
  'test_err': [round(mse_test, 2)]
})

MSE (test data): 2900.54
MSE (test data): 2984.52


## Comparing model performances

In [27]:
results = pd.concat([lreg_results, lasso_results, ridge_results], axis=0)

fig = go.Figure([
    go.Bar(x=results.model, y=results.train_err, name='Training error'),
    go.Bar(x=results.model, y=results.test_err, name='Test error')
]
               )
fig.update_layout(
    title="Model comparison", yaxis_title="MSE")
fig.update_layout(
    legend=dict(
        x=0.05,
        y=0.999
    )
)
fig.show()

# Questions for you



*   Any differences did you notice among these three model performances ?
*   Which is your preferred model and why?



