<a href="https://colab.research.google.com/github/mdkamrulhasan/data_mining_kdd/blob/main/notebooks/Linear_Regression_Regularizer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

What will we cover today (sklearn package)?


1.   Linear Regression:

 *   With and Without Regularization
 *   Controlling Overfitting

2.   Parametric models:

 *   Linear Rregression (LR)


3.   Non-parametric models:

 *   k-NN



In [53]:
import numpy as np
import pandas as pd
# Models (Sklearn)
from sklearn.linear_model import LinearRegression
# from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
# from sklearn.svm import SVR
# Data and Evaluation packages
from sklearn import datasets
from sklearn.metrics import mean_squared_error
# visualization
import plotly.express as px

[Data description](https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html)

In [54]:
# Load the diabetes dataset
df = datasets.load_diabetes(as_frame=True)
df.keys()

dict_keys(['data', 'target', 'frame', 'DESCR', 'feature_names', 'data_filename', 'target_filename', 'data_module'])

In [55]:
df.data.head()

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6
0,0.038076,0.05068,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019907,-0.017646
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.068332,-0.092204
2,0.085299,0.05068,0.044451,-0.00567,-0.045599,-0.034194,-0.032356,-0.002592,0.002861,-0.02593
3,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022688,-0.009362
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031988,-0.046641


In [56]:
# Load the diabetes dataset
X, y = datasets.load_diabetes(return_X_y=True)
X.shape, y.shape

((442, 10), (442,))

In [57]:
fig = px.scatter(x=df.data.bmi, y=y)
fig.show()

Random splitting

In [77]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=0)
X_train.shape, y_train.shape, X_test.shape, y_test.shape

((353, 10), (353,), (89, 10), (89,))

Training and Testing a LR model

In [78]:
# Create linear regression object
regr = LinearRegression()
# Train the model using the training sets
regr.fit(X_train, y_train)

Regression model parameters

In [79]:
regr.coef_, regr.intercept_

(array([ -35.55025079, -243.16508959,  562.76234744,  305.46348218,
        -662.70290089,  324.20738537,   24.74879489,  170.3249615 ,
         731.63743545,   43.0309307 ]),
 152.5380470138517)

In [80]:
# Training error
y_pred = regr.predict(X_train)
# The mean squared error
print("Mean squared error: %.2f" % mean_squared_error(y_train, y_pred))

Mean squared error: 2734.75


In [81]:
# Test error
y_pred = regr.predict(X_test)
# The mean squared error
print("Mean squared error: %.2f" % mean_squared_error(y_test, y_pred))

Mean squared error: 3424.26


Note: This LR implementation is without any Regularizer

[Read for explanation](https://https://www.blog.dailydoseofds.com/p/why-sklearns-linear-regression-implementation)

[Lasso](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html#sklearn.linear_model.Lasso): LR with L1 Regularizer

In [82]:
from sklearn.linear_model import Lasso

In [83]:
# Create linear regression object
regr = Lasso(alpha=0.2)
# Train the model using the training sets
regr.fit(X_train, y_train)
print(regr.coef_)
print(regr.intercept_)

[  -0.          -90.56905001  546.29516411  196.85989775   -0.
  -19.04365168 -198.35291936    0.          469.7505873     0.        ]
152.20509031885527


**Test its performance**

In [84]:
# Training error
y_pred = regr.predict(X_train)
# The mean squared error
print("Mean squared error: %.2f" % mean_squared_error(y_train, y_pred))

Mean squared error: 2874.69


In [85]:
# Test error
y_pred = regr.predict(X_test)
# The mean squared error
print("Mean squared error: %.2f" % mean_squared_error(y_test, y_pred))

Mean squared error: 3367.32


[Ride regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html#sklearn.linear_model.Ridge) : A variant of the L2 regularizer

In [86]:
from sklearn.linear_model import Ridge

In [87]:
# Create linear regression object
regr = Ridge(alpha=0.2)
# Train the model using the training sets
regr.fit(X_train, y_train)
print(regr.coef_)
print(regr.intercept_)

[  -9.46415049 -178.12094763  478.82799055  260.12391918  -52.62391563
 -104.83705975 -205.44507946  123.05827106  410.17917771   82.93942307]
152.22696067209364


In [88]:
# Training error
y_pred = regr.predict(X_train)
# The mean squared error
print("Mean squared error: %.2f" % mean_squared_error(y_train, y_pred))

Mean squared error: 2812.94


In [89]:
# Test error
y_pred = regr.predict(X_test)
# The mean squared error
print("Mean squared error: %.2f" % mean_squared_error(y_test, y_pred))

Mean squared error: 3325.26


k-NN Regressor

In [90]:
from sklearn.neighbors import KNeighborsRegressor
regr = KNeighborsRegressor(n_neighbors=2)

In [91]:
regr.fit(X_train, y_train)

In [92]:
# Training error
y_pred = regr.predict(X_train)
# The mean squared error
print("Mean squared error: %.2f" % mean_squared_error(y_train, y_pred))

Mean squared error: 1397.56


In [93]:
# Test error
y_pred = regr.predict(X_test)
# The mean squared error
print("Mean squared error: %.2f" % mean_squared_error(y_test, y_pred))

Mean squared error: 5627.81
