<a href="https://colab.research.google.com/github/mdkamrulhasan/data_mining_kdd/blob/main/notebooks/Linear_Regression_ensembles.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

What will we cover today (sklearn package)?


1.   Linear Regression:

 *   With and Without Regularization
 *   Controlling Overfitting

2.   Parametric models:

 *   Linear Rregression (LR)


3.   Non-parametric models:

 *   k-NN
 *   Decision Tree

4.   Ensemble Models

 *   Bagging (RandomForestRegressor)
 *   Boosting (GradientBoostingRegressor)


In [20]:
import numpy as np
import pandas as pd
# Models (Sklearn)
from sklearn.linear_model import LinearRegression
# from sklearn.svm import SVR
# Data and Evaluation packages
from sklearn import datasets
from sklearn.metrics import mean_squared_error
# visualization
import plotly.express as px

[Data description](https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html)

In [3]:
# Load the diabetes dataset
df = datasets.load_diabetes(as_frame=True)
df.keys()

dict_keys(['data', 'target', 'frame', 'DESCR', 'feature_names', 'data_filename', 'target_filename', 'data_module'])

In [4]:
df.data.head(2)

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6
0,0.038076,0.05068,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019907,-0.017646
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.068332,-0.092204


In [5]:
# Load the diabetes dataset
X, y = datasets.load_diabetes(return_X_y=True)
X.shape, y.shape

((442, 10), (442,))

In [6]:
fig = px.scatter(x=df.data.bmi, y=y)
fig.show()

Random splitting

In [7]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=0)
X_train.shape, y_train.shape, X_test.shape, y_test.shape

((353, 10), (353,), (89, 10), (89,))

# Linear Regression

In [8]:
# Create linear regression object
regr = LinearRegression()
# Train the model using the training sets
regr.fit(X_train, y_train)

Regression model parameters

In [9]:
regr.coef_, regr.intercept_

(array([ -35.55025079, -243.16508959,  562.76234744,  305.46348218,
        -662.70290089,  324.20738537,   24.74879489,  170.3249615 ,
         731.63743545,   43.0309307 ]),
 152.5380470138517)

In [10]:
# Training error
y_pred = regr.predict(X_train)
# The mean squared error
print("Mean squared error: %.2f" % mean_squared_error(y_train, y_pred))

Mean squared error: 2734.75


In [11]:
# Test error
y_pred = regr.predict(X_test)
# The mean squared error
print("Mean squared error: %.2f" % mean_squared_error(y_test, y_pred))

Mean squared error: 3424.26


# k-NN Regressor

In [13]:
from sklearn.neighbors import KNeighborsRegressor

In [14]:
regr = KNeighborsRegressor(n_neighbors=20)
regr.fit(X_train, y_train)

In [15]:
# Training error
y_pred = regr.predict(X_train)
# The mean squared error
print("Mean squared error: %.2f" % mean_squared_error(y_train, y_pred))

Mean squared error: 2836.52


In [16]:
# Test error
y_pred = regr.predict(X_test)
# The mean squared error
print("Mean squared error: %.2f" % mean_squared_error(y_test, y_pred))

Mean squared error: 3465.10


# Decision Tree Regressor

In [17]:
from sklearn.tree import DecisionTreeRegressor

In [24]:
regr = DecisionTreeRegressor()
regr.fit(X_train, y_train)

In [25]:
# Training error
y_pred = regr.predict(X_train)
# The mean squared error
print("Mean squared error: %.2f" % mean_squared_error(y_train, y_pred))

Mean squared error: 0.00


In [26]:
# Test error
y_pred = regr.predict(X_test)
# The mean squared error
print("Mean squared error: %.2f" % mean_squared_error(y_test, y_pred))

Mean squared error: 8054.88


# Ensemble methods

*   Bagging (Random Forest)
*   Boosting (Gradient Boosting)



In [21]:
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor

**Random Forest**

In [27]:
regr = RandomForestRegressor()
regr.fit(X_train, y_train)

In [28]:
# Training error
y_pred = regr.predict(X_train)
# The mean squared error
print("Mean squared error: %.2f" % mean_squared_error(y_train, y_pred))

Mean squared error: 457.65


In [29]:
# Test error
y_pred = regr.predict(X_test)
# The mean squared error
print("Mean squared error: %.2f" % mean_squared_error(y_test, y_pred))

Mean squared error: 3707.44


**Gradient Boosting**

In [30]:
regr = GradientBoostingRegressor()
regr.fit(X_train, y_train)

In [31]:
# Training error
y_pred = regr.predict(X_train)
# The mean squared error
print("Mean squared error: %.2f" % mean_squared_error(y_train, y_pred))

Mean squared error: 871.46


In [32]:
# Test error
y_pred = regr.predict(X_test)
# The mean squared error
print("Mean squared error: %.2f" % mean_squared_error(y_test, y_pred))

Mean squared error: 4047.62
