# Regression on the Diabetes dataset

Check the URL below to find more information on the Diabetes dataset

URL: https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_diabetes.html#sklearn.datasets.load_diabetes

**Features**:
- *age*: age in years
- *sex*: gender
- *bmi*: body mass index
- *bp*: average blood pressure
- *s1*: tc, total serum cholesterol
- *s2*: ldl, low-density lipoproteins
- *s3*: hdl, high-density lipoproteins
- *s4*: tch, total cholesterol / HDL
- *s5*: ltg, possibly log of serum triglycerides level
- *s6*: glu, blood sugar level

**Target**: The quantitative measure of disease progression one year after baseline

**Aim**: Train a linear regression model to predict the targets

In [1]:
# Imports
import numpy as np 
from sklearn.datasets import load_diabetes
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

In [2]:
# Load data
data = load_diabetes(scaled=True)

print(data)

{'data': array([[ 0.03807591,  0.05068012,  0.06169621, ..., -0.00259226,
         0.01990749, -0.01764613],
       [-0.00188202, -0.04464164, -0.05147406, ..., -0.03949338,
        -0.06833155, -0.09220405],
       [ 0.08529891,  0.05068012,  0.04445121, ..., -0.00259226,
         0.00286131, -0.02593034],
       ...,
       [ 0.04170844,  0.05068012, -0.01590626, ..., -0.01107952,
        -0.04688253,  0.01549073],
       [-0.04547248, -0.04464164,  0.03906215, ...,  0.02655962,
         0.04452873, -0.02593034],
       [-0.04547248, -0.04464164, -0.0730303 , ..., -0.03949338,
        -0.00422151,  0.00306441]]), 'target': array([151.,  75., 141., 206., 135.,  97., 138.,  63., 110., 310., 101.,
        69., 179., 185., 118., 171., 166., 144.,  97., 168.,  68.,  49.,
        68., 245., 184., 202., 137.,  85., 131., 283., 129.,  59., 341.,
        87.,  65., 102., 265., 276., 252.,  90., 100.,  55.,  61.,  92.,
       259.,  53., 190., 142.,  75., 142., 155., 225.,  59., 104., 182.,
  

In [3]:
# Extract feature names, features and targets
feature_names = data["feature_names"]
features = data["data"]
targets = data["target"]

print("Feature names:\n", feature_names, "\n")
print("Features:\n", features, "\n")
print("Targets:\n", targets, "\n")
print("Description:\n", data["DESCR"])

Feature names:
 ['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6'] 

Features:
 [[ 0.03807591  0.05068012  0.06169621 ... -0.00259226  0.01990749
  -0.01764613]
 [-0.00188202 -0.04464164 -0.05147406 ... -0.03949338 -0.06833155
  -0.09220405]
 [ 0.08529891  0.05068012  0.04445121 ... -0.00259226  0.00286131
  -0.02593034]
 ...
 [ 0.04170844  0.05068012 -0.01590626 ... -0.01107952 -0.04688253
   0.01549073]
 [-0.04547248 -0.04464164  0.03906215 ...  0.02655962  0.04452873
  -0.02593034]
 [-0.04547248 -0.04464164 -0.0730303  ... -0.03949338 -0.00422151
   0.00306441]] 

Targets:
 [151.  75. 141. 206. 135.  97. 138.  63. 110. 310. 101.  69. 179. 185.
 118. 171. 166. 144.  97. 168.  68.  49.  68. 245. 184. 202. 137.  85.
 131. 283. 129.  59. 341.  87.  65. 102. 265. 276. 252.  90. 100.  55.
  61.  92. 259.  53. 190. 142.  75. 142. 155. 225.  59. 104. 182. 128.
  52.  37. 170. 170.  61. 144.  52. 128.  71. 163. 150.  97. 160. 178.
  48. 270. 202. 111.  85.  42. 170. 200. 252. 11

In [4]:
# Check dimensions of features and targets
print("Features shape:", features.shape)
print("Targets shape:", targets.shape)

Features shape: (442, 10)
Targets shape: (442,)


In [5]:
# Split into train and test
train_features, test_features, train_targets, test_targets = train_test_split(features, targets, test_size=0.1)

print("Train features shape:", train_features.shape)
print("Test features shape:", test_features.shape)
print("Train targets shape:", train_targets.shape)
print("Test targets shape:", test_targets.shape)

Train features shape: (397, 10)
Test features shape: (45, 10)
Train targets shape: (397,)
Test targets shape: (45,)


In [6]:
# Train the regression model!
reg = LinearRegression().fit(train_features, train_targets)

# Compute regression error on test dataset
train_score = reg.score(train_features, train_targets)
test_score = reg.score(test_features, test_targets)

print("Training score:", train_score)
print("Test score:", test_score)

Training score: 0.5222484421648363
Test score: 0.4615424621693871


In [7]:
# Compute the Euclidean error
train_predictions = reg.predict(train_features)
test_predictions = reg.predict(test_features)

train_error = np.mean((train_predictions - train_targets)**2)
test_error = np.mean((test_predictions - test_targets)**2)

print("Train error:", train_error)
print("Test error:", test_error)

Train error: 2859.9224931723434
Test error: 2922.3311081819775


**By regression standards, these are errors are not very encouraging.** Let's try a different approach.

## Second order polynomial expansion of data

In [8]:
from sklearn.preprocessing import PolynomialFeatures

# Compute second order polynomial features
poly = PolynomialFeatures(degree=2, interaction_only=False)
train_features = poly.fit_transform(train_features)
test_features = poly.fit_transform(test_features)

print("Train features shape:", train_features.shape)
print("Test features shape:", test_features.shape)

Train features shape: (397, 66)
Test features shape: (45, 66)


In [9]:
# Fit regression model again
reg = LinearRegression().fit(train_features, train_targets)

train_score = reg.score(train_features, train_targets)
test_score = reg.score(test_features, test_targets)

print(train_score, test_score)

0.054528104155363466 -1.0027990559532762


In [10]:
# Euclidean errors
train_error = np.sqrt(np.mean((train_predictions - train_targets)**2))
test_error = np.sqrt(np.mean((test_predictions - test_targets)**2))

print("Train error:", train_error)
print("Test error:", test_error)

Train error: 53.478243175821916
Test error: 54.05858958742799


**No better results than last time.** Perhaps the data is not linear in nature...