# An exploration into sklearn’s Linnerud dataset

## Data description
The description of the Linnerud dataset reads as follows:-

“The Linnerud dataset is a multi-output regression dataset. It consists of three exercise (data) and three physiological (target) variables collected from twenty middle-aged men in a fitness club:

physiological — CSV containing 20 observations on 3 physiological variables: Weight, Waist and Pulse.
exercise — CSV containing 20 observations on 3 exercise variables: Chins, Situps and Jumps.”

## Importing the necessary libraries
 We first import the libraries we initially need, being pandas, numpy, matplotlib and seaborn.

In [1]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

W then load the dataset into the program from sklearn. The dataset was a X variable array that had 3 independent columns and a y variable array that also had three target columns :-

In [2]:
from sklearn.datasets import load_linnerud

X, y = load_linnerud(return_X_y=True)

In [3]:
X

array([[  5., 162.,  60.],
       [  2., 110.,  60.],
       [ 12., 101., 101.],
       [ 12., 105.,  37.],
       [ 13., 155.,  58.],
       [  4., 101.,  42.],
       [  8., 101.,  38.],
       [  6., 125.,  40.],
       [ 15., 200.,  40.],
       [ 17., 251., 250.],
       [ 17., 120.,  38.],
       [ 13., 210., 115.],
       [ 14., 215., 105.],
       [  1.,  50.,  50.],
       [  6.,  70.,  31.],
       [ 12., 210., 120.],
       [  4.,  60.,  25.],
       [ 11., 230.,  80.],
       [ 15., 225.,  73.],
       [  2., 110.,  43.]])

In [4]:
y

array([[191.,  36.,  50.],
       [189.,  37.,  52.],
       [193.,  38.,  58.],
       [162.,  35.,  62.],
       [189.,  35.,  46.],
       [182.,  36.,  56.],
       [211.,  38.,  56.],
       [167.,  34.,  60.],
       [176.,  31.,  74.],
       [154.,  33.,  56.],
       [169.,  34.,  50.],
       [166.,  33.,  52.],
       [154.,  34.,  64.],
       [247.,  46.,  50.],
       [193.,  36.,  46.],
       [202.,  37.,  62.],
       [176.,  37.,  54.],
       [157.,  32.,  52.],
       [156.,  33.,  54.],
       [138.,  33.,  68.]])

Once we downloaded the dataset from sklearn, We have to create a dataframe because the X and y arrays did not have their columns named:-

In [5]:
df1 = pd.DataFrame(X, columns=["weight","waist","pulse"])
print(df1)

    weight  waist  pulse
0      5.0  162.0   60.0
1      2.0  110.0   60.0
2     12.0  101.0  101.0
3     12.0  105.0   37.0
4     13.0  155.0   58.0
5      4.0  101.0   42.0
6      8.0  101.0   38.0
7      6.0  125.0   40.0
8     15.0  200.0   40.0
9     17.0  251.0  250.0
10    17.0  120.0   38.0
11    13.0  210.0  115.0
12    14.0  215.0  105.0
13     1.0   50.0   50.0
14     6.0   70.0   31.0
15    12.0  210.0  120.0
16     4.0   60.0   25.0
17    11.0  230.0   80.0
18    15.0  225.0   73.0
19     2.0  110.0   43.0


In [6]:
df2 = pd.DataFrame(y, columns=["chins", "sit_ups", "jumps"])
print(df2)

    chins  sit_ups  jumps
0   191.0     36.0   50.0
1   189.0     37.0   52.0
2   193.0     38.0   58.0
3   162.0     35.0   62.0
4   189.0     35.0   46.0
5   182.0     36.0   56.0
6   211.0     38.0   56.0
7   167.0     34.0   60.0
8   176.0     31.0   74.0
9   154.0     33.0   56.0
10  169.0     34.0   50.0
11  166.0     33.0   52.0
12  154.0     34.0   64.0
13  247.0     46.0   50.0
14  193.0     36.0   46.0
15  202.0     37.0   62.0
16  176.0     37.0   54.0
17  157.0     32.0   52.0
18  156.0     33.0   54.0
19  138.0     33.0   68.0


Once the dataframes for X and y variables had been created, we merge them to form one dataframe that comprised the two dataframes created:-

In [7]:
df = pd.merge(df1,df2, left_index=True, right_index=True)
df

Unnamed: 0,weight,waist,pulse,chins,sit_ups,jumps
0,5.0,162.0,60.0,191.0,36.0,50.0
1,2.0,110.0,60.0,189.0,37.0,52.0
2,12.0,101.0,101.0,193.0,38.0,58.0
3,12.0,105.0,37.0,162.0,35.0,62.0
4,13.0,155.0,58.0,189.0,35.0,46.0
5,4.0,101.0,42.0,182.0,36.0,56.0
6,8.0,101.0,38.0,211.0,38.0,56.0
7,6.0,125.0,40.0,167.0,34.0,60.0
8,15.0,200.0,40.0,176.0,31.0,74.0
9,17.0,251.0,250.0,154.0,33.0,56.0


We then define the X and y variables (again):-



In [8]:
y = df[['chins', 'sit_ups', 'jumps']]
X = df[['weight', 'waist', 'pulse']]

In [9]:
X

Unnamed: 0,weight,waist,pulse
0,5.0,162.0,60.0
1,2.0,110.0,60.0
2,12.0,101.0,101.0
3,12.0,105.0,37.0
4,13.0,155.0,58.0
5,4.0,101.0,42.0
6,8.0,101.0,38.0
7,6.0,125.0,40.0
8,15.0,200.0,40.0
9,17.0,251.0,250.0


In [10]:
y

Unnamed: 0,chins,sit_ups,jumps
0,191.0,36.0,50.0
1,189.0,37.0,52.0
2,193.0,38.0,58.0
3,162.0,35.0,62.0
4,189.0,35.0,46.0
5,182.0,36.0,56.0
6,211.0,38.0,56.0
7,167.0,34.0,60.0
8,176.0,31.0,74.0
9,154.0,33.0,56.0


We can then normalize the data:

In [12]:
X = (X - X.min()) / (X.max() - X.min())
X

Unnamed: 0,weight,waist,pulse
0,0.25,0.557214,0.155556
1,0.0625,0.298507,0.155556
2,0.6875,0.253731,0.337778
3,0.6875,0.273632,0.053333
4,0.75,0.522388,0.146667
5,0.1875,0.253731,0.075556
6,0.4375,0.253731,0.057778
7,0.3125,0.373134,0.066667
8,0.875,0.746269,0.066667
9,1.0,1.0,1.0


After the X variable had been pre-processed, Let's split the X array up for training and validation purposes, using sklearn’s train_test_split():-

In [13]:
from sklearn.model_selection import train_test_split

# Split into validation and training data
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.20, random_state=1)
X_train.shape, X_val.shape, y_train.shape,y_val.shape

((16, 3), (4, 3), (16, 3), (4, 3))

W then define the model. Because the target has multiple columns, we use sklearn’s MultiOutputRegressor() to separate the predictions into 3 separate columns.

We also select sklearn’s Ridge() function, which is a type of linear regression, and achieved a mean_squared_error of 131.64:-

In [14]:
from sklearn.multioutput import MultiOutputRegressor
from sklearn.linear_model import Ridge

model = MultiOutputRegressor(Ridge(random_state=1)).fit(X_train, y_train)
print(model.score(X_train, y_train))

0.2508382384061478


We then predict on validation set

In [15]:
y_pred = model.predict(X_val)
y_pred

array([[183.70305459,  35.65171046,  56.09856434],
       [192.71099243,  37.45868631,  54.52505193],
       [185.94731813,  36.25224697,  55.53900758],
       [179.68255327,  34.73016568,  56.93174121]])

Let's calculate the mean squared error

In [16]:
from sklearn.metrics import mean_squared_error

rms = mean_squared_error(y_val, y_pred, squared=True)
rms

131.6350612207492

Now we plot a graph of the first column of the prediction, which is seen in the diagram below:-

In [None]:
plt.figure(figsize=(10,10))
plt.scatter(y_val.iloc[:,0], y_pred[:,0], c='crimson')
plt.yscale('log')
plt.xscale('log')

p1 = max(max(y_pred[:,0]), max(y_val.iloc[:,0]))
p2 = min(min(y_pred[:,0]), min(y_val.iloc[:,0]))
plt.plot([p1, p2], [p1, p2], 'b-')
plt.xlabel('Actual Values', fontsize=15)
plt.ylabel('Predictions', fontsize=15)
plt.axis('equal')
plt.show()