# Linear Regression with Scikit-learn

This wholly uninteresting notebook is here simply to demonstrate how to perform a linear regression analysis with [Scikit-learn](http://scikit-learn.org/stable/index.html) and create associated plots. That is its only goal in life. I will use the built-in Linnerud dataset, which consists of exercise variables (the predictors) and physiological variables (to be predicted) for 20 people. I will predict the `Weight` variable from the three predictors. Plots are made using [Bokeh](http://bokeh.pydata.org/en/latest/).

## Setup

In [1]:
import bokeh.io as bkio
import bokeh.layouts as bklay
import bokeh.models as bkmod
import bokeh.plotting as bkplt

import bkcharts
import numpy as np
import pandas as pd
import sklearn.linear_model
import sklearn.datasets

bkio.output_notebook()

Here I read in the data and display the first few rows so we can get a feel for what it looks like. Scikit-learn prefers to have their datasets pre-split into training and target variables, but having them all in the same dataframe will make life easier when we go to create plots. That said, the designated predictor variables are:
- Chins
- Situps
- Jumps

The response variables are:
- Weight
- Waist
- Pulse

I will only predict `Weight` in this example.

In [2]:
linnerud = sklearn.datasets.load_linnerud()
train = pd.DataFrame(linnerud['data'], columns=linnerud['feature_names'], dtype=np.int)
test = pd.DataFrame(linnerud['target'], columns=linnerud['target_names'], dtype=np.int)
dat = pd.concat([train, test], axis=1)
del linnerud, train, test

dat.head()

Unnamed: 0,Chins,Situps,Jumps,Weight,Waist,Pulse
0,5,162,60,191,36,50
1,2,110,60,189,37,52
2,12,101,101,193,38,58
3,12,105,37,162,35,62
4,13,155,58,189,35,46


## Preliminary Plots

The relationship between `Weight` and `Chins` appears to be relatively linear, as does its relationship with `Situps`. Its relationship with `Jumps` is less clear.

In [3]:
p1 = bkcharts.Scatter(dat, x='Chins', y='Weight', width=300, height=300)
p2 = bkcharts.Scatter(dat, x='Situps', y='Weight', width=300, height=300)
p3 = bkcharts.Scatter(dat, x='Jumps', y='Weight', width=300, height=300)

grid = bklay.row([p1, p2, p3])
bkplt.show(grid)

I now remove three potential outliers and re-plot the data. From the plots, we can see that there is one particularly light person, one particularly heavy person, and one particularly jumpy person.

In [4]:
dat2 = dat.loc[(dat['Jumps'] < 240) & (dat['Weight'] > 140) & (dat['Weight'] < 240)]

In [5]:
p1 = bkcharts.Scatter(dat2, x='Chins', y='Weight', width=300, height=300)
p2 = bkcharts.Scatter(dat2, x='Situps', y='Weight', width=300, height=300)
p3 = bkcharts.Scatter(dat2, x='Jumps', y='Weight', width=300, height=300)

grid = bklay.row([p1, p2, p3])
bkplt.show(grid)

Still not wonderful, but whatever. Let's do some regression anyway and see how it goes.

## Multiple Regression

In [6]:
reg = sklearn.linear_model.LinearRegression()
reg.fit(dat2[['Chins', 'Situps', 'Jumps']], dat2['Weight'])

print('Intercept: ' + str(reg.intercept_))
print('Coefficients: ' + str(reg.coef_))

Intercept: 197.884401948
Coefficients: [-0.80153307 -0.16154532  0.19594315]


## LASSO

I'm not really done with the multiple regression, but I'll try finding an optimal LASSO hyperparameter for this problem as well, because hey, why not?

In [7]:
lreg = sklearn.linear_model.LassoCV(eps=0.01, n_alphas=1000, alphas=None, cv=17)
lreg.fit(dat2[['Chins', 'Situps', 'Jumps']], dat2['Weight'])

print('Regularization Parameter: ' + str(lreg.alpha_))
print('Intercept: ' + str(lreg.intercept_))
print('Coefficients: ' + str(lreg.coef_))

Regularization Parameter: 94.2420158942
Intercept: 192.830359015
Coefficients: [-0.         -0.09804644  0.        ]
