### Predictive statistics: class example 1, simple linear regression

There are many possible ways to perform a linear regression in Python. Here we will use scikit-learn: in addition to the linear regression method used here, this library also provides many other functions related to machine learning, some of which you will be using in other parts of this course. The scikit-learn webpage can be found here: https://scikit-learn.org/stable/

Linear regression in scikit-learn is performed using the LinearRegression() method. Let's try it out:

In [None]:
# import numpy to generate toy data, and matplotlib to plot it
import numpy as np
import matplotlib.pyplot as plt
# set up matplotlib to show the figures inline
plt.ion()
%matplotlib inline
# import the LinearRegression method from scikit-learn
from sklearn.linear_model import LinearRegression

We will create some toy data, and use this to see a simple example of applying the linear regression method:

In [None]:
# create evenly-spaced x variable with 50 points
len_x=50
x=np.arange(len_x)+np.random.randn(len_x)
# create toy y data as y = ax + b + random errors
a=0.5; b=-3
y=(a*x)+b+(3*np.random.randn(len_x))

In [None]:
# plot the data
plt.rcParams.update({'font.size': 15})
fig=plt.figure()
ax0=plt.subplot(111)
p0=plt.plot(x,y,'b.')
plt.grid()
ax0.set_xlabel('x')
ax0.set_ylabel('y')

To apply the linear regression method, we first initialise a generic model, and then fit it to our x,y data:

In [None]:
# create a generic instance the model
model = LinearRegression()
# NB: optional: if you want to read more about the options for setting up the model, 
# you can uncomment the line below. The default options are fine for our example,
# so we do not need to change any of the parameters
#model?

# fit the model
# (we have to add [:,None] because the method expects to receive 2D arrays as input data)
model.fit(x[:,None], y[:,None])

The model that we get back is a Python object. We can use it predict values of y (i.e. ŷ=ax+b) by supplying values of x:

In [None]:
# define the x values used to predict y
x_predict = np.arange(0,len_x+1,5)
# get the predicted values of ŷ using the model:
y_hat = model.predict(x_predict[:,None])

In [None]:
# plot the data and the fitted line
fig=plt.figure()
ax0=plt.subplot(111)
p0=plt.plot(x,y,'b.')
p1=plt.plot(x_predict,y_hat,'r.-')
plt.grid()
ax0.set_xlabel('x')
ax0.set_ylabel('y')

Remember that we created our toy y data by using:
```
a=0.5; b=-3
y=(a*x)+b+(3*np.random.randn(len_x))
```

So we expect that our model should have values that are close to 0.5 for the slope (a) and -3 for the intercept (b). Note that the actual values will be slightly different because we added some small errors to our data. We can check the actual values for a and b found by the model by looking at the values stored in ```model.coef_``` and ```model.intercept_```:

In [None]:
print('Our model has parameters a={:.2f}, b={:.2f}'.format(model.coef_[0][0],model.intercept_[0]))