# Linear Regression 

# Using libraries

So now we've seen how linear regression works analytically. Although it does not seem to be too hard to program a regression algorithm, there can be a lot of complications. Things work differently when for instance the data set is \*big\*, there are many features or when we cannot use a least squares cost function. In these cases we can use Gradient Descent. However, if we code gradient decsent from scratch, we have to be very careful not to make mistakes and it sometimes converges very slowly.

Alternatively, we can use an existing implementation. For Python, may popular algorithms have already been written an added to an open source library. These algorithms are often rigorously  tested, well documented and very efficient (in fact, most algorithms were implemented in a fast language like Fortran or C++ and made accessible from Python). Choosing an existing implementation makes coding \*a lot\* easier and you are far less likely to make mistakes. Here we will use **SKLearn** (pronounce Sci-Kit Learn), but there are many more interesting libraries.

So why is Python most popular for scripting Data Science experiments and not one of these fast languages? This is probably partly because Python is comprehensive, compact, highly readable, interactive and very easy to learn and use. Although Python itself is far less efficient, when used with these libraries to do the heavy lifting we will not notice much drop in performance.

# Data

When loading and preparing the data for use with a library, it is important to read in what format we should provide it. Luckily, for Python, Numpy array's are like the standard format everyone uses, because it's speed, stability and versatilty is unrivaled. However, for some choices, like should we provide the data in row-vectors or column-vectors, should we add a bias or does the library handle that, you should check out an example to see how that works.

In our case, SKlearn automatically adds a bias, so we should not add a bias ourselves. The data loaders in ML will only add a bias if we add bias=True.

In [1]:
from ml import *
data = advertising_sales_tv()

# Model

The **SKLearn** library contains excellent implementations for many Machine Learning algorithms. To use these libraries it is crucial that you read the documentation or some tutorial on how to use these properly. To use Gradient Descent to estimate a linear regression function, we should use the `partial_fit()` method on the `SGDRegressor`. We can configure the `loss` function to `squared_loss` and `eta0` is the learning rate.

In [2]:
data.train_X

array([[ 93.9],
       [ 75.1],
       [  4.1],
       [195.4],
       [261.3],
       [276.9],
       [141.3],
       [  0.7],
       [228.3],
       [171.3],
       [112.9],
       [187.9],
       [109.8],
       [  8.4],
       [255.4],
       [  7.8],
       [281.4],
       [292.9],
       [276.7],
       [188.4],
       [120.5],
       [129.4],
       [109.8],
       [  5.4],
       [293.6],
       [219.8],
       [ 17.2],
       [ 97.5],
       [240.1],
       [213.4],
       [  8.7],
       [ 78.2],
       [280.2],
       [218.5],
       [ 18.8],
       [215.4],
       [164.5],
       [ 62.3],
       [ 96.2],
       [217.7],
       [  8.6],
       [182.6],
       [240.1],
       [137.9],
       [125.7],
       [163.5],
       [206.9],
       [136.2],
       [234.5],
       [ 13.2],
       [156.6],
       [191.1],
       [172.5],
       [110.7],
       [ 36.9],
       [102.7],
       [ 73.4],
       [166.8],
       [ 48.3],
       [175.1],
       [290.7],
       [ 69. ],
       [

In [3]:
model = SGDRegressor(loss='squared_loss', eta0=1e-4)

We observe that the solution does not converge to value close to the analytical solution but not exactly the same. Also that the SGDRegressor converges more quickly than our own implementation. This is because the SKLearn implementation has a lot of built in extra functionality to speed up training and prevent overfitting. In practice, these solutions may generalize better to unseen data.

# Training

In [4]:
for _ in range(100000):
    model.partial_fit(data.train_X, data.train_y )
    if _ % 10000 == 0:
        print(model.coef_, model.intercept_)

[0.10269291] [0.00954835]
[0.06011554] [5.47417704]
[0.05478194] [6.43329338]
[0.05490904] [6.74178412]
[0.05477529] [6.85684681]
[0.05476347] [6.9038953]
[0.05576471] [6.92446553]
[0.05927584] [6.93431959]
[0.05598238] [6.93833489]
[0.05378712] [6.94025056]


Using the SKlearn library we observe that:
- it takes less epochs (passes over the dataset) to converge
- the model does not neatly converge, we see that $\theta_0$ oscilates while $\theta1$ converges. We will see fixes for that.
- around $\theta_1 \approx 6.94$ we see that it does not truly converge