<h1>Introduction to Linear Regression</h1>

<h2>Key Concepts</h2>

- Loss function
- Gradient descent
- Linear regression
- Statsmodels
- In/out-sampling

In [None]:
# Read in the data
import warnings
import pandas as pd
import numpy as np
import statsmodels.api as sm
from pylab import *

warnings.filterwarnings('ignore')

data = pd.read_csv('../datasets/world_happiness_v2.csv')

# We also add a constant
data['Explained constant'] = 1

response =  'Happiness score'
predictors = [col for col in data.columns if 'Explained' in col]

data.head().columns

print(predictors)

In [None]:
# Plots of the data
figure()
data[predictors[0]].hist(bins=10)
title(predictors[0])

figure()
plot(data[predictors[0]], data[response], 'x', alpha=0.5)
grid()
xlabel(predictors[0])
ylabel('response')

<div style="border: 3px solid green; padding: 10px">
  <b>Implementation 1:</b> Plot the different predictors. Can you visually tell which predictors are "important"?
</div>

<h3>Fitting a model</h3>

We are trying to explain the happiness using the different predictor variables. 
We start with a linear model, which has the following form:

$$ y_i = \sum_{j=1}^k \beta_j x_i + \epsilon_i $$


In [None]:
# We initialize a random beta vector 
np.random.seed(0)

beta = np.random.rand(len(predictors))

# We get an estimate for the happiness conditional on our beta 
data['y_hat'] = np.dot(data[predictors].values, np.array([beta]).T)

# We can plot the estimate against the data
figure()
plot(data[response], data['y_hat'], 'x', alpha=0.5)
grid()


<h3>Loss</h3>

How good is our model? We can look at its loss. We define the loss in the following way:

$$ \mathrm{L} = \frac{1}{N}  \sum_{i=1}^N \left(y_i - \sum_{j=1}^k \beta_j x_i \right)^2 $$

In matrix notation we can write the same as:

$$ \mathrm{L} = \frac{1}{N} \left( y - X \beta \right)^T \left(y - X \beta \right) $$


<div style="border: 3px solid green; padding: 10px">
  <b>Implementation 2:</b> Write a method that computes the loss and compute the loss for the random betas
</div>

In [None]:
def get_loss(X, y, beta):
    #TODO
    return 


<h3>Gradient descent</h3>

We can probably find a beta with a smaller loss. One way is to move along its gradient.

The loss can be rewritten as:

$$ \mathrm{L} = \frac{1}{N} \left( y^T y -2 \beta^T X^T y + \beta^T X^T X \beta \right) $$

And the grdient as:

$$ \frac{\partial \mathrm{L}}{\partial \beta} =  \frac{2}{N} \left( X^T X \beta - X^T y  \right) $$

We can then take steps along the gradient with a "learning rate" $\lambda$

$$ \beta_{n+1} = \beta_n - \lambda \left( X^T X \beta_n - X^T y  \right) $$

<div style="border: 3px solid green; padding: 10px">
  <b>Implementation 3:</b> Implement the gradient descent algorithm. Plot the loss. Observe what happens if you change the learning rate and n_steps. 
</div>

In [None]:
# Exercise 3)

def get_gradient(X, y, beta):
    #TODO
    return 

def update_beta(X, y, beta, learning_rate):
    #TODO
    return 

X = data[predictors].values
y = data[response].values
beta_update = beta
n_steps = 1000
learning_rate = 0.001

losses = []
for i in range(n_steps):
    beta_update = update_beta(X, y, beta_update, learning_rate)
    losses.append(get_loss(X, y, beta_update))
    
plot(losses)
grid()
xlabel("Interation")
ylabel("Loss")

print("Final loss {}".format(losses[-1]))

<h3>Direct solution</h3>

We can compute the solution for beta directly by setting the gradient equal to 0 and solve for $\beta$:

$$ \beta_e =  \left( X^T X \right)^{-1} X^T y$$


<div style="border: 3px solid green; padding: 10px">
  <b>Implementation 4:</b> Implement the direct solution and compare the loss
</div>

In [None]:
beta_optimal = 
print("loss optimal = ".format(get_loss(X, y, beta)))

<h3>Statsmodels</h3>

There are packages already that run this for you. They also provide you with a lot of usefull information about the model. What do you learn about the relationship between the happiness and its predictors?

In [None]:
m = sm.OLS(data[response], data[predictors]).fit()
m.summary()

### Exercises

In [None]:
# Create the in- and out-sample
in_sample = data.ix[0:77]
out_sample = data.ix[78:]

in_sample.head()

<div style="border: 3px solid green; padding: 10px">
  <b>Exercise 1:</b> On the in-sample data, fit one beta per predictor at a time 
        (univariate regression). Compute 
        the loss on the out-sample using those betas. Compare the out-sample loss to the
        multivariate estimated betas. 
</div>

<div style="border: 3px solid green; padding: 10px">
  <b>Exercise 2:</b> On the in-sample data, fit one beta per predictor at a time 
        (univariate regression) by fitting on the residual of the previous predictor. Compute 
        the loss on the out-sample using those betas. Compare the out-sample loss to the
        multivariate estimated betas. 
</div>

<div style="border: 3px solid green; padding: 10px">
  <b>Exercise 5:</b> Change your in-sample model in order to decrease the out-sample loss. Concepts you can try are 
    <b>manual feature selection</b> and <b>regularization (lasso)</b>. You can also split your in-sample data in half to tune the hyper parameters (alpha). If you are not familiar with those concepts, familiarize yourself first. 
</div>

In [None]:
m = sm.OLS(data[response], data[predictors]).fit_regularized(alpha=0.01)
m.params