# Introduction to Simple Linear Regression


## Learning Objectives and outcomes

- Introduction of linear regression in a simple setting.

- Basic assumptions of the model.

- Terminology - 'intercept', 'coefficient'.

- The Least Squares method.
    
- Implement linear regression in sklearn and statsmodels

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
!head data/Advertising.csv

In [None]:
# read data to a dataframe
data = pd.read_csv('data/Advertising.csv',
                   index_col=0)[['TV', 'Sales']]

print(data.shape)
data.head()

Note that TV is dollars in thousands and sales are in thousands of units too.

In [None]:
# plot data
data.plot(kind='scatter', x='TV', y='Sales')
plt.show()

In [None]:
# Let's discuss very briefly the notation we will be using

# We usually represent independent variables (input variables) with X:

X = data.TV

# similarly the dependent variable with y:

y = data.Sales

In [None]:
display(data.head(3))

print(X[3], y[3])

Recall that a line equation on the plane can be written as: 

$$ y = m\cdot x + b$$

In [None]:
# this function will make a y prediction for y (sales)
# given x (TV) and fixed m (slope) and b (intercept)

def predict_y(m=1, x=2, b=1):
    return m*x + b

__Your Turn__

Suppose m = 0.04 and b = 7:

- Find y if x = 230.1

- Find y if x = 44.5

- Find y if x = 17.2

In [None]:
X = data.head(3).TV

In [None]:
calculate_y(m=0.04, x=X, b=7)

In [None]:
def draw_line(X, y, intercept=7, slope=0.04, xlabel='Tv Advertisements',
              ylabel='Sales', title='A prediction for Sales'):
    """
    draws a line with given intercept and slope together with given data.
    parameters:
    X: array
    y: array
    intercept: float, preferably between 5 and 9 in this case
    slope: float, preferably between 0.02 and 0.08
    xlabel: str, label of the x-axis in the figure.
    ylabel: str, label of the y-axis in the figure.
    return: a figure with data and a regression line with given intercept and slope.
    """
    # find the predicted values. These points lie on the line with
    # given slope and intecept
    y_pred = intercept + slope * X

    # create a new figure and set the figure size
    plt.figure(figsize=(10, 8))

    # plot data points as scatter
    plt.scatter(x=X, y=y)

    # plot the prediction line
    plt.plot(X, y_pred, c='r', label='Regression Line')

    plt.ylim(bottom=1)

    # set labels
    plt.xlabel(xlabel)
    plt.ylabel(ylabel)

    # set the title of the figure
    plt.title(title)

    plt.legend()
    plt.show()
    return

In [None]:
tv = data.TV.values
sales = data.Sales.values
draw_line(tv, sales, intercept=7.03, slope=0.06)

## Assumptions

* Assume $\mbox{Sales} \approx \mbox{TV}$  linearly.

* There is an initial value (baseline) of the Sales. 

* Data might not be fit on a line but the errors are random.

In [None]:
# Sometimes linearity assumption is too strict

expr_x = np.linspace(-5, 5, 100)
y = expr_x**2 + 2
draw_line(expr_x, y, intercept=10, slope=3,
          xlabel='',
          ylabel='',
          title='Regression line for quadratic data')

In [None]:
X = np.linspace(-1, 1, 100)

In [None]:
errors = np.random.normal(loc=0, scale=1, size=100)

In [None]:
plt.hist(errors)

In [None]:
plt.scatter(X, errors)
plt.hlines(y=0, xmin=-1, xmax=1)
plt.show()

## Model  - Single Variable Case

- In simple linear regression we assume that if we would have population of both X and y variables then we would see the following relation:

$$ Y = \beta_{0} + \beta X + \epsilon$$
 
- $ \beta_{0}, \beta$ are parameters of the model and called the intercept and coefficient of the linear model respectively. 

- $\epsilon$ is the irreducible error term. Depend on the problem at hand we might assume that these errors are coming from measurement mistakes, personal beliefs, recording errors, etc.

- Our goal is given samples from X and y, try to find estimates $\hat{\beta}_{0}$ and $\hat{\beta}_{1}$ for population parameters $\beta_{0}$ and  $\beta_{1}$
- Once we find such estimates we can use them for future predictions: $\hat{\beta}_{0}$ and $\hat{\beta}_{1}$ we will make a prediction:

$$ \hat{y}_{i} = \hat{\beta}_{0} + \hat{\beta}_{1} x_{1} $$

<img src="visuals/best_fit.png" cap="Transformed dataset"  width='300'/>

## Using Sklearn for simple linear regression

- Fitting a regression model is very easy with python. 

- All we have to do is importing LinearRegression class from sklearn.linear_model module

- For more details and examples of implementation you can check:

[Sklearn Linear Regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression)

__Your Turn!__

- Now using 'tv' and 'Sales' try to fit a linear model with sklearn. Find the corresponding intercept and slope values.

In [None]:
X = data.TV.values
y = data.Sales.values

In [None]:
X.shape

In [None]:
X = X.reshape(-1, 1)

X.shape

In [None]:
# import LinearRegression class


In [None]:
# instantiate the class
# check parameters


In [None]:
# now fit the data.  
print('Shape of X before reshape:', X.shape)


In [None]:
# a reshape may be necessary
print('Shape of X after reshape:', X.shape)


In [None]:
# when you fit model learns b0_hat and b1_hat


In [None]:
# now we can use fitted object to get model parameters 

## How do we find an estimator? Least Squares method.
<a name="least_squares">
</a>

__Q: How to find 'best' line?__


<img src="visuals/errors.png" cap="Transformed dataset"  width='500'/>



* Recall that we know actual values $y$ for the sales and for given coefficients $\beta_{0}, \beta_{1}$ we can make a prediction $\hat{y}$. 

* Error for each prediction $e_{i} = y_{i} - \hat{y_{i}}$

### Residual sum of squares (RSS)


$$RSS = e_{1}^{2} + e_{2}^{2} + \cdots + e_{n}^{2}$$

* Wait a minute! Why do we get the squares?


The least square method minimizes the RSS.

* Good news: The least square can be calculated exactly because it has a closed form:

$$ \mbox{RSS} = (y_1 - \hat{b}_{0} -\hat{b}_{1}x_{1} )^{2} + (y_2 - \hat{b}_{0} -\hat{b}_{1}x_{2} )^{2} + \cdots + (y_n -\hat{b}_{0} -\hat{b}_{1}x_{n} )^{2}$$

* Bad news: It requires derivatives and is complicated to derive. But no worries, python will take care of this step for us.

$$ \hat{b}_{1} = \dfrac{\sum^{n}_{i=1} (x_i - \bar{x})(y_{i}-\bar{y})}{\sum^{n}_{i=1} (x_i - \bar{x})^{2}}$$

and 

$$ \hat{b}_{0} = \bar{y} - \hat{b}_{1}\bar{x} $$

- __Note:__ In the literature you might see some variants of RSS: Some of which are:

$$ \mbox{Mean Squared Errors (MSE)} = \frac{1}{N} \mbox{RSS}$$

$$ \mbox{Root Mean Squared Errors (RMSE)} = \sqrt{\frac{1}{N} \mbox{RSS}} $$

[Least Squares Visualized](https://phet.colorado.edu/sims/html/least-squares-regression/latest/least-squares-regression_en.html)

__Your Turn!__ (Together)

Write a function that for given X, y data it returns $\hat{b}_{0}$ and $\hat{b}_{1}$ and RSS for this data. Compare the results with sklearn's results.

Reminder:

$$ \hat{b}_{1} = \dfrac{\sum^{n}_{i=1} (x_i - \bar{x})(y_{i}-\bar{y})}{\sum^{n}_{i=1} (x_i - \bar{x})^{2}}$$

In [None]:
def least_squares(X, y):
    pass

In [None]:
least_squares(X, y)

# returns b0, b1 and RSS

Now use statsmodel or sklearn to compare results.

In [None]:
X = X.reshape(-1,1)

In [None]:
lr = LinearRegression()

lr.fit(X,y)
print(lr.intercept_, lr.coef_)

In [None]:
# let's use draw_line again for the given estimates.

In [None]:
draw_line(X, y, intercept= , slope= )

## Linear Regression with Statsmodel

There is another library that we can use for linear models which is Statsmodel.

- [check the documentation](http://www.statsmodels.org/devel/generated/statsmodels.regression.linear_model.OLS.html#statsmodels.regression.linear_model.OLS)

- Now let's use statsmodel to fit a linear model to our data.

In [None]:
import statsmodels.api as sm

__Your Turn__

- Try to use statsmodel library to fit a line to the advertising dataset.

In [None]:
data