# Linear Regression

![Mad men](assets/linear-regression/Ratings-for-Mad-Men.png)

(image: [flowingdata.com](https://flowingdata.com/2014/03/24/graph-tv-shows-ratings-by-episode/))

## Where are we?

![one of many cheatsheets](assets/linear-regression/machine-learning-cheet-sheet.png)

(image: [sas.com](https://www.sas.com/en_us/insights/analytics/machine-learning.html))

## Linear Equation

From $x$ (a feature), predict $y$ (outcome or result) assuming a "linear relationship".

$$y=Wx+b$$

![linear equation](assets/linear-regression/linear-equation.png)

## Polynomial Equations

Features can also be different degrees ($m$) of $x$ ($x^m$)

![polynomial](assets/linear-regression/polynomial-equations.png)

(image: quora)

## Objective

Given $y = Wx+b$

Find $W$ and $b$ so that $y$ is as accurate as possible

Loss function: measures "how accurate"

## Loss Functions

Also known as cost function, objective function

Example: Mean Square Error
$$L(W, b) = MSE(W, b) = \frac{1}{N}\sum_{i=1}^N{\big(y_i - (Wx_i + b)\big)^2}$$

Many more examples: http://scikit-learn.org/stable/modules/model_evaluation.html

## Objective (of Training)

Find $W^*$ and $b^*$ to minimize the loss function:

$$\underset{W^*, b^*}{\arg \min}\; L(W, b)$$

$$\underset{W^*, b^*}{\arg \min}\; \frac{1}{N}\sum_{i=1}^N{\big(y_i - (Wx_i + b)\big)^2}$$

$N$: number of samples

## Gradient Descent

1. Initialize parameters ($W$ and $b$) to random values
2. Compute gradient of the loss function: $L'(W, b)$
3. Update rule ($\epsilon$ = learning rate)
    $$W := W -\epsilon L'(W, b)$$
    $$b := b -\epsilon L'(W, b)$$
4. Repeat 2 and 3 until find $W^*$ and $b^*$

## Linear Equation as Dot Product

$y = Wx + b$

Let $x_0 = 1$, then:

$y = Wx + bx_0$

## Linear Equation as Dot Product

$y = Wx + bx_0 = bx_0 + Wx$

$y = \left[ \begin{array}{cc}
b & W \end{array} \right]
\left[ \begin{array}{c}
x_0 \\
x \end{array} \right] = \left[ \begin{array}{cc}
b & W \end{array} \right]
\left[ \begin{array}{c}
1 \\
x \end{array} \right] = \left[ \begin{array}{c}
b \\
W \end{array} \right]^T \left[ \begin{array}{c}
1 \\
x \end{array} \right] = \theta^TX$

where $\theta = \left[ \begin{array}{c}
b \\
W \end{array} \right]$ and $X = \left[ \begin{array}{c}
1 \\
x \end{array} \right]$

## Polynomial Equation as Dot Product

$y = W_2x^2 + W_1x + b = b+ W_1x + W_2x^2$

$y = \left[ \begin{array}{ccc}
b & W_1 & W_2 \end{array} \right]
\left[ \begin{array}{c}
1 \\
x \\
x^2 \end{array} \right] = \left[ \begin{array}{c}
b \\
W_1 \\
W_2 \end{array} \right]^T
\left[ \begin{array}{c}
1 \\
x \\
x^2\end{array} \right] = \theta^TX$

## Loss Function

For the $i^{th}$ sample: $y_i = \theta^TX_i$

Loss function computes for all N samples:
$$L(\theta) = MSE(\theta) = \frac{1}{N}\sum_{i=1}^N{\big(y_i - \theta^TX_i\big)^2}$$

## Why Dot Product?

In [None]:
import numpy as np

# 25 features, 10000 samples
X = np.random.rand(10000, 25)
W = np.random.rand(1, 25)

y1 = np.zeros((10000, 1))

In [None]:
%%time
for i in range(X.shape[0]):
    for j in range(X.shape[1]):
     y1[i] = y1[i] +W[0][j]*X[i][j]

In [None]:
%time
y2 = np.dot(X, W.T)

In [None]:
# ensure the two operations are the same
np.testing.assert_allclose(y1, y2)

## Libraries

http://scikit-learn.org/stable/modules/linear_model.html#linear-model

- LinearRegression (Least Squares)
- PolynomialFeatures

## Evaluation Metrics

http://scikit-learn.org/stable/modules/model_evaluation.html#regression-metrics

- Mean squared error (MSE)
- Mean squared log error (MSLE)
- Mean absolute error (MAE)
- $R^2$ score