# Linear Regression
Motto of this and all subsequent courses: **"If you can't implement it, you don't understand it."**

The very first machine learning model which serves as an introduction to the field.

## Statistics vs Machine learning
- Machine learning approach - interested in models with high predictive accuracy, loss functions, minimizing loss functions, train and test sets, hyperparameters, generalization, overfitting, regularization


- Statistics approach - interested in models with strong explanatory power, significance tests (ANOVA, t-tests etc.), model 
diagnostics, model building and selection (forward, backward stepwise), evaluation (AIC, BIC), standard transformations (Box-Cox), residual analysis

## Supervised learning vs unsupervised learning
- Supervised learning has a target variable, which we want to learn how to predict. Linear regression is in this category
    - Regression - trying to predict a real valued number, eg. temperature, price, age
    - Classification - trying to predict a category, eg. day of month, blood type, presence of cancer
- Unsupervised learning does not have a target variable and the goal is to reveals underlying patterns within the data

# 1D Linear regression
- Problem definition: Given some data, we want to find a line that best fits that data, so we can make predictions from input variables alone

<img src="assets/1dregression.png"
     alt="Markdown Monster icon"
     style="float: left; margin-right: 10px"
     width="300"/>

## Preliminaries

- given a set of datapoints $D = \{(x_1, y_1), ..., (x_n, y_n)\}$, where $(x_i, y_i)$ denote the pair of input variable $x$ and target variable $y$ for datapoint $i$, $n$ denotes the number of datapoints
- a 1D line is given by the equation $y = mx + b$,
- for a fitted line $\hat{y}$ and point $(x_i, y_i)$, the prediction value is given by $\hat{y_i} = ax_i + b$
- define a cost function $J$ to quantify how well the data is being fit
- $J = \sum_{i = 1}^{n} (\hat{y_i} - y_i)^2$, also known as the sum of square errors, ie. for every datapoint find the difference between the predicted value and the actual value and square it
- defining $J$ as such is a good choice, since the difference between predicted values and actual values will always be positive. If we didn't square the difference, the errors would end up cancelling out. Furthermore larger error get penalized more due to the square.

## Solution
What follows is an approach to finding the parameters $a, b$ such that the data is best fit, ie. the cost function is minimized

### Calculus essentials
- from calculus we know that the gradient is a vector of partial derivatives with respect to the inputs
- $\triangledown f(p) = \begin{bmatrix} \displaystyle \frac{\partial f(p)}{\partial x_1} \\ \vdots \\ \displaystyle \frac{\partial f(p)}{\partial x_n} \end{bmatrix}$, given a function $f$ at point $p$ and $n$ input variables, $\triangledown f(p)$ gives the gradient at point $p$

**Statement:** *The gradient points to the direction of steepest ascent*

<img src="assets/gradient.png"
     alt="Markdown Monster icon"
     style="float: left; margin-right: 10px"
     width="500"/>

- for a given input value $w_0$, consider its gradients
- for the subset of $w$ values yielding a positive gradient, by subtracting the gradient from $w_0$, we move to the left where, correspondingly, the value of $J(w)$ decreases
- likewise for the subsect of $w$ values yielding a negative gradient, by subtracting the gradient from $w_0$, we move to the right (subtracting a negative is equivalent to adding), we move to the right where, correspondingly, the value of $J(w)$ decreases

#### Gradient descent
Thus we arrive at the crucial algorithm used to minimize cost functions in machine learning, **the gradient descent algorithm**

- (general case for multiple input variables) for every input parameter $x_i$ we update its value by subtracting a small multiple $\epsilon$ (epsilon)  of the partial derivative with respect to $x_i$
- $x_i := x_i - \frac{\partial J(x_i)}{\partial x_i}$
- for a more compact notation we place all input variables into a vector $\theta$ 
- $\theta = \begin{bmatrix} x_1 \\ \vdots \\ x_n \end{bmatrix}$
- the we can express the simultaneous update of all input variables in a single equation
- $\theta = \theta - \epsilon * \triangledown{J}$

#### Closed form solution
- though we could certainly find the optimal parameters for linear regression with the use of gradient descent, linear regression is a special case of machine learning algorithm for which a closed form solution exists
- a closed form solution indicates that the optimal parameters can be gained from a direct calculation
- gradient descent gets us into a minimum by continuously going "downhill" on the cost function until we reach a point where the gradient is equal to zero
- the gradient could also however be equal to zero in a maximum, so with a closed form solution it might not be clear if a minimum is found, **EXCEPT** that the Squared error function is a *convex*, for which we know a global minimum will be found