# Linear regression with one variable

Outlining multiple linear regression algorithms.

## Notation

| Value | Meaning |
| :-: | :-- |
| $m$ | number of training examples |
| $x$'s | "input" variable / features (ex size of house) |
| $y$'s | "output" variable / "target" variable (ex price of house) |
| $(x, y)$ | a single training example (row in table) |
| $(x^{(i)}, y^{(i)})$ | $i$th training example (*1 indexed*) |
| $(x^{(i)}, y^{(i)});i=1,...,m$ | training set |
| $h$ | hypothesis - function that maps from $x$ to $y$ |
| $:=$ | assignment operator |

## Contents

- [linear regression model](#Linear-regression-model) - model to solve supervised problems
  - [training set](#Training-set) - the input data
  - [linear hypothesis](#Linear-hypothesis) - standard hypothesis for linear models
- [cost function](#Cost-function) - function to estimate error cost
- [gradient descent](#Gradient-descent) - finds parameters for function

## Linear regression model

The linear regression model solves [regression problems](1_introduction.ipynb#Regression-problem).

![](./static/model_representation_overview.png)

Given a [training set](#Training-set), we need a function where $h(x)$ is a good predictor of the corresponding value of $y$. That function is the hypothesis. For linear models we use the [linear hypothesis](#Linear-hypothesis).

For example, we could use:

- [linear hypothesis](#Linear-hypothesis) for calculating best-fit in linear data
- [cost function](#Cost-function) to calculate difference between linear hypothesis and data given a set of parameters
- [gradient descent](#Gradient-descent) to find best parameters for the cost function

### Training set

The set of input data. One example would be size and price of houses in Vancouver:

| Size in feet$^{2}$ ($x$) | Price in 1000's ($y$) |
| :-- | :-- |
| 2104 | 1024 |
| 1416 | 860 |
| 1534 | 945 |
| 852 | 560 |

$$
m = 4 \text{ (4 rows)}
$$

### Linear hypothesis

The linear hypothesis is used for finding a straight line through linear data.

The equation:

$$
h_\theta(x) = \theta_0 + \theta_1x
$$

Sometimes we just use $h(x)$ for short.

![](./static/linear_regression.png)

## Cost function

The cost function is the most commonly used function for [regression problem](1_introduction.ipynb#Regression-problem). It is designed to find the error between the hypothesis and the provided data.

> It is sometimes called "Squared error function", or "Mean squared error".

**Hypothesis:** We could use the [linear hypothesis](#Linear-hypothesis).

**Parameters:**

$\theta_0$ - y intercept at $x = 0$  
$\theta_1$ - slope

The cost function can be defined as:

$$
J(\theta_0, \theta_1) = \frac{1}{2m}\sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})^2
$$

In the cost function, $h_{\theta}(x^{(i)})$ is the same as $\theta_0 + \theta_1x^{(i)}$ (note this is the same as the hypothesis, and it is effected by changing $\theta$'s).

It is equal to $\frac{1}{2}\bar{x}$ where $\bar{x}$ is the mean of the squares of $h_\theta(x^{(i)}) - y^{(i)}$. This equals the difference between predicted and actual value. It is halved for convenience of computing [gradient descent](#Gradient-descent) (derivative of the square will cancel out the half).

We want to choose $\theta_0$ and $\theta_1$ such that $h_{\theta}(x)$ is close to $y$ for our training examples $(x, y)$ (minimizing error).

$\substack{minimize\\\theta_0\theta_1} J(\theta_0, \theta_1)$

$\substack{minimize\\\theta_0\theta_1}$ means "find me the values of $\theta_0$ and $\theta_1$ that minimize the equation".

## Gradient descent

> Sometimes called "Batch gradient descent" (looks at all training set - some versions may not).

Gradient descent is a way of estimating the best parameters for a function. Depending on the function, gradient descent may converge on local optima (a local "low point") rather than the global optima. The steps are:

1. choose an arbitrary starting value for each parameter
2. "step downhill" (towards lowest proximal function value)
3. continue until you reach a global minimum value for your function

> There is a normal equations method which will solve the same problems without multiple steps, but it doesn't scale as well with large training sets

There can be an arbitrary number of $\theta$'s, but the following graph shows just two: $\theta_0$ and $\theta_1$.

![](./static/gradient_descent_graph.png)

Repeat until convergence (for $j = 0$ and $j = 1$ - aka do it for both thetas):
$$
\theta_j := \theta_j - \alpha\frac{\partial}{\partial\theta_j}J(\theta_0, \theta_1)
$$

$\alpha$ = **learning rate** - controls how big of a step we take "downhill". If too small, might take too long to reach minimum. If too large, could overshoot and even fail to converge.

$\frac{\partial}{\partial\theta_j}J(\theta_0, \theta_1)$ = **derivative term** - the current slope of the function $J$ given the current $\theta_0$ and $\theta_1$. This allows for the "steps" to get smaller as we approach the global minimum (where slope will be 0).


All thetas must be updated simultaneously:

```go
temp0 := gradientDescent(theta0)
temp1 := gradientDescent(theta1)
theta0, theta1 := temp0, temp1
```

### Example with zero-value for $\theta_0$

Simplify by setting $\theta_0 = 0$ (same as removing $\theta_0$ from the equations).

Gradient descent:

$$
\theta_1 := \theta_1 - \alpha\frac{\partial}{\partial\theta_1}J(\theta_1)
$$

![](./static/gradient_descent_example.png)

Derivative term has positive slope, so $\theta_1 := \theta_1 - \alpha(positive)$ means $\theta_1$ will decrease (moving towards the minimum).

### Example using the cost function

Plug in the [cost function](#Cost-function):

$$
\frac{\partial}{\partial\theta_j}J(\theta_0, \theta_1) = \frac{\partial}{\partial\theta_j}\frac{1}{2m}\sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})^2
$$

We need to determine for $j=0,1$:

$$
\begin{align}
j &= 0:\frac{1}{m}\sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)}) \\
j &= 1:\frac{1}{m}\sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})x^{(i)}
\end{align}
$$

Repeat until convergence:

$$
\begin{align}
\theta_0 &:= \theta_0 - \alpha\frac{1}{m}\sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)}) \\
\theta_1 &:= \theta_1 - \alpha\frac{1}{m}\sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})x^{(i)}
\end{align}
$$