# Week 2 - Linear Regression with Multiple Variables
This week, we expand upon the univariate linear regression covered last week to linear regressions on multiple variables.

Topics for this week include
* Multivariate Linear Regression
* Computing Parameters Programatically

## Multivariate Linear Regression
We can use a version of linear regression that's more powerful, one that works on multiple variables/features. In our previous examples, we had a single feature (the size of the house) and a single result (the price of the house). We can start to include more variables, such as the age of the house, the number of bedrooms, etc. These additional variables will be demarked

$$
x_1, x_2, x_3, ... , x_n
$$

where
* $n$ will mean the number of variable/feautures
* $m$ will mean the number of training samples

Furthermore, we will use the following notation

$$
x_j^{(i)}
$$

to identify the $j$th sample of the $i$th training sample. Take the following example:

| Size (sq ft) | Num. bedrooms | Num. floors | Age of home | Price ($1000) |
|---|---|---|---|---|
2104|5|1|45|460
1416|3|2|40|232
1534|3|2|30|315
852|2|1|36|178

Here, 
* $m$ = 4
* $n$ = 4
* $x^{(2)}$, the second training sample, would be
$$ 
\begin{bmatrix} 
1416 \\ 
2 \\ 
2 \\ 
40
\end{bmatrix}
$$
* $x_4^{(2)}$ = 40

With 4 variables, we now have a hypothesis that looks like

$$
h_\theta (x) = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + \theta_3 x_3 + \theta_4 x_4
$$

For convenience, we define

$$
x_0 = 1
$$

Now our feature variable becomes

$$
x =
\begin{bmatrix}
x_0 \\
x_1 \\
x_2 \\
\vdots \\
x_n
\end{bmatrix}
\in \mathbb{R}^{n+1}
$$

and our parameter variable becomes

$$
\theta =
\begin{bmatrix}
\theta_0 \\
\theta_1 \\
\theta_2 \\
\vdots \\
\theta_n
\end{bmatrix}
\in \mathbb{R}^{n+1}
$$

making our hypothesis easily written as

$$
h_\theta (x) = \theta^T x = \theta \cdot x
$$

A vectorized form is our theta update is:

$$
h_\theta (X) = X \theta \\
\theta := \theta - \frac{\alpha}{m} X^T (X \theta - y)
$$

## Gradient Descent for Multivariate Linear Regressions
Our parameter space is now defined as the vector

$$ \theta \in \mathbb{R}^{n+1} $$

so our cost function can be written as 

$$ 
J (\theta) = \frac{1}{2 m} \sum_{i=1}^m \Big( h_\theta (x^{(i)}) - y^{(i)} \Big)^2
$$

which can also be written

$$ 
\begin{align}
J (\theta) & = \frac{1}{2 m} \sum_{i=1}^m \Big( \theta^T x^{(i)} - y^{(i)} \Big)^2 \\
& = \frac{1}{2 m} \sum_{i=1}^m \Bigg( \Big( \sum_{j=1}^n \theta_j x_j^{(i)} \Big) - y^{(i)} \Bigg)^2
\end{align}
$$

The process for gradient descent is therefore (where parameters are simultaneously updated)

$$
\theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta)
$$

By plugging in $J(\theta)$, our gradient descent algorithm becomes for

$$
\theta_j := \theta_j - \frac{\alpha}{m} \sum_{i=1}^m \Big( h_\theta (x^{(i)}) - y^{(i)} \Big) x_j^{(i)}
$$

And again this process is completed until convergence.

### Feature Scaling
A tip for gradient descent is to ensure each of the parameters are of similar scale. This causes the gradient descent to converge more quickly since the step size taken in the simultaneous update is roughly the same for each parameter.

For example, back to our house cost example, se we have 
* $x_1$ : size (0-2000 sq ft)
* $x_2$ : number of bedrooms (1-5)

Our two parameters here are of very different scales, so a tip is to scale the features so that they range from approximately -1 to 1. In this example, we can rescale as such:
* $x_1$ : $\frac{\text{size}}{2000}$
* $x_2$ : $\frac{\text{number of bedrooms}}{5}$

This helps ensure that a step in the gradient descent is of similar size for the various parameters. This allows the algorithm to converge more quickly. It's mainly just important that the parameters are all scaled to the same order of magnitude.

Another idea is to normalize by the mean. To do so, remap as follows:

$$ x_i := \frac{x_i - \mu_i}{\sigma_i} $$

where 
* $\mu_i$ is the mean value of the features
* $\sigma_i$ is the range or standard deviation of the features

In our example, if our mean house size is 1000 sq ft. and our mean bedroom number is 2,
* $x_1$ : $\frac{\text{size} - 1000}{2000}$
* $x_2$ : $\frac{\text{number of bedrooms} - 2}{5}$
This scales the range of parameters back to around -0.5 and 0.5.

This doesn't have to be exact. We just need to parameters in the same order of magnitude.

### Debugging & The Learning Rate
It's useful to plot the calculated minimized cost function per iteration of the gradient descent to ensure that the cost is gradually descending after each iteration.

Also note that the number of steps it may take to converge with gradient descent will vary greatly per problem and execution. So plotting the minimized cost function is a good way to do this.

A useful rule for is to declare convergence
when the cost function decreases by less than .001 in one iteration. It can sometimes be difficult though, depending on the problem. So agian, looking at the plot and finding when the curve is flattening is a good tip.

If the curve is not decreasing but increasing or oscillating instead, it's a good indicator that your learning rate is too large.  If the learning rate is too small, the algorithm will converge very slowly. It's suggested to actually try a *range* of values to capture one that's too slow and one that's too large and find the best convergence values.

### Features and Polynomial Regression
Back to the housing example, suppose we have a hypothesis where we are considering only the "frontage" or the width of the plot of land on which the house sits, and the "depth", the length of the plot of land on which the house sits. We can actually for a new model then that uses the land area of the plot of land. So we've now reduced the number of features from two to one. However, perhaps now a multi-degree polynomial is a better fit that a simple line. Let's say we've chosen a cubic function as our hypothesis. Our hypothesis function now looks as such:

$$
\begin{align}
h_\theta (x) & = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + \theta_3 x_3 \\
& = \theta_0 + \theta_1 (\text{size}) + \theta_2 (\text{size})^2 + \theta_3 (\text{size})^3 
\end{align}
$$

So we've set now
* $x_1 = (\text{size})$
* $x_2 = (\text{size})^2$
* $x_3 = (\text{size})^3$

**Note**: since we've chosen our features and model like this, feature scaling is much more important now. This is because the exponents in the polynomial will greatly separate features of different scales and this could slow down the gradient descent much moreso than in the linear case.

We have a lot of freedom in choosing our features. We could have set it to a quadratic polynomial:

$$
\begin{align}
h_\theta (x) & = \theta_0 + \theta_1 x_1 + \theta_2 x_2 \\
& = \theta_0 + \theta_1 (\text{size}) + \theta_2 (\text{size})^2  
\end{align}
$$

But we have to think about what this implies. A quadratic is symmetric, so it may fit the training data available reliably well, but it may not be a great fit. We'd expect the price to increase with the size of the plot of land, and the minimization of the cost function would ensure that the quadratic regression closely fits the available training data by increasing with size, but since it's quadratic, some other part of the regression will be unrealistic. 

A better choice might be

$$
\begin{align}
h_\theta (x) & = \theta_0 + \theta_1 x_1 + \theta_2 \sqrt{x_2} \\
& = \theta_0 + \theta_1 (\text{size}) + \theta_2 (\sqrt{\text{size}})  
\end{align}
$$

The main point is that you need to think about the implications of the model you want to use, but you have a lot of freedom. Ultimately, we'll look at an algorithm that will find the best type of function for us.

## Computing Parameters Analytically - The Normal Equation
The normal equation, for some linear regression problems, will give us a much better way to solve for the optimal value of the parameters in our hypothesis. So far the algorithm that we've been using for linear regression is gradient descent where in order to minimize the cost function we would take this iterative algorithm that takes many steps, multiple iterations of gradient descent to converge to the global minimum. In contrast, the normal equation would give us a method to solve for theta analytically, so that rather than needing to run this iterative algorithm, we can instead just solve for the optimal value for the paramets all at once, so that in basically one step you get to the optimal value.

Let's look back at our example:

| Size (sq ft) | Num. bedrooms | Num. floors | Age of home | Price ($1000) |
|---|---|---|---|---|
2104|5|1|45|460
1416|3|2|40|232
1534|3|2|30|315
852|2|1|36|178

So our matrix system is such that

$$
X = 
\begin{bmatrix}
1 & 2104 & 5 & 1 & 45 \\
1 & 1416 & 3 & 2 & 40 \\
1 & 1534 & 3 & 2 & 30 \\
1 & 852 & 2 & 1 & 36
\end{bmatrix},
y = 
\begin{bmatrix}
460 \\
232 \\
315 \\
178
\end{bmatrix}
$$

Here,
* $X$ is an $m \times (n+1)$
* $y$ is an $m$-dimensional vector

And, through some linear algebra steps that a bit more involved, we find:

$$
\theta = (X^T X)^{-1} X^T y
$$

In this method, we don't need to do feature scaling, since we're not taking incremental steps via gradient descent to get to an optimal value for our parameters.

Here are some pros and cons of the two methods:

Gradient Descent | Normal Equation
--- | ---
Need to choose $\alpha$ | Don't need to choose $\alpha$
Needs many iterations | No need to iterate
Works well even for large $n$ | Slow if $n$ is very large
Don't have to compute matrix inverse | Have to compute $(X^T X)^{-1}$

### Normal Equation and Noninveribility
In the normal equation method, we have to compute

$$
(X^T X)^{-1},
$$

but what if it's non-invertible? This is rare, but when it occurs, there are often two causes of this:
1. The features are redundant
   * For example:
     * $x_1$ = size in sq. ft.
     * $x_2$ = size in sq. m
2. Too many features ($m \leq n$)
   * Delete some features, or use regularization