# 2. Multivariate Linear Regression

In this section we will learn about a more powerful linear regression, one that can deal with multiple variables (or features) $x^{i}$.

### 2.1 Notation

$n$ = number of variables / features  
$x^{(i)}$ = input / features of $i^{th}$ training sample  
$x^{(i)}_j$ = value of feature $j$ in $i^{th}$ training sample  

Therefore in this notation, $x^{(i)}$ is an $n$-dimensional **vector** (as above, with $n$ being the number of features / variables). 

In this case our hypothesis function will look like:
    
$h_{\theta}(x) = \theta_0 + \theta_1x_1 + \theta_2x_2 + \cdots + \theta_nx_n $

For convenience, we will define $x_0 = 1$

Our vectors will therefore look like:
    
$x = \left[
\begin{array}
x_0 \\
x_1 \\
\cdots \\
x_n
\end{array}
\right]$

$\theta = \left[
\begin{array}
\theta_0 \\
\theta_1 \\
\cdots \\
\theta_n
\end{array}
\right]$

Putting them together, we can now write our hypothesis function as:

$h_{\theta}(x) = \theta^T_x $ with $\theta^T_x $ being the matrix product of vectors $x$ and $\theta$ as defined above. This is a **vectorization** of our hypothesis function for one training example.

### 2.2 Gradient Descent for multiple variables 

_Repeat until convergence:_  
$ {
\theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_J}(\theta_0, \ldots, \theta_n)} $  

Simultaneously update for every $j$

---

_Repeat until convergence:_

$ \theta_0 := \frac{\partial}{\partial \theta_j} \frac{1}{m} \sum_{i=1}^m (h_{\theta}(x^{(i)}) - y^{(i)})\cdot x_0^{(i)} $ 

$ \theta_1 := \frac{\partial}{\partial \theta_j} \frac{1}{m} \sum_{i=1}^m (h_{\theta}(x^{(i)}) - y^{(i)}) \cdot x_1^{(i)} $  

$ \theta_2 := \frac{\partial}{\partial \theta_j} \frac{1}{m} \sum_{i=1}^m (h_{\theta}(x^{(i)}) - y^{(i)}) \cdot x_2^{(i)} $  

### 2.3 Feature Scaling 

Idea: if the features are on a similar scale (within the same range), the gradient descent will converge more quickly. 

Practically, we want to get every feature _approximately_ in a $ -1 \le x \le 1 $ range.

#### Mean normalization

We can replace $x_i$ with $\frac{x_i - \mu_i}{s_i}$ to normalize features. 

### 2.4 Learning Rate

"**Debugging**" the gradient descent, or better yet, _ensuring it is working properly_, can be done in difference ways:

1. By plotting $min(J(\theta))$ as a function of the number of iterations. Ideally, we would have a monotonous smooth curve going lower and to the right.

2. Create an automatic convergence test (e.g. declare convergence if $J(\theta)$ decreases by less than $10^-3$ in one iteration.

**Note**: It is mathematically proven that for an $\alpha$ sufficiently small, the Gradient Descent should decrease for every iteration.

### 2.5 Features and Polynomial Regression

There are several options when it comes to choosing features in order to **modify the form of our hypothesis function**.. We can:

1. (Of course) Use features we already have 
2. Create our own (e.g. for house prices: $Area = Frontage \times Depth$  
3. Manipulate features to have different models (e.g. a _quadratic_ or _cubic_ model)

**Note**: bear in mind how the value of ranges changes after manipulation in order to normalize properly.

### 2.6 Computing Parameters Analytically - Normal Equation

Normal equation allows us to solve for $\theta$ analytically. 

**Intuition**

* Find the hyphotesis function minimum: $ \frac{d}{d \theta} J(\theta) = 0$ 
* Solve for $\theta$

**Equation**

Assuming $X$ to be the feature matrix of dimensions $m \times (n+1)$ and $y$ to be an $m$-dimensional vector of the actual values, our $\theta$ will be:

$\theta = (X^T X)^{-1}X^Ty$

Note: $m$ = number of training examples | $n$ = number of features

In [None]:
# normal equation in Octave

pinv(X' * X) * X' * y

In [None]:
prit