# Multivariate Linear Regression

Now, we are going to make some examples about the **Multivariate Linear Regression**. However instead of using python without "any" help to make our implementation, we are going to use a Python package called **SciKit Learn**.

But, first let's define some notation:

$x_{j}^{(i)}=$ value of feature $j$ in the $i^{th}$ training example
$x^{(i)}=$ the input (features) of the $i^{th}$ training example
$m=$ the number of training examples
$n=$ the number of features

The multivariable form of the hypothesis function accomodating these multiple features is as follows:

$h_{\theta}(x)=\theta_{0}+\theta_{1}x_{1}++\theta_{2}x_{2}++\theta_{3}x_{3}+ ... +\theta_{n}x_{n}$

In order to develop intuition about this function, we can think about $\theta_{0}$ as the basic price of a house, $\theta_{1}$ as the price per square meter, $\theta_{2}$ as the price per floor, etc. $x_{1}$ will be the number of square meters in the house, $x_{2}$ the number of floors, etc.

Using the definition of matrix multiplication, our multivariate hypothesis function can be concisely represented as:

$\begin{bmatrix}
\theta_{0} & \theta_{1} & \ldots & \theta_{n}
\end{bmatrix}
\begin{bmatrix}
x_{0} \\ x_{1} \\ \vdots \\ x_{n}
\end{bmatrix}=\theta^{T}x$

This is a vectorization of our hypothesis function for one training example. Remember that in this case we are assuming that $x_{0}^{(i)} = 1$ for $(i \in 1,\ldots,m)$

## Gradient Descent for Multiple Variables

The gradient descent equation itself is generally the same form; we just have to repeat it for our **n** features:

Repeat until convergence:{
$$\begin{equation}
    \begin{split}
        \theta_{0} := \theta_{0} - \alpha \frac{1}{m} \sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})x_{0}^{(i)} \\
        \theta_{1} := \theta_{1} - \alpha \frac{1}{m} \sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})x_{1}^{(i)} \\ 
        \theta_{2} := \theta_{2} - \alpha \frac{1}{m} \sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})x_{2}^{(i)} \\ 
        \ldots       
    \end{split}
\end{equation}$$
}

The equations above, could be written like:

$$\begin{equation}
    \begin{split}
        \theta_{j} := \theta_{j} - \alpha \frac{1}{m} \sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})x_{j}^{(i)} \\
    \end{split}
\end{equation} ;for j:= 0\ldots n$$

The image below summarize both gradient descent algoritms:

![GDMultiOne](./images/GDOneMulti.png "Gradient Descent Comparison")

### Scaling our algorithm

Always we would like that our gradient descent algorithm converges quickly to an optimal value. However this could be difficult. If you want to test something to try to makes you algorithm converge faster you could try these options: 

* The first option it is called **feature scaling**
* The second one it is called **mean normalization**

In both cases the idea is to have all of our features close to the same range of values.

![Scaling Algorithm](./images/scaling.png "Scaling our algorithm")

#### Feature Scaling
This technique involves dividing the input values by the range (i.e. the maximum value minus the minimum value) of the input variable, resulting in a new range of just 1.

#### Mean Normalization
In this case we subtract the average value for an input variable from the values for that input variable resulting in a new average value for the input variable of just zero.

To implement both of these techniques, adjust your input values as shown in this formula:

$$x_{i}:=\frac{x_{i} - \mu_{i}}{s_{i}}$$

In the equation below $\mu_{i}$ is the average of all values for feature **(i)** and $s_{i}$ is the range of values $(max-min)$, or $s_{i}$ is the standard deviation.

Remember that dividing by the standard deviation of by the range gives you different results.

For example, if $x_{i}$ represents housing prices with a range of 100 to 2000 and a mean value of 1000, then $x_{i}:=\frac{price - 1000}{1900}$

### Learning Rate and Debugging

As you know our Gradient Descent algorithm needs to be tuned to obtain good results. It is important to make a good selection of **Learning Rate** and this is possible to make, for example, trying to plot the **Cost Function** after some iterations of the algorithm, and readjusting the **Learning Rate** parameter.

![Learning Rate Debugging](./images/convergeGD.png "Debugging")

You could create an automatic convergence test if $J(\theta)$ decreases by less than $10^{-3}$ in one iteration. In that case you could stop the algorithm.

#### How to know if Gradient Descent is working correctly ?

![Gradient Descent Debugging](./images/debugGD.png "Debugging")

For sufficiently small $\alpha$, $J(\theta)$ should decrease on every iteration.
But if $\alpha$ is too small, gradient descent can be slow to converge.

To choose $\alpha$, try $\ldots,0.001,0.003,0.01,0.03,0.1,0.3,1\ldots$


### Polynomial Regression

We can improve our features and the form of our hypothesis function in a couple different ways.

It is possible to combine multiple features into one. For example, we can combine $x_{1}$ and $x_{2}$ into a new feature $x_{3}$ by taking $x_{1}*x_{2}$

Sometimes it is possible that our hypothesis function need not be a straight line because don't fit the data well.

We could change the behavior or curve of our hypothesis function by making it a quadratic, cubic or square root function (or any other form).

For example, think in this hypothesis function $h_{\theta }(x)=\theta_{0}+\theta_{1}x_{1}$ and look that it is possible to create additional features based on $x_{1}$, to get the quadratic function 

$$h_{\theta}(x)=\theta_{0}+\theta_{1}x_{1}+\theta_{2}x_{1}^{2}$$
or the cubic function:

$$h_{\theta}(x)=\theta_{0}+\theta_{1}x_{1}+\theta_{2}x_{1}^{2}+\theta_{3}x_{1}^{3}$$

In the cubic version, we have created new features $x_{2}=x_{1}^{2}$ and $x_{3}=x_{1}^{3}$

Remember to take into account the use of feature scaling.