# Multiple Linear Regression

In univariate linear regression we only had one feature. Although if you have more features then you would no longer be univariate. The way this is denoted is by $x_{j} = j^{th}$ where $j$ is the list of features $j = 1, \dots, 4$. If you have 4 features your $j$ would go until 4. Another denotation is $n = $ number of features.  A confusing subscript you might begin to see is $x_{j}^{(i)}$, where it simply means that you are looking at the $ith$ row in the $jth$ column. For example if you had $x_{3}^{2}$, it would where the 2nd row means with the third column. Also just to remind you that $\vec{x}^{(i)}$ denotes a list that stores all the features of the $i^{th}$ training example. Our model will also change as we have multiple features. It is defined as the following $$ f_{w, b}(x) = w_{1}x_{1} + \cdots + w_{n}x_{n} + b$$ There is an easier way to think about this equation:

* $\vec{w} = [w_{1} w_{2} w_{3} \dots w_{n}]$ = parameters of the model
* $\vec{x} = [x_{1} x_{2} x_{3} \dots x_{n}]$

Thus we can rewriten this to be $f_{\vec{w}, b}(\vec{x}) = \vec{w} \cdot \vec{x} + b$. This is known as **multiple linear regression**

## Vectorization

Vectorization is a process that makes your code run so much faster instead of using prehistoric techniques. We use the **dot product** which wwas shown in the equation above. It makes the equation $$f_{\vec{w}, b}(\vec{x}) = \vec{w} \cdot \vec{x} + b$$ a one liner in code instead of having to run a foor loop. Here is the one line `f_wb = np.dot(w,b) + b`. Isn't that neat. Vectorizations allows our algorithms be efficient and scale well with large data sets.

## Gradient Descent for MLR

Our model is not given by $$f_{\vec{w}, b}(\vec{x}) = \vec{w}\cdot \vec{x} + b$$ where
* $\vec{w} = [ w_{1} \cdots w_{n}]$
* $b$ is still a number/scalar
* $\cdot$ is the dot product

Our cost function is given by $J(\vec{w}, b)$. When we have multiple features, gradient descent becomes a little different. Before we would update each feature simultaneously without a problem. We had a general function for these in previous notes When we have $n$ features where $n \geq 2$ We get that the new way to update $w, b$ is given by $$w_{n} = w_{n} - \alpha \frac{1}{m} \sum_{i = 1}^{m} (f_{\vec{w}, b} (\vec{x}^{(i)}) - y^{(i)})x_{n}^{(i)} $$ 

$$b = b - \alpha \frac{1}{m} \sum_{i = 1}^{m} (f_{\vec{w}, b} (\vec{x}^{(i)}) - y^{(i)})$$

in which they are simultaneously updated, $w_{j} (j = 1, \dots, n)$ and $b$. There is also a quick alternative way to find the $w, b$ for linear equation. This is known as the **normal equation** which works only for linear regression. It does not need any iterations as it just instantly finds $w, b$ in one iteration. Although it only worked for **linear regression**. Although its best not to worry about these as they are so specific. Gradient descent is superior. Multiple linear regression is probably the most widely used algorithm used today. 

# Feature Scaling

**Feature scaling** is a method to enable gradient descent to run much faster. We rescale our features by making a transformation in order to have good data. The contour plot will look a lot more like a circle rather than an oval. This will lead gradient descent to be faster. When you have different features that take on very different ranges of values, it can cause gradient descent to run slowly but if we rescal the features so that they have comparable range of values, it will cause the algorithm to run faster.

## Implementation Methods

### Mean Normalization

You start with the original features and you rescale them so that the features are centered around zero. We change the values of numeric columns in the dataset to use a common scale. We do this because our features have different ranges. The formula is given by: $$ x' = \frac{x' - \mu'}{\max(x') - \min(x')}$$

### Z-score Normalization

The process of normalizing every value in a dataset such that the mean of all of the values is 0 and the standard deviation is 1. The formula is given by: $$x' = \frac{x' - \mu'}{\sigma'}$$ where $\sigma$ is the standard deviation of $x$ and $\mu$ is the mean like always.  

## Convergence of Gradient Descent

How can you tell when gradient descent is finally converging? Well you plot the cost function that was calculated on the training set. You make a learning curve where $J$ is your y-axis and the number of iterations is your x-axis. What you should understand from this graph is that your cost function should be decreasing after each iteration. If it is not then your learning rate is usually not ideal. Eventually your curve should flatten out to where gradient descent has converged as it is no longer decreasing. **Learning curves** are a great way to determine when you should stop training your model. You can do an automatic convergence test although the learning curve is probably the easiest way to determine. 

## Learning Rate

We have talked a lot about the **learning rate**. How do you choose a good learning rate, $\alpha$? You can look at your graph and identify what could possible be going wrong. A lot of the times you can see that your updated step could be overshooting the minimum. The cost will go up if the learning rate is too big. Although if it is taking too long to get to the minimum, your learning rate might be too small. Just know that your learning curve should be always decreasing. There could be a bug in the code, but a lot of the times it can be that the learning rate is too small/big assuming your code is correct. A good range of values to use is 0.001, 0.01, 0.1, 1,...

## Feature Engineering

When engineering your features. You want to have the best features that will optimize your model. We can define new features which are more optimized to the model that allows our model to determine what is the most important thing to predict. In **feature engineering** you can use intuition to design new features by transforming or combining original features. 
