# Multiple Linear Regression

In univariate linear regression we only had one feature. Although if you have more features then you would no longer be univariate. The way this is denoted is by $x_{j} = j^{th}$ where $j$ is the list of features $j = 1, \dots, 4$. If you have 4 features your $j$ would go until 4. Another denotation is $n = $ number of features.  A confusing subscript you might begin to see is $x_{j}^{(i)}$, where it simply means that you are looking at the $ith$ row in the $jth$ column. For example if you had $x_{3}^{2}$, it would where the 2nd row means with the third column. Also just to remind you that $\vec{x}^{(i)}$ denotes a list that stores all the features of the $i^{th}$ training example. Our model will also change as we have multiple features. It is defined as the following $$ f_{w, b}(x) = w_{1}x_{1} + \cdots + w_{n}x_{n} + b$$ There is an easier way to think about this equation:

* $\vec{w} = [w_{1} w_{2} w_{3} \dots w_{n}]$ = parameters of the model
* $\vec{x} = [x_{1} x_{2} x_{3} \dots x_{n}]$

Thus we can rewriten this to be $f_{\vec{w}, b}(\vec{x}) = \vec{w} \cdot \vec{x} + b$. This is known as **multiple linear regression**

## Vectorization

Vectorization is a process that makes your code run so much faster instead of using prehistoric techniques. We use the **dot product** which wwas shown in the equation above. It makes the equation $$f_{\vec{w}, b}(\vec{x}) = \vec{w} \cdot \vec{x} + b$$ a one liner in code instead of having to run a foor loop. Here is the one line `f_wb = np.dot(w,b) + b`. Isn't that neat. Vectorizations allows our algorithms be efficient and scale well with large data sets.

## Gradient Descent for MLR

Our model is not given by $$f_{\vec{w}, b}(\vec{x}) = \vec{w}\cdot \vec{x} + b$$ where
* $\vec{w} = [ w_{1} \cdots w_{n}]$
* $b$ is still a number/scalar
* $\cdot$ is the dot product

Our cost function is given by $J(\vec{w}, b)$. When we have multiple features, gradient descent becomes a little different. Before we would update each feature simultaneously without a problem. We had a general function for these in previous notes When we have $n$ features where $n \geq 2$ We get that the new way to update $w, b$ is given by $$w_{n} = w_{n} - \alpha \frac{1}{m} \sum_{i = 1}^{m} (f_{\vec{w}, b} (\vec{x}^{(i)}) - y^{(i)})x_{n}^{(i)} $$ 

$$b = b - \alpha \frac{1}{m} \sum_{i = 1}^{m} (f_{\vec{w}, b} (\vec{x}^{(i)}) - y^{(i)})$$

in which they are simultaneously updated, $w_{j} (j = 1, \dots, n)$ and $b$. There is also a quick alternative way to find the $w, b$ for linear equation. This is known as the **normal equation** which works only for linear regression. It does not need any iterations as it just instantly finds $w, b$ in one iteration. Although it only worked for **linear regression**. Although its best not to worry about these as they are so specific. Gradient descent is superior. Multiple linear regression is probably the most widely used algorithm used today. 

# Feature Scaling

**Feature scaling** is a method to enable gradient descent to run much faster. 