## Linear Regression  

* A <b>supervised</b> learning model (meaning that it learns from labeled data) for predicting a <b>continuous valued output</b> - such as the price of houses (real numbers i.e. scalar values).
* Types of Linear regression: 
* * Univariate (simple) linear regression uses one independent variable to predict the output $\hat{y}$. 
* * Multiple linear regression uses two or more independent variables to predict $\hat{y}$. Using more variables allows the model to account for more factors that influence $y$ and generally improve predictive accuracy. 


## Simple (Univariate) Linear Regression 

Simple linear regression model fits a straight line to the data: $f_{w,b}$(x) = $w$x + $b$ 

## Multiple Linear Regression 

Multiple Linear Regression is a model that estimates the relationship between a quantitative dependent variable and two or more independent variables using a straight line: $f_{\vec{w},b}(\vec{x}) = \vec{w}.\vec{X}$ + $b$ ,   where $.$ is a dot product 

* $w_{1} ... w_{n}, b = $ weights or coefficients or parameters of the model (adjusted as the model learns from data)
* $\vec{w} = [w_{1} ... w_{n}] = n$-length vector
* $b = $ scalar 
* $\vec{X} = $ feature matrix with $m$ rows and $n$ columns
* $n = $ length of feature (sample) vector
* $m = $ number of training samples
* $x^{(i)} = (x^{(i)}_{1}, ... , x^{(i)}_{n}) = $ feature vector $i$
* $x^{(i)}_{j} = $ element $j$ in sample $i$
* $y^{(i)}$ = output or predicted target variable $i$
* $(x^{(i)}, y^{(i)})$ = a single training example (the i-th training example) = a single row in a data table
* $f_{\vec{w},b}(x^{(i)}) = w_{1}x_{1}$ + ... + $w_{n}x_{1}$ + $b$
* $J(\vec{w},b) = J(w_{1}...w_{n},b) = \frac{1}{2m} \sum_{i=1}^{m} ( f_{\vec{w},b}(x^{(i)}) - y^{(i)} )^{2} = $ cost function, where $f_{\vec{w},b}(x^{(i)}) = \vec{w}.x^{(i)}$ + $b$ = $\hat{y}^{(i)}$


<br/>

### Gradient Descent (update rules)

Repeat until convergence: {
    <br>&nbsp;&nbsp; $w_{j} = w_{j} - \alpha \frac{\partial}{\partial w_{j}}J(\vec{w},b)$ 
    <br>&nbsp;&nbsp;  $b = b - \alpha \frac{\partial}{\partial b}J(\vec{w},b)$ <br> 
}
<br><br>
where: 
<br><br>$\frac{\partial}{\partial w_{j}}J(\vec{w},b) = \frac{1}{m} \sum_{i=1}^{m} \sum_{j=1}^{n} ( f_{\vec{w},b}(x^{(i)}) - y^{(i)} )x^{(i)}_{j}$ 
<br><br>$\frac{\partial}{\partial b}J(\vec{w},b) = \frac{1}{m} \sum_{i=1}^{m} ( f_{\vec{w},b}(x^{(i)}) - y^{(i)} )$

### Cost function with regularization:

$J(\vec{w},b)= \frac{1}{2m} \sum_{i=1}^{m}[f_{\vec{w},b}(x^{(i)}) - y^{(i)} ]^{2} + \frac{\lambda}{2m}\sum_{j=1}^{n}w_{j}^2$ 

###  Gradient descent + regularization (update rules):

$ w_{j} = w_{j} - \alpha [ \frac{1}{m} \sum_{i=1}^{m} \sum_{j=1}^{n} ( f_{\vec{w},b}(x^{(i)}) - y^{(i)} ) x^{i}_{j} $ + $ \frac{\lambda}{m}\sum_{j=1}^{n}w_{j} ]$
<br><br> 
$ b = b - \alpha [\frac{1}{m} \sum_{i=1}^{m} ( f_{\vec{w},b}(x^{(i)}) - y^{(i)} ) ]$


## Alternative to Gradient Descent:
 
> For models like Linear Regression, we can use two types of techniques to fit the parameter: Normal Equation and Gradient descent. 

* Normal equation is used only for linear regression i.e. does not generalize for other learning algorithms
* It's slow for number of features > 10000
* May be used on the background by ML libraries