# Goal of this notebook is to go through, step-by-step, the entire regression process and related functions/algorithms

### Outline

- Hypothesis function
- Cost Function

- Gradient descent
- Normal Equation

### Hypothesis Function

this function is used to predict an estimation based on the input paramaters x 

$h_{Θ}(x) = Θ_{0} + Θ_{1}x_{1} $

### Cost Function

Goal of the Cost Function is to choose our various Θ parameters such that h(x) is as close as possible to y for our training examples

This is a minimization problem.
We need to minimize the sum of squared differences between the h(x) predictions and the y training outcomes.

Cost Function Formula:
$ J(Θ_{0},Θ_{1}) = 1/2m * \sum (h_{Θ}x^{(i)}) - y^{(i)})^{2} $


where $ h_{Θ}(x^{(i)}) $ is the hypothesis function.


The above formula is a squared error cost function, one of the most typically used cost functions for a linear regression model. Minimizing the sum of squared errors gives the most optimal linear fit to the data by minimizing the distance between residuals and the hypothesis predictions. In short, the cost function represents the difference between the predictions and the outcomes.

### Gradient Descent

A method for minimizing the cost function $ J(Θ_{0}, Θ_{1}) $

Gradient Descent Formula:
    
$ Θ_{j} := Θ_{j} - \alpha * \frac{\partial}{\partial Θ_{j}} * J(Θ_{0}, Θ_{1}) $


Where $ Θ_{j} $ is updated simultaneously for all values j

Alpha represents the learning rate, which is the rate at which the algorithm attempts to converge closer to the minimum of $ J(Θ_{0}, Θ_{1}) $

### Linear Regression with Multiple Features

When using Linear Regression with Multiple Features, each feature becomes an  $ x_{j} $ and each observation within that feature becomes $ x^{(i)}_{j} $

Each feature $ x_{j} $ represents a $ n x 1 $ sized matrix


The Linear Regression with Multiple Features equation is the following:

$h_{Θ}(x) = Θ_{0} + Θ_{1}x_{1} + Θ_{2}x_{2} + Θ_{3}x_{3} + Θ_{4}x_{4} ... $

For convenience of notation, we can define $ x_{0} = 1 $

(This feature represents the y-intercept of the function)

which allows us to convert the Hypothesis Function to the following:

$h_{Θ}(x) = Θ_{0}x_{0} + Θ_{1}x_{1} + Θ_{2}x_{2} + Θ_{3}x_{3} + Θ_{4}x_{4} ... $  where $ Θ_{0}x_{0} = 1 $ 

The reason we like to convert the function to look like the above is because we can then simplify it even further by taking advantage of Linear Algebra matrix multiplication. Since $ Θ $ and $ x $ are both $ n x 1 $ matrices, we can transpose one of them to facilitate the matrix multiplication. 

$ Θ^{T} = [Θ_{0} Θ_{1} Θ_{2} Θ_{3} ... Θ_{n}] $

and x is a $ n x 1 $ dimension matrix still (without transposing this wouldn't work because you can't multiply 2 $ n x 1 $ matrices)

Finally, the hypothesis function can be quickly summarized by the following formula:
    
$h_{Θ}(x) = Θ^{T}x  $

### Gradient Descent for Multiple Variables

A method for minimizing the cost function $ J(Θ_{0}, Θ_{1},..., Θ_{n}) $

By simplifying the Hypothesis Function to $ h_{Θ}(x) = Θ^{T}x  $, we can also simplify the parameters $  Θ_{0}, Θ_{1},..., Θ_{n} $ to simply $ Θ $ (which is a m x n sized matrix (where n = number of features and n = number of observations) 

So the Cost Function can be represented as:

$ J(Θ_{0},Θ_{1},...,Θ_{n}) = 1/2m * \sum (h_{Θ}x^{(i)}) - y^{(i)})^{2} $

or

$ J(Θ) = 1/2m * \sum ({Θ}^{T}x^{(i)}) - y^{(i)})^{2} $

The Gradient Descent algorithm with multiple variables then becomes:
    
Repeat {
    
$ Θ_{j} := Θ_{j} - \alpha * \frac{\partial}{\partial Θ_{j}} * J(Θ) $


Where $ Θ_{j} $ is updated simultaneously for all values j
    
}

### Learning Rate $ \alpha $ 

During Gradient descent, the Cost Function output is being minimized throughout several iterations (sometimes 100s or 1000s or millions). To test that Gradient Descent is working properly, the Cost Function can be plotted as a function of the number of iterations and should show a consistent negative slope. If the Cost Function is decreasing with every iteration, we know it is properly minimizing the Cost Function (i.e. the model error). When the slope of the curve (i.e. the derivative) is zero, then the Gradient Descent has converged.

**Plotting the Cost Function over the number of descent iterations is a great way to visualize and confirm that the algorithm is working properly**

What if the Gradient Descent is not working correctly (i.e. slope begins to increase at any point)??

This is a signal that the Learning Rate $ \alpha $ is too large. The Learning Rate is the magnitude in step at which the algorithm attempts to converge to the minima. If the Learning Rate is too large, it is possible for the algorithm to overshoot the minima, which can lead to Gradient Descent not working and no convergence. If you make the Learning Rate smll enough (and the algo is working properly), then the Cost Function will decrease at every iteration. The downside to using a small Learning Rate is that it may take long for the algorithm to compute.

When determining a Learning Rate to use in a model, you are trying to maximize the Learning Rate while still ensuring the Cost Function is reduced with each iteration. The maximization is simply for computational efficiency and that value will likely need to be trial-and-errored.

### Feature Scaling

Gradient Descent works effectively if all the features are on a similar scale. If the features vary in scale, the computational time needed to perform the Gradient Descent will increase substantially. By normalizing the features and bringing them to a similar scale, we can make the algorithm's calculations much more efficient. One way to achieve this is by normalizing every feature to a value between -1 and 1. One way of feature scaling is using Mean Normalization.


### Normal Equation