## Chapter 4. Training Models

Two different ways to train a Linear Regression model:
 - Use a direct "closed-form" equation that directly computes the model parameters that best fit the model to the training set.
 - Use an iterative optimization approach, Gradient Descent (GD), Batch GD, Mini-batch GD, and Stochastic GD.

Polynomial Regression, more prone to overfitting. Detect overfitting using learning curves, reduce using regularization.

Logistic Regression

Softmax Regression

### Linear Regression

A linear model makes a prediction by simply computing a weighted sum of the input features plus a constant called the bias term (intercept term).

$$\mathbf{y} = h_\theta(\mathbf{x}) = \theta^T \cdot \mathbf{x}$$

Cost function

$$MSE(\mathbf{X}, h_\theta) = \frac{1}{m} \sum^m_{i=1}(\theta^T \cdot \mathbf{x}^{(i)} - y_{(i)})^2$$

#### The Normal Equation

Closed-form solution
$$\hat{\theta} = (\mathbf{X}^T \cdot \mathbf{X})^{-1} \cdot \mathbf{X}^T \cdot \mathbf{y}$$

#### Computational Complexity

<font color=red>_WARNING_</font>

The Normal Equation gets very slow when the number of features grows large (e.g., 100,000).

On the positive side, this equation is linear with regards to the number of instances and features in the training set (it is $O(m)$), so it handles large training sets efficiently, provided they can fit in memory.

### Gradient Descent

Concretely, you start by filling $\theta$ with random values (_random initialization_), and then you improve it gradually, taking one baby step at a time, each step attempting to decrease the cost function, until the algorithm _converges_ to a minimum.

_Learning rate_ hyperparameter: small, slow convergence; large, algorithm diverge.

Two main challenges with Gradient Descent: 
 - Random initialization, it could converge to a local minimum; 
 - It may take a very long time to cross the plateau, and if you stop too early you will never reach the global minimum.

<font color=red>_WARNING_</font>

When using Gradient Descent, you should ensure that all features have a similar scale, or else it will take much longer to converge.

#### Batch Gradient Descent

It involves calculations over the full training set X, at each Gradient Descent step.

How to set the number of iterations: set a very large number of iterations but to interrupt the algorithm when the norm of the gradient vector becomes smaller than $\epsilon$ (tolerance).

#### Stochastic Gradient Descent

Stochastic Gradient Descent just picks a random instance in the training set at every step and computes the gradients based only on that single instance. (fast, train on huge training sets)

Much less regular than Batch Gradient Descent: instead of gently decreasing until it reaches the minimum, the cost function will bounce up and down, decreasing only on average. Over time it will end up very close to the minimum, but once it gets there it will continue to bounce around, never settling down. So once the algorithm stops, the final parameter values are good, but not optimal.

Randomness is good to escape from local optima, but bad because it means that the algorithm can never settle at the minimum. One solution: gradually reduce the learning rate. _Simulated
annealing_: the steps start out large (which helps make quick progress and escape local minima), then get smaller and smaller, allowing the algorithm to settle at the global minimum. The function that determines the learning rate at each iteration is called the _learning
schedule_.

_Epoch_: each round of $m$ iterations.

#### Mini-batch Gradient Descent

Advantage: performance boost from hardware optimization of matrix operations, especially with GPUs.

Less erratic than with SGD, harder for it to escape from local minima.

<div style="width:600 px; font-size:100%; text-align:center;"> <center><img src="img/tab4-1.png" width=600px alt="tab4-1" style="padding-bottom:1.0em;padding-top:2.0em;"></center>_Table 4-1. Comparison of algorithms for Linear Regression_</div>

### Polynomial Regression

Polynomial Regression is capable of finding relationships between features.

### Learning Curves

One way to estimate a model's generalization performance is using cross-validation.

Another way is to look at the _learning curves_: plots of the model's performance on the training set and the validation set as a function of the training set size.

<div style="width:400 px; font-size:100%; text-align:center;"> <center><img src="img/fig4-15.png" width=400px alt="fig4-15" style="padding-bottom:1.0em;padding-top:2.0em;"></center>_Figure 4-15. Underfitting model. Both curves have reached a plateau; they are close and fairly high_</div>

<div style="width:400 px; font-size:100%; text-align:center;"> <center><img src="img/fig4-16.png" width=400px alt="fig4-16" style="padding-bottom:1.0em;padding-top:2.0em;"></center>_Figure 4-16. Overfitting model._</div>

The error on the training data is much lower than with the underfitting model.

There is a gap between the curves. This means that the model performs significantly better on the
training data than on the validation data, which is the hallmark of an overfitting model. However, if you used a much larger training set, the two curves would continue to get closer.

Bias/variance tradeoff

Amodel’s generalization error can be expressed as the sum of three very different errors:
 - Bias. Underfit due to wrong assumption.
 - Variance. Overfit due to the model's excessive sensitivity to small variation in the training data.
 - Irreducible error. Due to the noiseness of the data itself, clean up the data.
 
### Regularized Linear Models







