# Chapter 4 - Training Models


## Linear Regression
$\theta$: parameter vector

Linear Regression model prediction: $\hat{y} = \theta_{0} + \theta_{1}x_{1} + \cdots + \theta_{n}x_{n} = h_{\theta}(x) = \theta \cdot x$ 


*MSE cost function for Linear Regression model*: $MSE(X, h_{\theta}) = \frac{1}{m} + \sum \limits _{i=1}^{n}X_{j}(\theta^{T}x^{(i)}-y^{(i)})^{2}$

*Normal Equation*:  $\hat{\theta} = (x^{T}X)^{-1}X^{T}y$

The computational complexity of inverting a matrix is very high. \
Singular Value Decomposition (SVD): $O(n^{2})$ 
 
Predictions are very fast with a Linear Regression model.


## Gradient Descent
Gradient descent measures the local gradient of the error function with regards to the parameter vector $\theta$ until the gradient becomes zero. \
=> It reached a minimum (cost function). \
Learning rate: the size of the steps. (Low learning rate => many iterations => takes long time) (too high => bad as well) \
Local minimum, plateau \
The MSE cost function for a Linear Regression model is a *convex function*. => Just one global minimum. \
It is also a continuous function. \
=> guaranteed to approach arbitrarily close the global minimum 

### Batch Gradient Descent
- Partial Derivative of the cost function: how much the cost function will change if you change $\theta_{j}$
- Gradient vector of the cost function: contains all the partial derivatives of the cost function. $\nabla_{\theta}MSE(\theta)$
- Gradient Descent step: $\theta^{(next step)} = \theta - \eta\nabla_{\theta}MSE(\theta)$ ($\eta$: learning rate)





### Stochastic Gradient Descent
Stochastic Gradient Descent just picks a random instance in the training set at every step and computes the gradients based only on that single instance. \
Over time it will end up very close to the minimum, but once it gets there it will
continue to bounce around, never settling down. \
Therefore randomness is good to escape from local optima, but bad because it means that the algorithm can never settle at the minimum. \
One solution to this dilemma is to gradually reduce the learning rate. \
Learning schedule: the function that determines the learning rate at each iteration.

### Mini-batch Gradient Descent
At each step, Minibatch GD computes the gradients on small random sets of instances called minibatches. \
You can get a performance boost from hardware optimization of matrix operations, especially when using GPUs. (why better than Stochastic GD) \
Mini-batch GD will end up walking around a bit closer to the minimum than SGD. But, on the other hand, it may be harder for it to escape from local minima.


## Polynomial Regression
Adding powers of each feature as new features, then train a linear model on this extended set of features. \
PolynomialFeatures also adds all combinations of features up to the given degree.


## Learning Curves
Learning curve: a plot of the model’s performance on the training set and the validation set as a function of the training set size \
A model’s generalization error can be expressed as the sum of three errors:
1. Bias: due to wrong assumptions
2. Variance: due to the model’s excessive sensitivity to small variations in the training data
3. Irreducible error: due to the noisiness of the data itself.

## Regularized Linear Models
A good way to reduce overfitting is to regularize the model. \
For a linear model, regularization is typically achieved by constraining the weights of the model. \
### Ridge Regression
A regularized term ($\sum \limits _{i=1}^{n}\theta_{i}^{2}$) is added to the cost function. \
Ridge Regression cost function: $J(\theta) = MSE(\theta) + \alpha \frac{1}{2} \sum \limits _{i=1}^{n}\theta_{i}^{2}$ \
Ridge Regression closed-form solution: $\hat{\theta} = (X^{T}X + \alpha A)^{-1}X^{T}y$
### Lasso Regression
Lasso Regression cost function: $J(\theta) = MSE(\theta) + \alpha \sum \limits _{i=1}^{n}|\theta_{i}|$ \
Lasso Regression tends to completely eliminate the weights of the least important features.
### Elastic Net
The regularization term is a simple mix of both Ridge and Lasso’s regularization terms, and you can control the mix ratio r. \
When r = 0, Elastic Net is equivalent to Ridge Regression, and when r = 1, it is equivalent to Lasso Regression.\
Elastic Net cost fuction: $J(\theta) = MSE(\theta) + r\alpha \sum \limits _{i=1}^{n}|\theta_{i}| + \frac{1 - r}{2} \alpha \sum \limits _{i=1}^{n}\theta_{i}^{2}$
### Early Stopping
Stop training as soon as the validation error reaches a minimum.


## Logistic Regression
Logistic Regression is commonly used to estimate the probability that an instance belongs to a particular class. \
### Estimating Probabilities
Logistic Regression model estimated probability: $\hat{p} = h_{\theta}(x) = \sigma(x^{T}\theta)$ \
Logistic Function: $\sigma(t) = \frac{1}{1+exp(-t)}$ \
Logistic Regression model prediction: $\hat{y} = 0$, if $\hat{p} < 0.5$ and $\hat{y} = 1$, if $\hat{p} \geq 0.5$
### Training and Cost Function
### Decision Boundaries
### Softmax Regression
The Logistic Regression model can be generalized to support multiple classes directly. \
When given an instance x, the Softmax Regression model first computes a score sk(x) for each class k, then estimates the probability of each class by applying the softmax function. \
Softmax score for class k: $s_{k}(x)=x^{T}\theta^{k}$ \
Softmax function: $p_{k} = \sigma(s(x))_{k}=\frac{exp(s_{k}(x))}{\sum \limits_{j=1}^{K}exp(s_{j}(x))}$


