## Uderfitting, generalization, overfitting

#### Unerfitting
* If the algorithm does not fit the training data very well, it is said that the algorithm is underfitting the data.  
* Another term for underfitting is that the algorithm has <b>high bias</b>. 

#### Generalization
* If the learning algorithm does well even on examples, that are not on the training set, that's called generalization. 
* Technically we say that you want your learning algorithm to generalize well, which means to make much more reasonable predictions even on brand new examples that it has never seen before.   
* Binary classification example: Including most samples of one class within a single decision boundary is considered a pretty good fit to the data, even though the boudary does not perfectly classify every single training example in the training set and some of the samples of the other class may get classified among it too. But the model seems pretty good overal (i.e. just right), as it looks like it generalized pretty well to new samples. If the model tries really hard and contours or twist itself to find a decision boundary that fits the training data perfectly, it will produces many higher-order polynomial features and a very complex decision boundary. 

#### Overfitting
* If the algorithm is doing extremely good job fitting the training data, but make poorly predictions on examples, that are not on the training set, the algorith is overfitting the training dataset.
* Choosing parameters, that will result in the cost function being exactly equal to zero because the errors are zero on all training examples, may lead to overfitting problem.
* When the model fits the training data almost too well, but not generalize well to new examples that's never seen before, it is also said that the algorithm has <b>high variance</b>.
* Simpler models are less likely to overfit.

## How to solve overfitting?

* Collect more trainign data.
* If there's not enough training samples &rarr; reduce the number of features used i.e. "feature selection".
* * Simpler models are less likely to overfit.
* * The side effect of feature selection is that we throw away some useful information.
* * There are even algorithms that automatically chose most appropriate features.
* Use regularization - this is a way to more gently reduce the impact of some of the features, without eliminating them all together.
* * Regularization encourages the algorithm to shrink the parameters, without neccessarily demanding to set them to 0.
* * Usually $w$ params are regularized, not $b$.

![piai5.png](attachment:piai5.png)

## Cost function and Gradient descent with regularization

* More generally, the regularizatin tend to be implemented with penalizing all parameters $(w_{1},...,w_{n})$, since it's not known in advance which are the most important features and which are to penalize.
* The way to make certain parameters of the model really small ($\approx$ 0) is by penalizing the cost function, via adding those parameters $w$ with very large coefficients when calculating the cost.
* Therefore the gradient descent algorithm must find really small 'penalized' values of $w$ to be able to minimize the modified cost, otherwise they will weigh too much in the final sum.
* Regularization usually rezults in fitting a a smoother adn simpler model, which is less likely to overfit.
* The regularization parameter is named $\lambda$
* * If $\lambda$ = 0 &rarr; we don't penalize anything &rarr; possible overfitting
* * If $\lambda$ >>> 0 (very large) &rarr; too heavy regularization weights &rarr; $w_{1}...w_{n}$ will be very small &rarr; possible underfitting
* * Increase in $\lambda$ leads to decrease in $w_{1}...w_{n}$ &rarr; optimize!
  
#### Linear regression + regularization (cost function):

$J(\vec{w},b)= \frac{1}{2m} \sum_{i=1}^{m}[f_{\vec{w},b}(x^{(i)}) - y^{(i)} ]^{2} + \frac{\lambda}{2m}\sum_{j=1}^{n}w_{j}^2$ 

####  Logistic regression + regularization (cost function): 

$J(\vec{w},b)= -\frac{1}{m} \sum_{i=1}^{m} [y^{(i)} \log{(f_{\vec{w},b}(x^{(i)})} + (1 - y^{(i)}) \log{(1 - f_{\vec{w},b}(x^{(i)})} ] + \frac{\lambda}{2m}\sum_{j=1}^{n}w_{j}^2$ , where $f_{\vec{w},b}(x^{(i)}) = g(z)$

####  Gradient descent (GD)  + regularization (update rules):

$ w_{j} = w_{j} - \alpha [ \frac{1}{m} \sum_{i,j=1}^{m,n} ( f_{\vec{w},b}(x^{(i)}) - y^{(i)} ) x^{i}_{j} $ + $ \frac{\lambda}{m}\sum_{j=1}^{n}w_{j} ]$
<br><br> 
$ b = b - \alpha [\frac{1}{m} \sum_{i=1}^{m} ( f_{\vec{w},b}(x^{(i)}) - y^{(i)} ) ]$
<br><br>
* The GD equations for both types of regeression are the same, however in logistic regression  $f_{\vec{w},b}(x^{(i)}) = g(z)$
* $n = $ length of feature vector 
* $m = $ number of training samples
* $x^{(i)} = (x^{(i)}_{1}, ... , x^{(i)}_{n}) = $ feature vector $i$
* $x^{(i)}_{j} = $ element $j$ in sample $i$
* $f_{\vec{w},b}(x^{(i)}) = w_{1}x_{1}$ + ... + $w_{n}x_{1}$ + $b$
* $y^{(i)}$ = output or predicted target variable $i$
* $(x^{(i)}, y^{(i)})$ = a single training example (the i-th training example) = a single row in a data table 