In [None]:
## Generalization, Overfitting and Underfitting

In [None]:
########### OVERFITTING ###########

## And overfitting model uses its ability to capture complex patterns by being great at 
## predicting lots and lots of specific data samples or areas of local variation in the training set.

## But it often misses seeing global patterns in the training set that would help it generalize well 
## on the unseen test set. It can't see these more global patterns because, intuitively, there's not 
## enough data to constrain the model to respect these global trends. 

## As a result, the training set accuracy is a hopelessly optimistic indicator for the likely test set 
## accuracy if the model is overfitting.

<img src="pics/o1.png" alt="Drawing" style="width: 700px;"/>

In [None]:
### UNDERFITTING EXAMPLE

## In this case the model has under-fit the data. The model is too simple for the actual trends that 
## are present in the data. It doesn't even do well on the training points. 

## So these blue points would represent the training points, the input to the training process for the 
## regression. And when we underfit, we have a model that's too SIMPLE, doesn't even do well on the training data
## and thus, is not at all likely to generalize well to test data. 

<img src="pics/o3.png" alt="Drawing" style="width: 700px;"/>

In [None]:
## A better fit with a quadratic Function 

<img src="pics/o4.png" alt="Drawing" style="width: 700px;"/>

In [None]:
############## OVERFITTING EXAMPLE ###############

## A third example might be, we might hypothesize that the relationship between the input variable and the
## target variable is a function of several different parameters. Let's say, a polynomial so something that's 
## very bumpy. If we try to fit a more complex model to this set of training data, we might end up with 
## something that looks like this. 

## So, this more complex model has more parameters so it's able to capture more subtle behavior. 
## But it's also much higher variance here as you can see so it's more focused on capturing the more local 
## variations in the training data rather than trying to find the more global trend that we can see 
## as humans in the data. So, this is an example of overfitting. 

## In this case there's not enough data to really constrain the parameters of the model enough so that it 
## can recognize the global trend

<img src="pics/o5.png" alt="Drawing" style="width: 700px;"/>

In [None]:
################################ OVERFITTING IN CLASSIFICATION #####################

In [None]:
## Finding the decision boundry is the main task of classification 
## Normal Fit case will be below with a good boundry 

<img src="pics/o6.png" alt="Drawing" style="width: 700px;"/>

In [None]:
## UNDERFITTING IN CLASSIFICATION 

<img src="pics/o7.png" alt="Drawing" style="width: 700px;"/>

In [None]:
#### NORMAL FIT

## A reasonably good model that fits well would be. You know, a linear model that finds this 
## general difference between the positive class over here and the negative class over here. 

## So you can see that it's AWARE  of this sort of GLOBAL  pattern of having most of the blue negative points in 
## the upper left and most of the red positive points more toward the lower right.

## And it's robust enough in the sense that it ignores the fact that there may be occasionally a blue point in 
## the red region or red point in the blue region. 
## Instead, it's found this sort of more global separation between these two classes.

<img src="pics/o8.png" alt="Drawing" style="width: 700px;"/>

In [None]:
## OVERFITTING IN CLASSIFICATION 

## An overfitting model on the other hand would typically be a model that has lots of parameters so it can 
## capture complex behavior. 
## And so, it would try to find something very clever, where it tried to completely separate the red points 
## from the blue points in a way that resulted in a highly variable decision boundary. 

## So this has the advantage that, a questionable advantage, that it does capture the training data classes 
## very well. 
## It predicts the training data classes almost perfectly. 
## But as you can see, if the actual division between the classes, it's captured by this linear model, t
## he overfit model is going to make lots of mistakes in the regions where it's trying to be TOO PERFECT 
## in some sense. 

## So, you'll see this typically with overfitting. The overfit model is highly variable. 
## And again, it's trying to capture too many of the local fluctuations and does not have enough data to see 
## the global trend that would result in better overall generalization performance on new unseen data

<img src="pics/o9.png" alt="Drawing" style="width: 700px;"/>

In [None]:
############################ A VERY BRIEF INTRO TO K NEAREST NEIGHBORS ####################

In [None]:
## This third example shows the effect of modifying the K parameter in the K nearest neighbors classifier. 

## The three plots shown here show the decision boundaries for K=10, K=5, and K=1. 
## And here we're using the fruit dataset again with the height of a piece of fruit on the x axis and the 
## width on the y axis. 

## So the K=10 case, as you recall, K=10 means that, for each query point that we want to get a prediction for, 
## let's say over here, we have to look at the 10 nearest points to the query point and we'll take those. 
## We won't go through all 10 here, but let's just say there are several here that are nearby and will average or
## combine their votes to predict the label for this point. 
## So in the K=10 case, we need the votes from 10 different data instances in the training set to make our 
## prediction. 

## And as we reduce K, so K=5, we only need five neighbors to make a prediction. So for example, if the query 
## point was here, we'd look at these four and then possibly whatever was the closest in this let's say, 
## that one or this one. 

## And then finally, in the K=1 case, that's the most unstable in the sense that for any query point, 
## we only look at the single nearest neighbor to that point. 

## And so, the effect of reducing K in the k-nearest neighbors classifier is to increase the variance 
## of the decision boundaries, because the decision boundary can be affected by outliers. 

## If there's a point far, far away, it might have, it has much greater effect on the decision boundary 
## in the K=1 case, than it would in the K=10 case, when the votes of nine other neighbors are also needed.

<img src="pics/o10.png" alt="Drawing" style="width: 700px;"/>

In [None]:
<img src="pics/o6.png" alt="Drawing" style="width: 700px;"/>

In [None]:
<img src="pics/o6.png" alt="Drawing" style="width: 700px;"/>