## Outline

1. Bias/Variance Tradeoff
2. Model Complexity and Bias/Variance
3. Tackling High Variance (Overfitting) in a model
    * Feature Subset Selection
    * Cross Validation
    * Regularization

<img style= "float:left;" src= "./resources/underfit.png" width = 400><img style="float:right;" src= "./resources/overfit.png" width = 400>





#### What's wrong with these two models? 

In [3]:
## the left one is underfit (it has a high bias)
## the right one is overfit (it has a high variance)









Underfitting: When our model is unable to detect the "signal" present in our data. This is when we have High Bias.

Overfitting: When our model is fit to the "noise" of our data rather than the "signal." This is when we have High Variance.

<img src="./resources/overfit_underfit.png">

image source : https://medium.com/greyatom/what-is-underfitting-and-overfitting-in-machine-learning-and-how-to-deal-with-it-6803a989c76

#### Bias: the difference between an estimator's expected value and the true value of the parameter being estimate
$E[\hat{f}(x)] - f(x)$ 

*Where $f(x)$ is the **true** function representing the data and $E[\hat{f}(x)]$ is the expected predicted value of the estimated model*

*In other words $\hat{y} - y$*
#### Variance: the variability of a model prediction for a given data point.
$E[(\hat{f}(x)-E[\hat{f}(x)])^2]$



### Let's look at the total error of a function, in this example, we'll use Mean Square Error

$ MSE(X) = E[(Y - \hat{f}(x))^2]$

We can break down the components of the error further:
#### $MSE(x) = (E[\hat{f}(x)] - f(x))^2 + E[(\hat{f}(x)-E[\hat{f}(x)])^2] + \sigma_{e}^2$

$ MSE(x) = Bias^2 + Variance + Irreducible\ Error $


proof: http://www.cs.cmu.edu/~wcohen/10-601/bias-variance.pdf slides 9-13


video proof explanation: https://www.youtube.com/watch?v=jiQamxz2ZcQ&t=520s 

### Model Complexity and Bias/Variance
As we increase our model complexity on our training set, we are more likely to be overfitting the data.
<img src = "./resources/bias-variance-train-test.png" width = 500>

image from https://www.learnopencv.com/bias-variance-tradeoff-in-machine-learning/

#### Given this plot of the errors, what is the optimal number of parameters in this model??
<img src = "./resources/num_parameters.png" width = 500>


In [4]:
## around 5 parameters. At that point, your test error starts to increase, even though your train error
## continues to decrease. You are overfitting if you use a parameter of 








### How to deal with a model that is overfit to data:

* Train with more data
* Cross Validation
* Feature Subset Selection
* Regularization

### Train with more data

Models should be trained on as large of a sample size as possible. The larger the sample, the less likely you are to fit on "noise" that is specific to a certain sample set.

Usually, however, gathering more data is no trivial matter.

### Feature Subset Selection

Wrapper methods: use machine learning models with various method to determine what the optimal features in a model are. 

* Forward Search: Determine the most predictive single variable to regress on, repeatedly add variables until your model stops improving 
<img src="./resources/forward_search.png" width =400>
* Backward Elimination: Start with all variables and eliminate the least promising ones (the ones with the highest p-values)
<img src= "./resources/backward_search.png" width = 400>

These particular strategies are sometimes referred to as "greedy" because after you have made a decision at a particular node, you do not revisit that decision.

http://jmlr.csail.mit.edu/papers/volume3/guyon03a/guyon03a.pdf page 1166

http://www.biostat.jhsph.edu/~iruczins/teaching/jf/ch10.pdf

### Cross Validation
To ensure that we do not overfit to the training set, we should split our data into training, validation, and test sets.

#### Steps to cross validation
1. Split your data into training and validation sets.
2. Train models of varying complexity on the training set.
3. Make predictions with the validation set.
4. Choose the model that performs best on the validation set.

<img src="./resources/train_test_split.png" width = 500>


##### Cross Validation in Action

In [5]:
import pandas as pd
import seaborn as sns
data = pd.read_csv('./resources/auto-mpg.csv')
data.head()


Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model year,origin,car name
0,18.0,8,307.0,130,3504,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165,3693,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150,3436,11.0,70,1,plymouth satellite
3,16.0,8,304.0,150,3433,12.0,70,1,amc rebel sst
4,17.0,8,302.0,140,3449,10.5,70,1,ford torino


In [15]:
series = data['horsepower']

In [17]:
np.sum(series.str.isnumeric())

392

In [6]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression
X = data['weight']
y = data['displacement']
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2)
lr = LinearRegression()
lr.fit(X_train.values.reshape(-1,1),y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [7]:
train_predictions = lr.predict(X_train.values.reshape(-1,1))
train_rmse = mean_squared_error(train_predictions,y_train)**0.5
test_predictions = lr.predict(X_test.values.reshape(-1,1))
test_rmse = mean_squared_error(test_predictions,y_test)**0.5
print(' Training Error: {}\n Test Error: {}'.format(train_rmse,test_rmse))

 Training Error: 38.56682662624701
 Test Error: 33.07073319635047


In [8]:
##What we should see is that the test error is lower than the training error
##What are some potential problems with this version of cross validation?

#We still might be overfitting to something that is present in only the training set. As we run the previous cells
# multiple times, we can see that we get different results for the training and test error






#### K-fold cross validation

1. Split the data into k # of folds.
2. Train your model on k-1 of the folds
3. Test on the remaining fold.
4. Repeat k times, and find the average score for whichever metric you are using.
5. Once you've determined the best model, train on the entire training set.

In [9]:
import numpy as np
from sklearn.model_selection import KFold
folds = KFold(n_splits=5,shuffle=True)
mse = []
for train_idx, test_idx in folds.split(X,y):
    lr = LinearRegression()
    lr.fit(X[train_idx].values.reshape(-1,1),y[train_idx])
    predictions = lr.predict(X[test_idx].values.reshape(-1,1))
    mse.append(mean_squared_error(predictions,y[test_idx]))

avg_mse = np.mean(mse)
    
print(avg_mse)


1411.9549900336665


In [11]:
data.columns

Index(['mpg', 'cylinders', 'displacement', 'horsepower', 'weight',
       'acceleration', 'model year', 'origin', 'car name'],
      dtype='object')

In [25]:
cleaned_data = data[data['horsepower'].str.isnumeric()]
cleaned_data['horsepower'] = cleaned_data['horsepower'].astype('int')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [88]:
## cross validation can be done to check for the significance of features or models.
X_values = cleaned_data[['mpg', 'cylinders', 'displacement', 'weight',
       'acceleration', 'model year', 'origin']]
y_values = cleaned_data['horsepower']

from sklearn.model_selection import cross_val_score
from sklearn.linear_model import Lasso, Ridge
lr = LinearRegression()
la_model = Lasso()
ri_model = Ridge()
models = [lr,la_model,ri_model]

for model in models:
    scores = cross_val_score(model,X_values,y_values,cv=5)
    print(str(model))
    print(scores)
    print('Average R^2 is:', np.mean(scores))
    print('---------------------\n')

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
[0.84837778 0.89975864 0.90112909 0.86748389 0.61147899]
Average R^2 is: 0.8256456774554687
---------------------

Lasso(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=1000,
   normalize=False, positive=False, precompute=False, random_state=None,
   selection='cyclic', tol=0.0001, warm_start=False)
[0.83697118 0.89507352 0.89334634 0.8686426  0.66277496]
Average R^2 is: 0.8313617179841744
---------------------

Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001)
[0.8482818  0.89972975 0.90106848 0.86763043 0.61262674]
Average R^2 is: 0.8258674385039168
---------------------



#### We can now see how different models perform. Lasso performs the best!

Other variant: Leave one out kfold Cross Validation

This is the same as k-fold cross validation, except it trains on n-1 datapoints and tests on the remaining datapoint. Typically used with small sample sizes.

More on that here: http://scikit-learn.org/stable/modules/cross_validation.html#computing-cross-validated-metrics

##  Regularization


Reduce the effect that each parameter has on a predictive model. We will dive more into this soon. We implement some cost function within a machine learning model. We will cover this soon..

<img src = "./resources/lasso_ridge.png">

### Resources

http://scott.fortmann-roe.com/docs/BiasVariance.html

https://www.youtube.com/watch?v=jiQamxz2ZcQ