## Overfitting and Underfitting

Overfitting occurs when a statistical model or machine learning algorithm captures the noise of the data.  Intuitively, overfitting occurs when the model or the algorithm fits the data too well.  

Overfitting is often a result of an excessively complicated model..


Underfitting occurs when a statistical model or machine learning algorithm cannot capture the underlying trend of the data.  Intuitively, underfitting occurs when the model or the algorithm does not fit the data well enough. 

Underfitting is often a result of an excessively simple model.

Both overfitting and underfitting lead to poor predictions on new data sets.

<img src="images/overfit.png"/>

### Bias/Variance

<img src="images/bias.PNG"/>

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
data = pd.read_csv('dataset/Position_Salaries.csv')

FileNotFoundError: File b'dataset/Position_Salaries.csv' does not exist

In [None]:
data.head()

In [None]:
X = data.iloc[:,1:2].values
X.shape

In [None]:
y = data.iloc[:,2].values

In [None]:
plt.scatter(X,y,color='red')
#plt.plot(X,lin_reg.predict(X),color='blue')
plt.xlabel('Position Level')
plt.ylabel('Salary')
plt.show()

In [None]:
from sklearn.linear_model import LinearRegression
#from sklearn.metrics import r2_score

In [None]:
lin_reg = LinearRegression()
lin_reg.fit(X,y)

In [None]:
train_pred = lin_reg.predict(X)

In [None]:
plt.scatter(X,y,color='red')
plt.plot(X,train_pred,color='blue')
plt.xlabel('Position Level')
plt.ylabel('Salary')
plt.show()

### We can see that a linear function (polynomial with degree 1) is not sufficient to fit the training samples. This is called underfitting.

### fitting with Polynomial 

In [3]:
from sklearn.preprocessing import PolynomialFeatures

In [4]:
poly = PolynomialFeatures(degree=3)

In [5]:
poly.fit(X)

NameError: name 'X' is not defined

In [None]:
X_train_poly = poly.transform(X)

In [None]:
lin_poly = LinearRegression()
lin_poly.fit(X_train_poly,y)

In [None]:
train_pred_poly = lin_poly.predict(X_train_poly) 

In [None]:
plt.scatter(X,y,color='red')
plt.plot(X,train_pred_poly,color='blue')
plt.xlabel('Position Level')
plt.ylabel('Salary')
plt.title('Overfitting training data')
plt.show()

A polynomial of degree 6 approximates the true function almost perfectly. However, for higher degrees the model will overfit the training data..

In [None]:
poly = PolynomialFeatures(degree=2)

In [None]:
X_poly = poly.fit_transform(X)

In [None]:
lin_poly = LinearRegression()
lin_poly.fit(X_poly,y)

In [None]:
X_pred_poly = lin_poly.predict(X_poly) 

In [None]:
plt.scatter(X,y,color='red')
plt.plot(X,X_pred_poly,color='blue')
plt.xlabel('Position Level')
plt.ylabel('Salary')
plt.show()

## Reduce overfitting:Regularization

### Regularization is a technique to shrinks the coefficient parameters towards zero by adding additional term

### method for automatically penalizing extra features

### L1 and L2 regularization term  against model complexity

<img src="images/reg.png"/>

### L1 regularization:

- adds a penalty equal to the sum of the absolute value of the weights. 

- L1 can yield sparse features vectors (i.e. more features weights will be zero)

- this L1 can be useful in practice if we have a high-dimensional dataset with many features that are irrelevant..

- Some coefficients can become zero.. Lasso regression uses this method.

### L2 regularization:

- the sum of the square of the weights,

- L2 produces non-sparse coefficients, so does not have built-in feature selection property.

-  Ridge regression and SVMs use this method.

<img src="images/ridge.PNG"/>

###  Now, the coefficients are estimated by minimizing this function. Here, λ is the tuning parameter that decides how much we want to penalize the flexibility of our model.

### This is how the Ridge regression technique prevents coefficients from rising too high.

where α is a hyperparameter that controls the strength of the penalty. Hyperparameters are parameters of the model that are not learned automatically and must be set manually.

the value of alpha increases, the model complexity reduces. higher values of alpha reduce overfitting, significantly high values can cause underfitting as well

### Regression  model for predicting boston housing dataset

In [8]:
from sklearn.datasets import load_boston
#from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge

In [9]:
X = load_boston()
X.data.shape

(506, 13)

In [10]:
X_train,X_test,y_train,y_test = train_test_split(X.data,X.target,test_size=.20,random_state=1)

In [11]:
poly = PolynomialFeatures()
X_poly = poly.fit_transform(X_train)

In [12]:
ridge_reg = Ridge()

In [13]:
ridge_reg.fit(X_poly,y_train)

Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001)

In [14]:
X_poly_test = poly.transform(X_test)

In [15]:
pred = ridge_reg.predict(X_poly_test)

In [16]:
from sklearn.metrics import r2_score

In [17]:
r2_score(y_test,pred)

0.9202111333707729

In [18]:
r2_score(y_train,ridge_reg.predict(X_poly))

0.9188912314000841

In [19]:
#ridge_reg.coef_