# The Bias-Variance Tradeoff

## Generalization Error

**Overfitting**

![image-4](image-4.png)

The model clearly memorized the noise present in the training set. Such model achieves a low training set error and a high test set error.

**Underfitting**

![image-5](image-5.png)

In such model, training set error is roughly equal to test set error. However, both errors are relatively high.

**Generalization Error**

![image-6](image-6.png)

- Higher the bias, underfit the model
- Higher the variance, overfit the model

![image-7](image-7.png)


## Diagnose bias and variance problems

**K-Fold CV**

![image-8](image-8.png)

CV error = (E1+E2+...+E10)/10

- **High variance** : CV error > training set error
- **High bias** : CV error nearly= training set error, but much greater than the desired error 

In [1]:
import pandas as pd
auto = pd.read_csv('datasets/auto.csv')
auto.head()

Unnamed: 0,mpg,displ,hp,weight,accel,origin,size
0,18.0,250.0,88,3139,14.5,US,15.0
1,9.0,304.0,193,4732,18.5,US,20.0
2,36.1,91.0,60,1800,16.4,Asia,10.0
3,18.5,250.0,98,3525,19.0,US,15.0
4,34.3,97.0,78,2188,15.8,Europe,10.0


In [2]:
# separating data into explanatory and response data
X = auto.drop(['mpg','hp','weight','accel','origin','size'],axis=1)
y = auto['mpg']

In [3]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error as MSE
from sklearn.model_selection import cross_val_score

# Set seed for reproducibility
SEED = 123

# Split data into 70% train and 30% test
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3,random_state=SEED)

# Instantiate decision tree regressor 
dt = DecisionTreeRegressor(max_depth=4,
                          min_samples_leaf=0.14,
                          random_state=SEED)


In [4]:
# Evaluate the list of MSE obtained by 10-fold CV
# Set n_jobs to -1 in order to exploit all CPU cores in computation
MSE_CV = - cross_val_score(dt, X_train, y_train, cv=10,
                          scoring = 'neg_mean_squared_error',
                          n_jobs = -1)
MSE_CV

array([16.33824557, 19.71569647, 20.47204676, 19.32346373, 17.7182428 ,
       20.90288889, 25.60983091, 15.84346118, 22.69564507, 20.80819754])

In [5]:
# Fit 'dt' to the training set
dt.fit(X_train, y_train)
# Predict the labels of training set
y_predict_train = dt.predict(X_train)
# Predict the labels on test set
y_predict_test = dt.predict(X_test)

In [6]:
# CV MSE 
print('CV MSE: {:.2F}'.format(MSE_CV.mean()))

# Training set MSE 
print('Train MSE: {:.2F}'.format(MSE(y_train, y_predict_train)))

# Test set MSE 
print('Test MSE: {:.2F}'.format(MSE(y_test, y_predict_test)))

CV MSE: 19.94
Train MSE: 17.89
Test MSE: 20.41


Given that the training set error is smaller than the CV-error, we can deduce that dt overfits the training set and that it suffers from high variance.

## Ensemble Learning
- Limitation od CARTs --> High Variance, unconstrained CARTs may overfit the training set
- Solution: ensemble learning

![image-9](image-9.png)


### Ensemble Learning: Voting Classifier

![image-10](image-10.png)


In [7]:
breast_cancer = pd.read_csv('datasets/wbc.csv')
breast_cancer.head()

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,fractal_dimension_mean,radius_se,texture_se,perimeter_se,area_se,smoothness_se,compactness_se,concavity_se,concave points_se,symmetry_se,fractal_dimension_se,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,1.095,0.9053,8.589,153.4,0.006399,0.04904,0.05373,0.01587,0.03003,0.006193,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,0.5435,0.7339,3.398,74.08,0.005225,0.01308,0.0186,0.0134,0.01389,0.003532,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,0.7456,0.7869,4.585,94.03,0.00615,0.04006,0.03832,0.02058,0.0225,0.004571,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,0.4956,1.156,3.445,27.23,0.00911,0.07458,0.05661,0.01867,0.05963,0.009208,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,0.7572,0.7813,5.438,94.44,0.01149,0.02461,0.05688,0.01885,0.01756,0.005115,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,


In [8]:
# Import functions to compute accuracy and split data
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

# Import models, including VotingClassifier meta-model
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier as KNN
from sklearn.ensemble import VotingClassifier

# Set seed for reproducibility
SEED = 1

In [9]:
X = breast_cancer.drop(['diagnosis','Unnamed: 32'],axis=1)
y = breast_cancer['diagnosis']

In [10]:
# Split data into 70% train and 30% test
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                   test_size=0.3,
                                                   random_state=SEED)

# Instantiate individual classifiers
lr = LogisticRegression(random_state=SEED)
knn = KNN()
dt = DecisionTreeClassifier(random_state=SEED)

# Define a list called classifier that contains the tuples (classifier_name, classifier)
classifiers = [('Logistic Regression',lr),
              ('K Nearest Neighbours',knn),
              ('Classification Tree',dt)]

In [11]:
# Iterate over the defined list of tuples containing the classifiers
for clf_name, clf in classifiers:
    #fit clf to the training set
    clf.fit(X_train, y_train)
    
    #Predict the labels of the test set
    y_pred = clf.predict(X_test)
    
    #Evaluate the accuracy of clf on the test set
    print('{:s} : {:.3f}'.format(clf_name, accuracy_score(y_test, y_pred)))

Logistic Regression : 0.632
K Nearest Neighbours : 0.766
Classification Tree : 0.930


In [12]:
# VOTING CLASSIFIER

# Instantiate a VotingClassifier 'vc'
vc = VotingClassifier(estimators = classifiers)

# Fit 'vc' to the training set and predict test set labels
vc.fit(X_train, y_train)
y_pred = vc.predict(X_test)

# Evaluate the test-set accuracy of 'vc'
print('Voting Classifier: {:.3f}'.format(accuracy_score(y_test,y_pred)))

Voting Classifier: 0.772
