# Generalization Error

## Supervised Learning - Under the Hood
Supervised Learning: $y = f(x), f$ is unknown.

## Goals of Supervised Learning
* Find a model $\hat{f}$ that best approximates $f :\hat{f} \approx f$
* $\hat{f}$ can be Logistic Regression, Decision Tree, Neural Network ...
* Discard noise as much as possible.
* **End goal:** $\hat{f}$ should achieve a low predictive error on unseen datasets.

## Difficulties in Approximating f
* Overfitting: $\hat{f}$ fits the training set noise.
* Underfitting: $\hat{f}$ is not flexible enough to approximate $f$.

## Overfitting
When a model overfits the training set, its predictive power on unseen datasets is pretty low. The model memorizes the noise present in the training set. Such model achieves a low training set error and a high test set error. 

## Underfitting
When a model underfits the data, the training set error is roughly equal to the test set error. However, both errors are relatively high. The trained model isn't flexible enough to capture the complex dependency between features and labels. In analogy, it's like teaching calculus to a 3-year old. The child does not have the required mental abstraction level that enables him to understand calculus. 

## Generalization Error

The generalization error of a model tells you how much it generalizes on unseen data. It can be decomposed into 3 terms: bias, variance and irreducible error where the irreducible error is the error contribution of noise. 

* **Generalization Error of $\hat{f}$:** Does $\hat{f}$ generalize well on unseen data?
* it can be decomposed as follows  
Generalization error of $$\hat{f} = bias^2 + variance + irreducible\ error$$

## Bias

The bias term tells you, on average, how much $\hat{f}$ and f are different. Considering a model with high bias, it is not flexible enough to approximate the true function $f$. High bias models lead to underfitting. 

## Variance

The variance term tells you how much $\hat{f}$ is inconsistent over different training sets. Considering a model with high variance, $\hat{f}$ follows the training data points so closely that it misses the true function $f$. High variance models lead to overfitting. 

## Model Complexity
* Model Complexity: sets the flexibility of $\hat{f}$.
* Example: Maximum tree depth, Minimum samples per leaf, ...

## Complexity, bias and variance

As the complexity of $\hat{f}$ increases, the bias term decreases while the variance term increases.

# Diagnosing Bias and Variance Problems

## Estimating the Generalization Error
* How do we estimate the generalization error of a model?
* Cannot be done directly because:
    * $f$ is unknown,
    * usually you only have one dataset,
    * noise is unpredictable.

**Solution:**
* split the data to training and test sets
* fit $\hat{f}$ to the training set
* evaluate the error of $\hat{f}$ on the **unseen** test set
* generalization error of $\hat{f} \approx$ test set error of $\hat{f}$

## Better Model Evaluation with Cross-Validation
* Test set should not be touched until we are confident about $\hat{f}$'s performance
* Evaluating $\hat{f}$ on training set: biased estimate, $\hat{f}$ has already seen all training points
* Solution -> Cross-Validation (CV):
    * K-Fold CV
    * Hold-Out CV

## Diagnose Variance Problems

* If $\hat{f}$ suffers from **high variance**: CV Error of $\hat{f}$ > training set error of $\hat{f}$
* $\hat{f}$ is said to overfit the training set. To remedy overfitting:
    * decrease model complexity
        * ex. decrase max depth, increase min sample per leaf, ...
    * gather more data, ...

## Diagnose Bias Problemas
* if $\hat{f}$ suffers from high bias: CV error of $\hat{f} \approx$ training set error of $\hat{f}$ >> desired error
* $\hat{f}$ is said to underfit the training set. To remedy underfitting:
    * increase model complexity
        ex. increase max depth, decrease min sample per leaf, ...
    * gather more relevant features

## K-Fold CV in sklearn on the Auto Dataset

In [1]:
import pandas as pd
auto = pd.read_csv('auto.zip')
auto

Unnamed: 0,mpg,displ,hp,weight,accel,origin,size
0,18.0,250.0,88,3139,14.5,US,15.0
1,9.0,304.0,193,4732,18.5,US,20.0
2,36.1,91.0,60,1800,16.4,Asia,10.0
3,18.5,250.0,98,3525,19.0,US,15.0
4,34.3,97.0,78,2188,15.8,Europe,10.0
...,...,...,...,...,...,...,...
387,18.0,250.0,88,3021,16.5,US,15.0
388,27.0,151.0,90,2950,17.3,US,10.0
389,29.5,98.0,68,2135,16.6,Asia,10.0
390,17.5,250.0,110,3520,16.4,US,15.0


In [2]:
X = auto.iloc[:, 1:]
X['origin'] = pd.Categorical(X['origin']).codes
y = auto['mpg']

### First attempt

In [3]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error as MSE
from sklearn.model_selection import cross_val_score
# Split data into 70% train and 30% test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=123)
# Instantiate decision tree regressor and assign it to 'dt'
dt = DecisionTreeRegressor(max_depth=4, min_samples_leaf=0.14, random_state=123)
dt

All scorer objects follow the convention that higher return values are better than lower return values. Thus metrics which measure the distance between the model and the data, like metrics.mean_squared_error, are available as neg_mean_squared_error which return the negated value of the metric.

In [4]:
# Evaluate the list of MSE ontained by 10-fold CV
# Set n_jobs to -1 in order to exploit all CPU cores in computation
MSE_CV = - cross_val_score(dt, X_train, y_train, cv= 10,
                           scoring='neg_mean_squared_error',
                           n_jobs = -1)
# Fit 'dt' to the training set
dt.fit(X_train, y_train)
# Predict the labels of training set
y_predict_train = dt.predict(X_train)
# Predict the labels of test set
y_predict_test = dt.predict(X_test)

In [5]:
# CV MSE
print('CV MSE: {:.2f}'.format(MSE_CV.mean()))
# Training set MSE
print('Train MSE: {:.2f}'.format(MSE(y_train, y_predict_train)))
# Test set MSE
print('Test MSE: {:.2f}'.format(MSE(y_test, y_predict_test)))

CV MSE: 20.51
Train MSE: 15.30
Test MSE: 20.92


### Reducing variance

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.3, random_state=123)
# Instantiate a DecisionTreeRegressor dt
dt = DecisionTreeRegressor(max_depth=4, min_samples_leaf=0.26, random_state=123)

In [7]:
# Compute the array containing the 10-folds CV MSEs
MSE_CV_scores = - cross_val_score(dt, X_train, y_train, cv=10, 
                       scoring='neg_mean_squared_error',
                       n_jobs=-1)
# Compute the 10-folds CV MSE
MSE_CV = (MSE_CV_scores.mean())
# Print MSE_CV
print('CV MSE: {:.2f}'.format(MSE_CV))

CV MSE: 27.41


### Evaluating training error

In [8]:
# Import mean_squared_error from sklearn.metrics as MSE
from sklearn.metrics import mean_squared_error
# Fit dt to the training set
dt.fit(X_train, y_train)
# Predict the labels of the training set
y_pred_train = dt.predict(X_train)
# Evaluate the training set MSE of dt
MSE_train = (mean_squared_error(y_train, y_pred_train))
# Print MSE_train
print('Train MSE: {:.2f}'.format(MSE_train))

Train MSE: 25.32


# Ensemble learning

## Advantages of CARTs
* Simple to understand.
* Simple to interpret.
* Easy to use.
* Flexibility: ability to describe non-linear dependencies.
* Preprocessing: no need to standardize or normalize features, ...

## Limitations of CARTs
* Classification: can only produce orthogonal decision boundaries.
* Sensitive to small variations in the training set. Sometimes, when a single point is removed from the training set, a CART's learned parameters may changed drastically.
* High variance: unconstrained CARTs may overfit the training set
* A solution that takes advantage of the flexibility of CARTs while reducing their tendency to memorize noise (overfitting) is **ensemble learning**.

## Ensemble Learning
* Train different models on the same dataset
* Let each model make its predictions.
* Meta-model: aggregates predictions of individual models.
* Final prediction: more robust and less prone to errors.
* Best results: models are skillful in different ways, meaning that if some models make predictions that are way off, the other models should compensate these errors. In such case, the meta-model's predictions are more robust. 

## Ensemble Learning in Practice: Voting Classifier
* Binary classification task
* $N$ classifiers make predictions: $P_{1}, P_{2},\ ...,\ P_{N}$ with $P_{i}=0$ or 1
* Meta-model prediction: **hard voting**

## Voting Classifier in sklearn (Breast-Cancer dataset)

In [9]:
import pandas as pd
wbc = pd.read_csv('wbc.zip')
X = wbc.iloc[:, 2:-1]
y = pd.Categorical(wbc['diagnosis']).codes
X.shape

(569, 30)

In [10]:
# Import functions to compute accuracy and split data
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
# Import models, including VotingClassifier meta-model
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier as KNN
from sklearn.ensemble import VotingClassifier
from sklearn.preprocessing import StandardScaler
# Set seed for reproducibility
SEED = 1

In [11]:
X = StandardScaler().fit_transform(X)

In [12]:
# Split data into 70% train and 30% test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.3, random_state= SEED)
# Instantiate individual classifiers
lr = LogisticRegression(random_state=SEED)
knn = KNN()
dt = DecisionTreeClassifier(random_state=SEED)
# Define a list called classifier that contains the tuples (classifier_name, classifier)
classifiers = [('Logistic Regression', lr),
               ('K Nearest Neighbours', knn),
               ('Classification Tree', dt)]

In [13]:
# Iterate over the defined list of tuples containing the classifiers
for clf_name, clf in classifiers:
    #fit clf to the training set
    clf.fit(X_train, y_train)
    # Predict the labels of the test set
    y_pred = clf.predict(X_test)
    # Evaluate the accuracy of clf on the test set
    print('{:s} : {:.3f}'.format(clf_name, accuracy_score(y_test, y_pred)))

Logistic Regression : 0.971
K Nearest Neighbours : 0.953
Classification Tree : 0.930


In [14]:
# Instantiate a VotingClassifier 'vc'
vc = VotingClassifier(estimators=classifiers)
# Fit 'vc' to the traing set and predict test set labels
vc.fit(X_train, y_train)
y_pred = vc.predict(X_test)
# Evaluate the test-set accuracy of 'vc'
print('Voting Classifier: {:.3f}'.format(accuracy_score(y_test, y_pred)))

Voting Classifier: 0.971


## Voting Classifier in sklearn (Indian Liver Patient Dataset)

In [15]:
indian = pd.read_csv('indian_liver_patient_preprocessed.zip')

In [16]:
X = indian.iloc[:, 1:-1]
y = indian['Liver_disease']

In [17]:
# Set seed for reproducibility
SEED=1
# Instantiate lr
lr = LogisticRegression(random_state=SEED)
# Instantiate knn
knn = KNN(n_neighbors=27)
# Instantiate dt
dt = DecisionTreeClassifier(min_samples_leaf=0.13, random_state=SEED)
# Define the list classifiers
classifiers = [('Logistic Regression', lr), ('K Nearest Neighbours', knn), ('Classification Tree', dt)]

In [18]:
# Split data into 70% train and 30% test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.3, random_state= SEED)

In [19]:
# Iterate over the pre-defined list of classifiers
for clf_name, clf in classifiers:    
 
    # Fit clf to the training set
    clf.fit(X_train, y_train)    
   
    # Predict y_pred
    y_pred = clf.predict(X_test)
    
    # Calculate accuracy
    accuracy = accuracy_score(y_test, y_pred) 
   
    # Evaluate clf's accuracy on the test set
    print('{:s} : {:.3f}'.format(clf_name, accuracy))

Logistic Regression : 0.759
K Nearest Neighbours : 0.701
Classification Tree : 0.730


In [20]:
# Instantiate a VotingClassifier vc
vc = VotingClassifier(estimators=classifiers)     
# Fit vc to the training set
vc.fit(X_train, y_train)   
# Evaluate the test set predictions
y_pred = vc.predict(X_test)
# Calculate accuracy score
accuracy = accuracy_score(y_test, y_pred)
print('Voting Classifier: {:.3f}'.format(accuracy))

Voting Classifier: 0.770
