In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## Generalization Error

### Goals of Supervised Learning
- Find a model f_hat that best approximates f
- f_hat can be logistic legression, decision tree, neural network, ...
- Discard noise as much as possible
- End goal: f_hat should achieve a low predictive error on unseen datasets

### Difficulties in approximating f
- Overfitting: f_hat(x) fits the training set noise.
- Underfitting: f_hat is not flexible enough to approximate f.

### Generalization Error
- Does f_hat generalize well on unseen data?
- It can be decomposed as folows:
 - Generalization error of f_hat = bias_squared + variance + irreducible error
- Bias: error term that tells you, on average, how much f_hat is different from f
- Variance: tells you how much f_hat is inconsistent over different training sets
- Model complexity: sets the flexibility of f_hat
 - Example: maximum tree depth, minimum samples per leaf, ...

## Diagnosing Bias and Variance Problems

### Estimating the Genenralization Error
- How do we estimate the generalization error of a model?
- Cannot be done directly because:
 - f is unknown
 - usually you only have one dataset
 - noise is unpredictable
- Solution:
 - split the data to training and test sets
 - fit f_hat to the training set
 - evaluate the error of f_hat on hte unseen test set
 - generalization error of f_hat is similar to test set error of f_hat

### Better Model Evaluation with Cross-Validation
- Test set should not be touched until we are confident about f_hat's performance
- Evaluating f_hat on training set: biased estimate, f_hat has already seen all training points.
- Solution --> Cross-Validation (CV):
 - K-fold CV,
 - Hold-out CV

### Diagnose Variance Problems
- If f_hat suffers from high variance: CV error of f_hat > training set error of f_hat
- f_hat is said to overfit the training set. To remedy overfitting:
 - decrease model complexity
 - for ex: decrease max depth, increase min samples per leaf,
 - gather more data, ...

### Diagnose Bias Problems
- If f_hat suffers from high bias: CV error of f_hat (similar to) training set error of f_hat >> desired error.
- f_hat is said to underfit the training set. To remedy underfitting:
 - increase model complexity
 - for ex: increase max depth, decrase min samples per leaf,
 - gather more data, ...

### K-FOLD CV in sklearn on the Auto Dataset

In [2]:
auto = pd.read_csv('auto.csv')
auto['origin'] = auto['origin'].astype('category')
dummies = pd.get_dummies(auto['origin'], prefix='origin')
auto = pd.concat([auto, dummies], axis=1)

X = auto.drop(['mpg', 'origin'], axis=1)
y = auto['mpg']

In [3]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error as MSE
from sklearn.model_selection import cross_val_score


# Set seed for reproducibility
SEED = 123

# Split data into 70% train and 30% test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=SEED)

# Instantiate decision tree regressor and assign it to 'dt'
dt = DecisionTreeRegressor(max_depth=4, min_samples_leaf=0.14, random_state=SEED)

# Evaluate the list of MSE ontained by 10-fold CV
# Set n_jobs to -1 in order to exploit all CPU cores in computation
MSE_CV = - cross_val_score(dt, X_train, y_train, cv=10, scoring='neg_mean_squared_error', n_jobs=-1)

# Fit 'dt' to the training set
dt.fit(X_train, y_train)

# Predict the labels of training set
y_predict_train = dt.predict(X_train)

# Predict the labels of test set
y_predict_test = dt.predict(X_test)

In [4]:
# CV MSE
MSE_CV.mean()

20.505691068058148

In [5]:
# Training set MSE
MSE(y_train, y_predict_train)

15.299344592866507

In [6]:
# Test set MSE
MSE(y_test, y_predict_test)

20.923283625005098

- Given that the training set error is smaller than the CV-error, we can deduce that dt overfits the training set and that it suffers from high variance.
- Notice how the CV and test set errors are roughly equal.

### Exercise: Instantiate the model
In the following set of exercises, you'll diagnose the bias and variance problems of a regression tree. The regression tree you'll define in this exercise will be used to predict the mpg consumption of cars from the auto dataset using all available features.

We have already processed the data and loaded the features matrix X and the array y in your workspace. In addition, the DecisionTreeRegressor class was imported from sklearn.tree.

In [7]:
# Import train_test_split from sklearn.model_selection
from sklearn.model_selection import train_test_split

# Set SEED for reproducibility
SEED = 1

# Split the data into 70% train and 30% test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=SEED)

# Instantiate a DecisionTreeRegressor dt
dt = DecisionTreeRegressor(max_depth=4, min_samples_leaf=0.26, random_state=SEED)

Great work! In the next exercise, you'll evaluate dt's CV error.

### Exercise: Evaluate the 10-fold CV error
In this exercise, you'll evaluate the 10-fold CV Root Mean Squared Error (RMSE) achieved by the regression tree dt that you instantiated in the previous exercise.

In addition to dt, the training data including X_train and y_train are available in your workspace. We also imported cross_val_score from sklearn.model_selection.

Note that since cross_val_score has only the option of evaluating the negative MSEs, its output should be multiplied by negative one to obtain the MSEs. The CV RMSE can then be obtained by computing the square root of the average MSE.

In [8]:
# Compute the array containing the 10-folds CV MSEs
MSE_CV_scores = - cross_val_score(dt, X_train, y_train, cv=10, 
                                  scoring='neg_mean_squared_error',
                                  n_jobs=-1)

# Compute the 10-folds CV RMSE
RMSE_CV = (MSE_CV_scores.mean())**(1/2)

# Print RMSE_CV
print('CV RMSE: {:.2f}'.format(RMSE_CV))

CV RMSE: 5.14


Great work! A very good practice is to keep the test set untouched until you are confident about your model's performance. CV is a great technique to get an estimate of a model's performance without affecting the test set.

### Exercise: Evaluate the training error
You'll now evaluate the training set RMSE achieved by the regression tree dt that you instantiated in a previous exercise.

In addition to dt, X_train and y_train are available in your workspace.

Note that in scikit-learn, the MSE of a model can be computed as follows:

MSE_model = mean_squared_error(y_true, y_predicted)
where we use the function mean_squared_error from the metrics module and pass it the true labels y_true as a first argument, and the predicted labels from the model y_predicted as a second argument.

In [9]:
# Import mean_squared_error from sklearn.metrics as MSE
from sklearn.metrics import mean_squared_error as MSE

# Fit dt to the training set
dt.fit(X_train, y_train)

# Predict the labels of the training set
y_pred_train = dt.predict(X_train)

# Evaluate the training set RMSE of dt
RMSE_train = (MSE(y_train, y_pred_train))**(1/2)

# Print RMSE_train
print('Train RMSE: {:.2f}'.format(RMSE_train))

Train RMSE: 5.15


Awesome! Notice how the training error is roughly equal to the 10-folds CV error you obtained in the previous exercise.

### Exercise: High bias or high variance?
In this exercise you'll diagnose whether the regression tree dt you trained in the previous exercise suffers from a bias or a variance problem.

The training set RMSE (RMSE_train) and the CV RMSE (RMSE_CV) achieved by dt are available in your workspace. In addition, we have also loaded a variable called baseline_RMSE which corresponds to the root mean-squared error achieved by the regression-tree trained with the disp feature only (it is the RMSE achieved by the regression tree trained in chapter 1, lesson 3). Here baseline_RMSE serves as the baseline RMSE above which a model is considered to be underfitting and below which the model is considered 'good enough'.

Does dt suffer from a high bias or a high variance problem?

In [10]:
baseline_RMSE = 5.1
# the root mean-squared error achieved by the regression-tree trained with the disp feature only 
# (it is the RMSE achieved by the regression tree trained in chapter 1, lesson 3)

dt suffers from high bias because RMSE_CV  RMSE_train and both scores are greater than baseline_RMSE.

Correct! dt is indeed underfitting the training set as the model is too constrained to capture the nonlinear dependencies between features and labels.

## Ensemble Learning

### Advantages of CARTs
- simple to understand
- simple to interpret
- easy to use
- flexibility: ability to describe non-linear dependencies
- preprocessing: no need to standardize or normalize features, ...

### Limitations of CARTs
- classification: can only produce orthogonal (right-angled) decision boundaries
- sensitive to small variations in the training set
- high variance: unconstrained CARTs may overfit the training set
- solution: ENSEMBLE!

### Ensemble Learning
- Train different models on the same dataset.
- Let each model make its predictions.
- Meta-model: aggregates predictions of individual models.
- Final prediction: more robust and less prone to errors.
- Best results: models are skillful in different ways.

### Ensemble Learning in Practice: Voting Classifier
- Binary classification task
- N classifiers make predictions: P1, P2, ..., PN with Pi = 0 or 1.
- Meta-model prediction: hard voting

### Voting Classifier in sklearn (Breat-Cancer dataset)

In [11]:
wbc = pd.read_csv('wbc.csv').drop('Unnamed: 32', axis=1)
y = wbc['diagnosis']
X = wbc.drop(['id', 'diagnosis'], axis=1)

In [12]:
# Import functions to compute accuracy and split data
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

# Import models, including VotingClassifier meta-model
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier as KNN
from sklearn.ensemble import VotingClassifier

# Set seed for reproducibility
SEED = 1

In [13]:
# Split data into 70% train and 30% test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=SEED)

# Instantiate individual classifiers
lr = LogisticRegression(random_state=SEED, max_iter=10000)
knn = KNN()
dt = DecisionTreeClassifier(random_state=SEED)

# Define a list called classifier that contains the tuples (classifier_name, classifier)
classifiers = [('Logistic Regression', lr), 
               ('K Nearest Neighbors', knn), 
               ('Classfication Tree', dt)]

# Iterate over the defined list of tuples containing the classifiers
for clf_name, clf in classifiers:
    
    #fit clf to the training set
    clf.fit(X_train, y_train)
    
    # Predict the labels of the test set
    y_pred = clf.predict(X_test)
    
    # Evaluate the accuracy of clf on the test set
    print('{:s} : {:.3f}'.format(clf_name, accuracy_score(y_test, y_pred)))

Logistic Regression : 0.947
K Nearest Neighbors : 0.930
Classfication Tree : 0.930


In [14]:
# Instantiate a VotingClassifier 'vc'
vc = VotingClassifier(estimators=classifiers)

# Fit 'vc' to the traing set and predict test set labels
vc.fit(X_train, y_train)
y_pred = vc.predict(X_test)

# Evaluate the test-set accuracy of 'vc'
print('Voting Classifier: {:.3f}'.format(accuracy_score(y_test, y_pred)))

Voting Classifier: 0.953


Voting classifiers's accuracy is higher than that achieved by any of the individual models in the ensemble.

### Exercise: Define the ensemble
In the following set of exercises, you'll work with the Indian Liver Patient Dataset from the UCI Machine learning repository.

In this exercise, you'll instantiate three classifiers to predict whether a patient suffers from a liver disease using all the features present in the dataset.

The classes LogisticRegression, DecisionTreeClassifier, and KNeighborsClassifier under the alias KNN are available in your workspace.

In [15]:
# Set seed for reproducibility
SEED=1

# Instantiate lr
lr = LogisticRegression(random_state=SEED, max_iter=10000)

# Instantiate knn
knn = KNN(n_neighbors=27)

# Instantiate dt
dt = DecisionTreeClassifier(min_samples_leaf=0.13, random_state=SEED)

# Define the list classifiers
classifiers = [('Logistic Regression', lr), ('K Nearest Neighbours', knn), ('Classification Tree', dt)]

Great! In the next exercise, you will train these classifiers and evaluate their test set accuracy.

### Exercise: Evaluate individual classifiers
In this exercise you'll evaluate the performance of the models in the list classifiers that we defined in the previous exercise. You'll do so by fitting each classifier on the training set and evaluating its test set accuracy.

The dataset is already loaded and preprocessed for you (numerical features are standardized) and it is split into 70% train and 30% test. The features matrices X_train and X_test, as well as the arrays of labels y_train and y_test are available in your workspace. In addition, we have loaded the list classifiers from the previous exercise, as well as the function accuracy_score() from sklearn.metrics.

In [16]:
liver = pd.read_csv('indian_liver_patient_preprocessed.csv')
X = liver.drop(['Liver_disease', 'Unnamed: 0'] , axis=1)
y = liver['Liver_disease']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=SEED)

In [17]:
# Iterate over the pre-defined list of classifiers
for clf_name, clf in classifiers:    
 
    # Fit clf to the training set
    clf.fit(X_train, y_train)    
   
    # Predict y_pred
    y_pred = clf.predict(X_test)
    
    # Calculate accuracy
    accuracy = accuracy_score(y_test, y_pred) 
   
    # Evaluate clf's accuracy on the test set
    print('{:s} : {:.3f}'.format(clf_name, accuracy))

Logistic Regression : 0.759
K Nearest Neighbours : 0.701
Classification Tree : 0.730


Great work! Notice how Logistic Regression achieved the highest accuracy of ~~74.7%~~75.9%.

### Exercise: Better performance with a Voting Classifier
Finally, you'll evaluate the performance of a voting classifier that takes the outputs of the models defined in the list classifiers and assigns labels by majority voting.

X_train, X_test,y_train, y_test, the list classifiers defined in a previous exercise, as well as the function accuracy_score from sklearn.metrics are available in your workspace.

In [18]:
# Import VotingClassifier from sklearn.ensemble
from sklearn.ensemble import VotingClassifier

# Instantiate a VotingClassifier vc
vc = VotingClassifier(estimators=classifiers)     

# Fit vc to the training set
vc.fit(X_train, y_train)   

# Evaluate the test set predictions
y_pred = vc.predict(X_test)

# Calculate accuracy score
accuracy = accuracy_score(y_test, y_pred)
print('Voting Classifier: {:.3f}'.format(accuracy))

Voting Classifier: 0.770


Great work! Notice how the voting classifier achieves a test set accuracy of ~~75.3%~~77.7%. This value is greater than that achieved by LogisticRegression.