The bias-variance tradeoff is one of the fundamental concepts in supervised machine learning. In this chapter, you'll understand how to diagnose the problems of overfitting and underfitting. You'll also be introduced to the concept of ensembling where the predictions of several models are aggregated to produce predictions that are more robust.

In [1]:
# Importing course packages; you can add more too!
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn

# Importing course datasets as DataFrames
auto = pd.read_csv('../datasets/auto.csv')
bikes = pd.read_csv('../datasets/bikes.csv')
liver_disease = pd.read_csv('../datasets/indian_liver_patient_preprocessed.csv', index_col=0)
wbc = pd.read_csv('../datasets/wbc.csv') # Wisconsin Breast Cancer Dataset

# Diagnose bias and variance problems

### Instantiate the model

In the following set of exercises, you'll diagnose the bias and variance problems of a regression tree. The regression tree you'll define in this exercise will be used to predict the mpg consumption of cars from the auto dataset using all available features.

We have already processed the data and loaded the features matrix ```X``` and the array ```y``` in your workspace. In addition, the ```DecisionTreeRegressor``` class was imported from ```sklearn.tree```.

In [6]:
# prep data for training.

# imports
from sklearn.tree import DecisionTreeRegressor
import pandas as pd

# prep data creating dummy variables.
# Create dummy variables for categorical columns with drop_first=True: auto_origin
auto_origin = pd.get_dummies(auto, drop_first=True)

# Create arrays for features and target variable
y = auto_origin['mpg'] # target
X = auto_origin.drop('mpg', axis='columns') # features


In [9]:
# Import train_test_split from sklearn.model_selection
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score

# Set SEED for reproducibility
SEED = 1

# Split the data into 70% train and 30% test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=SEED)

# Instantiate a DecisionTreeRegressor dt
dt = DecisionTreeRegressor(max_depth=4, min_samples_leaf=0.26, random_state=SEED)

### Evaluate the 10-fold CV error

In this exercise, you'll evaluate the 10-fold CV Root Mean Squared Error (RMSE) achieved by the regression tree dt that you instantiated in the previous exercise.

In addition to ```dt```, the training data including ```X_train``` and ```y_train``` are available in your workspace. We also imported ```cross_val_score``` from ```sklearn.model_selection```.

Note that since c```ross_val_score``` has only the option of evaluating the negative MSEs, its output should be multiplied by negative one to obtain the MSEs. The CV RMSE can then be obtained by computing the square root of the average MSE.

In [10]:
# Compute the array containing the 10-folds CV MSEs
MSE_CV_scores = - cross_val_score(dt, X_train, y_train, cv=10, 
                                  scoring='neg_mean_squared_error', 
                                  n_jobs=-1) 

# Compute the 10-folds CV RMSE
RMSE_CV = (MSE_CV_scores.mean())**(1/2)

# Print RMSE_CV
print('CV RMSE: {:.2f}'.format(RMSE_CV))

CV RMSE: 5.14


### Evaluate the training error

You'll now evaluate the training set RMSE achieved by the regression tree dt that you instantiated in a previous exercise.

In addition to ```dt```, ```X_train``` and ```y_train``` are available in your workspace.

Note that in scikit-learn, the MSE of a model can be computed as follows:

```MSE_model = mean_squared_error(y_true, y_predicted)```

where we use the function ```mean_squared_error``` from the metrics module and pass it the true labels ```y_true``` as a first argument, and the predicted labels from the model ```y_predicted``` as a second argument.

[sklearn.metrics.mean_squared_error Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html)

In [11]:
# Import mean_squared_error from sklearn.metrics as MSE
from sklearn.metrics import mean_squared_error as MSE

# Fit dt to the training set
dt.fit(X_train, y_train)

# Predict the labels of the training set
y_pred_train = dt.predict(X_train)

# Evaluate the training set RMSE of dt
RMSE_train = (MSE(y_train, y_pred_train))**(1/2)

# Print RMSE_train
print('Train RMSE: {:.2f}'.format(RMSE_train))

Train RMSE: 5.15


# Ensemble Learning

### Define the ensemble

In the following set of exercises, you'll work with the Indian Liver Patient Dataset from the UCI Machine learning repository.

In this exercise, you'll instantiate three classifiers to predict whether a patient suffers from a liver disease using all the features present in the dataset.

The classes ```LogisticRegression```, ```DecisionTreeClassifier```, and ```KNeighborsClassifier``` under the alias ```KNN``` are available in your workspace.

In [32]:
# get imports needed for the module
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import  LogisticRegression
from sklearn.neighbors import KNeighborsClassifier as KNN
from sklearn.model_selection import train_test_split

y = liver_disease['Liver_disease'] # target
X = liver_disease.drop('Liver_disease', axis='columns') # features


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=1)

In [33]:
# Set seed for reproducibility
SEED=1

# Instantiate lr
lr = LogisticRegression(random_state=SEED)

# Instantiate knn
knn = KNN(n_neighbors=27)

# Instantiate dt
dt = DecisionTreeClassifier(min_samples_leaf=0.13, random_state=SEED)

# Define the list classifiers
classifiers = [('Logistic Regression', lr), ('K Nearest Neighbours', knn), ('Classification Tree', dt)]

### Evaluate individual classifiers

In this exercise you'll evaluate the performance of the models in the list ```classifiers``` that we defined in the previous exercise. You'll do so by fitting each classifier on the training set and evaluating its test set accuracy.

The dataset is already loaded and preprocessed for you (numerical features are standardized) and it is split into 70% train and 30% test. The features matrices ```X_train``` and ```X_test```, as well as the arrays of labels ```y_train``` and ```y_test``` are available in your workspace. In addition, we have loaded the list ```classifiers``` from the previous exercise, as well as the function ```accuracy_score()``` from ```sklearn.metrics```.

[sklearn.metrics.accuracy_score Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html)

In [34]:
from sklearn.metrics import accuracy_score

In [35]:
# Iterate over the pre-defined list of classifiers
for clf_name, clf in classifiers:    
 
    # Fit clf to the training set
    clf.fit(X_train, y_train)    
   
    # Predict y_pred
    y_pred = clf.predict(X_test)
    
    # Calculate accuracy
    accuracy = accuracy_score(y_test, y_pred) 
   
    # Evaluate clf's accuracy on the test set
    print('{:s} : {:.3f}'.format(clf_name, accuracy))

Logistic Regression : 0.759
K Nearest Neighbours : 0.701
Classification Tree : 0.730


### Better performance with a Voting Classifier

Finally, you'll evaluate the performance of a voting classifier that takes the outputs of the models defined in the list ```classifiers``` and assigns labels by majority voting.

```X_train```, ```X_test```, ```y_train```, ```y_test```, the list classifiers defined in a previous exercise, as well as the function ```accuracy_score``` from ```sklearn.metrics``` are available in your workspace.

[from sklear.ensemble.VotingClassifier Documentaiton](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingClassifier.html)

In [36]:
# Import VotingClassifier from sklearn.ensemble
from sklearn.ensemble import VotingClassifier

# Instantiate a VotingClassifier vc
vc = VotingClassifier(estimators=classifiers)     

# Fit vc to the training set
vc.fit(X_train, y_train)   

# Evaluate the test set predictions
y_pred = vc.predict(X_test)

# Calculate accuracy score
accuracy = accuracy_score(y_test, y_pred)
print('Voting Classifier: {:.3f}'.format(accuracy))

Voting Classifier: 0.770
