## The Bias-Variance Tradeoff
The bias-variance tradeoff is one of the fundamental concepts in supervised machine learning. In this chapter, you'll understand how to diagnose the problems of overfitting and underfitting. You'll also be introduced to the concept of ensembling where the predictions of several models are aggregated to produce predictions that are more robust.

### Instantiate the model
In the following set of exercises, you'll diagnose the bias and variance problems of a regression tree. The regression tree you'll define in this exercise will be used to predict the mpg consumption of cars from the auto dataset using all available features.

We have already processed the data and loaded the features matrix X and the array y in your workspace. In addition, the DecisionTreeRegressor class was imported from sklearn.tree.

In [5]:
# Diagnose bias and variance problems
# Import pandas to read csv
import pandas as pd
# Import train_test_split and cross val score 
from sklearn.model_selection import train_test_split, cross_val_score
# Import DecisionTreeRegressor from sklearn.tree
from sklearn.tree import DecisionTreeRegressor
# Import mean_squared_error from sklearn.metrics as MSE
from sklearn.metrics import mean_squared_error as MSE

# Load data
data = pd.read_csv('Auto-mpg.csv')
# seprate variables
y = data['mpg']
X = data.iloc[:,1:]
# split the data
X_train, X_test, y_train, y_test = train_test_split(X,y, 
                                                    test_size = 0.3, 
                                                    random_state = 123)
# Iniciate decision tree regressor

dt = DecisionTreeRegressor(max_depth = 4, 
                           min_samples_leaf = 0.14,
                          random_state = 123)

### Evaluate the 10-fold CV error
In this exercise, you'll evaluate the 10-fold CV Root Mean Squared Error (RMSE) achieved by the regression tree dt that you instantiated in the previous exercise.

In addition to dt, the training data including X_train and y_train are available in your workspace. We also imported cross_val_score from sklearn.model_selection.

Note that since cross_val_score has only the option of evaluating the negative MSEs, its output should be multiplied by negative one to obtain the MSEs. The CV RMSE can then be obtained by computing the square root of the average MSE.

In [4]:
# Evalute the list of MSE obtained by 10-fold cv
# set n_jobs = 1 in order to exploit all cpu cores in computation
MSE_CV = - cross_val_score(dt,X_train, y_train, cv = 10,
                          scoring = 'neg_mean_squared_error',
                          n_jobs = -1)

# Compute the 10-folds CV RMSE
RMSE_CV = (MSE_CV.mean())**(1/2)

# Print RMSE_CV
print('CV RMSE: {:.2f}'.format(RMSE_CV))

# Note that since cross_val_score has only the option of 
# evaluating the negative MSEs, its output should be 
# multiplied by negative one to obtain the MSEs. The CV RMSE can 
# then be obtained by computing the square root of the average MSE.

CV RMSE: 4.53


### Evaluate the training error
You'll now evaluate the training set RMSE achieved by the regression tree dt that you instantiated in a previous exercise.

In addition to dt, X_train and y_train are available in your workspace.

Note that in scikit-learn, the MSE of a model can be computed as follows:

MSE_model = mean_squared_error(y_true, y_predicted)
where we use the function mean_squared_error from the metrics module and pass it the true labels y_true as a first argument, and the predicted labels from the model y_predicted as a second argument.

In [7]:
# fit dt to the trainig set
dt.fit(X_train, y_train)
# predict labels of training set
y_predict_train = dt.predict(X_train)
# predict labels of test set
y_predict_test = dt.predict(X_test)
# print CV MSE
print('CV MSE: {:.2f}'.format(MSE_CV.mean()))
print('train MSE: {:.2f}'.format(MSE(y_train, y_predict_train)))
print('test MSE: {:.2f}'.format(MSE(y_test, y_predict_test)))
# as training MSE is smaller than cv-error we can deduce 
# dt overfit the training set and that it sufferes from high variance

CV MSE: 20.51
train MSE: 15.30
test MSE: 20.92


### Define the ensemble
In the following set of exercises, you'll work with the Indian Liver Patient Dataset from the UCI Machine learning repository.

In this exercise, you'll instantiate three classifiers to predict whether a patient suffers from a liver disease using all the features present in the dataset.

The classes LogisticRegression, DecisionTreeClassifier, and KNeighborsClassifier under the alias KNN are available in your workspace.

In [None]:
# Ensemble Learning
# Voting classifier

# Import DecisionTreeClassifier from sklearn.tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier as KNN
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score


SEED = 1

# Load data
data = pd.read_csv('Wisconsin Breast Cancer.csv')
# seprate variables
y = data['diagnosis']
X = data.iloc[:,3:]

# split the data
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.3, 
                                         stratify = y, 
                                         random_state = SEED)

# individual classifiers

lr = LogisticRegression(random_state = SEED)
knn = KNN()
dt = DecisionTreeClassifier(random_state = SEED)

classifiers = [('Logistic Regression', lr), ('K Nearest Neighbours', knn), 
               ('Classification Tree', dt)]

### Evaluate individual classifiers
In this exercise you'll evaluate the performance of the models in the list classifiers that we defined in the previous exercise. You'll do so by fitting each classifier on the training set and evaluating its test set accuracy.

The dataset is already loaded and preprocessed for you (numerical features are standardized) and it is split into 70% train and 30% test. The features matrices X_train and X_test, as well as the arrays of labels y_train and y_test are available in your workspace. In addition, we have loaded the list classifiers from the previous exercise, as well as the function accuracy_score() from sklearn.metrics.

In [None]:
# iterate over the classifiers and calculate th eaccuracy score
for clf_name, clf in classifiers:
    # fit
    clf.fit(X_train, y_train)
    # predict
    y_pred = clf.predict(X_test)
    # Accuracy
    print ('{:s} : {:.3f}'.format(clf_name, accuracy_score(y_test, 
                                                           y_pred)))

### Better performance with a Voting Classifier
Finally, you'll evaluate the performance of a voting classifier that takes the outputs of the models defined in the list classifiers and assigns labels by majority voting.

X_train, X_test,y_train, y_test, the list classifiers defined in a previous exercise, as well as the function accuracy_score from sklearn.metrics are available in your workspace.

In [22]:
# VotingClassifier
vc = VotingClassifier(estimators = classifiers)
# fit, predict, accuracy
vc.fit(X_train, y_train)
y_pred = vc.predict(X_test)
accuracy = accuracy_score(y_test,y_pred)
print('Voting Classifier: {:.3f}'.format(accuracy))

Logistic Regression : 0.947
K Nearest Neighbours : 0.924
Classification Tree : 0.942
Voting Classifier: 0.947


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
