# Machine Learning with Tree-Based Models in Python - Part 2
- Bias-Variance Tradeoff
    - Bias tells you, on avarege, how much true values and pred values are different
    - Variance tells you how much pred values are inconsistent over different training sets
    - As the complexity of model increases, the bias term decreases while the variance term increases.
    - As the complexity of model decreases, the bias term increases while the variance term decreases.
    - High bias model leads to underfitting
    - High variance model leads to overfitting
    - Increasing maximum tree depth or minimum samples per leaf increases complexity of model
- If training set error is smaller than the CV-error so model overfits training set and suffers from high variance
- If training set error and CV-error are roughly equals but higher then expected error so model underfitting training set and suffers from high bias

In [280]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score

from sklearn.metrics import mean_squared_error as MSE
from sklearn.metrics import accuracy_score

from sklearn.preprocessing import StandardScaler

from sklearn.tree import DecisionTreeRegressor
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier as KNN

from sklearn.ensemble import VotingClassifier

SEED = 1

## Datasets

### Auto-mpg

In [3]:
df_auto_mpg = pd.read_csv('../datasets/auto-mpg.csv')
df_auto_mpg = pd.get_dummies(df_auto_mpg)

### Indian Liver Patient

In [268]:
df_indian_liver = pd.read_csv('../datasets/indian_liver_patient.csv')
df_indian_liver = df_indian_liver.dropna()

X, y = df_indian_liver.iloc[:,:-1], df_indian_liver.iloc[:,-1]

In [269]:
X_number = X.select_dtypes(include=[np.number])
X_number_std = StandardScaler().fit_transform(X_number)

In [270]:
df_indian_liver_std = pd.DataFrame(X_number_std, columns=X_number.columns)
df_indian_liver_std['Gender'] = X['Gender']

df_indian_liver_std = pd.get_dummies(df_indian_liver_std, drop_first=True)

In [271]:
df_indian_liver_std[y.name] = y.values

## The Bias-Variance Tradeoff
The bias-variance tradeoff is one of the fundamental concepts in supervised machine learning. In this chapter, you'll understand how to diagnose the problems of overfitting and underfitting. You'll also be introduced to the concept of ensembling where the predictions of several models are aggregated to produce predictions that are more robust.

### Instantiate the model
In the following set of exercises, you'll diagnose the bias and variance problems of a regression tree. The regression tree you'll define in this exercise will be used to predict the mpg consumption of cars from the auto dataset using all available features.

We have already processed the data and loaded the features matrix X and the array y in your workspace. In addition, the DecisionTreeRegressor class was imported from sklearn.tree.

In [7]:
X, y = df_auto_mpg.iloc[:,:-1], df_auto_mpg.iloc[:,-1]

In [8]:
# Split the data into 70% train and 30% test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=SEED)

In [10]:
# Instantiate a DecisionTreeRegressor dt
dt = DecisionTreeRegressor(max_depth=4, min_samples_leaf=0.26, random_state=SEED)

### Evaluate the 10-fold CV error
In this exercise, you'll evaluate the 10-fold CV Root Mean Squared Error (RMSE) achieved by the regression tree dt that you instantiated in the previous exercise.

In addition to dt, the training data including X_train and y_train are available in your workspace. We also imported cross_val_score from sklearn.model_selection.

Note that since cross_val_score has only the option of evaluating the negative MSEs, its output should be multiplied by negative one to obtain the MSEs.

In [15]:
# Compute the array containing the 10-folds CV MSEs
MSE_CV_scores = - cross_val_score(dt, X_train, y_train, cv=10, scoring='neg_mean_squared_error', n_jobs=-1)

In [16]:
# Compute the 10-folds CV RMSE
RMSE_CV = (MSE_CV_scores.mean())**(1/2)

In [17]:
# Print RMSE_CV
print('CV RMSE: {:.2f}'.format(RMSE_CV))

CV RMSE: 0.33


### Evaluate the training error
You'll now evaluate the training set RMSE achieved by the regression tree dt that you instantiated in a previous exercise.

In addition to dt, X_train and y_train are available in your workspace.

In [19]:
# Fit dt to the training set
dt.fit(X_train, y_train)

DecisionTreeRegressor(ccp_alpha=0.0, criterion='mse', max_depth=4,
                      max_features=None, max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=0.26, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, presort='deprecated',
                      random_state=1, splitter='best')

In [20]:
# Predict the labels of the training set
y_pred_train = dt.predict(X_train)

In [21]:
# Evaluate the training set RMSE of dt
RMSE_train = (MSE(y_train, y_pred_train))**(1/2)

In [22]:
# Print RMSE_train
print('Train RMSE: {:.2f}'.format(RMSE_train))

Train RMSE: 0.31


### High bias or high variance?
In this exercise you'll diagnose whether the regression tree dt you trained in the previous exercise suffers from a bias or a variance problem.

The training set RMSE (RMSE_train) and the CV RMSE (RMSE_CV) achieved by dt are available in your workspace. In addition, we have also loaded a variable called baseline_RMSE which corresponds to the root mean-squared error achieved by the regression-tree trained with the disp feature only (it is the RMSE achieved by the regression tree trained in chapter 1, lesson 3). Here baseline_RMSE serves as the baseline RMSE above which a model is considered to be underfitting and below which the model is considered 'good enough'.

Does dt suffer from a high bias or a high variance problem?

In [23]:
# Predict the labels of the test set
y_pred_test = dt.predict(X_test)

In [25]:
# Evaluate the test set RMSE of dt
RMSE_test = (MSE(y_test, y_pred_test))**(1/2)

In [26]:
# Print RMSE_test
print('Test RMSE: {:.2f}'.format(RMSE_test))

Test RMSE: 0.36


In [29]:
# RMSE_train =~ RMSE_CV =~ RMSE_test
# dt suffers from high bias because RMSE_CV ≈ RMSE_train and both scores are greater than baseline_RMSE.

### Define the ensemble
In the following set of exercises, you'll work with the Indian Liver Patient Dataset from the UCI Machine learning repository.

In this exercise, you'll instantiate three classifiers to predict whether a patient suffers from a liver disease using all the features present in the dataset.

The classes LogisticRegression, DecisionTreeClassifier, and KNeighborsClassifier under the alias KNN are available in your workspace.

In [101]:
# Instantiate lr
lr = LogisticRegression(random_state=SEED)

In [105]:
# Instantiate knn
knn = KNN(n_neighbors=27)

In [106]:
# Instantiate dt
dt = DecisionTreeClassifier(min_samples_leaf=0.13, random_state=SEED)

In [107]:
# Define the list classifiers
classifiers = [('Logistic Regression', lr), ('K Nearest Neighbours', knn), ('Classification Tree', dt)]

### Evaluate individual classifiers
In this exercise you'll evaluate the performance of the models in the list classifiers that we defined in the previous exercise. You'll do so by fitting each classifier on the training set and evaluating its test set accuracy.

The dataset is already loaded and preprocessed for you (numerical features are standardized) and it is split into 70% train and 30% test. The features matrices X_train and X_test, as well as the arrays of labels y_train and y_test are available in your workspace. In addition, we have loaded the list classifiers from the previous exercise, as well as the function accuracy_score() from sklearn.metrics.

In [273]:
X, y = df_indian_liver_std.iloc[:,:-1], df_indian_liver_std.iloc[:,-1]

In [274]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=SEED)

In [279]:
# Iterate over the pre-defined list of classifiers
for clf_name, clf in classifiers:    
 
    # Fit clf to the training set
    clf.fit(X_train, y_train)    
   
    # Predict y_pred
    y_pred = clf.predict(X_test)
    
    # Calculate accuracy
    accuracy = accuracy_score(y_test, y_pred) 
   
    # Evaluate clf's accuracy on the test set
    print('{:s} : {:.3f}'.format(clf_name, accuracy))

Logistic Regression : 0.736
K Nearest Neighbours : 0.684
Classification Tree : 0.730


### Better performance with a Voting Classifier
Finally, you'll evaluate the performance of a voting classifier that takes the outputs of the models defined in the list classifiers and assigns labels by majority voting.

X_train, X_test,y_train, y_test, the list classifiers defined in a previous exercise, as well as the function accuracy_score from sklearn.metrics are available in your workspace.

In [281]:
# Instantiate a VotingClassifier vc
vc = VotingClassifier(estimators=classifiers)

In [282]:
# Fit vc to the training set
vc.fit(X_train, y_train)  

VotingClassifier(estimators=[('Logistic Regression',
                              LogisticRegression(C=1.0, class_weight=None,
                                                 dual=False, fit_intercept=True,
                                                 intercept_scaling=1,
                                                 l1_ratio=None, max_iter=100,
                                                 multi_class='auto',
                                                 n_jobs=None, penalty='l2',
                                                 random_state=1, solver='lbfgs',
                                                 tol=0.0001, verbose=0,
                                                 warm_start=False)),
                             ('K Nearest Neighbours',
                              KNeighborsClassifier(algorithm='auto',
                                                   leaf_...
                              DecisionTreeClassifier(ccp_alpha=0.0,
                         

In [283]:
# Evaluate the test set predictions
y_pred = vc.predict(X_test)

In [284]:
# Calculate accuracy score
accuracy = accuracy_score(y_test, y_pred)
print('Voting Classifier: {:.3f}'.format(accuracy))

Voting Classifier: 0.736
