# **Workshop 5**

In this workshop, you'll looking at evaluation metrics and hyperparameter turning.

# 0) Loading Data and Libraries

In [1]:
import numpy as np

import pandas as pd
import matplotlib.pyplot as plt
import sklearn.metrics
# we're using the Diabetes dataset from sklearn.datasets
from sklearn import datasets
# Remember you have to run this cell block before continuing!

# set a seed for reproducibility
random_seed = 25
np.random.seed(random_seed)

# 1) Evaluation Metrics

## 1.1) Meet the Metrics (Follow)

In [2]:
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
# This is a dummy dataset that contains 500 positive and 500 negative samples
X,Y = make_classification(n_samples=1000,n_features=4,flip_y=0,random_state=random_seed)

test_data_fraction = 0.2
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=test_data_fraction,  random_state=random_seed)

In [3]:
from sklearn.tree import DecisionTreeClassifier
Y_test_predicted = DecisionTreeClassifier(criterion = "gini", random_state=random_seed).fit(X=X_train, y=Y_train).predict(X_test)

In [7]:
print(f'Accuracy: {sklearn.metrics.accuracy_score(Y_test, Y_test_predicted)}')
print(f'Precision Macro: {sklearn.metrics.precision_score(Y_test, Y_test_predicted, average="macro")}')
print(f'Recall Macro: {sklearn.metrics.recall_score(Y_test, Y_test_predicted, average="macro")}')
print(f'F1 Macro: { sklearn.metrics.f1_score(Y_test, Y_test_predicted, average="macro") }')

Accuracy: 0.92
Precision Macro: 0.9184393588063313
Recall Macro: 0.9201336167628302
F1 Macro: 0.9191919191919192


In [5]:
# Since the datset is balanced in term of class distribution, all of the micro scores are the same as the accuracy
print(f'Precision Micro: {sklearn.metrics.precision_score(Y_test, Y_test_predicted, average="micro")}')
print(f'Recall Micro: {sklearn.metrics.recall_score(Y_test, Y_test_predicted, average="micro")}')
print(f'F1 Micro: { sklearn.metrics.f1_score(Y_test, Y_test_predicted, average="micro") }')

Precision Micro: 0.92
Recall Micro: 0.92
F1 Micro: 0.92


Sklearn also has a [built in function](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html) that will give a handy summary of all the popular classification metrics. You can use this for the later questions.

Precision, Recall and F1 are reported for **each class separately**. For the "0" row, a 0 is treated as the positive class. For the "1" row, the 1 is treated as the positive class. This is helpful because Precision and Recall are both sensitive to which class is considered positive. **Support** is the number of instances of both classes.

In [10]:
from sklearn.metrics import classification_report

print(classification_report(Y_test,Y_test_predicted,digits=4))

              precision    recall  f1-score   support

           0     0.9358    0.9189    0.9273       111
           1     0.9011    0.9213    0.9111        89

    accuracy                         0.9200       200
   macro avg     0.9184    0.9201    0.9192       200
weighted avg     0.9203    0.9200    0.9201       200



Next, let's compare some classifiers. Soon you'll learn about the K-nearest-neighbors and Adaboost classifiers. For now, all you need to know is that they're very different approaches than decision tress, and you should expect them to have different performance.

In [13]:
# K-Nearest Neighbor Classifier
from sklearn.neighbors import KNeighborsClassifier

Y_test_predicted = KNeighborsClassifier(n_neighbors=3).fit(X=X_train, y=Y_train).predict(X_test)
print("KNN Classifer")
print(classification_report(Y_test,Y_test_predicted,digits=4))

KNN Classifer
              precision    recall  f1-score   support

           0     0.9907    0.9550    0.9725       111
           1     0.9462    0.9888    0.9670        89

    accuracy                         0.9700       200
   macro avg     0.9684    0.9719    0.9698       200
weighted avg     0.9709    0.9700    0.9701       200



In [14]:
# AdaBoost Classifier
from sklearn.ensemble import AdaBoostClassifier

Y_test_predicted = AdaBoostClassifier(n_estimators=100, random_state=random_seed).fit(X=X_train, y=Y_train).predict(X_test)
print("Adaboost Classifier")
print(classification_report(Y_test,Y_test_predicted,digits=4))

Adaboost Classifier
              precision    recall  f1-score   support

           0     0.9537    0.9279    0.9406       111
           1     0.9130    0.9438    0.9282        89

    accuracy                         0.9350       200
   macro avg     0.9334    0.9359    0.9344       200
weighted avg     0.9356    0.9350    0.9351       200



A *dummy classifier* always picks the majority. We use the to make sure a classifier is doing better than a naive approach that wouldn't require any real training (classifiers don't always do better!).

What do the precision, recall and accuracy represent in this case?

In [15]:
import warnings
warnings.filterwarnings('ignore')

# Dummy Classifier (Picks the majority class. Every time.)
from sklearn.dummy import DummyClassifier

Y_test_predicted = DummyClassifier(strategy="most_frequent", random_state=random_seed).fit(X=X_train, y=Y_train).predict(X_test)
print("Dummy Classifier")
print(classification_report(Y_test,Y_test_predicted,digits=4))

Dummy Classifier
              precision    recall  f1-score   support

           0     0.0000    0.0000    0.0000       111
           1     0.4450    1.0000    0.6159        89

    accuracy                         0.4450       200
   macro avg     0.2225    0.5000    0.3080       200
weighted avg     0.1980    0.4450    0.2741       200



Now, let's compare the classifiers. Which is the best? What metric are you using to compare them?

## 1.2) Imbalanced data (Group)
In this problem, you'll be trying to predict the presence of breast cancer from various features from medical readings. This can help doctors make better diagnoses and save lives.

Breast cancer is a common canncer, but relatively rare overall. However, this dataset includes more positive instances (people with breast cancer) than negative. Why might that be the case?

In this problem, we'll learn how to deal with these "imballanced" datasets.

In [16]:
# Load the data
# Read the breast cancer prediction dataset
from sklearn import datasets
from sklearn.preprocessing import MinMaxScaler

bc_sk = datasets.load_breast_cancer()

# Make sure data is in the same range
bc_sk.data = MinMaxScaler().fit_transform(bc_sk.data)

# Note that the "target" attribute is species, represented as an integer
bc_data = pd.DataFrame(data= np.c_[bc_sk['data'], bc_sk['target']],columns= list(bc_sk['feature_names'])+['target'])
bc_data.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target
0,0.521037,0.022658,0.545989,0.363733,0.593753,0.792037,0.70314,0.731113,0.686364,0.605518,...,0.141525,0.66831,0.450698,0.601136,0.619292,0.56861,0.912027,0.598462,0.418864,0.0
1,0.643144,0.272574,0.615783,0.501591,0.28988,0.181768,0.203608,0.348757,0.379798,0.141323,...,0.303571,0.539818,0.435214,0.347553,0.154563,0.192971,0.639175,0.23359,0.222878,0.0
2,0.601496,0.39026,0.595743,0.449417,0.514309,0.431017,0.462512,0.635686,0.509596,0.211247,...,0.360075,0.508442,0.374508,0.48359,0.385375,0.359744,0.835052,0.403706,0.213433,0.0
3,0.21009,0.360839,0.233501,0.102906,0.811321,0.811361,0.565604,0.522863,0.776263,1.0,...,0.385928,0.241347,0.094008,0.915472,0.814012,0.548642,0.88488,1.0,0.773711,0.0
4,0.629893,0.156578,0.630986,0.48929,0.430351,0.347893,0.463918,0.51839,0.378283,0.186816,...,0.123934,0.506948,0.341575,0.437364,0.172415,0.319489,0.558419,0.1575,0.142595,0.0


In [17]:
test_data_fraction = 0.2
bc_features = bc_data.iloc[:,0:-1]
bc_labels = bc_data["target"]
X_train, X_test, Y_train, Y_test = train_test_split(bc_features, bc_labels, test_size=test_data_fraction,  random_state=random_seed)

Let's take a look at the ratio of class values in the dataset.

In [18]:
bc_data["target"].value_counts()

1.0    357
0.0    212
Name: target, dtype: int64

As we can see, it's around a 60/40 split. What effect do you think this will have on the various evaluation metrics? For example, how could a classifier easily get 100% recall, 60% accuracy and 60% precision?

**Answer Here**

Now run the evaluation metrics as like above for Decision Trees, KNN, Adaboost, and the Dummy Classifier.

In [None]:
# Decision Tree

In [None]:
# K-Nearest Neighbor Classifier

In [None]:
# AdaBoost Classifier

In [None]:
# Dummy Classifier

Based on these metrics, answer the following questions:

1. Which model would you select and why? 
2. What metric(s) are most important for the breast cancer classification problem?
3. How would you recommend a doctor actually use the model **in practice**? Is it good enough to make decisions on its own?

**Answer here**

## 1.3) Multiclass Data (Group)

Now, we'll be looking at the wine dataset.

In [None]:
# Read the iris dataset and translate to pandas dataframe
wine_sk = datasets.load_wine()
# Note that the "target" attribute is species, represented as an integer
wine_data = pd.DataFrame(data= np.c_[wine_sk['data'], wine_sk['target']],columns= wine_sk['feature_names'] + ['target'])

In [None]:
from sklearn.model_selection import train_test_split
# The fraction of data that will be test data
test_data_fraction = 0.90

wine_features = wine_data.iloc[:,0:-1]
wine_labels = wine_data["target"]
X_train, X_test, Y_train, Y_test = train_test_split(wine_features, wine_labels, test_size=test_data_fraction,  random_state=random_seed)

Let's check the distribution of the dataset

In [None]:
wine_data["target"].value_counts()

The classes are represented as numbers. This is just shorthand to make it easier to classify.

The [confusion matrix](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html) is useful for getting a broad overview of how your classifier handled certain classes. Below, create a confusion matrix using the **test set**.

In [None]:
from sklearn.metrics import confusion_matrix

# Now create a confusion matrix

Read the documentation for `confusion_matrix`. How many instances were predicted to be class 0 but were actually class 1?

**Answer here**

Now run the evaluation metrics as like above for Decision Trees, KNN, Adaboost, and the Dummy Classifier.

In [None]:
# Decision Tree

In [None]:
# K-Nearest Neighbor Classifier

In [None]:
# AdaBoost Classifier

In [None]:
# Dummy Classifier

Answer the following questions below:

1. Which model would you select if you cared equally about each class being correct? 
2. What if you cared most about accurately detecting Class 0? 
3. Would you ever choose the Decision Tree model over Adaboost? If so, when? If not, why not?

**Answer here**

# 2) Cross Validation and Hyperparmeter Tuning

## 2.1) Basic Cross Validation (Follow)

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold

In [None]:
# Initialize a k-fold splitter
kf = KFold(n_splits=3)

In [None]:
# Kf.split() allows you to iterate though the different folds
# "train_index" are the indecies of the training data in that fold
# "test_index" are the indicies of the testing data in that fold
for train_index, test_index in kf.split(X_train):
    print("Train: ", train_index)
    print("Test: ", test_index)
    print("----")

## 2.2) Hyperparameter Tuning with CV (Group)

We did some very basic HP Tuning last workshop. However, one of the main issues is that we did HP tuning by testing our HPs againt the test dataset. It's good practice not to touch your dataset at all until you've finished selecting your model completly. Therefore, in this exercise we'll be trying out different HPs by constructing validation sets from our training data.

The dataset we'll be using for this exercise is the breast cancer dataset, which is used to tell if a certain individal might have breast cancer or not.

In [None]:
# Load the data
# Read the wine dataset and translate to pandas dataframe
from sklearn import datasets
from sklearn.preprocessing import MinMaxScaler

bc_sk = datasets.load_breast_cancer()

# Make sure data is in the same range
bc_sk.data = MinMaxScaler().fit_transform(bc_sk.data)

# Note that the "target" attribute is species, represented as an integer
bc_data = pd.DataFrame(data= np.c_[bc_sk['data'], bc_sk['target']],columns= list(bc_sk['feature_names'])+['target'])
bc_data.head()

In [None]:
# Formatting our data
test_data_fraction = 0.2
bc_features = bc_data.iloc[:,0:-1]
bc_labels = bc_data["target"]
X_train, X_test, Y_train, Y_test = train_test_split(bc_features, bc_labels, test_size=test_data_fraction,  random_state=random_seed)

In [None]:
def k_fold_accuracy(k, model, X_data, Y_data):
    
    # Init k-fold splitter
    kf = KFold(n_splits=k)
    scores = []
    
    #use kf.split to split the train data into train and validation data
    #iterate through all possible folds and fit the folded training data to the model
    #use the validation data to predict on the model
    #compute the accuracy score and append it to scores

    return scores

In [None]:
# Testing K-fold
k = 3
model = DecisionTreeClassifier(criterion = "gini", random_state=random_seed)
per_fold_acc = k_fold_accuracy(k, model, X_train, Y_train)
print(per_fold_acc)
np.mean(per_fold_acc)

There also exists a built in sklearn function for this, however it is import to know how to perform your own k-fold cross validation split if you want to implement a custom evaluation metric.

In [None]:
from sklearn import metrics
# We're using the trianing dataset here, but remember that CV will
# split that data into training and validation sets for each fold
# so we get an "unbiased" estimate of our test performance.
per_fold_acc = cross_val_score(model, X_train, Y_train, cv=KFold(n_splits=k), scoring='accuracy')
print(per_fold_acc)
np.mean(per_fold_acc)

*Why would we ever do this by hand, if there's already a built-in method?* 

Sometimes our model training process is more complex than just fitting the model. For example, we may want to do:
* Feature selection
* Normalization / scaling
* More complex models not in the sklearn library

In these cases, we can still *only use our training data*! You can't use test data to select features - that would be "cheating." So everying that you use your training data for has to occur within the loop we wrote for CV, above, based on the training data for the particular fold we're evaluating.

## 2.3 Tuning (Group)

In this problem you are going to select the best hypterparameter, using *only the training dataset*. No peaking at the test dataset. To estimate how well a given hyperparameter value will do on *unseen* data, we can use Crossvalidation (within the training dataset) to evaluate our model.

Let's use this approach to select the best `ccp_alpha` hyperparameter for a Decision Tree model.

You should:
1. Iterate over all ccp_alpha values
2. Calculate the k_fold validation accuracy using the above funciton
3. Calculate the training accuracy and the validation accuracy
4. Plot both accuracies vs. the ccp_alpha value

In [None]:
from sklearn.metrics import accuracy_score

# np.arange generates a list that starts at minimum, ends at maximum, and increments by step
alpha_values = np.arange(0, 0.035, 0.002)

# two lists to hold our accuracy
k = 5
valid_accs = []
train_accs = []

# Put your solution here!



plt.plot(alpha_values, valid_accs, color='red')
plt.plot(alpha_values, train_accs, color='blue')
plt.xlabel("Post Pruning Alpha")
plt.ylabel(f'Average Accuracy of {k}-fold validation')

The following code selects the alpha value for the best model. Then you job is to train a new model (using all of the training data), using your best hyperparameter value. Then evaluate it on the test dataset. What is the accuracy, precision, recall and F1 Score?

In [None]:
# Take the alpha for the model with the best accuracy on the *validation* set!
best_alpha = alpha_values[np.argmax(valid_accs)]
best_alpha

In [None]:
# Train your model here. You may want to print the tree using plot_tree

In [None]:
# Now evaluate your model on the test dataset.
# How does it perform compared your model from the last workshop that didn't use CV?

# 3) ROC Curves

Sklearn has some built in methods for [plotting ROC curves](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.plot_roc_curve.html).

### 3.1) Plotting ROC Curves (Group)

In [None]:
# First make an ROC curve for the model you selected with HP tuning
gini_tree = DecisionTreeClassifier(criterion = "gini", random_state=random_seed, ccp_alpha=best_alpha).fit(X=X_train, y=Y_train)
metrics.plot_roc_curve(gini_tree,X_test,Y_test)

In [None]:
# Now, make an ROC curve with an AdaBoostClassifier with n_estimators=100


In [None]:
plt.figure(0).clf()

# When predicting, we have to ask for *continuous* values, not 0/1, so we use predict_proba
# We use [:,1] to get the predictions for the positive class
tree_predictions = gini_tree.predict_proba(X_test)[:,1]
fpr, tpr, thresh = metrics.roc_curve(Y_test, tree_predictions)
auc = metrics.roc_auc_score(Y_test, tree_predictions)
plt.plot(fpr,tpr,label="Decision Tree, auc="+str(auc))

adaboost_predictions = ada.predict_proba(X_test)[:,1]
fpr, tpr, thresh = metrics.roc_curve(Y_test, adaboost_predictions)
auc = metrics.roc_auc_score(Y_test, adaboost_predictions)
plt.plot(fpr,tpr,label="Adaboost, auc="+str(auc))

plt.legend(loc=0)

### 3.2) Intepreting ROC curves (Group)

Take a look at the above ROC curves. How are they similar? How do they differ? Is one strictly better than the other? In what situations is one better than the other? Discuss with your group.

**Take notes of your discussion here.**