<a href="https://colab.research.google.com/github/michaelwnau/ai_academy_notebooks/blob/main/WKS_5_nau_tues_Student.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Workshop 5 (Student)**

In this workshop, you'll looking at evaluation metrics and hyperparameter turning.

# 0) Loading Data and Libraries

In [None]:
import numpy as np

import pandas as pd
import matplotlib.pyplot as plt
import sklearn.metrics
# we're using the Diabetes dataset from sklearn.datasets
from sklearn import datasets
# Remember you have to run this cell block before continuing!

# set a seed for reproducibility
random_seed = 25
np.random.seed(random_seed)

# 1) Evaluation Metrics

## 1.1) Meet the Metrics (Follow)

In [None]:
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
# This is a dummy dataset that contains 500 positive and 500 negative samples
X,Y = make_classification(n_samples=1000,n_features=4,flip_y=0,random_state=random_seed)

test_data_fraction = 0.2
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=test_data_fraction,  random_state=random_seed)

In [None]:
from sklearn.tree import DecisionTreeClassifier
Y_test_predicted = DecisionTreeClassifier(criterion = "gini", random_state=random_seed).fit(X=X_train, y=Y_train).predict(X_test)
#sum(Y)/len(Y)

In [None]:
# BEGIN SOLUTION
print(f'Accuracy: {sklearn.metrics.accuracy_score(Y_test, Y_test_predicted)}')
print(f'Precision Macro: {sklearn.metrics.precision_score(Y_test, Y_test_predicted, average="macro")}')
print(f'Recall Macro: {sklearn.metrics.recall_score(Y_test, Y_test_predicted, average="macro")}')
print(f'F1 Macro: { sklearn.metrics.f1_score(Y_test, Y_test_predicted, average="macro") }')

In [None]:
# Since the datset is balanced in term of class distribution, all of the micro scores are the same as the accuracy
print(f'Precision Micro: {sklearn.metrics.precision_score(Y_test, Y_test_predicted, average="micro")}')
print(f'Recall Micro: {sklearn.metrics.recall_score(Y_test, Y_test_predicted, average="micro")}')
print(f'F1 Micro: { sklearn.metrics.f1_score(Y_test, Y_test_predicted, average="micro") }')

Sklearn also has a [built in function](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html) that will give a handy summary of all the popular classification metrics. You can use this for the later questions.

The first few values on the first column (before accuracy, macro avg, etc.) are the class values.

In [None]:
from sklearn.metrics import classification_report

print(classification_report(Y_test,Y_test_predicted,digits=4))

In [None]:
# K-Nearest Neighbor Classifier
from sklearn.neighbors import KNeighborsClassifier
Y_test_predicted = KNeighborsClassifier(n_neighbors=3).fit(X=X_train, y=Y_train).predict(X_test)
print("KNN Classifer")
print(classification_report(Y_test,Y_test_predicted,digits=4))

In [None]:
# AdaBoost Classifier
from sklearn.ensemble import AdaBoostClassifier
Y_test_predicted = AdaBoostClassifier(n_estimators=100, random_state=random_seed).fit(X=X_train, y=Y_train).predict(X_test)
print("Adaboost Classifier")
print(classification_report(Y_test,Y_test_predicted,digits=4))

In [None]:
import warnings
warnings.filterwarnings('ignore')

# Dummy Classifier (Picks the majority class. Every time.)
from sklearn.dummy import DummyClassifier
Y_test_predicted = DummyClassifier(strategy="most_frequent", random_state=random_seed).fit(X=X_train, y=Y_train).predict(X_test)
print("Dummy Classifier")
print(classification_report(Y_test,Y_test_predicted,digits=4))

## 1.2) Imbalanced data (Group)

In [None]:
# Load the data
# Read the breast cancer dataset and translate to pandas dataframe
from sklearn import datasets
from sklearn.preprocessing import MinMaxScaler

bc_sk = datasets.load_breast_cancer()

# Make sure data is in the same range
bc_sk.data = MinMaxScaler().fit_transform(bc_sk.data)

# Note that the "target" attribute is species, represented as an integer
bc_data = pd.DataFrame(data= np.c_[bc_sk['data'], bc_sk['target']],columns= list(bc_sk['feature_names'])+['target'])
bc_data.head()

In [None]:
test_data_fraction = 0.2
bc_features = bc_data.iloc[:,0:-1]
bc_labels = bc_data["target"]
X_train, X_test, Y_train, Y_test = train_test_split(bc_features, bc_labels, test_size=test_data_fraction,  random_state=random_seed)

Let's take a look at the ratio of class values in the dataset.

In [None]:
bc_data["target"].value_counts()

As we can see, it's around a 60/40 split. What effect do you think this will have on the various evaluation metrics?

**Discuss Here**

Now run the evaluation metrics as like above for Decision Trees, KNN, Adaboost, and the Dummy Classifier.

In [None]:
# Decision Tree


In [None]:
# K-Nearest Neighbor Classifier


In [None]:
# AdaBoost Classifier


In [None]:
# Dummy Classifier


In terms of evaluation metrics, how did each model perform? Discuss

**Discuss**

## 1.3) Multiclass Data (Group)

Now, we'll be looking at the wine dataset.

In [None]:
# Read the iris dataset and translate to pandas dataframe
wine_sk = datasets.load_wine()
# Note that the "target" attribute is species, represented as an integer
wine_data = pd.DataFrame(data= np.c_[wine_sk['data'], wine_sk['target']],columns= wine_sk['feature_names'] + ['target'])

In [None]:
from sklearn.model_selection import train_test_split
# The fraction of data that will be test data
test_data_fraction = 0.1

wine_features = wine_data.iloc[:,0:-1]
wine_labels = wine_data["target"]
X_train, X_test, Y_train, Y_test = train_test_split(wine_features, wine_labels, test_size=test_data_fraction,  random_state=random_seed)

Let's check the distribution of the dataset

In [None]:
wine_data["target"].value_counts()

The [confusion matrix](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html) is useful for getting a broad overview of how your classifier handled certain classes.

In [None]:
from sklearn.metrics import confusion_matrix

# Now create a confusion matrix

Now run the evaluation metrics as like above for Decision Trees, KNN, Adaboost, and the Dummy Classifier.

In [None]:
# Decision Tree


In [None]:
# K-Nearest Neighbor Classifier


In [None]:
# AdaBoost Classifier


In [None]:
# Dummy Classifier


In terms of evaluation metrics, how did each model perform? Discuss

Discuss here

# 2) Cross Validation and Hyperparmeter Tuning

## 2.1) Basic Cross Validation (Follow)

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold

In [None]:
# Initialize a k-fold splitter
kf = KFold(n_splits=3)

In [None]:
# Kf.split() allows you to iterate though the different folds
# "train_index" are the indecies of the training data in that fold
# "test_index" are the indicies of the testing data in that fold
print(len(X_train))
for train_index, test_index in kf.split(X_train):
    print("Train: ", train_index)
    print("Test: ", test_index)
    print("----")

## 2.2) Hyperparameter Tuning with CV (Group)

We did some very basic HP Tuning last workshop. However, one of the main issues is that we did HP tuning by testing our HPs againt the test dataset. It's good practice not to touch your dataset at all until you've finished selecting your model completly. Therefore, in this exercise we'll be trying out different HPs by constructing validation sets from our training data.

The dataset we'll be using for this exercise is the breast cancer dataset, which is used to tell if a certain individal might have breast cancer or not.

In [None]:
# Load the data
# Read the wine dataset and translate to pandas dataframe
from sklearn import datasets
from sklearn.preprocessing import MinMaxScaler

bc_sk = datasets.load_breast_cancer()

# Make sure data is in the same range
bc_sk.data = MinMaxScaler().fit_transform(bc_sk.data)

# Note that the "target" attribute is species, represented as an integer
bc_data = pd.DataFrame(data= np.c_[bc_sk['data'], bc_sk['target']],columns= list(bc_sk['feature_names'])+['target'])
bc_data.head()

In [None]:
# Formatting our data
test_data_fraction = 0.2
bc_features = bc_data.iloc[:,0:-1]
bc_labels = bc_data["target"]
X_train, X_test, Y_train, Y_test = train_test_split(bc_features, bc_labels, test_size=test_data_fraction,  random_state=random_seed)

In [None]:
def k_fold_accuracy(k, model, X_data, Y_data):
    
    # Init k-fold splitter
    kf = KFold(n_splits=k)
    scores = []
    
    #use kf.split to split the train data into train and validation data
    #iterate through all possible folds and fit the folded training data to the model
    #use the validation data to predict on the model
    #compute the accuracy score and append it to scores

    return scores

In [None]:
# Testing K-fold
k = 3
model = DecisionTreeClassifier(criterion = "gini", random_state=random_seed)
per_fold_acc = k_fold_accuracy(k, model, X_train, Y_train)
print(per_fold_acc)
np.mean(per_fold_acc)

There also exists a built in sklearn function for this, however it is import to know how to perform your own k-fold cross validation split if you want to implement a custom evaluation metric.

In [None]:
from sklearn import metrics
# We're using the trianing dataset here, but remember that CV will
# split that data into training and validation sets for each fold
# so we get an "unbiased" estimate of our test performance.
per_fold_acc = cross_val_score(model, X_train.values, Y_train.values, cv=KFold(n_splits=k), scoring='accuracy')
print(per_fold_acc)
np.mean(per_fold_acc)

## 2.3 Tuning (Group)

In this problem you are going to select the best hypterparameter, using *only the training dataset*. No peaking at the test dataset. To estimate how well a given hyperparameter value will do on *unseen* data, we can use Crossvalidation (within the training dataset) to evaluate our model.

You should:
1. Iterate over all ccp_alpha values
2. Calculate the k_fold validation accuracy using the above funciton
3. Calculate the training accuracy and the validation accuracy
4. Plot both accuracies vs. the ccp_alpha value

In [None]:
from sklearn.metrics import accuracy_score

# np.arange generates a list that starts at minimum, ends at maximum, and increments by step
alpha_values = np.arange(0, 0.035, 0.002)

# two lists to hold our accuracy
k = 5
valid_accs = []
train_accs = []

# Put your solution here!



plt.plot(alpha_values, valid_accs, color='red')
plt.plot(alpha_values, train_accs, color='blue')
plt.xlabel("Post Pruning Alpha")
plt.ylabel(f'Average Accuracy of {k}-fold validation')

The following code selects the alpha value for the best model. Then you job is to train a new model (using all of the training data), using your best hyperparameter value. Then evaluate it on the test dataset. What is the accuracy, precision, recall and F1 Score?

In [None]:
# Take the alpha for the model with the best accuracy on the *validation* set!
best_alpha = alpha_values[np.argmax(valid_accs)]
best_alpha

In [None]:
from sklearn.tree import plot_tree

# Train your model here. You may want to print the tree using plot_tree
# SOLUTION


In [None]:
# Now evaluate your model on the test dataset - what are the evaluation metrics?
# SOLUTION


# 3) ROC Curves

Sklearn has some built in methods for [plotting ROC curves](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.plot_roc_curve.html).

### 3.1) Plotting ROC Curves (Group)

In [None]:
# First make an ROC curve for the model you selected with HP tuning
from sklearn.metrics import RocCurveDisplay

gini_tree = DecisionTreeClassifier(criterion = "gini", random_state=random_seed, ccp_alpha=best_alpha).fit(X=X_train.values, y=Y_train.values)
RocCurveDisplay.from_estimator(gini_tree,X_test.values,Y_test.values)

In [None]:
# Now, make an ROC curve with an AdaBoostClassifier with n_estimators=100

#SOLUTION


In [None]:
plt.figure(0).clf()

# When predicting, we have to ask for *continuous* values, not 0/1, so we use predict_proba
# We use [:,1] to get the predictions for the positive class
tree_predictions = gini_tree.predict_proba(X_test.values)[:,1]
fpr, tpr, thresh = metrics.roc_curve(Y_test.values, tree_predictions)
auc = metrics.roc_auc_score(Y_test.values, tree_predictions)
plt.plot(fpr,tpr,label="Decision Tree, auc="+str(auc))

adaboost_predictions = ada.predict_proba(X_test.values)[:,1]
fpr, tpr, thresh = metrics.roc_curve(Y_test.values, adaboost_predictions)
auc = metrics.roc_auc_score(Y_test.values, adaboost_predictions)
plt.plot(fpr,tpr,label="Adaboost, auc="+str(auc))

plt.legend(loc=0)

### 3.2) Intepreting ROC curves (Group)

Take a look at the above ROC curves. How are they similar? How do they differ? Is one strictly better than the other? In what situations is one better than the other? Discuss with your group.

**Take notes of your discussion here.**