In [1]:
# preamble to be able to run notebooks in Jupyter and Colab
try:
    from google.colab import drive
    import sys
    
    drive.mount('/content/drive')
    notes_home = "/content/drive/Shared drives/CSC310/notes/"
    user_home = "/content/drive/My Drive/"
    
    sys.path.insert(1,notes_home) # let the notebook access the notes folder

except ModuleNotFoundError:
    notes_home = "" # running native Jupyter environment -- notes home is the same as the notebook
    user_home = ""  # under Jupyter we assume the user directory is the same as the notebook

# Classification Confidence Intervals

**Observation:** It does not matter how careful we are with our model evaluation techniques, there remains a fundamental uncertainty about the ability of our training data to effectively represent our (possibly infinite) data universe. This uncertainty can be observed during cross-validation: just partitioning the training data in different ways gives rise to drastic differences in model accuracy.  Here is the Iris example again from  the previous notebook.

In [2]:
# cross-validation Iris
import pandas as pd
import numpy as np
np.set_printoptions(formatter={'float_kind':"{:3.2f}".format})
from sklearn import tree
# grab cross validation code
from sklearn.model_selection import cross_val_score

# get data
df = pd.read_csv(notes_home+"assets/iris.csv")
X  = df.drop(['id','Species'],axis=1)
y = df['Species']

# set up the model
model = tree.DecisionTreeClassifier(criterion='entropy', max_depth=2)

# do the 5-fold cross validation
scores = cross_val_score(model, X, y, cv=5)
print("Fold Accuracies: {}".format(scores))

Fold Accuracies: [0.93 0.97 0.90 0.87 1.00]


This uncertainty reflects into our model evaluation. If our training data is a poor representation of the data universe then the models we construct using it will generalize poorly to the rest of the data universe. If our training data is a good representation of the data universe then we can expect that our model will generalize well.

Here we will deal with this uncertainty using *confidence intervals.*
First, let us define confidence intervals formally. Given a model accuracy, *acc*, then the confidence interval is defined as the probability *p* that our model accuracy *acc* lies between some lower bound *lb* and some upper bound *ub*,

> $Pr(lb ≤ acc ≤ ub) = p.$

Paraphrasing this equation with *p = 95%*:

> We are 95% percent sure that our model accuracy is not worse than *lb* and not better than *ub*.


Ultimitely we are interested in the lower and upper bounds of the 95% confidence interval.  We can use the following formula to compute the bounds:

> $ub = acc + 1.96 \sqrt \frac{acc (1 - acc)}{n}$

> $lb = acc - 1.96 \sqrt \frac{acc (1 - acc)}{n}$

Here, *n* is the number of observations in the testing dataset used to estimate *acc*. The constant 1.96 is called the *z-score* and expresses the fact that we are computing the 95% confidence interval.

## Classification Confidence Intervals in Python

Let's do a simple example using the function `classification_confint`.

In [3]:
from assets.confint import classification_confint

observations = 100
acc = .88
lb,ub = classification_confint(acc,observations)
print('Accuracy: {} ({:3.2f},{:3.2f})'.format(acc,lb, ub))

Accuracy: 0.88 (0.82,0.94)


Now, let's do an actual example using the Wisconsin breast cancer dataset.  We want to print out the testing accuracy together with it's 95% confidence interval.

In [4]:
import pandas as pd
from assets.treeviz import tree_print
from sklearn import tree
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from assets.confint import classification_confint

# read the data
df = pd.read_csv(notes_home+"assets/wdbc.csv")

# set up the feature matrix and target vector
X  = df.drop(['ID','Diagnosis'],axis=1)
y = df['Diagnosis']

# split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, test_size=0.2, random_state=2)

# set up the tree model object - limit the complexity to put us somewhere in the middle of the graph.
model = tree.DecisionTreeClassifier(criterion='entropy', max_depth=4, random_state=1)

# fit the model on the training set of data
model.fit(X_train, y_train)

# Test results: evaluate the model on the testing set of data
y_test_model = model.predict(X_test)
acc = accuracy_score(y_test, y_test_model)
observations = X_test.shape[0]
lb,up = classification_confint(acc, observations)
print("Accuracy: {:3.2f} ({:3.2f},{:3.2f})".format(acc,lb,ub))

Accuracy: 0.91 (0.86,0.94)


# Regression Confidence Intervals

When performing regression we use the $R^2$ score to examine the quality of our models.  Given that we only use a small training dataset for fitting the model compared to the rest of the data universe it is only natural to ask what the 95% confidence interval for this score might be.  We have a formula for that -- it is not as straight forward as the confidence interval for classification,

> $lb = R^2 - 2\sqrt{\frac{4R^{2}(1-R^{2})^{2}(n-k-1)^{2}}{(n^2 - 1)(n+3)}}$

> $ub = R^2 + 2\sqrt{\frac{4R^{2}(1-R^{2})^{2}(n-k-1)^{2}}{(n^2 - 1)(n+3)}}$

Here, *n* is the number of observations in the validation/testing dataset and *k* is the number of independent variables.

In [5]:
from assets.confint import regression_confint

rs_score = .75
observations = 100
variables = 4 # independent variables

lb,ub = regression_confint(rs_score, observations, variables)
print("R^2 Score: {:3.2f} ({:3.2f}, {:3.2f})".format(rs_score,lb,ub))

R^2 Score: 0.75 (0.67, 0.83)


Let's look at an actual regression problem and compute the $R^2$ score and it's 95% confidence interval. We will use the cars problem from before.

In [6]:
import numpy as np
import pandas
from sklearn.tree import DecisionTreeRegressor
from assets.confint import regression_confint

# get our dataset
cars_df = pandas.read_csv(notes_home+"assets/cars.csv")

# build model object
model = DecisionTreeRegressor(max_depth=None)

# fit model
# We have to reshape the values array to make 'fit' happy because
# the array only has a single feature
model.fit(cars_df['speed'].values.reshape(-1,1),cars_df['dist'])

# R^2 score
rs_score = model.score(cars_df['speed'].values.reshape(-1,1),cars_df['dist'])
observations = cars_df.shape[0]
variables = 1
lb,ub = regression_confint(rs_score, observations, variables)

# print out R^2 score with its 95% confidence interval
print("R^2 Score: {:3.2f} ({:3.2f}, {:3.2f})".format(rs_score,lb,ub))

R^2 Score: 0.79 (0.69, 0.89)


# Statistical Significance

Besides giving us an idea of the uncertainty of our model the 95% confidence intervals also have something to say about the significance of scores of different models.  That is, if the confidence intervals overlap then the difference in model performance of two different models on the same dataset is not statistically significant.

Consider the following,

In [7]:
from assets.confint import classification_confint

observations = 100

# first classifier
acc1 = .88
lb1,ub1 = classification_confint(acc1,observations)
print('Accuracy: {} ({:3.2f},{:3.2f})'.format(acc1,lb1, ub1))

# second classifier
acc2 = .92
lb2,ub2 = classification_confint(acc2,observations)
print('Accuracy: {} ({:3.2f},{:3.2f})'.format(acc2,lb2, ub2))

Accuracy: 0.88 (0.82,0.94)
Accuracy: 0.92 (0.87,0.97)


Even though the second classifier has a better raw accuracy when we look at the confidence intervals of the two classifiers we see that they overlap.  Here we see that the first classifier could potentially have an accuracy of .94 (even better than the raw accuracy of the second classifier).  Furthermore, the confidence interval of the second classifier tells us that that classifier could potentially have an accuracy of .87 which is worse than the raw accuracy of the first classifier.  For this reason we say that the difference in accuracy of two classifiers is not statistically significant if their confidence intervals overlap.

# Team Exercise

Using the `abalone.csv` dataset from the `assets` folder do the following:

* Find the best decision tree for the dataset using grid search with 5-fold cross-validation.  Recall that the decision tree has two free parameters: criterion and tree depth

* Apply the best decision tree you have found to the whole training dataset and compute its accuracy on that dataset.

* Compute the 95% confidence interval of the accuracy compute in the previous step.

* Build a decision tree wit max_depth=1 on the whole training data set. Compute its accuracy and 95% confidence interval on that training data.

* Build a decision tree wit max_depth=None on the whole training data set. Compute its accuracy and 95% confidence interval on that training data.

* Do the confidence intervals of all three decision trees overlap? If so, what are the conclusions you can draw?



## Teams

```
team 1:  Kenney A, Stephanie, Phidias
team 2:  Korakot, Julio, Camren Joseph
team 3:  Michael Russell, Sofia R, Jared P
team 4:  Emmely, C.J., Jaeke R
team 5:  Luca G, Evan Jonathan, Shannon Patrice
team 6:  Yeury, Timothy Terence, Cody Rithysan
team 7:  Joshua Patrick, Patrick M, John Francis
team 8:  Samantha N, Cole, Andrew Michael
team 9:  Jake Adam, Timothy, Hennjer
team 10: Zachary T, Giulia, Tony Levada
team 11: William Jordan, Dan Steven, Joshua D
team 12: Joey, Ryan Richard
```