In [1]:
###### Config #####
import sys, os, platform
if os.path.isdir("ds-assets"):
  !cd ds-assets && git pull
else:
  !git clone https://github.com/lutzhamel/ds-assets.git
colab = True if 'google.colab' in os.sys.modules else False
system = platform.system() # "Windows", "Linux", "Darwin"
home = "ds-assets/assets/"
sys.path.append(home)  

Already up to date.


In [2]:
# notebook level imports
import pandas as pd
import numpy as np
import dsutils
np.set_printoptions(formatter={'float_kind':"{:3.2f}".format})
from sklearn import tree
from sklearn import model_selection
from sklearn import metrics

# Model Building and Uncertainty

Building models carries with it a **certain amount of uncertainty**.
Recall that machine learning is an inductive activity: We learn from examples and try to generalize by creating patterns/hypotheses/theories. We use datasets that represent **samples** from much
larger domains in order to learn.  Recall the "black swan problem" where the overall domain of swans contains both white 
and black swans.  But the white swans outnumber the black swans by a substantial margin and therefore, if we are not careful, most samples "D" drawn from
the overall population "X" will only contain white swans as can be seen in the figure below,

<center>
<img 
  src="https://raw.githubusercontent.com/lutzhamel/ds-assets/main/assets/black-swans.png"  
  height="200" 
  width="240">
</center>

This means, if we learn from those samples we will come to the incorrect conclusion that "all swans are white".

What this example illustrates is that the quality of our model is very much dependent on the quality 
of the data samples.  Unfortunately, in most cases the machine learning practitioner has no control over
the construction of the data samples. 
This quality of the sample representation of the domain is a constant source of uncertainty when building models.  We can actually observe this uncertainty even in our simple iris dataset.

In [3]:
df = pd.read_csv(home+"iris.csv")
X  = df.drop(['id','Species'],axis=1)
y = df['Species']

Using train-test splits to build models and reporting the testing accuracy.  We do this five times randomly splitting the iris data into train and test folds.

In [4]:
for i in range(5):
   model = tree.DecisionTreeClassifier(max_depth=3)
   (X_train, X_test, y_train, y_test) = \
      model_selection.train_test_split(X, 
                                       y, 
                                       train_size=0.7, 
                                       test_size=0.3)
   model.fit(X_train, y_train)
   y_test_model = model.predict(X_test)
   print("Accuracy {}: {:3.2f}"
         .format(i,metrics.accuracy_score(y_test, y_test_model)))

Accuracy 0: 0.98
Accuracy 1: 0.93
Accuracy 2: 0.93
Accuracy 3: 0.96
Accuracy 4: 0.89


* Notice the impact the random splits have on the testing accuracy.  
* Some splits give rise to good models and some splits not so much.  
    
> **Each split can be seen as randomly sampling a train and a test set from the original domain of all iris flowers**. 

* Here we are directly observing the effects of the uncertainty due to the data samples.

* This uncertainty reflects into our models. 
   * If our data is a poor representation of the domain then the models we construct using it will generalize poorly. 
   * If our  data is a good representation of the domain then we can expect that our model will generalize well.


# Classification Confidence Intervals

We use **confidence intervals** in order to quantify the uncertainty discussed above in our model evaluations.

Let us define confidence intervals formally. 
Given a model accuracy, **acc**, then the confidence interval is defined as the probability **p** that our model accuracy **acc** lies between some lower bound **lb** and some upper bound **ub**,

$$
Pr(lb \le acc \le ub) = p.
$$

Paraphrasing this equation with *p = 95%*:

> We are 95% percent sure that our model accuracy **acc** is not worse than **lb** and not better than **ub**.


Ultimitely we are interested in the lower and upper bounds of the 95% confidence interval.  We can use the following formulas to compute the bounds:

$$ub = acc + 1.96 \sqrt \frac{acc (1 - acc)}{n}$$

$$lb = acc - 1.96 \sqrt \frac{acc (1 - acc)}{n}$$

Here, *n* is the number of observations in the testing dataset used to estimate *acc*. The constant 1.96 is called the *z-score* and expresses the fact that we are computing the 95% confidence interval.

Notice that as we let $n \rightarrow \infty$ both the upper bound and the lower bound tend towards the accuracy.  That is, as we test the model on more and more testing points we become more and more confident that the given accuracy this the correct accuracy.

## Example

Let's do an actual example using our iris dataset.  We want to print out the  accuracy together with it's 95% confidence interval. 

We construct a best model and test it

In [5]:
depth_ceiling = tree.DecisionTreeClassifier(max_depth=None)\
   .fit(X,y).get_depth()
model = tree.DecisionTreeClassifier(random_state=2)
param_grid = {
    'max_depth': list(range(1,depth_ceiling+1)),               
    }
best_model = model_selection.GridSearchCV(model,param_grid).fit(X,y)

In [6]:
dsutils.acc_score(best_model, X, y, as_string=True)

'Accuracy: 0.97 (0.95, 1.00)'

# Regression Confidence Intervals

When performing regression we use the $R^2$ score to examine the quality of our models.  Given that we only use a small training dataset for fitting the model compared to the rest of the data universe it is only natural to ask what the 95% confidence interval for this score might be.  We have a formula for that -- it is not as straight forward as the confidence interval for classification,

$$lb = R^2 - 2\sqrt{\frac{4R^{2}(1-R^{2})^{2}(n-k-1)^{2}}{(n^2 - 1)(n+3)}}$$

$$ub = R^2 + 2\sqrt{\frac{4R^{2}(1-R^{2})^{2}(n-k-1)^{2}}{(n^2 - 1)(n+3)}}$$

Here, *n* is the number of observations in the validation/testing dataset and *k* is the number of independent variables.

Let's look at an actual regression problem and compute the $R^2$ score and it's 95% confidence interval. We will use the cars problem from before.

In [7]:
# get our dataset
cars_df = pd.read_csv(home+"cars.csv")
X = cars_df[['speed']]
y = cars_df['dist']

In [8]:
# build a regression model 
model = tree.DecisionTreeRegressor().fit(X,y)

In [9]:
dsutils.rs_score(model, X, y, as_string=True)

'R^2 Score: 0.79 (0.69, 0.89)'

# Statistical Significance

Besides giving us an idea of the uncertainty of our model the 95% confidence intervals also have something to say about the significance of scores of different models:  

> If the confidence intervals overlap then the difference in model performance of two different models on the same dataset is **not statistically significant**.



## A Worked Example

Here we use a real-world dataset that tries to predict the sex of abalone given a set of parameters.
First we will construct the optimal model and then we construct a tree with minimal complexity for the same data set and compare the performances using statistical significance.

The optimal tree first.

In [10]:
# get the abalone data
df = pd.read_csv(home+"abalone.csv")

Some basic descriptive statistics:

In [11]:
print(df.shape)
df.head()

(4177, 9)


Unnamed: 0,sex,length,diameter,height,whole_weight,shucked_weight,viscera_weight,shell_weight,rings
0,M,0.455,0.365,0.095,0.514,0.2245,0.101,0.15,15
1,M,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,7
2,F,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9
3,M,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,10
4,I,0.33,0.255,0.08,0.205,0.0895,0.0395,0.055,7


In [12]:
df[['sex']].value_counts()

sex
M      1528
I      1342
F      1307
Name: count, dtype: int64

Build our models.

In [13]:
# construct our data matrices
X  = df.drop(columns=['sex'])
y = df[['sex']]

We construct our optimal tree first.

In [14]:
# optimal tree
depth_ceiling = tree.DecisionTreeClassifier(max_depth=None)\
    .fit(X,y)\
    .get_depth()
model = tree.DecisionTreeClassifier()
param_grid = {'max_depth': list(range(1,depth_ceiling+1))}
best_model = model_selection\
    .GridSearchCV(model, param_grid)\
    .fit(X,y)

In [15]:
dsutils.acc_score(best_model, X, y, as_string=True)

'Accuracy: 0.59 (0.57, 0.60)'

Now we construct the minimal tree with max depth of 2.  We chose two because at minimum we need two nested if-then-else statements in order to distinguish three different labels.

In [16]:
# minimal complexity tree: depth 2
# create our model object
model = tree\
   .DecisionTreeClassifier(max_depth=2)\
   .fit(X,y)


In [17]:
dsutils.acc_score(model, X, y, as_string=True)

'Accuracy: 0.54 (0.52, 0.55)'

**Observation**: The confidence intervals are not overlapping, therefore **the performance difference is statistically significant**! That means the optimal model indeed performs better than the minimal tree.

## Train-Test vs. Refit Scores

Let show that there is no statistically significant difference between the testing score computed with train-test partitions and the refit score.  We'll use the iris data set to do this.

In [18]:
df = pd.read_csv(home+"iris.csv")
X  = df.drop(['id','Species'],axis=1)
y = df['Species']

In [19]:
# max depth of tree
depth_ceiling = tree.DecisionTreeClassifier(max_depth=None)\
    .fit(X,y)\
    .get_depth()

# prototype model
model = tree.DecisionTreeClassifier()

# parameter grid for our searches
param_grid = {'max_depth': list(range(1,depth_ceiling+1))}

First, find the best model using train-test partitions.

In [20]:
(X_train, X_test, y_train, y_test) = model_selection\
   .train_test_split(X, y, train_size=0.7, test_size=0.3, random_state=1)

best_model = model_selection\
    .GridSearchCV(model, param_grid)\
    .fit(X_train,y_train)

acc,lb,ub = dsutils.acc_score(best_model,X_test,y_test)
print(f"Train-Test Accuracy: {acc:.2f} ({lb:.2f}, {ub:.2f})")


Train-Test Accuracy: 0.96 (0.90, 1.00)


Now, find the best model using the whole dataset and evaluate using the refit score.

In [22]:
best_model = model_selection\
    .GridSearchCV(model, param_grid)\
    .fit(X,y)
acc,lb,ub = dsutils.acc_score(best_model,X,y)
print(f"Full Data Accuracy: {acc:.2f} ({lb:.2f}, {ub:.2f})")

Full Data Accuracy: 0.99 (0.98, 1.00)


**Observation**: The score difference between those two evaluation methods ist **not** statically significant.  Therefore we can use either one to find and evaluate our best models.

# Project

Please see BrightSpace for project #3

# Midterm

The midterm will cover everything up to and including the material in project #3