# Statistical Significance Testing for Different Classifiers
When you are comparing 2 or more different machine learning classifiers in terms of their performance, several standard measures are used, e.g. accuracy, precision, recall and f1-score on a class-by-class basis, or for overall performance on all classes. What can strengthen your argument about showing the benefit of one classifier over the other is to show that the difference is *statistically significant*. 

Statistical significance testing in ML applications is often determining sufficient confidence in *rejecting the null hypothesis that is that there is no difference in the error distribution* of two models. Determining where there is sufficient confidence to reject the null hypothesis, this is usually determined by a p-value in a test being at or beneath a standard alpha threshold (e.g. p<=0.05 or p<=0.01 or p<=0.001).

Beneath are some examples of some classification predictions from two different models, then an application of significance testing in terms of accuracy to test whether there is a statistically significant difference between the two models.

#  Example 1: from sklearn's NLP tutorial 'Working with Text Data' to determine whether one classifier is statistically significantly better than the other

In [1]:
import sys
sys.path.append("../../")

In [2]:
from machine_learning_utils.evaluation.significance_testing import calculate_mcnemar_test

In [3]:
import numpy as np
import sklearn
from sklearn.naive_bayes import MultinomialNB, GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn import metrics

In [4]:
categories = ['alt.atheism', 'soc.religion.christian',
              'comp.graphics', 'sci.med']

In [5]:
#Loading the 20 Newsgroups dataset
from sklearn.datasets import fetch_20newsgroups
from matplotlib import pyplot as plt

In [6]:
# with a random seed, always keep it the same number each time
# for reproducibility (here 42 (=the meaning of life...))
twenty_train = fetch_20newsgroups(subset='train',categories=categories, 
                                  shuffle=True, random_state=42)

In [7]:
#fetch_20newsgroups puts the data in the .data attribute
len(twenty_train.data)

2257

In [8]:
# Extracting features from text data
# Make sure you read the part of the tutorial/lecture about the bags of words
# representation
# A vectorizer is used to extract features from each item in the dataset


# create a count vectorizer, which by default does some pre-processing
# tokenize (into single words/unigrams) + lower-casing
# to change these default settings look at the sklearn documentation
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(twenty_train.data)

In [9]:
# First system: a Naive Bayes
# Training a multinomial (beyond 2 class) NB classifier

clf = MultinomialNB().fit(X_train_counts, twenty_train.target)

In [10]:
# Testing on a toy dataset
docs_new = ['God is love', 'OpenGL on the GPU is fast']
X_new_counts = count_vect.transform(docs_new)
predicted = clf.predict(X_new_counts)
for doc, category in zip(docs_new, predicted):
    print('%r => %s' % (doc, twenty_train.target_names[category]))

'God is love' => soc.religion.christian
'OpenGL on the GPU is fast' => comp.graphics


In [11]:
# A Pipeline is an object that can carry out count extraction, weighting
# and classification all in one go- be careful you know what each part does

text_clf = Pipeline([('vect', CountVectorizer()),
                     ('clf', MultinomialNB()),
                    ])

In [12]:
# Proper testing on the full 20newsgroups test set
twenty_test = fetch_20newsgroups(subset='test', categories=categories, 
                                 shuffle=True, random_state=42)
docs_test = twenty_test.data
text_clf.fit(twenty_train.data, twenty_train.target)
predicted_nb = text_clf.predict(docs_test)

In [13]:
# first 10 predictions from the NB:
predicted_nb[:10]

array([2, 2, 2, 0, 3, 0, 1, 3, 2, 2])

In [14]:
# Using the metrics package
# Get a classification report to see overall and per-class performance 
print(metrics.classification_report(twenty_test.target, predicted_nb,
                                    target_names=twenty_test.target_names))

                        precision    recall  f1-score   support

           alt.atheism       0.92      0.90      0.91       319
         comp.graphics       0.95      0.95      0.95       389
               sci.med       0.96      0.91      0.93       396
soc.religion.christian       0.91      0.97      0.94       398

              accuracy                           0.93      1502
             macro avg       0.93      0.93      0.93      1502
          weighted avg       0.93      0.93      0.93      1502



In [15]:
# System 2: logistic regression

In [16]:
text_clf_2 = Pipeline([('vect', CountVectorizer()),
                     ('clf', (LogisticRegression())),
                    ])

In [17]:
text_clf_2.fit(twenty_train.data, twenty_train.target)
predicted_svm = text_clf_2.predict(docs_test)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [18]:
print(metrics.classification_report(twenty_test.target, predicted_svm,
                                    target_names=twenty_test.target_names))

                        precision    recall  f1-score   support

           alt.atheism       0.92      0.81      0.86       319
         comp.graphics       0.86      0.94      0.90       389
               sci.med       0.92      0.82      0.87       396
soc.religion.christian       0.88      0.98      0.93       398

              accuracy                           0.89      1502
             macro avg       0.90      0.89      0.89      1502
          weighted avg       0.89      0.89      0.89      1502



## Comparing the two sets of predictions against the ground truth and calculating significance using Macnemar's statistic
While the Naive Bayes appears to outperform the Logistic Regression in terms of its overall accuracy/f-score being higher, is there a significantly different distributions in errors? For this we use Macnemar's statistic applied to Machine Learning, as per:

    Dietterich, Thomas G. "Approximate statistical tests for comparing supervised classification learning algorithms." Neural computation 10, no. 7 (1998): 1895-1923.

In [19]:
print(twenty_test.target_names)
calculate_mcnemar_test(twenty_test.target, predicted_svm, predicted_nb)

['alt.atheism', 'comp.graphics', 'sci.med', 'soc.religion.christian']
statistic=51.000, p-value=0.000
Different proportions of errors (reject H0)


(51.0, 0.00025724063479780124, True, True, True)

Overall, there is a statistically significant difference in error distributions with the null hypothesis that the error distributions being correct being rejected at p<=0.05, p<=0.01 and p<=0.001. Now let's check whether that's the case for every individual class.

In [20]:
# check the significance for each class in a one-vs-rest fashion:
for i, target in enumerate(twenty_test.target_names):
    print(target)
    result = calculate_mcnemar_test(twenty_test.target, predicted_svm, predicted_nb, target_class=i)
    print(result)
    print("*" * 30)

alt.atheism
statistic=37.000, p-value=0.278
Same proportions of errors (fail to reject H0)
(37.0, 0.2779992890805517, False, False, False)
******************************
comp.graphics
statistic=28.000, p-value=0.007
Different proportions of errors (reject H0)
(28.0, 0.007275502036308492, True, True, False)
******************************
sci.med
statistic=27.000, p-value=0.000
Different proportions of errors (reject H0)
(27.0, 1.4766951457590834e-05, True, True, True)
******************************
soc.religion.christian
statistic=26.000, p-value=0.207
Same proportions of errors (fail to reject H0)
(26.0, 0.20736789985685422, False, False, False)
******************************


As can be seen, while overall there is a difference in performance between the two classifiers with the Naive Bayes outperforming the Logistic Regression overall, on the individual classes the performance is statistically insignificantly different for the alt.atheism and soc.religion.christian classes when considering p<=0.05 as signficant.