# STATISTICS

# Patterns and relationships
# Multiple Hypothesis Testing

Libraries:
- pandas
- numpy
- scipy
- statsmodels

## Four Models Comparison

In this example we will compare four classifiers  that were estimated on 14 data sets. On each dataset, the AUC of each classifier was calculated.

So, we have _C4.5_ classifier and its three modifications: with optimization of hyperparameter _m,_ hyperparameter _cf_ and with simultaneous optimization of both hyperparameters _m_ and _cf._

In [1]:
import pandas as pd
data = pd.read_csv('AUCs.txt', sep='\t')

In [2]:
data.head(10)

Unnamed: 0.1,Unnamed: 0,C4.5,C4.5+m,C4.5+cf,C4.5+m+cf
0,adult (sample),0.763,0.768,0.771,0.798
1,breast cancer,0.599,0.591,0.59,0.569
2,breast cancer wisconsin,0.954,0.971,0.968,0.967
3,cmc,0.628,0.661,0.654,0.657
4,ionosphere,0.882,0.888,0.886,0.898
5,iris,0.936,0.931,0.916,0.931
6,liver disorders,0.661,0.668,0.609,0.685
7,lung cancer,0.583,0.583,0.563,0.625
8,lymphography,0.775,0.838,0.866,0.875
9,mushroom,1.0,1.0,1.0,1.0


In [3]:
data.describe()

Unnamed: 0,C4.5,C4.5+m,C4.5+cf,C4.5+m+cf
count,14.0,14.0,14.0,14.0
mean,0.804929,0.820429,0.808786,0.827214
std,0.160187,0.158583,0.167566,0.154548
min,0.583,0.583,0.563,0.569
25%,0.63625,0.6665,0.624,0.673
50%,0.8285,0.863,0.876,0.8865
75%,0.9505,0.96875,0.96025,0.96575
max,1.0,1.0,1.0,1.0


First, using the sign ranks criteria, let's make a pairwise comparison of each classifier with another and choose two of them, the difference between which is the most statistically significant.

Используя критерий знаковых рангов, проведите попарное сравнение каждого классификатора с каждым. Выберите два классификатора, различие между которыми наиболее статистически значимо.

### 1. Mann-Whitney rank test

In [4]:
from scipy import stats

In [5]:
#H0: C4.5 = C4.5+m
#H1: C4.5 != C4.5+m

T12, p12 = stats.mannwhitneyu(data['C4.5'], data['C4.5+m'])
print('p-value 1-2: ', p12)

p-value 1-2:  0.3228781552767225


In [6]:
#H0: C4.5 = C4.5+cf
#H1: C4.5 != C4.5+cf

T13, p13 = stats.mannwhitneyu(data['C4.5'], data['C4.5+cf'])
print('p-value 1-3: ', p13)

p-value 1-3:  0.5


In [7]:
#H0: C4.5 = C4.5+m+cf
#H1: C4.5 != C4.5+m+cf

T14, p14 = stats.mannwhitneyu(data['C4.5'], data['C4.5+m+cf'])
print('p-value 1-4: ', p14)

p-value 1-4:  0.30660607679445023


In [8]:
#H0: C4.5+m = C4.5+cf
#H1: C4.5+m != C4.5+cf

T23, p23 = stats.mannwhitneyu(data['C4.5+m'], data['C4.5+cf'])
print('p-value 2-3: ', p23)

p-value 2-3:  0.33958876725007925


In [9]:
#H0: C4.5+m = C4.5+m+cf
#H1: C4.5+m != C4.5+m+cf

T24, p24 = stats.mannwhitneyu(data['C4.5+m'], data['C4.5+m+cf'])
print('p-value 2-4: ', p24)

p-value 2-4:  0.5


In [10]:
#H0: C4.5+cf = C4.5+m+cf
#H1: C4.5+cf != C4.5+m+cf

T34, p34 = stats.mannwhitneyu(data['C4.5+cf'], data['C4.5+m+cf'])
print('p-value 3-4: ', p34)

p-value 3-4:  0.3146960224969628


Comparing 4 classifiers with each other, we tested 6 hypotheses H0, we couldn't reject any one of them. Let's make a correction for multiple validation. Let's start with the Holm method.

### 2. Holm method

In [11]:
dict = {'pares': ['C4.5_C4.5m', 'C4.5_C4.5cf', 'C4.5_C4.5mcf','C4.5m_C4.5cf', 'C4.5m_C4.5mcf', 'C4.5cf_C4.5mcf'], 
        'p-value': [p12, p13, p14, p23, p24, p34]}
df = pd.DataFrame(data=dict)

In [12]:
df

Unnamed: 0,pares,p-value
0,C4.5_C4.5m,0.322878
1,C4.5_C4.5cf,0.5
2,C4.5_C4.5mcf,0.306606
3,C4.5m_C4.5cf,0.339589
4,C4.5m_C4.5mcf,0.5
5,C4.5cf_C4.5mcf,0.314696


In [13]:
from statsmodels.sandbox.stats.multicomp import multipletests 

In [14]:
reject, p_correct, a1, a2 = multipletests(df['p-value'], alpha = 0.05, method = 'holm') 

In [15]:
df['p_correct'] = p_correct
df['reject'] = reject

In [16]:
df

Unnamed: 0,pares,p-value,p_correct,reject
0,C4.5_C4.5m,0.322878,1.0,False
1,C4.5_C4.5cf,0.5,1.0,False
2,C4.5_C4.5mcf,0.306606,1.0,False
3,C4.5m_C4.5cf,0.339589,1.0,False
4,C4.5m_C4.5mcf,0.5,1.0,False
5,C4.5cf_C4.5mcf,0.314696,1.0,False


We still can't reject any one H0. Moreover, the p-value has increased in all cases.

### 3. Benjamini–Hochberg method

In [17]:
reject, p_correct, a1, a2 = multipletests(df['p-value'], alpha = 0.05, method = 'fdr_bh') 

In [18]:
df['p_correct'] = p_correct
df['reject'] = reject

In [19]:
df

Unnamed: 0,pares,p-value,p_correct,reject
0,C4.5_C4.5m,0.322878,0.5,False
1,C4.5_C4.5cf,0.5,0.5,False
2,C4.5_C4.5mcf,0.306606,0.5,False
3,C4.5m_C4.5cf,0.339589,0.5,False
4,C4.5m_C4.5mcf,0.5,0.5,False
5,C4.5cf_C4.5mcf,0.314696,0.5,False


As in the Holm method, we can't reject any one H0. However, Benjamini–Hochberg method uses more 'soft' correction for multiple validation than method before.