# Analysis of categorical data
- Analysis of one proportion
- Chi-square test
- Fisher exact test
- Cochran's Q test
- McNemar test

In [1]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats

# Analysis of one proportion
- Calculate the confidence intervals of the population, based on a given data sample.

- Suppose a general practitioner chooses a random sample of 215 women from the patient register for her general practice, and finds that 39 of them have a history of suffering from asthma. What is the confidence interval for the prevalence of asthma? (The data are taken from Altman, chapter 10.2.1:)

![image.png](attachment:image.png)

In [7]:
# Get the data
numTotal = 215
numPositive = 39

# Calculate the confidence intervals
p = float(numPositive)/numTotal
se = np.sqrt(p*(1-p)/numTotal)    # 标准误
td = stats.t(numTotal-1)          # 自由度
ci = p + np.array([-1,1])*td.isf(0.025)*se
print(ci)

# Print them
print('ONE PROPORTION')
print('The confidence interval for the given sample is {0:5.3f} to {1:5.3f}'.format(
    ci[0], ci[1]))

[0.12959388 0.23319682]
ONE PROPORTION
The confidence interval for the given sample is 0.130 to 0.233


# Chi-square test to a 2x2 table¶
- Data are taken from Altman, Table 10.10:

Comparison of number of hours swimming by swimmers with or without erosion of dental enamel:

- \>= 6h: 32 yes, 118 no
- <  6h: 17 yes, 127 no

The calculations are done with and without Yate's continuity correction.

In [8]:
# Enter the data
obs = np.array([[32, 118], [17, 127]])

# Calculate the chi-square test
chi2_corrected = stats.chi2_contingency(obs, correction=True)
chi2_uncorrected = stats.chi2_contingency(obs, correction=False)

# Print the result
print('CHI SQUARE')
print('The corrected chi2 value is {0:5.3f}, with p={1:5.3f}'.format(chi2_corrected[0], chi2_corrected[1]))
print('The uncorrected chi2 value is {0:5.3f}, with p={1:5.3f}'.format(chi2_uncorrected[0], chi2_uncorrected[1]))

CHI SQUARE
The corrected chi2 value is 4.141, with p=0.042
The uncorrected chi2 value is 4.802, with p=0.028


# Fisher's Exact Test¶
- Spectacle wearing among juvenile delinquensts and non-delinquents who failed a vision test

- Spectecle wearers: 1 delinquent, 5 non-delinquents
- non-spectacle wearers: 8 delinquents, 2 non-delinquents'''
- (Data are taken from Altman, Table 10.14)

In [9]:
# Enter the data
obs = np.array([[1,5], [8,2]])

# Calculate the Fisher Exact Test
fisher_result = stats.fisher_exact(obs)

# Print the result
print('The probability of obtaining a distribution at least as extreme '
+ 'as the one that was actually observed, assuming that the null ' +
    'hypothesis is true, is: {0:5.3f}.'.format(fisher_result[1]))

The probability of obtaining a distribution at least as extreme as the one that was actually observed, assuming that the null hypothesis is true, is: 0.035.


# Cochran's Q test
- 12 subjects are asked to perform 3 tasks. The outcome of each task is "success" or "failure". The results are coded 0 for failure and 1 for success. In the example, subject 1 was successful in task 2, but failed tasks 1 and 3. Is there a difference between the performance on the three tasks?

In [10]:
from statsmodels.sandbox.stats.runs import cochrans_q
import pandas as pd

tasks = np.array([[0,1,1,0,1,0,0,1,0,0,0,0],
                  [1,1,1,0,0,1,0,1,1,1,1,1],
                  [0,0,1,0,0,1,0,0,0,0,0,0]])

# I prefer a DataFrame here, as it indicates directly what the values mean
df = pd.DataFrame(tasks.T, columns = ['Task1', 'Task2', 'Task3'])

# --- >>> START stats <<< ---
(Q, pVal) = cochrans_q(df)
# --- >>> STOP stats <<< ---

print('Q = {0:5.3f}, p = {1:5.3f}'.format(Q, pVal))
if pVal < 0.05:
    print("There is a significant difference between the three tasks.")

Q = 8.667, p = 0.013
There is a significant difference between the three tasks.


# McNemar test
- McNemars Test should be run in the "exact" version, even though approximate formulas are typically given in the lecture scripts. Just ignore the statistic that is returned, because it is different for the two options.

- In the following example, a researcher attempts to determine if a drug has an effect on a particular disease. Counts of individuals are given in the table, with the diagnosis (disease: present or absent) before treatment given in the rows, and the diagnosis after treatment in the columns. The test requires the same subjects to be included in the before-and-after measurements (matched pairs).

In [19]:
from statsmodels.sandbox.stats.runs import mcnemar

f_obs = np.array([[101, 121],[59, 33]])
(statistic, pVal) = mcnemar(f_obs)
#f_obs = np.array([[20, 21],[43, 16]])
#(statistic, pVal) = mcnemar(f_obs)
print('p = {0:5.3e}'.format(pVal))
if pVal < 0.05:
    print("There was a significant change in the disease by the treatment.")

p = 4.434e-06
There was a significant change in the disease by the treatment.


# 9.1 Fisher’sExactTest:TheTeaExperiment
- At a party, a lady claimed to be able to tell whether the tea or the milk was added first to a cup. Fisher proposed to give her eight cups, four of each variety, in random order. One could then ask what the probability was for her getting the number she got correct, but just by chance.
- The experiment provided the Lady with eight randomly ordered cups of tea— four prepared by first adding milk, four prepared by first adding the tea. She was to select the four cups prepared by one method. (This offered the Lady the advantage of judging cups by comparison.)
- The null hypothesis was that the Lady had no such ability. (In the real, historical experiment, the lady got all eight cups correct.)
- Calculate if the claim of the Lady is supported if she gets three out of the four pairs correct.(Correct answer: No. If she gets three correct, that chance that a selection of “three or greater” was random is 0.243. She needs to get all four correct, if we set the rejection threshold at 0.05.)


In [17]:
'''Solution for Exercise "Categorical Data"
"A Lady Tasting Tea"
'''

# author: Thomas Haslwanter, date: Sept-2015

from scipy import stats
obs = [[3,1], [1,3]]
_, p = stats.fisher_exact(obs, alternative='greater')

obs2 = [[4,0], [0,4]]
aa, bb=stats.fisher_exact(obs2, alternative='greater')
print(aa,bb)

print('\n--- A Lady Tasting Tea (Fisher Exact Test) ---')
print('The chance that the lady selects 3 or more cups correctly by chance is {0:5.3f}'.format(p))

21.0 0.009883305548940234

--- A Lady Tasting Tea (Fisher Exact Test) ---
The chance that the lady selects 3 or more cups correctly by chance is 0.243


# 9.2 Chi2ContingencyTest(1DOF)
![image.png](attachment:image.png)

In [14]:
'''Solution for Exercise "Categorical Data":
Chi2-test with frequency tables
'''

# author: Thomas Haslwanter, date: Sept-2015

from scipy import stats

obs = [[36,14], [30,25]]
chi2, p, dof, expected = stats.chi2_contingency(obs)

print('--- Contingency Test ---')
if p < 0.05:
    print('p={0:6.4f}: the drug affects the heart rate.'.format(p))
else:
    print('p={0:6.4f}: the drug does NOT affect the heart rate.'.format(p))
    
obs2 = [[36,14], [29,26]]
chi2, p, dof, expected = stats.chi2_contingency(obs2)
chi2, p2, dof, expected = stats.chi2_contingency(obs2, correction=False)

print('If the response in 1 non-treated person were different, \n we would get p={0:6.4f} with Yates correction, and p={1:6.4f} without.'.format(p, p2))


--- Contingency Test ---
p=0.0997: the drug does NOT affect the heart rate.
If the response in 1 non-treated person were different, 
 we would get p=0.0673 with Yates correction, and p=0.0423 without.


![image.png](attachment:image.png)

![image.png](attachment:image.png)

In [15]:

'''Solution for Exercise "Categorical Data" '''

# author: Thomas Haslwanter, date: Sept-2015

from scipy import stats

# Chi2-oneway-test
obs = [4,6,14,10,16]
_, p = stats.chisquare(obs)

print('\n--- Chi2-oneway ---')
if p < 0.05:
    print('The difference in opinion between the different age groups is significant (p={0:6.4f})'.format(p))
else:
    print('The difference in opinion between the different age groups is NOT significant (p={0:6.4f})'.format(p))

print('DOF={0:3d}'.format(len(obs)-1))


--- Chi2-oneway ---
The difference in opinion between the different age groups is significant (p=0.0342)
DOF=  4


![image.png](attachment:image.png)

In [16]:
'''Solution for Exercise "Categorical Data"
McNemar's Test
'''

# author: Thomas Haslwanter, date: Sept-2015

from scipy import stats
from statsmodels.sandbox.stats.runs import mcnemar

obs = [[19,1], [6, 14]]
obs2 = [[20,0], [6, 14]]

_, p = mcnemar(obs)
_, p2 = mcnemar(obs2)

print('\n--- McNemar Test ---')
if p < 0.05:
    print('The results from the neurologist are significanlty different from the questionnaire (p={0:5.3f}).'.format(p))
else:
    print('The results from the neurologist are NOT significanlty different from the questionnaire (p={0:5.3f}).'.format(p))
    
if (p<0.05 == p2<0.05):
    print('The results would NOT change if the expert had diagnosed all "sane" people correctly.')
else:
    print('The results would change if the expert had diagnosed all "sane" people correctly.')


--- McNemar Test ---
The results from the neurologist are NOT significanlty different from the questionnaire (p=0.125).
The results would change if the expert had diagnosed all "sane" people correctly.
