# $\chi^2$ test in python including post-hoc and multiple comparisons correction

Author: Moran Neuhof, Feb 2018

This notebook accompanies the blog post I wrote, [Chi-square and post-hoc test in python - the easy way](https://neuhofmo.github.io/chi-square-and-post-hoc-in-python/), dealing with chi-square post-hoc tests.
The functions are also available as a module on the [chisq_test_wrapper repository](https://github.com/neuhofmo/chisq_test_wrapper/blob/master/chisq_test_wrapper.py)

In [1]:
# imports
import pandas as pd

from scipy.stats import chi2_contingency
from statsmodels.sandbox.stats.multicomp import multipletests # for multiple comparisons correction
from itertools import combinations  # for post-hoc tests

In [2]:
# assists in displaying significance
def get_asterisks_for_pval(p_val, alpha=0.05):
    """Receives the p-value and returns asterisks string."""
    if p_val > alpha:  # bigger than alpha
        p_text = "ns"
    # following the standards in biological publications
    elif p_val < 1e-4:  
        p_text = '****'
    elif p_val < 1e-3:
        p_text = '***'
    elif p_val < 1e-2:
        p_text = '**'
    else:
        p_text = '*'
    
    return p_text  # string of asterisks

In [3]:
def run_chisq_on_combination(df, combinations_tuple):
    """Receives a dataframe and a combinations tuple and returns p-value after performing chisq test."""
    assert len(combinations_tuple) == 2, "Combinations tuple is too long! Should be of size 2."
    new_df = df[(df.index == combinations_tuple[0]) | (df.index == combinations_tuple[1])]
    chi2, p, dof, ex = chi2_contingency(new_df, correction=True)
    return p


def chisq_and_posthoc_corrected(df, correction_method='fdr_bh', alpha=0.05):
    """Receives a dataframe and performs chi2 test and then post hoc.
    Prints the p-values and corrected p-values (after FDR correction).
    alpha: optional threshold for rejection (default: 0.05)
    correction_method: method used for mutiple comparisons correction. (default: 'fdr_bh').
    See statsmodels.sandbox.stats.multicomp.multipletests for elaboration."""

    # start by running chi2 test on the matrix
    chi2, p, dof, ex = chi2_contingency(df, correction=True)
    print("Chi2 result of the contingency table: {}, p-value: {}\n".format(chi2, p))
    
    # post-hoc test
    all_combinations = list(combinations(df.index, 2))  # gathering all combinations for post-hoc chi2
    print("Post-hoc chi2 tests results:")
    p_vals = [run_chisq_on_combination(df, comb) for comb in all_combinations]  # a list of all p-values
    # the list is in the same order of all_combinations

    # correction for multiple testing
    reject_list, corrected_p_vals = multipletests(p_vals, method=correction_method, alpha=alpha)[:2]
    for p_val, corr_p_val, reject, comb in zip(p_vals, corrected_p_vals, reject_list, all_combinations):
        print("{}: p_value: {:5f}; corrected: {:5f} ({}) reject: {}".format(comb, p_val, corr_p_val, get_asterisks_for_pval(p_val, alpha), reject))


### Demonstrating on sample data:

In [4]:
# loading the file from excel
df = pd.read_excel("groups_sum_demo.xlsx", index_col='Cell_line')

In [5]:
chisq_and_posthoc_corrected(df)

Chi2 result of the contingency table: 1095.406615116616, p-value: 3.761331610902334e-231

Post-hoc chi2 tests results:
('Control', 'Patient1'): p_value: 0.000101; corrected: 0.000168 (***) reject: True
('Control', 'Patient2'): p_value: 0.003231; corrected: 0.004615 (**) reject: True
('Control', 'Patient3'): p_value: 0.000084; corrected: 0.000168 (****) reject: True
('Control', 'Patient4'): p_value: 0.000000; corrected: 0.000000 (****) reject: True
('Patient1', 'Patient2'): p_value: 0.955635; corrected: 0.955635 (ns) reject: False
('Patient1', 'Patient3'): p_value: 0.034235; corrected: 0.042793 (*) reject: True
('Patient1', 'Patient4'): p_value: 0.000000; corrected: 0.000000 (****) reject: True
('Patient2', 'Patient3'): p_value: 0.158924; corrected: 0.176582 (ns) reject: False
('Patient2', 'Patient4'): p_value: 0.000000; corrected: 0.000000 (****) reject: True
('Patient3', 'Patient4'): p_value: 0.000000; corrected: 0.000000 (****) reject: True


Please feel free to share, comment or distribute.

-- Moran