# Coding Applications in Medicine: Data Science - Categorical Hypothesis Tests

There are two basic hypothesis test to compare categorical variables, Fisher's exact test and Chi-square test. Recall that proportions are commonly used to compare categorical variables.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

from scipy.stats import chi2_contingency
from scipy.stats import fisher_exact

import sklearn.datasets

In [None]:
# Load the insurance data.
insuranceDF = pd.read_csv("data/insurance.csv")
insuranceDF

## Fisher's Exact Test

The fisher's exact test is used to check for differences in proportions between categories. This test is mainly used for smaller sample size comparisons. 

In the following example, we will examine the differences in proportion of smokers based on male or female sex. A small sample size is simulated by only taking a look at the first 30 data points.

In [None]:
# Take first 30 data points.
insuranceDFSmallSampleDF = insuranceDF.head(30)

# Graph of the data.
sns.countplot(data=insuranceDFSmallSampleDF, y="smoker", hue="sex");
plt.show();

In [None]:
# Create the contingency table.
insuranceDFSmallSampleContingency = pd.crosstab(insuranceDFSmallSampleDF["smoker"], insuranceDFSmallSampleDF["sex"])
insuranceDFSmallSampleContingency

In [None]:
# Fisher Exact Test.
fisher_exact(insuranceDFSmallSampleContingency)

## Chi-Squared Test 

The chi-squared test is used to check for differences in proportions between categories. This test is mainly used for larger sample size comparisons. 

In the following example, we will examine the differences in proportion of smokers based on parent status (0 children or at least 1 child). A large sample size is simulated by only taking a look at all the data points.

In [None]:
# Create column to determine whether the individual is a parent.
insuranceDF["is parent"] = (insuranceDF["children"] > 0)

# Graph of the data.
sns.countplot(data=insuranceDF, y="smoker", hue="is parent");
plt.show();

In [None]:
# Contigency table by count.
smokerParentContingency = pd.crosstab(insuranceDF["smoker"], insuranceDF["is parent"])
smokerParentContingency

In [None]:
# Chi-squared analysis.For more information, check the following:
- User Guide: https://docs.scipy.org/doc/scipy/tutorial/stats.html
- API Reference: https://docs.scipy.org/doc/scipy/reference/stats.html
chi2_contingency(smokerParentContingency, correction=False)

For more information, check the following:
- User Guide: https://docs.scipy.org/doc/scipy/tutorial/stats.html
- API Reference: https://docs.scipy.org/doc/scipy/reference/stats.html

**Source:**


Module adapted from Kaggle: https://www.kaggle.com/code/mariapushkareva/medical-insurance-cost-with-linear-regression/notebook

Dataset source: https://github.com/stedy/Machine-Learning-with-R-datasets