# Coding Applications in Medicine: Data Science - Numerical Hypothesis Tests

There are two basic hypothesis test to compare numerical variables, t-test and ANOVA. Recall that means are commonly used to compare numerical variables.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

from scipy.stats import ttest_ind
from scipy.stats import f_oneway

In [None]:
# Load the insurance data.
insuranceDF = pd.read_csv("data/insurance.csv")
insuranceDF

## T-Test

The t-test is used to check for differences in population mean between two groups. 

In the following example, we will examine the differences in medical insurance charges based on whether an individual is a smoker or nonsmoker.

In [None]:
# Graph of the data.
sns.histplot(data=insuranceDF, x="charges", hue="smoker", binwidth=1000, kde=True)
plt.show();

In [None]:
# Another graph of the data.
sns.boxplot(data=insuranceDF, x="smoker", y="charges")
plt.show();

In [None]:
# Stratification by smoker/non-smoker.
smokerGroup = insuranceDF[insuranceDF["smoker"]=="yes"]
nonsmokerGroup = insuranceDF[insuranceDF["smoker"]=="no"]

# T-test analysis.
ttest_ind(smokerGroup["charges"], nonsmokerGroup["charges"])

## ANOVA

The t-test is used to check for differences in population mean between three or more groups. 

In the following example, we will examine the differences in medical insurance charges based on whether an individual is a young adult, middle-aged, or an elderly.

In [None]:
# Create the age grouping column.
insuranceByAgeDF = insuranceDF[["age", "charges"]].copy()

def ageCategory(age):
    if age < 40:
        return "Young"
    elif age >= 60:
        return "Old"
    return "Middle"

insuranceByAgeDF["Age Grouping"] = insuranceDF["age"].apply(ageCategory)
insuranceByAgeDF

In [None]:
# Graph of the data.
sns.histplot(data=insuranceByAgeDF, x="charges", hue="Age Grouping", binwidth=1000, kde=True)
plt.show();

In [None]:
# Another graph of the data.
sns.boxplot(data=insuranceByAgeDF, x="Age Grouping", y="charges")
plt.show();

In [None]:
# Create separate dataframes for each grouping.
insuranceByAgeDF = insuranceDF[["age", "charges"]].copy()

insuranceYoungAgeDF = insuranceByAgeDF[insuranceByAgeDF["age"] < 40]
insuranceMiddleAgeDF = insuranceByAgeDF[(insuranceByAgeDF["age"] >= 40) & 
                                  For more information, check the following:
- User Guide: https://docs.scipy.org/doc/scipy/tutorial/stats.html
- API Reference: https://docs.scipy.org/doc/scipy/reference/stats.html                     (insuranceByAgeDF["age"] < 60)]
insuranceOldAgeDF = insuranceByAgeDF[insuranceByAgeDF["age"] >= 60]

# ANOVA Test.
f_oneway(insuranceYoungAgeDF["charges"], 
         insuranceMiddleAgeDF["charges"], 
         insuranceOldAgeDF["charges"])

For more information, check the following:
- User Guide: https://docs.scipy.org/doc/scipy/tutorial/stats.html
- API Reference: https://docs.scipy.org/doc/scipy/reference/stats.html

**Source:**


Module adapted from Kaggle: https://www.kaggle.com/code/mariapushkareva/medical-insurance-cost-with-linear-regression/notebook

Dataset source: https://github.com/stedy/Machine-Learning-with-R-datasets