# Coding Applications in Medicine: Data Science - Other Statistical Analysis

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

from scipy.stats import pearsonr
from scipy.stats.contingency import odds_ratio
from scipy.stats.contingency import relative_risk

In [None]:
# Load the insurance data.
insuranceDF = pd.read_csv("data/insurance.csv")
insuranceDF

## Odds Ratio

The odds ratios are typically used in case-control studies. 

Suppose we are examining a group of individuals to see if individuals who smoke are more likely to be a male. (Start with smokers/nonsmokers and looking back to compare the frequency of male/female).

In [None]:
# Create the contingency table.
smokerBySexDF = insuranceDF[["sex", "smoker"]].copy()
smokerBySexDF["is male"] = (smokerBySexDF["sex"] == "male").astype(int)
smokerBySexContingency = pd.crosstab(smokerBySexDF["is male"], smokerBySexDF["smoker"])

smokerBySexContingency

In [None]:
# Odds Ratio.
smokerBySexOddsRatio = odds_ratio(smokerBySexContingency)
smokerBySexOddsRatio

In [None]:
# Confidence Interval (95%).
smokerBySexOddsRatio.confidence_interval(0.95)

## Relative Risk

The relative risks are typically used in cohort studies. 

Suppose we are examining a group of individuals at a given time to see if smoking increases the risk for a high medical insurance charge. (Start with smoking/nonsmoking and looking forward to compare the insurance charges).

In [None]:
# Create the contingency table.
chargeBySmokerDF = insuranceDF[["age", "charges"]].copy()
chargeBySmokerDF["is old"] = (chargeBySmokerDF["age"] >= 50).astype(int)
chargeBySmokerDF["high charge"] = (chargeBySmokerDF["charges"] > 10000)


chargeBySmokerContingency = pd.crosstab(chargeBySmokerDF["is old"], chargeBySmokerDF["high charge"])

chargeBySmokerContingency

In [None]:
# Unfortunately, the relative risk function does not take a contingency table as in input.
exposed_cases = chargeBySmokerContingency[True][1]
exposed_total = chargeBySmokerContingency.sum(axis=1)[1]
control_cases = chargeBySmokerContingency[True][0]
control_total = chargeBySmokerContingency.sum(axis=1)[0]

In [None]:
# Relative Risk.
chargeBySmokerRelRisk = relative_risk(exposed_cases, exposed_total, control_cases, control_total)
chargeBySmokerRelRisk

In [None]:
# Confidence Interval (95%).
chargeBySmokerRelRisk.confidence_interval(0.95)

## Pearson Correlation Coefficient

Recall that the Pearson correlation coefficient is used to check for linear correlation between two variables.

In the following example, we will examine the correlation between bmi and the medical insurance charges.

In [None]:
# Graph of the data.
sns.scatterplot(data=insuranceDF, x="bmi", y="charges")
plt.show();

In [None]:
# Pearson correlation coefficient.
insuranceChargeBmiPearsonR = pearsonr(insuranceDF["bmi"], 
                                             insuranceDF["charges"])
insuranceChargeBmiPearsonR

In [None]:
# Confidence Interval (95%).
insuranceChargeBmiPearsonR.confidence_interval(0.95)

For more information, check the following:
- User Guide: https://docs.scipy.org/doc/scipy/tutorial/stats.html
- API Reference: https://docs.scipy.org/doc/scipy/reference/stats.html

**Source:**


Module adapted from Kaggle: https://www.kaggle.com/code/mariapushkareva/medical-insurance-cost-with-linear-regression/notebook

Dataset source: https://github.com/stedy/Machine-Learning-with-R-datasets