# Data Analysis

Our analysis of dataset collected to capturing the distribution of type-2 diabetes among female Pima Indian individuals, living near Phoenix, Arizona,

## Setup

In [None]:
import os
import subprocess
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from datetime import date


In [None]:
# Path to root directory of the repo.
root_dir_ = subprocess.check_output(
    ["git", "rev-parse", "--show-toplevel"],
    text=True,
)
ROOT_DIR = root_dir_.strip()
# Path to data directory.
DATA_DIR = os.path.join(ROOT_DIR, "data")
# Path from which dataframe will be loaded
DF_PKL_PATH_SRC = os.path.join(DATA_DIR, "df_raw.pkl") 
DF_PKL_PATH_TAR = os.path.join(DATA_DIR, "df_processed.pkl") 

In [None]:
df = pd.read_pickle(DF_PKL_PATH_SRC)

## First Inspection

In [None]:
df.info()

We observe that the dataset contains no missing values, at least formally. 

Most of the columns data types also look on the first sight.

First, a minor column renaming for convenience.

In [None]:
df = df.rename(columns={
    "Age": "age",
    "diabetespedigreefunction": "dpfunction",
    # We will provide arguments below why 
    "outcome": "has_diabetes",
    "measurement_date": "date",
})

That "outcome" very likely encodes "has_diabetes", rather than "has_not_diabetes", 

Not clear: To exclude a further possible pitfall, let us examine whether outcome encodes presence or absence of diabetes. It is a well known that adipositas (BMI larger than 30) is inclined to cause a variety of disease, including type-1 diabetes.

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(12, 5))  # 1 row, 2 columns

# First subplot: Age
sns.kdeplot(data=df, x="age", hue="has_diabetes", common_norm=False, ax=axes[0])
axes[0].set_title("Age Distribution by Diabetes Status")
axes[0].set_xlabel("Age")
axes[0].set_ylabel("Density")

# Second subplot: Insulin
sns.kdeplot(data=df, x="bmi", hue="has_diabetes", common_norm=False, ax=axes[1])
axes[1].set_title("BMI Distribution by Diabetes Status")
axes[1].set_xlabel("BMI")
axes[1].set_ylabel("Density")

plt.tight_layout()
plt.show()


Also coorect date

In [None]:
df["date"] = pd.to_datetime(df["date"])

With the aid of the [accompanying paper](../docs/adap-diabetes.pdf), we may complete the explanative table 

| Name | Description |
| --- | --- |
| id | patient id |
| age | age in y |
| pregnancies | number of pregnancies |
| bmi | body-mass index in kg / m^2 |
| insulin | 2 h serum insulin in Uh/ml |
| glucose | plasma glucose concentration at 2 h in an oral glucose tolerance test (GTIT) |
| bloodpressure | diastolic blood pressure in mm Hg |
| date | date of measurement |
| dpf | diabetes pedigree function (further explanation below) |
| has_diabetes | whether developed type-1 diabetes within 5 years |
| skinthickness | triceps skin fold thickness in mm | 

An additional remark to the diabetes pedigree function (DPF): It is a formulate to calculate a score for the likelihood of having diabetes.

Let us outline on which variable the dpf depends and whether the dependence is isotone ("increase leads to increase") or antitome ("increase leads to decrease"):

- Number of relatives with diabetes (isotone)
  - Age at which relatives developed (antitone)
  - Percentage of shared genes (isotone)
- Number of relatives without diabetes (antitone)
  - Age of their last examination (isotone)
  - Percentage of shared genes (antitone)

Feature Distributions

In [None]:
cols = [col for col in df.columns if col not in ["id", "has_diabetes"]]

fig, axes = plt.subplots(5, 2, figsize=(12, 20))
axes = axes.flatten()

for i, col in enumerate(cols):
    ax = axes[i]

    sns.histplot(data=df, x=col, ax=ax, color="black", linestyle="--")

    ax.set_title(col)
    # ax.legend()

plt.tight_layout()
plt.show()

Saliences we want to address in the following section:

- (Very likely) zero-encoded NULL values for BMI, insulin and skinthickness.
- Exotic blood pressure values
- Two measurement days

## Missing Values

In this 

### BMI

In [None]:
BMI_LOWER = 10
df.loc[df["bmi"] <= BMI_LOWER, "bmi"] = np.nan

df["bmi"].isna().sum()

### Skin Thickness

In [None]:
SKINTHICKNESS_LOWER = 2.5
df.loc[df["skinthickness"] <= SKINTHICKNESS_LOWER, "skinthickness"] = np.nan

df["skinthickness"].isna().sum()  # count of NaNs

By the way, we have BMI

In [None]:
df["bmi"].corr(df["skinthickness"])

In [None]:
if "skinthickness" in df.columns:
    df = df.drop(columns="skinthickness")

### Insulin

In [None]:
INSULIN_LOWER = 10

df_tmp = df[df["insulin"] <= INSULIN_LOWER]

fig, ax = plt.subplots(figsize=(6, 4))
sns.histplot(data=df_tmp, x="insulin", ax=ax, color="black", linestyle="--")
ax.set_title("Insulin")

plt.tight_layout()
plt.show()

Insulin is integer, although it is a continuous quantitu. Veyr likely values are rounded. Values close above zero are not unusual. Therefore, cannot say which of the zero values are round down or falsely encoded Null values. Leave them as they stand.

### Blood Pressure

In [None]:
BLOODPRESSURE_LOWER = 30
df.loc[df["bloodpressure"] <= BLOODPRESSURE_LOWER, "bloodpressure"] = np.nan

df["bloodpressure"].isna().sum()

## Date of Measurement

In [None]:
df.nunique()

In [None]:
df["date"].unique()

We see we have only 768 different ids but 1536 (2 * 768) columns. The paper also only mentions 768 examinations. But we have two dates of measurement. Therefore, tt could be possible that each patient was examined at two different times

In [None]:
# Series with number of dates per id.
s = df.groupby("id")["date"].nunique()

# Unique values in this series.
s.unique()

We can conclude that for each patient, are exactly two measurement recorded, one on 2022-12-01 and the other on 2022-12-13.



In [None]:
df_wide = df.sort_values(["id", "date"])  # ensure correct order
df_wide["rank"] = df_wide.groupby("id").cumcount()

df_wide = df_wide.pivot(index="id", columns="rank")
df_wide.columns = [f"{order}_{col}" for col, order in df_wide.columns]
df_wide = df_wide.reset_index()

df_wide.nunique()

In [None]:
# Extract column suffixes shared by 0_ and 1_ columns
suffixes = [col[2:] for col in df_wide.columns if col.startswith("0_")]
suffixes = [s for s in suffixes if f"1_{s}" in df_wide.columns]

# Subtract matching columns
df_diff = df_wide[[f"1_{s}" for s in suffixes]].values - df_wide[[f"0_{s}" for s in suffixes]].values

# Construct new DataFrame with diff_ column names
df_diff = pd.DataFrame(df_diff, columns=[f"diff_{s}" for s in suffixes])
df_diff.insert(0, "id", df_wide["id"])

In [None]:
df_diff.sample(10)

In [None]:
if "date" in df.columns:
    df = df.drop(columns="date")

Correlation Matrix

In [None]:
df_ = df.drop(columns=["id"])
corr = df_.corr()

sns.heatmap(corr, annot=True, fmt=".2f", square=True)
plt.title("Correlation Matrix")
plt.show()

In [None]:
df.to_pickle(DF_PKL_PATH_TAR)