# Data Analysis

## Setup

For the purposes of our analysis, the following modules shall be required:

In [None]:
import os
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from datetime import date

from sklearn.model_selection import train_test_split

from ipynb_utils import CFG

In [None]:
DATA_DIR = CFG["DATA_DIR"]

# Path from which dataframe will be loaded
DF_PKL_PATH_SRC = os.path.join(DATA_DIR, "df_raw.pkl") 
DF_PKL_PATH_TAR = os.path.join(DATA_DIR, "df_processed.pkl")
DF_PKL_PATH_TAR_ = os.path.join(DATA_DIR, "df_duplicate.pkl")

Let us now load the data frame containing the diabetes dataset.

In [None]:
df = pd.read_pickle(DF_PKL_PATH_SRC)

## First Inspection

As a preliminary inspection, let us invoke the info and sample methods of the data frame.

In [None]:
df.info()

In [None]:
df.sample(8)

We observe that the dataset contains no missing values, at least formally; the majority of columns appear to possess the appropriate data type.

Let us proceed with the renaming and reordering of columns for the sake of convenience.

In [None]:
df = df.rename(columns={
    "Age": "age",
    "diabetespedigreefunction": "dpf",
    # We will provide arguments below why "outcome" is
    # "has_diabetes" and not "has_no_diabetes".
    "outcome": "has_diabetes",
    "measurement_date": "date",
})

# Sort columns as follows:
# id, features lexicographically, target.
features_sorted = sorted([col for col in df.columns if col not in ["id", "has_diabetes"]])
cols = ["id"] + features_sorted + ["has_diabetes"]
df = df[cols]
 
df.info()

As indicated in the preceding comment, we confirm that the column "has_diabetes" (formerly "outcome") indeed encodes the presence, rather than the absence, of type 1 diabetes. 

TODO: First let us observe that the column is logically indeed Boolean.

In [None]:
df["has_diabetes"].unique()

It is well established that increasing age and body mass index are associated with a greater likelihood of various diseases, including type 1 diabetes. Let us now examine the distributions of these features conditioned on diagnostic outcome.

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# First subplot: Age
sns.kdeplot(data=df, x="age", hue="has_diabetes", common_norm=False, ax=axes[0])
axes[0].set_title("Age Distribution by Diabetes Status")
axes[0].set_xlabel("Age")
axes[0].set_ylabel("Density")

# Second subplot: BMI
sns.kdeplot(data=df, x="bmi", hue="has_diabetes", common_norm=False, ax=axes[1])
axes[1].set_title("BMI Distribution by Diabetes Status")
axes[1].set_xlabel("BMI")
axes[1].set_ylabel("Density")

plt.tight_layout()
plt.show()


Our claim that "outcome" signifies "has_diabetes" may be deemed confirmed by these two plots.

Subsequently, let us assign the correct data type to the "date" feature.

In [None]:
df["date"] = pd.to_datetime(df["date"])

With the aid of the [accompanying paper](../archive/adap-diabetes.pdf), we may now complete a  table explicating the column names.

| Name | Description |
| --- | --- |
| id | Patient ID |
| age | Age in years |
| bloodpressure | Diastolic blood pressure in $mm Hg$ |
| bmi | Body mass index in $\frac{kg}{m^2}$ |
| date | Date of measurement |
| dpf | Diabetes pedigree function (further explained below) |
| glucose | Plasma glucose concentration at 2 hours in an oral glucose tolerance test (OGTT) |
| has_diabetes | Whether type 1 diabetes developed within 5 years |
| insulin | 2 hour serum insulin in $\frac{\mu U}{ml}$ |
| pregnancies | Number of pregnancies |
| skinthickness | Triceps skinfold thickness in millimetres |

The following plots may provide a visual impression of the individual feature distributions.

In [None]:
cols_blacklist = [
    "id", 
    "has_diabetes",
]

cols = [col for col in df.columns if col not in cols_blacklist]

fig, axes = plt.subplots(5, 2, figsize=(12, 20))
axes = axes.flatten()

for i, col in enumerate(cols):
    ax = axes[i]
    sns.histplot(data=df, x=col, ax=ax, color="black", linestyle="--")
    ax.set_title(col)
plt.tight_layout()
plt.show()

Let us now enumerate the salient features observed in these distributions:

- The "date" feature takes only two distinct values.
- The "bloodpressure" feature shows an implausible range of values.
- Implausible zero values appear in "bloodpressure", "bmi", "glucose", "insulin" and "skinthickness".

We will examine these anomalies in the following section.

## The Insidious Data Duplication

The plot for the date column exhibits only two values. Indeed:

In [None]:
df["date"].unique()

During the discussion of the exercise, it was revealed that rows dated 2022-12-01 were to be removed as they were claimed to be synthetic.

However, it transpires that matters are not so straightforward. The dataset was not simply duplicated. We do not share the view of the instructors that this fact becomes evident after reading the accompanying paper.

A more careful examination demonstrates that the "difference data frame" is not identically zero: The difference columns for "bloodpressure" and "glucose" appear to follow bell-shaped distributions; the difference column for "insulin" exhibits a skewed distribution.

In our opinion, these manipulations are not *eo ipso* detectable with absolute certainty, particularly as the analyst must place some minimal trust in the integrity of the provided data.

Consequently, an honest data analysis should address this "date anomaly" more carefully. We outline this analysis in an [external notebook](./archive/1--analysis_hint-ignorant.ipynb); therefore, we store the processed data frame in its current state to disk.

In [None]:
df.to_pickle(DF_PKL_PATH_TAR_)

However, for the remaining part of our analysis (and modelling), we follow the hint and remove all affected rows.

In [None]:
df = df[df["date"] != "2022-12-01"]

As the column "date" has become constant, it does not possess any explanatory power any longer. Accordingly, we may drop it entirely from the data frame.

In [None]:
if "date" in df.columns:
    df = df.drop(columns = ["date"])

## Implausible Values: Part I

In this subsection, we scrutinise the implausible zero values in the columns "bloodpressure", "bmi", "glucose", "insulin" and "skinthickness" and further, the implausible range of values for "bloodpressure".

It is very likely that the implausible zero values are actually missing values. Hence, let us convert them accordingly:

In [None]:
cols = [
    "bloodpressure",
    "bmi",
    "glucose",
    "insulin",
    "skinthickness",
]

df[cols] = df[cols].replace(0, np.nan)

The profile of missing values is structured as follows:

In [None]:
df_tmp = df.isnull().agg(["sum", "mean"])
df_tmp.loc["mean"] = df_tmp.loc["mean"].round(2)
df_tmp.T

In [None]:
mask = df.isnull().sum(axis=1) >= 4

# Count how many rows satisfy this
count = mask.sum()

df = df[~mask]

In [None]:
df.isnull().sum()

In [None]:
cols = [
    "bloodpressure",
    "bmi",
    "glucose",
]

df[cols].isnull().any(axis=1).sum()

In [None]:
df = df.dropna(subset=cols)

In [None]:
df.isnull().sum()

TODO:

Considerable amount of rows affected for "insulin" and "skinthickness". Simply dropping the corresponding rows would halve the dataset. Therefore, we need to impute the values.

From now on, we remove no rows anymore but impute
However, in order to avoid data leakage we must prepend some considerations about the train test split


## Parenthesis: Train-Test Split

We examine the target "has_diabetes". We already saw that this column is zero-one valued. But the classes are imbalanced:

In [None]:
df["has_diabetes"].mean()

Therefore it is wise to pass the stratify argument to the splitting method. As we remove no indices any more, it 

Most practical to implement the train-test split as a pseudo-feature "is_test" which indicates whether 

pertain model

In [None]:
idx_0, idx_1 = train_test_split(
    df.index,
    # test_size=CFG["TEST_SIZE"],
    test_size=0.2,
    random_state=CFG["RSEED"],
    stratify=df["has_diabetes"]
)

df["is_test"] = 0
df.loc[idx_1, "is_test"] = 1

df_0 = df.loc[idx_0]
df_1 = df.loc[idx_1]

In [None]:
for name, df_split in [("Train", df_0), ("Test", df_1)]:
    total = len(df_split)
    positives = df_split["has_diabetes"].sum()
    percent = positives / total
    print(f"{name} Set:") 
    print(f"  Total Individuals   : {total:>4}")
    print(f"  With Diabetes (abs) : {positives:>4}")
    print(f"  With Diabetes (rel) : {percent:.2f}")


## Implausible Values: Part II

We resume our treatment of implausible values. We shall impute values for "insulin" and "skinthickness". 

In [None]:
cols = [
    "insulin", 
    "skinthickness",
]

fig, axes = plt.subplots(1, 2, figsize=(12, 6))
axes = axes.flatten()

for i, col in enumerate(cols):
    ax = axes[i]
    sns.histplot(data=df_0, x=col, ax=ax, color="black", linestyle="--", bins=64)
    ax.set_title(col)
plt.tight_layout()
plt.show()

Insulin appears to be unimodular and right-tailed, i.e. possessing positive skewness. Indeed:

In [None]:
df_0["insulin"].skew()

We impute the missing values in the "insulin" column with the median of its train values.

In [None]:
median = df_0["insulin"].median()
df["insulin"] = df["insulin"].fillna(median)

Skin Thickness appears to be unimodular and nearly symmetric. Indeed, the modulus of skewness does not exceed $1$:

In [None]:
df_0["skinthickness"].skew()

Therefore, it is reasonable to choose replacing by mean as imputing strategy

In [None]:
mean = df_0["insulin"].mean()
df["skinthickness"] = df["skinthickness"].fillna(mean)

This step concludes our as all missing values were addressed:

In [None]:
df.isnull().sum().sum()

# Final Depiction At Polished Data



Before we render a plot depicting the 

In [None]:
cols_blacklist = [
    "id",
    "is_test",
    # "has_diabetes",
]
cols = [col for col in df.columns if col not in cols_blacklist]
df_subset = df[cols]

sns.pairplot(df_subset, hue="has_diabetes", diag_kind="kde")
plt.suptitle("Pairplot")
plt.show()

Correlation matrix

In [None]:
corr = df_subset.corr(numeric_only=True)

plt.figure(figsize=(8, 6))
sns.heatmap(corr, annot=True, fmt=".2f")
plt.title("Correlation Matrix")
plt.tight_layout()
plt.show()


Most correlated with the target "has_diabetes" are "glucose" (0.49) and "bmi" (0.25).

The greatest correlation is between age and pregnancies. This is to expect as each pregnancy requires a certain duration of time  (9 months).


---

As a terminal step in this notebook, we store the data frame that now incorporates all our sanitations to disk.

In [None]:
df.to_pickle(DF_PKL_PATH_TAR)