# Data Analysis

## Setup

For the purposes of our analysis, the following modules shall be required:

In [None]:
import os
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from datetime import date

# from sklearn.model_selection import train_test_split

from ipynb_utils import CFG

In [None]:
DATA_DIR = CFG["DATA_DIR"]

# Path from which dataframe will be loaded
DF_PKL_PATH_SRC = os.path.join(DATA_DIR, "df_raw.pkl") 
DF_PKL_PATH_TAR = os.path.join(DATA_DIR, "df_processed.pkl")
DF_PKL_PATH_TAR_ = os.path.join(DATA_DIR, "df_processed_.pkl")

Let us now load the data frame containing the diabetes dataset.

In [None]:
df = pd.read_pickle(DF_PKL_PATH_SRC)

## First Inspection

As a preliminary inspection, let us invoke the info and sample methods of the data frame.

In [None]:
df.info()

In [None]:
df.sample(8)

We observe that the dataset contains no missing values, at least formally; the majority of columns appear to possess the appropriate data type.

Let us proceed with the renaming and reordering of columns for the sake of convenience.

In [None]:
df = df.rename(columns={
    "Age": "age",
    "diabetespedigreefunction": "dpf",
    # We will provide arguments below why "outcome" is
    # "has_diabetes" and not "has_no_diabetes".
    "outcome": "has_diabetes",
    "measurement_date": "date",
})

# Sort columns as follows:
# id, features lexicographically, target.
features_sorted = sorted([col for col in df.columns if col not in ["id", "has_diabetes"]])
cols = ["id"] + features_sorted + ["has_diabetes"]
df = df[cols]
 
df.info()

As indicated in the preceding comment, we confirm that the column "has_diabetes" (formerly "outcome") indeed encodes the presence, rather than the absence, of type 1 diabetes.

It is well established that increasing age and body mass index are associated with a greater likelihood of various diseases, including type 1 diabetes. Let us now examine the distributions of these features conditioned on diagnostic outcome.

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# First subplot: Age
sns.kdeplot(data=df, x="age", hue="has_diabetes", common_norm=False, ax=axes[0])
axes[0].set_title("Age Distribution by Diabetes Status")
axes[0].set_xlabel("Age")
axes[0].set_ylabel("Density")

# Second subplot: BMI
sns.kdeplot(data=df, x="bmi", hue="has_diabetes", common_norm=False, ax=axes[1])
axes[1].set_title("BMI Distribution by Diabetes Status")
axes[1].set_xlabel("BMI")
axes[1].set_ylabel("Density")

plt.tight_layout()
plt.show()


Our claim that "outcome" signifies "has_diabetes" may be deemed confirmed by these two plots.

Subsequently, let us assign the correct data type to the "date" feature.

In [None]:
df["date"] = pd.to_datetime(df["date"])

With the aid of the [accompanying paper](../archive/adap-diabetes.pdf), we may now complete a  table explicating the column names.

| Name | Description |
| --- | --- |
| id | Patient ID |
| age | Age in years |
| bloodpressure | Diastolic blood pressure in $mm Hg$ |
| bmi | Body mass index in $\frac{kg}{m^2}$ |
| date | Date of measurement |
| dpf | Diabetes pedigree function (further explained below) |
| glucose | Plasma glucose concentration at 2 hours in an oral glucose tolerance test (OGTT) |
| has_diabetes | Whether type 1 diabetes developed within 5 years |
| insulin | 2 hour serum insulin in $\frac{\mu U}{ml}$ |
| pregnancies | Number of pregnancies |
| skinthickness | Triceps skinfold thickness in millimetres |

The following plots may provide a visual impression of the individual feature distributions.

In [None]:
cols = [col for col in df.columns if col not in ["id", "has_diabetes"]]

fig, axes = plt.subplots(5, 2, figsize=(12, 20))
axes = axes.flatten()

for i, col in enumerate(cols):
    ax = axes[i]
    sns.histplot(data=df, x=col, ax=ax, color="black", linestyle="--")
    ax.set_title(col)
    # ax.legend()
plt.tight_layout()
plt.show()

Let us now enumerate the salient features observed in these distributions:

- The "date" feature takes only two distinct values.
- The "bloodpressure" feature shows an implausible range of values.
- Implausible zero values appear in "bloodpressure", "bmi", "glucose", "insulin" and "skinthickness".

We will examine these anomalies in the following section.

## The Insidious Data Duplication

The plot for the date column exhibits only two values. Indeed:

In [None]:
df["date"].unique()

During the discussion of the exercise, it was revealed that rows dated 2022-12-01 were to be removed as they were claimed to be synthetic.

However, it transpires that matters are not so straightforward. The dataset was not simply duplicated. We do not share the view of the instructors that this fact becomes evident after reading the accompanying paper.

A more careful examination demonstrates that the "difference data frame" is not identically zero: The difference columns for "bloodpressure" and "glucose" appear to follow bell-shaped distributions; the difference column for "insulin" exhibits a skewed distribution.

In our opinion, these manipulations are not *eo ipso* detectable with absolute certainty, particularly as the analyst must place some minimal trust in the integrity of the provided data.

Consequently, an honest data analysis should address this "date anomaly" more carefully. We outline this analysis in an [external notebook](./archive/1--analysis_hint-ignorant.ipynb); therefore, we store the processed data frame in its current state to disk.

In [None]:
df.to_pickle(DF_PKL_PATH_TAR_)

However, for the remaining part of our analysis (and modelling), we follow the hint and remove all affected rows.

In [None]:
df = df[df["date"] != "2022-12-01"]

As the column "date" has become constant, it does not possess any explanatory power any longer. Accordingly, we may drop it entirely from the data frame.

In [None]:
if "date" in df.columns:
    df = df.drop(columns = ["date"])

## Implausible Values

In this subsection, we scrutinise the implausible zero values in the columns "bloodpressure", "bmi", "glucose", "insulin" and "skinthickness" and further, the implausible range of values for "bloodpressure".

It is very likely that the implausible zero values are actually missing values. Hence, let us convert them accordingly:

In [None]:
cols = [
    "bloodpressure",
    "bmi",
    "glucose",
    "insulin",
    "skinthickness",
]

df[cols] = df[cols].replace(0, np.nan)

The profile of missing values ausgestaltet as follows:

In [None]:
df.isnull().sum()

How to proceed:

- considerable amount of rows affected for "insuline" and "skinthickness"


### BMI

### Skin Thickness

There is significant correlation between the body mass index and the thickness of skin:

In [None]:
df["bmi"].corr(df["skinthickness"])

Because of that, we decide to drop the column "skinthickness" entirely:

In [None]:
if "skinthickness" in df.columns:
    df = df.drop(columns="skinthickness")

### Insulin

In [None]:
INSULIN_LOWER = 10

df_tmp = df[df["insulin"] <= INSULIN_LOWER]

fig, ax = plt.subplots(figsize=(6, 4))
sns.histplot(data=df_tmp, x="insulin", ax=ax, color="black", linestyle="--")
ax.set_title("Insulin")

plt.tight_layout()
plt.show()

Insulin is integer, although it is a continuous quantitu. Veyr likely values are rounded. Values close above zero are not unusual. Therefore, cannot say which of the zero values are round down or falsely encoded Null values. Leave them as they stand.

### Blood Pressure

In [None]:
BLOODPRESSURE_LOWER = 30
df.loc[df["bloodpressure"] <= BLOODPRESSURE_LOWER, "bloodpressure"] = np.nan

df["bloodpressure"].isna().sum()

### Glucose

Only 5 affected rows. Drop

In [None]:
if "glucose" in df.columns:
    df = df.drop(columns="glucose")

# Final Depiction At Polished Data



Before we render a plot depicting the 

In [None]:
def get_summary(description, data):
    summary = {
        "description": description,
        "mean": np.mean(data),
        "median": np.median(data),
        "std": np.std(data),
        "count": len(data)
    }
    return summary

stats = [
    get_summary("All", df["has_diabetes"]),
    get_summary("No Diabetes", df[df["has_diabetes"] == 0]["dpf"]),
    get_summary("Diabetes", df[df["has_diabetes"] == 1]["dpf"]),
]

summary_df = pd.DataFrame(stats)
print(summary_df)

In [None]:
df.info()

Same dimensions as reported in the accompanying paper (and as the dataset provided by kaggle).

As a terminal step in this notebook, we store the data frame that now incorporates all our sanitations to disk.

In [None]:
df.to_pickle(DF_PKL_PATH_TAR)