# Data Analysis

Our analysis of dataset collected to capturing the distribution of type-2 diabetes among female Pima Indian individuals, living near Phoenix, Arizona,

## Setup

In [None]:
import os
import subprocess
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from datetime import date


In [None]:
# Path to root directory of the repo.
root_dir_ = subprocess.check_output(
    ["git", "rev-parse", "--show-toplevel"],
    text=True,
)
ROOT_DIR = root_dir_.strip()
# Path to data directory.
DATA_DIR = os.path.join(ROOT_DIR, "data")
# Path from which dataframe will be loaded
DF_PKL_PATH_SRC = os.path.join(DATA_DIR, "df_raw.pkl") 
DF_PKL_PATH_TAR = os.path.join(DATA_DIR, "df_processed.pkl")
DF_PKL_PATH_TAR_ = os.path.join(DATA_DIR, "df_processed_.pkl")


plt.style.use('tableau-colorblind10')

## First Inspection

In [None]:
df.info()

We observe that the dataset contains no missing values, at least formally. 

Most of the columns data types also look on the first sight.

First, a minor column renaming for convenience.

In [None]:
df = df.rename(columns={
    "Age": "age",
    "diabetespedigreefunction": "dpf",
    # We will provide arguments below why 
    "outcome": "has_diabetes",
    "measurement_date": "date",
})

# Sort columns as follows:
# id, features lexicographically, target.
features_sorted = sorted([col for col in df.columns if col not in ["id", "has_diabetes"]])
cols = ["id"] + features_sorted + ["has_diabetes"]
df = df[cols]

cols

That "outcome" very likely encodes "has_diabetes", rather than "has_not_diabetes", 

Not clear: To exclude a further possible pitfall, let us examine whether outcome encodes presence or absence of diabetes. It is a well known that adipositas (BMI larger than 30) is inclined to cause a variety of disease, including type-1 diabetes.

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(12, 5))  # 1 row, 2 columns

# First subplot: Age
sns.kdeplot(data=df, x="age", hue="has_diabetes", common_norm=False, ax=axes[0])
axes[0].set_title("Age Distribution by Diabetes Status")
axes[0].set_xlabel("Age")
axes[0].set_ylabel("Density")

# Second subplot: Insulin
sns.kdeplot(data=df, x="bmi", hue="has_diabetes", common_norm=False, ax=axes[1])
axes[1].set_title("BMI Distribution by Diabetes Status")
axes[1].set_xlabel("BMI")
axes[1].set_ylabel("Density")

plt.tight_layout()
plt.show()


Also coorect date

In [None]:
df["date"] = pd.to_datetime(df["date"])

With the aid of the [accompanying paper](../docs/adap-diabetes.pdf), we may complete the explanative table 

| Name | Description |
| --- | --- |
| id | patient id |
| age | age in y |
| bloodpressure | diastolic blood pressure in mm Hg |
| bmi | body-mass index in kg / m^2 |
| date | date of measurement |
| dpf | diabetes pedigree function (further explanation below) |
| glucose | plasma glucose concentration at 2 h in an oral glucose tolerance test (GTIT) |
| has_diabetes | whether developed type-1 diabetes within 5 years |
| insulin | 2 h serum insulin in Uh/ml |
| pregnancies | number of pregnancies |
| skinthickness | triceps skin fold thickness in mm | 

In [None]:
def get_summary(description, data):
    summary = {
        "description": description,
        "mean": np.mean(data),
        "median": np.median(data),
        "std": np.std(data),
        "count": len(data)
    }
    return summary

stats = [
    get_summary("All", df["has_diabetes"]),
    get_summary("No Diabetes", df[df["has_diabetes"] == 0]["dpf"]),
    get_summary("Diabetes", df[df["has_diabetes"] == 1]["dpf"]),
]

summary_df = pd.DataFrame(stats)
print(summary_df)

Feature Distributions

In [None]:
cols = [col for col in df.columns if col not in ["id", "has_diabetes"]]

fig, axes = plt.subplots(5, 2, figsize=(12, 20))
axes = axes.flatten()

for i, col in enumerate(cols):
    ax = axes[i]

    sns.histplot(data=df, x=col, ax=ax, color="black", linestyle="--")

    ax.set_title(col)
    # ax.legend()

plt.tight_layout()
plt.show()

Saliences we want to address in the following section:

- (Very likely) zero-encoded NULL values for BMI, insulin and skinthickness.
- Exotic blood pressure values
- Two measurement days
- Two dates

## The Insidious Data Duplication



In [None]:
df.nunique()

In [None]:
df["date"].unique()

During the discussion of the exercise, it was revealed that we shall remove all rows subject to the date 2022-12-01. It was claimed that they were synthetic.

But as it turns out, it is not that easy. It is not the case that the dataset was simply duplicated. We do not share the opinion of the instructors that this fact is obvious after a perusal of the accompanying paper:

A more careful shows that the "difference data frame" is not identically zero: The bloodpressure and glucose difference appear to have a bell shape distribution, respectively, the insulin difference is skewed.

In our opinion, these manipulations are not detectable eo ipso with absolute certainty, and are beyond , particularly as the analysist must put some minimal amount of trust in the integrity of the provided data.

He definitively 

We outsourced the sketch. Let us dump the data frame 

In [None]:
df.to_pickle(DF_PKL_PATH_TAR_)

In this notebook, we continue the analysis and address the problem of missing values

In [None]:
df = df[df["date"] != "2022-12-01"]

In [None]:
df.info()

## Missing Values

In this 

### BMI

In [None]:
BMI_LOWER = 10
df.loc[df["bmi"] <= BMI_LOWER, "bmi"] = np.nan

df["bmi"].isna().sum()

### Skin Thickness

In [None]:
SKINTHICKNESS_LOWER = 2.5
df.loc[df["skinthickness"] <= SKINTHICKNESS_LOWER, "skinthickness"] = np.nan

df["skinthickness"].isna().sum()  # count of NaNs

By the way, we have BMI

In [None]:
df["bmi"].corr(df["skinthickness"])

In [None]:
if "skinthickness" in df.columns:
    df = df.drop(columns="skinthickness")

### Insulin

In [None]:
INSULIN_LOWER = 10

df_tmp = df[df["insulin"] <= INSULIN_LOWER]

fig, ax = plt.subplots(figsize=(6, 4))
sns.histplot(data=df_tmp, x="insulin", ax=ax, color="black", linestyle="--")
ax.set_title("Insulin")

plt.tight_layout()
plt.show()

Insulin is integer, although it is a continuous quantitu. Veyr likely values are rounded. Values close above zero are not unusual. Therefore, cannot say which of the zero values are round down or falsely encoded Null values. Leave them as they stand.

### Blood Pressure

In [None]:
BLOODPRESSURE_LOWER = 30
df.loc[df["bloodpressure"] <= BLOODPRESSURE_LOWER, "bloodpressure"] = np.nan

df["bloodpressure"].isna().sum()

# Final At Polished Data



In [None]:
df.to_pickle(DF_PKL_PATH_TAR)