# Data Analysis

This notebook presents an analysis of the data under the hypothetical assumption that we had not been informed of the presence of a (moderately noisy) duplication within the original data frame.

## Setup

The following modules shall be employed:

In [None]:
import os
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from ipynb_utils import CFG

We specify the location at which the data are stored on disk:

In [None]:
DATA_DIR = CFG["DATA_DIR"]

DF_PKL_PATH_SRC = os.path.join(DATA_DIR, "df_duplicate.pkl")

Then, the content of the file is loaded into a data frame.

In [None]:
df = pd.read_pickle(DF_PKL_PATH_SRC)

## Unstacking Paired Measurement Data

Let us take a glance at the number of unique values within the individual features, as well as at the values of the feature `"date"`.

In [None]:
df.nunique()

In [None]:
df["date"].unique()

We observe that there are only $768$ distinct IDs but $1536$ ($= 2 \times 768$) rows. Likewise, the attached paper mentions 768 examinations. However, there are two dates of measurement suggesting that each patient was examined on two separate occasions.

In [None]:
# Series with number of dates per "id".
s = df.groupby("id")["date"].nunique()

# Unique values in this series.
s.unique()

We may conclude that for each patient there are exactly two measurements recorded, one on 2022-12-01 and the other on 2022-12-13.

Following this observation, we create a data frame containing the two measurements for each patient. Columns suffixed with `"_0"` or `"_1"` indicate the first or second measurement in chronological order, respectively.

In [None]:
# Sort columns first by "id", then by "date".
df_wide = df.sort_values(["id", "date"])

# Index the measurement per "id".
df_wide["rank"] = df_wide.groupby("id").cumcount()

# Reshape data frame. Rows are indexed by values in
# the "id" column, columns are created for each unique
# value in the "rank" column.
df_wide = df_wide.pivot(index="id", columns="rank")

# Flatten column names from tuples to plain strings.
df_wide.columns = [f"{col}_{order}" for col, order in df_wide.columns]

# Convert "id" to a regular column.
df_wide = df_wide.reset_index()

From these pairs of measurements, we construct the corresponding difference columns.

In [None]:
# Columns for which a difference column shall be created. 
cols = [col for col in df.columns if col not in ["id", "date"]]

# Columns associated with the first measurment.
cols_0 = [f"{s}_0" for s in cols]
# Columns associated with the second measurment.
cols_1 = [f"{s}_1" for s in cols]
# Columns associated with the difference of both measurments.
cols_delta = [f"{s}_delta" for s in cols]

df_wide[cols_delta] = df_wide[cols_1].values - df_wide[cols_0].values

However, the new data frame still contains redundant information. On the one hand, the date-related columns have become obsolete. On the other hand, the values for the second measurement are entirely determined by those of the first measurement and the difference; consequently, they ought to be removed.

In [None]:
cols = []

# Date columns.
cols.extend([col for col in df_wide.columns if col.startswith("date_")])

# Columns associated with second measurement.
cols.extend([col for col in df_wide.columns if col in cols_1])

df_wide = df_wide.drop(columns=cols)

Consequently, a suffix for columns related to the first measurement is no longer necessary.

In [None]:
df_wide = df_wide.rename(
    columns={
        col: col.replace("_0", "") for col in df_wide.columns if col.endswith("_0")
    }
)

## Difference Columns

Let us examine the difference columns in greater detail. A glance at the number of unique values reveals which of these features are not trivial.

In [None]:
df_wide[cols_delta].nunique()

Naturally, constant columns do not contribute to explanatory power; therefore, it is safe to remove them.

In [None]:
# Delta columns containing at most one distinct non-NULL value.
df_tmp = df_wide[cols_delta].nunique()
cols = df_tmp[df_tmp <= 1].index.tolist()

df_wide = df_wide.drop(columns=cols)

Let us proceed with the renaming and reordering of columns for the sake of convenience.

In [None]:
# Sort columns as follows:
# id, features lexicographically, target.
features_sorted = sorted(
    [col for col in df_wide.columns if col not in ["id", "has_diabetes"]]
)

cols = ["id"] + features_sorted + ["has_diabetes"]
df_wide = df_wide[cols]

The following plots may provide a visual impression of the individual feature distributions.

In [None]:
cols = [col for col in df_wide.columns if col not in ["id", "has_diabetes"]]

fig, axes = plt.subplots(6, 2, figsize=(12, 20))
axes = axes.flatten()

for i, col in enumerate(cols):
    ax = axes[i]
    sns.histplot(data=df_wide, x=col, ax=ax, color="black", linestyle="--", bins=25)
    ax.set_title(col)

plt.tight_layout()
plt.show()

Furthermore, we shall examine the correlation matrix.

In [None]:
df_tmp = df_wide.drop(columns=["id"])
corr = df_tmp.corr()

sns.heatmap(corr, annot=True, fmt=".2f", square=True)
plt.title("Correlation Matrix")
plt.show()

The correlation between `"glucose"` and `"glucose_delta"` is highly significant; therefore, the removal of the difference column may be justified. *Mutatis mutandis*, the same applies to the columns related to blood pressure.

In [None]:
cols_blacklist = [
    "bloodpressure_delta", 
    "glucose_delta",
]

cols = [
    col for col in df_wide.columns if col not in cols_blacklist
]

df_wide = df_wide.drop(columns=cols)

However, the correlation between `"insulin"` and `"insulin_delta"` is too weak to warrant the removal of the difference column. A more sophisticated analysis would be required at this point.


---

Thereafter, we may resume the treatment of missing values in the same manner as in the [main analysis notebook](../2--analysis.ipynb). When converting implausible zero values to `NULL` values, care must also be taken to mark the corresponding cells in the columns matching `"*_delta"`.