# Main Cleaning Process and Exploration Analysis
**This notebook serves as the main data cleaning and processing module**

In this notebook , I will 
1. Combine all the diagnosis records from different sections.
2. Implement the chronological filtering to determine the earliest date of pre-diabetes and diabetes. 
3. Merge with demographic information and calculate the age of each time point(pre, diab, death)
4. Exclude some invalid records and label the patients.
5. Do the exploration analysis where I will give out the basic statistic and plot various of charts to illustrate the basic demographic outlines.

For clarity: we name all the pre-diabetes patients with prefix pre,diabetes patients with prefix diab and the patients progressed from pre-diabetes to diabetes with prefix pre2Diab.

We also label the abnormal patients as follow:
- Pre-diabetes Patient (Only have pre-diabetes): 0
- Pre-diabetes to Diabetes paitient (the patients who has sign of pre-diabetes BEFORE they are diagosed with diabetes): 1
- Diabetes (Only have diabetes or have earlier confirmed as diabetes than pre-diabetes): 2

In [None]:
# hyepr-parameter, time-spectrum: 2, 5, 10 year
TIME_SPEC = 10
CUT_OFF = '{year}-12-31'.format(year=2019-TIME_SPEC)

## 1. Combine All The Records From Different  Sections

We read all the table we want from disk:
- Lab Test result
- Diagnosis result (Diabetes Only)
- Family Medicine
- DMCS
- Medication(drug)

Import the modules

In [None]:
import pandas as pd
import os
import matplotlib.pyplot as plt
import numpy as np
%load_ext autoreload
import cleaning_tools as tools
import time
import matplotlib.pyplot as plt
from matplotlib.patches import Polygon

from typing import *
%autoreload 2

In [None]:
file_path = r'../tables/output'

# read the diag result from each tables
lab_diag = pd.read_csv(r'../tables/output/first_diag_lab.csv', index_col=0)
dx_diag = pd.read_csv(r'../tables/output/first_diag_dx.csv', index_col=0)
fm_diag = pd.read_csv(r'../tables/output/first_diag_fm.csv', index_col=0)
dmcs_diag = pd.read_csv(r'../tables/output/first_diag_dmcs.csv', index_col=0)
drug_diag = pd.read_csv(r'../tables/output/first_diag_drug.csv', index_col=0)

In [None]:
# combine the diagnosis together
combine = pd.concat([lab_diag, dx_diag, fm_diag, dmcs_diag, drug_diag]).reset_index(drop=True)
combine

In [None]:
# convert the data type
combine["diff_hour"] = combine["diff_hour"].astype('int')

## 2. Implement the Chronological Filtering

Now, we got the patient diagnosis of pre-diabetes and diabetes for every patient. we need to seperate them into three groups:
1. group 0: no pre-diabetes or diabetes records were earlier than pre-diabetes
2. group 1: only pre-diabetes
3. group 2: pre-diabetes to diabetes

In [None]:
start = time.time()

def chronoFilter(df: pd.DataFrame) -> pd.DataFrame:
    '''
    callable object than can apply splittng rules that define each patient labels on each patient records. The splitting 
    rules are defined as follow:
    - Pre-diabetes Patient (Only have pre-diabetes): 0
    - Pre-diabetes to Diabetes paitient (the patients who has sign of pre-diabetes BEFORE they are diagosed with diabetes): 1
    - Diabetes (Only have diabetes or have earlier confirmed as diabetes than pre-diabetes): 2
    
    Args:
        df: grouped data frame that is waiting to be aggregated
    Return:
        aggregated data frame
    
    '''
    
    dim = df.shape[0]
    assert dim <= 2 # make sure there are only earliest record(s) in this dataframe
    
    # one record only situation
    if dim == 1: 
        if df.iloc[0]["diab_type"] == "pre":
            return pd.DataFrame({
                "pseudo_patient_key":[df.iloc[0,0]], 
                "pre_dtm": [df.iloc[0,1]],
                "diab_dtm": [np.nan],
                "pre_diff_hour": [df.iloc[0,2]],
                "diab_diff_hour": [np.nan],
                "pre_src": [df.iloc[0,4]],
                "diab_src":[np.nan],
                "label": [0]
            })
        else: #diabetes
            return pd.DataFrame({
            "pseudo_patient_key":[df.iloc[0,0]], 
            "pre_dtm": [np.nan], 
            "diab_dtm": [df.iloc[0,1]],
            "pre_diff_hour": [np.nan],
            "diab_diff_hour": [df.iloc[0,2]],
            "pre_src": [np.nan],
            "diab_src":[df.iloc[0,4]],
            "label": [2]
        })
        
    # two records situation
    else: 
        if df.iloc[0]["diff_hour"] > df.iloc[1]["diff_hour"]: # pre to diabetes
            return pd.DataFrame({
            "pseudo_patient_key":[df.iloc[0,0]], 
            "pre_dtm": [df.iloc[1,1]], 
            "diab_dtm": [df.iloc[0,1]],
            "pre_diff_hour": [df.iloc[1,2]],
            "diab_diff_hour": [df.iloc[0,2]],
            "pre_src": [df.iloc[1,4]],
            "diab_src":[df.iloc[0,4]],
            "label": [1]
        })
        else: # diabetes to pre
            return pd.DataFrame({
            "pseudo_patient_key":[df.iloc[0,0]], 
            "pre_dtm": [np.nan], 
            "diab_dtm": [df.iloc[0,1]],
            "pre_diff_hour": [np.nan],
            "diab_diff_hour": [df.iloc[0,2]],
            "pre_src": [np.nan],
            "diab_src":[df.iloc[0,4]],
            "label": [2]
        })

        
#######################################################################################
# get the row number 
rnk = tools.row_number(combine, "pseudo_patient_key", "diab_type", sort_key="diff_hour")
# get the earliest records for each patient each diab type
el_rec = combine[rnk == 1].sort_values(["pseudo_patient_key", "diab_type"])
# apply the rules on the dataframe
group_patient = el_rec.groupby(by="pseudo_patient_key").apply(chronoFilter)
# write the data to disk
group_patient.to_csv("../tables/output/group_patient.csv")

min = (time.time() - start) / 60
print("runtime: {:.4f} minutes".format(min))

## 3. Merge with demographic information and test results.

### 3.1 Demographic Infomation

In [None]:
# read the file of patient demographic information
patient_info = tools.fileReader(r"../DATAFILE", 'patient_data')
# read the grouped patients
group_patient = pd.read_csv(r"../tables/output/group_patient.csv", index_col=0)

In [None]:
left = group_patient
right = patient_info[["pseudo_patient_key", "dob_Y", "sex", "death_date_Y", "diff_in_hour_death_date"]]
diab_patients_info = pd.merge(left=left, right=right, how='left', on='pseudo_patient_key')

Get the age of each date time.

In [None]:
def map_age(df, fields, dob="dob_Y"):
    '''
    map the date time fieds into age inplace
    Args:
        field: date fields
        dob: date of birth
    '''
        
    age_fields = list(map(lambda x : x.split(r'_')[0] + "_age", fields))
    for af, f in zip(age_fields, fields):
        df[af] = (pd.to_datetime(df[f]) - pd.to_datetime(df[dob])).apply(lambda x : x / np.timedelta64(1, "Y"))
        
######################################################################################################################  

# replace null value to np.nan
diab_patient_age = diab_patients_info.replace(r'""', np.nan)
# map the date time fields into age inplace
map_age(diab_patient_age, ["pre_dtm", "diab_dtm", "death_date_Y"])

## 4 Exclusion and Labeling

### 4.1 Exclusion
exclusion criterial:
 - Enrolment in last {{TIME_SPEC}} (pre_dtm <= {{CUT_OFF}})
 - Follow up time is less than {{TIME_SPEC}}, we want to exclude the patients who exited the investigation out of death.
 - Diabetes only.
 - Patients younger than 18 (pre_age < 18)

In [None]:
# convert to year of birth
diab_patient_age["dob_Y"] = diab_patient_age["dob_Y"].apply(lambda x : x[:4]).astype("int")
# compute the progression period in hours
diab_patient_age["prog_pd"] = diab_patient_age["diab_diff_hour"] - diab_patient_age["pre_diff_hour"] 
diab_patient_age["diff_in_hour_death_date"] = diab_patient_age["diff_in_hour_death_date"].astype("float")

In [None]:
MIN_FT = 30.5 * 24
# exclusion

def show_num():
    print(diab_patient_excluded.shape[0], diab_patient_excluded.query("label == 0").shape[0], diab_patient_excluded.query("label == 1").shape[0])
    
diab_patient_excluded = diab_patient_age # create a new reference

show_num()
# Enrolment in the first year and last three year (pre_dtm > 2003-12-31 and pre_dtm <= 2014-12-31)
diab_patient_excluded = diab_patient_excluded.query(f"pre_dtm <= '{CUT_OFF}' or label == 2")
show_num()

# Follow up time is less than {{TIME_SPEC}}
death_diff_hour = TIME_SPEC * 365.25 * 24
diab_patient_excluded = diab_patient_excluded.query(f"diff_in_hour_death_date - pre_diff_hour > {death_diff_hour} \
| diff_in_hour_death_date.isnull()", engine='python')
show_num()

# Diabetes only.
diab_patient_excluded = diab_patient_excluded.query("label < 2")
show_num()

# Patients younger than 18(pre_age < 18)
diab_patient_excluded = diab_patient_excluded.query("pre_age >= 18")
show_num()

# # Follow-up time less than 1 month i.e. 30.5*24 hours
# diab_patient_age = diab_patient_age.query(f"prog_pd > {MIN_FT} | prog_pd.isnull()", engine='python')


### 4.2 Labeling

In [None]:
# diab_patient_age = diab_patient_age.assign(prog_pd = diab_patient_age["diab_diff_hour"] - diab_patient_age["pre_diff_hour"])
PERIOD_LONG = TIME_SPEC * 365.25 * 24
def cls_mapper(prog_pd: float) -> int:
    if prog_pd < PERIOD_LONG:
        return 1
    else:
        return 0
    
diab_patient_excluded["cls"] = diab_patient_excluded["prog_pd"].apply(cls_mapper)

In [None]:
# diab_patient_excluded.to_csv(f"../tables/output/diab_patient-{TIME_SPEC}year.csv")
diab_patient_excluded = pd.read_csv(f"../tables/output/diab_patient-{TIME_SPEC}year.csv", index_col=0)

## 5. Descriptive Summary

In [None]:
# define helper fuctions
def plot_barchart(s:pd.Series, n_bin:int, bin_width:int=1, sort=False, title=""):
    bin = pd.cut(s, bins=[bin_width * i for i in range(n_bin)])
    out = bin.value_counts(sort=sort)
    ax = out.plot.bar(rot=0, color='b', figsize=(6,4))
    ax.set_xticklabels([bin_width * x for x in range(n_bin)])
    ax.set_title(title)

def plot_boxplot(df, cat, title):
    # boxplot for death age
    fig, axs = plt.subplots(1,len(cat))
    fig.suptitle(title)
    for idx, name in zip(range(len(cat)), cat):
        data = df[df.label == idx]["death_age"]
        axs[idx].boxplot(data, 0, '')
        axs[idx].set_title(name)
    

def show_number(df):
    tools.getNum(df, False)

In [None]:
diab_patient_excluded["dob_Y"]

Plot the bar chart against each group of patients.

In [None]:
fig, ax = plt.subplots(figsize=(6,4))

count = diab_patient_excluded.groupby("label")["label"].count()

group = ["pre-diabetes only", "pre-diabetes to diabetes"]
bar_color = ['tab:red', 'tab:blue']
ax.bar(group, count, label = group, color = bar_color)

ax.set_ylabel("patients number")
ax.set_title("patients number of each group")

for i, v in enumerate(count.to_list()):
    plt.text(i, v, "{:,}".format(v), ha = 'center')
# plt.savefig(os.path.join(OUT_PATH, "chart", "patients_number.png"))

In [None]:
death = diab_patient_excluded[diab_patient_age.death_age.notnull()]
pre = diab_patient_excluded[diab_patient_age.label == 0]
pre2diab = diab_patient_excluded[diab_patient_age.label == 1]
diab = diab_patient_excluded[diab_patient_age.label == 2]

In [None]:
fig, ax = plt.subplots(figsize=(6,4))
count = death.groupby("label")["label"].count()

group = ["pre-diabetes only", "pre-diabetes to diabetes"]
bar_color = ['tab:green', 'tab:purple', 'tab:pink']
ax.bar(group, count, label = group, color = bar_color)

ax.set_ylabel("patients number")
ax.set_title("death number of each class")

for i, v in enumerate(count.to_list()):
    plt.text(i, v, "{:,}".format(v), ha = 'center')
plt.savefig(os.path.join(OUT_PATH,"charts", "death_class.png"))

### Prediabetes Distribution Against Age

In [None]:
# pre diabetes distribution against age
bin = pd.cut(pre.pre_age, bins=[5 * (i) for i in range(3,23)])
out = bin.value_counts(sort=False)
ax = out.plot.bar(rot=30, color='b', figsize=(8,4))
ax.set_xticklabels(["18-20"] + [f"{i*5}-{i*5+5}" for i in range(4, 23)])
ax.set_xlabel("Age")
ax.set_ylabel("Number")
_ = ax.set_title("Distribution of prediabetes against age")
# plt.savefig(os.path.join(OUT_PATH,"charts", "distribution_pre_age.png"))

In [None]:
temp = diab_patient_excluded.assign(pre_year = diab_patient_age["pre_dtm"].apply(lambda s : str(s)[:4]))
bin = pd.cut(temp["pre_age"], bins=[5 * (i) for i in range(3,23)])
temp["age_bin"] = bin
idx_to_dx = {0: "class0", 1: "class1"}
temp["cls"] = temp["cls"].apply(lambda x : idx_to_dx[x])
temp_group = temp.groupby(["age_bin", "cls"]).size().unstack()
percent = (temp_group["class1"] / (temp_group["class1"] + temp_group["class0"])) * 100
ax = temp_group.plot(kind="bar", stacked=True, colormap="Set2", figsize=(12, 6))
mid = temp_group["class1"] / 2 + temp_group["class0"]

for con in ax.containers:
    plt.setp(con, width=0.5)

x0, x1 = ax.get_xlim()
ax.set_xlim(x0-1, x1+1)

for i,per in enumerate(percent):
    plt.text(i, mid[i], str(np.round(per,1)) + '%', va='center', ha='center', fontsize="small")

ax.set_ylabel("Number of patients")
ax.set_xlabel("Age of confirming prediabetes")
ax.set_xticklabels(["18-20"] + [f"{i*5}-{i*5+5}" for i in range(3, 23)], rotation=40)
_ = ax.set_title("Patients of different class portion against prediabetes confirmation time (Age)")

### Prediabetes Distribution Against Calendar Year

pre-diabetes only and pre-to-diabetes patients portion against each year

In [None]:
temp = diab_patient_excluded.assign(pre_year = diab_patient_age["pre_dtm"].apply(lambda s : str(s)[:4]))
idx_to_dx = {0: "class0", 1: "class1"}
temp["cls"] = temp["cls"].apply(lambda x : idx_to_dx[x])
temp_group = temp.groupby(["pre_year", "cls"]).size().unstack()
percent = (temp_group["class1"] / (temp_group["class1"] + temp_group["class0"])) * 100
ax = temp_group.plot(kind="bar", stacked=True, colormap="Set2", figsize=(12, 6))
mid = temp_group["class1"] / 2 + temp_group["class0"]

for con in ax.containers:
    plt.setp(con, width=0.5)

x0, x1 = ax.get_xlim()
ax.set_xlim(x0-1, x1+1)

for i,per in enumerate(percent):
    plt.text(i, mid[i], str(np.round(per,1)) + '%', va='center', ha='center')

ax.set_ylabel("Number of patients")
ax.set_xlabel("Year of confirming prediabetes")
ax.set_xticklabels(temp_group.index, rotation=40)
_ = ax.set_title("Patients of different class portion against prediabetes confirmation time(year)")
# plt.savefig(os.path.join(OUT_PATH,"charts", "patient_portion_each_class.png"))
# plt.savefig(os.path.join(OUT_PATH,"charts", "survival_curve.png"))

In [None]:
temp_group

In [None]:
years = [f'{x}-01-01' for x in range(2004,2018)]
cnt = []
INTV = 1
age = [INTV*i for i in range(20//INTV,90//INTV + 1)]
for a in age:
    cnt.append(diab_patient_age.query(f"diab_age < {a} and not (death_age < {a})")["pseudo_patient_key"].count())
    
surv = [diab_patient_age.shape[0] - c for c in cnt]

###################### plot ##########################
fig, ax = plt.subplots(figsize=(10,5))
ax.step(age, surv, 'k-', color='r')
ax.set_xticks([5*i for i in range(4,19)])
ax.tick_params(axis='x', labelrotation=45)
ax.set_xlabel("Prediabetes age")
ax.set_ylabel("Number of patients")
_ = ax.set_title("T2DM progression survival curve")
plt.savefig(os.path.join(OUT_PATH,"charts", "survival_curve.png"))

In [None]:
# investigate the progress free period against the age
data = pre2diab.assign(period = pre2diab.diab_age - pre2diab.pre_age)[["pre_age", "diab_age", "death_age", "period"]]
bin = pd.cut(data.pre_age, bins = [5 * i for i in range(3,23)])
data = data.assign(bin = bin)
out = data.groupby("bin").agg({"period":["count","mean"]})

###################### plot ##########################
ax = out["period"]["mean"].plot.bar(rot=30, color='b', figsize=(10,4))
ax.plot(["18-20"] + [f"{i*5}-{i*5+5}" for i in range(4, 22)],out["period"]["mean"].tolist())
ax.set_xticklabels(["18-20"] + [f"{i*5}-{i*5+5}" for i in range(4, 22)])
ax.set_xlabel("Age")
ax.set_ylabel("Year")
ax.set_ylim([0,6])
_ = ax.set_title("Mean progression period with respect to prediabetes age")