<a href="https://colab.research.google.com/github/philipp-lampert/mymandible/blob/main/data_science/01_data_preparation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data preparation
This notebook prepares the raw data for further analyses by correctly defining missing values and column types.

In [236]:
import numpy as np
import pandas as pd

We are now ready to import the dataset from the [mymandible](https://github.com/philipp-lampert/mymandible) Github repository. This is the unprocessed CSV file exported directly from the associated [RedCap](https://www.project-redcap.org/) project.

We prevent automatic detection of missing values by setting `na_filter = False` as this would replace missing values with Numpy's `np.nan` which - in contrast to Panda's newer `pd.NA` - does not allow for nullable boolean and integer columns.



In [237]:
df = pd.read_csv('/Users/philipp.lampert/repositories/mymandible/data/preprocessing/01_raw_data.csv', na_filter = False)
df = df.replace(["NaN", ""], pd.NA)

For multiple-choice variables, RedCap exports each choice as a binary column with a naming convention of `variable___option`. Importantly, missing values are not stored directly inside each column but in an additional binary column named `variable___nan`. Therefore, we have to set each row of `variable___option` to `NaN` whenever `variable___nan == 1`.

In [238]:
nan_columns = df.filter(like = "___nan").columns
multiple_choice_variables = [name.split("___nan")[0] for name in nan_columns]

for variable in multiple_choice_variables:
    row_with_nan = df[f"{variable}___nan"] == 1
    columns = df.columns[df.columns.str.startswith(variable)]
    df.loc[row_with_nan, columns] = pd.NA
    df = df.drop(f"{variable}___nan", axis=1)

With missing values now being correctly represented in our dataframe, let's remove the auto-generated RedCap columns that are only relevant during data collecting.

In [239]:
df = df.drop(["id", "predictors_complete", "outcomes_complete", "imaging_complete"], axis = 1)

Now, we will convert each column to its appropriate datatype (boolean, integer, categorical etc.).

In [240]:
data_types = {
        "boolean": {
            "sex_female",
            "skin_transplanted",
            "flap_loss",
            "wound_infection",
            "nonunion_6_12",
            "nonunion_12_24",
        },
        "category": {
            "indication",
            "prior_flap",
            "flap_revision",
            "flap_donor_site",
            "plate_type",
            "long_plate_thickness",
            "mini_plate_thickness",
            "tmj_replacement_type",
            "flap_segment_count",
            "flap_loss_type",
            "imaging_6_12",
            "imaging_12_24_months"
        },
        "string": {
            "which_autoimmune_disease",
            "which_bleeding_disorder",
        },
        "UInt8": {"age_surgery_years", "height_cm", "weight_kg"},
        "UInt16": {"surgery_duration_min"},
        "Float32": {"bmi"},
    }

for column in df.columns:
    # All multiple-choice columns have three underscores in their name
    if "___" in column:
        df[column] = df[column].astype("boolean")
    elif column in data_types["boolean"]:
        df[column] = np.where(
            df[column] == "True",
            True,
            np.where(df[column] == "False", False, df[column]),
        )
        df[column] = df[column].astype("boolean")
    elif column.startswith("days_to_"):
        df[column] = df[column].astype("UInt16")
    else:
        for data_type in ["category", "string", "UInt8", "UInt16", "Float32"]:
            if column in data_types[data_type]:
                df[column] = df[column].astype(data_type)

Set complications to False if they occured after a given time.

In [241]:
def set_max_outcome_time(outcome, days_to_outcome):
    print(df[outcome].value_counts())
    for index, row in df.iterrows():
        if pd.notna(row[days_to_outcome]) and row[days_to_outcome] > 365:
            df.at[index, outcome] = False
    print(df[outcome].value_counts())
    print()

In [242]:
set_max_outcome_time('flap_revision', 'days_to_flap_revision')
set_max_outcome_time('flap_loss', 'days_to_flap_loss')
set_max_outcome_time('complication___whd_recipient_site', 'days_to_whd_recipient_site')
set_max_outcome_time('complication___partial_necrosis', 'days_to_partial_necrosis')
set_max_outcome_time('complication___whd_donor_site', 'days_to_whd_donor_site')
set_max_outcome_time('complication___salivary_fistula', 'days_to_salivary_fistula')
set_max_outcome_time('complication___osteoradionecrosis', 'days_to_osteoradionecrosis')
set_max_outcome_time('wound_infection', 'days_to_wound_infection')
set_max_outcome_time('complication___bone_exposure', 'days_to_bone_exposure')
set_max_outcome_time('complication_plate___exposure', 'days_to_plate_exposure')
set_max_outcome_time('complication_plate___removal', 'days_to_plate_removal')
set_max_outcome_time('complication_plate___fracture', 'days_to_plate_fracture')
set_max_outcome_time('complication_plate___loosening', 'days_to_plate_loosening')
set_max_outcome_time('complication_bony___fracture', 'days_to_fracture')
set_max_outcome_time('complication_bony___dislocation', 'days_to_dislocation')

flap_revision
none        330
venous       13
arterial     12
Name: count, dtype: int64
flap_revision
none        330
venous       13
arterial     12
Name: count, dtype: int64

flap_loss
False    322
True      33
Name: count, dtype: Int64
flap_loss
False    332
True      23
Name: count, dtype: Int64

complication___whd_recipient_site
False    226
True     125
Name: count, dtype: Int64
complication___whd_recipient_site
False    236
True     115
Name: count, dtype: Int64

complication___partial_necrosis
False    323
True      28
Name: count, dtype: Int64
complication___partial_necrosis
False    327
True      24
Name: count, dtype: Int64

complication___whd_donor_site
False    258
True      93
Name: count, dtype: Int64
complication___whd_donor_site
False    259
True      92
Name: count, dtype: Int64

complication___salivary_fistula
False    342
True       9
Name: count, dtype: Int64
complication___salivary_fistula
False    342
True       9
Name: count, dtype: Int64

complication___osteora

Now, we will create a new derived boolean outcome variable representing if any of the most relevant complications occured (see selection in code).

In [243]:
soft_tissue_complication = [
    'complication___whd_recipient_site',
    'complication___partial_necrosis',
    'complication___bone_exposure',
    'complication_plate___exposure',
    'wound_infection'
    ]

df['soft_tissue_complication'] = np.where(df[soft_tissue_complication].isna().any(axis=1), pd.NA, df[soft_tissue_complication].any(axis=1))
df['soft_tissue_complication'] = df['soft_tissue_complication'].astype('boolean')

In [244]:
df['nonunion'] = np.where(
    (df['nonunion_6_12'].isna()) & (df['nonunion_12_24'].isna()), pd.NA,
    np.where(df[['nonunion_6_12', 'nonunion_12_24']].any(axis=1), True, False)
)
df['nonunion'] = df['nonunion'].astype('boolean')

In [245]:
orn_mask = ((df['complication___bone_exposure'] & df['radiotherapy___post_surgery']) | df['complication___osteoradionecrosis'])
df['orn'] = orn_mask    

In [246]:
plate_failure = [
    'complication_plate___fracture',
    'complication_plate___loosening',
    'complication_bony___dislocation'
]
df['plate_failure'] = np.where(df[plate_failure].any(axis=1), True, False)
df['plate_failure'] = df['plate_failure'].astype('boolean')

In [247]:
any_complication = [
    'soft_tissue_complication',
    'nonunion',
    'flap_loss',
    'orn',
    'plate_failure'
    ]

df['any_complication'] = np.where(df[any_complication].any(axis=1), True, False)
df['any_complication'] = df['any_complication'].astype('boolean')

We can now save the dataframe in the Parquet format to preserve the data types, something that would not be possible in the CSV format.

In [248]:
df.to_parquet('02_preprocessed.parquet')
!mv 02_preprocessed.parquet /Users/philipp.lampert/repositories/mymandible/data/preprocessing/

In [249]:
df['plate_type'].value_counts()

plate_type
cad_long    222
cad_mix     103
cad_mini     30
Name: count, dtype: int64