<a href="https://colab.research.google.com/github/philipp-lampert/mymandible/blob/main/data_science/data_preparation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Welcome to the data preparation notebook
Note: The project is still under active development.

First, let's import the necessary libraries and set the option to display all rows of each output.

In [None]:
import numpy as np
import pandas as pd
pd.set_option("display.max_rows", None)

We are now ready to import the dataset from the [mymandible](https://github.com/philipp-lampert/mymandible) Github repository. This is the unprocessed CSV file exported directly from the associated [RedCap](https://www.project-redcap.org/) project.

We prevent automatic detection of missing values by setting `na_filter = False` as this would replace missing values with Numpy's `np.nan` which - in contrast to Panda's newer `pd.NA` - does not allow for nullable boolean and integer columns.



In [23]:
df = pd.read_csv("https://raw.githubusercontent.com/philipp-lampert/mymandible/main/data_science/BFlapsRevised_DATA_2023-10-24_1441.csv", na_filter = False)
df = df.replace(["NaN", ""], pd.NA)

For multiple-choice variables, RedCap exports each choice as a binary column with a naming convention of `variable___option`. Importantly, missing values are not stored directly inside each column but in an additional binary column named `variable___nan`. Therefore, we have to set each row of `variable___option` to `NaN` whenever `variable___nan == 1`.

In [13]:
nan_columns = df.filter(like = "___nan").columns
multiple_choice_variables = [name.split("___nan")[0] for name in nan_columns]

for variable in multiple_choice_variables:
  row_with_nan = df[f"{variable}___nan"] == 1
  columns = df.columns[df.columns.str.startswith(variable)]
  df.loc[row_with_nan, columns] = pd.NA
  df = df.drop(f"{variable}___nan", axis=1)

With missing values now being correctly represented in our dataframe, let's remove the auto-generated RedCap columns that are only relevant during data collecting.

In [None]:
df = df.drop(["id", "predictors_complete", "outcomes_complete", "imaging_complete"], axis = 1)

Now, we will define the type of each column (boolean, integer, categorical etc.).

In [15]:
# Multiple-choice fields can immediately be converted to boolean
for column in df.columns:
    if "___" in column:
        df[column] = df[column].astype('boolean')

# All other boolean columns have to be transformed using np.where
boolean_columns = ['sex_female', 'skin_transplanted', 'flap_loss', 'wound_infection', 'nonunion', 'tmj_luxation']

for column in boolean_columns:
    df[column] = np.where(df[column] == 'True', True, np.where(df[column] == 'False', False, df[column]))
    df[column] = df[column].astype('boolean')

# The remaining column types are defined individually
df = df.astype(
    {
       'indication' : 'category',
       'which_autoimmune_disease' : 'string',
       'which_bleeding_disorder' : 'string',
       'prior_flap' : 'category',
       'age_surgery_years' : 'UInt8',
       'flap_donor_site' : 'category',
       'flap_revision' : 'category',
       'days_to_flap_revision' : 'UInt16',
       'plate_type' : 'category',
       'long_plate_thickness' : 'category',
       'tmj_replacement_type' : 'category',
       'flap_segment_count' : 'category',
       'surgery_duration_min' : 'UInt16',
       'height_cm' : 'UInt8',
       'weight_kg' : 'UInt8',
       'bmi' : 'Float32',
       'flap_loss_type' : 'category',
       'days_to_flap_loss' : 'Int16',
       'days_to_whd_recipient_site' : 'Int16',
       'days_to_whd_donor_site' : 'Int16',
       'days_to_abscess' : 'Int16',
       'days_to_fistula' : 'Int16',
       'days_to_vestibuloplasty' : 'Int16',
       'days_to_osteoradionecrosis' : 'Int16',
       'days_to_bone_exposure' : 'Int16',
       'days_to_plate_exposure' : 'Int16',
       'days_to_plate_removal' : 'Int16',
       'days_to_plate_fracture' : 'Int16',
       'days_to_plate_loosening' : 'Int16',
       'days_to_implant_received' : 'Int16',
       'days_to_implant_planned' : 'Int16',
       'days_to_implant_plate_removal' : 'Int16',
       'days_to_iliac_crest_augmentation' : 'Int16',
       'days_to_follow_up' : 'Int16',
       'imaging' : 'category',
       'days_to_imaging' : 'Int16',
       'days_to_nonunion' : 'Int16',
       'days_to_fracture' : 'Int16',
       'days_to_dislocation' : 'Int16',
       'days_to_tmj_luxation' : 'Int16'
    }
)

To make this easier to work with, let's divide our dataframe into predictor and outcome variables.

In [20]:
predictor_variables = ['sex_female', 'indication', 'comorbidity___none',
       'comorbidity___smoking', 'comorbidity___alcohol',
       'comorbidity___copd', 'comorbidity___hypertension',
       'comorbidity___diabetes', 'comorbidity___atherosclerosis',
       'comorbidity___hyperlipidemia', 'comorbidity___osteoporosis',
       'comorbidity___hypothyroidism',
       'comorbidity___chronic_kidney_disease',
       'comorbidity___bleeding_disorder',
       'comorbidity___autoimmune_disease', 'which_autoimmune_disease',
       'which_bleeding_disorder', 'prior_flap', 'age_surgery_years',
       'flap_donor_site', 'flap_revision', 'days_to_flap_revision',
       'radiotherapy___none', 'radiotherapy___pre_surgery',
       'radiotherapy___post_surgery', 'chemotherapy___none',
       'chemotherapy___pre_surgery', 'chemotherapy___post_surgery',
       'plate_type', 'long_plate_thickness', 'urkens_classification___c',
       'urkens_classification___r', 'urkens_classification___b',
       'urkens_classification___s', 'tmj_replacement_type',
       'flap_segment_count', 'surgery_duration_min', 'height_cm',
       'weight_kg', 'bmi', 'skin_transplanted',
       'venous_anastomosis_type___end_end',
       'venous_anastomosis_type___end_side',
       'venous_anastomosis_tool___coupler',
       'venous_anastomosis_tool___suture']

predictors_df = df[predictor_variables]
outcomes_df = df.drop(predictor_variables, axis = 1)

Let's now take a look at the `predictors_df`.

In [21]:
predictors_df.head()

Unnamed: 0,sex_female,indication,comorbidity___none,comorbidity___smoking,comorbidity___alcohol,comorbidity___copd,comorbidity___hypertension,comorbidity___diabetes,comorbidity___atherosclerosis,comorbidity___hyperlipidemia,...,flap_segment_count,surgery_duration_min,height_cm,weight_kg,bmi,skin_transplanted,venous_anastomosis_type___end_end,venous_anastomosis_type___end_side,venous_anastomosis_tool___coupler,venous_anastomosis_tool___suture
0,False,flap_loss,True,False,False,False,False,False,False,False,...,three,441,184,92,27.173914,,False,True,False,True
1,True,malignant_tumor,False,False,False,False,False,False,False,False,...,one,430,160,51,19.921875,False,False,True,False,True
2,False,osteoradionecrosis,False,False,False,False,True,False,False,False,...,two,478,188,77,21.785875,True,True,False,False,True
3,True,malignant_tumor,False,True,False,False,False,False,False,False,...,three,474,175,61,19.918367,True,False,True,False,True
4,False,malignant_tumor,False,True,True,False,False,False,False,False,...,three,536,174,70,23.120625,True,False,True,False,True
