# Example Cleaning Pipeline

This notebook demonstrates how to run the NCCID notebook using an dummy dataset.

In [None]:
import pandas as pd
from nccid_cleaning import clean_data_df, patient_df_pipeline

The example data contains 10 synthesized rows of NCCID clinical data and a subset of the possible columns. The columns have been chosen to be representative of the different types of information available, i.e., dates, categories, integers, floats, whilst also covering the known data quality issues within, i.e., typos in headings, incorrect formats, values embedded in strings, values outside of category ranges etc. 

Broadly speaking the NCCID clinical data can be split into 5 groups:
- general: PatientID, SubmittingCentre, swab status, demographics
- date: swab dates, scan dates, date of admission, intubation death etc.,
- medical history: usually categorical e.g, presence of pre-existing lung conditions
- admission metrics: usually numerical, e.g., heart rate on admission,
- outcomes: usually categorial, e.g., test results, x-ray severity, death.

The data is broken down into these 5 groups in the subsequent analysis of the cleaning pipeline. 

## Run the cleaning pipeline

The full cleaning pipeline can be called using the imported Collection 'patient_df_pipeline'. Rather than replacing the original data the pipeline creates new columns with lowercase and underscored columns names. For example, the clean version of 'Date of admission' becomes 'date_of_admission'.

In [None]:
# load example df
df = pd.read_csv("data/example.csv")
df.head()

In [None]:
# run clinical cleaning pipeline
clean_df = clean_data_df(df, patient_df_pipeline)
clean_df.head()

In [None]:
# just the new cleaned columns
clean_df[[col for col in clean_df.columns if col.islower()]].head()

In [None]:
def compare_dfs(columns, df, clean_df):
    """Creates dataframe with equivalent columns side by side.

    Params
    -------
    columns: list
             list of original column names
    df: pd.DataFrame
        original data
    clean_df: pd.DataFrame
        cleaned data

    Returns
    -------
    comp_df: pd.DataFrame
            comparision dataframe with equivalent columns side by side.
    """

    comp_df = pd.concat(
        [
            pd.concat(
                (df[col], clean_df[col.lower().replace(" ", "_")]), axis=1
            )
            for col in columns
        ],
        axis=1,
    )
    return comp_df

### General Columns

Columns like Pseudonym and Submitting centre are have already been preprocessed by the NCCID ingestion pipeline. As such subsequent cleaning is not applied to these columns. 

Mixed formats have been used in the demographic columns ethnicity and sex, where categories are submitting in multiple ways. For sex the mapping is simple, e.g., ```0``` to ```F```, ```1``` to ```M```. For ethnicity, various subgroups are aggregated into the broader set of ethnicity groupings: white, black, asain, multiple, other, unknown. 


In [None]:
gen_df = compare_dfs(["Ethnicity", "Age", "Sex"], df, clean_df)
pd.concat((df[["Pseudonym", "SubmittingCentre"]], gen_df), axis=1)

### Date cleaning
The majority of date columns, including ```Date of admission```, are submitted in US date format MM/DD/YYYY or some variant (e.g, M/D/YY). As such the cleaning pipeline assumes ```month_first=True``` for most date columns when coercing into pd.datetime. The exception is ```SwabDate``` which is submitted in UK format and therefore treated separately. 

Known errors in date entry include submissions of the string format ```[Text] - YYYY-MM-DD```, for which the date is extracted using regex. Other errors such as non-date entries (row 7) are cast to ```NaT```. 

In some cases, dates have been submitted in the wrong format (day first instead of month first or vice versa). The cleaning pipeline cannot correct for ambiguous cases, such as 05/06, and users should look to additional sources such as DICOM header dates to corroborate where possible. 

In [None]:
# original and cleaned date columns
compare_dfs(["SwabDate", "Date of admission"], df, clean_df)