# Data cleaning

This notebook processes the raw data exported, following EDA and feedback with data owner.

Inputs|Outputs
---|---
`interim/major-data.parquet`|`interim/clean-data.parquet`
&nbsp;|`interim/cols.csv`
&nbsp;|`interim/describe.csv`

In [None]:
import pandas as pd
import seaborn as sns
from pandas_profiling import ProfileReport

pd.set_option("display.max_rows", 500)
pd.set_option("display.max_columns", 100)

%matplotlib inline

In [None]:
# Load data
major_data_df = pd.read_parquet("../../data/interim/major-data.parquet")
major_data_df.shape

## Convert datetimes

In [None]:
datetimes_df = major_data_df.copy()
datetime_cols = [
    "DISCHARGE_DATE_HOSPITAL_PROVIDER_SPELL",
    "EXPECTED_DISCHARGE_DATE",
    "FIRST_START_DATE_TIME_WARD_STAY",
    "START_DATE_TIME_HOSPITAL_PROVIDER_SPELL",
]

for col in datetime_cols:
    datetimes_df[col] = pd.to_datetime(datetimes_df[col], format="%Y-%m-%d %H:%M:%S.%f")

## Order rows

Original data is ~unordered, order by START_DATE_TIME_HOSPITAL_PROVIDER_SPELL

In [None]:
datetimes_df.sort_values(by="START_DATE_TIME_HOSPITAL_PROVIDER_SPELL", inplace=True)
datetimes_df.reset_index(drop=True, inplace=True)

## Drop empty/redundant/agreed columns

As agreed with data SME

In [None]:
cleaned_cols_df = (
    datetimes_df.drop(
        # Drop empty columns
        columns=[
            "DISCHARGE_READY_DATE",
            "cds_unique_identifier",
            "healthcare_resource_group_code",
            "presenting_complaint_code",
            "ae_patient_group_code",
            "ae_patient_group",
        ]
    )
    .drop(
        # Drop redundant columns
        columns=[
            "Frailty Proxy",  # encoded in IS_frailty_proxy
            "all_breach_reason_codes",  # unknown data column
            "ae_attendance_category_code",  # low cardinality
            "all_diagnosis_codes",  # not available on admission
            "all_investigation_codes",  # not available on admission
            "all_local_investigation_codes",  # not available on admission
            "all_local_treatment_codes",  # not available on admission
            "all_treatment_codes",  # not available on admission
            "PATIENT_CLASSIFICATION",  # low cardinality
            "PATIENT_GENDER_CURRENT",  # encoded in PATIENT_GENDER_CURRENT_DESCRIPTION
            "SOURCE_OF_ADMISSION_HOSPITAL_PROVIDER_SPELL",  # encoded in SOURCE_OF_ADMISSION_HOSPITAL_PROVIDER_SPELL_DESCRIPTION
            "TREATMENT_FUNCTION_CODE_AT_ADMISSION",  # encoded in TREATMENT_FUNCTION_CODE_AT_ADMISSION_DESCRIPTION
            "MAIN_SPECIALTY_CODE_AT_ADMISSION",  # encoded in MAIN_SPECIALTY_CODE_AT_ADMISSION_DESCRIPTION
            "ae_initial_assessment_triage_category_code",  # redundant given focus on major cases
            "ae_initial_assessment_triage_category",  # redundant given focus on major cases
            "major_minor",  # redundant given focus on major cases
            "manchester_triage_category",  # redundant given focus on major cases
            "FIRST_START_DATE_TIME_WARD_STAY",  # proxy for START_DATE_TIME_HOSPITAL_PROVIDER_SPELL
            "FIRST_REGULAR_DAY_OR_NIGHT_ADMISSION_DESCRIPTION",  # 99.99% empty
            "wait",  # not available at admission
            "attendance_type",  # high cardinality (all "E")
            "initial_wait",  # not available at admission
            "arrival_day_of_week",  # will be recalculated
            "arrival_month_name",  # will be recalculated
            "wait_minutes",  # not available at admission
            "initial_wait_minutes",  # not available at admission
            "FIRST_WARD_STAY_IDENTIFIER",  # low cardinality
            "LENGTH_OF_STAY_IN_MINUTES",  # low cardinality
            "START_DATE_HOSPITAL_PROVIDER_SPELL",  # low cardinality
            "EXPECTED_DISCHARGE_DATE_TIME",  # low cardinality
        ]
    )
    .drop(
        # Drop identifier columns
        columns=[
            "LOCAL_PATIENT_IDENTIFIER",
            "previous_30_day_hospital_provider_spell_number",
            "ED_attendance_episode_number",
            "unique_internal_ED_admission_number",
            "unique_internal_IP_admission_number",
        ]
    )
)
cleaned_cols_df.shape

This results in a reduction of ~100 columns to ~50 columns (50% reduction).

## Assign nan values 

* SME agrees that NaN = N for stroke_ward_stay

In [None]:
cleaned_cols_df.stroke_ward_stay.value_counts(dropna=False)

In [None]:
# fill stroke_ward_stay
imputed_df = cleaned_cols_df.copy()
imputed_df.stroke_ward_stay.fillna(value="N", inplace=True)
imputed_df.stroke_ward_stay.value_counts(dropna=False)

`MAIN_SPECIALTY_CODE_AT_ADMISSION_DESCRIPTION` is a feature that is used in modelling, and some models (Catboost) are unable to handle null values in categorical columns.

There are a small number (<1%) of null values for this field, which we will encode with the string "Not specified"

In [None]:
imputed_df.loc[
    imputed_df.MAIN_SPECIALTY_CODE_AT_ADMISSION_DESCRIPTION.isna(),
    "MAIN_SPECIALTY_CODE_AT_ADMISSION_DESCRIPTION",
] = "Not specified"

## Remove duplicates

In [None]:
no_duplicate_rows_df = imputed_df.drop_duplicates()
no_duplicate_rows_df.shape

This results in a reduction of ~300 rows (0.2% reduction).

## Homogenise binary fields

Many fields are encoding as Y/N or similar, convert these into binary fields using an explicit mapping.

Once mapped, check for NaN values and revert to Data SME to make sure we are not infilling values without correct clinical understanding.

In [None]:
no_duplicate_rows_df.IS_care_home_on_admission.unique()

In [None]:
binary_fields_df = no_duplicate_rows_df.copy()
# map Y/N to 1/0 for non-null columns
binary_cols = [
    "stroke_ward_stay",
    "IS_care_home_on_admission",
    "IS_care_home_on_discharge",
]
for col in binary_cols:
    binary_fields_df[col] = binary_fields_df[col].map({"Y": 1, "N": 0})

The `IS_illness_not_injury` field is a binary field that uses the strings `Illness` and `Injury` to define its state. We will create a true binary field (`0` or `1`) to encode this information:

In [None]:
# create new fields
binary_fields_df["IS_illness_not_injury"] = binary_fields_df["Illness Injury Flag"].map(
    {"Illness": 1, "Injury": 0}
)
# drop old fields
binary_fields_df.drop(columns=["Illness Injury Flag"], inplace=True)

In [None]:
# check new binary fields
# if you have any NaN values you need to check with Data SME on how to fill them ie. default Y or N?
for field in [
    "stroke_ward_stay",
    "IS_care_home_on_admission",
    "IS_care_home_on_discharge",
    "IS_illness_not_injury",
]:
    print(binary_fields_df[field].value_counts(dropna=False))

## Check genders

In [None]:
# What is the distribution of gender in the dataset?
binary_fields_df.PATIENT_GENDER_CURRENT_DESCRIPTION.value_counts(dropna=False)

In this dataset, there are three possible values for gender: Male, Female and "Not specified".

Only 9 rows of data correspond to "Not specified", which represents less than 0.01% of the data.

We choose to remove these rows as the sample size is too small to be reliably modelled.

In [None]:
# drop "not specified" values
genders_df = binary_fields_df.drop(
    labels=binary_fields_df[
        binary_fields_df.PATIENT_GENDER_CURRENT_DESCRIPTION == "Not specified"
    ].index
)
genders_df.shape

## Cap length of stay

What is the distribution of length of stay, and should we cap high length of stay outliers?

In [None]:
# Check distribution of length of stay
genders_df.groupby(by="LENGTH_OF_STAY").count().AGE_ON_ADMISSION.plot();

The highest length of stay is ~250 days, with the number of patients with length of stay over 30 days decreasing significantly.

The original work capped length of stay to 30 days, and we will do the same here.

In [None]:
# Cap maximum length of stay to 30 days
capped_df = genders_df.copy()
capped_df.LENGTH_OF_STAY = capped_df.LENGTH_OF_STAY.apply(lambda x: 30 if x > 30 else x)

## Final data checks

In [None]:
# check null values
# there are still some columns with null values; these can be encoding during modelling using e.g. dummy_na=True
capped_df.isnull().sum()

In [None]:
# plot null values
sns.set(rc={"figure.figsize": (15, 8)})
sns.heatmap(capped_df.isnull(), cbar=False);

## Export cleaned data

In [None]:
# Export data (outside git tree)
capped_df.to_parquet("../../data/interim/clean-data.parquet")

In [None]:
# Export cols/descriptions for Data Dictionary Excel/Google Sheets import (outside git tree)
capped_df.dtypes.to_csv("../../data/interim/cols.csv")
capped_df.describe().to_csv("../../data/interim/describe.csv")