# Feature engineering

This notebook processes cleaned data into the feature set used for modelling.

The decisions around feature engineering are the culmination of a number of explorations of the data, including modelling of the full dataset, which is not included in this repository.

In [None]:
import numpy as np
import pandas as pd

pd.set_option("display.max_rows", None)
pd.set_option("display.max_columns", None)

In [None]:
# Load data
clean_data_df = pd.read_parquet("../../data/clean-data.parquet")

## Add derived fields

These were removed during cleaning due to missing data, and can be recalculated:

In [None]:
derived_df = clean_data_df.copy()
derived_df[
    "arrival_day_of_week"
] = derived_df.START_DATE_TIME_HOSPITAL_PROVIDER_SPELL.dt.day_name().str[:3]
derived_df[
    "arrival_month_name"
] = derived_df.START_DATE_TIME_HOSPITAL_PROVIDER_SPELL.dt.month_name().str[:3]

## Select agreed columns

As agreed with data SME

In [None]:
columns = [
    "ae_arrival_mode",
    "IS_major",
    "AGE_ON_ADMISSION",
    "EL CountLast12m",
    "IS_elective",
    "EMCountLast12m",
    "IS_illness_not_injury",
    "IS_cancer",
    "IS_care_home_on_admission",
    "IS_chronic_kidney_disease",
    "IS_COPD",
    "IS_coronary_heart_disease",
    "IS_dementia",
    "IS_diabetes",
    "IS_frailty_proxy",
    "IS_hypertension",
    "IS_mental_health",
    "MAIN_SPECIALTY_CODE_AT_ADMISSION_DESCRIPTION",
    "OP First CountLast12m",
    "OP FU CountLast12m",
    "SOURCE_OF_ADMISSION_HOSPITAL_PROVIDER_SPELL_DESCRIPTION",
    "stroke_ward_stay",
    "LENGTH_OF_STAY",
    "arrival_day_of_week",
    "arrival_month_name",
]

# define sensitive columns for fairness testing later
sensitive_columns = [
    "ETHNIC_CATEGORY_CODE_DESCRIPTION",
    "IMD county decile",
    "OAC Group Name",
    "OAC Subgroup Name",
    "OAC Supergroup Name",
    "PATIENT_GENDER_CURRENT_DESCRIPTION",
    "POST_CODE_AT_ADMISSION_DATE_DISTRICT",
    "Rural urban classification",
]

subset_df = derived_df[columns + sensitive_columns]

## Focus on MAJOR, non-elective cases only

SME requests model built for MAJOR and non-elective cases only, as these will require longer stay

In [None]:
major_df = subset_df[subset_df.IS_major == 1]
major_df = major_df[major_df.IS_elective == 0]
# These columns no longer contain additional information:
columns.remove("IS_major")
columns.remove("IS_elective")
major_df.drop(columns=["IS_major", "IS_elective"], inplace=True)
major_df.shape

## One-hot encode categorical data

One-hot encoding is performed twice; once without sensitive features, and once with. This is so that when we are testing for fairness later, we can compare model performance on models trained without sensitive features

In [None]:
# To avoid the "dummy variable trap", we could drop the first category of these features to reduce duplication.
# However, we may lose interpretability if e.g. Monday is dropped and is an important feature?
encoded_df = pd.get_dummies(major_df.drop(columns=sensitive_columns), drop_first=False)
print(encoded_df.shape)
# Add back in the sensitive columns, without encoding
encoded_sensitive_df = encoded_df.copy()
encoded_sensitive_df[sensitive_columns] = major_df[sensitive_columns]
print(encoded_sensitive_df.shape)

## Check correlation

In [None]:
corr = encoded_df.corr()
# check for correlation of feature with LENGTH_OF_STAY
corr.LENGTH_OF_STAY[corr.LENGTH_OF_STAY.abs().sort_values(ascending=False).index]

## Export to parquet

In [None]:
encoded_df.to_parquet("../../data/features.parquet")
encoded_sensitive_df.to_parquet("../../data/features-sensitive.parquet")

# Some machine learning algorithms e.g. catboost require NOT to one-hot encode data, so export for these
major_df.drop(columns=sensitive_columns).to_parquet(
    "../../data/features-catboost.parquet"
)
major_df.to_parquet("../../data/features-sensitive-catboost.parquet")