# 02 Exploratory Data Analysis

Notebook goal: initial Exploratory Data Analysis to understand columns (features), data cleanliness and basic (inter)correlations:

1. Check data types of columns and re-cast if necessary
2. Check validity of LENGTH_OF_STAY column
3. Check ordering of dataset
4. Check levels of missing data
4. Check levels of duplication of dataset
5. Basic correlation to look at relationships between columns
6. Generate a pandas profile and explore distributions of data

Note that EDA is a cyclical process, and many explorations of the data take place after cleaning.

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import plotly.express as px
import seaborn as sns
from pandas_profiling import ProfileReport

%matplotlib inline

pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)

## Load data

Data processed in previous notebook

In [None]:
# Load into pandas
original_data_df = pd.read_parquet("../../data/raw/original-data.parquet")
original_data_df.shape

## Remove minor and elective cases

The modelling in this project is focussed on major, non-elective cases, and these cases will tend to have a higher length of stay and subsequent risk of becoming a long stayer.

We will remove minor and elective cases before proceeding with further exploration.

Data subject matter expert (SME) has clarified that Null values for `IS_MAJOR` are "N", so we can select only "Y" values.

In [None]:
major_df = original_data_df[original_data_df.IS_major == "Y"]
major_df = major_df[major_df.elective_or_non_elective == "Non-elective admission"]
# drop now-redundant columns
major_df.drop(columns=["IS_major", "elective_or_non_elective"], inplace=True)
major_df.shape

## Explore Data types

Check the range of data types in the dataset manually:

In [None]:
major_df.dtypes

There are two derived fields, `arrival_day_of_week` and `arrival_month_name` - how have they been derived?

In [None]:
major_df.arrival_day_of_week.unique()

In [None]:
major_df.arrival_month_name.unique()

## Data consistency checks

Does the LENGTH_OF_STAY match start/end dates?

In [None]:
# check data types before conducting maths
major_df[
    [
        "DISCHARGE_DATE_HOSPITAL_PROVIDER_SPELL",
        "START_DATE_TIME_HOSPITAL_PROVIDER_SPELL",
    ]
].dtypes

In [None]:
# cast dates to datetime
datetime_df = major_df.copy()
datetime_df.DISCHARGE_DATE_HOSPITAL_PROVIDER_SPELL = pd.to_datetime(
    datetime_df.DISCHARGE_DATE_HOSPITAL_PROVIDER_SPELL, format="%Y-%m-%d %H:%M:%S.%f"
)
datetime_df.START_DATE_TIME_HOSPITAL_PROVIDER_SPELL = pd.to_datetime(
    datetime_df.START_DATE_TIME_HOSPITAL_PROVIDER_SPELL, format="%Y-%m-%d %H:%M:%S.%f"
)

In [None]:
# Discharge is whole day, admission is datetime
datetime_df[
    [
        "DISCHARGE_DATE_HOSPITAL_PROVIDER_SPELL",
        "START_DATE_TIME_HOSPITAL_PROVIDER_SPELL",
    ]
].sample(10)

In [None]:
# calculate derived LoS
# round up to whole days
datetime_df["DER_los"] = (
    datetime_df["DISCHARGE_DATE_HOSPITAL_PROVIDER_SPELL"]
    - datetime_df["START_DATE_TIME_HOSPITAL_PROVIDER_SPELL"]
).dt.days + 1

In [None]:
# quick visual inspection - do they match?
datetime_df[["DER_los", "LENGTH_OF_STAY"]].head(10)

In [None]:
# check that mean difference is ~ 0 days
datetime_df[["DER_los", "LENGTH_OF_STAY"]].diff(axis=1).LENGTH_OF_STAY.mean()

## Data ordering

How is data ordered?

In [None]:
# not ordered by local patient id
datetime_df.LOCAL_PATIENT_IDENTIFIER.head(10)

# nb. value_counts() shows repeat visits - could this be feature?

In [None]:
# not ordered by start-date
datetime_df.START_DATE_TIME_HOSPITAL_PROVIDER_SPELL.head(10)

In [None]:
# not by end-date
datetime_df.DISCHARGE_DATE_HOSPITAL_PROVIDER_SPELL.head(10)

In [None]:
# not by cds
datetime_df.cds_unique_identifier.sample(10)

Re-order data by `START_DATE_TIME_HOSPITAL_PROVIDER_SPELL`:

In [None]:
datetime_df.sort_values(by="START_DATE_TIME_HOSPITAL_PROVIDER_SPELL", inplace=True)
datetime_df.reset_index(drop=True, inplace=True)

### Explore Missing data

We can generate a heatmap of missing data to quickly visualise the totality of missing data in one image. This will only capture significant areas of missing data, but can be useful to identify sparse columns, rows and blocks of missing data.

The heatmap generated will show missing data in black, and present data in white.

Any patches of black indicate missing data:

In [None]:
sns.set(rc={"figure.figsize": (15, 8)})
sns.heatmap(datetime_df.isnull(), cbar=False);

We can also summarise the number of missing rows of data in tabular format, which can help identify columns with smaller amounts of missing data:

In [None]:
datetime_df.isnull().sum()

## Correlation plot

In [None]:
# Pearson correlation by default:
corr = datetime_df.corr()

In [None]:
sns.set_theme(style="white")

# Generate a mask for the upper triangle
mask = np.triu(np.ones_like(corr, dtype=bool))

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(15, 8))

# Generate a custom diverging colormap
cmap = sns.diverging_palette(230, 20, as_cmap=True)


# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(
    corr,
    mask=mask,
    cmap=cmap,
    vmax=0.3,
    center=0,
    square=True,
    linewidths=0.5,
    cbar_kws={"shrink": 0.5},
)

In [None]:
# check for correlation between features
corr[corr != 1.00][corr.abs() > 0.1].abs().unstack().sort_values(ascending=False)

In [None]:
# check for correlation with LENGTH OF STAY
corr.LENGTH_OF_STAY[corr.LENGTH_OF_STAY.abs().sort_values(ascending=False).index]

## Check duplicate rows

In [None]:
datetime_df.duplicated().sum()

## Check duplicate columns

In [None]:
# How is FIRST_START_DATE_TIME_WARD_STAY different to START_DATE_TIME_HOSPITAL_PROVIDER_SPELL?
# Cast FIRST_START_DATE_TIME_WARD_STAY to datetime
datetime_df.FIRST_START_DATE_TIME_WARD_STAY = pd.to_datetime(
    datetime_df.FIRST_START_DATE_TIME_WARD_STAY, format="%Y-%m-%d %H:%M:%S.%f"
)

In [None]:
# check if FIRST_START_DATE_TIME_WARD_STAY is the same as START_DATE_TIME_HOSPITAL_PROVIDER_SPELL
datetime_df.FIRST_START_DATE_TIME_WARD_STAY.equals(
    datetime_df.START_DATE_TIME_HOSPITAL_PROVIDER_SPELL
)

In [None]:
# they are different, so work out what the difference is between the columns
datetime_df[
    ["FIRST_START_DATE_TIME_WARD_STAY", "START_DATE_TIME_HOSPITAL_PROVIDER_SPELL"]
].sample(10).diff(axis=1)

In [None]:
# there are many NaT values in FIRST_START_DATE_TIME_WARD_STAY which lead to a difference of 0 days
# find out if there are any actual differences in dates
(
    datetime_df[
        ["FIRST_START_DATE_TIME_WARD_STAY", "START_DATE_TIME_HOSPITAL_PROVIDER_SPELL"]
    ]
    .diff(axis=1)
    .START_DATE_TIME_HOSPITAL_PROVIDER_SPELL
    > pd.Timedelta(0)
).sum()

## Pandas profiling

Pandas profiling will (after some time) generate an overall profile of the dataset, including histograms, frequent values and checks such as sparsity and ordinality that can help generate further questions for the data subject matter expert (SME).

In [None]:
pd.__version__
# note bug with version 1.4.1: https://github.com/ydataai/pandas-profiling/issues/911
# use lower version (e.g. 1.3.5)

In [None]:
# Dataset large and crashing without minimal=True
profile = ProfileReport(datetime_df, title="Pandas Profiling Report", minimal=True)

In [None]:
profile

## Export data

This is the data containing our target population

In [None]:
# nb. this is outside the git tree
major_df.to_parquet("../../data/interim/major-data.parquet")