# 01 Data Loading

Notebook goal: load in data and perform basic data quality checks:

* check row and column count aligns with data warehouse version
* check column names match previous work

Inputs|Outputs
---|---
`raw/original-data.csv`|`raw/original-data.parquet`

## 1. Load Azure ML dataset into pandas dataframe

The directory structure of this project includes **data stored outside of the git tree**. This is to ensure that, when coding in the open, no data can accidentally be committed to the repository through either the use of `git push -f` to override a `.gitignore` file, or through ignoring the `pre-commit` hooks.

A `project-directory` must first be created, inside of which this repository can be cloned (into e.g. `repo-directory`).

`data` and `models` folders will be stored at the highest level, outside the git tree, and must be created manually first, including the subdirectories data/interim, data/processed and data/raw:

```
project-directory
├── repo-directory
│   ├── .git
│   ├── .github
│   ├── config
│   ├── docs
│   ├── notebooks
│   └── src
├── data
│   ├── interim
│   ├── processed
│   └── raw
└── models
```

In [None]:
import os

import pandas as pd

# azureml-core of version 1.0.72 or higher is required
# azureml-dataprep[pandas] of version 1.1.34 or higher is required
from azureml.core import Dataset, Workspace
from dotenv import load_dotenv

# import azure ml environment variables
load_dotenv("../config/.env")

for var in [
    "subscription_id",
    "resource_group",
    "workspace_name",
    "dataset_name",
    "datafile_name",
]:
    if var not in os.environ:
        raise NameError(f"Please define {var} in ../config/.env")

subscription_id = os.environ.get("subscription_id")
resource_group = os.environ.get("resource_group")
workspace_name = os.environ.get("workspace_name")
dataset_name = os.environ.get("dataset_name")
datafile_name = os.environ.get("datafile_name")

workspace = Workspace(subscription_id, resource_group, workspace_name)

dataset = Dataset.get_by_name(workspace, name=dataset_name)

In [None]:
# Convert tabular dataset to a pandas dataframe
# df = dataset.to_pandas_dataframe()

# Download file dataset locally, outside of git tree
dataset.download(target_path="../../data/raw/", overwrite=False)

In [None]:
# Load into pandas
df = pd.read_csv(f"../../data/raw/{datafile_name}", low_memory=False)

## 2. Data Quality checks

In [None]:
df.shape

In [None]:
df.describe()

### Check columns are the same as Github columns

[Previous work](https://github.com/nhsx/skunkworks-long-stayer-risk-stratification) implemented a convolutional neural network to predict length of stay using the same data.

Here we check we have the same data, as it was delivered under a separate DPA.

In [None]:
# From https://github.com/nhsx/skunkworks-long-stayer-risk-stratification/blob/main/training/README.md
original_header = "LENGTH_OF_STAY,LENGTH_OF_STAY_IN_MINUTES,ADMISSION_METHOD_HOSPITAL_PROVIDER_SPELL_DESCRIPTION,AGE_ON_ADMISSION,DISCHARGE_DATE_HOSPITAL_PROVIDER_SPELL,ETHNIC_CATEGORY_CODE_DESCRIPTION,DISCHARGE_READY_DATE,DIVISION_NAME_AT_ADMISSION,EXPECTED_DISCHARGE_DATE,EXPECTED_DISCHARGE_DATE_TIME,FIRST_REGULAR_DAY_OR_NIGHT_ADMISSION_DESCRIPTION,FIRST_START_DATE_TIME_WARD_STAY,FIRST_WARD_STAY_IDENTIFIER,IS_PATIENT_DEATH_DURING_SPELL,MAIN_SPECIALTY_CODE_AT_ADMISSION,MAIN_SPECIALTY_CODE_AT_ADMISSION_DESCRIPTION,PATIENT_CLASSIFICATION,PATIENT_CLASSIFICATION_DESCRIPTION,POST_CODE_AT_ADMISSION_DATE_DISTRICT,SOURCE_OF_ADMISSION_HOSPITAL_PROVIDER_SPELL,SOURCE_OF_ADMISSION_HOSPITAL_PROVIDER_SPELL_DESCRIPTION,START_DATE_HOSPITAL_PROVIDER_SPELL,START_DATE_TIME_HOSPITAL_PROVIDER_SPELL,TREATMENT_FUNCTION_CODE_AT_ADMISSION,TREATMENT_FUNCTION_CODE_AT_ADMISSION_DESCRIPTION,elective_or_non_elective,stroke_ward_stay,PATIENT_GENDER_CURRENT,PATIENT_GENDER_CURRENT_DESCRIPTION,LOCAL_PATIENT_IDENTIFIER,SpellDominantProcedure,all_diagnoses,cds_unique_identifier,previous_30_day_hospital_provider_spell_number,ED_attendance_episode_number,unique_internal_ED_admission_number,unique_internal_IP_admission_number,reason_for_admission,IS_care_home_on_admission,IS_care_home_on_discharge,ae_attendance_category,ae_arrival_mode,ae_attendance_disposal,ae_attendance_category_code,healthcare_resource_group_code,presenting_complaint_code,presenting_complaint,wait,wait_minutes,all_investigation_codes,all_diagnosis_codes,all_treatment_codes,all_breach_reason_codes,all_location_codes,all_investigations,all_diagnosis,all_treatments,all_local_investigation_codes,all_local_investigations,all_local_treatment_codes,all_local_treatments,attendance_type,initial_wait,initial_wait_minutes,major_minor,IS_major,ae_patient_group_code,ae_patient_group,ae_initial_assessment_triage_category_code,ae_initial_assessment_triage_category,manchester_triage_category,arrival_day_of_week,arrival_month_name,Illness Injury Flag,Mental Health Flag,Frailty Proxy,Presenting Complaint Group,IS_cancer,cancer_type,IS_chronic_kidney_disease,IS_COPD,IS_coronary_heart_disease,IS_dementia,IS_diabetes,diabetes_type,IS_frailty_proxy,IS_hypertension,IS_mental_health,IMD county decile,District,Rural urban classification,OAC Group Name,OAC Subgroup Name,OAC Supergroup Name,EMCountLast12m,EL CountLast12m,ED CountLast12m,OP First CountLast12m,OP FU CountLast12m"

In [None]:
original_headers = original_header.split(",")

for col in df.columns:
    if col not in original_headers:
        print(f"{col} additional to original columns")

for col in original_headers:
    if col not in df.columns:
        print(f"{col} not found")

print(df.columns.size)
print(len(original_headers))

## Save to parquet

More performant data format

In [None]:
# nb. this is outside the git tree
df.to_parquet("../../data/raw/original-data.parquet")