In [None]:
import pandas as pd

### Check data quality of raw artificial APC data

In [34]:
path = "../data/raw/artificial_hes_apc_202302_v1_full/artificial_hes_apc_202302_v1_full/artificial_hes_apc_2122.csv"

NEEDED_COLS = [
    "PSEUDO_HESID", "EPIKEY", "EPIORDER",
    "ADMIDATE", "DISDATE", "ADMIMETH", "ADMISORC",
    "DIAG_4_01", "ETHNOS", "SEX", "STARTAGE", "LSOA11",
    "SPELBGIN", "SPELEND", "SPELDUR", "SPELDUR_CALC", "FYEAR",
]

raw_apc = pd.read_csv(
    path,
    usecols=NEEDED_COLS
)

In [39]:
raw_apc.head()

Unnamed: 0,FYEAR,EPIKEY,PSEUDO_HESID,ADMIDATE,ADMIMETH,ADMISORC,DIAG_4_01,DISDATE,EPIORDER,ETHNOS,SEX,SPELBGIN,SPELDUR,SPELDUR_CALC,SPELEND,STARTAGE,LSOA11
0,2122,811282731008,TESToBeA1ZCAwu96JtChcj174KEKSgD9,2021-03-31,11,19,K315,2021-06-22,1,A,2,0.0,0.0,2.0,Y,84.0,E01025529
1,2122,945889480421,TESToBeA1ZCAwu96JtChcj174KEKSgD9,2021-05-18,2D,19,K573,2021-06-16,1,A,2,2.0,0.0,0.0,Y,84.0,E01025463
2,2122,528111587051,TESTumS1fdRDx6j7yODr6lFffI5BZBDF,2021-05-15,24,19,I269,2021-05-27,1,A,2,0.0,4.0,2.0,Y,90.0,E01003625
3,2122,912335548721,TESTZhkl8wktpxboD3w4h9gznMl8EhMV,2021-04-28,11,19,I839,2021-06-11,1,A,2,2.0,0.0,0.0,Y,25.0,E01018984
4,2122,171243392039,TESTfJm3gvXDWqIhFLnKGYC07r5XTNNo,2021-04-27,25,19,N185,2021-04-27,1,A,1,2.0,0.0,1.0,Y,79.0,E01013736


From head above:
- Top rows above show incorrect spell duration and concurrent spells, which isn't expected. 
- E.g. for patient 'TESToBeA1ZCAwu96JtChcj174KEKSgD9':
    - Admitted on 2021-03-31 and discharged on 2021-06-22 but SPELDUR is shown as 0 days. 
    - Also, the same patient had another admission on 2021-05-18 and was discahrged on 2021-06-16, which is before the spell listed above.

Next steps:
- Check episodes:spells for each patient (expect episodes count to be greater than spells count)
- Are there more records with concurrent spells? Consider fix.

In [None]:
dq = raw_apc.groupby("PSEUDO_HESID").agg(
    n_episodes=("EPIKEY","count"),
    n_unique_admissions=("ADMIDATE","nunique")
    ).query("n_episodes != n_unique_admissions")

dq

In [54]:
raw_apc[raw_apc["PSEUDO_HESID"] == 'TESTzzlueBoitsgb1W8cSRPaBr5QqT0g'].sort(by=["ADMIDATE", "EPIORDER"])

AttributeError: 'DataFrame' object has no attribute 'sort'

The code above shows that there are more episodes than unique admission dates, which is something we would expect. However, analysing some patient records shows that there are concurrent spells and inconsistent spell calculations.

This is due to the fact that this is synthetic data, and as the synthetic HES data documentation states, the data "does not preserve relationships between fields".

As a result, to build a usable analysis table, I will:
- Collapse to one row per spell, where a spell is defined by the unique combination of (patient, admission date, discharge date).
- Recalculate length of stay (LOS) from these dates.
- Compute the mode of primary diagnosis, ethnicity, and sex per spell (deterministic tie-break).
- Flag any emergency admission within the spell.
- Join deprivation (IMD) and use quintiles.
- Create a stable spell ID (patient|adm|dis) for downstream analysis.
- Treat implausible dates (e.g. < 1900) as missing before calculations.

/tools/repair_synthetic_hes.py runs this cleaning process. See output below.


In [None]:
df = pd.read_parquet("../data/processed/apc_clean.parquet")
df.head()

Unnamed: 0,PSEUDO_HESID,adm,dis,n_episodes,los_days,any_emerg,imd_quintile,primary_diag,ethnicity,sex,spell_id
0,TEST2tY19ZzGmlISgbIeRHDw51lPrGZ6,2019-05-04,2019-10-08,1,158.0,True,1,M866,C,1,TEST2tY19ZzGmlISgbIeRHDw51lPrGZ6|2019-05-04|20...
1,TESTZX0ThypwRU3229MUU9vUFowEkf8Z,2019-08-03,2019-08-03,1,1.0,True,2,I489,A,2,TESTZX0ThypwRU3229MUU9vUFowEkf8Z|2019-08-03|20...
2,TEST7jQcnbSY1MSkpH0cC1YXkmCkazfQ,2019-03-21,2019-05-02,1,43.0,False,3,D869,A,1,TEST7jQcnbSY1MSkpH0cC1YXkmCkazfQ|2019-03-21|20...
3,TESTGTPdNWx8qykR1iz2aHgS4bcPFphg,2019-07-09,2019-11-23,1,138.0,False,4,O680,A,1,TESTGTPdNWx8qykR1iz2aHgS4bcPFphg|2019-07-09|20...
4,TESTkyWBrKAUFvrpO0YDLH1UNRGs7YRp,2019-05-17,2019-10-09,1,146.0,False,1,J181,A,2,TESTkyWBrKAUFvrpO0YDLH1UNRGs7YRp|2019-05-17|20...
