# MIMIC-IV ECG metadata quick look

Use this notebook inside the `707Project` micromamba environment to explore the precomputed stratified metadata.



**Instructions / Usage**

1. In the terminal, run `conda activate seven-oh-seven Project` to activate the environment.
2. Go to the `707FinalProject/notebooks` folder and start `jupyter lab` or `jupyter notebook`.
3. Run the cells below one by one to load and explore the metadata.


In [1]:
from pathlib import Path
import pandas as pd
from IPython.display import display

DATA_ROOT = Path("../preprocess/data")
META_PATH = DATA_ROOT / "records_w_diag_icd10_folds.pkl"

print(f"Loading metadata from: {META_PATH.resolve()}")
df = pd.read_pickle(META_PATH)
print(f"Rows: {len(df):,}  Columns: {len(df.columns)}")
df.head(0)


Loading metadata from: C:\Users\23180\Duke\707\Project\707FinalProject\preprocess\data\records_w_diag_icd10_folds.pkl
Rows: 800,035  Columns: 22


Unnamed: 0,file_name,study_id,subject_id,ecg_time,ed_stay_id,ed_hadm_id,hosp_hadm_id,ed_diag_ed,ed_diag_hosp,hosp_diag_hosp,...,gender,age,anchor_year,anchor_age,dod,ecg_no_within_stay,ecg_taken_in_ed,ecg_taken_in_hosp,ecg_taken_in_ed_or_hosp,fold


In [2]:
column_docs = {
    "file_name": "Path inside the official ECG zip where the WFDB header/data lives.",
    "study_id": "Unique ECG recording identifier.",
    "subject_id": "MIMIC-IV patient identifier (subject_id).",
    "ecg_time": "Timestamp recorded in the WFDB header for this ECG.",
    "ed_stay_id": "Emergency department stay identifier that contains the ECG time.",
    "ed_hadm_id": "Hospital admission associated with the ED stay (if any).",
    "hosp_hadm_id": "Inpatient hospital admission whose window contains the ECG time.",
    "ed_diag_ed": "Diagnoses coded during the ED stay (ICD mapped to ICD-10).",
    "ed_diag_hosp": "Hospital discharge diagnoses tied to the ED stay.",
    "hosp_diag_hosp": "Diagnoses from the inpatient hospital admission.",
    "all_diag_hosp": "Union of hospital + ED discharge diagnoses (if present).",
    "all_diag_all": "Fallback diagnoses: uses hospital union when available, else ED diagnoses.",
    "gender": "Binary sex reported in MIMIC-IV patients table.",
    "age": "Age at ECG time computed from MIMIC anchor_year/anchor_age.",
    "anchor_year": "MIMIC-IV anchor year used to de-identify timelines.",
    "anchor_age": "Reference age used by MIMIC to shift dates.",
    "dod": "Date of death (NaT if patient alive/unknown).",
    "ecg_no_within_stay": "Zero-based index of the ECG within its stay/admission.",
    "ecg_taken_in_ed": "Boolean flag: ECG within an ED stay.",
    "ecg_taken_in_hosp": "Boolean flag: ECG within an inpatient admission.",
    "ecg_taken_in_ed_or_hosp": "Boolean flag: ECG overlaps ED or inpatient care.",
    "fold": "Precomputed fold index for reproducible training/validation splits.",
}
dtype_lookup = df.dtypes.astype(str).to_dict()
column_doc_df = pd.DataFrame(
    [
        {"column": col, "dtype": dtype_lookup.get(col, "n/a"), "description": desc}
        for col, desc in column_docs.items()
    ]
)
column_doc_df


Unnamed: 0,column,dtype,description
0,file_name,object,Path inside the official ECG zip where the WFD...
1,study_id,int64,Unique ECG recording identifier.
2,subject_id,int64,MIMIC-IV patient identifier (subject_id).
3,ecg_time,datetime64[ns],Timestamp recorded in the WFDB header for this...
4,ed_stay_id,float64,Emergency department stay identifier that cont...
5,ed_hadm_id,float64,Hospital admission associated with the ED stay...
6,hosp_hadm_id,float64,Inpatient hospital admission whose window cont...
7,ed_diag_ed,object,Diagnoses coded during the ED stay (ICD mapped...
8,ed_diag_hosp,object,Hospital discharge diagnoses tied to the ED stay.
9,hosp_diag_hosp,object,Diagnoses from the inpatient hospital admission.


In [3]:
preview_cols = [
    "study_id",
    "subject_id",
    "ecg_time",
    "gender",
    "age",
    "fold",
    "ecg_taken_in_ed",
    "ecg_taken_in_hosp",
    "ecg_taken_in_ed_or_hosp",
    "all_diag_all",
]
df[preview_cols].head()


Unnamed: 0,study_id,subject_id,ecg_time,gender,age,fold,ecg_taken_in_ed,ecg_taken_in_hosp,ecg_taken_in_ed_or_hosp,all_diag_all
0,40689238,10000032,2180-07-23 08:44:00,F,52.0,1,True,False,True,"[F17200, Z21, Z66, R64, E875, F319, I9581, J44..."
1,44458630,10000032,2180-07-23 09:54:00,F,52.0,1,True,False,True,"[F17200, Z21, Z66, R64, E875, F319, I9581, J44..."
2,49036311,10000032,2180-08-06 09:07:00,F,52.0,1,False,True,True,"[Z21, R197, F17200, E875, J449, B182, E871, V4..."
3,45090959,10000117,2181-03-04 17:14:00,F,55.0,0,False,False,False,[]
4,48446569,10000117,2183-09-18 13:52:00,F,57.0,0,False,False,False,[]


In [4]:
summary_targets = {
    "fold": df["fold"],
    "ecg_taken_in_ed": df["ecg_taken_in_ed"],
    "ecg_taken_in_hosp": df["ecg_taken_in_hosp"],
    "ecg_taken_in_ed_or_hosp": df["ecg_taken_in_ed_or_hosp"],
}
for name, series in summary_targets.items():
    counts = series.value_counts(dropna=False).sort_index()
    summary = pd.DataFrame(
        {
            "count": counts,
            "pct": (counts / len(df) * 100).round(2),
        }
    )
    print(f"\nDistribution for {name}:")
    display(summary)



Distribution for fold:


Unnamed: 0_level_0,count,pct
fold,Unnamed: 1_level_1,Unnamed: 2_level_1
0,590234,73.78
1,104004,13.0
2,105797,13.22



Distribution for ecg_taken_in_ed:


Unnamed: 0_level_0,count,pct
ecg_taken_in_ed,Unnamed: 1_level_1,Unnamed: 2_level_1
False,799898,99.98
True,137,0.02



Distribution for ecg_taken_in_hosp:


Unnamed: 0_level_0,count,pct
ecg_taken_in_hosp,Unnamed: 1_level_1,Unnamed: 2_level_1
False,501777,62.72
True,298258,37.28



Distribution for ecg_taken_in_ed_or_hosp:


Unnamed: 0_level_0,count,pct
ecg_taken_in_ed_or_hosp,Unnamed: 1_level_1,Unnamed: 2_level_1
False,501645,62.7
True,298390,37.3


In [5]:
sample_cols = [
    "study_id",
    "subject_id",
    "ecg_time",
    "ecg_no_within_stay",
    "gender",
    "age",
    "ed_diag_ed",
    "ed_diag_hosp",
    "hosp_diag_hosp",
    "all_diag_hosp",
    "all_diag_all",
    "fold",
]
df.sample(5, random_state=2025)[sample_cols].sort_values("subject_id")


Unnamed: 0,study_id,subject_id,ecg_time,ecg_no_within_stay,gender,age,ed_diag_ed,ed_diag_hosp,hosp_diag_hosp,all_diag_hosp,all_diag_all,fold
35547,45601591,10459488,2143-04-26 16:50:00,-1,F,66.0,[],[],[],[],[],0
294315,49300851,13690559,2179-10-08 13:33:00,-1,F,77.0,[],[],[],[],[],1
481046,41264204,15999095,2185-08-13 13:27:00,-1,M,79.0,[],[],[],[],[],1
742277,45441947,19271243,2178-06-08 17:07:00,-1,F,80.0,[],[],[],[],[],0
756236,49567564,19451054,2156-02-12 08:36:00,0,M,56.0,[],[],"[R197, K219, R55, R109, R110, F341, V173, R51,...","[R197, K219, R55, R110, R109, F341, V173, R51,...","[R197, K219, R55, R110, R109, F341, V173, R51,...",0
