# Example ETL via MEDS-Extract

In this example, we'll extract some raw data into the MEDS format via MEDS-Extract. To start with, let's inspect our input (synthetic) raw data:

In [6]:
from pathlib import Path

from pretty_print_directory import print_directory

DATA_ROOT = Path("raw_data")

In [7]:
print_directory(DATA_ROOT)

├── diagnoses.csv
├── labs_vitals.csv
├── medications.csv
└── patients.csv


We can see there are four `csv` files; let's look at what they contain, using [`polars`](https://docs.pola.rs/) as our dataframe engine.

In [8]:
import polars as pl
from IPython.display import display  # This is used for nice displays later

dfs = {}
for fn in ("diagnoses", "labs_vitals", "medications", "patients"):
    fp = DATA_ROOT / f"{fn}.csv"
    dfs[fn] = pl.read_csv(fp)
    print(f"{fn}: ")
    display(dfs[fn].head(2))

diagnoses: 


patient_id,icd10_code
str,str
"""P0001""","""E11.9"""
"""P0001""","""I10"""


labs_vitals: 


test_name,patient_id,timestamp,result
str,str,str,f64
"""Glucose""","""P0001""","""2024-11-28T01:14:00""",236.8
"""Blood Pressure Systolic""","""P0001""","""2025-02-14T08:04:00""",153.0


medications: 


medication_name,dose,patient_id,timestamp
str,str,str,str
"""Metformin""","""500 mg""","""P0001""","""2024-01-17T13:26:00"""
"""Insulin glargine""","""10 units""","""P0001""","""2024-03-25T17:29:00"""


patients: 


patient_id,eye_color,hair_color,birth_datetime,death_datetime
str,str,str,str,str
"""P0001""","""gray""","""black""","""1954-01-24T08:15:00""","""2011-12-15T06:38:07"""
"""P0002""","""gray""","""brown""","""1994-01-01T02:13:00""","""2020-12-05T03:18:10"""


Following the conventions of MEDS-Extract, we'll use the event conversion configuration file stored in `event_cfg.yaml` in this directory to parse these data:

In [9]:
print(Path("event_cfg.yaml").read_text())

subject_id_col: patient_id

patients:
  eye_color:
    code:
      - EYE_COLOR
      - col(eye_color)
    time: null
  hair_color:
    code:
      - HAIR_COLOR
      - col(hair_color)
    time: null
  dob:
    code: MEDS_BIRTH
    time: col(birth_datetime)
    time_format: "%Y-%m-%dT%H:%M:%S"
  dod:
    code: MEDS_DEATH
    time: col(death_datetime)
    time_format: "%Y-%m-%dT%H:%M:%S"

labs_vitals:
  lab:
    code: col(test_name)
    time: col(timestamp)
    time_format: "%Y-%m-%dT%H:%M:%S"
    numeric_value: col(result)

medications:
  med:
    code:
      - col(medication_name)
      - col(dose)
    time: col(timestamp)
    time_format: "%Y-%m-%dT%H:%M:%S"
    numeric_value: col(dose)

diagnoses:
  dx:
    code: col(icd10_code)
    time: col(timestamp)
    time_format: "%Y-%m-%dT%H:%M:%S"



Now, we can simply run the pipeline via the traditional syntax in MEDS-Transforms:

```bash
MEDS_transform-pipeline pkg://MEDS_extract.configs._extract.yaml --overrides input_dir=raw_data output_dir=$OUTPUT_DIR
```

In [10]:
%%bash
MEDS_transform-pipeline pkg://MEDS_extract.configs._extract.yaml --overrides input_dir=raw_data output_dir=MEDS_output

Traceback (most recent call last):
  File "/home/mmd/mambaforge/envs/MEDS_extract/bin/MEDS_transform-pipeline", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/mmd/mambaforge/envs/MEDS_extract/lib/python3.12/site-packages/MEDS_transforms/runner.py", line 345, in main
    run_stage(
  File "/home/mmd/mambaforge/envs/MEDS_extract/lib/python3.12/site-packages/MEDS_transforms/runner.py", line 273, in run_stage
    raise ValueError(
ValueError: Stage shard_events failed via MEDS_transform-stage pkg://MEDS_extract.configs._extract.yaml shard_events stage=shard_events input_dir=raw_data output_dir=MEDS_output with return code 1.


CalledProcessError: Command 'b'MEDS_transform-pipeline pkg://MEDS_extract.configs._extract.yaml --overrides input_dir=raw_data output_dir=MEDS_output\n'' returned non-zero exit status 1.