# Loading MIMIC IV data for downstream tasks and analysis in biomedical informatics, clinical research, and health AI

Steps:

1. Download https://physionet.org/content/mimiciii/1.4/ and https://physionet.org/content/mimiciv/2.2/ by creating a Physionet account, completing the training, and signing the data use agremeent (which includes provisions such as `The LICENSEE agrees to contribute code associated with publications arising from this data to a repository that is open to the research community.`)

2. Parse the header of each file and use the header information as context for a large language model to write the SQL scripts in DuckDB dialect. 
3. Make another LLM call and use the SQL query prototyped in this notebook as the basis for a dbt model, one per file.  

## Extract downloaded archives

In [5]:
!ls ~/data/mimic/mimic-iv-2.2/

CHANGELOG.txt   LICENSE.txt     SHA256SUMS.txt  [34mhosp[m[m/           [34micu[m[m/


In [6]:
%%capture folder_structure
!tree ~/data/mimic/mimic-iv-2.2/

# First load MIMIC-IV data

In [5]:
%%capture mimic_iv_structure
import seedir as sd
mimic_iv_path = '~/data/mimic/mimic-iv-2.2/'
sd.seedir(mimic_iv_path, style='lines', depthlimit=2, exclude_folders=['.git', '.ipynb_checkpoints'])

In [6]:
print(mimic_iv_structure.stdout)

mimic-iv-2.2/
├─diagnoses_icd.parquet
├─chartevents.parquet
├─ingredientevents.parquet
├─d_labitems.parquet
├─admissions.parquet
├─provider.parquet
├─transfers.parquet
├─icu_icustays.parquet
├─datetimeevents.parquet
├─patients.parquet
├─procedures_icd.parquet
├─omr.parquet
├─emar.parquet
├─services.parquet
├─d_hcpcs.parquet
├─microbiologyevents.parquet
├─emar_detail.parquet
├─hcpcsevents.parquet
├─d_items.parquet
├─labevents.parquet
├─hosp_poe.parquet
├─pharmacy.parquet
├─hosp/
│ ├─poe.csv.gz
│ ├─d_hcpcs.csv.gz
│ ├─poe_detail.csv.gz
│ ├─patients.csv.gz
│ ├─diagnoses_icd.csv.gz
│ ├─emar_detail.csv.gz
│ ├─provider.csv.gz
│ ├─prescriptions.csv.gz
│ ├─drgcodes.csv.gz
│ ├─d_icd_diagnoses.csv.gz
│ ├─d_labitems.csv.gz
│ ├─transfers.csv.gz
│ ├─admissions.csv.gz
│ ├─labevents.csv.gz
│ ├─pharmacy.csv.gz
│ ├─procedures_icd.csv.gz
│ ├─hcpcsevents.csv.gz
│ ├─services.csv.gz
│ ├─d_icd_procedures.csv.gz
│ ├─omr.csv.gz
│ ├─emar.csv.gz
│ └─microbiologyevents.csv.gz
├─inputevents.parquet
├─drgcodes.parq

In [10]:
%%capture code_folder_structure
sd.seedir('~/projects/electronic-health-records-analysis', style='lines', itemlimit=10, depthlimit=2, exclude_folders=['.git', '.ipynb_checkpoints'])

In [11]:
print(code_folder_structure.stdout)

electronic-health-records-analysis/
├─.DS_Store
├─LICENSE
├─requirements.txt
├─data_processing/
│ ├─snapshots/
│ ├─tests/
│ ├─models/
│ ├─README.md
│ ├─macros/
│ ├─.gitignore
│ ├─seeds/
│ ├─analyses/
│ └─dbt_project.yml
├─README.md
├─logs/
│ └─dbt.log
├─.gitignore
├─.venv/
│ ├─bin/
│ ├─include/
│ ├─etc/
│ ├─pyvenv.cfg
│ ├─lib/
│ └─share/
├─notebooks/
│ ├─loading_physionet_mimic_iv_data.ipynb
│ └─loading_physionet_mimic_iii_data.ipynb
└─requirements.in



## Get header of each file and use it as context for a large language model to write the SQL scripts in DuckDB dialect to load the MIMIC data into a database that can be saved to a parquet file

You may need to follow this readme first: https://colab.research.google.com/github/jaanli/language-model-notebooks/blob/main/notebooks/getting-started.ipynb

In [1]:
# Load duckdb, which lets us efficiently load large files
import duckdb

# Load pandas, which lets us manipulate dataframes
import pandas as pd

# Import jupysql Jupyter extension to create SQL cells
%load_ext sql

# Set configrations on jupysql to directly output data to Pandas and to simplify the output that is printed to the notebook.
%config SqlMagic.autopandas = True

%config SqlMagic.feedback = False
%config SqlMagic.displaycon = False

# Allow named parameters (python variables) in SQL cells
%config SqlMagic.named_parameters=True

# Connect jupysql to DuckDB using a SQLAlchemy-style connection string. Either connect to an in memory DuckDB, or a file backed db.
%sql duckdb:///:memory:

Please use a valid option: "warn", "enabled", or "disabled". 
For more information, see the docs: https://jupysql.ploomber.io/en/latest/api/configuration.html#named-parameters


In [2]:
%load_ext jupyter_ai

In [3]:
%load_ext dotenv

In [4]:
%dotenv

In [5]:
url = "https://raw.githubusercontent.com/MIT-LCP/mimic-code/main/mimic-iv/buildmimic/mysql/load.sql"

In [6]:
%%capture path_to_mimic

!readlink -f ~/data/mimic/mimic-iv-2.2/

In [7]:
path_to_mimic.stdout

'/Users/me/data/mimic/mimic-iv-2.2\r\n'

In [21]:
%%capture example_contents

!curl {url}

In [22]:
%%capture example_name

!basename {url}

In [23]:
print(example_name)
print(example_contents.stdout[:100])

load.sql

-- csv2mysql with arguments:
--   -o
--   1-load-no-keys.sql
--   -e
--   
--   -u
--   -z
--


In [24]:
!pwd

/Users/me/projects/electronic-health-records-analysis/notebooks


In [7]:
!cd {mimic_iv_path}

In [8]:
import os
os.chdir(os.path.expanduser(mimic_iv_path))

In [9]:
!pwd

/Users/me/data/mimic/mimic-iv-2.2


In [43]:
%%capture row_level_protected_health_information
%%bash

for file in /Users/me/data/mimic/mimic-iv-2.2/hosp/*.csv.gz /Users/me/data/mimic/mimic-iv-2.2/icu/*.csv.gz; do
    echo "$file"
    echo "----------"
    duckdb -markdown -c "SELECT * FROM read_csv_auto('$file') LIMIT 10;"
    echo "=========="
done

In [44]:
headers = row_level_protected_health_information.stdout.split('\n')
for i, row in enumerate(headers):
    if i > 0 and '.csv.gz' in headers[i - 1]:
        print(headers[i-1])
        print(headers[i+1])

/Users/me/data/mimic/mimic-iv-2.2/hosp/admissions.csv.gz
| subject_id | hadm_id  |      admittime      |      dischtime      | deathtime |  admission_type   | admit_provider_id |   admission_location   | discharge_location | insurance | language | marital_status | race  |      edregtime      |      edouttime      | hospital_expire_flag |
/Users/me/data/mimic/mimic-iv-2.2/hosp/d_hcpcs.csv.gz
| code  | category | long_description | short_description  |
/Users/me/data/mimic/mimic-iv-2.2/hosp/d_icd_diagnoses.csv.gz
| icd_code | icd_version |              long_title               |
/Users/me/data/mimic/mimic-iv-2.2/hosp/d_icd_procedures.csv.gz
| icd_code | icd_version |                         long_title                          |
/Users/me/data/mimic/mimic-iv-2.2/hosp/d_labitems.csv.gz
| itemid |                label                | fluid | category  |
/Users/me/data/mimic/mimic-iv-2.2/hosp/diagnoses_icd.csv.gz
| subject_id | hadm_id  | seq_num | icd_code | icd_version |
/Users/me/data/mi

In [45]:
%%capture duckdb_docs_raw

!curl -s "https://duckdb.org/docs/data/csv/overview.html" | sed -e 's/<[^>]*>//g; /^$/d' | tr -s '\n'
!curl -s "https://duckdb.org/docs/sql/functions/dateformat.html" | sed -e 's/<[^>]*>//g; /^$/d' | tr -s '\n'

In [46]:
import re
duckdb_docs = re.escape(duckdb_docs_raw.stdout.replace('\t', '').replace('\n', '').replace(':', '').replace('{', '').replace('}', ''))

In [48]:
parent_prompt = f"""
Please find the context for this task here: 

```
{mimic_iv_structure.stdout}
```

This lists the directory contents using the seedir python package, of the directory at this path: `{mimic_iv_path}`.

Here are the headers of every file, derived from the command `for file in *.csv.gz; do echo "$file"; echo "----------"; duckdb -markdown -c "SELECT * FROM read_csv_auto('$file') LIMIT 10;"; echo "=========="; done`: 

```
{row_level_protected_health_information.stdout}
```

Additionally, here is an example file, {example_name} - in MYSQL, not DuckDB dialect - which contains a SQL transform to load the raw data from the files above into a database, using the following script:

```
{example_contents.stdout}
```

Remember:

We are running this notebook from the same directory as that shown above, from within the `notebooks` directory in Jupyter Lab. All of your code should be prefixed with the `%%sql` magic (the duckdb and jupysql libraries are already loaded!) so that it can be run after this first execution and code writing stage :) 

Also remember that this is the code folder structure: 

{code_folder_structure.stdout}

Also remember that these are the latest DuckDB documentation for the read_csv function that you must use: 

```
{duckdb_docs}
```

For each set of MIMIC files (and versions), get header of each file and use it as context for a large language model to write the SQL scripts in DuckDB dialect to load the MIMIC data into a database that can be saved to a parquet file.

Proceed step-by-step, focused on not changing the underlying names and original data dictionary at all. Make sure to cross-link data appropriately using the previously given file as reference.

Remember to always use the `read_csv` functions wrapped around the file name, like so: `SELECT * FROM read_csv('filename.csv.gz')` as but one example. 

Copy the resulting database into a parquet file compressed with ZSTD compression.

Please always remember to prefix the output with `%%sql` for the JupySQL cell magic :)
"""

In [49]:
print(parent_prompt[:1000])


Please find the context for this task here: 

```
mimic-iv-2.2/
├─hosp/
│ ├─poe.csv.gz
│ ├─d_hcpcs.csv.gz
│ ├─poe_detail.csv.gz
│ ├─patients.csv.gz
│ ├─diagnoses_icd.csv.gz
│ ├─emar_detail.csv.gz
│ ├─provider.csv.gz
│ ├─prescriptions.csv.gz
│ ├─drgcodes.csv.gz
│ ├─d_icd_diagnoses.csv.gz
│ ├─d_labitems.csv.gz
│ ├─transfers.csv.gz
│ ├─admissions.csv.gz
│ ├─labevents.csv.gz
│ ├─pharmacy.csv.gz
│ ├─procedures_icd.csv.gz
│ ├─hcpcsevents.csv.gz
│ ├─services.csv.gz
│ ├─d_icd_procedures.csv.gz
│ ├─omr.csv.gz
│ ├─emar.csv.gz
│ └─microbiologyevents.csv.gz
├─CHANGELOG.txt
├─LICENSE.txt
├─SHA256SUMS.txt
└─icu/
  ├─datetimeevents.csv.gz
  ├─caregiver.csv.gz
  ├─ingredientevents.csv.gz
  ├─inputevents.csv.gz
  ├─procedureevents.csv.gz
  ├─d_items.csv.gz
  ├─chartevents.csv.gz
  ├─icustays.csv.gz
  └─outputevents.csv.gz

```

This lists the directory contents using the seedir python package, of the directory at this path: `~/data/mimic/mimic-iv-2.2/`.

Here are the headers of every file, derived fro

In [52]:
%%capture directory
!ls  -lh hosp/* icu/*

In [71]:
print(directory)

-rw-r--r--@ 1 me  staff    15M Jan  5  2023 hosp/admissions.csv.gz
-rw-r--r--@ 1 me  staff   417K Jan  5  2023 hosp/d_hcpcs.csv.gz
-rw-r--r--@ 1 me  staff   839K Jan  5  2023 hosp/d_icd_diagnoses.csv.gz
-rw-r--r--@ 1 me  staff   565K Jan  5  2023 hosp/d_icd_procedures.csv.gz
-rw-r--r--@ 1 me  staff    13K Jan  5  2023 hosp/d_labitems.csv.gz
-rw-r--r--@ 1 me  staff    24M Jan  5  2023 hosp/diagnoses_icd.csv.gz
-rw-r--r--@ 1 me  staff   7.1M Jan  5  2023 hosp/drgcodes.csv.gz
-rw-r--r--@ 1 me  staff   485M Jan  5  2023 hosp/emar.csv.gz
-rw-r--r--@ 1 me  staff   449M Jan  5  2023 hosp/emar_detail.csv.gz
-rw-r--r--@ 1 me  staff   1.7M Jan  5  2023 hosp/hcpcsevents.csv.gz
-rw-r--r--@ 1 me  staff   1.8G Jan  5  2023 hosp/labevents.csv.gz
-rw-r--r--@ 1 me  staff    92M Jan  5  2023 hosp/microbiologyevents.csv.gz
-rw-r--r--@ 1 me  staff    34M Jan  5  2023 hosp/omr.csv.gz
-rw-r--r--@ 1 me  staff   2.2M Jan  5  2023 hosp/patients.csv.gz
-rw-r--r--@ 1 me  staff   380M Jan  5  2023 hosp/pharmacy.c

## Blocker: need to copy and paste file names into the `%%ai` cell magic enabled by the `jupyter-ai` python package in order to execute the prompt separately for every file in the current working directory

In [54]:
%%ai anthropic-chat:claude-3-opus-20240229 --format code

{parent_prompt}

Only do this for this file:

```
-rw-r--r--@ 1 me  staff    15M Jan  5  2023 hosp/admissions.csv.gz
```

In [55]:
%%sql

CREATE TABLE admissions AS 
SELECT * FROM read_csv('hosp/admissions.csv.gz', header=True, compression='gzip');

COPY admissions TO 'admissions.parquet' (FORMAT PARQUET, CODEC ZSTD);

Unnamed: 0,Success


In [56]:
%%ai anthropic-chat:claude-3-opus-20240229 --format code

{parent_prompt}

Only do this for this file:

```
-rw-r--r--@ 1 me  staff   417K Jan  5  2023 hosp/d_hcpcs.csv.gz
```

In [57]:
%%sql
CREATE TABLE d_hcpcs AS 
SELECT * FROM read_csv('hosp/d_hcpcs.csv.gz');

COPY d_hcpcs TO 'hosp/d_hcpcs.parquet' (FORMAT 'parquet', CODEC 'ZSTD');

Unnamed: 0,Success


In [58]:
%%ai anthropic-chat:claude-3-opus-20240229 --format code

{parent_prompt}

Only do this for this file:

```
-rw-r--r--@ 1 me  staff   839K Jan  5  2023 hosp/d_icd_diagnoses.csv.gz
```

In [59]:
%%sql
COPY (
  SELECT
    icd_code,
    icd_version,
    long_title
  FROM read_csv_auto('hosp/d_icd_diagnoses.csv.gz', header=True, all_varchar=True)
) TO 'hosp/d_icd_diagnoses.parquet' (FORMAT 'parquet', CODEC 'ZSTD');

Unnamed: 0,Success


In [60]:
%%ai anthropic-chat:claude-3-opus-20240229 --format code

{parent_prompt}

Only do this for this file:

```
-rw-r--r--@ 1 me  staff   565K Jan  5  2023 hosp/d_icd_procedures.csv.gz
```

In [61]:
%%sql

CREATE OR REPLACE TABLE d_icd_procedures AS 
SELECT * 
FROM read_csv_auto('hosp/d_icd_procedures.csv.gz', header=True, sep=',');

COPY d_icd_procedures TO 'hosp/d_icd_procedures.parquet' (FORMAT PARQUET, CODEC ZSTD);

Unnamed: 0,Success


In [62]:
%%ai anthropic-chat:claude-3-opus-20240229 --format code

{parent_prompt}

Only do this for this file:

```
-rw-r--r--@ 1 me  staff    13K Jan  5  2023 hosp/d_labitems.csv.gz
```

In [67]:
%%sql

CREATE TABLE d_labitems AS 
SELECT * 
FROM read_csv('hosp/d_labitems.csv.gz', 
    columns={itemid: 'INTEGER', label: 'VARCHAR', fluid: 'VARCHAR', category: 'VARCHAR'});

COPY d_labitems TO 'hosp/d_labitems.parquet' (FORMAT 'parquet', CODEC 'ZSTD');

Unnamed: 0,Success


In [68]:
%%ai anthropic-chat:claude-3-opus-20240229 --format code

{parent_prompt}

Only do this for this file:

```
-rw-r--r--@ 1 me  staff    24M Jan  5  2023 hosp/diagnoses_icd.csv.gz
```

In [69]:
%%sql

CREATE TABLE diagnoses_icd AS 
SELECT * FROM read_csv_auto('hosp/diagnoses_icd.csv.gz', header=True, sep=',', columns={'subject_id': 'INT', 'hadm_id': 'INT', 'seq_num': 'INT', 'icd_code': 'VARCHAR', 'icd_version': 'INT'});

COPY diagnoses_icd TO 'diagnoses_icd.parquet' (FORMAT PARQUET, CODEC ZSTD);

Unnamed: 0,Success


In [70]:
%%ai anthropic-chat:claude-3-opus-20240229 --format code

{parent_prompt}

Only do this for this file:

```
-rw-r--r--@ 1 me  staff   7.1M Jan  5  2023 hosp/drgcodes.csv.gz
```

In [72]:
%%sql

CREATE TABLE drgcodes AS
SELECT * FROM read_csv('hosp/drgcodes.csv.gz', header=True, compression='gzip');

COPY drgcodes TO 'drgcodes.parquet' (FORMAT 'parquet', CODEC 'ZSTD');

Unnamed: 0,Success


In [74]:
%%ai anthropic-chat:claude-3-opus-20240229 --format code

{parent_prompt}

Only do this for this file:

```
-rw-r--r--@ 1 me  staff   485M Jan  5  2023 hosp/emar.csv.gz
```

In [75]:
%%sql

CREATE TABLE emar AS 
SELECT *
FROM read_csv_auto('hosp/emar.csv.gz', header=True, all_varchar=True);

COPY (
    SELECT * 
    FROM emar
) TO 'emar.parquet' (FORMAT 'parquet', CODEC 'ZSTD');

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Unnamed: 0,Success


In [76]:
%%ai anthropic-chat:claude-3-opus-20240229 --format code

{parent_prompt}

Only do this for this file:

```
-rw-r--r--@ 1 me  staff   449M Jan  5  2023 hosp/emar_detail.csv.gz
```

In [78]:
%%sql

CREATE TABLE emar_detail AS 
SELECT * 
FROM read_csv_auto('hosp/emar_detail.csv.gz', all_varchar=True);

COPY emar_detail TO 'emar_detail.parquet' (FORMAT 'parquet', CODEC 'ZSTD');

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Unnamed: 0,Success


In [80]:
%%ai anthropic-chat:claude-3-opus-20240229 --format code

{parent_prompt}

Only do this for this file:

```
-rw-r--r--@ 1 me  staff   1.7M Jan  5  2023 hosp/hcpcsevents.csv.gz
```

In [81]:
%%sql

CREATE TABLE hcpcsevents AS 
SELECT * FROM read_csv('hosp/hcpcsevents.csv.gz', header=True, compression='gzip');

COPY hcpcsevents TO 'hcpcsevents.parquet' (FORMAT 'parquet', CODEC 'ZSTD');

Unnamed: 0,Success


In [None]:
%%sql

CREATE TABLE hcpcsevents AS 
SELECT * 
FROM read_csv('hosp/hcpcsevents.csv.gz', header=True, compression='gzip');

COPY hcpcsevents TO 'hcpcsevents.parquet' (FORMAT 'parquet', CODEC 'ZSTD');

In [82]:
%%ai anthropic-chat:claude-3-opus-20240229 --format code

{parent_prompt}

Only do this for this file:

```
-rw-r--r--@ 1 me  staff   1.8G Jan  5  2023 hosp/labevents.csv.gz
```

In [83]:
%%sql

CREATE TABLE labevents AS 
SELECT *
FROM read_csv_auto('hosp/labevents.csv.gz', ALL_VARCHAR=0, HEADER=1, SEP=',', FILENAME=1);

COPY labevents TO 'labevents.parquet' (FORMAT PARQUET, CODEC ZSTD);

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Unnamed: 0,Success


In [85]:
%%ai anthropic-chat:claude-3-opus-20240229 --format code

{parent_prompt}

Only do this for this file:

```
-rw-r--r--@ 1 me  staff    92M Jan  5  2023 hosp/microbiologyevents.csv.gz
```

In [87]:
%%sql

CREATE TABLE microbiologyevents AS 
SELECT * 
FROM read_csv_auto('hosp/microbiologyevents.csv.gz');

COPY microbiologyevents TO 'microbiologyevents.parquet' (FORMAT PARQUET, CODEC ZSTD);

Unnamed: 0,Success


In [None]:
%%sql

CREATE TABLE microbiologyevents AS 
SELECT * FROM read_csv_auto('mimic-iv-2.2/hosp/microbiologyevents.csv.gz');

COPY microbiologyevents TO 'microbiologyevents.parquet' (FORMAT 'parquet', CODEC 'ZSTD');

In [88]:
%%ai anthropic-chat:claude-3-opus-20240229 --format code

{parent_prompt}

Only do this for this file:

```
-rw-r--r--@ 1 me  staff    34M Jan  5  2023 hosp/omr.csv.gz
```

In [90]:
%%sql

CREATE TABLE omr AS
SELECT * 
FROM read_csv_auto('hosp/omr.csv.gz');

COPY omr TO 'omr.parquet' (FORMAT PARQUET, CODEC ZSTD);

Unnamed: 0,Success


In [91]:
%%ai anthropic-chat:claude-3-opus-20240229 --format code

{parent_prompt}

Only do this for this file:

```
-rw-r--r--@ 1 me  staff   2.2M Jan  5  2023 hosp/patients.csv.gz
```

In [92]:
%%sql
CREATE TABLE patients AS
SELECT *
FROM read_csv_auto('hosp/patients.csv.gz');

COPY patients TO 'patients.parquet' (FORMAT PARQUET, CODEC ZSTD);

Unnamed: 0,Success


In [93]:
%%ai anthropic-chat:claude-3-opus-20240229 --format code

{parent_prompt}

Only do this for this file:

```
-rw-r--r--@ 1 me  staff   380M Jan  5  2023 hosp/pharmacy.csv.gz
```

In [95]:
%%sql

CREATE TABLE pharmacy AS 
SELECT *
FROM read_csv_auto('hosp/pharmacy.csv.gz', all_varchar=True);

COPY pharmacy TO 'pharmacy.parquet' (FORMAT PARQUET, CODEC 'ZSTD');

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Unnamed: 0,Success


In [96]:
%%ai anthropic-chat:claude-3-opus-20240229 --format code

{parent_prompt}

Only do this for this file:

```
-rw-r--r--@ 1 me  staff   475M Jan  5  2023 hosp/poe.csv.gz
```

In [97]:
%%sql
CREATE TABLE hosp_poe AS 
SELECT * FROM read_csv_auto('hosp/poe.csv.gz');

COPY hosp_poe TO 'hosp_poe.parquet' (FORMAT PARQUET, CODEC ZSTD);

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Unnamed: 0,Success


In [98]:
%%ai anthropic-chat:claude-3-opus-20240229 --format code

{parent_prompt}

Only do this for this file:

```
-rw-r--r--@ 1 me  staff    24M Jan  5  2023 hosp/poe_detail.csv.gz
```

In [101]:
%%sql

CREATE TABLE poe_detail AS 
SELECT * FROM read_csv(
  'hosp/poe_detail.csv.gz',
  delim=',',
  quote='"',
  escape='"',
  header=True,
  all_varchar=False,
  parallel=True,
  compression='gzip'
);

COPY poe_detail TO 'poe_detail.parquet' (FORMAT PARQUET, CODEC ZSTD);

Unnamed: 0,Success


In [103]:
%%ai anthropic-chat:claude-3-opus-20240229 --format code

{parent_prompt}

Only do this for this file:

```
-rw-r--r--@ 1 me  staff   438M Jan  5  2023 hosp/prescriptions.csv.gz
```

In [107]:
%%sql

CREATE TABLE prescriptions AS 
SELECT *
FROM read_csv_auto('hosp/prescriptions.csv.gz');

COPY prescriptions TO 'prescriptions.parquet' (FORMAT PARQUET, CODEC ZSTD);

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Unnamed: 0,Success


In [113]:
%%ai anthropic-chat:claude-3-opus-20240229 --format code

{parent_prompt}

Only do this for this file:

```
-rw-r--r--@ 1 me  staff   5.7M Jan  5  2023 hosp/procedures_icd.csv.gz
```

In [111]:
%%sql

CREATE TABLE procedures_icd AS 
SELECT *
FROM read_csv_auto('hosp/procedures_icd.csv.gz');

COPY procedures_icd TO 'procedures_icd.parquet' (FORMAT PARQUET, CODEC ZSTD);

Unnamed: 0,Success


In [112]:
%%ai anthropic-chat:claude-3-opus-20240229 --format code

{parent_prompt}

Only do this for this file:

```
-rw-r--r--@ 1 me  staff   120K Jan  5  2023 hosp/provider.csv.gz
```

In [114]:
%%sql

CREATE TABLE provider AS
SELECT * 
FROM read_csv_auto('hosp/provider.csv.gz', header=True, compression='gzip');

COPY (SELECT * FROM provider) TO 'provider.parquet' (FORMAT 'parquet', CODEC 'ZSTD');

Unnamed: 0,Success


In [115]:
%%ai anthropic-chat:claude-3-opus-20240229 --format code

{parent_prompt}

Only do this for this file:

```
-rw-r--r--@ 1 me  staff   6.5M Jan  5  2023 hosp/services.csv.gz
```

In [116]:
%%sql

CREATE TABLE services AS 
SELECT *
FROM read_csv_auto('hosp/services.csv.gz');

COPY services TO 'services.parquet' (FORMAT PARQUET, CODEC ZSTD);

Unnamed: 0,Success


In [117]:
%%ai anthropic-chat:claude-3-opus-20240229 --format code

{parent_prompt}

Only do this for this file:

```
-rw-r--r--@ 1 me  staff    34M Jan  5  2023 hosp/transfers.csv.gz
```

In [119]:
%%sql

CREATE TABLE transfers AS
SELECT * 
FROM read_csv_auto('hosp/transfers.csv.gz');

COPY (
    SELECT *
    FROM transfers
) TO 'transfers.parquet' (FORMAT 'parquet', CODEC 'ZSTD');

Unnamed: 0,Success


In [120]:
%%ai anthropic-chat:claude-3-opus-20240229 --format code

{parent_prompt}

Only do this for this file:

```
-rw-r--r--@ 1 me  staff    35K Jan  5  2023 icu/caregiver.csv.gz
```

In [121]:
%%sql

CREATE TABLE caregiver AS 
SELECT * 
FROM read_csv('icu/caregiver.csv.gz', header=True, compression='gzip');

COPY caregiver TO 'caregiver.parquet' (FORMAT 'parquet', CODEC 'ZSTD');

Unnamed: 0,Success


In [122]:
%%ai anthropic-chat:claude-3-opus-20240229 --format code

{parent_prompt}

Only do this for this file:

```
-rw-r--r--@ 1 me  staff   2.3G Jan  5  2023 icu/chartevents.csv.gz
```

In [127]:
%%sql
CREATE TABLE chartevents AS 
SELECT *
FROM read_csv_auto('icu/chartevents.csv.gz', 
    header=True,
    compression='gzip',
    dateformat='%Y-%m-%d %H:%M:%S',
    timestampformat='%Y-%m-%d %H:%M:%S'
);

COPY chartevents TO 'chartevents.parquet' (FORMAT 'parquet', CODEC 'ZSTD');

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Unnamed: 0,Success


In [128]:
%%ai anthropic-chat:claude-3-opus-20240229 --format code

{parent_prompt}

Only do this for this file:

```
-rw-r--r--@ 1 me  staff    56K Jan  5  2023 icu/d_items.csv.gz
```

In [129]:
%%sql

CREATE TABLE d_items AS 
SELECT * 
FROM read_csv_auto('icu/d_items.csv.gz');

COPY d_items TO 'icu/d_items.parquet' (FORMAT PARQUET, CODEC ZSTD);

Unnamed: 0,Success


In [130]:
%%ai anthropic-chat:claude-3-opus-20240229 --format code

{parent_prompt}

Only do this for this file:

```
-rw-r--r--@ 1 me  staff    44M Jan  5  2023 icu/datetimeevents.csv.gz
```

In [131]:
%%sql

CREATE OR REPLACE TABLE datetimeevents AS 
SELECT * 
FROM read_csv('icu/datetimeevents.csv.gz', header=True, compression='gzip');

COPY (
    SELECT *
    FROM datetimeevents
) TO 'datetimeevents.parquet' (FORMAT 'parquet', CODEC 'ZSTD');

Unnamed: 0,Success


In [132]:
%%ai anthropic-chat:claude-3-opus-20240229 --format code

{parent_prompt}

Only do this for this file:

```
-rw-r--r--@ 1 me  staff   2.5M Jan  5  2023 icu/icustays.csv.gz
```

In [134]:
%%sql

CREATE TABLE icu_icustays AS 
SELECT * 
FROM read_csv('icu/icustays.csv.gz', header=True, compression='gzip');

COPY icu_icustays TO 'icu_icustays.parquet' (FORMAT PARQUET, CODEC ZSTD);

Unnamed: 0,Success


In [135]:
%%ai anthropic-chat:claude-3-opus-20240229 --format code

{parent_prompt}

Only do this for this file:

```
-rw-r--r--@ 1 me  staff   240M Jan  5  2023 icu/ingredientevents.csv.gz
```

In [137]:
%%sql

CREATE TABLE ingredientevents AS 
SELECT * 
FROM read_csv('icu/ingredientevents.csv.gz', 
    header=True,
    columns={
        'subject_id': 'INT',
        'hadm_id': 'INT',
        'stay_id': 'INT',
        'caregiver_id': 'INT',
        'starttime': 'TIMESTAMP',
        'endtime': 'TIMESTAMP',
        'storetime': 'TIMESTAMP',
        'itemid': 'INT',
        'amount': 'DOUBLE',
        'amountuom': 'VARCHAR',
        'rate': 'DOUBLE',
        'rateuom': 'VARCHAR',
        'orderid': 'INT',
        'linkorderid': 'INT',
        'statusdescription': 'VARCHAR',
        'originalamount': 'DOUBLE',
        'originalrate': 'DOUBLE'
    },
    auto_detect=True,
    dateformat = '%Y-%m-%d %H:%M:%S'
);

COPY (
    SELECT *
    FROM ingredientevents
)
TO 'ingredientevents.parquet' (FORMAT 'PARQUET', CODEC 'ZSTD');

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Unnamed: 0,Success


In [139]:
%%ai anthropic-chat:claude-3-opus-20240229 --format code

{parent_prompt}

Only do this for this file:

```
-rw-r--r--@ 1 me  staff   309M Jan  5  2023 icu/inputevents.csv.gz
```

In [140]:
%%sql

CREATE TABLE inputevents AS 
SELECT * FROM read_csv_auto('icu/inputevents.csv.gz', header=True, compression='gzip');

COPY inputevents TO 'inputevents.parquet' (FORMAT PARQUET, CODEC 'ZSTD');

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Unnamed: 0,Success


In [141]:
%%ai anthropic-chat:claude-3-opus-20240229 --format code

{parent_prompt}

Only do this for this file:

```
-rw-r--r--@ 1 me  staff    37M Jan  5  2023 icu/outputevents.csv.gz
```

In [142]:
%%sql

CREATE TABLE outputevents AS 
SELECT * FROM read_csv_auto('icu/outputevents.csv.gz');

COPY outputevents TO 'outputevents.parquet' (FORMAT PARQUET, CODEC ZSTD);

Unnamed: 0,Success


In [143]:
%%ai anthropic-chat:claude-3-opus-20240229 --format code

{parent_prompt}

Only do this for this file:

```
-rw-r--r--@ 1 me  staff    20M Jan  5  2023 icu/procedureevents.csv.gz
```

In [144]:
%%sql

CREATE TABLE procedureevents AS
SELECT *
FROM read_csv_auto('icu/procedureevents.csv.gz');

COPY procedureevents TO 'procedureevents.parquet' (FORMAT 'parquet', CODEC 'ZSTD');

Unnamed: 0,Success


## Generate the schema based on the DuckDB database files

In [145]:
%%capture filenames
!ls -lh *.parquet

In [150]:
%%ai anthropic-chat:claude-3-opus-20240229 --format code

{parent_prompt}

Generate the schema using bash one liners in duckdb for every file here:
```
{filenames}                                                   
```

Next we will figure out how to link these files with join statements.

Update: Instead of `%%sql`, the prefix for this one cell containing the bash one liner will simply be `!` to indicate shell command.

In [167]:
%%capture mimic_iv_schema
%%bash
for file in *.parquet; do 
    echo "$file"
    echo "----------"
    duckdb -c "DESCRIBE SELECT * FROM parquet_scan('$file')"
    echo "=========="
done

In [168]:
print(mimic_iv_schema.stdout)

admissions.parquet
----------
┌──────────────────────┬─────────────┬─────────┬─────────┬─────────┬─────────┐
│     column_name      │ column_type │  null   │   key   │ default │  extra  │
│       varchar        │   varchar   │ varchar │ varchar │ varchar │ varchar │
├──────────────────────┼─────────────┼─────────┼─────────┼─────────┼─────────┤
│ subject_id           │ BIGINT      │ YES     │         │         │         │
│ hadm_id              │ BIGINT      │ YES     │         │         │         │
│ admittime            │ TIMESTAMP   │ YES     │         │         │         │
│ dischtime            │ TIMESTAMP   │ YES     │         │         │         │
│ deathtime            │ TIMESTAMP   │ YES     │         │         │         │
│ admission_type       │ VARCHAR     │ YES     │         │         │         │
│ admit_provider_id    │ VARCHAR     │ YES     │         │         │         │
│ admission_location   │ VARCHAR     │ YES     │         │         │         │
│ discharge_location  

## Generate visualizations of the MIMIC IV database / clinical data repository using python and altair

In [169]:
%%ai anthropic-chat:claude-3-opus-20240229 --format code

Given the schema in these files: 

```
{mimic_iv_schema.stdout}
```

Proceed to create simple interactive figures using the vega-altair library in python. Use best practices for the visual display of information, such as those from Tufte.

Don't forget to load the parquet file first to ensure the python variables are available for altair :) and proceed step-by-step.

Your work output will be used for downstream analytics such as cohort selection and clinical trials paptient population matching and a *variety* of other use cases in biomedical informatics and life sciences.

Also remember to avoid this error: 

```
MaxRowsError: The number of rows in your dataset is greater than the maximum allowed (5000).

Try enabling the VegaFusion data transformer which raises this limit by pre-evaluating data
transformations in Python.
    >> import altair as alt
    >> alt.data_transformers.enable("vegafusion")
```

Only use the following variables for the d_icd_procedures.parquet and d_icd_diagnoses.parquet tables:

```
long_title
icd_code
```

Link the `long_title` using the `icd_code` column in these: procedures_icd.parquet and diagnoses_icd.parquet:

```
hadm_id
icd_code
```

For the icu_icustays.parquet, only use the following columns:

```
hadm_id
los
```

(For the hadm_id, or hospital admission identifier!)

Then create a histogram.

Alongside their human readable descriptions in the axes labels.

Bucket the length of stay into entire days; perhaps a histogram will suffice or some other standard way of showing such a business intelligence dashboard.

Alongside any others you might need to link the data in MIMIC-IV :)

In [10]:
import altair as alt
import duckdb
import pyarrow.dataset as ds
import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd

alt.data_transformers.enable("vegafusion")

# Load the parquet files into DuckDB
duckdb.sql("INSTALL httpfs;")
duckdb.sql("LOAD httpfs;")

parquet_files = [
    'procedures_icd.parquet',
    'diagnoses_icd.parquet',
    'd_icd_procedures.parquet',
    'd_icd_diagnoses.parquet',
    'icu_icustays.parquet'
]

for pfile in parquet_files:
    duckdb.sql(f"CREATE VIEW {pfile.split('/')[-1].split('.')[0]} AS SELECT * FROM parquet_scan('{pfile}')")

# Query the data
df = duckdb.sql("""
SELECT 
    i.hadm_id,
    ROUND(i.los) AS los_days,
    dp.icd_code AS procedure_icd_code,
    dp.long_title AS procedure_description,
    dd.icd_code AS diagnosis_icd_code,  
    dd.long_title AS diagnosis_description
FROM icu_icustays i
LEFT JOIN procedures_icd p ON i.hadm_id = p.hadm_id
LEFT JOIN d_icd_procedures dp ON p.icd_code = dp.icd_code
LEFT JOIN diagnoses_icd d ON i.hadm_id = d.hadm_id
LEFT JOIN d_icd_diagnoses dd ON d.icd_code = dd.icd_code
""").df()

# Create the histogram
histogram = alt.Chart(df).mark_bar().encode(
    alt.X("los_days:Q", bin=True, title="Length of Stay (days)"),
    y='count()',
    # tooltip=['procedure_description', 'diagnosis_description']
).properties(
    title='Distribution of ICU Length of Stay',
    width=600,
    height=400
)

histogram