# Perception Agent:
# Extracts and normalizes patient data (structured + unstructured)
## Purpose
This notebook initializes all external and custom utilities required for the **Perception Agent**, responsible for extracting, normalizing, and preparing patient datasets. The imported modules support data reading (including compressed `.gz` files), progress tracking, serialization, and structured data handling. Together, these tools enable efficient loading, transformation, and saving of patient data into a clean training and testing dataset in `.csv` format.

## Inputs
- Modules enable access to:
  - Raw patient data files (`.csv`, `.gz`).
  - The main dataset used for the training phase and validation is [MIMIC-IV Clinical Database Demo](https://physionet.org/content/mimiciii-demo/1.4/) which is a open access dataset. MIMIC-IV is comprised of deidentified electronic health records for patients admitted to the Beth Israel Deaconess Medical Center. Access to MIMIC-IV is limited to credentialed users. Here, we have provided an openly available demo of MIMIC-IV containing a subset of 100 patients.
  - Utility functions defined in `utils.py` for data extraction, normalization, or cleaning.
- Assumes that:
  - The working directory contains a `utils.py` file.
  - Required dependencies (`tqdm`, `dill`, `jsonlines`, etc.) are installed.

In [1]:
import gzip
import os
from tqdm import tqdm
# import dilla
import random
import jsonlines
import numpy as np
import pandas as pd
from utils import *

In [2]:
data_path = './data/mimic-iv-clinical-database-demo-2.2/hosp'
# Delete all .csv files in the folder
csv_list = [f for f in os.listdir(data_path) if f.endswith('.csv')]
for csv_file in csv_list:
    csv_path = os.path.join(data_path, csv_file)
    try:
        os.remove(csv_path)
        print(f"Deleted file: {csv_file}")
    except Exception as e:
        print(f"Error deleting {csv_file}: {e}")
data_path = './data/mimic-iv-clinical-database-demo-2.2/icu'  
csv_list = [f for f in os.listdir(data_path) if f.endswith('.csv')]
for csv_file in csv_list:
    csv_path = os.path.join(data_path, csv_file)
    try:
        os.remove(csv_path)
        print(f"Deleted file: {csv_file}")
    except Exception as e:
        print(f"Error deleting {csv_file}: {e}")

Deleted file: admissions.csv
Deleted file: diagnoses_icd.csv
Deleted file: drgcodes.csv
Deleted file: d_hcpcs.csv
Deleted file: d_icd_diagnoses.csv
Deleted file: d_icd_procedures.csv
Deleted file: d_labitems.csv
Deleted file: emar.csv
Deleted file: emar_detail.csv
Deleted file: hcpcsevents.csv
Deleted file: labevents.csv
Deleted file: microbiologyevents.csv
Deleted file: omr.csv
Deleted file: patients.csv
Deleted file: pharmacy.csv
Deleted file: poe.csv
Deleted file: poe_detail.csv
Deleted file: prescriptions.csv
Deleted file: procedures_icd.csv
Deleted file: provider.csv
Deleted file: services.csv
Deleted file: transfers.csv
Deleted file: caregiver.csv
Deleted file: chartevents.csv
Deleted file: datetimeevents.csv
Deleted file: d_items.csv
Deleted file: icustays.csv
Deleted file: ingredientevents.csv
Deleted file: inputevents.csv
Deleted file: outputevents.csv
Deleted file: procedureevents.csv


In [3]:
data_path = "./data/mimic-iv-clinical-database-demo-2.2/hosp"

In [4]:
zip_list = os.listdir(data_path)
for fn in zip_list:
    if fn.split('.')[-1] != 'gz':
        zip_list.remove(fn)
len(zip_list)

22

In [5]:
for zip_file in tqdm(zip_list):
    with gzip.GzipFile(os.path.join(data_path, zip_file), mode='rb') as zf:
        try:
            data = zf.read()
            with open(os.path.join(data_path, zip_file.split('.')[0] + '.csv'), 'wb') as f:
                f.write(data)
        except:
            print('File error: ' + zip_file)

100%|██████████| 22/22 [00:00<00:00, 89.55it/s]


In [6]:
data_path = "./data/mimic-iv-clinical-database-demo-2.2/icu"

In [7]:
zip_list = [f for f in os.listdir(data_path) if f.endswith('.gz')]

for zip_file in tqdm(zip_list):
    try:
        with gzip.open(os.path.join(data_path, zip_file), 'rb') as zf:
            data = zf.read()
        with open(os.path.join(data_path, zip_file.split('.')[0] + '.csv'), 'wb') as f:
            f.write(data)
    except Exception as e:
        print(f'File error: {zip_file} - {e}')

100%|██████████| 9/9 [00:00<00:00, 31.68it/s]
