<a href="https://colab.research.google.com/github/mc-doxey/Awesome-Adaptation-of-Agentic-AI/blob/main/safety_nest_01_patient_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 00 - Intro


In [18]:
## 00. SETUP PERSISTENT STORAGE
from google.colab import drive
import os

drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [19]:
## 01. Define Persistent Paths
## We will save everything here so it survives a runtime restart
BASE_DRIVE_PATH = "/content/drive/MyDrive/SafetyNest_Data"
CLEAN_DATA_DIR = os.path.join(BASE_DRIVE_PATH, "fhir_clean")
DIRTY_DATA_DIR = os.path.join(BASE_DRIVE_PATH, "fhir_dirty")

# 3. Create Directories if they don't exist
os.makedirs(CLEAN_DATA_DIR, exist_ok=True)
os.makedirs(DIRTY_DATA_DIR, exist_ok=True)

print(f"‚úÖ Storage Configured.")
print(f"üìÇ Clean Data: {CLEAN_DATA_DIR}")
print(f"üìÇ Dirty Data: {DIRTY_DATA_DIR}")

‚úÖ Storage Configured.
üìÇ Clean Data: /content/drive/MyDrive/SafetyNest_Data/fhir_clean
üìÇ Dirty Data: /content/drive/MyDrive/SafetyNest_Data/fhir_dirty


In [20]:

print(f"Checking contents of local Synthea output directory: {LOCAL_SYNTHEA_OUTPUT_DIR}")
if os.path.exists(LOCAL_SYNTHEA_OUTPUT_DIR) and os.listdir(LOCAL_SYNTHEA_OUTPUT_DIR):
    print(f"Found files in local directory. First 5 files: {os.listdir(LOCAL_SYNTHEA_OUTPUT_DIR)[:5]}...")
    # If files are found locally, copy them to the FHIR_CLEAN_DIR on Drive
    print(f"Copying files from {LOCAL_SYNTHEA_OUTPUT_DIR} to {FHIR_CLEAN_DIR} for persistence...")
    !cp -r "{LOCAL_SYNTHEA_OUTPUT_DIR}"/* "{FHIR_CLEAN_DIR}"
    print("‚úÖ Files copied to Google Drive.")
    # Optionally, remove local files to free up space (be careful with this)
    # !rm -r "{LOCAL_SYNTHEA_OUTPUT_DIR}"
else:
    print(f"No files found in local Synthea output directory: {LOCAL_SYNTHEA_OUTPUT_DIR}")
    print("This indicates Synthea might not have generated files, or they are in an unexpected location.")
    print(f"Listing contents of {FHIR_CLEAN_DIR} to confirm it's still empty: {os.listdir(FHIR_CLEAN_DIR)}")

Checking contents of local Synthea output directory: /content/output/fhir
Found files in local directory. First 5 files: ['Laverne101_Torphy630_7b7c1667-e930-2f94-1e0e-1e74e2f7b41f.json', 'Colin861_Mayert710_608c8dce-a577-4b05-f962-6fef90986a3a.json', 'Eldridge510_Hand679_52c4ccb1-a216-cced-ae1d-a5120c628011.json', 'Mark765_Schuster709_46eab0c9-da7b-1910-9245-4f2f8da7965c.json', 'Irvin970_Parker433_d7ecdd8c-150c-369e-a929-1b7939a9b3ab.json']...
Copying files from /content/output/fhir to /content/drive/MyDrive/Synthea_Data/fhir_output/clean for persistence...
‚úÖ Files copied to Google Drive.


# 01 - Synthetic Data

## 01a - Get Synthea Data

In [21]:
import glob
import shutil

# Check if we already have data
existing_files = glob.glob(f"{CLEAN_DATA_DIR}/*.json")

if len(existing_files) >= 1000:
    print(f"‚úÖ Found {len(existing_files)} existing patient records in Drive.")
    print("‚è≠Ô∏è Skipping Synthea Generation.")

else:
    print("‚ö†Ô∏è Data not found (or incomplete). Starting Fresh Generation...")

    # 1. Install Java & Download Synthea (Local VM)
    !apt-get install openjdk-8-jdk-headless -qq > /dev/null
    !wget -q https://github.com/synthetichealth/synthea/releases/download/v3.0.0/synthea-with-dependencies.jar -O synthea.jar

    # 2. Run Synthea (Generate locally first for speed)
    print("üè• Generating 1000 Patients locally...")
    !java -jar synthea.jar -p 1000 -a 0-17 --exporter.baseDirectory "./temp_output"

    # 3. Move to Drive (Persistence)
    print(f"üíæ Copying files to Google Drive: {CLEAN_DATA_DIR}...")

    # Synthea creates subfolders, we want the files in 'fhir'
    local_fhir_path = "./temp_output/fhir"
    for file_name in os.listdir(local_fhir_path):
        full_file_name = os.path.join(local_fhir_path, file_name)
        if os.path.isfile(full_file_name):
            shutil.copy(full_file_name, CLEAN_DATA_DIR)

    print("‚úÖ Data successfully saved to Drive.")

‚ö†Ô∏è Data not found (or incomplete). Starting Fresh Generation...
E: Failed to fetch http://security.ubuntu.com/ubuntu/pool/universe/o/openjdk-8/openjdk-8-jre-headless_8u472-ga-1%7e22.04_amd64.deb  404  Not Found [IP: 91.189.91.82 80]
E: Failed to fetch http://security.ubuntu.com/ubuntu/pool/universe/o/openjdk-8/openjdk-8-jdk-headless_8u472-ga-1%7e22.04_amd64.deb  404  Not Found [IP: 91.189.91.82 80]
E: Unable to fetch some archives, maybe run apt-get update or try with --fix-missing?
üè• Generating 1000 Patients locally...
Scanned 67 modules and 75 submodules.
Loading submodule modules/breast_cancer/tnm_diagnosis.json
Loading submodule modules/encounter/anxiety_screening.json
Loading submodule modules/covid19/nonsurvivor_lab_values.json
Loading submodule modules/covid19/outcomes.json
Loading submodule modules/covid19/survivor_lab_values.json
Loading submodule modules/dermatitis/moderate_cd_obs.json
Loading submodule modules/covid19/diagnose_blood_clot.json
Loading submodule modules

## 01b - Create "dirty bird" UDFs

These UDFs are used to inject erroneous and "dirty" data into the base Synthea files to create a more realistic patient population

In [23]:
import json
import os
import random
import glob

if not os.path.exists(OUTPUT_DIR):
    os.makedirs(OUTPUT_DIR)

# --- HELPER: FIND RESOURCES IN FHIR BUNDLE ---
def find_resource(bundle, resource_type):
    return [e['resource'] for e in bundle.get('entry', []) if e['resource']['resourceType'] == resource_type]

def find_weight(bundle):
    # Synthea uses LOINC 29463-7 for Body Weight
    obs = find_resource(bundle, "Observation")
    for o in obs:
        if o.get('code', {}).get('coding', [{}])[0].get('code') == '29463-7':
            return o
    return None

# --- INJECTION LOGIC ---

def inject_weight_error(bundle):
    """Scenario: Unit Mismatch (KG vs LBS)."""
    weight_res = find_weight(bundle)
    if weight_res:
        original_kg = weight_res['valueQuantity']['value']
        fake_val = round(original_kg * 2.2, 1) # Convert to LBS number

        # Corrupt the record
        weight_res['valueQuantity']['value'] = fake_val
        # (Note: We deliberately leave the code as 'kg' to cause the error)

        return bundle, f"WEIGHT_ERROR: Patient is {original_kg}kg, but EHR records {fake_val}kg (lbs)."
    return bundle, None

def inject_zombie_order(bundle):
    """Scenario: Intent Mismatch (Note says Stop, Order is Active)."""
    meds = find_resource(bundle, "MedicationRequest")
    active_meds = [m for m in meds if m['status'] == 'active']

    if active_meds:
        target_med = active_meds[0]
        drug_name = target_med['medicationCodeableConcept']['text']

        # Create a "Discharge Summary" note with the intent to STOP
        note_resource = {
            "resourceType": "DocumentReference",
            "status": "current",
            "docStatus": "final",
            "type": {"text": "Discharge Summary"},
            "content": [{
                "attachment": {
                    "contentType": "text/plain",
                    "data": f"PLAN: Patient developed rash. Discontinue {drug_name} immediately. Switch to supportive care.".encode('utf-8').hex()
                }
            }]
        }
        bundle['entry'].append({"resource": note_resource})

        return bundle, f"RECON_ERROR: Note says 'Stop {drug_name}', but Order is Active."
    return bundle, None

def inject_concentration_error(bundle):
    """Scenario: Ambiguous Liquid (Missing Concentration)."""
    meds = find_resource(bundle, "MedicationRequest")
    if meds:
        target = meds[0]
        if 'dosageInstruction' in target:
            target['dosageInstruction'][0]['text'] = "Give 5mL PO every 4 hours" # Vague!

        return bundle, "CONC_ERROR: Order says '5mL' without specifying concentration."
    return bundle, None


## 01c - Inject errors into data

In [26]:
## 01c. INJECT ERRORS (READ/WRITE TO DRIVE)

## Configuration
INPUT_DIR = CLEAN_DATA_DIR  # Read from Drive
OUTPUT_DIR = DIRTY_DATA_DIR # Save to Drive
ERROR_RATE = 0.4

## MAIN EXECUTION ---
print("üíâ Checking/Injecting Chaos...")

## Check if we already did this
existing_dirty = glob.glob(f"{OUTPUT_DIR}/*.json")
if len(existing_dirty) >= 1000:
    print(f"‚úÖ Found {len(existing_dirty)} processed records in Drive. Skipping Injection.")

else:
    print(f"‚ö†Ô∏è Processed data missing. Reading from {INPUT_DIR}...")
    files = glob.glob(f"{INPUT_DIR}/*.json")
    processed_log = []

    print("Dirtying up Synthea Data...")
    for i, file_path in enumerate(files):
        with open(file_path, 'r') as f:
            bundle = json.load(f)

        ## Extract Patient Name for logging
        try:
            patient = find_resource(bundle, "Patient")[0]
            name = f"{patient['name'][0]['given'][0]} {patient['name'][0]['family']}"
        except:
            name = "Unknown"

        error_desc = "SAFE"
        has_error = False

        ## Roll dice
        if random.random() < ERROR_RATE:
            error_type = random.choice(["weight", "zombie", "conc"])

            if error_type == "weight":
                bundle, msg = inject_weight_error(bundle)
            elif error_type == "zombie":
                bundle, msg = inject_zombie_order(bundle)
            elif error_type == "conc":
                bundle, msg = inject_concentration_error(bundle)

            if msg:
                has_error = True
                error_desc = msg

        ## Save to Ground Truth Log
        processed_log.append({
            "file": os.path.basename(file_path),
            "name": name,
            "has_error": has_error,
            "description": error_desc
        })

        ## Save the (possibly corrupted) FHIR file to Drive
        out_path = os.path.join(OUTPUT_DIR, os.path.basename(file_path))
        with open(out_path, 'w') as f:
            json.dump(bundle, f, indent=2)

        # Logging (Optional: Print every 100th file to show progress)
        if i % 100 == 0:
            print(f"   Processed {i}/{len(files)}")

    ## Save the Answer Key to Drive
    with open(f"{OUTPUT_DIR}/ground_truth_key.json", 'w') as f:
        json.dump(processed_log, f, indent=2)

    print(f"‚úÖ Processed {len(files)} patients.")
    print(f"üìÇ Corrupted data saved to {OUTPUT_DIR}")
    print(f"üîë Answer Key saved to {OUTPUT_DIR}/ground_truth_key.json")
    print("‚úÖ Injection Complete. Saved to Drive.")


üíâ Checking/Injecting Chaos...
‚ö†Ô∏è Processed data missing. Reading from /content/drive/MyDrive/SafetyNest_Data/fhir_clean...
Dirtying up Synthea Data...
   Processed 0/1006
   Processed 100/1006
   Processed 200/1006
   Processed 300/1006
   Processed 400/1006
   Processed 500/1006
   Processed 600/1006
   Processed 700/1006
   Processed 800/1006
   Processed 900/1006
   Processed 1000/1006
‚úÖ Processed 1006 patients.
üìÇ Corrupted data saved to /content/drive/MyDrive/SafetyNest_Data/fhir_dirty
üîë Answer Key saved to /content/drive/MyDrive/SafetyNest_Data/fhir_dirty/ground_truth_key.json
‚úÖ Injection Complete. Saved to Drive.


## 01d - Convert FHIR to structured data

In [27]:
## STEP 3: CONVERT FHIR to structured csv
import pandas as pd
import json
import glob
import os

# --- CONFIGURATION ---
# Use previously defined Google Drive paths
CLEAN_DIR = CLEAN_DATA_DIR # Input from persistent storage
ADJUSTED_DIR = DIRTY_DATA_DIR # Input from persistent storage
GROUND_TRUTH_FILE = os.path.join(DIRTY_DATA_DIR, "ground_truth_key.json") # Ground truth key path on Drive

# Ensure CSV output path exists
os.makedirs(CSV_OUTPUT_PATH, exist_ok=True)

# --- PARSING HELPER ---
def parse_fhir_bundle(file_path):
    """
    Reads a FHIR Bundle and extracts the key 'Safety Nest' features
    into a flat dictionary. Handles non-patient files gracefully.
    """
    try:
        with open(file_path, 'r') as f:
            bundle = json.load(f)
    except Exception as e:
        # Catch any exception during JSON loading (e.g., malformed JSON)
        print(f"Error loading {file_path}: {e}")
        return None

    # Ensure the loaded JSON is a dictionary, as expected for a FHIR Bundle
    if not isinstance(bundle, dict):
        # If it's not a dict, it's not a valid FHIR Bundle for our parsing logic
        # Also skip the ground_truth_key.json file itself if encountered here
        if os.path.basename(file_path) == os.path.basename(GROUND_TRUTH_FILE):
            return None
        print(f"Skipping {file_path}: Not a valid FHIR Bundle (expected dict, got {type(bundle)})")
        return None

    # 1. Basic Patient Info (SAFETY CHECK)
    # Synthea creates metadata files (hospitalInformation.json) that we must skip.
    patient_resources = [e['resource'] for e in bundle.get('entry', []) if e.get('resource', {}).get('resourceType') == 'Patient']

    if not patient_resources:
        return None # Skip file if no patient found

    pat_resource = patient_resources[0]
    pid = pat_resource['id']
    name = f"{pat_resource['name'][0]['given'][0]} {pat_resource['name'][0]['family']}"

    # 2. Extract Weight (LOINC 29463-7)
    weight = "N/A"
    weight_unit = "N/A"
    obs = [e['resource'] for e in bundle.get('entry', []) if e.get('resource', {}).get('resourceType') == 'Observation']
    for o in obs:
        if o.get('code', {}).get('coding', [{}])[0].get('code') == '29463-7':
            # Safe access in case valueQuantity is missing
            if 'valueQuantity' in o:
                weight = o['valueQuantity']['value']
                weight_unit = o['valueQuantity']['unit']
            break

    # 3. Extract Active Medication
    meds = [
        e['resource'] for e in bundle.get('entry', [])
        if e.get('resource', {}).get('resourceType') == 'MedicationRequest'
        and e.get('resource', {}).get('status') == 'active'
    ]

    active_med_name = "None"
    dosage_instruction = "None"

    if meds:
        m = meds[0]
        # Safe access for medication text
        if 'medicationCodeableConcept' in m:
            active_med_name = m['medicationCodeableConcept'].get('text', 'Unknown Drug')
        elif 'medicationReference' in m:
             active_med_name = m['medicationReference'].get('display', 'Unknown Drug')

        if 'dosageInstruction' in m and len(m['dosageInstruction']) > 0:
            dosage_instruction = m['dosageInstruction'][0].get('text', 'Complex Dosage')

    # 4. Extract Clinical Notes (DocumentReference)
    notes = [e['resource'] for e in bundle.get('entry', []) if e.get('resource', {}).get('resourceType') == 'DocumentReference']
    latest_note = "None"
    if notes:
        try:
            # Our injector used .hex() encoding for the chaos note
            # If it's a standard Synthea note, it might be base64, but we focus on our injected ones for the CSV
            content = notes[-1]['content'][0]['attachment']
            if 'data' in content:
                try:
                    # Try decoding our hex injection
                    latest_note = bytes.fromhex(content['data']).decode('utf-8')
                except:
                    latest_note = "Standard Synthea Note (Binary)"
        except:
            latest_note = "Error Reading Note"

    return {
        "filename": os.path.basename(file_path),
        "patient_id": pid,
        "name": name,
        "weight_val": weight,
        "weight_unit": weight_unit,
        "active_med": active_med_name,
        "dosage_instruction": dosage_instruction,
        "latest_note": latest_note
    }

# --- PROCESS CLEAN DATA ---
print("üìä Processing Clean Data...")
clean_files = glob.glob(f"{CLEAN_DIR}/*.json")
clean_data = [d for d in [parse_fhir_bundle(f) for f in clean_files] if d]

df_clean = pd.DataFrame(clean_data)
if not df_clean.empty:
    df_clean.to_csv(f"{CSV_OUTPUT_PATH}/safety_nest_clean.csv", index=False) # Save to CSV_OUTPUT_PATH
    print(f"‚úÖ Saved {len(df_clean)} records to '{CSV_OUTPUT_PATH}/safety_nest_clean.csv'")
else:
    print("‚ö†Ô∏è No valid clean patient records found.")

# --- PROCESS ADJUSTED (DIRTY) DATA ---
print("üìä Processing Adjusted Data...")
adjusted_files = glob.glob(f"{ADJUSTED_DIR}/*.json")
adjusted_data = [d for d in [parse_fhir_bundle(f) for f in adjusted_files] if d]

df_adjusted = pd.DataFrame(adjusted_data)

# Merge with Ground Truth
if not df_adjusted.empty:
    try:
        # Load ground truth from the specified Drive path
        with open(GROUND_TRUTH_FILE, 'r') as f:
            ground_truth = pd.DataFrame(json.load(f))

        # Merge on filename
        df_final = df_adjusted.merge(ground_truth[['file', 'has_error', 'description']], left_on='filename', right_on='file', how='left')
        df_final.drop(columns=['file'], inplace=True)

        df_final.to_csv(f"{CSV_OUTPUT_PATH}/safety_nest_adjusted.csv", index=False) # Save to CSV_OUTPUT_PATH
        print(f"‚úÖ Saved {len(df_final)} records to '{CSV_OUTPUT_PATH}/safety_nest_adjusted.csv'")

        # Display a preview of the Errors
        print("\n--- PREVIEW: INJECTED ERRORS ---")
        print(df_final[df_final['has_error'] == True][['name', 'active_med', 'description']].head())

    except Exception as e:
        print(f"‚ö†Ô∏è Could not merge Ground Truth: {e}")
        df_adjusted.to_csv(f"{CSV_OUTPUT_PATH}/safety_nest_adjusted.csv", index=False) # Save to CSV_OUTPUT_PATH
else:
    print("‚ö†Ô∏è No valid adjusted patient records found.")

üìä Processing Clean Data...
‚úÖ Saved 1004 records to '/content/drive/MyDrive/Synthea_Data/processed_csv/safety_nest_clean.csv'
üìä Processing Adjusted Data...
‚úÖ Saved 1004 records to '/content/drive/MyDrive/Synthea_Data/processed_csv/safety_nest_adjusted.csv'

--- PREVIEW: INJECTED ERRORS ---
                     name active_med  \
2        Ana972 Garica503       None   
3     Delmy607 Johnson679       None   
7   Prince887 Prosacco716       None   
9   Darrell400 Gaylord332       None   
11  Herbert830 Parisian75       None   

                                          description  
2   CONC_ERROR: Order says '5mL' without specifyin...  
3   CONC_ERROR: Order says '5mL' without specifyin...  
7   WEIGHT_ERROR: Patient is 22.3kg, but EHR recor...  
9   CONC_ERROR: Order says '5mL' without specifyin...  
11  CONC_ERROR: Order says '5mL' without specifyin...  


In [29]:
## CREATE pandas dataframes for clean and adjusted data
import pandas as pd
import os

## Load the safety_nest_clean.csv file from Google Drive
df_clean = pd.read_csv(os.path.join(CSV_OUTPUT_PATH, 'safety_nest_clean.csv'))

# Display information about the clean dataset
df_clean.info()

## Load the safety_nest_adjusted.csv file from Google Drive
df_adjusted = pd.read_csv(os.path.join(CSV_OUTPUT_PATH, 'safety_nest_adjusted.csv'))

## Display information about the adjusted dataset
df_adjusted.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1004 entries, 0 to 1003
Data columns (total 8 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   filename            1004 non-null   object 
 1   patient_id          1004 non-null   object 
 2   name                1004 non-null   object 
 3   weight_val          1002 non-null   float64
 4   weight_unit         1002 non-null   object 
 5   active_med          260 non-null    object 
 6   dosage_instruction  208 non-null    object 
 7   latest_note         1004 non-null   object 
dtypes: float64(1), object(7)
memory usage: 62.9+ KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1004 entries, 0 to 1003
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   filename            1004 non-null   object 
 1   patient_id          1004 non-null   object 
 2   name                1004 non-null   object 

In [None]:
display(df_clean.head())


Unnamed: 0,filename,patient_id,name,weight_val,weight_unit,active_med,dosage_instruction,latest_note
0,Dudley365_Schuster709_2d50fbf3-6f4a-ed3d-2791-...,2d50fbf3-6f4a-ed3d-2791-3354efc6f2a0,Dudley365 Schuster709,16.9,kg,Methylphenidate Hydrochloride 20 MG Oral Tablet,,Standard Synthea Note (Binary)
1,Jeff859_Durgan499_81e3f04d-962f-935d-5002-aafc...,81e3f04d-962f-935d-5002-aafcb3c3f876,Jeff859 Durgan499,13.3,kg,,,Standard Synthea Note (Binary)
2,Mica978_Metz686_96d97004-07cb-77f9-3d55-798222...,96d97004-07cb-77f9-3d55-798222ef3ab7,Mica978 Metz686,20.3,kg,1 ML medroxyprogesterone acetate 150 MG/ML Inj...,,Standard Synthea Note (Binary)
3,Ava220_Metz686_8622ba5b-d72d-9283-074f-b91cd39...,8622ba5b-d72d-9283-074f-b91cd3950b11,Ava220 Metz686,14.8,kg,cetirizine hydrochloride 5 MG Oral Tablet,Take as needed.,Standard Synthea Note (Binary)
4,Arden380_Spinka232_ba827a58-238a-f3b0-7520-724...,ba827a58-238a-f3b0-7520-7241d88a5ae9,Arden380 Spinka232,17.8,kg,,,Standard Synthea Note (Binary)


In [None]:
display(df_adjusted.head(10))

Unnamed: 0,filename,patient_id,name,weight_val,weight_unit,active_med,dosage_instruction,latest_note,has_error,description
0,Dudley365_Schuster709_2d50fbf3-6f4a-ed3d-2791-...,2d50fbf3-6f4a-ed3d-2791-3354efc6f2a0,Dudley365 Schuster709,16.9,kg,Methylphenidate Hydrochloride 20 MG Oral Tablet,,Standard Synthea Note (Binary),False,SAFE
1,Jeff859_Durgan499_81e3f04d-962f-935d-5002-aafc...,81e3f04d-962f-935d-5002-aafcb3c3f876,Jeff859 Durgan499,13.3,kg,,,Standard Synthea Note (Binary),False,SAFE
2,Mica978_Metz686_96d97004-07cb-77f9-3d55-798222...,96d97004-07cb-77f9-3d55-798222ef3ab7,Mica978 Metz686,20.3,kg,1 ML medroxyprogesterone acetate 150 MG/ML Inj...,,Standard Synthea Note (Binary),False,SAFE
3,Ava220_Metz686_8622ba5b-d72d-9283-074f-b91cd39...,8622ba5b-d72d-9283-074f-b91cd3950b11,Ava220 Metz686,14.8,kg,cetirizine hydrochloride 5 MG Oral Tablet,Take as needed.,Standard Synthea Note (Binary),False,SAFE
4,Arden380_Spinka232_ba827a58-238a-f3b0-7520-724...,ba827a58-238a-f3b0-7520-7241d88a5ae9,Arden380 Spinka232,17.8,kg,,,Standard Synthea Note (Binary),False,SAFE
5,Len277_Schulist381_98ad79c0-a346-cc55-9c30-070...,98ad79c0-a346-cc55-9c30-070724b83b8d,Len277 Schulist381,4.4,kg,,,Standard Synthea Note (Binary),False,SAFE
6,Mirna233_Hammes673_7cc86b28-997a-b9db-6610-c47...,7cc86b28-997a-b9db-6610-c47932677995,Mirna233 Hammes673,3.4,kg,,,Standard Synthea Note (Binary),False,SAFE
7,Sid118_Lubowitz58_3bfda4ff-2c80-6e4c-f9cd-dc56...,3bfda4ff-2c80-6e4c-f9cd-dc569bd5e984,Sid118 Lubowitz58,14.9,kg,,,Standard Synthea Note (Binary),False,SAFE
8,Horacio317_Anderson154_71fea222-a5ad-0268-cc47...,71fea222-a5ad-0268-cc47-ef337ec697c2,Horacio317 Anderson154,4.0,kg,Fexofenadine hydrochloride 30 MG Oral Tablet,Take as needed.,Standard Synthea Note (Binary),False,SAFE
9,Eryn994_Anderson154_d5ad922f-6376-752d-5468-bc...,d5ad922f-6376-752d-5468-bc8770ab0fc9,Eryn994 Anderson154,3.7,kg,,,Standard Synthea Note (Binary),False,SAFE


### Summary of Errors in `df_adjusted`

First, let's look at the distribution of the injected errors in the `df_adjusted` dataframe.

In [30]:
print("Error distribution in df_adjusted:")
display(df_adjusted['description'].value_counts())

num_errors = df_adjusted['has_error'].sum()
print(f"\nTotal patients with injected errors: {num_errors}")
print(f"Total patients without errors: {len(df_adjusted) - num_errors}")

Error distribution in df_adjusted:


Unnamed: 0_level_0,count
description,Unnamed: 1_level_1
SAFE,705
CONC_ERROR: Order says '5mL' without specifying concentration.,131
"WEIGHT_ERROR: Patient is 3.9kg, but EHR records 8.6kg (lbs).",11
"RECON_ERROR: Note says 'Stop Fexofenadine hydrochloride 30 MG Oral Tablet', but Order is Active.",9
"RECON_ERROR: Note says 'Stop cetirizine hydrochloride 5 MG Oral Tablet', but Order is Active.",9
...,...
"RECON_ERROR: Note says 'Stop Meperidine Hydrochloride 50 MG Oral Tablet', but Order is Active.",1
"WEIGHT_ERROR: Patient is 22.9kg, but EHR records 50.4kg (lbs).",1
"WEIGHT_ERROR: Patient is 7.5kg, but EHR records 16.5kg (lbs).",1
"WEIGHT_ERROR: Patient is 2.8kg, but EHR records 6.2kg (lbs).",1



Total patients with injected errors: 299
Total patients without errors: 705


### Detailed Comparison of a Corrupted Patient

Now, let's select one patient that has been flagged with an error in `df_adjusted` and compare their record with the corresponding clean record from `df_clean`.

In [31]:
## 1. Set a Seed for Reproducibility
## Change this number to pick a different "random" patient (e.g., 42, 123, 999)
SEED = 42

## 2. Select 1 RANDOM patient with an error
## random_state=SEED ensures we get the same patient every time we run this cell
errored_patient = df_adjusted[df_adjusted['has_error'] == True].sample(n=1, random_state=SEED).iloc[0]

## 3. Save the ID to a global variable for the Agent to use later
TARGET_PATIENT_ID = errored_patient['patient_id']

print(f"Random Seed: {SEED}")
print(f"Target Patient ID: {TARGET_PATIENT_ID}")

## 4. Fetch and Display the Records
clean_record = df_clean[df_clean['patient_id'] == TARGET_PATIENT_ID]
adjusted_record = df_adjusted[df_adjusted['patient_id'] == TARGET_PATIENT_ID]

print("\n--- üü¢ Clean Record (Original Synthea) ---")
display(clean_record[['name', 'weight_val', 'weight_unit', 'active_med', 'dosage_instruction', 'latest_note']])

print("\n--- üî¥ Adjusted Record (With Injected Error) ---")
display(adjusted_record[['name', 'weight_val', 'weight_unit', 'active_med', 'dosage_instruction', 'latest_note', 'description']])

Random Seed: 42
Target Patient ID: 7613cea2-6b8e-b25b-9bd5-a530028449bc

--- üü¢ Clean Record (Original Synthea) ---


Unnamed: 0,name,weight_val,weight_unit,active_med,dosage_instruction,latest_note
944,Kiersten731 Schuppe920,6.7,kg,,,Standard Synthea Note (Binary)



--- üî¥ Adjusted Record (With Injected Error) ---


Unnamed: 0,name,weight_val,weight_unit,active_med,dosage_instruction,latest_note,description
944,Kiersten731 Schuppe920,6.7,kg,,,Standard Synthea Note (Binary),CONC_ERROR: Order says '5mL' without specifyin...


### 01e - Save DataFrames for Persistence

To ensure our processed DataFrames (`df_clean` and `df_adjusted`) are also persistent across Colab sessions, we will save them to Google Drive in the Parquet format. Parquet is a columnar storage format that is highly efficient for data analytics and is faster to read than CSVs.

In [32]:
import os

# Define a directory for saving processed DataFrames in Google Drive
DATAFRAMES_OUTPUT_PATH = os.path.join(GDRIVE_PATH, "processed_dataframes")
os.makedirs(DATAFRAMES_OUTPUT_PATH, exist_ok=True)

# Save df_clean to Parquet
clean_df_path = os.path.join(DATAFRAMES_OUTPUT_PATH, "safety_nest_clean.parquet")
df_clean.to_parquet(clean_df_path, index=False)
print(f"‚úÖ df_clean saved to {clean_df_path}")

# Save df_adjusted to Parquet
adjusted_df_path = os.path.join(DATAFRAMES_OUTPUT_PATH, "safety_nest_adjusted.parquet")
df_adjusted.to_parquet(adjusted_df_path, index=False)
print(f"‚úÖ df_adjusted saved to {adjusted_df_path}")

print("You can now load these DataFrames directly from these paths in future sessions.")

‚úÖ df_clean saved to /content/drive/MyDrive/Synthea_Data/processed_dataframes/safety_nest_clean.parquet
‚úÖ df_adjusted saved to /content/drive/MyDrive/Synthea_Data/processed_dataframes/safety_nest_adjusted.parquet
You can now load these DataFrames directly from these paths in future sessions.


# Appendix

In [None]:
# print("Dirtying up Synthea Data...")
# files = glob.glob(f"{INPUT_DIR}/*.json")
# processed_log = []

# for file_path in files:
#     with open(file_path, 'r') as f:
#         bundle = json.load(f)

#     ## Extract Patient Name for logging
#     try:
#         patient = find_resource(bundle, "Patient")[0]
#         name = f"{patient['name'][0]['given'][0]} {patient['name'][0]['family']}"
#     except:
#         name = "Unknown"

#     error_desc = "SAFE"
#     has_error = False

#     ## Roll dice
#     if random.random() < ERROR_RATE:
#         error_type = random.choice(["weight", "zombie", "conc"])

#         if error_type == "weight":
#             bundle, msg = inject_weight_error(bundle)
#         elif error_type == "zombie":
#             bundle, msg = inject_zombie_order(bundle)
#         elif error_type == "conc":
#             bundle, msg = inject_concentration_error(bundle)

#         if msg:
#             has_error = True
#             error_desc = msg

#     ## Save to Ground Truth Log
#     processed_log.append({
#         "file": os.path.basename(file_path),
#         "name": name,
#         "has_error": has_error,
#         "description": error_desc
#     })

#     ## Save the (possibly corrupted) FHIR file
#     with open(f"{OUTPUT_DIR}/{os.path.basename(file_path)}", 'w') as f:
#         json.dump(bundle, f, indent=2)

# ## Save the Answer Key
# with open(f"{OUTPUT_DIR}/ground_truth_key.json", 'w') as f:
#     json.dump(processed_log, f, indent=2)

# print(f"‚úÖ Processed {len(files)} patients.")
# print(f"üìÇ Corrupted data saved to {OUTPUT_DIR}")
# print(f"üîë Answer Key saved to {OUTPUT_DIR}/ground_truth_key.json")

üíâ Dirtying up Synthea Data...
‚úÖ Processed 1013 patients.
üìÇ Corrupted data saved to ./safety_nest_data
üîë Answer Key saved to ./safety_nest_data/ground_truth_key.json


# Task
To establish a centralized configuration for all persistent Google Drive paths, I will create a `config.py` file. This file will define all necessary directory paths used throughout the project and ensure their existence on Google Drive.

```python
import os

# Define the base Google Drive paths for different project components.
# SAFETY_NEST_BASE_DRIVE_PATH is for SafetyNest-specific raw/processed FHIR data.
# SYNTHEA_OUTPUT_BASE_DRIVE_PATH is for Synthea-related outputs (CSVs, Parquet, JARs).
# The variable name GDRIVE_PATH is used here to match its previous usage in the notebook
# for processed outputs, adhering to the user's plan.
BASE_DRIVE_PATH = "/content/drive/MyDrive/SafetyNest_Data"
GDRIVE_PATH = "/content/drive/MyDrive/Synthea_Data"

# --- Data Directory Definitions ---

# 1. Raw/Dirty FHIR Data (stored under BASE_DRIVE_PATH)
CLEAN_DATA_DIR = os.path.join(BASE_DRIVE_PATH, "fhir_clean")
DIRTY_DATA_DIR = os.path.join(BASE_DRIVE_PATH, "fhir_dirty")

# 2. Processed CSV and DataFrame Outputs (stored under GDRIVE_PATH)
CSV_OUTPUT_PATH = os.path.join(GDRIVE_PATH, "processed_csv")
DATAFRAMES_OUTPUT_PATH = os.path.join(GDRIVE_PATH, "processed_dataframes")

# 3. Synthea JAR Path (stored under GDRIVE_PATH)
SYNTHEA_JAR_PATH = os.path.join(GDRIVE_PATH, "synthea_jar/synthea.jar")

# --- Function to Ensure Directories Exist ---
def ensure_directories_exist():
    """
    Creates all necessary directories if they do not already exist.
    This function is called automatically when config.py is imported.
    """
    directories_to_create = [
        CLEAN_DATA_DIR,
        DIRTY_DATA_DIR,
        CSV_OUTPUT_PATH,
        DATAFRAMES_OUTPUT_PATH,
        os.path.dirname(SYNTHEA_JAR_PATH) # Ensure the synthea_jar directory exists
    ]

    print("Ensuring project directories exist on Google Drive...")
    for path in directories_to_create:
        os.makedirs(path, exist_ok=True)
        print(f"- Created/Ensured: {path}")
    print("All necessary directories are in place.")

# Call the function to create directories when config.py is imported
ensure_directories_exist()

print("‚úÖ Configuration loaded.")
```

I will save the above content as `config.py` in the `BASE_DRIVE_PATH`, which is `/content/drive/MyDrive/SafetyNest_Data`.

```python
config_content = """
import os

# Define the base Google Drive paths for different project components.
# SAFETY_NEST_BASE_DRIVE_PATH is for SafetyNest-specific raw/processed FHIR data.
# SYNTHEA_OUTPUT_BASE_DRIVE_PATH is for Synthea-related outputs (CSVs, Parquet, JARs).
# The variable name GDRIVE_PATH is used here to match its previous usage in the notebook
# for processed outputs, adhering to the user's plan.
BASE_DRIVE_PATH = "/content/drive/MyDrive/SafetyNest_Data"
GDRIVE_PATH = "/content/drive/MyDrive/Synthea_Data"

# --- Data Directory Definitions ---

# 1. Raw/Dirty FHIR Data (stored under BASE_DRIVE_PATH)
CLEAN_DATA_DIR = os.path.join(BASE_DRIVE_PATH, "fhir_clean")
DIRTY_DATA_DIR = os.path.join(BASE_DRIVE_PATH, "fhir_dirty")

# 2. Processed CSV and DataFrame Outputs (stored under GDRIVE_PATH)
CSV_OUTPUT_PATH = os.path.join(GDRIVE_PATH, "processed_csv")
DATAFRAMES_OUTPUT_PATH = os.path.join(GDRIVE_PATH, "processed_dataframes")

# 3. Synthea JAR Path (stored under GDRIVE_PATH)
SYNTHEA_JAR_PATH = os.path.join(GDRIVE_PATH, "synthea_jar/synthea.jar")

# --- Function to Ensure Directories Exist ---
def ensure_directories_exist():
    \"\"\"
    Creates all necessary directories if they do not already exist.
    This function is called automatically when config.py is imported.
    \"\"\"
    directories_to_create = [
        CLEAN_DATA_DIR,
        DIRTY_DATA_DIR,
        CSV_OUTPUT_PATH,
        DATAFRAMES_OUTPUT_PATH,
        os.path.dirname(SYNTHEA_JAR_PATH) # Ensure the synthea_jar directory exists
    ]

    print("Ensuring project directories exist on Google Drive...")
    for path in directories_to_create:
        os.makedirs(path, exist_ok=True)
        print(f"- Created/Ensured: {path}")
    print("All necessary directories are in place.")

# Call the function to create directories when config.py is imported
ensure_directories_exist()

print("‚úÖ Configuration loaded.")
"""

# Define the path where config.py will be saved
config_file_path = os.path.join(BASE_DRIVE_PATH, "config.py")

# Write the content to the config.py file
with open(config_file_path, "w") as f:
    f.write(config_content)

print(f"‚úÖ Created central configuration file: {config_file_path}")
```

## Create a Central Configuration File

### Subtask:
Generate a Python file (`config.py`) that contains all the Google Drive path definitions (e.g., `GDRIVE_PATH`, `CLEAN_DATA_DIR`, `DIRTY_DATA_DIR`, `CSV_OUTPUT_PATH`, `DATAFRAMES_OUTPUT_PATH`) and the necessary `os.makedirs` calls. This file will be saved directly into your base Google Drive path.


**Reasoning**:
The subtask is to create a `config.py` file with all the Google Drive path definitions and directory creation logic. This step generates a Python code block to define the content of the `config.py` file, specifies the file path, and writes the content to that file in the Google Drive.



In [34]:
import os

# Define the content of the config.py file
config_file_content = """
import os

# Base Google Drive path
GDRIVE_PATH = "/content/drive/MyDrive/Synthea_Data"
BASE_DRIVE_PATH = "/content/drive/MyDrive/SafetyNest_Data"

# Directory for clean FHIR data
CLEAN_DATA_DIR = os.path.join(BASE_DRIVE_PATH, "fhir_clean")

# Directory for dirty FHIR data
DIRTY_DATA_DIR = os.path.join(BASE_DRIVE_PATH, "fhir_dirty")

# Directory for processed CSV files
CSV_OUTPUT_PATH = os.path.join(GDRIVE_PATH, "processed_csv")

# Directory for processed Parquet DataFrames
DATAFRAMES_OUTPUT_PATH = os.path.join(GDRIVE_PATH, "processed_dataframes")

# Path to the Synthea JAR file (if downloaded and stored in Drive)
SYNTHEA_JAR_PATH = os.path.join(GDRIVE_PATH, "synthea_jar", "synthea.jar")

# Path to the ground truth key
GROUND_TRUTH_KEY_PATH = os.path.join(DIRTY_DATA_DIR, "ground_truth_key.json")


def ensure_directories_exist():
    '''Ensures all defined directories exist.'''
    os.makedirs(CLEAN_DATA_DIR, exist_ok=True)
    os.makedirs(DIRTY_DATA_DIR, exist_ok=True)
    os.makedirs(CSV_OUTPUT_PATH, exist_ok=True)
    os.makedirs(DATAFRAMES_OUTPUT_PATH, exist_ok=True)
    os.makedirs(os.path.dirname(SYNTHEA_JAR_PATH), exist_ok=True) # Ensure synthea_jar directory exists
    print("‚úÖ All necessary directories ensured.")

# Call the function to create directories when config.py is imported or run
ensure_directories_exist()
"""

# Specify the full file path for config.py
config_file_path = os.path.join(BASE_DRIVE_PATH, "config.py")

# Ensure the BASE_DRIVE_PATH exists before writing the config file
os.makedirs(BASE_DRIVE_PATH, exist_ok=True)

# Write the content to config.py
with open(config_file_path, 'w') as f:
    f.write(config_file_content)

print(f"‚úÖ config.py successfully created at {config_file_path}")

‚úÖ config.py successfully created at /content/drive/MyDrive/SafetyNest_Data/config.py


## Demonstrate Importing Configuration in a New Notebook

### Subtask:
Provide instructions and example code for how you would set up a new notebook to import and use the `config.py` file. This will involve mounting Google Drive and dynamically adding the configuration file's directory to Python's system path.


## Demonstrate Importing Configuration in a New Notebook

To effectively use the `config.py` file in a new or existing Colab notebook, follow these steps:

1.  **Mount Google Drive**: The first step in any Colab notebook that needs to access files stored on Google Drive is to mount your Drive. This makes your Google Drive content accessible within the Colab environment.
    ```python
    from google.colab import drive
    drive.mount('/content/drive')
    ```

2.  **Add Configuration File's Directory to System Path**: Python needs to know where to find the `config.py` module. Since it's stored on Google Drive, we must explicitly add its parent directory to Python's `sys.path`. This allows Python's `import` statement to locate the file.
    ```python
    import sys
    import os

    # Assuming config.py is in BASE_DRIVE_PATH, which is '/content/drive/MyDrive/SafetyNest_Data'
    config_dir = '/content/drive/MyDrive/SafetyNest_Data'
    if config_dir not in sys.path:
        sys.path.append(config_dir)
    ```

3.  **Import the `config` Module**: Once the directory is in `sys.path`, you can import `config` just like any other Python module.
    ```python
    import config
    ```

4.  **Verify Configuration**: After importing, you can access any variable defined within `config.py` using dot notation (e.g., `config.BASE_DRIVE_PATH`). Print a few to confirm they are loaded correctly.
    ```python
    print(f"Base Drive Path: {config.BASE_DRIVE_PATH}")
    print(f"Clean Data Directory: {config.CLEAN_DATA_DIR}")
    print(f"CSV Output Path: {config.CSV_OUTPUT_PATH}")
    ```

Following these steps ensures that your notebook consistently uses the centralized path definitions, making your project more organized and less prone to errors related to hardcoded paths.

**Reasoning**:
Now I will provide the first code block to mount Google Drive, as instructed in step 1 of the subtask.



In [35]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


**Reasoning**:
Following the instructions, I will now add the directory containing `config.py` to Python's system path to allow its import.



In [36]:
import sys
import os

# Define the directory where config.py was saved
config_dir = '/content/drive/MyDrive/SafetyNest_Data'

# Add the directory to sys.path if it's not already there
if config_dir not in sys.path:
    sys.path.append(config_dir)
    print(f"‚úÖ Added '{config_dir}' to sys.path.")
else:
    print(f"‚ÑπÔ∏è '{config_dir}' is already in sys.path.")

‚úÖ Added '/content/drive/MyDrive/SafetyNest_Data' to sys.path.


**Reasoning**:
Now that the directory containing `config.py` is in `sys.path`, I will import the `config` module and print a few of its defined paths to verify that it's successfully loaded and accessible, as per the instructions.



In [37]:
import config

print(f"Base Drive Path: {config.BASE_DRIVE_PATH}")
print(f"Clean Data Directory: {config.CLEAN_DATA_DIR}")
print(f"CSV Output Path: {config.CSV_OUTPUT_PATH}")
print(f"Ground Truth Key Path: {config.GROUND_TRUTH_KEY_PATH}")

‚úÖ All necessary directories ensured.
Base Drive Path: /content/drive/MyDrive/SafetyNest_Data
Clean Data Directory: /content/drive/MyDrive/SafetyNest_Data/fhir_clean
CSV Output Path: /content/drive/MyDrive/Synthea_Data/processed_csv
Ground Truth Key Path: /content/drive/MyDrive/SafetyNest_Data/fhir_dirty/ground_truth_key.json


## Final Task

### Subtask:
Confirm that the configuration file is created and explain how other notebooks can now easily access all the defined persistent paths, supporting a modular workflow.


## Summary:

### Q&A
*   **Is the configuration file created?**
    Yes, the `config.py` file was successfully created and saved at `/content/drive/MyDrive/SafetyNest_Data/config.py`.
*   **How can other notebooks easily access all the defined persistent paths?**
    Other notebooks can access the defined persistent paths by following a three-step process:
    1.  Mounting Google Drive.
    2.  Appending the directory where `config.py` is located (`/content/drive/MyDrive/SafetyNest_Data`) to Python's `sys.path`.
    3.  Importing the `config` module (e.g., `import config`), which makes all defined paths accessible as attributes (e.g., `config.CLEAN_DATA_DIR`).

### Data Analysis Key Findings
*   A `config.py` file was successfully generated and saved to Google Drive at `/content/drive/MyDrive/SafetyNest_Data/config.py`. This file centralizes all essential project paths, including `BASE_DRIVE_PATH`, `GDRIVE_PATH`, `CLEAN_DATA_DIR`, `DIRTY_DATA_DIR`, `CSV_OUTPUT_PATH`, `DATAFRAMES_OUTPUT_PATH`, `SYNTHEA_JAR_PATH`, and `GROUND_TRUTH_KEY_PATH`.
*   The `config.py` file incorporates an `ensure_directories_exist()` function, which automatically creates all specified directories if they do not already exist, executing upon the module's import.
*   During the initial attempt to create `config.py`, a `SyntaxError` occurred due to conflicting triple-quote usage within the string content. This was resolved by changing the inner docstring from triple double-quotes (`"""`) to triple single-quotes (`'''`).
*   Demonstration confirmed that after mounting Google Drive and adding the `config.py`'s directory to Python's `sys.path`, the `config` module could be successfully imported, and its defined variables (e.g., `config.BASE_DRIVE_PATH`, `config.CLEAN_DATA_DIR`) were readily accessible.

### Insights or Next Steps
*   This centralized `config.py` establishes a robust and modular workflow, preventing hardcoding of paths and ensuring consistency across all project notebooks.
*   Future development can now rely on importing `config` at the start of any notebook to automatically configure directory structures and access predefined persistent storage locations, simplifying data management and collaboration.
