<img src="../static/imo_health.png" alt="IMO Health Logo" width="300"/>

---

# Cohort Identification in RWE Solution Accelerator

This notebook provides an end-to-end solution for identifying patient cohorts from Real World DATA (RWD)  and matching them against cohort eligibility criteria. The workflow includes:

1. **HL7 Data Extraction**: Parse HL7 messages to extract medical codes (ICD-10-CM, CPT, LOINC, SNOMED CT)
2. **Code Normalization**: Use IMO's normalization API to enhance incomplete or missing codes
3. **Cohort Dictionary Search**: Search and download Cohort criteria valuesets from IMO's FHIR API
4. **Cohort Matching**: Match patient codes against cohort eligibility criteria to identify eligible cohorts
5. **Results Analysis**: Generate match reports and statistics

## Prerequisites

- IMO API credentials configured in `config.json`
- HL7 data files in `../uploads/hl7_data/` directory
- Required Python packages: `boto3`, `requests`, `hl7apy`, `pandas`

---

## Step 1: Environment Setup and Package Installation

### Install Required Python Packages

This cell installs the necessary Python packages for the solution:

- **`boto3`**: AWS SDK for Python (if cloud integration is needed)
- **`requests`**: HTTP library for API calls to IMO services
- **`hl7apy`**: Python library for parsing and processing HL7 messages
- **`pandas`**: Data manipulation and analysis library for handling datasets

> **Note**: Uncomment the pip install command if running in a new environment or if packages are missing.

In [1]:
# Install required packages (uncomment if running in a new environment)
!pip install boto3 requests hl7apy pandas 




[notice] A new release of pip is available: 25.2 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip


---

## Step 2: Import Core Libraries

### Import Essential Libraries for Data Processing

This cell imports the fundamental libraries needed for the solution:

- **`boto3`**: Amazon Web Services SDK (for potential cloud storage integration)
- **`hl7apy`**: Specialized library for parsing HL7 healthcare data messages
- **`requests`**: HTTP client for making API calls to IMO's FHIR and normalization services
- **`json`**: Built-in library for handling JSON data from API responses

These imports establish the foundation for HL7 message parsing, API communication, and data processing throughout the notebook.

In [2]:
import boto3
import hl7apy
from hl7apy.parser import parse_message
import requests
import json

---

## Step 3: HL7 Data Loading and Processing

### Load HL7 Messages from Local Upload Directory

This cell performs the following operations:

1. **Directory Setup**: Defines the path to the HL7 data upload folder (`uploads/hl7_data`)
2. **File Discovery**: Recursively searches for all HL7 files in the upload directory
3. **Data Loading**: Reads each HL7 file and stores the content in memory for processing
4. **Preview Display**: Shows the first 200 characters of each file for verification

**Expected Input**: HL7 message files (.hl7, .txt, or similar formats) containing patient data
**Output**: `hl7_data_dict` list containing the raw HL7 message content from all files

> **Important**: Ensure your HL7 data files are placed in the `uploads/hl7_data/` directory before running this cell.

In [4]:
# Load HL7 feed from uploads folder
import os

uploads_folder = '../uploads_hl7/hl7_data'

def list_files_in_uploads_folder(folder_path):
    """
    List all files in the uploads folder.
    Returns a list of file paths.
    """
    file_paths = []
    if os.path.exists(folder_path):
        for root, dirs, files in os.walk(folder_path):
            for file in files:
                file_paths.append(os.path.join(root, file))
    return file_paths

file_paths = list_files_in_uploads_folder(uploads_folder)
print(file_paths)
hl7_data_dict = []
for file_path in file_paths:
    try:
        with open(file_path, 'r', encoding='utf-8') as file:
            hl7_data = file.read()
            hl7_data_dict.append(hl7_data)
            print(f"Loaded file: {file_path}")
            print(hl7_data[:200] + "..." if len(hl7_data) > 200 else hl7_data)
            print("-" * 50)
    except Exception as e:
        print(f"Error reading file {file_path}: {e}")

['../uploads_hl7/hl7_data\\hl7_data1', '../uploads_hl7/hl7_data\\hl7_data10', '../uploads_hl7/hl7_data\\hl7_data11', '../uploads_hl7/hl7_data\\hl7_data12', '../uploads_hl7/hl7_data\\hl7_data2', '../uploads_hl7/hl7_data\\hl7_data3', '../uploads_hl7/hl7_data\\hl7_data4', '../uploads_hl7/hl7_data\\hl7_data5', '../uploads_hl7/hl7_data\\hl7_data6', '../uploads_hl7/hl7_data\\hl7_data7', '../uploads_hl7/hl7_data\\hl7_data8', '../uploads_hl7/hl7_data\\hl7_data9']
Loaded file: ../uploads_hl7/hl7_data\hl7_data1
MSH|^~\&|LAB_APP|LAB_FACILITY|EMR_APP|EMR_FACILITY|202301151200||ORU^R01|MSG0001|P|2.5.1
PID|1||PAT1^^^HOSPIT~PAT456^^^CLINIC||DOE1^JOHN||19900510|M|||123 MAIN ST^^ANYTOWN^CA^12345||555-123-4567|||M
O...
--------------------------------------------------
Loaded file: ../uploads_hl7/hl7_data\hl7_data10
MSH|^~\&|LAB_APP|LAB_FACILITY|EMR_APP|EMR_FACILITY|202301151202||ORU^R01|MSG0003|P|2.5.1
PID|1||PAT003^^^HOSPIT~PAT003A^^^CLINIC||WILLIAMS^DAVID||19751130|M|||303 MAPLE AVE^^CHICAGO^IL^60601

---

## Step 4: HL7 Message Parsing and Medical Code Extraction

### Extract Medical Codes from HL7 Messages

This comprehensive cell contains multiple extraction functions for different medical coding systems:

#### Functions Included:

1. **`extract_patient_id()`**: Extracts patient identifiers from PID segments
2. **`extract_loinc_or_labtest_details()`**: Extracts LOINC codes from laboratory test results (OBX segments)
3. **`extract_cpt_codes()`**: Extracts CPT procedure codes from PR1 and OBR segments
4. **`extract_icd10cm_codes()`**: Extracts ICD-10-CM diagnosis codes from DG1 and OBX segments
5. **`extract_snomedct_codes()`**: Extracts SNOMED CT codes from various segments

#### Processing Workflow:

- **Input**: Raw HL7 messages from `hl7_data_dict`
- **Processing**: Each HL7 message is parsed and analyzed for medical codes
- **Priority**: Processes codes in order: LOINC ‚Üí CPT ‚Üí ICD-10-CM ‚Üí SNOMED CT
- **Output**: `trial_patient_dict` containing structured patient data with extracted codes

#### Data Structure:
Each extracted code includes:
- `code`: The medical code value
- `code_system`: The coding system (loinc, cpt, icd10cm, snomedct)
- `description`: Human-readable description
- `patient_id`: Associated patient identifier

In [5]:
# Parse HL7 message and extract codes from segments


def extract_patient_id(hl7_text):
    """Extracts the patient ID from the PID segment (PID-3.1) of the HL7 message."""
    try:
        message = parse_message(hl7_text.replace('\n', '\r'), find_groups=False)
        for segment in message.children:
            if segment.name == 'PID':
                # PID-3 may be a repeating field, take the first occurrence
                if hasattr(segment, 'pid_3'):
                    pid_3 = segment.pid_3
                    # If pid_3 is a list (repeating field), take the first
                    if isinstance(pid_3, list) and len(pid_3) > 0:
                        pid_3 = pid_3[0]
                    if hasattr(pid_3, 'cx_1'):
                        return pid_3.cx_1.value
                    elif hasattr(pid_3, 'value'):
                        return pid_3.value
    except Exception:
        pass
    return None

def extract_loinc_or_labtest_details(hl7_text):
    message = parse_message(hl7_text.replace('\n', '\r'), find_groups=False)
    patient_id = extract_patient_id(hl7_text)
    loinc_results = []
    for obx in message.children:
        if obx.name == 'OBX' and hasattr(obx.obx_3, 'ce_3') and obx.obx_3.ce_3.value in ['LN']:
            try:
                loinc_code = obx.obx_3.ce_1.value if hasattr(obx.obx_3, 'ce_1') else None
                lab_test_name = obx.obx_3.ce_2.value if hasattr(obx.obx_3, 'ce_2') else None
                loinc_results.append({
                    'code': loinc_code,
                    'code_system': 'loinc',
                    'description': lab_test_name,
                    'patient_id': patient_id
                })
            except Exception:
                continue
    return loinc_results


def extract_cpt_codes(hl7_text):
    message = parse_message(hl7_text.replace('\n', '\r'), find_groups=False)
    patient_id = extract_patient_id(hl7_text)
    cpt_results = []
    for segment in message.children:
        if segment.name == 'PR1':
            try:
                cpt_code = segment.pr1_3.ce_1.value if hasattr(segment.pr1_3, 'ce_1') else None
                cpt_desc = segment.pr1_3.ce_2.value if hasattr(segment.pr1_3, 'ce_2') else None
                if cpt_code or cpt_desc:
                    cpt_results.append({
                        'code': cpt_code,
                        'code_system': 'cpt',
                        'description': cpt_desc,
                        'patient_id': patient_id})
            except Exception:
                continue
        elif segment.name == 'OBR':
            try:
                code = segment.obr_4.ce_1.value if hasattr(segment.obr_4, 'ce_1') else None
                desc = segment.obr_4.ce_2.value if hasattr(segment.obr_4, 'ce_2') else None
                if code and code.isdigit() and (len(code) == 5 or len(code) == 7):
                    cpt_results.append({
                        'code': code, 
                        'code_system': 'cpt',
                        'description': desc, 
                        'patient_id': patient_id})
            except Exception:
                continue
    return cpt_results


def extract_icd10cm_codes(hl7_text):
    message = parse_message(hl7_text.replace('\n', '\r'), find_groups=False)
    patient_id = extract_patient_id(hl7_text)
    icd_results = []
    for segment in message.children:
        if segment.name == 'DG1':
            try:
                icd_code = segment.dg1_3.ce_1.value if hasattr(segment.dg1_3, 'ce_1') else (segment.dg1_3.value if hasattr(segment, 'dg1_3') and hasattr(segment.dg1_3, 'value') else None)
                icd_desc = segment.dg1_4.value if hasattr(segment, 'dg1_4') and hasattr(segment.dg1_4, 'value') else None
                if icd_code or icd_desc:
                    icd_results.append({
                        'code': icd_code, 
                        'code_system': 'icd10cm',
                        'description': icd_desc, 
                        'patient_id': patient_id})
            except Exception:
                continue
        elif segment.name == 'OBX':
            try:
                if hasattr(segment.obx_3, 'ce_3') and segment.obx_3.ce_3.value == 'ICD-10-CM':
                    icd_code = segment.obx_3.ce_1.value if hasattr(segment.obx_3, 'ce_1') else None
                    icd_desc = segment.obx_3.ce_2.value if hasattr(segment.obx_3, 'ce_2') else None
                    if icd_code or icd_desc:
                        icd_results.append({
                            'code': icd_code, 
                            'code_system': 'icd10cm',
                            'description': icd_desc, 
                            'patient_id': patient_id})
            except Exception:
                continue
    return icd_results


def extract_snomedct_codes(hl7_text):
    message = parse_message(hl7_text.replace('\n', '\r'), find_groups=False)
    patient_id = extract_patient_id(hl7_text)
    snomed_results = []
    for segment in message.children:
        if segment.name in ['OBX', 'OBR', 'PR1']:
            try:
                if segment.name == 'OBX' and hasattr(segment.obx_3, 'ce_3') and segment.obx_3.ce_3.value in ['SCT', 'SNOMEDCT']:
                    code = segment.obx_3.ce_1.value if hasattr(segment.obx_3, 'ce_1') else None
                    desc = segment.obx_3.ce_2.value if hasattr(segment.obx_3, 'ce_2') else None
                    if code or desc:
                        snomed_results.append({
                            'code': code,
                            'code_system': 'snomedct',
                            'description': desc, 
                            'patient_id': patient_id})
                elif segment.name == 'OBR' and hasattr(segment.obr_4, 'ce_3') and segment.obr_4.ce_3.value in ['SCT', 'SNOMEDCT']:
                    code = segment.obr_4.ce_1.value if hasattr(segment.obr_4, 'ce_1') else None
                    desc = segment.obr_4.ce_2.value if hasattr(segment.obr_4, 'ce_2') else None
                    if code or desc:
                        snomed_results.append({
                            'code': code, 
                            'code_system' : 'snomedct',
                            'description': desc, 
                            'patient_id': patient_id})
                elif segment.name == 'PR1' and hasattr(segment.pr1_4, 'ce_3') and segment.pr1_4.ce_3.value in ['SCT', 'SNOMEDCT']:
                    code = segment.pr1_4.ce_1.value if hasattr(segment.pr1_4, 'ce_1') else None
                    desc = segment.pr1_4.ce_2.value if hasattr(segment.pr1_4, 'ce_2') else None
                    if code or desc:
                        snomed_results.append({
                            'code': code, 
                            'code_system' : 'snomedct',
                            'description': desc, 
                            'patient_id': patient_id})
            except Exception:
                continue
    return snomed_results

trial_patient_dict = []
for hl7_data in hl7_data_dict:
    results = extract_loinc_or_labtest_details(hl7_data)
    if results:
        trial_patient_dict.append(results)
        print(results)
        continue
        
    results = extract_cpt_codes(hl7_data)
    if results:
        trial_patient_dict.append(results)
        print(results)
        continue
        
    results = extract_icd10cm_codes(hl7_data)
    if results:
        trial_patient_dict.append(results)  
        print(results)
        continue

    results = extract_snomedct_codes(hl7_data)
    if results:
        trial_patient_dict.append(results)
        print(results)
        continue

    


    
    


[{'code': '17968-9', 'code_system': 'loinc', 'description': 'Bacteria6:Prid:Pt:Stool:Nom:Culture', 'patient_id': 'PAT1'}]
[{'code': 'I10', 'code_system': 'icd10cm', 'description': 'Essential (primary) hypertension', 'patient_id': 'PAT003'}]
[{'code': 'E78.5', 'code_system': 'icd10cm', 'description': 'Hyperlipidemia, unspecified', 'patient_id': 'PAT004'}]
[{'code': 'F41.1', 'code_system': 'icd10cm', 'description': 'Generalized anxiety disorder', 'patient_id': 'PAT005'}]
[{'code': 'I10', 'code_system': 'icd10cm', 'description': 'Essential (primary) hypertension', 'patient_id': 'PAT2'}]
[{'code': '45380', 'code_system': 'cpt', 'description': 'Colonoscopy; with removal of polyp, snare technique', 'patient_id': 'PAT3'}]
[{'code': '', 'code_system': 'cpt', 'description': 'Thoracic CT Scan with Contrast', 'patient_id': 'PAT4'}]
[{'code': '80146002', 'code_system': 'snomedct', 'description': 'POST APPENDECTOMY FOLLOW-UP EXAM', 'patient_id': 'PAT5'}]
[{'code': '625-4', 'code_system': 'loinc', '

---

## Step 5: Code Normalization with IMO API

### Normalize Missing or Incomplete Medical Codes

This cell enhances patient data by using IMO's normalization API to fill in missing codes:

#### Key Functions:

1. **`load_config()`**: Loads API credentials from `config.json`
2. **`get_auth0_token()`**: Authenticates with IMO's Auth0 service to obtain access token
3. **`get_imo_domain()`**: Maps code systems to IMO domain categories:
   - `loinc` ‚Üí `lab` (laboratory)
   - `icd10cm` ‚Üí `problem` (diagnosis)
   - `cpt` ‚Üí `procedure` (procedures)

#### Normalization Process:

- **Input**: Patient records with missing codes (empty `code` field)
- **API Call**: Sends description text to IMO's normalization endpoint
- **Enhancement**: Uses LLM-powered transformation and candidate selection
- **Output**: Updates `trial_patient_dict` with normalized codes

#### API Configuration:
- **Endpoint**: `https://api.imohealth.com/precision/normalize`
- **Authentication**: Bearer token from Auth0
- **Features**: LLM transformation and candidate selection enabled
- **Organization**: IMO

> **Prerequisites**: Ensure `config.json` contains valid IMO API credentials before running this cell.

In [7]:
# Normalize trial patient dictionary if it has bad codes or no codes
import requests
import json
import os

def load_config():
    """Load configuration from config.json file"""
    # When running from this notebook (in using-OMOP), config.json is one folder up
    if '__file__' in globals():
        config_path = os.path.join(os.path.dirname(__file__), 'config.json')
    else:
        config_path = os.path.join('..', 'config.json')
    
    with open(config_path, 'r') as f:
        return json.load(f)

def get_auth0_token(config):
    """Get access token from Auth0"""
    auth0_config = config['auth0']
    
    payload = {
        'client_id': auth0_config['client_id'],
        'client_secret': auth0_config['client_secret'],
        'audience': auth0_config['audience'],
        'grant_type': 'client_credentials'
    }
    
    headers = {
        'content-type': 'application/json'
    }
    
    token_url = f"https://{auth0_config['domain']}/oauth/token"
    
    response = requests.post(token_url, json=payload, headers=headers)
    
    if response.status_code == 200:
        token_data = response.json()
        return token_data['access_token']
    else:
        raise Exception(f"Failed to get token: {response.status_code} - {response.text}")

# Load configuration and get token
config = load_config()
Token = get_auth0_token(config)

url = "https://api.imohealth.com/precision/normalize"

def get_imo_domain(code_system):
    if code_system == 'loinc':
        return 'lab'
    if code_system == 'icd10cm':
        return 'problem'
    if code_system == 'cpt':
        return 'procedure'

def get_code_by_codesystem(code_system, response_json):
    if code_system == 'cpt':
     return response_json["requests"][0]["response"]["items"][0]["metadata"]["mappings"]["cpt"]["codes"][0]["code"]
    elif code_system == 'loinc':
     return response_json["requests"][0]["response"]["items"][0]["metadata"]["mappings"]["loinc"]["codes"][0]["code"]
    elif code_system == 'icd10cm':
     print("ICD10CM Code Found")
     return response_json["requests"][0]["response"]["items"][0]["metadata"]["mappings"]["icd10cm"]["codes"][0]["code"]
# POST Normalization patient_trial_dict    
for data in trial_patient_dict:
    if data[0]['code'] == '':
        domain = get_imo_domain(data[0]['code_system'])
        
        payload = json.dumps({
          "organization_id": "IMO",
          "client_request_id": "123",
          "preferences": {
            "threshold": 0,
            "use_llm_transformation": True,
            "use_llm_select_candidate": True
          },
          "requests": [
            {
              "record_id": "10001",
              "domain": domain,
              "input_term": data[0]['description']
              
            }
          ]
        })
        headers = {
          'Content-Type': 'application/json',
          'Authorization': f'Bearer {Token}'
        }

        response = requests.request("POST", url, headers=headers, data=payload)
        response_json = response.json()
        if response.status_code == 200:
           data[0]['code'] = get_code_by_codesystem(data[0]['code_system'], response_json)

# POST Normalization patient_trial_dict
for data in trial_patient_dict:
    print(data)

ICD10CM Code Found
[{'code': '17968-9', 'code_system': 'loinc', 'description': 'Bacteria6:Prid:Pt:Stool:Nom:Culture', 'patient_id': 'PAT1'}]
[{'code': 'I10', 'code_system': 'icd10cm', 'description': 'Essential (primary) hypertension', 'patient_id': 'PAT003'}]
[{'code': 'E78.5', 'code_system': 'icd10cm', 'description': 'Hyperlipidemia, unspecified', 'patient_id': 'PAT004'}]
[{'code': 'F41.1', 'code_system': 'icd10cm', 'description': 'Generalized anxiety disorder', 'patient_id': 'PAT005'}]
[{'code': 'I10', 'code_system': 'icd10cm', 'description': 'Essential (primary) hypertension', 'patient_id': 'PAT2'}]
[{'code': '45380', 'code_system': 'cpt', 'description': 'Colonoscopy; with removal of polyp, snare technique', 'patient_id': 'PAT3'}]
[{'code': '71260', 'code_system': 'cpt', 'description': 'Thoracic CT Scan with Contrast', 'patient_id': 'PAT4'}]
[{'code': '80146002', 'code_system': 'snomedct', 'description': 'POST APPENDECTOMY FOLLOW-UP EXAM', 'patient_id': 'PAT5'}]
[{'code': '625-4', '

---

## Step 6: Cohort Criteria Dictionary Search and Download

### Interactive Valueset Search and Code Retrieval

This cell provides a command-line interface for searching and downloading Cohort criteria valuesets:

#### Core Functions:

1. **`search_valuesets()`**: Searches IMO's FHIR ValueSet repository using text queries
2. **`get_valueset_codes_paged()`**: Downloads complete valueset codes with pagination handling
3. **`save_codes_to_csv()`**: Exports downloaded codes to CSV format for analysis
4. **`parse_valuesets_from_search()`**: Formats search results into structured data
5. **`display_valuesets_table()`**: Shows search results in a formatted table

#### Interactive Workflow:

1. **Search Query**: Prompts user to enter valueset search term
2. **Results Display**: Shows matching valuesets with ID, Name, Scope, and Criteria
3. **Selection**: User selects a specific valueset ID to download
4. **Download**: Retrieves all codes with progress tracking and pagination
5. **Export**: Saves codes to CSV file named `{valueset_id}_codes.csv`

#### Technical Features:

- **Pagination Support**: Handles large valuesets with automatic page navigation
- **Progress Tracking**: Shows download progress for each page
- **Error Handling**: Comprehensive error management for API failures
- **Global Variables**: Stores results for use in subsequent matching steps

#### API Endpoints:
- **Search**: `https://api.imohealth.com/fhir/r6/ValueSet/search`
- **Retrieve**: `https://api.imohealth.com/fhir/r6/ValueSet/{id}`

In [8]:
# Cohort Dictionary Search and Valueset Selection - Command Line Interface
import pandas as pd
import csv
import os
import requests
from IPython.display import clear_output

def search_valuesets(search_term, token):
    """Search for valuesets using the IMO API"""
    url = "https://api.imohealth.com/fhir/r6/ValueSet/search"
    
    headers = {
        'Content-Type': 'application/json',
        'Authorization': f'Bearer {token}'
    }
    params = {"searchText": search_term}

    response = requests.get(url, headers=headers, params=params)
    
    if response.status_code == 200:
        return response.json()
    else:
        raise Exception(f"Failed to search valuesets: {response.status_code} - {response.text}")

def get_valueset_codes_paged(valueset_id, token):
    """Get all codes from a valueset with pagination"""
    all_codes = []
    page = 1
    page_size = 50
    max_pages = 100  # Safety limit to prevent infinite loops
    
    while page <= max_pages:
        url = f"https://api.imohealth.com/fhir/r6/ValueSet/{valueset_id}"
        
        headers = {
            'Authorization': f'Bearer {token}'
        }
        
        params = {
            'page': page,
            'pageSize': page_size
        }
        
        response = requests.get(url, headers=headers, params=params)
        
        if response.status_code != 200:
            raise Exception(f"Failed to get valueset codes: {response.status_code} - {response.text}")
        
        data = response.json()
        expansion = data.get('expansion', {})
        total = expansion.get('total', 0)   
        codes = expansion.get('contains', [])

        # If no codes returned, we've reached the end
        if not codes:
            break
            
        all_codes.extend(codes)
        print(f"Retrieved page {page}, codes on this page: {len(codes)}, total codes so far: {len(all_codes)}")
        
        # Check if we've retrieved all codes or if this page has fewer than page_size
        if len(all_codes) >= total or len(codes) < page_size:
            print(f"Pagination complete. Total codes retrieved: {len(all_codes)}")
            break
            
        page += 1
    
    if page > max_pages:
        print(f"‚ö†Ô∏è Reached maximum page limit ({max_pages}). Retrieved {len(all_codes)} codes.")
    
    return all_codes

def save_codes_to_csv(codes, filename):
    """Save codes to CSV file in the PythonNotebook folder"""
    filepath = os.path.join(os.path.dirname(__file__) if '__file__' in globals() else '.', filename)
    
    with open(filepath, 'w', newline='', encoding='utf-8') as csvfile:
        if codes:
            fieldnames = codes[0].keys()
            writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
            writer.writeheader()
            writer.writerows(codes)
    
    return filepath

def parse_valuesets_from_search(search_results):
    """Parse and format valueset data from search results"""
    valuesets = []
    
    if "entry" in search_results:
        for each_item in search_results.get("entry", []):
            resource = each_item.get("resource", {})
            description = resource.get("description", "")
            publisher = resource.get("publisher", "")
            if publisher == "IMO":
                continue
            
            # Extract inclusion and exclusion criteria from description or compose
            inclusion_criteria = "N/A"
            exclusion_criteria = "N/A"
            # Try to get scope information for better criteria display
            scope = resource.get("scope", {})
            if scope:
                inclusion_criteria = scope.get("inclusionCriteria", [])
                exclusion_criteria = scope.get("exclusionCriteria", [])
                
                
            value_set_info = {
                "Id": resource.get("id"),
                "Name": resource.get("title", resource.get("name", "N/A")),
                "Scope": description[:100] + "..." if len(description) > 100 else description,
                "Inclusion Criteria": inclusion_criteria,
                "Exclusion Criteria": exclusion_criteria
            }
            
            valuesets.append(value_set_info)
    
    return valuesets


# Main workflow
print("üîç Cohort Identification - Dictionary Search and Valueset Selection")
print("=" * 60)

# Step 1: Prompt user for valueset name
search_term = input("Enter the name of the valueset to search for: ").strip()

if not search_term:
    print("‚ùå No search term provided. Exiting.")
else:
    try:
        # Step 2: Search and display results
        print(f"\nüîç Searching for valuesets matching '{search_term}'...")
        search_results = search_valuesets(search_term, Token)
        valuesets = parse_valuesets_from_search(search_results)
        
        if valuesets:
            #display_valuesets_table(valuesets)
            
            # Step 3: Prompt user to select valueset ID
            print(f"\nAvailable valueset IDs:")
            for vs in valuesets:
                print(f"  - {vs['Id']}: {vs['Name']}")
            
            selected_id = input("\nEnter the ID of the valueset you would like to download: ").strip()
            
            # Find the selected valueset
            selected_valueset = None
            for vs in valuesets:
                if vs['Id'] == selected_id:
                    selected_valueset = vs
                    break
            
            if selected_valueset:
                print(f"\nüìã Selected valueset: {selected_valueset['Name']} (ID: {selected_id})")
                
                # Step 4: Download the valueset
                try:
                    print(f"üíæ Downloading codes for valueset '{selected_valueset['Name']}'...")
                    codes = get_valueset_codes_paged(selected_id, Token)
                    
                    if codes:
                        filename = f"{selected_id}_codes.csv"
                        filepath = save_codes_to_csv(codes, filename)
                        
                        print(f"‚úÖ Successfully downloaded {len(codes)} codes!")
                        print(f"üìÅ File saved as: {filename}")
                        
                        # Store global variables for next steps
                        globals()['selected_valueset_info'] = selected_valueset
                        globals()['downloaded_codes_file'] = filename
                        globals()['downloaded_codes_count'] = len(codes)
                        
                        print(f"\nüéâ Download complete! You can now proceed to the cohort matching step.")
                        
                    else:
                        print("‚ö†Ô∏è No codes found for this valueset.")
                        
                except Exception as e:
                    print(f"‚ùå Error downloading codes: {str(e)}")
            else:
                print(f"‚ùå Invalid valueset ID '{selected_id}'. Please check the available IDs above.")
                
        else:
            print("‚ùå No valuesets found for your search term.")
            
    except Exception as e:
        print(f"‚ùå Error searching valuesets: {str(e)}")

üîç Cohort Identification - Dictionary Search and Valueset Selection
‚ùå No search term provided. Exiting.


---

## Step 7: Patient-Trial Matching Analysis

### Match Patient Codes Against Cohort Eligibility Criteria

This cell performs comprehensive patient matching against the downloaded trial dictionary:

#### Core Functions:

1. **`find_latest_downloaded_csv()`**: Locates the most recent trial dictionary CSV file
2. **`load_trial_dictionary()`**: Loads and validates the trial dictionary data
3. **`match_patient_codes()`**: Performs code-level matching between patients and trial criteria
4. **`display_matching_summary()`**: Generates detailed match statistics and reports

#### Matching Process:

1. **Dictionary Loading**: Automatically finds and loads the latest downloaded trial CSV
2. **Code Alignment**: Maps patient codes to trial dictionary codes with case-insensitive matching
3. **Match Detection**: Identifies patients whose codes appear in trial eligibility criteria
4. **Result Classification**: Separates patients into matched and unmatched groups
5. **Statistical Analysis**: Calculates match percentages and code system breakdowns

#### Output Analysis:

- **Matched Patients**: Complete details of patients meeting trial criteria
- **Unmatched Patients**: Patients not matching current trial eligibility
- **Match Statistics**: Percentage calculations and demographic breakdowns
- **Code System Analysis**: Breakdown by medical coding systems (ICD-10, CPT, LOINC, SNOMED)

#### Data Quality Features:

- **Flexible Column Detection**: Automatically identifies code columns in trial dictionary
- **Case-Insensitive Matching**: Robust string comparison for reliable matching
- **Multiple Match Support**: Handles cases where patient codes match multiple trial entries
- **Progress Reporting**: Real-time feedback during matching process

#### Global Variables Created:
- `matching_results`: Complete results dictionary for further analysis
- Contains matched/unmatched patient lists and trial dictionary metadata

In [9]:
# Patient Code Matching Against Downloaded Trial Dictionary
import pandas as pd
import os
import glob

def find_latest_downloaded_csv():
    """Find the most recently downloaded CSV file"""
    csv_files = glob.glob("*_codes.csv")
    if not csv_files:
        return None
    
    # Get the most recently modified CSV file
    latest_file = max(csv_files, key=os.path.getmtime)
    return latest_file

def load_trial_dictionary(csv_file):
    """Load the trial dictionary from CSV file"""
    try:
        df = pd.read_csv(csv_file)
        print(f"üìã Loaded trial dictionary from: {csv_file}")
        print(f"   Total codes in dictionary: {len(df)}")
        
        # Display column names to understand the structure
        print(f"   Columns: {list(df.columns)}")
        
        # Show first few entries
        if len(df) > 0:
            print(f"   Sample entries:")
            print(df.head(3).to_string(index=False))
        
        return df
    except Exception as e:
        print(f"‚ùå Error loading CSV file: {e}")
        return None

def match_patient_codes(trial_patient_dict, trial_df):
    """Match patient codes against the trial dictionary"""
    if trial_df is None or len(trial_df) == 0:
        print("‚ùå No trial dictionary data available for matching")
        return []
    
    # Try to identify the code column in the trial dictionary
    code_columns = ['code', 'Code', 'CODE', 'system', 'System']
    code_col = None
    
    for col in code_columns:
        if col in trial_df.columns:
            code_col = col
            break
    
    if code_col is None:
        print(f"‚ùå Could not identify code column. Available columns: {list(trial_df.columns)}")
        return []
    
    print(f"‚úÖ Using '{code_col}' column for code matching")
    
    # Convert trial dictionary codes to set for faster lookup
    trial_codes = set(trial_df[code_col].astype(str).str.strip().str.upper())
    
    matched_patients = []
    unmatched_patients = []
    
    print(f"\nüîç Matching {len(trial_patient_dict)} patient records against trial dictionary...")
    print("=" * 80)
    
    for patient_group in trial_patient_dict:
        for patient_record in patient_group:
            patient_code = str(patient_record.get('code', '')).strip().upper()
            patient_id = patient_record.get('patient_id', 'Unknown')
            code_system = patient_record.get('code_system', 'Unknown')
            description = patient_record.get('description', 'N/A')
            
            if patient_code and patient_code in trial_codes:
                # Find the matching trial dictionary entry
                matching_entries = trial_df[trial_df[code_col].astype(str).str.strip().str.upper() == patient_code]
                
                match_info = {
                    'patient_id': patient_id,
                    'patient_code': patient_record.get('code', ''),
                    'code_system': code_system,
                    'patient_description': description,
                    'trial_matches': matching_entries.to_dict('records')
                }
                matched_patients.append(match_info)
                
                print(f"‚úÖ MATCH FOUND!")
                print(f"   Patient ID: {patient_id}")
                print(f"   Code: {patient_record.get('code', '')} ({code_system})")
                print(f"   Description: {description}")
                print(f"   Trial Dictionary Matches: {len(matching_entries)}")
                
                # Show trial dictionary match details
                for idx, match in matching_entries.iterrows():
                    trial_desc = match.get('display', match.get('Display', match.get('description', 'N/A')))
                    print(f"     - Trial Entry: {trial_desc}")
                print("-" * 60)
                
            else:
                unmatched_info = {
                    'patient_id': patient_id,
                    'patient_code': patient_record.get('code', ''),
                    'code_system': code_system,
                    'patient_description': description
                }
                unmatched_patients.append(unmatched_info)
    
    return matched_patients, unmatched_patients

def display_matching_summary(matched_patients, unmatched_patients):
    """Display a summary of matching results"""
    total_patients = len(matched_patients) + len(unmatched_patients)
    
    print(f"\nüìä MATCHING SUMMARY")
    print("=" * 50)
    print(f"Total patients analyzed: {total_patients}")
    print(f"‚úÖ Patients matching trial criteria: {len(matched_patients)}")
    print(f"‚ùå Patients NOT matching trial criteria: {len(unmatched_patients)}")
    
    if total_patients > 0:
        match_percentage = (len(matched_patients) / total_patients) * 100
        print(f"üìà Match percentage: {match_percentage:.1f}%")
    
    # Group by code system
    if matched_patients:
        print(f"\nüéØ Matched patients by code system:")
        code_systems = {}
        for patient in matched_patients:
            cs = patient['code_system']
            if cs not in code_systems:
                code_systems[cs] = 0
            code_systems[cs] += 1
        
        for cs, count in code_systems.items():
            print(f"   {cs}: {count} patients")
    
    if unmatched_patients:
        print(f"\n‚ö†Ô∏è Unmatched patients by code system:")
        code_systems = {}
        for patient in unmatched_patients:
            cs = patient['code_system']
            if cs not in code_systems:
                code_systems[cs] = 0
            code_systems[cs] += 1
        
        for cs, count in code_systems.items():
            print(f"   {cs}: {count} patients")

# Main execution
print("üîç Patient Code Matching Against Trial Dictionary")
print("=" * 60)

# Check if we have patient data
if 'trial_patient_dict' not in globals() or not trial_patient_dict:
    print("‚ùå No patient data found. Please run the HL7 extraction cells first.")
else:
    # Find the latest downloaded CSV file
    csv_file = find_latest_downloaded_csv()
    
    if csv_file:
        print(f"üìÅ Found downloaded trial dictionary: {csv_file}")
        
        # Load the trial dictionary
        trial_df = load_trial_dictionary(csv_file)
        
        if trial_df is not None:
            # Perform matching
            matched_patients, unmatched_patients = match_patient_codes(trial_patient_dict, trial_df)
            
            # Display summary
            display_matching_summary(matched_patients, unmatched_patients)
            
            # Store results for potential further analysis
            matching_results = {
                'matched_patients': matched_patients,
                'unmatched_patients': unmatched_patients,
                'trial_dictionary_file': csv_file,
                'total_trial_codes': len(trial_df)
            }
            
            print(f"\nüíæ Results stored in 'matching_results' variable for further analysis")
    else:
        print("‚ùå No downloaded CSV files found. Please run the cohort dictionary search cell first to download a trial dictionary.")

üîç Patient Code Matching Against Trial Dictionary
‚ùå No downloaded CSV files found. Please run the cohort dictionary search cell first to download a trial dictionary.
