<img src="../static/imo_health.png" alt="IMO Health Logo" width="300"/>

---

# RWE Cohort Identification - Applying Eligibility Criteria

This notebook is the **third step** in the RWE Cohort Identification pipeline. It searches for and downloads clinical trial eligibility criteria valuesets, then matches OMOP patient data against these criteria to identify eligible cohorts.

## Workflow Overview

This notebook performs cohort matching based on inclusion and exclusion criteria:

1. **API Authentication**: Authenticate with IMO's FHIR API services
2. **Valueset Search & Download**: Search for and download Cohort eligibility criteria valuesets (both inclusion and exclusion)
3. **OMOP Data Loading**: Load patient data in OMOP CDM format from previous step
4. **Inclusion Criteria Matching**: Match patients against ALL inclusion criteria lists
5. **Exclusion Criteria Filtering**: Remove patients matching ANY exclusion criteria
6. **Cohort Summary**: Generate detailed statistics and export final matching results

## Prerequisites

- **Previous Steps Completed**:
  - Step 1: NLP extraction completed (`patient_output.xlsx` generated)
  - Step 2: OMOP conversion completed (`Output/OMOP_CSV/*.csv` files created)
- IMO API credentials configured in `config.json`
- Required Python packages: `pandas`, `requests`

## Example Use Case

**Study**: Extract adult patients with Rheumatoid Arthritis on methotrexate

**Inclusion Criteria**:
- Age ‚â• 18 at index date
- ‚â• 2 encounters with ICD-10: M05*, M06* (Rheumatoid arthritis)
- At least 1 methotrexate prescription

**Exclusion Criteria**:
- Any diagnosis of juvenile idiopathic arthritis (M08*)
- Missing gender or birthdate

---

## Step 0: Package Installation and Setup


### Verify Required Packages

This cell verifies that all necessary Python packages are installed:

- **`pandas`**: Data manipulation and analysis library
- **`openpyxl`**: Reading/writing Excel files
- **`xlsxwriter`**: Creating formatted Excel files
- **`requests`**: HTTP library for API calls to IMO services

> **Note**: These packages should already be installed in your virtual environment.

In [1]:
import pandas as pd
import requests
import json
import os

# Install required packages (uncomment if running in a new environment)
# Install required packages (uncomment if running in a new environment)
%pip install requests pandas xlsxwriter

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 25.2 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip


---

## Step 1: Load API Configuration

### Load IMO API Credentials

This cell loads the API credentials from `config.json` and obtains an authentication token:

- **Auth0 Authentication**: Authenticates with IMO's Auth0 service
- **Bearer Token**: Retrieves access token for API calls
- **Token Storage**: Stores token globally for use in subsequent steps

The configuration file should contain:
- `client_id`: Your IMO API client ID
- `client_secret`: Your IMO API client secret
- `audience`: API audience identifier
- `domain`: Auth0 domain

This token will be used for both NLP entity extraction and FHIR valueset searches.

In [3]:
import json

def load_config():
    """Load configuration from config.json file"""
    # When running from this notebook (in using-OMOP), config.json is one folder up
    if '__file__' in globals():
        config_path = os.path.join(os.path.dirname(__file__), 'config.json')
    else:
        config_path = os.path.join('..', 'config.json')
    
    with open(config_path, 'r') as f:
        return json.load(f)

def get_auth0_token(config):
    """Get access token from Auth0"""
    auth0_config = config['auth0']
    
    payload = {
        'client_id': auth0_config['client_id'],
        'client_secret': auth0_config['client_secret'],
        'audience': auth0_config['audience'],
        'grant_type': 'client_credentials'
    }
    
    headers = {'content-type': 'application/json'}
    token_url = f"https://{auth0_config['domain']}/oauth/token"
    
    response = requests.post(token_url, json=payload, headers=headers)
    
    if response.status_code == 200:
        token_data = response.json()
        return token_data['access_token']
    else:
        raise Exception(f"Failed to get token: {response.status_code} - {response.text}")

# Load configuration and get token
config = load_config()
Token = get_auth0_token(config)

print("‚úÖ Successfully authenticated with IMO API")
print(f"   Token obtained: {Token[:20]}...")

‚úÖ Successfully authenticated with IMO API
   Token obtained: eyJhbGciOiJSUzI1NiIs...


## Step 2 Searching and Download Valusets created for the following RWE study

Scope: Extract adult patients over the age of 18 with Rheumatoid Artheritis on methotrexate 
1.	Inclusion:
2.	  - Age ‚â• 18 at index date
3.	  - ‚â• 2 encounters with ICD-10: M05*, M06* (Rheumatoid arthritis)
4.	  - At least 1 methotrexate prescription 

5.	
6.	Exclusion:
7.	  - Any diagnosis of juvenile idiopathic arthritis (M08*)
8.	  - Missing gender or birthdate



In [4]:
# Cohort Dictionary Search and Valueset Selection - Command Line Interface
import pandas as pd
import csv
import os
import requests
from IPython.display import clear_output

def search_valuesets(search_term, token):
    """Search for valuesets using the IMO API"""
    url = "https://api.imohealth.com/fhir/r6/ValueSet/search"
    
    headers = {
        'Content-Type': 'application/json',
        'Authorization': f'Bearer {token}'
    }
    params = {"searchText": search_term}

    response = requests.get(url, headers=headers, params=params)
    
    if response.status_code == 200:
        return response.json()
    else:
        raise Exception(f"Failed to search valuesets: {response.status_code} - {response.text}")

def get_valueset_codes_paged(valueset_id, token):
    """Get all codes from a valueset with pagination"""
    all_codes = []
    page = 1
    page_size = 50
    max_pages = 100  # Safety limit to prevent infinite loops
    
    while page <= max_pages:
        url = f"https://api.imohealth.com/fhir/r6/ValueSet/{valueset_id}"
        
        headers = {
            'Authorization': f'Bearer {token}'
        }
        
        params = {
            'page': page,
            'pageSize': page_size
        }
        
        response = requests.get(url, headers=headers, params=params)
        
        if response.status_code != 200:
            raise Exception(f"Failed to get valueset codes: {response.status_code} - {response.text}")
        
        data = response.json()
        expansion = data.get('expansion', {})
        total = expansion.get('total', 0)   
        codes = expansion.get('contains', [])

        # If no codes returned, we've reached the end
        if not codes:
            break
            
        all_codes.extend(codes)
        print(f"Retrieved page {page}, codes on this page: {len(codes)}, total codes so far: {len(all_codes)}")
        
        # Check if we've retrieved all codes or if this page has fewer than page_size
        if len(all_codes) >= total or len(codes) < page_size:
            print(f"Pagination complete. Total codes retrieved: {len(all_codes)}")
            break
            
        page += 1
    
    if page > max_pages:
        print(f"‚ö†Ô∏è Reached maximum page limit ({max_pages}). Retrieved {len(all_codes)} codes.")
    
    return all_codes

def save_codes_to_csv(codes, filename):
    """Save codes to CSV file in the PythonNotebook folder"""
    filepath = os.path.join(os.path.dirname(__file__) if '__file__' in globals() else '.', filename)
    
    with open(filepath, 'w', newline='', encoding='utf-8') as csvfile:
        if codes:
            fieldnames = codes[0].keys()
            writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
            writer.writeheader()
            writer.writerows(codes)
    
    return filepath

def parse_valuesets_from_search(search_results):
    """Parse and format valueset data from search results"""
    valuesets = []
    
    if "entry" in search_results:
        for each_item in search_results.get("entry", []):
            resource = each_item.get("resource", {})
            description = resource.get("description", "")
            publisher = resource.get("publisher", "")
            if publisher == "IMO":
                continue
            
            # Extract inclusion and exclusion criteria from description or compose
            inclusion_criteria = "N/A"
            exclusion_criteria = "N/A"
            # Try to get scope information for better criteria display
            scope = resource.get("scope", {})
            if scope:
                inclusion_criteria = scope.get("inclusionCriteria", [])
                exclusion_criteria = scope.get("exclusionCriteria", [])
                
                
            value_set_info = {
                "Id": resource.get("id"),
                "Name": resource.get("title", resource.get("name", "N/A")),
                "Scope": description[:100] + "..." if len(description) > 100 else description,
                "Inclusion Criteria": inclusion_criteria,
                "Exclusion Criteria": exclusion_criteria
            }
            
            valuesets.append(value_set_info)
    
    return valuesets


# Main workflow
print("üîç Cohort Dictionary Search and Valueset Selection")
print("=" * 60)

# Step 1: Prompt user for valueset name
print("Search for RWE Rheumatoid arthritis valuesets")
search_inclusion_criteria = ["RWE Rheumatoid arthritis-Condition To Include", "RWE Rheumatoid arthritis-Medications To Include",]
search_exlusion_critera = ["RWE Rheumatoid arthritis-Condition To Exclude"]

for search_term in search_inclusion_criteria + search_exlusion_critera:
    if not search_term:
        print("‚ùå No search term provided. Exiting.")
    else:
        if search_term in search_inclusion_criteria:
            prefix = "Inclusion_"
        else:
            prefix = "Exclusion_"
        try:
            # Step 2: Search and display results
            print(f"\nüîç Searching for valuesets matching '{search_term}'...")
            search_results = search_valuesets(search_term, Token)
            valuesets = parse_valuesets_from_search(search_results)
            
            if valuesets:
                #display_valuesets_table(valuesets)
                
                # Step 3: Prompt user to select valueset ID
                print(f"\nAvailable valueset IDs:")
                for vs in valuesets:
                    print(f"  - {vs['Id']}: {vs['Name']}")
                    selected_valueset = vs
                    selected_id = vs['Id']
                    
                    if selected_valueset:
                        print(f"\nüìã Selected valueset: {selected_valueset['Name']} (ID: {selected_id})")
                        
                        # Step 4: Download the valueset
                        try:
                            print(f"üíæ Downloading codes for valueset '{selected_valueset['Name']}'...")
                            codes = get_valueset_codes_paged(selected_id, Token)
                            
                            if codes:
                                filename = prefix + f"{selected_id}_codes.csv"
                                filepath = save_codes_to_csv(codes, filename)
                                
                                print(f"‚úÖ Successfully downloaded {len(codes)} codes!")
                                print(f"üìÅ File saved as: {filename}")
                                
                                # Store global variables for next steps
                                globals()['selected_valueset_info'] = selected_valueset
                                globals()['downloaded_codes_file'] = filename
                                globals()['downloaded_codes_count'] = len(codes)
                                
                                print(f"\nüéâ Download complete! You can now proceed to the patient matching step.")
                                
                            else:
                                print("‚ö†Ô∏è No codes found for this valueset.")
                                
                        except Exception as e:
                            print(f"‚ùå Error downloading codes: {str(e)}")
                    else:
                        print(f"‚ùå Invalid valueset ID '{selected_id}'. Please check the available IDs above.")
                        
            else:
                print("‚ùå No valuesets found for your search term.")
                
        except Exception as e:
            print(f"‚ùå Error searching valuesets: {str(e)}")

üîç Cohort Dictionary Search and Valueset Selection
Search for RWE Rheumatoid arthritis valuesets

üîç Searching for valuesets matching 'RWE Rheumatoid arthritis-Condition To Include'...

Available valueset IDs:
  - 18950: RWE Rheumatoid arthritis-Conditions To Include updated

üìã Selected valueset: RWE Rheumatoid arthritis-Conditions To Include updated (ID: 18950)
üíæ Downloading codes for valueset 'RWE Rheumatoid arthritis-Conditions To Include updated'...
Retrieved page 1, codes on this page: 50, total codes so far: 50
Retrieved page 2, codes on this page: 50, total codes so far: 100
Retrieved page 3, codes on this page: 50, total codes so far: 150
Retrieved page 4, codes on this page: 50, total codes so far: 200
Retrieved page 5, codes on this page: 50, total codes so far: 250
Retrieved page 6, codes on this page: 50, total codes so far: 300
Retrieved page 7, codes on this page: 50, total codes so far: 350
Retrieved page 8, codes on this page: 50, total codes so far: 400
Retri

---

## Step 3: Patient Cohort Matching Against Inclusion Criteria

### Match OMOP Patient Data Against Downloaded Eligibility Criteria

This section performs comprehensive cohort matching between OMOP patient data and inclusion criteria valuesets.

#### Core Functions:

1. **`find_latest_inclusion_csvs()`**: Locates all downloaded inclusion criteria CSV files
2. **`load_omop_data()`**: Loads OMOP clinical tables from CSV files
3. **`match_omop_codes()`**: Matches patients against ALL inclusion criteria lists

#### Matching Logic:

**Important**: A patient is only included in the final cohort if they match **at least one code** in **EVERY** inclusion criteria list.

For example, if you have:
- Inclusion List 1: Rheumatoid arthritis conditions (M05*, M06*)
- Inclusion List 2: Methotrexate medications

A patient must have:
- At least 1 match in List 1 (RA diagnosis) **AND**
- At least 1 match in List 2 (Methotrexate prescription)

Patients matching only some of the inclusion lists are excluded.

#### OMOP Tables Processed:

- **CONDITION_OCCURRENCE**: Diagnoses/problems (ICD10CM codes)
- **DRUG_EXPOSURE**: Medications (RXNORM codes)
- **MEASUREMENT**: Lab tests (LOINC codes)
- **PROCEDURE_OCCURRENCE**: Procedures (CPT codes)

#### Output:

- **Console**: Detailed matching progress with patient IDs and matched codes
- **CSV Export**: `Output/cohort_matching_results.csv` with matched patient details
- **Global Variable**: `cohort_matching_results` dictionary for further analysis

---


In [5]:
# Patient Cohort Matching Against Trial Dictionary using OMOP Data
import pandas as pd
import os
import glob
from collections import defaultdict

def find_latest_inclusion_csvs():
    """Find all latest downloaded trial dictionary CSV files that start with 'Inclusion'"""
    csv_files = glob.glob("Inclusion*_codes.csv")
    if not csv_files:
        return []
    # Sort by modification time, newest first
    csv_files = sorted(csv_files, key=os.path.getmtime, reverse=True)
    return csv_files

def load_omop_data(omop_dir='Output/OMOP_CSV'):
    """Load OMOP clinical tables from CSV files"""
    omop_tables = {}
    
    if not os.path.exists(omop_dir):
        print(f"‚ùå OMOP directory not found: {omop_dir}")
        return omop_tables
    
    # Define the clinical tables to load
    clinical_tables = {
        'CONDITION_OCCURRENCE': 'condition_source_value',
        'DRUG_EXPOSURE': 'drug_source_value',
        'MEASUREMENT': 'measurement_source_value',
        'PROCEDURE_OCCURRENCE': 'procedure_source_value'
    }
    
    print(f"üìä Loading OMOP clinical tables from: {omop_dir}")
    
    for table_name, code_column in clinical_tables.items():
        file_path = os.path.join(omop_dir, f'{table_name}.csv')
        
        if os.path.exists(file_path):
            try:
                df = pd.read_csv(file_path)
                omop_tables[table_name] = df
                print(f"   ‚úÖ {table_name}: {len(df)} records")
            except Exception as e:
                print(f"   ‚ùå Error loading {table_name}: {e}")
        else:
            print(f"   ‚ö†Ô∏è {table_name}.csv not found")
    
    return omop_tables

def match_omop_codes(omop_tables, inclusion_trial_dfs):
    """
    Match OMOP source codes against each inclusion trial dictionary.
    Only patients who match at least one code in every inclusion dictionary are returned.
    """
    if not inclusion_trial_dfs or any(df is None or len(df) == 0 for df in inclusion_trial_dfs):
        print("‚ùå No valid inclusion trial dictionary data available for matching")
        return {}

    # Prepare code sets for each inclusion dictionary
    code_columns = ['code', 'Code', 'CODE', 'system', 'System']
    inclusion_code_sets = []
    inclusion_code_cols = []
    for trial_df in inclusion_trial_dfs:
        code_col = next((col for col in code_columns if col in trial_df.columns), None)
        if code_col is None:
            print(f"‚ùå Could not identify code column. Available columns: {list(trial_df.columns)}")
            return {}
        inclusion_code_cols.append(code_col)
        inclusion_code_sets.append(set(trial_df[code_col].astype(str).str.strip().str.upper()))

    # Track matches by patient for each inclusion list
    patient_matches_per_inclusion = [defaultdict(int) for _ in inclusion_trial_dfs]

    omop_code_columns = {
        'CONDITION_OCCURRENCE': 'condition_source_value',
        'DRUG_EXPOSURE': 'drug_source_value',
        'MEASUREMENT': 'measurement_source_value',
        'PROCEDURE_OCCURRENCE': 'procedure_source_value'
    }

    print(f"\nüîç Matching OMOP data against each inclusion trial dictionary...")
    print("=" * 80)

    for table_name, df in omop_tables.items():
        code_column = omop_code_columns.get(table_name)
        if code_column not in df.columns or 'person_id' not in df.columns:
            continue
        print(f"\nProcessing OMOP table: {table_name} with {len(df)} records")
        for idx, row in df.iterrows():
            source_code = str(row[code_column]).strip().upper()
            person_id = row['person_id']
            for i, code_set in enumerate(inclusion_code_sets):
                if source_code in code_set:
                    print(f"   Match found - Patient ID: {person_id}, Code: {source_code}, Inclusion List: {i+1}")
                    patient_matches_per_inclusion[i][person_id] += 1

    # Find patients present in all inclusion lists
    matched_patients = set(patient_matches_per_inclusion[0].keys())
    for matches in patient_matches_per_inclusion[1:]:
        matched_patients &= set(matches.keys())

    print(f"\n{'='*80}")
    print(f"üë• Patients matching ALL inclusion criteria: {len(matched_patients)}")

    # Collect detailed match info for each patient
    final_matches = {}
    for person_id in matched_patients:
        match_info = {
            'person_id': person_id,
            'inclusion_match_counts': [matches[person_id] for matches in patient_matches_per_inclusion],
            'total_inclusion_lists': len(inclusion_trial_dfs)
        }
        final_matches[person_id] = match_info

    return final_matches

# Main execution - Load and Match
print("üîç OMOP-Based Patient Cohort Matching Against Cohort Dictionary")
print("=" * 80)

# Find and load all inclusion trial dictionaries
inclusion_csv_files = find_latest_inclusion_csvs()

if inclusion_csv_files:
    print(f"üìÅ Found {len(inclusion_csv_files)} inclusion trial dictionaries:")
    for f in inclusion_csv_files:
        print(f"   - {f}")

    # Load all inclusion trial dictionaries into DataFrames
    inclusion_trial_dfs = []
    for csv_file in inclusion_csv_files:
        try:
            df = pd.read_csv(csv_file)
            inclusion_trial_dfs.append(df)
            print(f"   ‚úÖ Loaded {csv_file}: {len(df)} codes")
        except Exception as e:
            print(f"   ‚ùå Error loading {csv_file}: {e}")

    # Load OMOP data
    omop_tables = load_omop_data()

    if omop_tables and inclusion_trial_dfs:
        # Perform matching: find patients who match at least one code in every inclusion dictionary
        patient_matches = match_omop_codes(omop_tables, inclusion_trial_dfs)
        
        print(f"\n‚úÖ Matching complete. Found {len(patient_matches)} eligible patients.")
        print(f"   Proceed to next cell for detailed cohort summary.")
    else:
        patient_matches = {}
        print("‚ùå No OMOP data tables loaded or no valid inclusion trial dictionaries found.")
else:
    patient_matches = {}
    omop_tables = {}
    inclusion_csv_files = []
    print("‚ùå No inclusion trial dictionary CSV files found.")
    print("   Please run Step 2 first to search and download trial dictionaries.")


üîç OMOP-Based Patient Cohort Matching Against Cohort Dictionary
üìÅ Found 2 inclusion trial dictionaries:
   - Inclusion_18957_codes.csv
   - Inclusion_18950_codes.csv
   ‚úÖ Loaded Inclusion_18957_codes.csv: 212 codes
   ‚úÖ Loaded Inclusion_18950_codes.csv: 2584 codes
üìä Loading OMOP clinical tables from: Output/OMOP_CSV
   ‚úÖ CONDITION_OCCURRENCE: 712 records
   ‚úÖ DRUG_EXPOSURE: 262 records
   ‚úÖ MEASUREMENT: 646 records
   ‚úÖ PROCEDURE_OCCURRENCE: 59 records

üîç Matching OMOP data against each inclusion trial dictionary...

Processing OMOP table: CONDITION_OCCURRENCE with 712 records
   Match found - Patient ID: 1104223764212704475, Code: M06.9, Inclusion List: 2
   Match found - Patient ID: 443887166406931539, Code: M06.9, Inclusion List: 2

Processing OMOP table: DRUG_EXPOSURE with 262 records
   Match found - Patient ID: 443887166406931539, Code: 328406, Inclusion List: 1

Processing OMOP table: MEASUREMENT with 646 records

Processing OMOP table: PROCEDURE_OCCURRENC

---

## Step 3.5: Apply Exclusion Criteria Filtering

### Remove Patients Matching Exclusion Criteria

This section filters out patients who matched the inclusion criteria but also have any conditions, medications, or procedures listed in the exclusion criteria.

#### Exclusion Logic:

**Important**: A patient is excluded from the final cohort if they match **any code** in **any** exclusion criteria list.

For example, if your exclusion criteria include:
- Exclusion List 1: Juvenile idiopathic arthritis (M08*)
- Exclusion List 2: Pregnancy-related conditions

A patient will be excluded if they have:
- Any match in List 1 **OR**
- Any match in List 2

This is the opposite of inclusion logic (which requires matches in ALL lists).

#### Process:

1. **Find Exclusion CSV Files**: Locate all `Exclusion*_codes.csv` files
2. **Match Against OMOP Data**: Check if inclusion-matched patients have any exclusion codes
3. **Filter Cohort**: Remove any patient with exclusion matches
4. **Report**: Display how many patients were excluded and why

---


In [7]:
# Apply Exclusion Criteria Filtering
import pandas as pd
import os
import glob
from collections import defaultdict

def find_latest_exclusion_csvs():
    """Find all downloaded exclusion dictionary CSV files that start with 'Exclusion'"""
    csv_files = glob.glob("Exclusion*_codes.csv")
    if not csv_files:
        return []
    # Sort by modification time, newest first
    csv_files = sorted(csv_files, key=os.path.getmtime, reverse=True)
    return csv_files

def match_exclusion_codes(omop_tables, exclusion_trial_dfs, inclusion_matched_patients):
    """
    Check if any inclusion-matched patients have exclusion criteria codes.
    Returns a set of person_ids that should be excluded.
    """
    if not exclusion_trial_dfs or any(df is None or len(df) == 0 for df in exclusion_trial_dfs):
        print("‚ö†Ô∏è No exclusion trial dictionary data available")
        return set()

    # Prepare code sets for each exclusion dictionary
    code_columns = ['code', 'Code', 'CODE', 'system', 'System']
    exclusion_code_sets = []
    
    for trial_df in exclusion_trial_dfs:
        code_col = next((col for col in code_columns if col in trial_df.columns), None)
        if code_col is None:
            print(f"‚ùå Could not identify code column in exclusion file. Available columns: {list(trial_df.columns)}")
            continue
        exclusion_code_sets.append(set(trial_df[code_col].astype(str).str.strip().str.upper()))

    if not exclusion_code_sets:
        return set()

    # Track which patients have exclusion matches
    excluded_patients = set()
    exclusion_details = defaultdict(list)  # Track what exclusion codes were found for each patient

    omop_code_columns = {
        'CONDITION_OCCURRENCE': 'condition_source_value',
        'DRUG_EXPOSURE': 'drug_source_value',
        'MEASUREMENT': 'measurement_source_value',
        'PROCEDURE_OCCURRENCE': 'procedure_source_value'
    }

    print(f"\nüîç Checking inclusion-matched patients against exclusion criteria...")
    print("=" * 80)

    # Only check patients who matched inclusion criteria
    inclusion_person_ids = set(inclusion_matched_patients.keys())

    for table_name, df in omop_tables.items():
        code_column = omop_code_columns.get(table_name)
        if code_column not in df.columns or 'person_id' not in df.columns:
            continue
            
        print(f"\nProcessing OMOP table: {table_name}")
        
        # Filter to only inclusion-matched patients
        df_filtered = df[df['person_id'].isin(inclusion_person_ids)]
        
        for idx, row in df_filtered.iterrows():
            source_code = str(row[code_column]).strip().upper()
            person_id = row['person_id']
            
            # Check against all exclusion code sets
            for excl_idx, code_set in enumerate(exclusion_code_sets):
                if source_code in code_set:
                    excluded_patients.add(person_id)
                    exclusion_details[person_id].append({
                        'code': source_code,
                        'table': table_name,
                        'exclusion_list': excl_idx + 1
                    })
                    print(f"   ‚ùå EXCLUSION - Patient ID: {person_id}, Code: {source_code}, Exclusion List: {excl_idx+1}")

    return excluded_patients, exclusion_details

# Main execution - Apply Exclusion Filtering
print("\nüö´ EXCLUSION CRITERIA FILTERING")
print("=" * 80)

# Find and load all exclusion trial dictionaries
exclusion_csv_files = find_latest_exclusion_csvs()

if exclusion_csv_files:
    print(f"üìÅ Found {len(exclusion_csv_files)} exclusion trial dictionaries:")
    for f in exclusion_csv_files:
        print(f"   - {f}")

    # Load all exclusion trial dictionaries into DataFrames
    exclusion_trial_dfs = []
    for csv_file in exclusion_csv_files:
        try:
            df = pd.read_csv(csv_file)
            exclusion_trial_dfs.append(df)
            print(f"   ‚úÖ Loaded {csv_file}: {len(df)} codes")
        except Exception as e:
            print(f"   ‚ùå Error loading {csv_file}: {e}")

    if patient_matches and omop_tables and exclusion_trial_dfs:
        # Check for exclusions
        excluded_patients, exclusion_details = match_exclusion_codes(omop_tables, exclusion_trial_dfs, patient_matches)
        
        if excluded_patients:
            print(f"\n{'='*80}")
            print(f"‚ùå Found {len(excluded_patients)} patients with exclusion criteria:")
            
            # Show sample excluded patients
            for idx, person_id in enumerate(list(excluded_patients)[:5]):
                details = exclusion_details[person_id]
                print(f"\n   Patient {idx+1} (person_id: {person_id}):")
                print(f"     Exclusion matches: {len(details)}")
                for detail in details[:3]:  # Show first 3 matches
                    print(f"       - Code: {detail['code']} (List {detail['exclusion_list']}, Table: {detail['table']})")
            
            # Filter out excluded patients from patient_matches
            original_count = len(patient_matches)
            patient_matches = {pid: data for pid, data in patient_matches.items() if pid not in excluded_patients}
            
            print(f"\n{'='*80}")
            print(f"üìä Cohort Updated After Exclusion Filtering:")
            print(f"   Patients before exclusion: {original_count}")
            print(f"   Patients excluded: {len(excluded_patients)}")
            print(f"   Final cohort size: {len(patient_matches)}")
            print(f"{'='*80}")
        else:
            print(f"\n‚úÖ No patients matched exclusion criteria. All {len(patient_matches)} patients remain in cohort.")
    else:
        print("‚ö†Ô∏è Cannot apply exclusion filtering - missing required data")
else:
    print("‚ÑπÔ∏è No exclusion trial dictionary CSV files found.")
    print("   Skipping exclusion filtering - all inclusion-matched patients will be retained.")
    exclusion_csv_files = []

print(f"\n‚úÖ Exclusion filtering complete. Proceed to next cell for cohort summary.")



üö´ EXCLUSION CRITERIA FILTERING
üìÅ Found 1 exclusion trial dictionaries:
   - Exclusion_19150_codes.csv
   ‚úÖ Loaded Exclusion_19150_codes.csv: 1106 codes

üîç Checking inclusion-matched patients against exclusion criteria...

Processing OMOP table: CONDITION_OCCURRENCE

Processing OMOP table: DRUG_EXPOSURE

Processing OMOP table: MEASUREMENT

Processing OMOP table: PROCEDURE_OCCURRENCE

‚úÖ No patients matched exclusion criteria. All 1 patients remain in cohort.

‚úÖ Exclusion filtering complete. Proceed to next cell for cohort summary.


---

## Step 4: Display Final Cohort Summary and Export Results

### Generate Comprehensive Statistics for Final Cohort

This section displays detailed statistics about the final matched cohort (after both inclusion matching and exclusion filtering) and exports results to CSV.

#### Summary Includes:

- **Total Patient Count**: Number of patients in OMOP data vs. final matched patients
- **Match Percentage**: Percentage of patients meeting ALL inclusion criteria AND having NO exclusion criteria
- **Inclusion List Statistics**: Match counts for each individual inclusion criteria list
- **Sample Patient Details**: First 5 matched patients with their match counts
- **CSV Export**: Results saved to `Output/cohort_matching_results.csv`

#### Output Format:

The exported CSV contains:
- `person_id`: Obfuscated patient identifier
- `inclusion_match_counts`: Number of matches per inclusion list (comma-separated)
- `total_inclusion_lists`: Total number of inclusion criteria lists

**Note**: The final cohort displayed here has already been filtered to exclude patients with any exclusion criteria matches.

---


In [8]:
def display_cohort_summary(patient_matches, omop_tables, inclusion_csv_files):
    """Display comprehensive cohort matching summary for inclusion criteria"""
    print(f"\n{'='*80}")
    print(f"üìä COHORT MATCHING SUMMARY")
    print(f"{'='*80}")
    
    if not patient_matches:
        print("‚ùå No patients matched the trial eligibility criteria")
        return
    
    # Calculate total unique patients in OMOP data
    all_person_ids = set()
    for table_name, df in omop_tables.items():
        if 'person_id' in df.columns:
            all_person_ids.update(df['person_id'].unique())
    
    total_patients = len(all_person_ids)
    matched_patients = len(patient_matches)
    
    print(f"\nüë• Patient Cohort Statistics:")
    print(f"   Total patients in RWE data: {total_patients}")
    print(f"   Patients matching ALL inclusion criteria: {matched_patients}")
    
    if total_patients > 0:
        match_percentage = (matched_patients / total_patients) * 100
        print(f"   Match percentage: {match_percentage:.1f}%")
    
    # Analyze match distribution across inclusion lists
    print(f"\nüéØ Inclusion Criteria Match Distribution:")
    if matched_patients > 0:
        # Get the number of inclusion lists
        num_inclusion_lists = patient_matches[list(patient_matches.keys())[0]]['total_inclusion_lists']
        print(f"   Total inclusion criteria lists: {num_inclusion_lists}")
        
        # Calculate average matches per inclusion list
        for i in range(num_inclusion_lists):
            matches_for_list = [data['inclusion_match_counts'][i] for data in patient_matches.values()]
            avg_matches = sum(matches_for_list) / len(matches_for_list) if matches_for_list else 0
            min_matches = min(matches_for_list) if matches_for_list else 0
            max_matches = max(matches_for_list) if matches_for_list else 0
            print(f"\n   Inclusion List {i+1} ({inclusion_csv_files[i] if i < len(inclusion_csv_files) else 'Unknown'}):")
            print(f"     Average matches per patient: {avg_matches:.1f}")
            print(f"     Min matches: {min_matches}")
            print(f"     Max matches: {max_matches}")
    
    # Show sample matched patients
    print(f"\nüìã Sample Matched Patients (first 5):")
    for idx, (person_id, data) in enumerate(list(patient_matches.items())[:5]):
        print(f"\n   Patient {idx+1} (person_id: {person_id}):")
        print(f"     Matches per inclusion list: {data['inclusion_match_counts']}")
        print(f"     Total inclusion lists matched: {data['total_inclusion_lists']}/{data['total_inclusion_lists']}")
    
    # Save results to CSV
    if patient_matches:
        results_file = 'Output/cohort_matching_results.csv'
        results_list = []
        for person_id, data in patient_matches.items():
            results_list.append({
                'person_id': person_id,
                'inclusion_match_counts': ', '.join(map(str, data['inclusion_match_counts'])),
                'total_inclusion_lists': data['total_inclusion_lists']
            })
        df_results = pd.DataFrame(results_list)
        df_results.to_csv(results_file, index=False)
        print(f"\nüíæ Results exported to: {results_file}")
    
    # Store results globally
    globals()['cohort_matching_results'] = {
        'patient_matches': patient_matches,
        'inclusion_dictionary_files': inclusion_csv_files,
        'omop_tables_used': list(omop_tables.keys())
    }
    print(f"\n‚úÖ Results stored in 'cohort_matching_results' variable for further analysis")

# Display summary and export results
if patient_matches and omop_tables:
    display_cohort_summary(patient_matches, omop_tables, inclusion_csv_files)
else:
    print("‚ö†Ô∏è No patient matches to display. Please run Step 3 first.")



üìä COHORT MATCHING SUMMARY

üë• Patient Cohort Statistics:
   Total patients in RWE data: 35
   Patients matching ALL inclusion criteria: 1
   Match percentage: 2.9%

üéØ Inclusion Criteria Match Distribution:
   Total inclusion criteria lists: 2

   Inclusion List 1 (Inclusion_18957_codes.csv):
     Average matches per patient: 1.0
     Min matches: 1
     Max matches: 1

   Inclusion List 2 (Inclusion_18950_codes.csv):
     Average matches per patient: 1.0
     Min matches: 1
     Max matches: 1

üìã Sample Matched Patients (first 5):

   Patient 1 (person_id: 443887166406931539):
     Matches per inclusion list: [1, 1]
     Total inclusion lists matched: 2/2

üíæ Results exported to: Output/cohort_matching_results.csv

‚úÖ Results stored in 'cohort_matching_results' variable for further analysis
