#  **CLINICAL TRIAL MATCHING AI**

##  Project Description

**Clinical Trial Matching AI – Project Summary**

**Description**:  
**Developed an AI system for intelligent clinical trial matching**, designed to link patient clinical notes to relevant trials from ClinicalTrials.gov. The system extracts structured patient information and filters trials using both rule-based and AI-powered techniques to identify the most relevant matches.

---

##  Data Used:

- **Clinical trial eligibility criteria** from ClinicalTrials.gov (full protocol JSON)
- **Unstructured patient data** containing clinical notes (symptoms, diagnoses, medications, history) from **PMC-Patients-V2**

---

##  Techniques and Models Used

###  Named Entity Recognition (NER):

- **SciSpacy + UMLS Entity Linker** – for extracting diseases, medications, procedures with synonym resolution  
- **Stanza** – Stanford NLP for additional NER coverage  
- **AWS Comprehend Medical** – for condition, medication, and test detection  
- **Anthropic Claude (LLM)** – for zero-shot and instruction-following medical NER  

---

###  Trial Filtering Pipeline:

- **Initial condition-based trial filtering** – match patient conditions with trial eligibility conditions to reduce search space  
- **Final LLM-based filtering and reasoning** – used Claude to evaluate full eligibility criteria for remaining trials and rank them based on relevance

---

##  Workflow

1. **Extract structured patient data** using NER
2. **Retrieve and preprocess trials** from ClinicalTrials.gov
3. **Match patient conditions** with trial conditions (ICD/UMLS-linked if available)
4. **Filter to a smaller subset of "close" matches**
5. **Use Claude LLM** to evaluate inclusion/exclusion criteria and output **eligibility reasoning**

---

##  Outcome

-  **Efficient trial matching** with explainable reasoning  


In [42]:
!pip install boto3

Collecting boto3
  Downloading boto3-1.37.25-py3-none-any.whl.metadata (6.7 kB)
Collecting s3transfer<0.12.0,>=0.11.0 (from boto3)
  Downloading s3transfer-0.11.4-py3-none-any.whl.metadata (1.7 kB)
Downloading boto3-1.37.25-py3-none-any.whl (139 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/139.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m139.6/139.6 kB[0m [31m9.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading s3transfer-0.11.4-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.4/84.4 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: s3transfer, boto3
Successfully installed boto3-1.37.25 s3transfer-0.11.4


In [98]:
#pip install anthropic
#pip install stanza
#pip install scispacy
# pip install spacy==3.4.4
#pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.4.0/en_core_sci_sm-0.4.0.tar.gz
#pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.4.0/en_ner_bc5cdr_md-0.4.0.tar.gz
#pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.1/en_core_sci_scibert-0.5.1.tar.gz
#pip install numpy==1.25.0

In [18]:
# IMPORT LIBRARIES
import json
import pandas as pd
from typing import Optional
from datetime import datetime
from anthropic import AnthropicBedrock
from typing import List, Dict, Optional
import boto3
import pandas as pd
import json
import time
import requests
import aiohttp
import warnings
warnings.filterwarnings("ignore")
import regex as re


import stanza
import spacy
import scispacy
from scispacy.umls_linking import UmlsEntityLinker
import pandas as pd


# Initialize Anthropic client
client_claude = AnthropicBedrock()
import json


In [49]:

import json
import boto3
from google.colab import userdata

os.environ["AWS_ACCESS_KEY_ID"] = userdata.get('AWS_ACCESS_KEY_ID')
os.environ["AWS_SECRET_ACCESS_KEY"] = userdata.get("AWS_SECRET_ACCESS_KEY")
os.environ["AWS_DEFAULT_REGION"] = "us-east-1"


client_claude = AnthropicBedrock()

##  Load Patient Dataset

We will load and preview the first 700 patients from the **PMC-Patients-V2.csv** file.

In [4]:
patient_df = pd.read_csv("/content/PMC-Patients-V2.csv", on_bad_lines='skip')
patient_df = patient_df.head(700)
patient_df.head()



Unnamed: 0,patient_id,patient_uid,PMID,file_path,title,patient,age,gender,pub_date
0,0,497050-1,15268761,oa_comm/PMC000xxxxxx/PMC497050.xml,Echocardiographic assessment and percutaneous ...,A 34-year old woman was referred for considera...,"[[34.0, 'year']]",F,2004
1,1,497050-2,15268761,oa_comm/PMC000xxxxxx/PMC497050.xml,Echocardiographic assessment and percutaneous ...,A 31-year old woman was found to have a secund...,"[[31.0, 'year']]",F,2004
2,2,497050-3,15268761,oa_comm/PMC000xxxxxx/PMC497050.xml,Echocardiographic assessment and percutaneous ...,A 30-year old woman was investigated for palpi...,"[[30.0, 'year']]",F,2004
3,3,503399-1,15272940,oa_comm/PMC000xxxxxx/PMC503399.xml,A unique dedifferentiated tumor of the retrope...,A 65-year-old male presented with a three-week...,"[[65.0, 'year']]",M,2004
4,4,509249-1,15285782,oa_comm/PMC000xxxxxx/PMC509249.xml,Adenoid cystic carcinoma of the parotid metast...,A 60 year old woman presented with a small swe...,"[[60.0, 'year']]",F,2004


In [5]:
patient_df = patient_df[["patient_id", "patient_uid", "patient", "age", "gender"]]

#  Preview result
patient_df.head()

Unnamed: 0,patient_id,patient_uid,patient,age,gender
0,0,497050-1,A 34-year old woman was referred for considera...,"[[34.0, 'year']]",F
1,1,497050-2,A 31-year old woman was found to have a secund...,"[[31.0, 'year']]",F
2,2,497050-3,A 30-year old woman was investigated for palpi...,"[[30.0, 'year']]",F
3,3,503399-1,A 65-year-old male presented with a three-week...,"[[65.0, 'year']]",M
4,4,509249-1,A 60 year old woman presented with a small swe...,"[[60.0, 'year']]",F


# NER
# NER WITH AWS COMPREHEND MEDICAL

##  Extract Medical Entities Using AWS Comprehend Medical

We use **AWS Comprehend Medical** to extract structured clinical entities from the `patient` notes, including:

- ANATOMY  
- MEDICAL_CONDITION  
- MEDICATION  
- TEST_TREATMENT_PROCEDURE  
- TIME_EXPRESSION  
- PROTECTED_HEALTH_INFORMATION

The extracted data is saved to `patients_with_comprehend.csv`.


In [None]:
import boto3
import pandas as pd
from tqdm import tqdm

# Load your dataset
patients_df_full = pd.read_csv('PMC-Patients-V2.csv')
print("Loaded:", patients_df_full.shape)
patients_df=patients_df_full.head(700)

# Create AWS Comprehend Medical client
comprehend_med = boto3.client(service_name='comprehendmedical', region_name='us-east-1')

# Categories to extract
all_categories = ['ANATOMY', 'MEDICAL_CONDITION', 'MEDICATION', 'PROTECTED_HEALTH_INFORMATION', 'TEST_TREATMENT_PROCEDURE', 'TIME_EXPRESSION']

# Initialize empty lists for each category
for cat in all_categories:
    patients_df[cat] = [[] for _ in range(len(patients_df))]

# Process each note and extract entities
for idx, row in tqdm(patients_df.iterrows(), total=len(patients_df)):
    text = row['patient']
    try:
        response = comprehend_med.detect_entities_v2(Text=text)
        for entity in response['Entities']:
            category = entity['Category']
            if category in all_categories:
                patients_df.at[idx, category].append(entity['Text'])
    except Exception as e:
        print(f"❌ Error at row {idx}: {e}")
        continue

    # Optional: Save progress every 10 rows
    if idx % 10 == 0:
        patients_df.to_csv("patients_with_comprehend.csv", index=False)
        print(f"✅ Saved progress at row {idx}")

# Final save
patients_df.to_csv("patients_with_comprehend.csv", index=False)
print("🎉 Completed entity extraction & saved all results.")



In [6]:
aws_patient_df = pd.read_csv('aws_comprehend_700.csv')
print(aws_patient_df.shape)
aws_patient_df.head()

(700, 15)


Unnamed: 0,patient_id,patient_uid,PMID,file_path,title,patient,age,gender,pub_date,ANATOMY,MEDICAL_CONDITION,MEDICATION,PROTECTED_HEALTH_INFORMATION,TEST_TREATMENT_PROCEDURE,TIME_EXPRESSION
0,0,497050-1,15268761,oa_comm/PMC000xxxxxx/PMC497050.xml,Echocardiographic assessment and percutaneous ...,A 34-year old woman was referred for considera...,"[[34.0, 'year']]",F,2004,"['right ventricle', 'cardiac', 'right ventricu...","['ASD', 'right ventricle was dilating', 'eleva...",[],"['34-year', '12']","['percutaneous ASD closure', 'TTE', 'cardiac c...","['six months later', 'six months']"
1,1,497050-2,15268761,oa_comm/PMC000xxxxxx/PMC497050.xml,Echocardiographic assessment and percutaneous ...,A 31-year old woman was found to have a secund...,"[[31.0, 'year']]",F,2004,"['right atrial', 'right ventricular', 'Left at...","['secundum ASD', 'breathlessness', 'defect', '...",[],['31-year'],"['TTE', 'Cardiac catheterisation', 'TEE', 'inf...",['three months later']
2,2,497050-3,15268761,oa_comm/PMC000xxxxxx/PMC497050.xml,Echocardiographic assessment and percutaneous ...,A 30-year old woman was investigated for palpi...,"[[30.0, 'year']]",F,2004,"['atrial septum', 'cardiac', 'fossa ovalis', '...","['palpitations', 'secundum ASD', 'atrial septu...",[],['30-year'],"['TEE', 'device closure', 'cardiac catheterisa...",[]
3,3,503399-1,15272940,oa_comm/PMC000xxxxxx/PMC503399.xml,A unique dedifferentiated tumor of the retrope...,A 65-year-old male presented with a three-week...,"[[65.0, 'year']]",M,2004,"['abdominal', 'abdominal', 'bowel', 'abdomen',...","['abdominal distension', 'abdominal pain', 'bo...",[],"['65', 'surgical oncology outpatient clinic']","['Computed tomographic (CT) scan', 'neoadjuvan...","['three-week', 'preoperative', 'Four months', ..."
4,4,509249-1,15285782,oa_comm/PMC000xxxxxx/PMC509249.xml,Adenoid cystic carcinoma of the parotid metast...,A 60 year old woman presented with a small swe...,"[[60.0, 'year']]",F,2004,"['beneath the right ear lobe', 'superficial pa...","['swelling beneath the right ear lobe', 'swell...",[],['60'],"['fine needle aspiration cytology (FNAC)', 'to...","['4 months', 'post-operatively', 'Three years'..."


##  NER with Stanza + SciSpacy + UMLS

This section demonstrates how to extract structured clinical information using a hybrid NER approach powered by:

-  Stanza (Stanford NLP – `i2b2` + `mimic` clinical models)
-  SciSpacy with UMLS Entity Linking (`en_core_sci_scibert`)
-  Optional export of UMLS-linked terms for standardized condition mapping

extract structured clinical entities from the patient notes, including:

- PROBLEM
- TEST
- TREATMENT

save as patients_with_stanza_ents (1).csv

In [None]:

### NER WITH spacy-Stanza-UMLS


import stanza
import spacy
import scispacy
from scispacy.umls_linking import UmlsEntityLinker
import pandas as pd
from tqdm import tqdm

# --- 1. Initialize Stanza for NER ---
stanza.download('en', package='i2b2')
stanza.download('en', package='mimic')
stanza_nlp = stanza.Pipeline(lang='en', processors='tokenize,ner', package={'tokenize': 'mimic', 'ner': 'i2b2'})

# --- 2. Init empty label columns if missing ---
for col in ["PROBLEM", "TEST", "TREATMENT"]:
    if col not in patients_df.columns:
        patients_df[col] = [[] for _ in range(len(patients_df))]

# --- 3. Run Stanza for NER ---
stanza.download("en")
stanza_nlp = stanza.Pipeline("en", processors="tokenize,ner")

for idx, row in tqdm(patients_df.iterrows(), total=len(patients_df)):
    text = row["patient"]
    doc_stanza = stanza_nlp(text)

    problems, tests, treatments = [], [], []
    for ent in doc_stanza.ents:
        if ent.type == "PROBLEM":
            problems.append(ent.text)
        elif ent.type == "TEST":
            tests.append(ent.text)
        elif ent.type == "TREATMENT":
            treatments.append(ent.text)

    patients_df.at[idx, "PROBLEM"] = problems
    patients_df.at[idx, "TEST"] = tests
    patients_df.at[idx, "TREATMENT"] = treatments

    if idx % 10 == 0:
        patients_df.to_csv("patients_with_stanza_ents.csv", index=False)

# --- 4. Initialize SciSpacy + UMLS Linker ---
# nlp = spacy.load("en_core_sci_scibert")
# linker = UmlsEntityLinker(resolve_abbreviations=True, filter_for_definitions=True)
# nlp.add_pipe("scispacy_linker", config={"resolve_abbreviations": True}, last=True)

# # --- 5. Create UMLS-linked columns ---
# labels = ["PROBLEM", "TEST", "TREATMENT"]
# for label in labels:
#     patients_df[f"{label}_umls"] = [[] for _ in range(len(patients_df))]

# --- 6. UMLS Linking ---
# for idx, row in tqdm(patients_df.iterrows(), total=len(patients_df)):
#     for label in labels:
#         linked_names = []
#         for phrase in row[label]:
#             try:
#                 doc = nlp(phrase)
#                 for ent in doc.ents:
#                     if ent._.umls_ents:
#                         cui, score = ent._.umls_ents[0]
#                         name = linker.umls.cui_to_entity[cui].canonical_name
#                         linked_names.append(name)
#             except Exception as e:
#                 print(f"⚠️ UMLS linking error at row {idx}, label {label}, phrase '{phrase}':", e)

#         patients_df.at[idx, f"{label}_umls"] = linked_names

#     if idx % 10 == 0:
#         patients_df.to_csv("patients_with_stanza_umls.csv", index=False)

# # --- 7. Optional: Rename and Final Columns ---
# patients_df.rename(columns={
#     "PROBLEM": "condition",
#     "TEST": "test",
#     "TREATMENT": "treatment",
#     "PROBLEM_umls": "condition_umls",
#     "TEST_umls": "test_umls",
#     "TREATMENT_umls": "treatment_umls"
# }, inplace=True)

# # Optional columns if present
# optional_cols = ['uid', 'patient_id', 'age', 'sex', 'patient', 'medication', 'medical_history']
# final_columns = [col for col in optional_cols if col in patients_df.columns] + [
#     "condition", "test", "treatment",
#     "condition_umls", "test_umls", "treatment_umls"
# ]

# final_df = patients_df[final_columns]
# final_df.to_csv("final_patient_with_umls (1).csv", index=False)




In [7]:
patient_df_stanza=pd.read_csv('patients_with_stanza_ents (1).csv')
patient_df_stanza.head()

Unnamed: 0,patient_id,patient_uid,PMID,file_path,title,patient,age,gender,pub_date,PROBLEM,TEST,TREATMENT
0,0,497050-1,15268761,oa_comm/PMC000xxxxxx/PMC497050.xml,Echocardiographic assessment and percutaneous ...,A 34-year old woman was referred for considera...,"[[34.0, 'year']]",F,2004,['mildly elevated right ventricular systolic p...,"['TTE', 'cardiac catheterisation', 'systemic f...","['percutaneous ASD closure', 'percutaneous ASD..."
1,1,497050-2,15268761,oa_comm/PMC000xxxxxx/PMC497050.xml,Echocardiographic assessment and percutaneous ...,A 31-year old woman was found to have a secund...,"[[31.0, 'year']]",F,2004,"['a secundum ASD', 'breathlessness', 'The defe...","['TTE', 'further investigation', 'Cardiac cath...","['treatment', 'systemic flow shunt', 'a 24 mm ..."
2,2,497050-3,15268761,oa_comm/PMC000xxxxxx/PMC497050.xml,Echocardiographic assessment and percutaneous ...,A 30-year old woman was investigated for palpi...,"[[30.0, 'year']]",F,2004,"['palpitations', 'a secundum ASD', 'the atrial...","['TEE', 'cardiac catheterisation', 'Equivalent...","['device closure', 'The superior hole', 'a 16 ..."
3,3,503399-1,15272940,oa_comm/PMC000xxxxxx/PMC503399.xml,A unique dedifferentiated tumor of the retrope...,A 65-year-old male presented with a three-week...,"[[65.0, 'year']]",M,2004,['progressively worsening abdominal distension...,"['examination', 'Computed tomographic (CT) sca...","['neoadjuvant chemotherapy', 'radiation', 'a f..."
4,4,509249-1,15285782,oa_comm/PMC000xxxxxx/PMC509249.xml,Adenoid cystic carcinoma of the parotid metast...,A 60 year old woman presented with a small swe...,"[[60.0, 'year']]",F,2004,['a small swelling beneath the right ear lobe'...,"['A fine needle aspiration cytology (FNAC)', '...","['a total conservative parotidectomy', 'Gy adj..."


##  Merge NER Outputs from Stanza and AWS Comprehend Medical

We combine structured clinical entities extracted from both Stanza and AWS Comprehend Medical by joining their outputs on `patient_id`. This allows us to compare and enrich insights across models.

The resulting merged DataFrame includes:

- Patient demographics (`patient_id`, `patient_uid`, `age`, `gender`)
- Stanza-extracted: `condition`, `test`, `treatment`
- AWS-extracted: `condition`, `medication`, `test_treatment_procedure`
- Full patient note under `medical history`


In [9]:
# 2. Keep only the requested columns
columns_to_keep1= [
    'patient_id', 'patient_uid', 'patient', 'age', 'gender',
    'PROBLEM', 'TEST', 'TREATMENT',  # from stanza

]
columns_to_keep2= [
    'patient_id',
       'ANATOMY', 'MEDICAL_CONDITION', 'MEDICATION', 'TEST_TREATMENT_PROCEDURE'  # from AWS # from stanza
]
aws_patient_df = aws_patient_df[columns_to_keep2]

patient_df_stanza= patient_df_stanza[columns_to_keep1]

# 1. Merge on 'patient' column (inner join)
merged_df = pd.merge(
    aws_patient_df,
    patient_df_stanza,
    on='patient_id',
    suffixes=('_aws', '_stanza')
)



# 3. Rename columns as requested
merged_df.rename(columns={
    'PROBLEM': 'stanza_condition',
    'TEST': 'stanza_test',
    'TREATMENT': 'stanza_treatment',
    'MEDICAL_CONDITION': 'aws_condition',
    'MEDICATION': 'aws_medication',
    'patient':'medical history',
    'TEST_TREATMENT_PROCEDURE': 'aws_test_treatment_procedure'
}, inplace=True)

merged_df.head()

merged_df.to_csv("merged_df.csv", index=False)

In [10]:
merged_df=pd.read_csv("merged_df.csv")
merged_df.head()

Unnamed: 0,patient_id,ANATOMY,aws_condition,aws_medication,aws_test_treatment_procedure,patient_uid,medical history,age,gender,stanza_condition,stanza_test,stanza_treatment
0,0,"['right ventricle', 'cardiac', 'right ventricu...","['ASD', 'right ventricle was dilating', 'eleva...",[],"['percutaneous ASD closure', 'TTE', 'cardiac c...",497050-1,A 34-year old woman was referred for considera...,"[[34.0, 'year']]",F,['mildly elevated right ventricular systolic p...,"['TTE', 'cardiac catheterisation', 'systemic f...","['percutaneous ASD closure', 'percutaneous ASD..."
1,1,"['right atrial', 'right ventricular', 'Left at...","['secundum ASD', 'breathlessness', 'defect', '...",[],"['TTE', 'Cardiac catheterisation', 'TEE', 'inf...",497050-2,A 31-year old woman was found to have a secund...,"[[31.0, 'year']]",F,"['a secundum ASD', 'breathlessness', 'The defe...","['TTE', 'further investigation', 'Cardiac cath...","['treatment', 'systemic flow shunt', 'a 24 mm ..."
2,2,"['atrial septum', 'cardiac', 'fossa ovalis', '...","['palpitations', 'secundum ASD', 'atrial septu...",[],"['TEE', 'device closure', 'cardiac catheterisa...",497050-3,A 30-year old woman was investigated for palpi...,"[[30.0, 'year']]",F,"['palpitations', 'a secundum ASD', 'the atrial...","['TEE', 'cardiac catheterisation', 'Equivalent...","['device closure', 'The superior hole', 'a 16 ..."
3,3,"['abdominal', 'abdominal', 'bowel', 'abdomen',...","['abdominal distension', 'abdominal pain', 'bo...",[],"['Computed tomographic (CT) scan', 'neoadjuvan...",503399-1,A 65-year-old male presented with a three-week...,"[[65.0, 'year']]",M,['progressively worsening abdominal distension...,"['examination', 'Computed tomographic (CT) sca...","['neoadjuvant chemotherapy', 'radiation', 'a f..."
4,4,"['beneath the right ear lobe', 'superficial pa...","['swelling beneath the right ear lobe', 'swell...",[],"['fine needle aspiration cytology (FNAC)', 'to...",509249-1,A 60 year old woman presented with a small swe...,"[[60.0, 'year']]",F,['a small swelling beneath the right ear lobe'...,"['A fine needle aspiration cytology (FNAC)', '...","['a total conservative parotidectomy', 'Gy adj..."


## NER WITH LLM

## Extract Structured Patient Info using Claude (LLM-based NER)

We use **Anthropic Claude** to extract structured information from clinical notes. This includes:

- Demographics: `name`, `age`, `sex`
- Conditions: problems, diagnoses, diseases
- Medications
- Medical History

Each note is processed using a carefully designed prompt and results are stored in `nlp_ner.csv`.


In [None]:
from typing import Dict
import json
import time

def extract_patient_info(clinical_note: str) -> Dict:
    prompt = f"""
You are a medical data extraction assistant. Given the clinical note below, extract structured patient information.

Clinical Note:
{clinical_note}

Return only a valid JSON object in the following structure:
{{
  "demographics": {{"name": <optional>, ,"age": <optional>,"sex": <optional>}},
  "conditions": [list of diagnosed conditions,problems,diagnosis, diseases present],
  "medications": [list of medications],
  "history": [list of relevant medical history]
}}
"""

    message = client_claude.messages.create(
                    max_tokens=1024,
                    temperature=0.0,
                    messages=[{"role": "user", "content": prompt}],
                    model="anthropic.claude-3-sonnet-20240229-v1:0",
                )

    try:

        return json.loads(message.content[0].text)
    except Exception:
        print("⚠️ Failed to parse Claude response. Returning empty structure.")
        return {"demographics": {}, "conditions": [], "medications": [], "history": []}
for col in ["conditions", "medications", "history"]:
    patient_df[col] = pd.Series([[] for _ in range(len(patient_df))], dtype=object)

# Name/age/gender can stay as None
for col in ["extracted_name", "extracted_age", "extracted_gender"]:
    if col not in patient_df.columns:
        patient_df[col] = None

for idx, row in patient_df.iterrows():
    time.sleep(6)
    extracted = extract_patient_info(row["patient"])
    print(extracted)
    demographics = extracted.get("demographics", {})

    patient_df.at[idx, "extracted_name"] = demographics.get("name")
    patient_df.at[idx, "extracted_age"] = demographics.get("age")
    patient_df.at[idx, "extracted_gender"] = demographics.get("gender")

    patient_df.at[idx, "conditions"] = list(extracted.get("conditions", []))
    patient_df.at[idx, "medications"] = list(extracted.get("medications", []))
    patient_df.at[idx, "history"] = list(extracted.get("history", []))
    patient_df.to_csv("nlp_ner.csv", index=False)





In [106]:
patient_df_nlp=pd.read_csv('nlp_ner.csv')
#patient_df_nlp=patient_df_nlp[["patient_id","patient_uid",	"patient",	"age",	"gender",	"conditions",	"medications",	"history", "extracted_name"	,"extracted_age",	"extracted_gender"]]
patient_df_nlp.head()

Unnamed: 0,patient_id,patient_uid,PMID,file_path,title,patient,age,gender,pub_date,conditions,medications,history,extracted_name,extracted_age,extracted_gender
0,0,497050-1,15268761,oa_comm/PMC000xxxxxx/PMC497050.xml,Echocardiographic assessment and percutaneous ...,A 34-year old woman was referred for considera...,"[[34.0, 'year']]",F,2004,"['atrial septal defect (ASD)', 'right ventricu...",[],"['ASD diagnosed at age 12', 'percutaneous ASD ...",,34.0,
1,1,497050-2,15268761,oa_comm/PMC000xxxxxx/PMC497050.xml,Echocardiographic assessment and percutaneous ...,A 31-year old woman was found to have a secund...,"[[31.0, 'year']]",F,2004,"['secundum atrial septal defect (ASD)', 'breat...",[],['Cardiac catheterisation demonstrated a pulmo...,,31.0,
2,2,497050-3,15268761,oa_comm/PMC000xxxxxx/PMC497050.xml,Echocardiographic assessment and percutaneous ...,A 30-year old woman was investigated for palpi...,"[[30.0, 'year']]",F,2004,"['palpitations', 'secundum atrial septal defec...",[],"['device closure of ASD', 'cardiac catheteriza...",,30.0,
3,3,503399-1,15272940,oa_comm/PMC000xxxxxx/PMC503399.xml,A unique dedifferentiated tumor of the retrope...,A 65-year-old male presented with a three-week...,"[[65.0, 'year']]",M,2004,"['abdominal distension', 'increased frequency ...",[],['past medical history was unremarkable'],,65.0,
4,4,509249-1,15285782,oa_comm/PMC000xxxxxx/PMC509249.xml,Adenoid cystic carcinoma of the parotid metast...,A 60 year old woman presented with a small swe...,"[[60.0, 'year']]",F,2004,"['swelling beneath the right ear lobe', 'acini...",[],['underwent a total conservative parotidectomy...,,60.0,


## Clinical Trial Match






## 🔍 Clinical Trial Match – Fetch Trials from ClinicalTrials.gov (Async API)

We use an asynchronous pipeline to fetch **clinical trial protocols** from the **ClinicalTrials.gov v2 API**. The pipeline supports:

- Pagination with `nextPageToken`
- Filtering trials that are **currently recruiting**
- Fallback to a local JSON dataset for faster iteration (e.g. `ctg-studies.json`)

Each trial record includes:
- `briefTitle`
- `conditions`
- `eligibilityCriteria`
- `nctId`
- Full `protocolSection`


In [None]:
async def fetch_page(session, base_url, query, page_token=None):
    params = {"query.term": query}
    if page_token:
        params["pageToken"] = page_token

    async with session.get(base_url, params=params) as response:
        if response.status != 200:
            return None
        return await response.json()

async def fetch_trials_async(query: str) -> List[Dict]:
    base_url = "https://clinicaltrials.gov/api/v2/studies"
    headers = {"Accept": "application/json"}
    trials = []
    seen_pages = set()
    next_token = None

    async with aiohttp.ClientSession(headers=headers) as session:
        while True:
            data = await fetch_page(session, base_url, query, next_token)
            if not data:
                break

            studies = data.get("studies", [])
            if not studies:
                break

            for study in studies:
                protocol = study.get("protocolSection", {})
                status = protocol.get("statusModule", {}).get("overallStatus", "").upper()
                if status in { "RECRUITING"}:
                    trials.append({
                        "title": protocol.get("identificationModule", {}).get("briefTitle", ""),
                        "eligibility": protocol.get("eligibilityModule", {}).get("eligibilityCriteria", ""),
                        "conditions": protocol.get("conditionsModule", {}).get("conditions", []),
                        "nct_id": protocol.get("identificationModule", {}).get("nctId", ""),
                        "protocol": protocol
                    })

            next_token = data.get("nextPageToken")
            if not next_token or next_token in seen_pages:
                break
            seen_pages.add(next_token)

    return trials
# trials = await fetch_trials_async("")

In [16]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [19]:
with open("/content/drive/MyDrive/ctg-studies.json", "r", encoding="utf-8") as f:
    LOCAL_TRIALS = json.load(f)
LOCAL_TRIALS[0]

{'protocolSection': {'identificationModule': {'nctId': 'NCT05421520',
   'orgStudyIdInfo': {'id': 'EA-22-005'},
   'organization': {'fullName': 'Renmin Hospital of Wuhan University',
    'class': 'OTHER'},
   'briefTitle': 'Effectiveness of an Ai-based Endoscopic Ultrasound Navigation System in the Training of Endoscopic Ultrasonics',
   'officialTitle': 'Feasibility and Effectiveness of an Ai-based Endoscopic Ultrasound Navigation System in the Training of Endoscopic ultrasonics-a Prospective, Randomized, Multi-center Clinical Trial'},
  'statusModule': {'statusVerifiedDate': '2023-05',
   'overallStatus': 'NOT_YET_RECRUITING',
   'expandedAccessInfo': {'hasExpandedAccess': False},
   'startDateStruct': {'date': '2023-10-08', 'type': 'ESTIMATED'},
   'primaryCompletionDateStruct': {'date': '2024-05-31', 'type': 'ESTIMATED'},
   'completionDateStruct': {'date': '2024-05-31', 'type': 'ESTIMATED'},
   'studyFirstSubmitDate': '2022-05-21',
   'studyFirstSubmitQcDate': '2022-06-12',
   'st

##  Generate Unified Keyword Set from AWS + Stanza NER Outputs

We combine clinical keywords from various extracted columns (conditions, medications, tests, etc.) to create a **deduplicated keyword list** for each patient.

This combined list helps in downstream tasks like:
- Semantic search for trial matching
- Embedding generation
- UMLS linking or filtering

###  Columns Used for Keyword Generation:
- `aws_condition`, `aws_medication`, `aws_test_treatment_procedure`
- `stanza_condition`, `stanza_test`, `stanza_treatment`
- Plus demographics: `patient_uid`, `age`, `gender`


In [20]:
import pandas as pd

# Define the relevant columns
columns_to_combine = [
    'aws_condition', 'aws_medication', 'aws_test_treatment_procedure',
    'patient_uid', 'age', 'gender',
    'stanza_condition', 'stanza_test', 'stanza_treatment'
]

# Function to extract unique keywords from a row
import re

def extract_unique_keywords(row):
    keywords = set()
    for col in columns_to_combine:
        value = row[col]
        if isinstance(value, list):
            items = [str(v).strip().lower() for v in value]
        elif isinstance(value, str):
            items = [w.strip().lower() for w in value.split(",")]
        else:
            items = [str(value).strip().lower()]

        for item in items:
            # Remove any brackets or junk
            cleaned = re.sub(r"[\[\]\"']", "", item)
            # Only keep words (letters, numbers, hyphen/underscore/space)
            if cleaned and re.search(r"[a-zA-Z0-9]", cleaned):
                keywords.add(cleaned)

    return list(keywords)


# Apply function row-wise
merged_df["keywords"] = merged_df.apply(extract_unique_keywords, axis=1)

In [21]:
merged_df['keywords'][0]

['a second asd',
 'both defects',
 'uncomplicated placement',
 'transesophageal echocardiography',
 'systemic flow ratio',
 'tte',
 'asd',
 'post-procedure shunt',
 'a single posteroanterior projection chest x-ray',
 'residual shunt',
 'procedure',
 'the secundum asd',
 'the second procedure',
 'a small post-procedure shunt',
 'a 9 mm amplatzer septal occluder',
 'right ventricular systolic pressures',
 'defect',
 'left to right shunting',
 'a further device closure',
 'elevated right ventricular systolic pressures',
 'equivalent chest x-ray',
 'the second defect',
 'secundum asd',
 'the defect',
 'year',
 'obstructed flow',
 'percutaneous asd',
 'septal occluder',
 'defect was successfully closed with a 9 mm amplatzer septal occluder',
 'the procedure',
 'a 17 mm amplatzer septal occluder',
 'right ventricle was dilating',
 'pulmonary to systemic flow ratio',
 'tee',
 'aneurysmal formation of the interatrial septum',
 'f',
 'cardiac catheterisation',
 'repeat tte',
 'remained well',
 

In [22]:
def join_keywords(keyword_list):
    return ", ".join(sorted(set(map(str, keyword_list))))

merged_df["keywords_str"] = merged_df["keywords"].apply(join_keywords)

##  Step 1: Rule-Based Matching of Patient Keywords with Trial Text

We loop through each clinical trial and match patient keywords (from AWS + Stanza NER) with the trial’s text, including:

- Trial `title`
- `conditions`
- `eligibilityCriteria`

For each match, we store:
- The matched phrases
- Trial metadata
- Match count

This step is a fast, heuristic filter to reduce the search space for more advanced matching later (e.g., LLMs or embeddings).


In [23]:
matched_trials = []

for i, trial in enumerate(LOCAL_TRIALS):
    protocol = trial.get("protocolSection", {})
    title = protocol.get("identificationModule", {}).get("officialTitle", "")
    conditions = protocol.get("conditionsModule", {}).get("conditions", [])
    eligibility = protocol.get("eligibilityModule", {}).get("eligibilityCriteria", "")

    trial_text = f"{title}\n\nConditions: {', '.join(conditions)}\n\nEligibility:\n{eligibility}"
    trial_text_lower = trial_text.lower()
    matched_keywords = []

    for phrase in merged_df['keywords'][0]:
        if phrase.lower() in trial_text_lower:
            matched_keywords.append(phrase)

    matched_trials.append({
        "protocol" : protocol,
        "trial_index": i,
        "title": title,
        "conditions": conditions,
        "eligibility": eligibility,
        "matched_phrases": matched_keywords,
        "num_matches": len(matched_keywords)
    })

In [24]:
matched_trials_sorted = sorted(matched_trials, key=lambda x: x["num_matches"], reverse=True)

pd.DataFrame(matched_trials_sorted[:100])

Unnamed: 0,protocol,trial_index,title,conditions,eligibility,matched_phrases,num_matches
0,{'identificationModule': {'nctId': 'NCT0459139...,6452,Evaluation of the Safety and Efficacy of the r...,"[Heart Septal Defect, Heart Septal Defects, At...",Inclusion Criteria:\n\nAll responses must be Y...,"[tte, asd, procedure, defect, secundum asd, ye...",8
1,{'identificationModule': {'nctId': 'NCT0529614...,11278,A European Feasibility Study of the CroíValve ...,[Tricuspid Regurgitation],Inclusion Criteria:\n\n1. Presence of severe T...,"[transesophageal echocardiography, tte, asd, p...",8
2,{'identificationModule': {'nctId': 'NCT0571282...,27733,"A Multi-center, Open-label, Prospective Study ...",[Heart Failure],Inclusion Criteria:\n\n* General Inclusion Cri...,"[transesophageal echocardiography, tte, asd, p...",8
3,{'identificationModule': {'nctId': 'NCT0646570...,51567,"A Prospective, Multicenter, Randomized Control...",[Atrial Fibrillation],Inclusion Criteria:\n\n* The patient is a male...,"[tte, asd, procedure, defect, year, the proced...",8
4,{'identificationModule': {'nctId': 'NCT0658027...,55025,The GLACE Study: A Prospective Multi-center Si...,"[Atrial Fibrillation, Stroke Prevention]",Inclusion Criteria:\n\n1. Male or non-pregnant...,"[tte, asd, procedure, defect, year, the proced...",8
...,...,...,...,...,...,...,...
95,{'identificationModule': {'nctId': 'NCT0567478...,23459,Vestibular Implantation to Treat Adult-Onset B...,"[Other Disorders of Vestibular Function, Bilat...",Inclusion Criteria:\n\n1. Adults age 22-90 yea...,"[tte, procedure, defect, year, the procedure, f]",6
96,{'identificationModule': {'nctId': 'NCT0520533...,23546,"An Open-label, Single-arm, Phase Ib/IIa Trial ...","[Refractory Metastatic Colorectal Cancer, Soli...",Diagnosis and Main Criteria for Inclusion/Excl...,"[tte, procedure, defect, year, tee, f]",6
97,{'identificationModule': {'nctId': 'NCT0609973...,23608,Safety and Effectiveness of Pulmonary Vein isO...,[Atrial Fibrillation],Inclusion Criteria:\n\n* Symptomatic paroxysma...,"[tte, procedure, year, the procedure, tee, f]",6
98,{'identificationModule': {'nctId': 'NCT0640523...,23809,A Pilot/Exploratory Translational Study to Eva...,"[Lung Cancer, Non-Small Cell]",Inclusion Criteria:\n\n* Participants must hav...,"[tte, procedure, year, the procedure, f, chest...",6


##  Step 2: Semantic Matching with Mistral Embeddings

After rule-based keyword filtering, we now compute **dense vector embeddings** for patient keywords using **Mistral's embedding model**.

- Each patient's keywords are converted to a single string (`keywords_str`)
- Mistral's `mistral-embed` model is used to embed the text
- Embeddings will be compared with trial text embeddings later using similarity measures (e.g., cosine similarity or Euclidean distance)

Ensure you have a valid **Mistral API key** to proceed.


In [3]:
#!pip install mistralai

Collecting mistralai
  Downloading mistralai-1.6.0-py3-none-any.whl.metadata (30 kB)
Collecting eval-type-backport>=0.2.0 (from mistralai)
  Downloading eval_type_backport-0.2.2-py3-none-any.whl.metadata (2.2 kB)
Downloading mistralai-1.6.0-py3-none-any.whl (288 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m288.7/288.7 kB[0m [31m15.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading eval_type_backport-0.2.2-py3-none-any.whl (5.8 kB)
Installing collected packages: eval-type-backport, mistralai
Successfully installed eval-type-backport-0.2.2 mistralai-1.6.0


In [25]:
import os
from mistralai import Mistral
import time
api_key = os.environ["mistrial_api"]

model = "mistral-embed"

client = Mistral(api_key=api_key)


from sklearn.metrics.pairwise import euclidean_distances

def get_text_embedding(inputs):
    embeddings_batch_response = client.embeddings.create(
        model=model,
        inputs=inputs
    )
    time.sleep(2)
    return embeddings_batch_response.data[0].embedding

In [26]:
merged_df["embeddings"] = merged_df["keywords_str"].apply(get_text_embedding)

In [27]:
merged_df.head()

Unnamed: 0,patient_id,ANATOMY,aws_condition,aws_medication,aws_test_treatment_procedure,patient_uid,medical history,age,gender,stanza_condition,stanza_test,stanza_treatment,keywords,keywords_str,embeddings
0,0,"['right ventricle', 'cardiac', 'right ventricu...","['ASD', 'right ventricle was dilating', 'eleva...",[],"['percutaneous ASD closure', 'TTE', 'cardiac c...",497050-1,A 34-year old woman was referred for considera...,"[[34.0, 'year']]",F,['mildly elevated right ventricular systolic p...,"['TTE', 'cardiac catheterisation', 'systemic f...","['percutaneous ASD closure', 'percutaneous ASD...","[a second asd, both defects, uncomplicated pla...","34.0, 497050-1, a 17 mm amplatzer septal occlu...","[-0.0482177734375, 0.044830322265625, 0.018310..."
1,1,"['right atrial', 'right ventricular', 'Left at...","['secundum ASD', 'breathlessness', 'defect', '...",[],"['TTE', 'Cardiac catheterisation', 'TEE', 'inf...",497050-2,A 31-year old woman was found to have a secund...,"[[31.0, 'year']]",F,"['a secundum ASD', 'breathlessness', 'The defe...","['TTE', 'further investigation', 'Cardiac cath...","['treatment', 'systemic flow shunt', 'a 24 mm ...","[treatment, the two devices, tte, intra-cardia...","31.0, 497050-2, a 24 mm amplatzer septal occlu...","[-0.03570556640625, 0.04681396484375, 0.011085..."
2,2,"['atrial septum', 'cardiac', 'fossa ovalis', '...","['palpitations', 'secundum ASD', 'atrial septu...",[],"['TEE', 'device closure', 'cardiac catheterisa...",497050-3,A 30-year old woman was investigated for palpi...,"[[30.0, 'year']]",F,"['palpitations', 'a secundum ASD', 'the atrial...","['TEE', 'cardiac catheterisation', 'Equivalent...","['device closure', 'The superior hole', 'a 16 ...","[septal occluders, 15 mm amplatzer septal occl...",15 mm amplatzer septal occluder device was sub...,"[-0.048095703125, 0.041778564453125, 0.0130615..."
3,3,"['abdominal', 'abdominal', 'bowel', 'abdomen',...","['abdominal distension', 'abdominal pain', 'bo...",[],"['Computed tomographic (CT) scan', 'neoadjuvan...",503399-1,A 65-year-old male presented with a three-week...,"[[65.0, 'year']]",M,['progressively worsening abdominal distension...,"['examination', 'Computed tomographic (CT) sca...","['neoadjuvant chemotherapy', 'radiation', 'a f...","[65.0, a heterogeneous nonlipomatous component...","503399-1, 65.0, a calcified, a dedifferentiate...","[-0.01367950439453125, 0.048858642578125, 0.09..."
4,4,"['beneath the right ear lobe', 'superficial pa...","['swelling beneath the right ear lobe', 'swell...",[],"['fine needle aspiration cytology (FNAC)', 'to...",509249-1,A 60 year old woman presented with a small swe...,"[[60.0, 'year']]",F,['a small swelling beneath the right ear lobe'...,"['A fine needle aspiration cytology (FNAC)', '...","['a total conservative parotidectomy', 'Gy adj...","[metastatic lesions, total conservative paroti...","509249-1, 60.0, a fine needle aspiration cytol...","[-0.0307769775390625, 0.0377197265625, 0.06988..."


## 🔗 Step 3: Generate Trial Embeddings for Matched Trials

We now embed the top 100 matched clinical trials using **Mistral's embedding model**.

For each trial, we concatenate:
- `title`
- `conditions`
- `eligibility criteria`

This creates a meaningful representation that can be semantically compared with the patient embedding.


In [29]:
trial_texts = []
trial_meta = []
trial_embeddings=[]

for trial in matched_trials_sorted[:100]:
    protocol = trial.get("protocol", {})
    title = protocol.get("identificationModule", {}).get("officialTitle", "")
    conditions = protocol.get("conditionsModule", {}).get("conditions", [])
    eligibility = protocol.get("eligibilityModule", {}).get("eligibilityCriteria", "")

    combined_text = f"{title}\n\nConditions: {', '.join(conditions)}\n\nEligibility:\n{eligibility}"
    trial_texts.append(combined_text)
    trial_meta.append({
        "protocol" : protocol,
        "title": title,
        "conditions": conditions,
        "eligibility": eligibility
    })

    trial_embeddings.append(get_text_embedding(combined_text))

##  Step 4: Cosine Similarity Matching (Patient - Trial)

Using **SentenceTransformers' cosine similarity**, we compare the embedded patient profile with each trial’s embedded text (title + eligibility + conditions).

This gives us a similarity score between `0` and `1`:
- Higher score → higher semantic alignment between patient profile and trial requirements.


In [30]:
import torch
from sentence_transformers.util import cos_sim


patient_embedding=merged_df["embeddings"][0]

# Example for one patient
matches = []
for i, trial_emb in enumerate(trial_embeddings):
    score = cos_sim(patient_embedding, trial_emb).item()
    matches.append({
        "score": score,
        "title": trial_meta[i]["title"],
        "conditions": trial_meta[i]["conditions"],
        "protocol" : trial_meta[i].get("protocol", {}),
        "eligibility": trial_meta[i]["eligibility"][:300] + "..."
    })

# Sort by match score
matches_sorted = sorted(matches, key=lambda x: x["score"], reverse=True)

# View top match
import pandas as pd
pd.DataFrame(matches_sorted[:5])

Unnamed: 0,score,title,conditions,protocol,eligibility
0,0.87827,Evaluation of the Safety and Efficacy of the r...,"[Heart Septal Defect, Heart Septal Defects, At...",{'identificationModule': {'nctId': 'NCT0459139...,Inclusion Criteria:\n\nAll responses must be Y...
1,0.859271,Comparative Effectiveness of Three Devices for...,[Atrial Septal Defect],{'identificationModule': {'nctId': 'NCT0429189...,Inclusion Criteria:\n\n* ≥18 years old referre...
2,0.856861,An International Multicenter Study on Transcat...,[Perimembranous Ventricular Septal Defect],{'identificationModule': {'nctId': 'NCT0682363...,Inclusion Criteria:\n\n1. Patients with perime...
3,0.834031,Use of 4D Intracardiac Echocardiography (4D-IC...,"[Atrial Fibrillation, Stroke]",{'identificationModule': {'nctId': 'NCT0606175...,Inclusion Criteria:\n\n* Male/Females\n* Age:1...
4,0.829945,GORE® CARDIOFORM Septal Occluder and Antiplate...,"[Stroke, PFO - Patent Foramen Ovale]",{'identificationModule': {'nctId': 'NCT0382112...,Inclusion Criteria:\n\n* Diagnosed with an isc...


In [31]:
merged_df['medical history'][0]

'A 34-year old woman was referred for consideration of percutaneous ASD closure. The ASD had been diagnosed when the patient was 12 years old and TTE had suggested that the right ventricle was dilating. At cardiac catheterisation there were mildly elevated right ventricular systolic pressures and a pulmonary to systemic flow ratio of over two. The secundum ASD was estimated to be 15 mm wide using TTE with aneurysmal formation of the interatrial septum. The patient was admitted for percutaneous ASD closure and underwent uncomplicated placement of a 17 mm Amplatzer septal occluder. Transesophageal echocardiography (TEE) during the procedure revealed the presence of a second ASD near the inferior vena cava and a small post-procedure shunt. The septal occluder did not completely cover both defects. Equivalent chest x-ray radiation dose (assuming a single posteroanterior projection chest x-ray is eight centi-Gray/cm2) was 400. Repeat TTE continued to demonstrate left to right shunting and t

##  Step 5: LLM-Powered Trial Matching using Claude

We use Anthropic's **Claude-3 Sonnet** large language model to perform detailed, explainable trial-patient matching. This complements earlier rule-based and embedding-based methods by incorporating nuanced reasoning over complex eligibility criteria.

###  What It Does:
- Takes a selected patient’s clinical note
- Evaluates them against the **top N matched clinical trials**
- Uses Claude to reason through:
  - Inclusion/exclusion criteria
  - Medical history
  - Procedures already performed
  - Missing or inferred information
- Returns:
  - Match Score (0.0–1.0)
  - Final Match Decision (YES/NO)
  - Explanation for transparency

###  Claude Prompt Includes:
- Structured trial metadata (`eligibilityCriteria`, `conditions`, `title`)
- Free-text patient clinical history
- Step-by-step matching tasks with justification for each criterion
- Final eligibility reasoning

###  Output:
A list of trial recommendations, sorted by `match_score`, each containing:
- `trial_id`
- `title`
- `condition`
- `match` (YES/NO)
- `match_score`
- `explanation` (why the patient qualifies or doesn’t)

This LLM-driven approach provides both **accuracy** and **explainability**, especially useful for clinicians reviewing eligibility decisions.

In [62]:
import re
import json
import time


# Helper to clean Claude's JSON response
def parse_json_response(content: str):
    try:
        clean_content = re.sub(r'^```json\n?|```$', '', content.strip(), flags=re.MULTILINE)
        return json.loads(clean_content)
    except Exception:
        return None

# Core matching function for top 20 trials using Claude
def match_top_trials_with_claude(patient_id: str, top_n: int = 20):
    matches = []
    patient_row = patient_df[patient_df["patient_id"] == patient_id]

    if patient_row.empty:
        return [{"error": "Patient not found"}]

    row = patient_row.iloc[0]
    text = row["patient"]

    for study in matches_sorted[:top_n]:
        time.sleep(6)  # To respect Claude's rate limits
        protocol = study.get("protocol", {})
        nct_id = protocol.get("identificationModule", {}).get("nctId")
        title = protocol.get("identificationModule", {}).get("briefTitle")
        eligibility = protocol.get("eligibilityModule", {}).get("eligibilityCriteria", "")
        conditions = " ".join(protocol.get("conditionsModule", {}).get("conditions", []))

        prompt = f"""
You are a clinical trial matching assistant. Given the following information, determine whether the patient qualifies for the clinical trial.

### Patient Information:
{text}

### Trial Eligibility Criteria:
 {json.dumps(title)} {json.dumps(eligibility)} {json.dumps(conditions)}

Your tasks:
  1. Summarize the patient info in bullet points and consider it the most recent update.
  2. Summarize the inclusion and exclusion criteria.
  3. Check each inclusion criterion against patient details (YES/NO + reasoning).
  4. Check exclusion criteria for conflicts (YES/NO + reasoning).
  5. If information is missing, say so explicitly with a cautious assumption.
  6. Estimate a match score (0.0 to 1.0) using cosine similarity reasoning.score should match the final yes / nodecision
  7. If the patient already received a treatment meant for future, mark ineligible unless criteria allow.
  8. Final YES/NO  decision with explanation.

Return ONLY valid JSON, no markdown. Format:
{{
  "trial_id": "{nct_id}",
  "title": "{title}",
  "condition": "{conditions}",
  "match": "YES" or "NO",
  "match_score": float between 0.0 and 1.0,
  "explanation": "Brief explanation on eligibility decision based on the patient's current status"
}}
""".strip()

        try:
            message = client_claude.messages.create(
                max_tokens=1024,
                temperature=0.0,
                messages=[{"role": "user", "content": prompt}],
                model="anthropic.claude-3-sonnet-20240229-v1:0",
            )
            response = message.content[0].text
            result = parse_json_response(response)
            if result:
                matches.append(result)

            else:
                print(f"❌ Error matching with trial {title}")

        except Exception as e:
            print(f"❌ Error matching with trial {title}: {e}")
            continue

    return sorted(matches, key=lambda x: x["match_score"], reverse=True)


In [63]:
results = match_top_trials_with_claude(patient_id=0)


In [64]:
results_df=pd.DataFrame(results)
results_df.head(20)

Unnamed: 0,trial_id,title,condition,match,match_score,explanation
0,NCT04591392,Safety and Efficacy Study of reSept ASD Occlud...,"Heart Septal Defect Heart Septal Defects, Atri...",NO,0.4,The patient has already undergone percutaneous...
1,NCT04291898,Comparison of Devices for Atrial Septal Defect...,Atrial Septal Defect,NO,0.4,The patient does not meet the inclusion criter...
2,NCT04881578,Atrial Fibrillation Before and After Patent Fo...,"Cryptogenic Stroke Foramen Ovale, Patent Atria...",NO,0.4,The patient does not meet the inclusion criter...
3,NCT05712824,V-Lap System For Wirelessly Measuring And Moni...,Heart Failure,NO,0.3,The patient does not appear to have heart fail...
4,NCT04684212,Can the Lambre Device Occlude IRRegular And La...,Atrial Fibrillation Left Atrial Appendage Thro...,NO,0.3,The patient does not meet several key inclusio...
5,NCT06061757,Use of 4D-ICE in Conjunction With TEE for Left...,Atrial Fibrillation Stroke,NO,0.25,The patient does not have a history of atrial ...
6,NCT06823635,An International Multicenter Study on Transcat...,Perimembranous Ventricular Septal Defect,NO,0.2,The patient does not have a perimembranous ven...
7,NCT03821129,GORE® CARDIOFORM Septal Occluder and Antiplate...,Stroke PFO - Patent Foramen Ovale,NO,0.2,The patient does not meet the key inclusion cr...
8,NCT05951101,Zenith LAA Occlusion System,Atrial Fibrillation Left Atrial Appendage Stroke,NO,0.2,The patient does not have a documented diagnos...
9,NCT06707688,Abbott Medical - VERITAS Study,Atrial Fibrillation (AF) Stoke,NO,0.2,The patient does not have atrial fibrillation ...
