## Demo: Leveraging Large Language Models (LLMs) to Validate Medical Claims with PubMed Research

This notebook demonstrates a workflow for validating medical claims using Large Language Models (LLMs) alongside scientific evidence sourced from PubMed. The steps are:

1. Claim Definition: define a medical claim, provide an edge from a (knowledge) graph, or propose a hypothesis that you want to verify.

2. Evidence Retrieval: Utilize Milvus to efficiently search and retrieve relevant sentences from PubMed articles.

3. Claim Verification: Apply LLMs to assess the accuracy of the medical claim (defined in step 1) based on the retrieved evidence (from step 2).

4. Result Analysis: Present the results provided by the LLMs, and use any statistcs to interpret the results.

##### After launching the Milvus container in Docker, wait until at least **1** node is ready to load the collections.

In [1]:
from pymilvus import connections
from pymilvus import utility

connections.connect(
  alias="default",
  uri="http://localhost:19530",
  token="root:Milvus",
)
info = utility.describe_resource_group(name='__default_resource_group')
num_available_node = info.num_available_node
print(f"Node avialability: {num_available_node}")

Node avialability: 1


##### If you see at least "1" printed from the cell above, you may continue running the following cells. Otherwise, wait a moment and rerun the cell above until at least "1" appears.

In [3]:
from pymilvus import MilvusClient
import time
import requests
import json
from utils import extract_non_think, generate_claim, embed_sentence

##### The following cell defines:

1. The medical claim or edge you'd like to verify.
2. The Milvus collection name (pubmed_sentence_XX, where XX ranges from 00 to 09) to search for supporting evidence.
3. The Large Language Model(s) you'd like to use for claim verification.

In [26]:
#  provide an edge with the following format
edge = {'subject': 'ginger',
        'object': 'nausea',
        'predicate': 'Biolink:treats'}
claim = generate_claim(edge)

# or provide a claim which is a sentence
# claim = 'ginger treats nausea'

# vectorize the claim for semantic search conducted in the following step
claim_vector = embed_sentence(claim)

which_collection = 'pubmed_sentence_03'
LLMs = ['phi4', 'gemma3:4b', 'deepseek-r1:8b', 'llama3.1:8b', 'mistral:7b']

##### Connect to the milvus-standalone container and load the collection specified above by which_collection. (This might take a while - about 1 minute)

In [29]:
client = MilvusClient(
    uri="http://localhost:19530",
    token="root:Milvus"
)

start = time.time()
client.load_collection(
    collection_name=which_collection,
    # replica_number=1
)
end = time.time()
print(f"execution time: {(end - start):.2f}s") 

execution time: 50.92s


##### Perform a semantic search using your claim to retrieve relevant sentences from the Milvus collection with `client.search()`.

In [31]:
start = time.time()
# semantic search
res = client.search(
    collection_name=which_collection,  # target collection
    data=claim_vector,  # query vectors
    limit=30,  # number of returned entities
    search_params={
        # highlight-start
        "params": {
            "radius": 0.75,
            "range_filter": 1.0
        }
        # highlight-end
    },
    output_fields=["sentence", "pmid"],  # specifies fields to be returned
)
end = time.time()
print(f"execution time: {(end - start):.2f}s") 
pmids = set([i['entity']['pmid'] for i in res[0]])
context = [i['entity']['sentence'] for i in res[0]]
print(f"{len(context)} relevant sentences were retrieved from a subset of PubMed.")

execution time: 0.11s
15 relevant sentences were retrieved from a subset of PubMed.


##### Then, generate a prompt using these retrieved sentences.

In [34]:
prompt = f"""Claim: {claim}
Context:
{"\n".join(context)}
Question: Does the context support the claim? ***Just return Yes or No.***
"""
print(f"The prompt will be used for LLMs queries -\n{prompt}")

The prompt will be used for LLMs queries -
Claim: ginger treats nausea
Context:
Efficacy of ginger for nausea and vomiting: a systematic review of randomized clinical trials.
Ginger for nausea.
We have performed a systematic review of the evidence from randomized controlled trials for or against the efficacy of ginger for nausea and vomiting.
Ginger (Zingiber officinale) has been used to ameliorate symptoms of nausea.
Ginger (Zingiber officinale) is often advocated as beneficial for nausea and vomiting.
Comparison of efficacy of ginger with various antimotion sickness drugs.
Taking ginger for nausea and vomiting during pregnancy.
Ginger effectively reduces nausea, tachygastric activity, and vasopressin release induced by circular vection.
To determine the effectiveness of ginger for the treatment of nausea and vomiting of pregnancy.
Is ginger root effective for decreasing the severity of nausea and vomiting in early pregnancy?
Ginger for nausea and vomiting in pregnancy: randomized, do

##### Finally, query the LLM(s) you specified earlier, collect their responses, and record the results for statistical analysis.

In [50]:
LLM_url = "http://localhost:11434/api/generate"
headers = {
    "Content-Type": "application/json"
}
results = []
responses = []
start = time.time()
for LLM in LLMs:
    data = {
        "model": LLM,
        "prompt": prompt,
        "stream": False
    }
    response = requests.post(LLM_url, headers=headers, data=json.dumps(data))
    if response.status_code == 200:
        response_text = response.text
        data = json.loads(response_text)
        actual_resonse = data['response'].strip()
        # print(actual_resonse)
        if LLM == 'deepseek-r1:8b':
            actual_resonse = extract_non_think(actual_resonse)
        responses.append(f"{LLM}: {actual_resonse}")
        if actual_resonse[:3].lower() == 'yes':
            results.append(1)
        elif actual_resonse[:2].lower() == 'no':
            results.append(0)
        else:
            print(f"Error: not a proper answer from {LLM}", actual_resonse)
    else:
        print("Error", LLM, response.status_code, response.text)
end = time.time()
print(f"execution time: {(end - start):.2f}s") 
score = sum(results)/len(results)
print(f"There were {len(results)} LLMs were queried and returning responses -\n{"\n".join(responses)}.\nThe confident score for this edge being correct is {score},\nwith the evidences {pmids}")


execution time: 10.72s
There were 5 LLMs were queried and returning responses -
phi4: Yes. The context supports the claim that ginger treats nausea through various systematic reviews, randomized controlled trials, and studies indicating its efficacy in reducing nausea and vomiting associated with different conditions such as pregnancy, gastrointestinal illnesses, and motion sickness. However, it also notes exceptions where ginger was not effective, like postoperative nausea after laparoscopic surgery. Overall, there is substantial support for the claim within the provided context.
gemma3:4b: Yes
deepseek-r1:8b: Yes
llama3.1:8b: Yes.
mistral:7b: Yes.
The confident score for this edge being correct is 1.0,
with the evidences {12651648, 12371300, 10446026, 12233808, 12576305, 11509171, 11275030, 11876024, 11538042, 10793599}


In [60]:
from concurrent.futures import ThreadPoolExecutor, as_completed

LLM_url = "http://localhost:11434/api/generate"
headers = {"Content-Type": "application/json"}

def query_llm(llm_name, prompt):
    data = {
        "model": llm_name,
        "prompt": prompt,
        "stream": False
    }
    response = requests.post(LLM_url, headers=headers, data=json.dumps(data))
    if response.status_code == 200:
        response_data = response.json()
        actual_response = response_data['response'].strip()
        if llm_name == 'deepseek-r1:8b':
            actual_response = extract_non_think(actual_response)
        return llm_name, actual_response
    else:
        return llm_name, f"Error {response.status_code}: {response.text}"

responses = []
results = []

start = time.time()
with ThreadPoolExecutor(max_workers=len(LLMs)) as executor:
    futures = [executor.submit(query_llm, llm, prompt) for llm in LLMs]

    for future in as_completed(futures):
        llm_name, actual_response = future.result()
        responses.append(f"{llm_name}: {actual_response}")
        
        if actual_response.lower().startswith('yes'):
            results.append(1)
        elif actual_response.lower().startswith('no'):
            results.append(0)
        else:
            print(f"Error: not a proper answer from {llm_name}", actual_response)
end = time.time()
print(f"execution time: {(end - start):.2f}s")  
score = sum(results)/len(results)
print(f"There were {len(results)} LLMs were queried and returning responses -\n{"\n".join(responses)}.\nThe confident score for this edge being correct is {score},\nwith the evidences {pmids}")


execution time: 7.47s
There were 5 LLMs were queried and returning responses -
mistral:7b: Yes
gemma3:4b: Yes
llama3.1:8b: Yes.
phi4: Yes. 

The context provides multiple references to studies and trials indicating that ginger is effective in treating nausea in various scenarios, such as during pregnancy and due to motion sickness, although it notes one exception related to postoperative nausea. Overall, the evidence supports the claim that ginger treats nausea.
deepseek-r1:8b: Yes.
The confident score for this edge being correct is 1.0,
with the evidences {12651648, 12371300, 10446026, 12233808, 12576305, 11509171, 11275030, 11876024, 11538042, 10793599}


##### (Optional: Release the collection to optimize RAM usage.)

In [62]:
client.release_collection(
    collection_name=which_collection
)

res = client.get_load_state(
    collection_name=which_collection
)

print(res)

{'state': <LoadState: NotLoad>}
