# Test Query Expansion

This notebook tests whether our expanded PubMed queries can capture all the test case articles.

## Test Cases:
1. **NRF2**: PMC7234996 (Neoaves KEAP1), PMID:28612944 (SKN-1 C.elegans)
2. **SOX2**: SuperSOX study
3. **APOE**: APOE2/3/4 variants and longevity
4. **OCT4**: OCT6→OCT4 conversion (EMBR study)

In [1]:
import sys
from pathlib import Path

project_root = Path.cwd().parent
sys.path.append(str(project_root))

from src.tools.pubmed import PubMed
import json

# Initialize PubMed client
pubmed = PubMed()

## TEST CASE 1: NRF2

Expected to find:
- PMID32424161: Neoaves KEAP1 mutation → over-active NRF2
- PMID:28612944: SKN-1 (NRF2 ortholog) increases lifespan in C.elegans

In [2]:
# Build expanded query for NRF2
nrf2_query = pubmed.build_search_query("NRF2")
print("NRF2 Query:")
print(nrf2_query)

NRF2 Query:
NRF2[TIAB] AND (aging[TIAB] OR longevity[TIAB] OR lifespan[TIAB] OR healthspan[TIAB] OR life span[TIAB] OR centenarian[TIAB] OR survival[TIAB] OR senescence[TIAB] OR age-related[TIAB]) NOT (Review[PT]) AND hasabstract


In [8]:
# Search for NRF2 papers
nrf2_pmids = pubmed.search(nrf2_query, 130)
print(f"Found {len(nrf2_pmids)} papers for NRF2")

Found 130 papers for NRF2


In [9]:
# Check if our target papers are in the results
target_nrf2 = ["28612944", "32424161"]

print("Checking for target NRF2 papers:")
for target in target_nrf2:
    if target in nrf2_pmids:
        print(f"position:{nrf2_pmids.index(target)}")
        print(f"✓ FOUND: PMID {target}")
    else:
        print(f"✗ MISSING: PMID {target}")

Checking for target NRF2 papers:
position:119
✓ FOUND: PMID 28612944
✗ MISSING: PMID 32424161


In [10]:
# Fetch metadata for all NRF2 papers
print("Fetching metadata for 130 NRF2 papers...")
nrf2_papers = pubmed.fetch(nrf2_pmids)
print(f"✓ Fetched {len(nrf2_papers)} papers\n")

# Initialize screening
from src.tools.screening import Screening
from tqdm import tqdm

screening = Screening()

# Screen all papers with progress bar
print("Screening papers for sequence→function→aging links...")
results = []

# Add this BEFORE the screening loop
import json as json_module

# Screen all 130 NRF2 papers
from tqdm import tqdm

print("Screening 130 NRF2 papers for sequence→function→aging links...")
results = []

for paper in tqdm(nrf2_papers, desc="Screening"):
    result = screening.screen_paper(
        title=paper.get("title", ""),
        abstract=paper.get("abstract", ""),
        keywords=paper.get("mesh_terms", [])
    )
    # Add paper metadata to result
    result["pmid"] = paper.get("pmid", "")
    result["title"] = paper.get("title", "")
    result["year"] = paper.get("year", "")
    results.append(result)

# Sort by score (highest first)
results_sorted = sorted(results, key=lambda x: x["score"], reverse=True)

# Display top 10
print("\n" + "="*80)
print("TOP 10 PAPERS BY RELEVANCE SCORE")
print("="*80 + "\n")

for i, paper in enumerate(results_sorted[:10], 1):
    print(f"{i}. PMID: {paper['pmid']} | Score: {paper['score']:.2f} | Year: {paper['year']}")
    print(f"   Title: {paper['title'][:100]}...")
    print(f"   Reasoning: {paper['reasoning']}")
    print()

# Check if target papers made it to top rankings
print("="*80)
print("TARGET PAPER RANKINGS")
print("="*80 + "\n")

target_nrf2 = ["28612944"] 
for target_pmid in target_nrf2:
    for rank, paper in enumerate(results_sorted, 1):
        if paper["pmid"] == target_pmid:
            print(f"✓ PMID {target_pmid}: Rank #{rank} | Score: {paper['score']:.2f}")
            print(f"  Reasoning: {paper['reasoning']}")
            break

Fetching metadata for 130 NRF2 papers...
✓ Fetched 130 papers

Screening papers for sequence→function→aging links...
Screening 130 NRF2 papers for sequence→function→aging links...


Screening: 100%|██████████| 130/130 [06:41<00:00,  3.09s/it]


TOP 10 PAPERS BY RELEVANCE SCORE

1. PMID: 27259148 | Score: 0.90 | Year: 2016
   Title: Repression of the Antioxidant NRF2 Pathway in Premature Aging....
   Reasoning: The paper provides strong evidence for a sequence change (progerin mutation) leading to a functional change (NRF2 sequestration and impaired transcriptional activity) and a measurable aging-related phenotypic effect (premature aging defects in HGPS), with clear mechanistic details and quantitative outcomes.

2. PMID: 40999940 | Score: 0.90 | Year: 2025
   Title: Enhancing Late-Life Survival and Mobility via Mitohormesis by Reducing Mitochondrial Calcium Levels....
   Reasoning: The paper demonstrates a causal chain from genetic knockdown of mcu-1 (sequence change) to reduced mitochondrial calcium levels (functional change) to extended lifespan and improved mobility (aging-related phenotypic effect), with a clear mechanistic explanation involving ROS-mediated signaling and antioxidant pathways.

3. PMID: 37751046 | Scor




In [6]:
# Test the stripping logic directly
test_content = '```json\n{\n  "relevant": true,\n  "score": 0.9,\n  "reasoning": "Test"\n}\n```'

print("Original:")
print(repr(test_content))

# Apply the same stripping logic
content = test_content.strip()

if content.startswith("```"):
    lines = content.split("\n")
    if lines[0].startswith("```"):
        lines = lines[1:]
    if lines and lines[-1].strip() == "```":
        lines = lines[:-1]
    content = "\n".join(lines).strip()

print("\nAfter stripping:")
print(repr(content))

# Try to parse
import json
try:
    result = json.loads(content)
    print("\n✓ Successfully parsed!")
    print(result)
except Exception as e:
    print(f"\n✗ Failed: {e}")

Original:
'```json\n{\n  "relevant": true,\n  "score": 0.9,\n  "reasoning": "Test"\n}\n```'

After stripping:
'{\n  "relevant": true,\n  "score": 0.9,\n  "reasoning": "Test"\n}'

✓ Successfully parsed!
{'relevant': True, 'score': 0.9, 'reasoning': 'Test'}


## TEST CASE 2: SOX2

Expected to find:
- PMID38141611. Modified SOX2 with enhanced reprogramming capabilities

In [None]:
# Build expanded query for SOX2 (with reprogramming terms)
sox2_query = pubmed.build_search_query("SOX2", include_reprogramming=True)
print("SOX2 Query (with reprogramming):")
print(sox2_query)

In [None]:
# Search for SOX2 papers
sox2_pmids = pubmed.search(sox2_query, 100)
print(f"Found {len(sox2_pmids)} papers for SOX2")

In [None]:
# Check if our target papers are in the results
target_sox2 = ["38141611"]

print("Checking for target sox2 papers:")
for target in target_sox2:
    if target in sox2_pmids:
        print(f"position:{sox2_pmids.index(target)}")
        print(f"✓ FOUND: PMID {target}")
    else:
        print(f"✗ MISSING: PMID {target}")

## TEST CASE 3: APOE

Expected to find:
- APOE2: protective variant associated with longevity
- APOE3: common neutral variant
- APOE4: risk variant for Alzheimer's and reduced longevity

## TEST CASE 4: OCT4/OCT6

Expected to find:
- EMBR study(PMID28007765): Converting OCT6 into reprogramming factor through sequence modifications

In [None]:
# Build expanded query for OCT6 (with reprogramming terms)
oct4_query = pubmed.build_search_query("OCT4", include_reprogramming=True)
print("OCT4 Query (with reprogramming):")
print(oct4_query)

In [None]:
# Search for OCT4 papers
oct4_pmids = pubmed.search(oct4_query, 800)
print(f"Found {len(oct4_pmids)} papers for OCT4")

In [None]:
# Check if our target papers are in the results
target_oct4 = ["28007765"]

print("Checking for target target_oct6 papers:")
for target in target_oct4:
    if target in oct4_pmids:
        print(f"✓ FOUND: PMID {target}")
        print(f"position:{oct4_pmids.index(target)}")
    else:
        print(f"✗ MISSING: PMID {target}")

## All genes results

In [3]:
import pandas as pd
df = pd.read_csv("../data/all_genes_results.csv")

In [6]:
df.head()

Unnamed: 0,gene_id,gene_symbol,pmid,title,year,journal,score,relevant,reasoning,search_date
0,4780,NRF2,27259148,Repression of the Antioxidant NRF2 Pathway in ...,2016,Cell,0.9,True,The paper provides strong evidence for a seque...,2025-10-16
1,4780,NRF2,40999940,Enhancing Late-Life Survival and Mobility via ...,2025,Aging Cell,0.9,True,The paper demonstrates a causal chain from gen...,2025-10-16
2,4780,NRF2,37751046,Comparative analysis of the molecular and phys...,2023,Geroscience,0.9,True,The paper provides experimental evidence of a ...,2025-10-16
3,4780,NRF2,28272406,O-GlcNAcylation of SKN-1 modulates the lifespa...,2017,Sci Rep,0.9,True,The paper provides strong evidence for a seque...,2025-10-16
4,4780,NRF2,34561453,A PTEN variant uncouples longevity from impair...,2021,Nat Commun,0.9,True,The paper provides strong evidence for a seque...,2025-10-16


In [5]:
test_cases = {
    "NRF2": ["28612944", "32424161"],
    "SOX2": ["38141611"],
    "OCT4": ["28007765"]
}

In [None]:
gene_symbols = df['gene_symbol'].unique()

In [10]:
for gene in gene_symbols:
    if gene in test_cases:
        gene_pmids = df[df['gene_symbol'] == gene]['pmid'].astype(str).tolist()

        for test_pmid in test_cases[gene]:
            if test_pmid in gene_pmids:
                rank = gene_pmids.index(test_pmid) + 1
                print(f"{gene} test case FOUND: PMID {test_pmid} (Rank #{rank})")
            else:
                print(f"{gene} test case MISSING: PMID {test_pmid}")


NRF2 test case MISSING: PMID 28612944
NRF2 test case MISSING: PMID 32424161
OCT4 test case FOUND: PMID 28007765 (Rank #6)
SOX2 test case FOUND: PMID 38141611 (Rank #1)
