# Test Query Expansion

This notebook tests whether our expanded PubMed queries can capture all the test case articles.

## Test Cases:
1. **NRF2**: PMC7234996 (Neoaves KEAP1), PMID:28612944 (SKN-1 C.elegans)
2. **SOX2**: SuperSOX study
3. **APOE**: APOE2/3/4 variants and longevity
4. **OCT4**: OCT6→OCT4 conversion (EMBR study)

In [2]:
# Setup
import sys
sys.path.append('.')

from src.tools.pubmed import PubMed
import json

# Initialize PubMed client
pubmed = PubMed()

## TEST CASE 1: NRF2

Expected to find:
- PMID32424161: Neoaves KEAP1 mutation → over-active NRF2
- PMID:28612944: SKN-1 (NRF2 ortholog) increases lifespan in C.elegans

In [3]:
# Build expanded query for NRF2
nrf2_query = pubmed.build_search_query("NRF2")
print("NRF2 Query:")
print(nrf2_query)

NRF2 Query:
NRF2[TIAB] AND (aging[TIAB] OR longevity[TIAB] OR lifespan[TIAB] OR healthspan[TIAB] OR life span[TIAB] OR centenarian[TIAB] OR survival[TIAB] OR senescence[TIAB] OR age-related[TIAB]) NOT (Review[PT]) AND hasabstract


In [4]:
# Search for NRF2 papers
nrf2_pmids = pubmed.search(nrf2_query, 500)
print(f"Found {len(nrf2_pmids)} papers for NRF2")

Found 500 papers for NRF2


In [5]:
# Check if our target papers are in the results
target_nrf2 = ["28612944", "32424161"]

print("Checking for target NRF2 papers:")
for target in target_nrf2:
    if target in nrf2_pmids:
        print(f"position:{nrf2_pmids.index(target)}")
        print(f"✓ FOUND: PMID {target}")
    else:
        print(f"✗ MISSING: PMID {target}")

Checking for target NRF2 papers:
position:119
✓ FOUND: PMID 28612944
position:461
✓ FOUND: PMID 32424161


## TEST CASE 2: SOX2

Expected to find:
- PMID38141611. Modified SOX2 with enhanced reprogramming capabilities

In [6]:
# Build expanded query for SOX2 (with reprogramming terms)
sox2_query = pubmed.build_search_query("SOX2", include_reprogramming=True)
print("SOX2 Query (with reprogramming):")
print(sox2_query)

SOX2 Query (with reprogramming):
SOX2[TIAB] AND (aging[TIAB] OR longevity[TIAB] OR lifespan[TIAB] OR healthspan[TIAB] OR life span[TIAB] OR centenarian[TIAB] OR survival[TIAB] OR senescence[TIAB] OR age-related[TIAB] OR reprogramming[TIAB] OR cellular reprogramming[TIAB] OR Yamanaka factors[TIAB]) NOT (Review[PT]) AND hasabstract


In [7]:
# Search for SOX2 papers
sox2_pmids = pubmed.search(sox2_query, 100)
print(f"Found {len(sox2_pmids)} papers for SOX2")

Found 100 papers for SOX2


In [8]:
# Check if our target papers are in the results
target_sox2 = ["38141611"]

print("Checking for target sox2 papers:")
for target in target_sox2:
    if target in sox2_pmids:
        print(f"position:{sox2_pmids.index(target)}")
        print(f"✓ FOUND: PMID {target}")
    else:
        print(f"✗ MISSING: PMID {target}")

Checking for target sox2 papers:
position:4
✓ FOUND: PMID 38141611


## TEST CASE 3: APOE

Expected to find:
- APOE2: protective variant associated with longevity
- APOE3: common neutral variant
- APOE4: risk variant for Alzheimer's and reduced longevity

## TEST CASE 4: OCT4/OCT6

Expected to find:
- EMBR study(PMID28007765): Converting OCT6 into reprogramming factor through sequence modifications

In [9]:
# Build expanded query for OCT6 (with reprogramming terms)
oct4_query = pubmed.build_search_query("OCT4", include_reprogramming=True)
print("OCT4 Query (with reprogramming):")
print(oct4_query)

OCT4 Query (with reprogramming):
OCT4[TIAB] AND (aging[TIAB] OR longevity[TIAB] OR lifespan[TIAB] OR healthspan[TIAB] OR life span[TIAB] OR centenarian[TIAB] OR survival[TIAB] OR senescence[TIAB] OR age-related[TIAB] OR reprogramming[TIAB] OR cellular reprogramming[TIAB] OR Yamanaka factors[TIAB]) NOT (Review[PT]) AND hasabstract


In [10]:
# Search for OCT4 papers
oct4_pmids = pubmed.search(oct4_query, 800)
print(f"Found {len(oct4_pmids)} papers for OCT4")

Found 800 papers for OCT4


In [11]:
# Check if our target papers are in the results
target_oct4 = ["28007765"]

print("Checking for target target_oct6 papers:")
for target in target_oct4:
    if target in oct4_pmids:
        print(f"✓ FOUND: PMID {target}")
        print(f"position:{oct4_pmids.index(target)}")
    else:
        print(f"✗ MISSING: PMID {target}")

Checking for target target_oct6 papers:
✓ FOUND: PMID 28007765
position:53
