# PubMed Tools Pipeline Test

This notebook tests the full pipeline: **Search → Fetch → Screen (Metadata) → Screen (Abstract)**

The pipeline uses a unified `screen_paper()` function that can perform:
1. **Quick screening** using only title and MeSH terms (no abstract)
2. **Deep screening** using title, abstract, and MeSH terms to identify sequence→function→aging links

## Import Required Libraries

In [1]:
# IMPORTANT: If you've changed the model in .env, restart the kernel before running!
# Kernel -> Restart Kernel

import json
from src.tools.pubmed import PubMed
from src.tools.screening import screen_paper

# Initialize PubMed client
pubmed = PubMed()

# Verify the model being used
from src.config import NEBIUS_MODEL
print(f"✓ Using model: {NEBIUS_MODEL}")

✓ Using model: meta-llama/Llama-3.3-70B-Instruct


## Configure Search Parameters

In [2]:
# Customize these parameters
SEARCH_QUERY = "IGF1R aging"
MAX_RESULTS = 10

## Step 1: Search PubMed

In [3]:
print("=" * 70)
print(f"STEP 1: Searching PubMed for '{SEARCH_QUERY}'")
print("=" * 70)

pmids = pubmed.search(SEARCH_QUERY, max_results=MAX_RESULTS)
print(f"\nFound {len(pmids)} PMIDs")

if not pmids or pmids[0].startswith("Error"):
    print("\n❌ Search failed!")
    raise Exception("Search failed")

print(f"✓ Search successful! Got {len(pmids)} PMIDs")
print(f"\nFirst 5 PMIDs: {pmids[:5]}")

STEP 1: Searching PubMed for 'IGF1R aging'

Found 10 PMIDs
✓ Search successful! Got 10 PMIDs

First 5 PMIDs: ['35857466', '16123266', '37527036', '37441495', '30026579']


## Step 2: Fetch Paper Metadata

In [4]:
print("\n" + "=" * 70)
print(f"STEP 2: Fetching metadata for {len(pmids)} papers")
print("=" * 70)

# Fetch papers using PubMed class
papers = pubmed.fetch(pmids)

if papers and "error" in papers[0]:
    print("\n❌ Fetch failed!")
    print(papers[0]["error"])
    raise Exception("Fetch failed")

print(f"✓ Fetched metadata for {len(papers)} papers")

# Show a sample paper
if papers:
    print("\n" + "-" * 70)
    print("Sample paper:")
    print("-" * 70)
    sample = papers[0]
    print(f"PMID: {sample['pmid']}")
    print(f"Title: {sample['title']}")
    print(f"Year: {sample['year']}")
    print(f"Journal: {sample['journal']}")
    print(f"MeSH Terms: {sample['mesh_terms'][:5] if sample['mesh_terms'] else 'None'}...")


STEP 2: Fetching metadata for 10 papers
✓ Fetched metadata for 10 papers

----------------------------------------------------------------------
Sample paper:
----------------------------------------------------------------------
PMID: 35857466
Title: Progerin modulates the IGF-1R/Akt signaling involved in aging.
Year: 2022
Journal: Sci Adv
MeSH Terms: None...


## Step 3: Screen Papers Using Title + Keywords

In [5]:
print("\n" + "=" * 70)
print(f"STEP 3: Screening papers using title + MeSH terms")
print("=" * 70)

relevant_papers = []

for i, paper in enumerate(papers, 1):
    print(f"\nScreening paper {i}/{len(papers)}: {paper['pmid']}")
    print(f"  Title: {paper.get('title', '')[:80]}...")

    result = screen_paper(
        title=paper.get("title", ""),
        keywords=paper.get("mesh_terms", [])
        # No abstract parameter = quick metadata screening
    )

    if result["relevant"]:
        relevant_papers.append({
            **paper,
            "metadata_score": result["score"],
            "metadata_reasoning": result["reasoning"]
        })
        print(f"  ✓ RELEVANT (score: {result['score']:.2f})")
        print(f"    Reason: {result['reasoning']}")
    else:
        print(f"  ✗ Not relevant (score: {result['score']:.2f})")
        print(f"    Reason: {result['reasoning']}")

print(f"\n✓ Metadata screening complete!")


STEP 3: Screening papers using title + MeSH terms

Screening paper 1/10: 35857466
  Title: Progerin modulates the IGF-1R/Akt signaling involved in aging....
  ✓ RELEVANT (score: 0.80)
    Reason: The title mentions a specific protein, progerin, and its involvement in aging through modulation of the IGF-1R/Akt signaling pathway, suggesting a potential SEQUENCE→PHENOTYPE link. The lack of keywords is a limitation, but the title provides sufficient indication of a link to aging research.

Screening paper 2/10: 16123266
  Title: Suppression of aging in mice by the hormone Klotho....
  ✓ RELEVANT (score: 0.80)
    Reason: The title and keywords suggest a link between the Klotho hormone and aging/longevity, and the presence of terms like 'genetics', 'transgenic mice', and 'recombinant proteins' imply experimental studies, making this paper relevant for SEQUENCE→PHENOTYPE links in aging research.

Screening paper 3/10: 37527036
  Title: IGFBPL1 is a master driver of microglia homeostasis and

## Step 4: Deep Screen Abstracts for Sequence→Function→Aging Links

In [6]:
print("\n" + "=" * 70)
print(f"STEP 4: Deep screening abstracts for sequence→function→aging links")
print("=" * 70)

highly_relevant_papers = []

for i, paper in enumerate(relevant_papers, 1):
    print(f"\nDeep screening paper {i}/{len(relevant_papers)}: {paper['pmid']}")
    print(f"  Title: {paper.get('title', '')[:80]}...")
    
    result = screen_paper(
        title=paper.get("title", ""),
        keywords=paper.get("mesh_terms", []),
        abstract=paper.get("abstract", "")  # Including abstract = deep screening
    )
    
    print(f"  Score: {result['score']:.2f}")
    
    # Only keep papers with high scores (threshold 0.5)
    if result['score'] >= 0.5 and result['relevant']:
        highly_relevant_papers.append({
            **paper,
            "abstract_score": result["score"],
            "abstract_reasoning": result["reasoning"]
        })
        print(f"  ✓✓ HIGH PRIORITY")
        print(f"    Reasoning: {result['reasoning']}")
    else:
        print(f"  ✗ Lower priority (score < 0.5 or not relevant)")
        print(f"    Reasoning: {result['reasoning']}")

print(f"\n✓ Deep screening complete!")


STEP 4: Deep screening abstracts for sequence→function→aging links

Deep screening paper 1/5: 35857466
  Title: Progerin modulates the IGF-1R/Akt signaling involved in aging....
  Score: 0.90
  ✓✓ HIGH PRIORITY
    Reasoning: The paper describes a specific sequence change (LMNA mutation leading to progerin production) that causes a functional change (mislocalization and interaction with IGF-1R, down-regulating its expression) which affects an aging-related phenotype (premature aging, cellular senescence, and reduced longevity). The mechanistic connection between sequence, function, and aging is well-supported by experimental validation.

Deep screening paper 2/5: 16123266
  Title: Suppression of aging in mice by the hormone Klotho....
  Score: 0.20
  ✗ Lower priority (score < 0.5 or not relevant)
    Reasoning: The paper describes the effect of Klotho overexpression on aging, but it does not provide explicit sequence-level changes, only gene expression changes, which does not meet the

## Step 5: Display Results

In [7]:
print("\n" + "=" * 70)
print(f"RESULTS: Found {len(highly_relevant_papers)} HIGH PRIORITY papers")
print(f"         (out of {len(relevant_papers)} relevant, {len(papers)} total)")
print("=" * 70)

if highly_relevant_papers:
    print("\nHigh priority papers with sequence→function→aging links:")
    print("-" * 70)
    
    for i, paper in enumerate(highly_relevant_papers, 1):
        print(f"\n{i}. PMID: {paper['pmid']}")
        print(f"   Metadata Score: {paper['metadata_score']:.2f} | Abstract Score: {paper['abstract_score']:.2f}")
        print(f"   Title: {paper['title']}")
        print(f"   Year: {paper['year']} | Journal: {paper['journal']}")
        print(f"   Analysis: {paper['abstract_reasoning']}")
else:
    print("\n⚠️  No high priority papers found in this batch.")
    print("Papers from Step 3 may still be relevant but lack strong sequence→function→aging evidence.")

print("\n" + "=" * 70)
print("✓ Pipeline test completed!")
print("=" * 70)


RESULTS: Found 2 HIGH PRIORITY papers
         (out of 5 relevant, 10 total)

High priority papers with sequence→function→aging links:
----------------------------------------------------------------------

1. PMID: 35857466
   Metadata Score: 0.80 | Abstract Score: 0.90
   Title: Progerin modulates the IGF-1R/Akt signaling involved in aging.
   Year: 2022 | Journal: Sci Adv
   Analysis: The paper describes a specific sequence change (LMNA mutation leading to progerin production) that causes a functional change (mislocalization and interaction with IGF-1R, down-regulating its expression) which affects an aging-related phenotype (premature aging, cellular senescence, and reduced longevity). The mechanistic connection between sequence, function, and aging is well-supported by experimental validation.

2. PMID: 30026579
   Metadata Score: 0.80 | Abstract Score: 0.90
   Title: Reversing wrinkled skin and hair loss in mice by restoring mitochondrial function.
   Year: 2018 | Journal: Cell 

## Summary Statistics

In [8]:
print("\n📊 SUMMARY STATISTICS")
print("=" * 70)
print(f"Search query: {SEARCH_QUERY}")
print(f"Total papers searched: {len(papers)}")
print(f"Papers passing metadata screen: {len(relevant_papers)}")
print(f"Papers passing abstract screen: {len(highly_relevant_papers)}")
print(f"Overall relevance rate: {len(highly_relevant_papers)/len(papers)*100:.1f}%")

if highly_relevant_papers:
    avg_metadata_score = sum(p['metadata_score'] for p in highly_relevant_papers) / len(highly_relevant_papers)
    avg_abstract_score = sum(p['abstract_score'] for p in highly_relevant_papers) / len(highly_relevant_papers)
    print(f"Average metadata score: {avg_metadata_score:.2f}")
    print(f"Average abstract score: {avg_abstract_score:.2f}")
    print(f"\nTop scored paper: {highly_relevant_papers[0]['title']}")


📊 SUMMARY STATISTICS
Search query: IGF1R aging
Total papers searched: 10
Papers passing metadata screen: 5
Papers passing abstract screen: 2
Overall relevance rate: 20.0%
Average metadata score: 0.80
Average abstract score: 0.90

Top scored paper: Progerin modulates the IGF-1R/Akt signaling involved in aging.


## Export Results (Optional)

In [None]:
# Uncomment to save results to JSON
# import json
# with open('highly_relevant_papers.json', 'w') as f:
#     json.dump(highly_relevant_papers, f, indent=2)
# print("High priority results saved to highly_relevant_papers.json")