<a id="tabel"></a>
<b>Table of contents:</b>


* [1. Import Libraries & Dataset](#import)
* [2. Document Retrieval](#retrive)
    - [2.1. Raw Baseline(Querying from National Library of Medicine)](#1000-document)
    - [2.2. Enhanced Querying by adding Mesh Index](#enhanced)
* [3. BM 25](#bm-25)
* [3. Out-of-Box Neural Re-ranking](#bm-25)
* [4. Fine Tune Neural Re-ranking on Training Dataset](#fine-tune)

[🏠 Table of Contents](#tabel)




<style>
  body {
    background: #111;          /* match your dark page */
    color: #fff;
    font-family: 'Poppins', Roboto, sans-serif;
  }

  .timeline {
    display: flex;
    align-items: stretch;      /* make children all same height */
    justify-content: center;
    gap: 40px;
    padding: 40px;
  }

  /* Left & right columns stack vertically */
  .phase-column {
    display: flex;
    flex-direction: column;
    gap: 30px;
  }

  /* dashed separator */
  .separator {
    border-left: 2px dashed #888;
    width: 0;
    /* stretch separator to the tallest column */
    align-self: stretch;
    margin: 0 20px;
  }

  .column-title {
    font-size: 1.2em;
    font-weight: 600;
    text-align: center;
    margin-bottom: 10px;
    text-transform: uppercase;
    letter-spacing: 1px;
  }

  .phase {
    width: 260px;
    height: 70px;
    line-height: 70px;
    color: white;
    font-weight: 600;
    text-align: center;
    border-radius: 6px;
    font-size: 1.05em;
    clip-path: polygon(0% 0%, 92% 0%, 100% 50%, 92% 100%, 0% 100%);
    box-shadow: 2px 4px 10px rgba(0,0,0,0.3);
    opacity: 0.9;
    transition: transform 0.2s, box-shadow 0.2s, opacity 0.2s;
  }
  .phase:hover {
    transform: scale(1.03);
    box-shadow: 4px 6px 15px rgba(0,0,0,0.4);
    opacity: 1;
  }
  .phase1 { background: #69B8F7; }
  .phase2 { background: #54A6F3; }
  .phase3 { background: #2B7BC1; }
  .phase4 { background: #1F557F; }
  .phase5 { background: #5C9D99; }
  .phase6 { background: #4E9B97; }

  @media (max-width: 800px) {
    .timeline {
      flex-direction: column;
      align-items: center;
    }
    .separator {
      display: none;
    }
  }
</style>

<div class="timeline">
  <!-- Phase A column -->
  <div class="phase-column">
    <div class="column-title">Phase A</div>
    <div class="phase phase1">1. 1000 Document Pool from API</div>
    <div class="phase phase2">2. Add BM25 (“Traditional” IR)</div>
    <div class="phase phase3">3. Representation Learning (BERT)</div>
    <div class="phase phase4">4. Baseline Neural Re-Ranking</div>
    <div class="phase phase5">5. Transfer Learning on the Neural</div>
  </div>

  <!-- Dashed line -->
  <div class="separator"></div>

  <!-- Phase B column -->
  <div class="phase-column">
    <div class="column-title">Phase B</div>
    <div class="phase phase6">6. Create Model for Phase B</div>
  </div>
</div>


## Library Imports


<a id="import"></a>

In [None]:
! pip install requirements.txt

In [None]:
# ----------------------------- #
#        Standard Library       #
# ----------------------------- #
import csv
import json
import requests
import logging
import math
import re
import os
import statistics
import pandas as pd
import pathlib
import random
import string
import time
from collections import defaultdict
from pathlib import Path
from typing import Dict, List, Tuple

# ----------------------------- #
#      Third-Party Libraries    #
# ----------------------------- #

# Scientific Computing
import numpy as np
import torch

# Bioinformatics
from Bio import Entrez, Medline  # Accessing and parsing PubMed/NCBI data
from query_pubmed import esearch_pmids, fetch_documents, preprocess

# NLP & Text Processing
import nltk                      # Tokenization, stopwords, etc.
from rank_bm25 import BM25Okapi # BM25 ranking algorithm for information retrieval
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from rank_bm25 import BM25Okapi

# Sentence Transformers
from sentence_transformers import CrossEncoder, InputExample
from torch.utils.data import DataLoader

# Progress Bar / Visualization
from tqdm import tqdm, trange

<a id="retrive"></a>

## Retrieving a Pool of Documents through Querying

>https://www.nlm.nih.gov

### Baseline(Raw Querying from National Library of Medicine)

<a id="1000-document"></a>

In [None]:
EMAIL = ""           # Enter your E-Mail(The API will start compaining if not entered)
API_KEY = ""         # Enter your API Key(More Queries/s if entered)

Entrez.email = EMAIL
Entrez.api_key = API_KEY

RETMX = 1000  # Number of documents to fetch
TRAIN_DATA_URL  = "https://participants-area.bioasq.org/Tasks/13b/trainingDataset/training13b.json"
DATA_FILE = "training13b.json"
BATCH = 500    # PubMed efetch batch size
stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer() # Suffix Removal(Generalization of Query)

Entrez.email   = EMAIL
Entrez.api_key = API_KEY


TRAIN_DATA_URL = (
    "https://participants-area.bioasq.org/"
    "Tasks/13b/trainingDataset/training13b.json"
)
DATA_FILE = "training13b.json"


##############################---Utility Functions---###################################
# Functions to help us with the fetching
#  Download training set.
if not os.path.exists(DATA_FILE):
    print(f"Downloading training data from:\n  {TRAIN_DATA_URL}")
    resp = requests.get(TRAIN_DATA_URL)
    resp.raise_for_status()
    with open(DATA_FILE, "wb") as f:
        f.write(resp.content)
    print(f"Saved to {DATA_FILE}")

# Load the Training File
with open(DATA_FILE, "r", encoding="utf-8") as f:
    data = json.load(f)

# To fetch only the Document 
def pmids_from_urls(urls: List[str]) -> List[str]:
    """Extract the numeric PMID from each PubMed URL."""
    pmids: List[str] = []
    for u in urls:
        m = re.search(r'/pubmed/(\d+)', u)
        if m:
            pmids.append(m.group(1))
    return pmids

#####################################################################################


# Main Fetching Logic
def fetch_documents(pmids: List[str], batch_size: int = BATCH) -> List[Tuple[str,str,str]]:
    """Fetch (pmid, title, abstract) for each PMID."""
    docs: List[Tuple[str,str,str]] = [] # List to store the documents
    # Loop to iterate through each query
    for start in range(0, len(pmids), batch_size):
        # Querying by batches
        batch = pmids[start:start+batch_size]
        # Phrasing the query(Doing the main API Call)
        handle  = Entrez.efetch(db="pubmed", id=",".join(batch), retmode="xml")
        records = Entrez.read(handle)
        time.sleep(0.34)  # NCBI rate limiting
        for art in records.get('PubmedArticle', []):
            pmid = art['MedlineCitation']['PMID'] 
            el   = art['MedlineCitation']['Article']
            title = el.get('ArticleTitle', '') 
            abst  = " ".join(el.get('Abstract',{}).get('AbstractText', []))
            docs.append((pmid, title, abst))
    return docs

# Build output structure under a top-level "questions" key just like the testing of the competition expects it
questions_out = []
for q in data.get("questions", []):
    
    qid   = q.get("id", "")
    qtype = q.get("type")           
    body  = q.get("body", "").strip()
    urls  = q.get("documents", [])
    pmids = pmids_from_urls(urls)

    docs = fetch_documents(pmids)
    docs_json = [
        {"pmid": pmid, "title": title, "abstract": abst}
        for pmid, title, abst in docs
    ]

    questions_out.append({
        "id":        qid,
        "type":      qtype,
        "body":      body,
        "documents": docs_json
    })

#  Write wrapped JSON
out_dir = "api_retrival"
os.makedirs(out_dir, exist_ok=True)
out_path = Path(out_dir) / "bio_api_docs_title+abstract_baseline.json"

with open(out_path, "w", encoding="utf-8") as f:
    json.dump({"questions": questions_out}, f, indent=2, ensure_ascii=False)

print(f"Processed {len(questions_out)} questions; total docs = "
      f"{sum(len(q['documents']) for q in questions_out)}")

<a id="enhanced"></a>

## Enhanced Querying by adding Mesh Index

In [None]:
# Load the Training File
with open(DATA_FILE, "r", encoding="utf-8") as f:
    data = json.load(f)
questions = data.get("questions", [])


##############################---Utility Functions---###################################

def esearch_pmids(query: str, retmax: int = RETMX) -> list:
    """
    Perform a PubMed search using both Title/Abstract and MeSH terms.
    Returns a list of PMIDs (strings).
    """
    # Cleaning up teh query by phrasing it as well as tokenizing it.
    phrase = query.lower().strip()
    tokens = phrase.split()
    # Build query parts for full phrase and individual tokens. Mesh Terms are 
    # standardized vocabulary used to enhance search precision by capturing the semantic meaning of biomedical concepts.
    # Frankly this really helps in the document retrival part.
    parts = [f'"{phrase}"[Title/Abstract]', f'"{phrase}"[MeSH Terms]']
    for t in tokens:
        parts.append(f'{t}[Title/Abstract]')
        parts.append(f'{t}[MeSH Terms]')
    term = f"({' OR '.join(parts)}) AND hasabstract[text]"

    # Retry up to 3 times on failure
    for attempt in range(3):
        try:
            handle = Entrez.esearch(
                db="pubmed",
                term=term,
                retmax=retmax,
                sort="relevance",
                retmode="xml"
            )
            result = Entrez.read(handle)
            time.sleep(0.1)
            return result.get('IdList', [])
        except Exception as e:
            print(f"Esearch attempt {attempt+1} failed: {e}")
            time.sleep(2 ** attempt)
    return []

#####################################################################################

def fetch_documents(pmids: list, batch_size: int = BATCH) -> list:
    """
    Retrieve (pmid, title, abstract) tuples for given PMIDs in batches.
    """
    # Logic not changed from above
    docs = []
    for start in range(0, len(pmids), batch_size):
        batch = pmids[start:start+batch_size]
        handle = Entrez.efetch(
            db="pubmed",
            id=','.join(batch),
            retmode="xml"
        )
        records = Entrez.read(handle)
        time.sleep(0.34)
        for art in records.get('PubmedArticle', []):
            pmid = art['MedlineCitation']['PMID']
            el   = art['MedlineCitation']['Article']
            title = el.get('ArticleTitle', '')
            abst  = ''
            if el.get('Abstract'):
                abst = ' '.join(el['Abstract']['AbstractText'])
            docs.append((pmid, title, abst))
    return docs

# Loop Iteration for each query in the training file or testin batches. 
output = []
for idx, q in enumerate(questions, start=1):
    qid   = q.get('id', '')
    qtype = q.get('type', '')
    body  = q.get('body', '').strip()
    print(f"[{idx}/{len(questions)}] Querying: {qid}")

    # 1) Search PubMed with MeSH and Title/Abstract
    pmids = esearch_pmids(body)
    print(f"  Retrieved {len(pmids)} PMIDs")

    # 2) Fetch titles and abstracts
    docs = fetch_documents(pmids)
    print(f"  Fetched {len(docs)} documents")

    # 3) Structure output
    docs_json = [
        {"pmid": pmid, "title": title, "abstract": abst}
        for pmid, title, abst in docs
    ]

    output.append({
        "id":        qid,
        "type":      qtype,
        "body":      body,
        "documents": docs_json
    })

# WRITE OUTPUT
out_dir = Path("api_retrieval")
out_dir.mkdir(exist_ok=True)
out_path = out_dir / "bio_api_docs_mesh.json"
with open(out_path, "w", encoding="utf-8") as f:
    json.dump({"questions": output}, f, indent=2, ensure_ascii=False)

%store out_path

print(f"Processed {len(output)} questions; output written to {out_path}")


## BM 25(Traditional IR Model)

<a id="bm-25"></a>

In [None]:
RETMX   = 1000
TOP_K   = 10
OUT_DIR = "api_retrival"

# Paths
INPUT_FILE  = Path(OUT_DIR) / "bio_api_docs_title+abstract.json"
OUTPUT_FILE = Path("bm_25") / "bm_25_output.json"

def main():
    # read the 'questions' key
    data = json.loads(INPUT_FILE.read_text())['questions']
    output = []

    for i, item in enumerate(data, 1):
        print(f"Step {i}/{len(data)}: Fetching PMIDs…")
        qid, qtype, body = item['id'], item.get('type'), item['body']

        pmids = esearch_pmids(body, RETMX)
        docs  = fetch_documents(pmids)

        print(f"  Fetched {len(docs)} documents")
        corpus   = [preprocess(t) for _, _, t in docs]  
        bm25     = BM25Okapi(corpus)
        q_tokens = preprocess(body)
        scores   = bm25.get_scores(q_tokens)

        top_idx   = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:TOP_K]
        top_pmids = [docs[i][0] for i in top_idx]
        top_urls  = [f"http://www.ncbi.nlm.nih.gov/pubmed/{pmid}" for pmid in top_pmids]

        output.append({
            'id':        qid,
            'type':      qtype,
            'body':      body,
            'documents': top_urls,
            'snippets':  []
        })

    OUTPUT_FILE.parent.mkdir(exist_ok=True)
    OUTPUT_FILE.write_text(json.dumps({'questions': output}, indent=2))
    print(f"Saved output to {OUTPUT_FILE}")


main()

## Out-of-Box Neural Re-ranking(Cross Encoder trained on MS Marco Passage Ranking)


<a id="Baseline-NN"></a>

In [None]:
# Just a check when we were first running the model to ensure its running on GPU on the cloud.
assert torch.cuda.is_available(), "CUDA is not available."


with open(INPUT_FILE, 'r', encoding='utf-8') as f:
    data = json.load(f)

model = CrossEncoder(
    "cross-encoder/ms-marco-MiniLM-L-6-v2",
    device="cuda"
)


top_k = 10  # Limited by the competition
questions = data['questions']

# Build one big list of all (query, abstract) pairs
all_pairs = []
# keep track of how many docs each question had
doc_counts = []
for q in questions:
    docs = q['documents']
    doc_counts.append(len(docs))
    all_pairs.extend([(q['body'], d['abstract']) for d in docs])

# Predict scores in large batches
scores = model.predict(all_pairs, batch_size=128)

# Split scores back per-question and pick top_k
output = {"questions": []}
idx = 0
for q, count in zip(questions, doc_counts):
    # slice out this question's scores
    chunk_scores = scores[idx: idx + count]
    idx += count

    # pick top_k within this chunk
    ranked = np.argsort(chunk_scores)[::-1][:top_k]
    top_docs = [q['documents'][i] for i in ranked]

    output["questions"].append({
        "id":       q.get("id"),
        "body":     q['body'],
        "documents": top_docs,
        "snippets": []
    })

# Save results(JSON Dump)

out_dir = "Neural_Reranking"
os.makedirs(out_dir, exist_ok=True)
neural_reranking_path = Path(out_dir) / "Baseline_Neural_Reranking.json"

with open(neural_reranking_path, "w", encoding="utf-8") as out:
    json.dump(output, out, indent=2, ensure_ascii=False)

print("Done – ran all questions in one big batch and wrote Ranked_Results.json")


Done – ran all questions in one big batch and wrote Ranked_Results.json


# Fine Tune Neural Re-ranking on Training Dataset



In [None]:
# We just used 500 out of the 5000 documents to train the model due to computation limitations and that is why we created this filtered
# Dataset
out_dir = "Neural_Reranking/Fine_Tuning"

filtered_training = Path(out_dir) / "Filtered_Training.json"

# Loading the master file and collecting its IDs
with out_path.open('r', encoding='utf-8') as f:
    master = json.load(f)
master_ids = { q['id'] for q in master.get('questions', []) if 'id' in q }

# Load the ground truth file
with DATA_FILE.open('r', encoding='utf-8') as f:
    meta = json.load(f)

# Filter meta to only those IDs present in the golden file
filtered_meta = [
    q for q in meta.get('questions', [])
    if q.get('id') in master_ids
]

# 4) Write out the filtered meta
with filtered_training.open('w', encoding='utf-8') as f:
    json.dump({"questions": filtered_meta}, f, indent=2, ensure_ascii=False)

print(
    f"Filtered meta: kept {len(filtered_meta)} entries "
    f"out of {len(meta.get('questions', []))}."
)

In [None]:
# ------------------------------------------------------------------
TRAIN_CSV = Path(out_dir) / "train.csv"
DEV_CSV = Path(out_dir) / "dev.csv"
DEV_SPLIT  = 0.10   # 10 % of rows go to dev
BATCH_SIZE = 16
# ------------------------------------------------------------------

def is_positive(abs_txt, snippets, exacts):
    # cross-referencing the BioASQ annotations, 
    # specifically marking documents(with a label 1) containing exact answer matches or 
    # those explicitly referenced by expert-provided relevant snippets and the rest as 0 
    # This approach was choosen becuase we expect the querying to make systematic errors regardless of the data being 
    # the training and test dataset so we can help the model understand these issues(where we retrive non-sense documents).
    for s in snippets:
        if s['text'] in abs_txt:
            return True
    for token in exacts:
        if re.search(r'\b{}\b'.format(re.escape(token.lower())), abs_txt.lower()):
            return True
    return False

def flatten_exact(exact_answer):
    """ exact_answer can be list of lists or strings – flatten it """
    flat = []
    for item in exact_answer:
        if isinstance(item, list):
            flat.extend(item)
        else:
            flat.append(item)
    return flat

# build (label, question, abstract) triples 
with filtered_training.open() as f:
    qs = json.load(f)['questions']

with out_path.open() as f:
    pools = {q['id']: q['documents'] for q in json.load(f)['questions']}

rows = []
for q in qs:
    qid   = q['id']
    qtext = q['body']
    pool  = pools.get(qid, [])

    snippets = q.get('snippets', [])
    exacts   = flatten_exact(q.get('exact_answer', []))
    pos_ids  = {s['document'].split('/')[-1] for s in snippets}

    for d in pool:
        abstract = d.get('abstract', '')
        matched  = (str(d['pmid']) in pos_ids) or is_positive(abstract, snippets, exacts)
        rows.append((int(matched), qtext, abstract))

print(f"Collected {len(rows):,} pairs")

# shuffle & split
random.shuffle(rows)
cut = int(len(rows) * DEV_SPLIT)

train_rows = rows[cut:]
dev_rows   = rows[:cut]

# write CSV with full quoting 
for path, data in [(TRAIN_CSV, train_rows), (DEV_CSV, dev_rows)]:
    with path.open('w', newline='', encoding='utf-8') as f:
        writer = csv.writer(f, quoting=csv.QUOTE_ALL)
        writer.writerow(['label', 'question', 'abstract'])  # header
        writer.writerows(data)
    print(f"Wrote {len(data):,} rows → {path}")

#  helper to read the new CSV 
def read_csv(path: pathlib.Path):
    with path.open(encoding='utf-8') as f:
        rdr = csv.DictReader(f, quoting=csv.QUOTE_ALL)
        for row in rdr:
            yield InputExample(
                texts=[row['question'], row['abstract']],
                label=float(row['label'])
            )

train_samples = list(read_csv(TRAIN_CSV))
dev_samples   = list(read_csv(DEV_CSV))
train_loader  = DataLoader(train_samples, shuffle=True, batch_size=BATCH_SIZE)

print("Data ready:",
      len(train_samples), "train pairs |",
      len(dev_samples),   "dev pairs | batch =", BATCH_SIZE)


In [None]:
train_df = pd.read_csv('/Neural_Reranking/Fine_Tuning/train.csv')

test_df = pd.read_csv('/Neural_Reranking/Fine_Tuning/dev.csv')

In [None]:
print(f"Missing Data in the Training Dataset: \n{train_df.isna().sum()}")

print(f"Missing Data in the Test Dataset: \n{test_df.isna().sum()}")

In [None]:
# Clearning so that we have a clean datset
train_df = train_df.dropna()

test_df = test_df.dropna()

In [None]:
train_df.to_csv('/Neural_Reranking/Fine_Tuning/train.csv')

test_df.to_csv('/Neural_Reranking/Fine_Tuning/dev.csv')

In [2]:
# Initializing the Cross Encoder
model = CrossEncoder(
    "microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract",
    num_labels=1,
    max_length=512,
    device="cuda"
)

# tiny LR so we don’t erase prior knowledge
model.config.hidden_dropout_prob = 0.1

config.json:   0%|          | 0.00/385 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/225k [00:00<?, ?B/s]

README.md:   0%|          | 0.00/2.31k [00:00<?, ?B/s]

In [3]:
# Preparing text-pair training data

BATCH     = 16

# Load the CSVs
def load_csv(path):
    with open(path, encoding="utf-8") as f:
        reader = csv.DictReader(f, quoting=csv.QUOTE_ALL)
        for row in reader:
            yield InputExample(
                texts=[row["question"], row["abstract"]],
                label=float(row["label"])
            )

train_samples = list(load_csv(train_df))
dev_samples   = list(load_csv(test_df))

train_loader  = DataLoader(train_samples, shuffle=True, batch_size=BATCH)

print(f"{len(train_samples):,} training pairs | "
      f"{len(dev_samples):,} dev pairs | batch={BATCH}")

441,019 training pairs | 48,942 dev pairs | batch=16


In [4]:
if torch.cuda.is_available():
    device_name = torch.cuda.get_device_name(0)
    device      = "cuda"
    print(f"✅ GPU found → {device_name}")
else:
    device = "cpu"
    print("No CUDA GPU detected.")

✅ GPU found → Tesla T4


In [5]:
model = CrossEncoder(
    "microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract",
    num_labels=1,
    max_length=512,
    device="cuda"         # T4 GPU since we are working on Lightning.ai
)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [7]:
model.fit(
    train_dataloader=train_loader,
    epochs=1,
    warmup_steps=int(0.1 * len(train_loader)),
    use_amp=True,
    output_path="/Neural_Reranking/Fine_Tuning/Model/biomed_cross_encoder",      
    show_progress_bar=True,        
    optimizer_params={'lr': 2e-5}
)

print("✅  Fine‑tuning complete. Model saved to biomed_cross_encoder/")

Token indices sequence length is longer than the specified maximum sequence length for this model (536 > 512). Running this sequence through the model will result in indexing errors


Step,Training Loss
500,0.3642
1000,0.2504
1500,0.2234
2000,0.2192
2500,0.1653
3000,0.1449
3500,0.1341
4000,0.1274
4500,0.1172
5000,0.1187


✅  Fine‑tuning complete. Model saved to biomed_cross_encoder/


In [13]:
model.save("/Neural_Reranking/Fine_Tuning/Model/biomed_cross_encoder")

In [None]:
# Loading the Model Parameters
model_path = "/Neural_Reranking/Fine_Tuning/Model/biomed_cross_encoder"
model_dev  = CrossEncoder(model_path, device="cuda")

# Creating Label Pairs
pairs   = [ex.texts for ex in dev_samples]   # [(q, doc), (q, doc), ...]
labels  = [ex.label for ex in dev_samples]   # floats 0 / 1


# Creating Some metrics
scores  = model_dev.predict(pairs, batch_size=128)
mse     = np.mean((np.array(scores) - np.array(labels)) ** 2)
pearson = np.corrcoef(scores, labels)[0, 1]

print(f"\nDev set MSE       : {mse:.4f}")
print(f"Dev set Pearson r : {pearson:.4f}")

# Testing

In [2]:
# ------------------------------------------------------------------
PRED_FILE = Path("/Users/greinaldpappa/Downloads/batch4_out.json")
GOLD_FILE = Path("/Users/greinaldpappa/Downloads/BioASQ-task13bPhaseB-testset4.txt")
K         = 10                     # evaluate at top‑K
# ------------------------------------------------------------------

def pmid_from_url(url: str) -> str:
    """extract PubMed ID from any of the URLs"""
    m = re.search(r'/pubmed/(\d+)', url)
    return m.group(1) if m else url.rsplit('/', 1)[-1]

# ---------- load gold ----------
with GOLD_FILE.open(encoding="utf-8") as f:
    gold_data = json.load(f)

gold_docs = {
    q["id"]: { pmid_from_url(d) for d in q["documents"] }
    for q in gold_data["questions"]
}

# ---------- load predictions ----------
with PRED_FILE.open(encoding="utf-8") as f:
    pred_data = json.load(f)

pred_docs = {
    q["id"]: [ pmid_from_url(d) for d in q["documents"] ]
    for q in pred_data["questions"]
}

# ---------- compute metrics ----------
def metrics_for_query(gold_set, pred_list, k):
    """return (precision@k, recall@k, AP, RR) for one query"""
    hits = 0
    precisions = []
    rr = 0
    for rank, pmid in enumerate(pred_list[:k], start=1):
        if pmid in gold_set:
            hits += 1
            precisions.append(hits / rank)      # precision at this hit
            if rr == 0:
                rr = 1 / rank                   # first relevant ⇒ RR
    prec_k  = hits / k
    rec_k   = hits / len(gold_set) if gold_set else 0
    ap      = sum(precisions) / len(gold_set) if gold_set else 0
    return prec_k, rec_k, ap, rr

P_sum = R_sum = AP_sum = RR_sum = 0
count = 0
missing = []

for qid, gold_set in gold_docs.items():
    pred_list = pred_docs.get(qid)
    if pred_list is None:
        missing.append(qid)
        continue
    p, r, ap, rr = metrics_for_query(gold_set, pred_list, K)
    P_sum  += p
    R_sum  += r
    AP_sum += ap
    RR_sum += rr
    count  += 1

print(f"Evaluated {count} questions   (missing preds for {len(missing)})")
print(f"Precision@{K}: {P_sum / count:.3f}")
print(f"Recall@{K}   : {R_sum / count:.3f}")
print(f"MAP          : {AP_sum / count:.3f}")
print(f"MRR          : {RR_sum / count:.3f}")


Evaluated 85 questions   (missing preds for 0)
Precision@10: 0.022
Recall@10   : 0.100
MAP          : 0.059
MRR          : 0.108
