# 03: Extract Metadata, References, and Search Strategies from Cochrane PDFs

## Summary
This notebook extracts metadata, references, and **search strategies** from Cochrane review PDFs using **PyMuPDF** (fitz) - a fast PDF library.

### Extracted Data:

**Metadata:**
- `doi`: DOI from filename
- `title`: Review title
- `authors`: Author list (from abstract section)
- `abstract`: Full abstract text
- `review_type`: review, protocol, or withdrawn
- `cochrane_group`: Cochrane review group

**References:**
- `category`: included, excluded, awaiting, ongoing
- `study_id`: Author Year format
- `authors`: Full author list
- `title`: Reference title
- `year`: Publication year
- `ref_doi`: DOI of reference
- `pmid`: PubMed ID
- `full_citation`: Complete citation text

**Search Strategies (NEW):**
- `doi`: Review DOI
- `database`: Database searched (e.g., "Ovid MEDLINE", "PubMed")
- `raw_strategy`: Original search strategy text (usually Ovid syntax)
- `pubmed_query`: Translated PubMed query (for automated search execution)
- `translation_notes`: Any issues encountered during translation

### Performance:
- **~15-20 minutes** for 16,588 PDFs (using PyMuPDF)
- ~50x faster than pdfplumber

### Output:
- `Data/review_metadata.csv`
- `Data/categorized_references.csv`
- `Data/search_strategies.csv` (NEW)

In [11]:
%pip install -q pymupdf pandas tqdm

Note: you may need to restart the kernel to use updated packages.


In [12]:
import os
from pathlib import Path
import pandas as pd
import fitz  # PyMuPDF - fast PDF library
import re
import time
from collections import Counter
from tqdm.notebook import tqdm
from typing import Dict, List, Tuple

# Setup paths
notebook_dir = Path.cwd()
project_root = notebook_dir if (notebook_dir / "Data").exists() else notebook_dir.parent
DATA_DIR = project_root / "Data"
PDF_DIR = DATA_DIR / "cochrane_pdfs"

METADATA_CSV = DATA_DIR / "review_metadata.csv"
REFERENCES_CSV = DATA_DIR / "categorized_references.csv"
SEARCH_STRATEGIES_CSV = DATA_DIR / "search_strategies.csv"

pdf_files = list(PDF_DIR.glob("*.pdf"))
print(f"Project root: {project_root}")
print(f"PDF directory: {PDF_DIR}")
print(f"Found {len(pdf_files):,} PDFs")

Project root: c:\Users\juanx\Documents\LSE-UKHSA Project
PDF directory: c:\Users\juanx\Documents\LSE-UKHSA Project\Data\cochrane_pdfs
Found 16,618 PDFs


In [3]:
# =============================================================================
# PyMuPDF Helper Functions
# =============================================================================
# PyMuPDF (fitz) is 10-50x faster than pdfplumber for text extraction
# =============================================================================

def extract_text_fast(pdf_path: Path, start_page: int = 0, end_page: int = None) -> str:
    """Extract text from PDF pages using PyMuPDF (fast)."""
    doc = fitz.open(pdf_path)
    if end_page is None:
        end_page = len(doc)
    
    text = ""
    for i in range(start_page, min(end_page, len(doc))):
        text += doc[i].get_text() + "\n\n"
    doc.close()
    return text


def get_pdf_page_count(pdf_path: Path) -> int:
    """Get page count quickly."""
    doc = fitz.open(pdf_path)
    count = len(doc)
    doc.close()
    return count


# Quick speed test
test_pdf = pdf_files[0]
start = time.time()
text = extract_text_fast(test_pdf)
elapsed = time.time() - start
print(f"✓ PyMuPDF extracted {len(text):,} chars in {elapsed:.3f} seconds")
print(f"  Estimated time for all {len(pdf_files):,} PDFs: {elapsed * len(pdf_files) / 60:.1f} minutes")

✓ PyMuPDF extracted 31,979 chars in 0.040 seconds
  Estimated time for all 16,618 PDFs: 11.0 minutes


In [4]:
# =============================================================================
# Metadata Extraction (using PyMuPDF)
# =============================================================================

def extract_metadata(pdf_path: Path) -> Dict:
    """Extract review metadata from first pages of PDF."""
    doi = pdf_path.stem.replace("-", "/")
    result = {
        'doi': doi, 'title': '', 'authors': '', 'abstract': '',
        'review_type': '', 'cochrane_group': '',
    }
    try:
        # Extract first 5 pages for metadata
        text = extract_text_fast(pdf_path, 0, 5)
        
        # --- TITLE ---
        title_patterns = [
            r'Cochrane Database of Systematic Reviews\s*\n+\s*([A-Z][^\.]{10,200})',
            r'CochraneDatabaseofSystematicReviews\s*\n*\s*([A-Z][^\.]{10,200})',
            r'Review\s*\n+([A-Z][A-Za-z\s\-\,\:]{10,200})',
        ]
        for pattern in title_patterns:
            title_match = re.search(pattern, text)
            if title_match:
                title = re.sub(r'\s+', ' ', title_match.group(1)).strip()
                if len(title) > 10 and not title.startswith('Copyright'):
                    result['title'] = title[:500]
                    break
        
        # --- AUTHORS ---
        authors_match = re.search(
            r'^([A-Z][a-z]+(?:\s+[A-Z]\.?)+(?:,\s*[A-Z][a-z]+(?:\s+[A-Z]\.?)+)*)',
            text[200:2000], re.MULTILINE
        )
        if authors_match:
            result['authors'] = re.sub(r'\s+', ' ', authors_match.group(1)).strip()[:500]
        
        # --- ABSTRACT ---
        # Look for "Background" section content (not the table of contents with dots)
        # Key: Find "Background" followed by actual paragraph text, not "..." dots
        abstract_patterns = [
            # Pattern 1: Background section with real content (no dots)
            r'Background\s*\n+([A-Z][^\.]{20,}(?:\.\s+[A-Z][^\.]+)*\.)',
            # Pattern 2: Objectives section
            r'Objectives\s*\n+([A-Z][^\.]{20,}(?:\.\s+[A-Z][^\.]+)*\.)',
            # Pattern 3: Summary section
            r'Summary\s*\n+([A-Z][^\.]{20,}(?:\.\s+[A-Z][^\.]+)*\.)',
        ]
        for pattern in abstract_patterns:
            abs_match = re.search(pattern, text)
            if abs_match:
                abstract = abs_match.group(1).strip()
                # Skip if it's mostly dots (table of contents)
                if '...' not in abstract and len(abstract) > 50:
                    result['abstract'] = re.sub(r'\s+', ' ', abstract)[:3000]
                    break
        
        # Fallback: Try to find any substantive text after "Background"
        if not result['abstract']:
            bg_match = re.search(r'Background[:\s]+(.{100,1500}?)(?=\n\s*(?:Objectives|Methods|Search|Selection))', 
                                 text, re.IGNORECASE | re.DOTALL)
            if bg_match:
                abstract = bg_match.group(1).strip()
                abstract = re.sub(r'\.+\s*\d+', '', abstract)  # Remove "... 2" patterns
                abstract = re.sub(r'\s+', ' ', abstract)
                if len(abstract) > 50 and '...' not in abstract:
                    result['abstract'] = abstract[:3000]
        
        # --- REVIEW TYPE ---
        text_lower = text.lower()[:3000]
        if 'protocol' in text_lower:
            result['review_type'] = 'protocol'
        elif 'withdrawn' in text_lower:
            result['review_type'] = 'withdrawn'
        else:
            result['review_type'] = 'review'
        
        # --- COCHRANE GROUP ---
        group_patterns = [
            r'Cochrane\s+([A-Za-z\s&]+?)\s+Group',
            r'Cochrane\s+([A-Za-z\s&]+?)\s+Review Group',
        ]
        for pattern in group_patterns:
            group_match = re.search(pattern, text)
            if group_match:
                result['cochrane_group'] = group_match.group(1).strip()
                break
                
    except Exception as e:
        result['error'] = str(e)
    return result


# Test on a few PDFs
print("Testing metadata extraction...")
for i in range(5):
    meta = extract_metadata(pdf_files[i])
    print(f"\n{meta['doi']}")
    print(f"  Title: {meta['title'][:60]}..." if meta['title'] else "  Title: (none)")
    print(f"  Type: {meta['review_type']}")
    print(f"  Abstract: {meta['abstract'][:80]}..." if meta['abstract'] else "  Abstract: (none)")

Testing metadata extraction...

10.1002/14651858.CD000004
  Title: Abdominal decompression for suspected fetal compromise/pre-e...
  Type: review
  Abstract: Abdominal decompression was developed as a means of pain relief during labour. I...

10.1002/14651858.CD000004.pub2
  Title: Abdominal decompression for suspected fetal compromise/pre-e...
  Type: review
  Abstract: Abdominal decompression was developed as a means of pain relief during labour. I...

10.1002/14651858.CD000005
  Title: Absorbable staples for uterine incision at caesarean section...
  Type: review
  Abstract: Staples can be placed during the making of an incision, with the aim of decreasi...

10.1002/14651858.CD000005.pub2
  Title: Absorbable staples for uterine incision at caesarean section...
  Type: protocol
  Abstract: (none)

10.1002/14651858.CD000006
  Title: Absorbable synthetic versus catgut suture material for perin...
  Type: review
  Abstract: Approximately 70% of women will experience some degree of perin

In [5]:
# =============================================================================
# Search Strategy Extraction and Ovid-PubMed Translation  (v4)
# =============================================================================
# v3 fixes + v4 improvements:
#  11. Pattern 2b: header-less numbered Ovid blocks (recovers ~261 narratives)
#  12. Tighter quality filter: short queries (<100 chars) must have field tag,
#      wildcard, or quoted phrase — filters out author names & junk
#  13. Better field-tag regex in quality gate (includes Mesh:NoExp)
# =============================================================================


def extract_search_strategy(pdf_path: Path) -> List[Dict]:
    """Extract search strategies from a Cochrane PDF."""
    doi = pdf_path.stem.replace("-", "/")
    strategies = []

    try:
        doc = fitz.open(pdf_path)
        full_text = ""
        for i in range(len(doc)):
            full_text += doc[i].get_text() + "\n"
        doc.close()

        # Pattern 1: numbered search blocks (require "N." or "N)" after DB header)
        numbered_pat = re.compile(
            r'((?:Ovid\s+)?(?:MEDLINE|EMBASE|PubMed|CENTRAL|PsycINFO)[^\n]*)\n'
            r'((?:\s*\d+[\.\)]\s+.+\n)+)',
            re.IGNORECASE,
        )
        for m in numbered_pat.finditer(full_text):
            db = m.group(1).strip()
            raw = m.group(2).strip()
            if "medline" in db.lower():
                strategies.append({"doi": doi, "database": db, "raw_strategy": raw})

        # Pattern 1b: hit-count format  "1 Term/ (7133)"
        if not strategies:
            hc_pat = re.compile(
                r'((?:Ovid\s+)?(?:MEDLINE|EMBASE|PubMed|CENTRAL|PsycINFO)[^\n]*)\n'
                r'((?:\s*\d{1,3}\s+\S.+\(\d[\d,]*\)\s*\n)+)',
                re.IGNORECASE,
            )
            for m in hc_pat.finditer(full_text):
                db = m.group(1).strip()
                raw = m.group(2).strip()
                if "medline" in db.lower():
                    strategies.append({"doi": doi, "database": db, "raw_strategy": raw})

        # Pattern 2: Appendix-style
        if not strategies:
            app_pat = re.compile(
                r'Appendix\s*\d*[.:]*\s*(MEDLINE[^\n]*)\n'
                r'(.{100,3000}?)'
                r'(?=\nAppendix|\nW H A T|\nH I S T O R Y|\Z)',
                re.IGNORECASE | re.DOTALL,
            )
            m = app_pat.search(full_text)
            if m:
                strategies.append({
                    "doi": doi,
                    "database": m.group(1).strip(),
                    "raw_strategy": m.group(2).strip(),
                })

        # Pattern 2b (NEW): header-less numbered Ovid blocks anywhere in text
        # Recovers ~261 strategies previously classified as narrative_only
        if not strategies:
            ovid_block_pat = re.compile(
                r'((?:^\s*\d{1,3}[\.\)]\s+.+(?:\.(tw|ti|ab|sh|mp|kf|kw|pt|fs|hw)\.|/|\bexp\s).+\n){4,})',
                re.IGNORECASE | re.MULTILINE,
            )
            m = ovid_block_pat.search(full_text)
            if m:
                strategies.append({
                    "doi": doi,
                    "database": "MEDLINE (inferred from Ovid syntax)",
                    "raw_strategy": m.group(0).strip(),
                })

        # Pattern 3: narrative-only (kept for stats)
        if not strategies:
            meth_pat = re.compile(
                r'(Electronic searches|Search methods)[^\n]*\n'
                r'(.{100,2000}?)'
                r'(?=Data collection|Selection of studies)',
                re.IGNORECASE | re.DOTALL,
            )
            m = meth_pat.search(full_text)
            if m:
                strategies.append({
                    "doi": doi,
                    "database": "narrative_only",
                    "raw_strategy": m.group(2).strip(),
                })

    except Exception as e:
        strategies.append({
            "doi": doi, "database": "error",
            "raw_strategy": f"ERROR: {str(e)[:100]}",
        })

    return strategies


# =====================  helper: range expansion  ============================

def expand_range_expr(expr: str) -> List[int]:
    """Expand '1-4', '1-4,6', '1-4, 6-8' into sorted list of ints."""
    nums: List[int] = []
    for part in re.split(r'[,\s]+', expr.strip()):
        part = part.strip()
        if not part:
            continue
        if '-' in part:
            try:
                a, b = part.split('-', 1)
                nums.extend(range(int(a), int(b) + 1))
            except (ValueError, IndexError):
                pass
        else:
            try:
                nums.append(int(part))
            except ValueError:
                pass
    return sorted(set(nums))


# =====================  pre-processing  =====================================

def preprocess_raw_strategy(raw: str) -> str:
    """Fix common OCR errors and strip hit-count suffixes."""
    raw = re.sub(r'\b0r/', 'or/', raw)
    raw = re.sub(r'\b0R/', 'OR/', raw)
    raw = re.sub(r'\s+\(\d[\d,]*\)\s*$', '', raw, flags=re.MULTILINE)
    return raw


# =====================  main translator  ====================================

def translate_ovid_to_pubmed(raw_strategy: str) -> Tuple[str, str]:
    """Translate an Ovid MEDLINE search block to PubMed syntax (v4)."""
    if not raw_strategy or len(raw_strategy) < 10:
        return "", "empty_strategy"

    notes: list = []
    clean_raw = preprocess_raw_strategy(raw_strategy)

    # Parse numbered lines
    lines: Dict[int, str] = {}
    line_pat = re.compile(r'^\s*(\d{1,3})(?:[.\)]\s+|\s{1,4})(.+)$', re.MULTILINE)
    for m in line_pat.finditer(clean_raw):
        lines[int(m.group(1))] = m.group(2).strip()

    # Fallback for un-numbered strategies
    if not lines:
        fq = translate_fallback_strategy(raw_strategy, notes)
        if fq:
            return fq, "; ".join(notes) if notes else "fallback_parse"
        return "", "no_numbered_lines"

    # Translate each line
    translated: Dict[int, str] = {}
    for num, content in lines.items():
        translated[num] = translate_ovid_line(content, notes)

    # Pick best final line
    final_line = find_best_final_line(lines, translated)

    # Resolve references
    pubmed_query = resolve_line_references(translated, final_line, notes)

    # Fallback: if resolution is empty, OR all non-empty translated terms
    if not pubmed_query.strip():
        non_empty = [
            t for _, t in sorted(translated.items())
            if t.strip() and not is_line_reference(t)
        ]
        if non_empty:
            pubmed_query = " OR ".join(f"({t})" for t in non_empty)
            notes.append("final_line_empty_used_fallback")

    pubmed_query = clean_pubmed_query(pubmed_query)

    # Quality gate
    if pubmed_query and not is_valid_search_query(pubmed_query):
        notes.append("filtered_not_a_query")
        return "", "filtered_not_a_query"

    # PubMed hard limit ~4000 chars
    if len(pubmed_query) > 4000:
        notes.append("query_truncated")
        pubmed_query = simplify_long_query(pubmed_query, 4000)

    return pubmed_query, "; ".join(notes) if notes else "success"


# =====================  final-line heuristic  ===============================

def find_best_final_line(original_lines: dict, translated_lines: dict) -> int:
    """Return the last line that combines earlier lines (or max line number)."""
    for ln in sorted(original_lines.keys(), reverse=True):
        raw = original_lines[ln]
        if re.match(r'^(?:or|and)/\d', raw, re.IGNORECASE):
            return ln
        if is_line_reference(translated_lines.get(ln, '')):
            return ln
        if re.match(r'^\d[\d\-\s,]*(or|and|not)\s+\d', raw, re.IGNORECASE):
            return ln
    return max(original_lines.keys())


# =====================  quality gate (v4 — tighter)  ========================

def is_valid_search_query(query: str) -> bool:
    """Reject obvious non-search text (database names, author lists, etc.).

    v4: short queries (<100 chars) must contain a PubMed field tag, wildcard,
    or quoted phrase.  This filters ~978 junk short queries while preserving
    legitimate single-concept searches.
    """
    if len(query) < 5:
        return False
    has_field = bool(re.search(
        r'\[(?:Mesh|Mesh:NoExp|tiab|ti|pt|sh|la|ta)\]', query, re.IGNORECASE))
    has_bool  = bool(re.search(r'\b(?:AND|OR|NOT)\b', query))
    has_quote = '"' in query
    has_wild  = '*' in query

    # Short queries: require a concrete search indicator
    if len(query) < 100:
        return has_field or has_wild or has_quote

    # Longer queries: booleans alone are also fine
    return has_field or has_bool or has_quote or has_wild


# =====================  fallback (non-numbered)  ============================

def translate_fallback_strategy(raw_strategy: str, notes: list) -> str:
    """Best-effort parser for strategies without numbered lines."""
    notes.append("fallback_parse")
    terms: list = []

    # PubMed-style: term[field]
    for m in re.finditer(
        r'(["\w\*]+(?:\s+["\w\*]+)*)\s*\[(tiab|ti|ab|mesh|pt|all fields)\]',
        raw_strategy, re.IGNORECASE,
    ):
        term = m.group(1).strip()
        fmap = {'tiab': 'tiab', 'ti': 'ti', 'ab': 'tiab', 'mesh': 'Mesh', 'pt': 'pt'}
        terms.append(f'{term}[{fmap.get(m.group(2).lower(), "tiab")}]')

    # Ovid-style: term.field.
    for m in re.finditer(
        r'([A-Za-z\*\?]+(?:\s+[A-Za-z\*\?]+)*)\.(tw|ti|ab|mp|sh|kf|kw)\.',
        raw_strategy, re.IGNORECASE,
    ):
        term = m.group(1).strip()
        fmap = {'tw': 'tiab', 'ti': 'ti', 'ab': 'tiab', 'mp': 'tiab',
                'sh': 'Mesh', 'kf': 'tiab', 'kw': 'tiab'}
        term = term.replace('$', '*').replace(':', '*').replace('?', '*')
        terms.append(f'{term}[{fmap.get(m.group(2).lower(), "tiab")}]')

    # exp MeSH/
    for m in re.finditer(r'exp\s+([^/\n]+)/', raw_strategy, re.IGNORECASE):
        terms.append(f'"{m.group(1).strip()}"[Mesh]')

    if terms:
        seen: set = set()
        unique = []
        for t in terms:
            key = t.lower()
            if key not in seen:
                seen.add(key)
                unique.append(t)
        return " OR ".join(unique)
    return ""


# =====================  per-line translator  ================================

def translate_ovid_line(line: str, notes: list) -> str:
    """Translate a single Ovid search line to PubMed syntax."""
    original = line.strip()

    # -- combination patterns: or/N-M, and/N-M
    combo = re.match(r'^(or|and)/(.+)$', original, re.IGNORECASE)
    if combo:
        op = combo.group(1).upper()
        nums = expand_range_expr(combo.group(2))
        if nums:
            return f" {op} ".join(str(n) for n in nums)

    # -- field-tagged line refs: "1-4.kf." , "1-8.mp."
    fref = re.match(r'^([\d\-,\s]+)\.(kf|kw|tw|ti|ab|mp|sh)\.$', original, re.IGNORECASE)
    if fref:
        nums = expand_range_expr(fref.group(1))
        if nums:
            return " OR ".join(str(n) for n in nums)

    # -- pure line references (including NOT): "9 not 10", "1-5 or 7"
    if is_line_reference(original):
        return original

    # -- LIMIT commands
    if original.lower().startswith('limit'):
        if 'human' in original.lower():
            if 'humans_limit_converted' not in notes:
                notes.append('humans_limit_converted')
            return '"humans"[Mesh]'
        if 'english' in original.lower():
            if 'english_limit_converted' not in notes:
                notes.append('english_limit_converted')
            return '"english"[la]'
        if 'limit_skipped' not in notes:
            notes.append('limit_skipped')
        return ""

    line = original

    # 1. MeSH with subheadings
    line = re.sub(r'exp\s+([^/]+)/[a-z,\s]+', r'"\1"[Mesh]', line, flags=re.IGNORECASE)
    # 2. Explode MeSH
    line = re.sub(r'exp\s+([^/]+)/', r'"\1"[Mesh]', line, flags=re.IGNORECASE)
    # 3. Non-exp MeSH (not or/ and/ not/)
    line = re.sub(
        r'(?<!["\w])(?!(?:or|and|not)/)([A-Za-z][A-Za-z\s\-,]+)/',
        r'"\1"[Mesh:NoExp]', line, flags=re.IGNORECASE,
    )

    # 4. Field tags
    line = re.sub(r'\.tw\.', '[tiab]', line, flags=re.IGNORECASE)
    line = re.sub(r'\.ti\.', '[ti]', line, flags=re.IGNORECASE)
    line = re.sub(r'\.ab\.', '[tiab]', line, flags=re.IGNORECASE)
    line = re.sub(r'\.pt\.', '[pt]', line, flags=re.IGNORECASE)
    line = re.sub(r'\.sh\.', '[Mesh]', line, flags=re.IGNORECASE)
    line = re.sub(r'\.kf\.', '[tiab]', line, flags=re.IGNORECASE)
    line = re.sub(r'\.kw\.', '[tiab]', line, flags=re.IGNORECASE)
    line = re.sub(r'\.hw\.', '[tiab]', line, flags=re.IGNORECASE)
    line = re.sub(r'\.fs\.', '[sh]', line, flags=re.IGNORECASE)
    line = re.sub(r'\.nm\.', '[tiab]', line, flags=re.IGNORECASE)
    line = re.sub(r'\.rn\.', '[tiab]', line, flags=re.IGNORECASE)
    line = re.sub(r'\.jw\.', '[ta]', line, flags=re.IGNORECASE)
    line = re.sub(r'\.ot\.', '[tiab]', line, flags=re.IGNORECASE)
    line = re.sub(r'\.ti,ab\.', '[tiab]', line, flags=re.IGNORECASE)
    line = re.sub(r'\.ab,ti\.', '[tiab]', line, flags=re.IGNORECASE)

    # 5. .mp. -> [tiab]
    if '.mp.' in line.lower():
        line = re.sub(r'\.mp\.', '[tiab]', line, flags=re.IGNORECASE)
        if 'mp_approximated' not in notes:
            notes.append('mp_approximated')

    # 6. Truncation
    line = re.sub(r'(\w+)[\:\$]', r'\1*', line)

    # 7. Wildcard
    if '?' in line:
        line = re.sub(r'\?', '*', line)
        if 'wildcard_approximated' not in notes:
            notes.append('wildcard_approximated')

    # 8. Adjacency
    if re.search(r'\badj\d*\b', line, re.IGNORECASE):
        line = re.sub(r'\s+adj\d*\s+', ' AND ', line, flags=re.IGNORECASE)
        if 'adjacency_converted_to_AND' not in notes:
            notes.append('adjacency_converted_to_AND')

    # 9. NEAR
    if re.search(r'\bnear/?\d*\b', line, re.IGNORECASE):
        line = re.sub(r'\s+near/?\d*\s+', ' AND ', line, flags=re.IGNORECASE)
        if 'near_converted_to_AND' not in notes:
            notes.append('near_converted_to_AND')

    # 10. Boolean -> uppercase
    line = re.sub(r'\bor\b',  'OR',  line, flags=re.IGNORECASE)
    line = re.sub(r'\band\b', 'AND', line, flags=re.IGNORECASE)
    line = re.sub(r'\bnot\b', 'NOT', line, flags=re.IGNORECASE)

    # 11. Smart quotes -> straight
    line = line.replace('\u201c', '"').replace('\u201d', '"')

    return line


# =====================  line-reference detection  ===========================

def is_line_reference(content: str) -> bool:
    """True when content is purely refs to other lines (with OR/AND/NOT)."""
    content = content.strip()
    if re.match(r'^[\d\s\-,]+(?:(?:or|and|not)[\d\s\-,]+)+$', content, re.IGNORECASE):
        return True
    if re.match(r'^\d+\s*-\s*\d+$', content):
        return True
    return False


# =====================  reference resolver  =================================

def resolve_line_references(
    translated_lines: Dict[int, str], target_line: int, notes: list
) -> str:
    """Recursively resolve line refs into concrete query text."""

    def resolve(line_num: int, visited: set) -> str:
        if line_num in visited:
            if 'circular_reference' not in notes:
                notes.append('circular_reference')
            return ""
        visited.add(line_num)

        content = translated_lines.get(line_num, "")
        if not content.strip():
            return ""

        if is_line_reference(content):
            parts = re.split(r'\s+(or|and|not)\s+', content, flags=re.IGNORECASE)
            resolved_parts: list = []
            operators: list = []

            for p in parts:
                p = p.strip()
                if p.upper() in ('OR', 'AND', 'NOT'):
                    operators.append(p.upper())
                else:
                    line_nums = expand_range_expr(p)
                    subs = []
                    for ln in line_nums:
                        r = resolve(ln, visited.copy())
                        if r:
                            subs.append(f"({r})")
                    if subs:
                        resolved_parts.append(
                            " OR ".join(subs) if len(subs) > 1 else subs[0]
                        )

            if not resolved_parts:
                return ""

            result = resolved_parts[0]
            for i, op in enumerate(operators):
                if i + 1 < len(resolved_parts):
                    result = f"({result}) {op} ({resolved_parts[i + 1]})"
            return result
        else:
            return content

    return resolve(target_line, set())


# =====================  query clean-up  =====================================

def clean_pubmed_query(query: str) -> str:
    """Tidy translated PubMed query."""
    query = re.sub(r'\s+', ' ', query).strip()
    query = re.sub(r'""', '"', query)

    for _ in range(3):
        query = re.sub(r'\(\s*\)', '', query)

    query = re.sub(r'\(\s*(?:OR|AND|NOT)\s*\)', '', query)
    query = re.sub(r'^\s*(?:OR|AND)\s+', '', query)
    query = re.sub(r'\s+(?:OR|AND)\s*$', '', query)
    query = re.sub(r'(?:OR|AND)\s+(?:OR|AND)', 'OR', query)

    o, c = query.count('('), query.count(')')
    if o > c:
        query += ')' * (o - c)
    elif c > o:
        query = '(' * (c - o) + query

    return query.strip()


def simplify_long_query(query: str, max_length: int) -> str:
    """Trim a query that exceeds PubMed's char limit."""
    s = re.sub(r'\([^()]*\[All Fields\][^()]*\)\s*(?:OR|AND)?\s*', '', query)
    if len(s) <= max_length:
        return clean_pubmed_query(s)
    s = re.sub(r'\([^()]*\[sh\][^()]*\)\s*(?:OR|AND)?\s*', '', s)
    if len(s) <= max_length:
        return clean_pubmed_query(s)
    trunc = s[:max_length]
    lp = trunc.rfind(')')
    if lp > max_length * 0.7:
        trunc = trunc[:lp + 1]
    return clean_pubmed_query(trunc)


# ====  Quick smoke test  ====================================================
print("Testing v4 translator on first 10 PDFs ...")
test_strategies = []
for pdf in pdf_files[:10]:
    test_strategies.extend(extract_search_strategy(pdf))

translated_ok = 0
for s in test_strategies:
    if s["database"] != "narrative_only":
        q, n = translate_ovid_to_pubmed(s["raw_strategy"])
        s["pubmed_query"] = q
        s["translation_notes"] = n
        if q:
            translated_ok += 1

print(f"Found {len(test_strategies)} strategies, {translated_ok} translated successfully")
for s in test_strategies[:3]:
    print(f"\n  DOI: {s['doi']}")
    print(f"  Database: {s['database']}")
    print(f"  Raw (first 200): {s['raw_strategy'][:200]}")
    if "pubmed_query" in s:
        print(f"  PubMed: {s.get('pubmed_query','')[:200]}")
        print(f"  Notes: {s.get('translation_notes','')}")

Testing v4 translator on first 10 PDFs ...
Found 7 strategies, 0 translated successfully

  DOI: 10.1002/14651858.CD000004
  Database: narrative_only
  Raw (first 200): The Cochrane Pregnancy and Childbirth Group’s Trials Register (October 2008).
Selection criteria
Randomised or quasi-randomised trials comparing abdominal decompression with no decompression in women 

  DOI: 10.1002/14651858.CD000004.pub2
  Database: narrative_only
  Raw (first 200): The Cochrane Pregnancy and Childbirth Group’s Trials Register (2 February 2012).
Selection criteria
Randomised or quasi-randomised trials comparing abdominal decompression with no decompression in wom

  DOI: 10.1002/14651858.CD000005
  Database: narrative_only
  Raw (first 200): We searched the Cochrane Pregnancy and Childbirth Group trials register.
Selection criteria
Randomised and quasi-randomised trials of extending the uterine incision using a stapler compared with exten


In [6]:
# =============================================================================
# Reference Extraction (using PyMuPDF)
# =============================================================================

def extract_references_from_pdf(pdf_path: Path) -> Tuple[List[Dict], str]:
    """Extract categorized references from PDF."""
    doi = pdf_path.stem.replace("-", "/")
    
    try:
        doc = fitz.open(pdf_path)
        total_pages = len(doc)
        
        # Check first page for protocol/withdrawn
        first_text = doc[0].get_text().lower() if total_pages > 0 else ""
        if 'protocol' in first_text[:1500]:
            doc.close()
            return [], 'protocol'
        if 'withdrawn' in first_text[:1500]:
            doc.close()
            return [], 'withdrawn'
        
        # Find reference pages - search from page 2 onwards
        ref_text = ""
        in_refs = False
        
        for i in range(2, total_pages):
            page_text = doc[i].get_text()
            page_lower = page_text.lower()
            
            # Start capturing when we find reference section markers
            if not in_refs:
                if any(marker in page_lower for marker in [
                    'references to studies included',
                    'references to studies excluded',
                    '{published data only}',
                    '{unpublished data only}'
                ]):
                    in_refs = True
            
            if in_refs:
                ref_text += page_text + "\n"
                # Stop if we hit characteristics section
                if 'characteristics of included' in page_lower:
                    break
                if 'characteristics of excluded' in page_lower:
                    break
        
        doc.close()
        
        if not ref_text:
            return [], 'no_refs'
        
        # Parse references
        references = parse_references(ref_text, doi)
        return references, 'review' if references else 'no_refs'
        
    except Exception as e:
        return [], f'error: {str(e)[:30]}'


def parse_references(text: str, review_doi: str) -> List[Dict]:
    """Parse references with structure: AuthorName Year {datatype} followed by citation."""
    references = []
    
    # Define section markers
    sections = [
        ('included', r'references\s*to\s*studies\s*included'),
        ('excluded', r'references\s*to\s*studies\s*excluded'),
        ('awaiting', r'references\s*to\s*studies\s*awaiting'),
        ('ongoing', r'references\s*to\s*ongoing\s*studies'),
    ]
    
    for category, pattern in sections:
        match = re.search(pattern, text, re.IGNORECASE)
        if match:
            start = match.end()
            
            # Find end of section
            end = len(text)
            end_patterns = [
                r'references\s*to\s*studies\s*(included|excluded|awaiting)',
                r'references\s*to\s*ongoing',
                r'additional\s+references',
                r'characteristics\s*of',
                r'\*\s*indicates',
            ]
            for ep in end_patterns:
                end_match = re.search(ep, text[start:], re.IGNORECASE)
                if end_match and end_match.start() > 10:  # Avoid matching at very start
                    end = min(end, start + end_match.start())
            
            section_text = text[start:end]
            
            # Pattern: "AuthorName Year {published/unpublished data only}"
            # Can have whitespace/newlines between parts
            study_pattern = re.compile(
                r'([A-Z][A-Za-z\'\-]+(?:\s+et\s+al)?)\s+(\d{4}[a-z]?)\s*\{(published|unpublished)',
                re.IGNORECASE
            )
            
            for m in study_pattern.finditer(section_text):
                author_id = m.group(1).strip()
                year = m.group(2)
                
                # Get citation text after "data only}"
                cite_start = m.end()
                data_only_end = section_text.find('}', cite_start)
                if data_only_end > 0:
                    cite_start = data_only_end + 1
                
                # Find end of this citation (next study ID or section end)
                next_study = study_pattern.search(section_text[cite_start:])
                cite_end = cite_start + (next_study.start() if next_study else 2000)
                
                raw_citation = section_text[cite_start:cite_end].strip()
                
                # Clean up citation
                citation = re.sub(r'\n+', ' ', raw_citation)
                citation = re.sub(r'\s+', ' ', citation).strip()
                
                # Parse fields from citation
                # Format: "Authors. Title. Journal Year;Vol:Pages."
                parts = citation.split('. ')
                authors = parts[0].strip() if parts else ""
                title = parts[1].strip() if len(parts) > 1 else ""
                
                # Extract DOI
                doi_match = re.search(r'(10\.\d{4,}/[^\s\]\)]+)', citation)
                ref_doi = doi_match.group(1).rstrip('.,;])') if doi_match else ""
                
                # Extract PMID
                pmid_match = re.search(r'(?:PMID[:\s]*|PubMed[:\s]*|\[PM[:\s]*)(\d{6,9})', citation, re.IGNORECASE)
                pmid = pmid_match.group(1) if pmid_match else ""
                
                references.append({
                    'review_doi': review_doi,
                    'category': category,
                    'study_id': f"{author_id} {year}",
                    'year': year,
                    'authors': authors[:500],
                    'title': title[:500],
                    'ref_doi': ref_doi[:100],
                    'pmid': pmid,
                    'full_citation': citation[:1000],
                })
    
    return references


# Test
print("Testing reference extraction on first 5 PDFs...")
for i, pdf in enumerate(pdf_files[:5]):
    refs, rtype = extract_references_from_pdf(pdf)
    print(f"{pdf.name}: {len(refs)} refs, type: {rtype}")
    if refs:
        r = refs[0]
        print(f"  [{r['category']}] {r['study_id']}")
        print(f"  Authors: {r['authors'][:50]}")
        print(f"  Title: {r['title'][:50]}")

Testing reference extraction on first 5 PDFs...
10.1002-14651858.CD000004.pdf: 4 refs, type: review
  [included] Blecher 1967
  Authors: Blecher JA
  Title: Aspects of the physiology of decompression and its
10.1002-14651858.CD000004.pub2.pdf: 4 refs, type: review
  [included] Blecher 1967
  Authors: Blecher JA
  Title: Aspects of the physiology of decompression and its
10.1002-14651858.CD000005.pdf: 4 refs, type: review
  [included] Dargent 1990
  Authors: Dargent D, Audra G, Noblot G
  Title: Utilization de la pince POLY CS 57 pour l’operatio
10.1002-14651858.CD000005.pub2.pdf: 0 refs, type: no_refs
10.1002-14651858.CD000006.pdf: 11 refs, type: review
  [included] Banninger 1978
  Authors: Banninger U, Buhrig H, Schreiner WE
  Title: A comparison between chromic catgut and polyglycol


In [7]:
# =============================================================================
# MAIN EXTRACTION - All 16,588 PDFs
# =============================================================================
# Estimated time: ~15-20 minutes
# =============================================================================

print(f"Extracting from {len(pdf_files):,} PDFs...")
print("=" * 70)

start_time = time.time()
all_metadata = []
all_references = []
all_search_strategies = []
stats = Counter()
category_counts = Counter()
strategy_stats = Counter()

for i, pdf_path in enumerate(tqdm(pdf_files, desc="Extracting")):
    # Metadata
    meta = extract_metadata(pdf_path)
    all_metadata.append(meta)
    
    # References
    refs, rtype = extract_references_from_pdf(pdf_path)
    all_references.extend(refs)
    stats[rtype] += 1
    
    for ref in refs:
        category_counts[ref['category']] += 1
    
    # Search strategies (NEW)
    strategies = extract_search_strategy(pdf_path)
    for strat in strategies:
        # Translate Ovid to PubMed
        if strat['database'] != 'narrative_only' and strat['database'] != 'error':
            pubmed_query, notes = translate_ovid_to_pubmed(strat['raw_strategy'])
            strat['pubmed_query'] = pubmed_query
            strat['translation_notes'] = notes
        else:
            strat['pubmed_query'] = ''
            strat['translation_notes'] = strat['database']
        all_search_strategies.append(strat)
        strategy_stats[strat['database'].split()[0] if strat['database'] else 'unknown'] += 1
    
    # Progress every 2000
    if (i + 1) % 2000 == 0:
        elapsed = time.time() - start_time
        rate = (i + 1) / elapsed
        remaining = (len(pdf_files) - i - 1) / rate
        print(f"  {i+1:,} done | {rate:.1f}/sec | {remaining/60:.1f}min left | refs: {len(all_references):,} | strategies: {len(all_search_strategies):,}")

elapsed = time.time() - start_time
print(f"\n" + "=" * 70)
print(f"EXTRACTION COMPLETE")
print(f"=" * 70)
print(f"Time: {elapsed/60:.1f} minutes ({elapsed/len(pdf_files)*1000:.0f}ms per PDF)")
print(f"\nDocument types: {dict(stats)}")
print(f"Reference categories: {dict(category_counts)}")
print(f"Total references: {len(all_references):,}")
print(f"\nSearch strategy databases: {dict(strategy_stats)}")
print(f"Total search strategies: {len(all_search_strategies):,}")

Extracting from 16,618 PDFs...


Extracting:   0%|          | 0/16618 [00:00<?, ?it/s]

MuPDF error: syntax error: invalid key in dict

MuPDF error: syntax error: invalid key in dict

MuPDF error: syntax error: invalid key in dict

MuPDF error: syntax error: invalid key in dict

MuPDF error: syntax error: invalid key in dict

MuPDF error: syntax error: invalid key in dict

MuPDF error: syntax error: invalid key in dict

MuPDF error: syntax error: invalid key in dict

MuPDF error: syntax error: invalid key in dict

MuPDF error: syntax error: invalid key in dict

  2,000 done | 5.1/sec | 47.6min left | refs: 72,677 | strategies: 1,930
MuPDF error: syntax error: invalid key in dict

MuPDF error: syntax error: invalid key in dict

MuPDF error: syntax error: invalid key in dict

MuPDF error: syntax error: invalid key in dict

MuPDF error: syntax error: invalid key in dict

MuPDF error: syntax error: invalid key in dict

MuPDF error: syntax error: invalid key in dict

MuPDF error: syntax error: invalid key in dict

MuPDF error: syntax error: invalid key in dict

MuPDF error: sy

In [8]:
# SAVE RESULTS
meta_df = pd.DataFrame(all_metadata)
refs_df = pd.DataFrame(all_references)
strategies_df = pd.DataFrame(all_search_strategies)

# ── Add version tracking columns ───────────────────────────────────────────────
# Cochrane DOIs follow pattern: 10.1002/14651858.CD000004.pub2
# Extract CD number and publication version for deduplication downstream
import re as _re
_version_parts = meta_df['doi'].str.extract(r'(CD\d+)(?:\.pub(\d+))?', flags=_re.I)
meta_df['cd_number'] = _version_parts[0].str.upper()
meta_df['version'] = _version_parts[1].fillna(1).astype(int)

# Flag latest version of each review
_has_cd = meta_df[meta_df['cd_number'].notna()]
_latest_idx = _has_cd.groupby('cd_number')['version'].idxmax()
meta_df['is_latest_version'] = False
meta_df.loc[_latest_idx, 'is_latest_version'] = True
# Reviews without CD numbers are treated as latest by default
meta_df.loc[meta_df['cd_number'].isna(), 'is_latest_version'] = True

meta_df.to_csv(METADATA_CSV, index=False)
refs_df.to_csv(REFERENCES_CSV, index=False)
strategies_df.to_csv(SEARCH_STRATEGIES_CSV, index=False)

print(f"Saved {len(meta_df):,} metadata → {METADATA_CSV.name}")
print(f"Saved {len(refs_df):,} references → {REFERENCES_CSV.name}")
print(f"Saved {len(strategies_df):,} search strategies → {SEARCH_STRATEGIES_CSV.name}")

# Version tracking summary
_latest_count = meta_df['is_latest_version'].sum()
_superseded = len(meta_df) - _latest_count
print(f"\nVersion tracking:")
print(f"  Reviews with CD number: {_has_cd.shape[0]:,}")
print(f"  Unique CD numbers:      {_has_cd['cd_number'].nunique():,}")
print(f"  Latest versions:        {_latest_count:,}")
print(f"  Superseded (old):       {_superseded:,}")
print(f"  Version distribution:   {dict(meta_df['version'].value_counts().sort_index())}")

# Summary
print(f"\nReferences by category:")
if len(refs_df) > 0:
    print(refs_df['category'].value_counts())

print(f"\nSearch strategies by database:")
if len(strategies_df) > 0:
    print(strategies_df['database'].value_counts().head(10))

# Translation success rate
if len(strategies_df) > 0:
    has_pubmed = (strategies_df['pubmed_query'].str.len() > 0).sum()
    print(f"\nTranslation success: {has_pubmed:,}/{len(strategies_df):,} ({has_pubmed/len(strategies_df)*100:.1f}%)")
    print(f"\nTranslation notes breakdown:")
    print(strategies_df['translation_notes'].value_counts().head(10))

Saved 16,618 metadata → review_metadata.csv
Saved 630,032 references → categorized_references.csv
Saved 16,929 search strategies → search_strategies.csv

Version tracking:
  Reviews with CD number: 16,405
  Unique CD numbers:      9,755
  Latest versions:        9,968
  Superseded (old):       6,650
  Version distribution:   {1: np.int64(2828), 2: np.int64(8435), 3: np.int64(3424), 4: np.int64(1323), 5: np.int64(435), 6: np.int64(128), 7: np.int64(33), 8: np.int64(8), 9: np.int64(3), 10: np.int64(1)}

References by category:
category
excluded    387757
included    211000
awaiting     22657
ongoing       8618
Name: count, dtype: int64

Search strategies by database:
database
narrative_only                         8801
MEDLINE search strategy                1436
MEDLINE                                 566
MEDLINE (inferred from Ovid syntax)     387
MEDLINE (Ovid) search strategy          370
MEDLINE Ovid search strategy            303
MEDLINE (OvidSP) search strategy        274
MEDLINE (

In [9]:
# =============================================================================
# DIAGNOSE REMAINING FAILURES
# =============================================================================
# Let's understand what's still failing and whether improvement is possible
# =============================================================================

# 1. What do the 123 no_numbered_lines failures look like?
no_num = [s for s in all_search_strategies if s['translation_notes'] == 'no_numbered_lines']
print(f"=== no_numbered_lines failures: {len(no_num)} ===\n")
for s in no_num[:5]:
    print(f"DOI: {s['doi']}")
    print(f"Database: {s['database']}")
    print(f"Raw (first 300 chars):\n{s['raw_strategy'][:300]}")
    print("-" * 60)

# 2. What are the narrative_only strategies? Are any actually structured?
narrative = [s for s in all_search_strategies if s['translation_notes'] == 'narrative_only']
print(f"\n=== narrative_only: {len(narrative)} ===")
# Check if any contain Ovid-like terms
has_ovid_terms = 0
has_mesh = 0
has_pubmed_terms = 0
for s in narrative:
    raw = s['raw_strategy'].lower()
    if '.tw.' in raw or '.ti.' in raw or '.ab.' in raw or '.mp.' in raw:
        has_ovid_terms += 1
    if '[mesh]' in raw or 'exp ' in raw:
        has_mesh += 1
    if '[tiab]' in raw or '[ti]' in raw:
        has_pubmed_terms += 1

print(f"  With Ovid terms (.tw., .ti., etc.): {has_ovid_terms}")
print(f"  With MeSH terms: {has_mesh}")
print(f"  With PubMed terms: {has_pubmed_terms}")

# Show a few that have structured content
print("\n--- Narrative entries that actually contain structured search terms ---")
shown = 0
for s in narrative:
    raw = s['raw_strategy'].lower()
    if '.tw.' in raw or '.ti.' in raw or 'exp ' in raw or '[mesh]' in raw:
        print(f"\nDOI: {s['doi']}")
        print(f"Raw (first 400 chars):\n{s['raw_strategy'][:400]}")
        print("-" * 60)
        shown += 1
        if shown >= 3:
            break

# 3. How many reviews have NO strategy at all?
all_dois = set(f.stem.replace("-", "/") for f in pdf_files)
strategy_dois = set(s['doi'] for s in all_search_strategies)
no_strategy = all_dois - strategy_dois
print(f"\n=== Reviews with zero search strategies: {len(no_strategy)} ===")

# 4. Translation quality check - are any translations empty despite "success"?
empty_success = [s for s in all_search_strategies 
                 if 'success' in s.get('translation_notes', '') 
                 and len(s.get('pubmed_query', '')) < 10]
print(f"\n=== 'success' but empty/very short query: {len(empty_success)} ===")

# 5. Query length distribution for successful translations
successful = [s for s in all_search_strategies if len(s.get('pubmed_query', '')) > 0]
lengths = [len(s['pubmed_query']) for s in successful]
print(f"\n=== Query length stats (n={len(successful)}) ===")
print(f"  Min: {min(lengths)}")
print(f"  Median: {sorted(lengths)[len(lengths)//2]}")
print(f"  Mean: {sum(lengths)/len(lengths):.0f}")
print(f"  Max: {max(lengths)}")
print(f"  >4000 chars: {sum(1 for l in lengths if l > 4000)}")
print(f"  <50 chars: {sum(1 for l in lengths if l < 50)}")

=== no_numbered_lines failures: 548 ===

DOI: 10.1002/14651858.CD000009.pub2
Database: MEDLINE, EMBASE, BIOSIS Previews, PsycINFO, Science and Social Sciences Citation Index, AMED and CISCOM. Date of last search January
Raw (first 300 chars):
2005.
Selection criteria
------------------------------------------------------------
DOI: 10.1002/14651858.CD000029.pub3
Database: MEDLINE (Ovid) (June 1998 to May
Raw (first 300 chars):
2013) (Appendix 2), and EMBASE (Ovid) (June 1998 to May
2013) (Appendix 3).
------------------------------------------------------------
DOI: 10.1002/14651858.CD000031.pub3
Database: MEDLINE and EMBASE (via Ovid, 9th June
Raw (first 300 chars):
2009) using the medication name and 'smoking' as a free text
------------------------------------------------------------
DOI: 10.1002/14651858.CD000039.pub2
Database: MEDLINE (1966 to April
Raw (first 300 chars):
2008) (Appendix 1), EMBASE (1980 to April 2008) (Appendix
----------------------------------------------------