# Data Processing Pipeline Documentation

## Project Overview

This notebook provides a comprehensive summary of the data processing pipeline used in the **LSE-UKHSA Systematic Review Screening Project**. The goal of this project is to evaluate how well open-source Large Language Models (LLMs) can screen paper abstracts for inclusion in systematic reviews.

---

## Table of Contents

1. [Pipeline Overview](#1-pipeline-overview)
2. [Data Sources](#2-data-sources)
3. [Step 1: Obtaining Cochrane Reviews](#3-step-1-obtaining-cochrane-reviews)
4. [Step 2: Exploratory Data Analysis](#4-step-2-exploratory-data-analysis)
5. [Step 3: Fetching Referenced Paper Abstracts](#5-step-3-fetching-referenced-paper-abstracts)
6. [Step 4: Building the Ground Truth Validation Set](#6-step-4-building-the-ground-truth-validation-set)
7. [Step 5: LLM Evaluation](#7-step-5-llm-evaluation)
8. [Data Files Summary](#8-data-files-summary)
9. [Records Excluded Due to Missing Information](#9-records-excluded-due-to-missing-information)
10. [Data Samples](#10-data-samples)

---

## 1. Pipeline Overview

The data processing pipeline consists of **5 sequential steps**, each implemented in a separate Jupyter notebook:

```
┌─────────────────────────────────────────────────────────────────────────────────┐
│                           DATA PROCESSING PIPELINE                               │
├─────────────────────────────────────────────────────────────────────────────────┤
│                                                                                  │
│  ┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐              │
│  │   PubMed API    │───>│ 00_obtain...    │───>│ cochrane_pubmed │              │
│  │   (NCBI Entrez) │    │  .ipynb         │    │ _abstracts.csv  │              │
│  └─────────────────┘    └─────────────────┘    │ (17,092 reviews)│              │
│                                   │            │                  │              │
│                                   │            │ cochrane_pubmed │              │
│                                   └───────────>│ _references.csv │              │
│                                                │ (1.18M edges)   │              │
│                                                └────────┬────────┘              │
│                                                         │                        │
│                                                         v                        │
│  ┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐              │
│  │   PubMed API    │───>│ 02_fetch...     │───>│ referenced_paper│              │
│  │   (NCBI Entrez) │    │  .ipynb         │    │ _abstracts.csv  │              │
│  └─────────────────┘    └─────────────────┘    │ (491,529 papers)│              │
│                                                └────────┬────────┘              │
│                                                         │                        │
│                                                         v                        │
│                         ┌─────────────────┐    ┌─────────────────┐              │
│                         │ 03_build...     │───>│ ground_truth_   │              │
│                         │  .ipynb         │    │ validation.csv  │              │
│                         └─────────────────┘    │ (1,000 samples) │              │
│                                                └────────┬────────┘              │
│                                                         │                        │
│                                                         v                        │
│  ┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐              │
│  │   Ollama        │───>│ 04_llm...       │───>│ results/        │              │
│  │   (Local LLMs)  │    │  .ipynb         │    │ eval_*.csv      │              │
│  └─────────────────┘    └─────────────────┘    └─────────────────┘              │
│                                                                                  │
└─────────────────────────────────────────────────────────────────────────────────┘
```

---

## 2. Data Sources

### Primary Data Source: PubMed (via NCBI Entrez API)

All data was obtained from **PubMed**, the free biomedical literature database maintained by the National Library of Medicine (NLM) at the National Institutes of Health (NIH).

| Aspect | Details |
|--------|--------|
| **API** | NCBI Entrez E-utilities |
| **Database** | PubMed |
| **Python Library** | BioPython (`Bio.Entrez`) |
| **Authentication** | Email required; API key optional (increases rate limit) |
| **Rate Limits** | 3 requests/sec without API key, 10 requests/sec with key |

### Why Cochrane Reviews?

**Cochrane systematic reviews** are considered the gold standard in evidence-based medicine. Each Cochrane review:

1. **Clearly defines inclusion criteria** for which studies to include
2. **Cites all "included" studies** in its reference list
3. **Follows rigorous screening protocols** that we can use as ground truth

This makes them ideal for evaluating automated screening tools—we know exactly which papers were included after human screening.

---

## 3. Step 1: Obtaining Cochrane Reviews

**Notebook:** `00_obtain_cochrane_abstracts.ipynb`

### Process

1. **Search Query:** All articles from the Cochrane Database of Systematic Reviews with abstracts
   ```
   ("Cochrane Database Syst Rev"[Journal]) AND hasabstract[text]
   ```

2. **Fetching Strategy:** 
   - Split query by year to avoid PubMed's 9,500 record retrieval limit
   - Fetch PMIDs in batches of 1,000
   - Fetch MEDLINE records (abstracts) in batches of 50
   - Fetch XML records (references) in batches of 50

3. **Outputs:**
   - `cochrane_pubmed_abstracts.csv`: Review metadata and abstracts
   - `cochrane_pubmed_references.csv`: Links between reviews and cited papers

### Records Pulled

| Dataset | Record Count |
|---------|-------------|
| Cochrane reviews with abstracts | **17,092** |
| Reference edges (review → cited paper) | **1,182,678** |

---

## 4. Step 2: Exploratory Data Analysis

**Notebook:** `01_eda_cochrane_data.ipynb`

### Key Findings

#### Cochrane Reviews Dataset

| Statistic | Value |
|-----------|-------|
| Total reviews | 17,092 |
| Reviews with references | 10,105 (59%) |
| Reviews without references | 6,987 (41%) |
| Missing author info | 33 records |

#### Abstract Length Statistics

| Metric | Words |
|--------|-------|
| Mean | 545 |
| Median | 478 |
| Min | 18 |
| Max | 4,774 |
| Std Dev | 234 |

#### References per Review

| Metric | Count |
|--------|-------|
| Mean | 117 references |
| Median | 84 references |
| Min | 2 references |
| Max | 1,890 references |

#### Reference Identifiers

| Identifier Type | Count | Percentage |
|----------------|-------|------------|
| References with PMID | 848,607 | 71.8% |
| References with DOI | 141,573 | 12.0% |
| References with either | 865,992 | 73.2% |
| **Unique papers with PMIDs** | **491,531** | — |

---

## 5. Step 3: Fetching Referenced Paper Abstracts

**Notebook:** `02_fetch_referenced_abstracts.ipynb`

### Process

1. **Extract unique PMIDs** from the reference edges (491,531 unique papers)
2. **Fetch MEDLINE records** in batches of 200 with exponential backoff for errors
3. **Save incrementally** to allow resuming if interrupted
4. **Total time:** Approximately 2.5-3 hours due to API rate limits

### Records Pulled

| Metric | Count |
|--------|-------|
| Unique PMIDs to fetch | 491,531 |
| Total records fetched | 491,529 |
| Records with abstracts | **443,977** (90.3%) |
| Records without abstracts | 47,552 (9.7%) |

### Why Some Records Lack Abstracts

- Older papers (pre-1975) often don't have abstracts indexed in PubMed
- Some publication types (letters, editorials, book chapters) may not have abstracts
- 2 PMIDs could not be retrieved (likely withdrawn or merged records)

---

## 6. Step 4: Building the Ground Truth Validation Set

**Notebook:** `03_build_ground_truth.ipynb`

### Methodology

The ground truth dataset contains labeled examples for evaluating LLM screening performance:

- **Positive examples (INCLUDED):** Papers that appear in a Cochrane review's reference list — these passed human screening
- **Negative examples (EXCLUDED):** Papers from *related* reviews that were NOT included in the current review — realistic "near-miss" candidates

### Sampling Process

```
1. Start with 17,092 Cochrane reviews
       ↓ Filter: has valid references with PMIDs
2. 10,077 reviews with valid reference edges
       ↓ Filter: referenced papers have abstracts available
3. 10,013 reviews with ≥5 included papers with abstracts
       ↓ Filter: inclusion criteria can be parsed from abstract
4. 9,660 reviews with extractable criteria (96.5%)
       ↓ Random sample
5. 100 reviews selected for validation set
       ↓ Sample 5 included + 5 excluded papers per review
6. Final: 1,000 validation records (500 included + 500 excluded)
```

### Filtering Statistics

| Stage | Records | Notes |
|-------|---------|-------|
| Total Cochrane reviews | 17,092 | — |
| Reviews with references | 10,105 | 6,987 excluded (no reference list) |
| Valid reference edges (with PMID) | 848,607 | 334,071 excluded (no PMID) |
| Unique papers in reference graph | 491,531 | — |
| Papers with abstracts available | 443,977 | 47,554 excluded (no abstract) |
| Reviews with ≥5 included papers | 10,013 | 64 excluded (too few papers) |
| Reviews with parseable criteria | 9,660 | 353 excluded (criteria not extractable) |
| **Final validation set** | **1,000** | 500 included + 500 excluded |

### How Negative Examples Were Chosen

Negative examples are not random papers—they are **"near-miss" papers** that:

1. Were included in a *related* Cochrane review (same topic area)
2. Were NOT included in the current review
3. Have abstracts available

This creates a realistic screening challenge where the LLM must distinguish between truly relevant papers and closely related but ultimately excluded papers.

---

## 7. Step 5: LLM Evaluation

**Notebook:** `04_llm_evaluation.ipynb`

### Models Tested

| Model | Description |
|-------|------------|
| **Llama 3.2** | Meta's open-source LLM |
| **Mistral** | Mistral AI's open-source model |

### Prompting Strategies

| Strategy | Description |
|----------|------------|
| **Zero-shot** | Direct instruction to classify as INCLUDE/EXCLUDE |
| **Chain-of-Thought (CoT)** | Step-by-step reasoning before decision |

### Evaluation Runs

The LLM evaluation was run **twice** to verify reproducibility. There are **8 evaluation files** (2 models × 2 prompts × 2 runs), plus one duplicate file that was created during Run 1.

#### Run 1 (January 15, 2026 evening)

| File | Model | Prompt | Timestamp | Records |
|------|-------|--------|-----------|---------|
| `eval_llama3.2_zero_shot_20260115_193605.csv` | Llama 3.2 | Zero-shot | 19:36:05 | 1,000 |
| `eval_llama3.2_zero_shot_20260115_201927.csv` | Llama 3.2 | Zero-shot | 20:19:27 | 1,000 ⚠️ *duplicate* |
| `eval_llama3.2_cot_20260115_215209.csv` | Llama 3.2 | CoT | 21:52:09 | 1,000 |
| `eval_mistral_zero_shot_20260115_231802.csv` | Mistral | Zero-shot | 23:18:02 | 1,000 |
| `eval_mistral_cot_20260116_003208.csv` | Mistral | CoT | 00:32:08 | 1,000 |

> ⚠️ **Note:** `eval_llama3.2_zero_shot_20260115_201927.csv` is a duplicate of the 19:36:05 file (identical content). This brings the total to 9 files instead of 8.

#### Run 2 (January 16, 2026 early morning)

| File | Model | Prompt | Timestamp | Records |
|------|-------|--------|-----------|---------|
| `eval_llama3.2_zero_shot_20260116_025453.csv` | Llama 3.2 | Zero-shot | 02:54:53 | 1,000 |
| `eval_llama3.2_cot_20260116_041136.csv` | Llama 3.2 | CoT | 04:11:36 | 1,000 |
| `eval_mistral_zero_shot_20260116_050656.csv` | Mistral | Zero-shot | 05:06:56 | 1,000 |
| `eval_mistral_cot_20260116_073058.csv` | Mistral | CoT | 07:30:58 | 1,000 |

### Results Summary (Run 2 - Final Results)

The `model_comparison.csv` contains the final aggregated results from Run 2:

| Model | Prompt | Accuracy | Precision | Recall | F1 Score | Cohen's κ | Unclear |
|-------|--------|----------|-----------|--------|----------|----------|---------|
| **Mistral** | **CoT** | **83.6%** | **83.6%** | **83.7%** | **0.837** | **0.672** | 5 |
| Mistral | Zero-shot | 84.5% | 90.8% | 76.8% | 0.832 | 0.690 | 0 |
| Llama 3.2 | Zero-shot | 80.9% | 76.4% | 89.4% | 0.824 | 0.618 | 0 |
| Llama 3.2 | CoT | 73.9% | 90.1% | 52.9% | 0.667 | 0.476 | 25 |

### Key Findings

1. **Best F1 Score:** Mistral with CoT prompting (0.837) — balanced precision/recall
2. **Highest Precision:** Mistral zero-shot (90.8%) — fewer false positives
3. **Highest Recall:** Llama 3.2 zero-shot (89.4%) — fewer false negatives
4. **Unclear Responses:** CoT prompting caused some unclear responses (5-25 samples)
5. **Reproducibility:** Two runs were performed to verify consistency of results

---

## 8. Data Files Summary

### Primary Data Files

| File | Records | Description |
|------|---------|-------------|
| `cochrane_pubmed_abstracts.csv` | 17,092 | Cochrane review abstracts and metadata |
| `cochrane_pubmed_references.csv` | 1,182,678 | Links between reviews and cited papers |
| `referenced_paper_abstracts.csv` | 491,529 | Abstracts of papers cited in Cochrane reviews |
| `ground_truth_validation_set.csv` | 1,000 | Labeled dataset for LLM evaluation |

### Schema: `cochrane_pubmed_abstracts.csv`

| Column | Type | Description |
|--------|------|-------------|
| `pmid` | string | PubMed identifier |
| `title` | string | Review title |
| `abstract` | string | Full abstract text |
| `journal` | string | Journal name |
| `year` | string | Publication year |
| `authors` | string | Semicolon-separated author list |

### Schema: `cochrane_pubmed_references.csv`

| Column | Type | Description |
|--------|------|-------------|
| `citing_pmid` | string | PMID of the Cochrane review |
| `ref_pmid` | string | PMID of the cited paper (if available) |
| `ref_doi` | string | DOI of the cited paper (if available) |
| `ref_title` | string | Citation text of the reference |

### Schema: `referenced_paper_abstracts.csv`

| Column | Type | Description |
|--------|------|-------------|
| `pmid` | string | PubMed identifier |
| `title` | string | Paper title |
| `abstract` | string | Abstract text (may be empty) |
| `journal` | string | Journal name |
| `year` | string | Publication year |
| `authors` | string | Semicolon-separated author list |

### Schema: `ground_truth_validation_set.csv`

| Column | Type | Description |
|--------|------|-------------|
| `review_pmid` | string | PMID of the Cochrane review |
| `review_title` | string | Title of the review |
| `review_objectives` | string | Extracted objectives section |
| `review_criteria` | string | Extracted selection criteria |
| `paper_pmid` | string | PMID of the candidate paper |
| `paper_title` | string | Title of the candidate paper |
| `paper_abstract` | string | Abstract of the candidate paper |
| `label` | int | 1 = Included, 0 = Excluded |

---

## 9. Records Excluded Due to Missing Information

Throughout the pipeline, records were excluded at various stages due to missing or insufficient information:

### Stage 1: Cochrane Reviews Without References

**6,987 reviews (40.9%)** had no reference list in PubMed. This happens when:
- The review is a protocol (planned study, not yet completed)
- The review is withdrawn
- Reference data wasn't indexed in PubMed

### Stage 2: References Without PMIDs

**334,071 reference edges (28.2%)** lacked a PMID identifier. This happens when:
- The cited paper is not indexed in PubMed (books, grey literature, non-English papers)
- The citation is incomplete or malformed
- The paper was published before PubMed indexing began

### Stage 3: Papers Without Abstracts

**47,554 papers (9.7%)** were retrieved but had no abstract text. This happens when:
- Older papers (especially pre-1975)
- Short publication types (letters, editorials, corrections)
- Abstract not included in original publication

### Stage 4: Reviews Without Parseable Criteria

**353 reviews (3.5%)** could not have inclusion criteria extracted. This happens when:
- Abstract uses non-standard structure
- Criteria section is labeled differently
- Abstract is malformed or incomplete

### Summary of Exclusions

| Exclusion Reason | Records Excluded | Percentage |
|-----------------|-----------------|------------|
| Reviews without references | 6,987 | 40.9% of reviews |
| References without PMID | 334,071 | 28.2% of ref edges |
| Papers without abstract | 47,554 | 9.7% of papers |
| Non-parseable criteria | 353 | 3.5% of eligible reviews |
| Authors missing | 33 | 0.2% of reviews |

---

## 10. Data Samples

Below are examples of what the data looks like at each stage of the pipeline.

In [10]:
# Load libraries
import pandas as pd
from pathlib import Path
from IPython.display import display, HTML

DATA_DIR = Path.cwd().parent / "Data" if not (Path.cwd() / "Data").exists() else Path.cwd() / "Data"
print(f"Data directory: {DATA_DIR}")

Data directory: c:\Users\juanx\Documents\LSE-UKHSA Project\Data


### Sample: Cochrane Review Abstract

Below is an example of a Cochrane review record showing the structured abstract format typical of systematic reviews:

In [11]:
# Load and display a sample Cochrane review
cochrane = pd.read_csv(DATA_DIR / "cochrane_pubmed_abstracts.csv", dtype={"pmid": str}, nrows=5)

sample_review = cochrane.iloc[0]
print("=" * 80)
print("SAMPLE COCHRANE REVIEW")
print("=" * 80)
print(f"\nPMID: {sample_review['pmid']}")
print(f"Year: {sample_review['year']}")
print(f"Journal: {sample_review['journal']}")
print(f"\nTitle:\n{sample_review['title']}")
print(f"\nAuthors:\n{sample_review['authors'][:200]}...")
print(f"\nAbstract (first 1000 chars):\n{sample_review['abstract'][:1000]}...")

SAMPLE COCHRANE REVIEW

PMID: 41527994
Year: 2026
Journal: The Cochrane database of systematic reviews

Title:
Surgical interventions for treating vesicovaginal fistula in women.

Authors:
Okada Y; Matsushita T; Hasegawa T; Noma H; Ota E; Achila B; Yoshimura Y...

Abstract (first 1000 chars):
This is a protocol for a Cochrane Review (intervention). The objectives are as follows: To assess the benefits and harms of surgical interventions for treating vesicovaginal fistula in women....


### Sample: Reference Edges

Each row in the references file connects a Cochrane review to one of its cited papers:

In [12]:
# Load and display sample reference edges
refs = pd.read_csv(DATA_DIR / "cochrane_pubmed_references.csv", dtype={"citing_pmid": str, "ref_pmid": str}, nrows=10)

print("SAMPLE REFERENCE EDGES (Review → Cited Paper)")
print("=" * 80)
display(refs)

SAMPLE REFERENCE EDGES (Review → Cited Paper)


Unnamed: 0,citing_pmid,ref_pmid,ref_doi,ref_title
0,41527994,,,"Hillary CJ, Osman NI, Hilton P, Chapple CR. Th..."
1,41527994,,,"Hilton P, Ward A. Epidemiological and surgical..."
2,41527994,,,"Ahmed S, Genadry R, Asiamah B, Liang M, Tripat..."
3,41527994,,,World Health Organization (WHO). International...
4,41527994,,,"Adler AJ, Ronsmans C, Calvert C, Filippi V. Es..."
5,41527994,,,"Ahmed S, Curtis SL, Jamil K, Nahar Q, Rahman M..."
6,41527994,,,"Arrowsmith S, Hamlin EC, Wall LL. Obstructed l..."
7,41527994,,,"Cichowitz C, Watt MH, Mchome B, Masenga GG. De..."
8,41527994,,,"Cromwell D, Hilton P. Retrospective cohort stu..."
9,41527994,,,"Goh J, Romanzi L, Elneil S, Haylen B, Chen G, ..."


### Sample: Referenced Paper Abstract

These are the "included" studies that were cited by Cochrane reviews:

In [13]:
# Load and display sample referenced paper
ref_abstracts = pd.read_csv(DATA_DIR / "referenced_paper_abstracts.csv", dtype={"pmid": str}, nrows=5)

sample_paper = ref_abstracts[ref_abstracts['abstract'].notna()].iloc[0]
print("=" * 80)
print("SAMPLE REFERENCED PAPER (Included in a Cochrane review)")
print("=" * 80)
print(f"\nPMID: {sample_paper['pmid']}")
print(f"Year: {sample_paper['year']}")
print(f"Journal: {sample_paper['journal']}")
print(f"\nTitle:\n{sample_paper['title']}")
print(f"\nAbstract:\n{sample_paper['abstract'][:800]}...")

SAMPLE REFERENCED PAPER (Included in a Cochrane review)

PMID: 2314794
Year: 1990
Journal: Obstetrics and gynecology

Title:
The use of modified Martius graft as an adjunctive technique in vesicovaginal and rectovaginal fistula repair.

Abstract:
The use of the Martius graft, a labial fibro-fatty tissue graft, is described as an adjunctive technique in the repair of 37 complex fistulas in 35 patients. The graft was used to repair three groups of patients with non-radiation-induced vesicovaginal fistulas: 12 patients with large (greater than 4 cm) obstetric fistulas, six patients with obstetric fistulas that caused urethral sloughing, and six patients with recurrent obstetric or post-hysterectomy fistulas. Five other patients had radiation-induced fistulas, and six others had rectovaginal fistulas. The overall success rate was 86.5%. Anatomical studies undertaken of the graft in a cadaver demonstrated that it is composed of fibroadipose tissue from the labium majus, and not from the bul

### Sample: Ground Truth Validation Records

Each validation record pairs a Cochrane review with a candidate paper and a label (1=include, 0=exclude):

In [14]:
# Load the validation set
validation = pd.read_csv(DATA_DIR / "ground_truth_validation_set.csv")

# Show an INCLUDED example
included = validation[validation['label'] == 1].iloc[0]
print("=" * 80)
print("EXAMPLE: INCLUDED PAPER (Label = 1)")
print("=" * 80)
print(f"\nReview PMID: {included['review_pmid']}")
print(f"Review Title:\n{included['review_title'][:100]}...")
print(f"\nSelection Criteria:\n{included['review_criteria'][:400]}...")
print(f"\nCandidate Paper PMID: {included['paper_pmid']}")
print(f"Paper Title:\n{included['paper_title']}")
print(f"\nPaper Abstract:\n{included['paper_abstract'][:500]}...")
print(f"\nLabel: {included['label']} (INCLUDED - this paper was cited in the review)")

EXAMPLE: INCLUDED PAPER (Label = 1)

Review PMID: 21678351
Review Title:
Evaluation of follow-up strategies for patients with epithelial ovarian cancer following completion ...

Selection Criteria:
SELECTION CRITERIA: All relevant randomised controlled trials (RCTs) that evaluated follow-up strategies for patients with epithelial ovarian cancer following completion of primary treatment....

Candidate Paper PMID: 11737464
Paper Title:
A critical evaluation of current protocols for the follow-up of women treated for gynecological malignancies: a pilot study.

Paper Abstract:
This retrospective review was undertaken to determine the efficacy of routine follow-up in the detection and management of recurrent cancer. The case notes of all women attending a regional cancer center who were diagnosed with cancer in 1997 were reviewed. Of 81 new cancers followed up for a median of 42 months (range 36-48), 14 have recurred after curative treatment and there were six cases of persistent disease. T

In [15]:
# Show an EXCLUDED example
excluded = validation[validation['label'] == 0].iloc[0]
print("=" * 80)
print("EXAMPLE: EXCLUDED PAPER (Label = 0)")
print("=" * 80)
print(f"\nReview PMID: {excluded['review_pmid']}")
print(f"Review Title:\n{excluded['review_title'][:100]}...")
print(f"\nSelection Criteria:\n{excluded['review_criteria'][:400]}...")
print(f"\nCandidate Paper PMID: {excluded['paper_pmid']}")
print(f"Paper Title:\n{excluded['paper_title']}")
print(f"\nPaper Abstract:\n{excluded['paper_abstract'][:500]}...")
print(f"\nLabel: {excluded['label']} (EXCLUDED - from a related review, NOT cited in this review)")

EXAMPLE: EXCLUDED PAPER (Label = 0)

Review PMID: 21678351
Review Title:
Evaluation of follow-up strategies for patients with epithelial ovarian cancer following completion ...

Selection Criteria:
SELECTION CRITERIA: All relevant randomised controlled trials (RCTs) that evaluated follow-up strategies for patients with epithelial ovarian cancer following completion of primary treatment....

Candidate Paper PMID: 18363586
Paper Title:
Estimates of the burden of malaria morbidity in Africa in children under the age of 5 years.

Paper Abstract:
OBJECTIVE: To estimate the direct burden of malaria among children younger than 5 years in sub-Saharan Africa (SSA) for the year 2000, as part of a wider initiative on burden estimates. METHODS: A systematic literature review was undertaken in June 2003. Severe malaria outcomes (cerebral malaria, severe malarial anaemia and respiratory distress) and non-severe malaria data were abstracted separately, together with information on the characteristics

### Sample: LLM Evaluation Results

Each evaluation run produces predictions that can be compared to the ground truth:

In [16]:
# Load model comparison results (Run 2 - final results)
results = pd.read_csv(DATA_DIR / "results" / "model_comparison.csv")

print("MODEL COMPARISON RESULTS - RUN 2 (sorted by F1 score)")
print("=" * 80)
display(results.round(3))

MODEL COMPARISON RESULTS - RUN 2 (sorted by F1 score)


Unnamed: 0,model,prompt_type,accuracy,precision,recall,f1,kappa,n_valid,n_unclear
0,mistral,cot,0.836,0.836,0.837,0.837,0.672,995,5
1,mistral,zero_shot,0.845,0.908,0.768,0.832,0.69,1000,0
2,llama3.2,zero_shot,0.809,0.764,0.894,0.824,0.618,1000,0
3,llama3.2,cot,0.739,0.901,0.529,0.667,0.476,975,25


In [17]:
# Compare Run 1 vs Run 2 results
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

def compute_metrics_from_file(filepath):
    """Compute metrics from an evaluation result file."""
    df = pd.read_csv(filepath)
    # Handle different column names between runs
    if 'true_label' in df.columns:
        y_true = df['true_label']
    else:
        y_true = df['label']
    
    if 'prediction' in df.columns:
        pred_col = df['prediction']
        if pred_col.dtype == 'object':
            # String predictions: 'include'/'exclude'
            valid = pred_col.isin(['include', 'exclude'])
            y_pred = (pred_col == 'include').astype(int)
        else:
            # Numeric predictions
            valid = pred_col.isin([0, 1])
            y_pred = pred_col
    else:
        return None
    
    df_valid = df[valid]
    y_true = y_true[valid]
    y_pred = y_pred[valid]
    
    return {
        'accuracy': accuracy_score(y_true, y_pred),
        'precision': precision_score(y_true, y_pred, zero_division=0),
        'recall': recall_score(y_true, y_pred, zero_division=0),
        'f1': f1_score(y_true, y_pred, zero_division=0),
        'n_valid': len(df_valid),
        'n_total': len(df)
    }

# Run 1 files
run1_files = {
    ('Llama 3.2', 'Zero-shot'): 'eval_llama3.2_zero_shot_20260115_193605.csv',
    ('Llama 3.2', 'CoT'): 'eval_llama3.2_cot_20260115_215209.csv',
    ('Mistral', 'Zero-shot'): 'eval_mistral_zero_shot_20260115_231802.csv',
    ('Mistral', 'CoT'): 'eval_mistral_cot_20260116_003208.csv',
}

# Run 2 files
run2_files = {
    ('Llama 3.2', 'Zero-shot'): 'eval_llama3.2_zero_shot_20260116_025453.csv',
    ('Llama 3.2', 'CoT'): 'eval_llama3.2_cot_20260116_041136.csv',
    ('Mistral', 'Zero-shot'): 'eval_mistral_zero_shot_20260116_050656.csv',
    ('Mistral', 'CoT'): 'eval_mistral_cot_20260116_073058.csv',
}

print("=" * 80)
print("COMPARISON: RUN 1 vs RUN 2")
print("=" * 80)

comparison_data = []
for (model, prompt), file1 in run1_files.items():
    file2 = run2_files[(model, prompt)]
    try:
        m1 = compute_metrics_from_file(DATA_DIR / "results" / file1)
        m2 = compute_metrics_from_file(DATA_DIR / "results" / file2)
        if m1 and m2:
            comparison_data.append({
                'Model': model,
                'Prompt': prompt,
                'Run 1 F1': m1['f1'],
                'Run 2 F1': m2['f1'],
                'Δ F1': m2['f1'] - m1['f1'],
                'Run 1 Acc': m1['accuracy'],
                'Run 2 Acc': m2['accuracy'],
            })
    except Exception as e:
        print(f"Error processing {model} {prompt}: {e}")

comparison_df = pd.DataFrame(comparison_data)
display(comparison_df.round(3))

COMPARISON: RUN 1 vs RUN 2


Unnamed: 0,Model,Prompt,Run 1 F1,Run 2 F1,Δ F1,Run 1 Acc,Run 2 Acc
0,Llama 3.2,Zero-shot,0.721,0.824,0.103,0.647,0.809
1,Llama 3.2,CoT,0.643,0.667,0.024,0.704,0.739
2,Mistral,Zero-shot,0.801,0.832,0.031,0.811,0.845
3,Mistral,CoT,0.576,0.837,0.261,0.688,0.836


In [18]:
# Show a sample LLM response with reasoning (Chain-of-Thought)
# Use the Run 2 file which has the raw_response column
eval_file = DATA_DIR / "results" / "eval_mistral_cot_20260116_073058.csv"
if eval_file.exists():
    eval_df = pd.read_csv(eval_file, nrows=3)
    sample = eval_df.iloc[0]
    print("=" * 80)
    print("SAMPLE LLM RESPONSE (Mistral with Chain-of-Thought - Run 2)")
    print("=" * 80)
    print(f"\nPaper PMID: {sample['paper_pmid']}")
    # Handle different column names
    label_col = 'true_label' if 'true_label' in sample.index else 'label'
    print(f"True Label: {'INCLUDE' if sample[label_col] == 1 else 'EXCLUDE'}")
    pred = sample['prediction']
    print(f"Prediction: {pred.upper() if isinstance(pred, str) else ('INCLUDE' if pred == 1 else 'EXCLUDE')}")
    # Show raw response if available
    if 'raw_response' in sample.index:
        print(f"\nLLM Reasoning (truncated):\n{sample['raw_response']}")
    elif 'response' in sample.index:
        print(f"\nLLM Response:\n{sample['response']}")
else:
    print("Evaluation file not found.")

SAMPLE LLM RESPONSE (Mistral with Chain-of-Thought - Run 2)

Paper PMID: 11737464
True Label: INCLUDE
Prediction: INCLUDE

LLM Reasoning (truncated):
 1. The main topic of this paper is the evaluation of follow-up strategies for patients with cancer (epithelial ovarian cancer in this case), focusing on their efficacy in detecting and managing recurrent cancer.
2. This paper indeed relates to the systematic review topic as it evaluates different follow-up strategies for patients with epithelial ovarian cancer after primary treatment, which aligns with the systematic review's focus on the same topic.
3. The paper appears to provide relevant evi


---

## Appendix: Quick Reference

### Notebook Execution Order

| # | Notebook | Time | Output |
|---|----------|------|--------|
| 1 | `00_obtain_cochrane_abstracts.ipynb` | ~1 hour | 2 CSV files |
| 2 | `01_eda_cochrane_data.ipynb` | < 1 min | Analysis/plots |
| 3 | `02_fetch_referenced_abstracts.ipynb` | 2.5-3 hours | 1 CSV file |
| 4 | `03_build_ground_truth.ipynb` | < 1 min | 1 CSV file |
| 5 | `04_llm_evaluation.ipynb` | Several hours | Multiple CSV files |

### Evaluation Files Inventory

| Run | Model | Prompt | Filename | Notes |
|-----|-------|--------|----------|-------|
| 1 | Llama 3.2 | Zero-shot | `eval_llama3.2_zero_shot_20260115_193605.csv` | |
| 1 | Llama 3.2 | Zero-shot | `eval_llama3.2_zero_shot_20260115_201927.csv` | ⚠️ Duplicate |
| 1 | Llama 3.2 | CoT | `eval_llama3.2_cot_20260115_215209.csv` | |
| 1 | Mistral | Zero-shot | `eval_mistral_zero_shot_20260115_231802.csv` | |
| 1 | Mistral | CoT | `eval_mistral_cot_20260116_003208.csv` | |
| 2 | Llama 3.2 | Zero-shot | `eval_llama3.2_zero_shot_20260116_025453.csv` | |
| 2 | Llama 3.2 | CoT | `eval_llama3.2_cot_20260116_041136.csv` | |
| 2 | Mistral | Zero-shot | `eval_mistral_zero_shot_20260116_050656.csv` | |
| 2 | Mistral | CoT | `eval_mistral_cot_20260116_073058.csv` | |

**Total: 9 files (8 unique runs + 1 duplicate)**

### Key Numbers to Remember

| Metric | Value |
|--------|-------|
| Cochrane reviews | 17,092 |
| Reference edges | 1,182,678 |
| References with PMIDs | 848,607 (71.8%) |
| Unique cited papers | 491,531 |
| Papers with abstracts | 443,977 (90.3%) |
| Validation set size | 1,000 (balanced) |
| Best LLM F1 score | 0.837 (Mistral CoT) |