# LLM Screening Evaluation Pipeline

**Summary:** In this notebook, I evaluate how well open-source LLMs can screen paper abstracts for inclusion in systematic reviews. I test two models (Llama 3.2 and Mistral) with two prompt strategies (zero-shot and chain-of-thought) on a ground-truth validation set of 1,000 labeled abstracts.

---

## Methodology

### Ground Truth Dataset Construction
The validation set consists of 1,000 paper-review pairs sampled from Cochrane systematic reviews:
- **100 Cochrane reviews** were randomly selected from reviews that have clearly defined inclusion criteria and at least 5 included studies with available abstracts
- For each review, I sample **5 "included" papers** (papers that were actually included in the review, serving as positive examples) and **5 "excluded" papers** (papers that were NOT cited by the review but share the same medical topic, serving as hard negative examples)
- The excluded papers are sampled from the broader PubMed corpus based on topic similarity, making them realistic "near-miss" candidates that a human screener would need to evaluate
- Final dataset: **500 included + 500 excluded = 1,000 labeled pairs**

### Prompt Generation
Each prompt contains two key components:
1. **Review context:** The title of the Cochrane review, which describes the clinical question (e.g., "Interventions for treating depression after stroke")
2. **Candidate abstract:** The full abstract text of the paper being screened (truncated to 3,000 characters if needed)

I test two prompting strategies:

**Zero-shot prompt:** A direct instruction asking the LLM to decide INCLUDE or EXCLUDE based on whether the paper is relevant to the review topic. No examples are provided.

**Chain-of-thought (CoT) prompt:** The LLM is asked to reason step-by-step before making a decision:
1. What is the main topic of this paper?
2. Does it relate to the systematic review topic?
3. Does it appear to provide relevant evidence?
Then it gives a final DECISION: INCLUDE or DECISION: EXCLUDE.

### Evaluation Process
- I run each model (Llama 3.2, Mistral) with each prompt type (zero-shot, CoT) on all 1,000 samples via Ollama
- The LLM's text response is parsed to extract the binary decision
- I compute standard classification metrics against the ground truth labels

---

## Key Results

| Model | Prompt | Accuracy | Precision | Recall | F1 |
|-------|--------|----------|-----------|--------|-----|
| Mistral | cot | 83.6% | 83.6% | 83.7% | 0.837 |
| Mistral | zero_shot | 84.5% | 90.8% | 76.8% | 0.832 |
| Llama 3.2 | zero_shot | 80.9% | 76.4% | 89.4% | 0.824 |
| Llama 3.2 | cot | 73.9% | 90.1% | 52.9% | 0.667 |

**Main finding:** Mistral with chain-of-thought prompting achieves the best F1 score (0.837), with balanced precision and recall. Mistral zero-shot has the highest accuracy (84.5%) and precision (90.8%). Llama 3.2 zero-shot has the highest recall (89.4%) but lower precision.

---

## Pipeline Steps
1. I load the ground-truth validation set (500 included, 500 excluded papers)
2. I define two prompt templates: zero-shot and chain-of-thought
3. I run each LLM on all 1,000 samples using Ollama
4. I compute metrics (accuracy, precision, recall, F1, Cohen's kappa)
5. I analyze errors and compare models

In [19]:
# I import all libraries needed for evaluation
import pandas as pd
import numpy as np
import requests
import json
import time
import os
from pathlib import Path
from datetime import datetime
from tqdm import tqdm
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, cohen_kappa_score, confusion_matrix

DATA_DIR = Path("../Data")
RESULTS_DIR = DATA_DIR / "results"
RESULTS_DIR.mkdir(exist_ok=True)

OLLAMA_URL = "http://localhost:11434/api/generate"
print(f"Data directory: {DATA_DIR.resolve()}")
print(f"Results directory: {RESULTS_DIR.resolve()}")

Data directory: C:\Users\juanx\Documents\LSE-UKHSA Project\Data
Results directory: C:\Users\juanx\Documents\LSE-UKHSA Project\Data\results


In [20]:
# I load the validation set that contains 1,000 labeled abstracts
df = pd.read_csv(DATA_DIR / "ground_truth_validation_set.csv")
print(f"Loaded {len(df):,} records")
print(f"Label distribution: {df['label'].value_counts().to_dict()}")
print(f"\nSample row:")
print(df.iloc[0].to_dict())

Loaded 1,000 records
Label distribution: {1: 500, 0: 500}

Sample row:
{'review_pmid': 21678351, 'review_title': 'Evaluation of follow-up strategies for patients with epithelial ovarian cancer following completion of primary treatment.', 'review_objectives': 'BACKGROUND: Ovarian cancer is the sixth most common cancer and seventh cause of cancer death in women worldwide. Traditionally, many patients who have been treated for cancer undergo long-term follow up in secondary care. Recently however it has been suggested that the use of routine review may not be effective in improving survival, quality of life (QoL), and relieving anxiety. In addition, it may not be cost effective. OBJECTIVES: To compare the potential benefits of different strategies of follow up in women with epithelial ovarian cancer following completion of primary treatment. SEARCH STRATEGY: We searched the Cochrane Gynaecological Cancer Group Trials Register, Cochrane Central Register of Controlled Trials (CENTRAL) (The 

In [None]:
# I define two prompt templates for the LLMs

ZERO_SHOT_TEMPLATE = """You are a systematic review screener. Based on the abstract below, decide if this paper should be INCLUDED or EXCLUDED from a systematic review about:
"{review_title}"

Abstract:
{abstract}

Answer with exactly one word: INCLUDE or EXCLUDE"""

COT_TEMPLATE = """You are a systematic review screener. Your task is to decide if a paper should be included in a systematic review.

Review topic: "{review_title}"

Abstract to screen:
{abstract}

Think through this step by step:
1. What is the main topic of this paper?
2. Does it relate to the systematic review topic?
3. Does it appear to provide relevant evidence?

After your reasoning, give your final answer on a new line as exactly: DECISION: INCLUDE or DECISION: EXCLUDE"""

print("Prompt templates defined.")

Prompt templates defined.


In [22]:
# I define functions to call Ollama and parse LLM responses

def call_ollama(model: str, prompt: str, timeout: int = 120) -> str:
    """Send a prompt to Ollama and return the response text."""
    try:
        resp = requests.post(
            OLLAMA_URL,
            json={"model": model, "prompt": prompt, "stream": False},
            timeout=timeout
        )
        resp.raise_for_status()
        return resp.json().get("response", "")
    except Exception as e:
        return f"ERROR: {e}"

def parse_decision(response: str, prompt_type: str) -> str:
    """Extract INCLUDE/EXCLUDE from LLM response."""
    text = response.upper()
    if prompt_type == "cot":
        if "DECISION: INCLUDE" in text or "DECISION:INCLUDE" in text:
            return "include"
        elif "DECISION: EXCLUDE" in text or "DECISION:EXCLUDE" in text:
            return "exclude"
    if "INCLUDE" in text and "EXCLUDE" not in text:
        return "include"
    elif "EXCLUDE" in text and "INCLUDE" not in text:
        return "exclude"
    elif text.strip().startswith("INCLUDE"):
        return "include"
    elif text.strip().startswith("EXCLUDE"):
        return "exclude"
    return "unclear"

print("Ollama functions defined.")

Ollama functions defined.


In [23]:
# I check which models are available in Ollama
try:
    resp = requests.get("http://localhost:11434/api/tags", timeout=5)
    models = [m["name"] for m in resp.json().get("models", [])]
    print(f"Available models: {models}")
except Exception as e:
    print(f"Could not connect to Ollama: {e}")
    print("Make sure Ollama is running (ollama serve)")

Available models: ['mistral:latest', 'llama3.2:latest']


In [30]:
# I define the main evaluation function that runs the LLM on all samples

def run_evaluation(df: pd.DataFrame, model: str, prompt_type: str, template: str, limit: int = None):
    """Run evaluation on the dataset and return results DataFrame."""
    data = df.head(limit) if limit else df
    results = []
    
    for idx, row in tqdm(data.iterrows(), total=len(data), desc=f"{model}/{prompt_type}"):
        prompt = template.format(
            review_title=row["review_title"],
            abstract=row["paper_abstract"][:3000]
        )
        response = call_ollama(model, prompt)
        prediction = parse_decision(response, prompt_type)
        
        results.append({
            "paper_pmid": row["paper_pmid"],
            "true_label": row["label"],
            "prediction": prediction,
            "raw_response": response[:500],
            "model": model,
            "prompt_type": prompt_type
        })
    
    return pd.DataFrame(results)

print("Evaluation function defined.")

Evaluation function defined.


In [32]:
# I define functions to compute and display metrics

def compute_metrics(results_df: pd.DataFrame) -> dict:
    """Compute all evaluation metrics."""
    valid = results_df[results_df["prediction"].isin(["include", "exclude"])].copy()
    # true_label is already int (1=include, 0=exclude)
    y_true = valid["true_label"].astype(int)
    y_pred = (valid["prediction"] == "include").astype(int)
    
    return {
        "accuracy": accuracy_score(y_true, y_pred),
        "precision": precision_score(y_true, y_pred, zero_division=0),
        "recall": recall_score(y_true, y_pred, zero_division=0),
        "f1": f1_score(y_true, y_pred, zero_division=0),
        "kappa": cohen_kappa_score(y_true, y_pred),
        "n_valid": len(valid),
        "n_unclear": len(results_df) - len(valid),
        "confusion_matrix": confusion_matrix(y_true, y_pred).tolist()
    }

def print_metrics(metrics: dict, model: str, prompt_type: str):
    """Display metrics in a readable format."""
    print(f"\n{'='*50}")
    print(f"Model: {model} | Prompt: {prompt_type}")
    print(f"{'='*50}")
    print(f"Accuracy:  {metrics['accuracy']:.1%}")
    print(f"Precision: {metrics['precision']:.1%}")
    print(f"Recall:    {metrics['recall']:.1%}")
    print(f"F1 Score:  {metrics['f1']:.3f}")
    print(f"Kappa:     {metrics['kappa']:.3f}")
    print(f"Valid:     {metrics['n_valid']} | Unclear: {metrics['n_unclear']}")
    cm = metrics['confusion_matrix']
    print(f"\nConfusion Matrix:")
    print(f"          Pred Excl  Pred Incl")
    print(f"True Excl    {cm[0][0]:4d}       {cm[0][1]:4d}")
    print(f"True Incl    {cm[1][0]:4d}       {cm[1][1]:4d}")

print("Metrics functions defined.")

Metrics functions defined.


In [33]:
# I run a quick test with 10 samples to verify everything works
test_results = run_evaluation(df, "llama3.2", "zero_shot", ZERO_SHOT_TEMPLATE, limit=10)
test_metrics = compute_metrics(test_results)
print_metrics(test_metrics, "llama3.2", "zero_shot")
print("\nQuick test complete. Ready for full evaluation.")

llama3.2/zero_shot: 100%|██████████| 10/10 [00:25<00:00,  2.53s/it]


Model: llama3.2 | Prompt: zero_shot
Accuracy:  80.0%
Precision: 80.0%
Recall:    80.0%
F1 Score:  0.800
Kappa:     0.600
Valid:     10 | Unclear: 0

Confusion Matrix:
          Pred Excl  Pred Incl
True Excl       4          1
True Incl       1          4

Quick test complete. Ready for full evaluation.





In [34]:
# I run the full evaluation on all models and prompt types
# WARNING: This takes several hours to complete!

MODELS = ["llama3.2", "mistral"]
PROMPTS = {
    "zero_shot": ZERO_SHOT_TEMPLATE,
    "cot": COT_TEMPLATE
}

all_metrics = []

for model in MODELS:
    for prompt_type, template in PROMPTS.items():
        print(f"\n{'#'*60}")
        print(f"Running: {model} with {prompt_type} prompt")
        print(f"{'#'*60}")
        
        results = run_evaluation(df, model, prompt_type, template)
        
        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
        out_file = RESULTS_DIR / f"eval_{model}_{prompt_type}_{timestamp}.csv"
        results.to_csv(out_file, index=False)
        print(f"Saved results to {out_file}")
        
        metrics = compute_metrics(results)
        print_metrics(metrics, model, prompt_type)
        
        all_metrics.append({
            "model": model,
            "prompt_type": prompt_type,
            **{k: v for k, v in metrics.items() if k != "confusion_matrix"}
        })

print("\nFull evaluation complete!")


############################################################
Running: llama3.2 with zero_shot prompt
############################################################


llama3.2/zero_shot: 100%|██████████| 1000/1000 [45:24<00:00,  2.72s/it]  


Saved results to ..\Data\results\eval_llama3.2_zero_shot_20260116_025453.csv

Model: llama3.2 | Prompt: zero_shot
Accuracy:  80.9%
Precision: 76.4%
Recall:    89.4%
F1 Score:  0.824
Kappa:     0.618
Valid:     1000 | Unclear: 0

Confusion Matrix:
          Pred Excl  Pred Incl
True Excl     362        138
True Incl      53        447

############################################################
Running: llama3.2 with cot prompt
############################################################


llama3.2/cot: 100%|██████████| 1000/1000 [1:16:43<00:00,  4.60s/it]


Saved results to ..\Data\results\eval_llama3.2_cot_20260116_041136.csv

Model: llama3.2 | Prompt: cot
Accuracy:  73.9%
Precision: 90.1%
Recall:    52.9%
F1 Score:  0.667
Kappa:     0.476
Valid:     975 | Unclear: 25

Confusion Matrix:
          Pred Excl  Pred Incl
True Excl     467         28
True Incl     226        254

############################################################
Running: mistral with zero_shot prompt
############################################################


mistral/zero_shot: 100%|██████████| 1000/1000 [55:20<00:00,  3.32s/it]   


Saved results to ..\Data\results\eval_mistral_zero_shot_20260116_050656.csv

Model: mistral | Prompt: zero_shot
Accuracy:  84.5%
Precision: 90.8%
Recall:    76.8%
F1 Score:  0.832
Kappa:     0.690
Valid:     1000 | Unclear: 0

Confusion Matrix:
          Pred Excl  Pred Incl
True Excl     461         39
True Incl     116        384

############################################################
Running: mistral with cot prompt
############################################################


mistral/cot: 100%|██████████| 1000/1000 [2:24:01<00:00,  8.64s/it]   

Saved results to ..\Data\results\eval_mistral_cot_20260116_073058.csv

Model: mistral | Prompt: cot
Accuracy:  83.6%
Precision: 83.6%
Recall:    83.7%
F1 Score:  0.837
Kappa:     0.672
Valid:     995 | Unclear: 5

Confusion Matrix:
          Pred Excl  Pred Incl
True Excl     415         82
True Incl      81        417

Full evaluation complete!





In [35]:
# I create a comparison table of all results
comparison_df = pd.DataFrame(all_metrics)
comparison_df = comparison_df.sort_values("f1", ascending=False)
comparison_df.to_csv(RESULTS_DIR / "model_comparison.csv", index=False)

print("\nModel Comparison (sorted by F1 score):")
print(comparison_df.to_string(index=False))


Model Comparison (sorted by F1 score):
   model prompt_type  accuracy  precision   recall       f1    kappa  n_valid  n_unclear
 mistral         cot  0.836181   0.835671 0.837349 0.836510 0.672361      995          5
 mistral   zero_shot  0.845000   0.907801 0.768000 0.832069 0.690000     1000          0
llama3.2   zero_shot  0.809000   0.764103 0.894000 0.823963 0.618000     1000          0
llama3.2         cot  0.739487   0.900709 0.529167 0.666667 0.475573      975         25


In [38]:
# I define a function to analyze errors and see where models fail

def analyze_errors(results_file: str, n_samples: int = 5):
    """Show examples of false positives and false negatives."""
    df_results = pd.read_csv(results_file)
    df_gt = pd.read_csv(DATA_DIR / "ground_truth_validation_set.csv")
    merged = df_results.merge(df_gt[["paper_pmid", "paper_abstract", "review_title"]], on="paper_pmid")
    
    # true_label is int (1=include, 0=exclude), prediction is string
    fp = merged[(merged["true_label"] == 0) & (merged["prediction"] == "include")]
    fn = merged[(merged["true_label"] == 1) & (merged["prediction"] == "exclude")]
    
    print(f"\nFalse Positives ({len(fp)} total):")
    for _, row in fp.head(n_samples).iterrows():
        print(f"  - {row['paper_abstract'][:100]}...")
    
    print(f"\nFalse Negatives ({len(fn)} total):")
    for _, row in fn.head(n_samples).iterrows():
        print(f"  - {row['paper_abstract'][:100]}...")

print("Error analysis function defined.")

Error analysis function defined.


In [None]:
# I analyze errors for the best-performing model (Mistral CoT had best F1, but Mistral zero-shot had highest precision)
result_files = list(RESULTS_DIR.glob("eval_mistral_zero_shot_*.csv"))
if result_files:
    latest_file = max(result_files, key=lambda x: x.stat().st_mtime)
    print(f"Analyzing: {latest_file.name}")
    analyze_errors(latest_file)
else:
    print("No Mistral zero-shot results found. Run the full evaluation first.")

Analyzing: eval_mistral_zero_shot_20260116_050656.csv

False Positives (39 total):
  - BACKGROUND: Apneic mass movement of oxygen by applying continuous positive airway pressure (CPAP) is...
  - BACKGROUND: The associations between homocysteine, B vitamin status, and pregnancy outcomes have not...
  - We sought to identify the clinical characteristics and outcomes of patients who had advanced heart f...
  - OBJECTIVES: To investigate the efficacy of a disease-specific Expert Patient Programme (EPP) compare...
  - BACKGROUND: Despite the epidemic rise in obesity, few studies have evaluated the effect of obesity o...

False Negatives (131 total):
  - BACKGROUND: Positron emission tomography-computed tomography (PET-CT) is currently not established i...
  - OBJECTIVE: To assess the effectiveness of nurse led follow up in the management of patients with lun...
  - Interrater reliability assessments were undertaken for the Hamilton Depression Rating Scale, the Ras...
  - In an open study th