# LLM Screening Evaluation Pipeline

**Summary:** This notebook evaluates how well various LLMs can screen paper abstracts for inclusion in systematic reviews. I test both open-source models (Llama 3.2 and Mistral via Ollama) and proprietary APIs (Gemini 3 Pro Preview, GPT-5.2 Thinking, Claude Opus 4.5) with two prompt strategies (zero-shot and chain-of-thought) on a ground-truth validation set of 1,000 labeled abstracts.

---

## Models Evaluated

| Model | Type | Provider |
|-------|------|----------|
| Llama 3.2 3B | Open-source (via Ollama) | Meta |
| Mistral 7B Instruct | Open-source (via Ollama) | Mistral AI |
| Gemini 3 Pro Preview | Proprietary API | Google |
| GPT-5.2 (Thinking/High-Reasoning) | Proprietary API | OpenAI |
| Claude Opus 4.5 | Proprietary API | Anthropic |

---

## Methodology

### Ground Truth Dataset Construction
The validation set consists of 1,000 paper-review pairs sampled from Cochrane systematic reviews:
- **100 Cochrane reviews** were randomly selected from reviews that have clearly defined inclusion criteria and at least 5 included studies with available abstracts
- For each review, I sample **5 "included" papers** (papers that were actually included in the review, serving as positive examples) and **5 "excluded" papers** (papers that were NOT cited by the review but share the same medical topic, serving as hard negative examples)
- Final dataset: **500 included + 500 excluded = 1,000 labeled pairs**

### Prompt Strategies

**Zero-shot prompt:** A direct instruction asking the LLM to decide INCLUDE or EXCLUDE based on whether the paper is relevant to the review topic.

**Chain-of-thought (CoT) prompt:** The LLM is asked to reason step-by-step before making a decision.

---

## Pipeline Steps
1. Load ground-truth validation set (500 included, 500 excluded papers)
2. Define prompt templates (zero-shot and chain-of-thought)
3. Set up API clients (Ollama for local models, API keys for proprietary)
4. Run evaluation on all models
5. Compute metrics (accuracy, precision, recall, F1, Cohen's kappa)
6. Compare results and analyze errors

In [13]:
# ============================================================
# CELL 1: IMPORTS AND SETUP
# ============================================================
import pandas as pd
import numpy as np
import requests
import json
import time
import os
from pathlib import Path
from datetime import datetime
from tqdm import tqdm
from dotenv import load_dotenv
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, cohen_kappa_score, confusion_matrix

# Load environment variables from .env file (force override)
ENV_FILE = Path(r"c:\Users\juanx\Documents\LSE-UKHSA Project\.env")
load_dotenv(ENV_FILE, override=True)
print(f"Loading .env from: {ENV_FILE}")

DATA_DIR = Path("../Data")
RESULTS_DIR = DATA_DIR / "results"
RESULTS_DIR.mkdir(exist_ok=True)

# API Configuration
OLLAMA_URL = "http://localhost:11434/api/generate"
GEMINI_API_KEY = os.getenv("GEMINI_API_KEY")
OPENAI_API_KEY = os.getenv("OPEN_AI_API_KEY")  # Note: underscore in env var name
ANTHROPIC_API_KEY = os.getenv("ANTHROPIC_API_KEY")

print(f"Data directory: {DATA_DIR.resolve()}")
print(f"Results directory: {RESULTS_DIR.resolve()}")
print(f"\nAPI Keys loaded:")
print(f"  Gemini:    {'‚úÖ Set' if GEMINI_API_KEY else '‚ùå Missing'}")
print(f"  OpenAI:    {'‚úÖ Set' if OPENAI_API_KEY else '‚ùå Missing'}")
print(f"  Anthropic: {'‚úÖ Set' if ANTHROPIC_API_KEY else '‚ùå Missing'}")

Loading .env from: c:\Users\juanx\Documents\LSE-UKHSA Project\.env
Data directory: C:\Users\juanx\Documents\LSE-UKHSA Project\Data
Results directory: C:\Users\juanx\Documents\LSE-UKHSA Project\Data\results

API Keys loaded:
  Gemini:    ‚úÖ Set
  OpenAI:    ‚úÖ Set
  Anthropic: ‚úÖ Set


In [14]:
# ============================================================
# CELL 2: LOAD VALIDATION DATA
# ============================================================
df = pd.read_csv(DATA_DIR / "ground_truth_validation_set.csv")
print(f"‚úÖ Loaded {len(df):,} records")
print(f"Label distribution: {df['label'].value_counts().to_dict()}")

‚úÖ Loaded 1,000 records
Label distribution: {1: 500, 0: 500}


In [15]:
# ============================================================
# CELL 3: PROMPT TEMPLATES
# ============================================================

ZERO_SHOT_TEMPLATE = """You are a systematic review screener. Based on the abstract below, decide if this paper should be INCLUDED or EXCLUDED from a systematic review about:
"{review_title}"

Abstract:
{abstract}

Answer with exactly one word: INCLUDE or EXCLUDE"""

COT_TEMPLATE = """You are a systematic review screener. Your task is to decide if a paper should be included in a systematic review.

Review topic: "{review_title}"

Abstract to screen:
{abstract}

Think through this step by step:
1. What is the main topic of this paper?
2. Does it relate to the systematic review topic?
3. Does it appear to provide relevant evidence?

After your reasoning, give your final answer on a new line as exactly: DECISION: INCLUDE or DECISION: EXCLUDE"""

PROMPTS = {
    "zero_shot": ZERO_SHOT_TEMPLATE,
    "cot": COT_TEMPLATE
}

print("‚úÖ Prompt templates defined.")

‚úÖ Prompt templates defined.


In [16]:
# ============================================================
# CELL 4: OLLAMA FUNCTIONS (for local models)
# ============================================================

def call_ollama(model: str, prompt: str, timeout: int = 120) -> str:
    """Send a prompt to Ollama and return the response text."""
    try:
        resp = requests.post(
            OLLAMA_URL,
            json={"model": model, "prompt": prompt, "stream": False},
            timeout=timeout
        )
        resp.raise_for_status()
        return resp.json().get("response", "")
    except Exception as e:
        return f"ERROR: {e}"

def parse_decision(response: str, prompt_type: str) -> str:
    """Extract INCLUDE/EXCLUDE from LLM response."""
    text = response.upper()
    if prompt_type == "cot":
        if "DECISION: INCLUDE" in text or "DECISION:INCLUDE" in text:
            return "include"
        elif "DECISION: EXCLUDE" in text or "DECISION:EXCLUDE" in text:
            return "exclude"
    if "INCLUDE" in text and "EXCLUDE" not in text:
        return "include"
    elif "EXCLUDE" in text and "INCLUDE" not in text:
        return "exclude"
    elif text.strip().startswith("INCLUDE"):
        return "include"
    elif text.strip().startswith("EXCLUDE"):
        return "exclude"
    return "unclear"

print("‚úÖ Ollama functions defined.")

‚úÖ Ollama functions defined.


In [33]:
# ============================================================
# CELL 5: PROPRIETARY API CLIENTS
# ============================================================
import google.generativeai as genai
from openai import OpenAI
import anthropic

# --- GEMINI 2.0 Flash ---
gemini_model = None
if GEMINI_API_KEY:
    genai.configure(api_key=GEMINI_API_KEY)
    gemini_model = genai.GenerativeModel('gemini-2.0-flash')  # Available model
    print("‚úÖ Gemini 2.0 Flash initialized")

def generate_gemini(prompt: str, max_tokens: int = 256) -> str:
    """Generate response using Gemini 2.0 Flash."""
    if not gemini_model:
        return "ERROR: Gemini API key not set"
    try:
        response = gemini_model.generate_content(
            prompt,
            generation_config=genai.types.GenerationConfig(
                max_output_tokens=max_tokens,
                temperature=0.1
            )
        )
        return response.text.strip()
    except Exception as e:
        return f"ERROR: {e}"

# --- GPT-4o ---
openai_client = None
if OPENAI_API_KEY:
    openai_client = OpenAI(api_key=OPENAI_API_KEY, timeout=120.0)
    print("‚úÖ OpenAI GPT-4o initialized")

def generate_gpt(prompt: str, max_tokens: int = 256) -> str:
    """Generate response using GPT-4o."""
    if not openai_client:
        return "ERROR: OpenAI API key not set"
    try:
        response = openai_client.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": prompt}],
            max_tokens=max_tokens,
            temperature=0.1
        )
        return response.choices[0].message.content.strip()
    except Exception as e:
        return f"ERROR: {e}"

# --- Claude 3 Haiku ---
anthropic_client = None
if ANTHROPIC_API_KEY:
    anthropic_client = anthropic.Anthropic(api_key=ANTHROPIC_API_KEY, timeout=120.0)
    print("‚úÖ Claude 3 Haiku initialized")

def generate_claude(prompt: str, max_tokens: int = 256) -> str:
    """Generate response using Claude 3 Haiku."""
    if not anthropic_client:
        return "ERROR: Anthropic API key not set"
    try:
        response = anthropic_client.messages.create(
            model="claude-3-haiku-20240307",  # Most basic/available model
            max_tokens=max_tokens,
            messages=[{"role": "user", "content": prompt}]
        )
        return response.content[0].text.strip()
    except Exception as e:
        return f"ERROR: {e}"

print("\n‚úÖ All API clients ready!")

‚úÖ Gemini 2.0 Flash initialized
‚úÖ OpenAI GPT-4o initialized
‚úÖ Claude 3 Haiku initialized

‚úÖ All API clients ready!


In [34]:
# ============================================================
# CELL 6: CHECK AVAILABLE MODELS
# ============================================================

# Check Ollama models
print("üîç Checking Ollama models...")
try:
    resp = requests.get("http://localhost:11434/api/tags", timeout=5)
    ollama_models = [m["name"] for m in resp.json().get("models", [])]
    print(f"   Available: {ollama_models}")
except Exception as e:
    print(f"   ‚ö†Ô∏è Could not connect to Ollama: {e}")
    print("   Make sure Ollama is running (ollama serve)")
    ollama_models = []

# Check API models
print("\nüîç Checking API models...")
api_models = []
if GEMINI_API_KEY:
    api_models.append("gemini-2.0-flash")
if OPENAI_API_KEY:
    api_models.append("gpt-4o")
if ANTHROPIC_API_KEY:
    api_models.append("claude-3-haiku-20240307")
print(f"   Available: {api_models}")

üîç Checking Ollama models...
   Available: ['mistral:latest', 'llama3.2:latest']

üîç Checking API models...
   Available: ['gemini-2.0-flash', 'gpt-4o', 'claude-3-haiku-20240307']


In [19]:
# ============================================================
# CELL 8: EVALUATION FUNCTIONS
# ============================================================

def run_evaluation_ollama(df: pd.DataFrame, model: str, prompt_type: str, 
                          template: str, limit: int = None):
    """Run evaluation using Ollama (local models)."""
    data = df.head(limit) if limit else df
    results = []
    
    for idx, row in tqdm(data.iterrows(), total=len(data), desc=f"{model}/{prompt_type}"):
        prompt = template.format(
            review_title=row["review_title"],
            abstract=str(row["paper_abstract"])[:3000]
        )
        response = call_ollama(model, prompt)
        prediction = parse_decision(response, prompt_type)
        
        results.append({
            "paper_pmid": row["paper_pmid"],
            "true_label": row["label"],
            "prediction": prediction,
            "raw_response": response[:500] if response else "",
            "model": model,
            "prompt_type": prompt_type
        })
    
    return pd.DataFrame(results)

def run_evaluation_api(df: pd.DataFrame, model_name: str, generate_fn, 
                       prompt_type: str, template: str, limit: int = None,
                       delay: float = 0.5):
    """Run evaluation using API models (Gemini, GPT, Claude)."""
    data = df.head(limit) if limit else df
    results = []
    max_tokens = 300 if prompt_type == "cot" else 50
    
    for idx, row in tqdm(data.iterrows(), total=len(data), desc=f"{model_name}/{prompt_type}"):
        prompt = template.format(
            review_title=row["review_title"],
            abstract=str(row["paper_abstract"])[:3000]
        )
        
        response = generate_fn(prompt, max_tokens)
        prediction = parse_decision(response, prompt_type)
        
        results.append({
            "paper_pmid": row["paper_pmid"],
            "true_label": row["label"],
            "prediction": prediction,
            "raw_response": response[:500] if response else "",
            "model": model_name,
            "prompt_type": prompt_type
        })
        
        if delay > 0:
            time.sleep(delay)
    
    return pd.DataFrame(results)

print("‚úÖ Evaluation functions defined.")

‚úÖ Evaluation functions defined.


In [20]:
# ============================================================
# CELL 9: METRICS FUNCTIONS
# ============================================================

def compute_metrics(results_df: pd.DataFrame) -> dict:
    """Compute all evaluation metrics."""
    valid = results_df[results_df["prediction"].isin(["include", "exclude"])].copy()
    y_true = valid["true_label"].astype(int)
    y_pred = (valid["prediction"] == "include").astype(int)
    
    return {
        "accuracy": accuracy_score(y_true, y_pred),
        "precision": precision_score(y_true, y_pred, zero_division=0),
        "recall": recall_score(y_true, y_pred, zero_division=0),
        "f1": f1_score(y_true, y_pred, zero_division=0),
        "kappa": cohen_kappa_score(y_true, y_pred),
        "n_valid": len(valid),
        "n_unclear": len(results_df) - len(valid),
        "confusion_matrix": confusion_matrix(y_true, y_pred).tolist()
    }

def print_metrics(metrics: dict, model: str, prompt_type: str):
    """Display metrics in a readable format."""
    print(f"\n{'='*50}")
    print(f"Model: {model} | Prompt: {prompt_type}")
    print(f"{'='*50}")
    print(f"Accuracy:  {metrics['accuracy']:.1%}")
    print(f"Precision: {metrics['precision']:.1%}")
    print(f"Recall:    {metrics['recall']:.1%}")
    print(f"F1 Score:  {metrics['f1']:.3f}")
    print(f"Kappa:     {metrics['kappa']:.3f}")
    print(f"Valid:     {metrics['n_valid']} | Unclear: {metrics['n_unclear']}")
    cm = metrics['confusion_matrix']
    print(f"\nConfusion Matrix:")
    print(f"          Pred Excl  Pred Incl")
    print(f"True Excl    {cm[0][0]:4d}       {cm[0][1]:4d}")
    print(f"True Incl    {cm[1][0]:4d}       {cm[1][1]:4d}")

print("‚úÖ Metrics functions defined.")

‚úÖ Metrics functions defined.


In [35]:
# ============================================================
# CELL 10: QUICK TEST (10 samples each)
# ============================================================

print("üß™ Quick test with 10 samples each...\n")

# Test Ollama (if available)
if "llama3.2" in str(ollama_models):
    print("Testing Llama 3.2 (Ollama)...")
    test_ollama = run_evaluation_ollama(df, "llama3.2", "zero_shot", ZERO_SHOT_TEMPLATE, limit=10)
    metrics = compute_metrics(test_ollama)
    print(f"   Accuracy: {metrics['accuracy']:.1%}, F1: {metrics['f1']:.3f}")
else:
    print("‚è≠Ô∏è Llama 3.2 not available in Ollama")

# Test Gemini 2.0 Flash
if GEMINI_API_KEY:
    print("Testing Gemini 2.0 Flash...")
    test_gemini = run_evaluation_api(df, "gemini-2.0-flash", generate_gemini, "zero_shot", ZERO_SHOT_TEMPLATE, limit=10, delay=1.0)
    metrics = compute_metrics(test_gemini)
    print(f"   Accuracy: {metrics['accuracy']:.1%}, F1: {metrics['f1']:.3f}")

# Test GPT-4o
if OPENAI_API_KEY:
    print("Testing GPT-4o...")
    test_gpt = run_evaluation_api(df, "gpt-4o", generate_gpt, "zero_shot", ZERO_SHOT_TEMPLATE, limit=10, delay=1.0)
    metrics = compute_metrics(test_gpt)
    print(f"   Accuracy: {metrics['accuracy']:.1%}, F1: {metrics['f1']:.3f}")

# Test Claude 3 Haiku
if ANTHROPIC_API_KEY:
    print("Testing Claude 3 Haiku...")
    test_claude = run_evaluation_api(df, "claude-3-haiku-20240307", generate_claude, "zero_shot", ZERO_SHOT_TEMPLATE, limit=10, delay=1.0)
    metrics = compute_metrics(test_claude)
    print(f"   Accuracy: {metrics['accuracy']:.1%}, F1: {metrics['f1']:.3f}")

print("\n‚úÖ Quick tests complete!")

üß™ Quick test with 10 samples each...

Testing Llama 3.2 (Ollama)...


llama3.2/zero_shot: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 10/10 [00:27<00:00,  2.76s/it]


   Accuracy: 90.0%, F1: 0.889
Testing Gemini 2.0 Flash...


gemini-2.0-flash/zero_shot: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 10/10 [00:15<00:00,  1.54s/it]


   Accuracy: 90.0%, F1: 0.889
Testing GPT-4o...


gpt-4o/zero_shot: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 10/10 [00:16<00:00,  1.65s/it]


   Accuracy: 70.0%, F1: 0.571
Testing Claude 3 Haiku...


claude-3-haiku-20240307/zero_shot: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 10/10 [00:16<00:00,  1.66s/it]

   Accuracy: 90.0%, F1: 0.889

‚úÖ Quick tests complete!





In [None]:
# ============================================================
# CELL 11: FULL EVALUATION - ALL MODELS
# ============================================================
# This evaluates all available models on all 1,000 samples.
# Estimated time: 
#   - Ollama models: ~1-2 hours each
#   - API models: ~30-60 min each (with rate limiting)

all_metrics = []

# --- LOCAL MODELS (Ollama) ---
OLLAMA_MODELS = ["llama3.2", "mistral"]

for model in OLLAMA_MODELS:
    if model not in str(ollama_models):
        print(f"‚è≠Ô∏è Skipping {model} (not in Ollama)")
        continue
        
    for prompt_type, template in PROMPTS.items():
        print(f"\n{'#'*60}")
        print(f"üöÄ Running: {model} with {prompt_type} prompt")
        print(f"{'#'*60}")
        
        results = run_evaluation_ollama(df, model, prompt_type, template)
        
        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
        out_file = RESULTS_DIR / f"eval_{model}_{prompt_type}_{timestamp}.csv"
        results.to_csv(out_file, index=False)
        print(f"üíæ Saved: {out_file}")
        
        metrics = compute_metrics(results)
        print_metrics(metrics, model, prompt_type)
        
        all_metrics.append({
            "model": model,
            "prompt_type": prompt_type,
            **{k: v for k, v in metrics.items() if k != "confusion_matrix"}
        })

# --- API MODELS ---
API_MODELS = {
    "gemini-2.0-flash": {"fn": generate_gemini, "delay": 0.5, "enabled": bool(GEMINI_API_KEY)},
    "gpt-4o": {"fn": generate_gpt, "delay": 0.5, "enabled": bool(OPENAI_API_KEY)},
    "claude-3-haiku-20240307": {"fn": generate_claude, "delay": 0.5, "enabled": bool(ANTHROPIC_API_KEY)},
}

for model_name, config in API_MODELS.items():
    if not config["enabled"]:
        print(f"\n‚è≠Ô∏è Skipping {model_name} (no API key)")
        continue
    
    for prompt_type, template in PROMPTS.items():
        print(f"\n{'#'*60}")
        print(f"üöÄ Running: {model_name} with {prompt_type} prompt")
        print(f"{'#'*60}")
        
        results = run_evaluation_api(
            df, model_name, config["fn"], 
            prompt_type, template, 
            delay=config["delay"]
        )
        
        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
        safe_name = model_name.replace(".", "_").replace("-", "_")
        out_file = RESULTS_DIR / f"eval_{safe_name}_{prompt_type}_{timestamp}.csv"
        results.to_csv(out_file, index=False)
        print(f"üíæ Saved: {out_file}")
        
        metrics = compute_metrics(results)
        print_metrics(metrics, model_name, prompt_type)
        
        all_metrics.append({
            "model": model_name,
            "prompt_type": prompt_type,
            **{k: v for k, v in metrics.items() if k != "confusion_matrix"}
        })

print("\n" + "="*60)
print("üéâ FULL EVALUATION COMPLETE!")
print("="*60)


############################################################
üöÄ Running: llama3.2 with zero_shot prompt
############################################################


llama3.2/zero_shot:   1%|          | 10/1000 [00:29<49:22,  2.99s/it]

In [None]:
# ============================================================
# CELL 12: MODEL COMPARISON TABLE
# ============================================================

comparison_df = pd.DataFrame(all_metrics)
comparison_df = comparison_df.sort_values("f1", ascending=False)
comparison_df.to_csv(RESULTS_DIR / "model_comparison.csv", index=False)

print("\nüìä Model Comparison (sorted by F1 score):")
print("="*80)
print(comparison_df.to_string(index=False))
print("\nüíæ Saved to:", RESULTS_DIR / "model_comparison.csv")


Model Comparison (sorted by F1 score):
   model prompt_type  accuracy  precision   recall       f1    kappa  n_valid  n_unclear
 mistral         cot  0.836181   0.835671 0.837349 0.836510 0.672361      995          5
 mistral   zero_shot  0.845000   0.907801 0.768000 0.832069 0.690000     1000          0
llama3.2   zero_shot  0.809000   0.764103 0.894000 0.823963 0.618000     1000          0
llama3.2         cot  0.739487   0.900709 0.529167 0.666667 0.475573      975         25


In [None]:
# ============================================================
# CELL 13: ERROR ANALYSIS
# ============================================================

def analyze_errors(results_file: str, n_samples: int = 5):
    """Show examples of false positives and false negatives."""
    df_results = pd.read_csv(results_file)
    df_gt = pd.read_csv(DATA_DIR / "ground_truth_validation_set.csv")
    merged = df_results.merge(df_gt[["paper_pmid", "paper_abstract", "review_title"]], on="paper_pmid")
    
    fp = merged[(merged["true_label"] == 0) & (merged["prediction"] == "include")]
    fn = merged[(merged["true_label"] == 1) & (merged["prediction"] == "exclude")]
    
    print(f"\n‚ùå False Positives ({len(fp)} total) - wrongly included:")
    for _, row in fp.head(n_samples).iterrows():
        print(f"  ‚Ä¢ {row['paper_abstract'][:100]}...")
    
    print(f"\n‚ùå False Negatives ({len(fn)} total) - wrongly excluded:")
    for _, row in fn.head(n_samples).iterrows():
        print(f"  ‚Ä¢ {row['paper_abstract'][:100]}...")

print("‚úÖ Error analysis function defined.")

Error analysis function defined.


In [11]:
# ============================================================
# CELL 14: ANALYZE BEST MODEL ERRORS
# ============================================================

# Find the best performing model based on F1 score
if not comparison_df.empty:
    best_model = comparison_df.iloc[0]
    print(f"üìà Best model: {best_model['model']} ({best_model['prompt_type']})")
    print(f"   F1: {best_model['f1']:.3f}, Accuracy: {best_model['accuracy']:.1%}")
    
    # Find the corresponding results file
    safe_name = best_model['model'].replace(".", "_").replace("-", "_")
    pattern = f"eval_{safe_name}_{best_model['prompt_type']}_*.csv"
    result_files = list(RESULTS_DIR.glob(pattern))
    
    if result_files:
        latest_file = max(result_files, key=lambda x: x.stat().st_mtime)
        print(f"\nüîç Analyzing errors from: {latest_file.name}")
        analyze_errors(latest_file)
    else:
        print(f"‚ö†Ô∏è No results file found matching: {pattern}")
else:
    print("‚ö†Ô∏è No evaluation results available. Run the full evaluation first.")

NameError: name 'comparison_df' is not defined