# LLM Screening Evaluation Pipeline

**Objective:** Evaluate open-source LLMs on the task of screening paper abstracts for inclusion in systematic reviews.

## Pipeline Overview
1. Load ground truth validation set
2. Design screening prompts
3. Run LLM inference
4. Parse model outputs
5. Calculate metrics (Precision, Recall, F1, Cohen's Kappa)

In [2]:
# No additional packages needed - uses requests (built-in) for Ollama API

In [3]:
import pandas as pd
import numpy as np
from pathlib import Path
import json
import re
from datetime import datetime
from typing import Literal
from sklearn.metrics import (
    precision_score, recall_score, f1_score, 
    accuracy_score, confusion_matrix, classification_report,
    cohen_kappa_score
)

# Paths
DATA_DIR = Path.cwd().parent / "Data" if not (Path.cwd() / "Data").exists() else Path.cwd() / "Data"
VALIDATION_CSV = DATA_DIR / "ground_truth_validation_set.csv"
RESULTS_DIR = DATA_DIR / "results"
RESULTS_DIR.mkdir(exist_ok=True)

print(f"Data directory: {DATA_DIR}")
print(f"Results directory: {RESULTS_DIR}")

Data directory: c:\Users\juanx\Documents\LSE-UKHSA Project\Data
Results directory: c:\Users\juanx\Documents\LSE-UKHSA Project\Data\results


In [4]:
# Load validation set
val_df = pd.read_csv(VALIDATION_CSV)
print(f"Loaded {len(val_df):,} validation records")
print(f"Label distribution: {val_df['label'].value_counts().to_dict()}")
val_df.head(2)

Loaded 1,000 validation records
Label distribution: {1: 500, 0: 500}


Unnamed: 0,review_pmid,review_title,review_objectives,review_criteria,paper_pmid,paper_title,paper_abstract,label
0,21678351,Evaluation of follow-up strategies for patient...,BACKGROUND: Ovarian cancer is the sixth most c...,SELECTION CRITERIA: All relevant randomised co...,11737464,A critical evaluation of current protocols for...,This retrospective review was undertaken to de...,1
1,21678351,Evaluation of follow-up strategies for patient...,BACKGROUND: Ovarian cancer is the sixth most c...,SELECTION CRITERIA: All relevant randomised co...,2564410,[Clinical usefulness of serum sialyl Le(x)-i m...,"Sialyl Le(X)-i (Sialyl SSEA-1, SLX) is one of ...",1


## Prompt Design

We'll test multiple prompt strategies:
1. **Zero-shot**: Direct instruction with criteria
2. **Few-shot**: Include examples
3. **Chain-of-thought**: Ask for reasoning before decision

In [5]:
# Prompt templates

SYSTEM_PROMPT = """You are an expert systematic review screener. Your task is to determine whether a research paper should be INCLUDED or EXCLUDED from a systematic review based on the review's selection criteria.

Respond with ONLY one word: INCLUDE or EXCLUDE."""


def create_zero_shot_prompt(criteria: str, paper_title: str, paper_abstract: str) -> str:
    """Create a zero-shot screening prompt."""
    return f"""## Systematic Review Selection Criteria:
{criteria}

## Paper to Screen:
Title: {paper_title}

Abstract: {paper_abstract}

## Decision:
Based on the selection criteria above, should this paper be INCLUDED or EXCLUDED from the systematic review?

Answer with one word only: INCLUDE or EXCLUDE"""


def create_cot_prompt(criteria: str, paper_title: str, paper_abstract: str) -> str:
    """Create a chain-of-thought screening prompt."""
    return f"""## Systematic Review Selection Criteria:
{criteria}

## Paper to Screen:
Title: {paper_title}

Abstract: {paper_abstract}

## Task:
Analyze whether this paper meets the selection criteria for the systematic review.

First, briefly explain your reasoning (2-3 sentences).
Then, provide your final decision on a new line starting with "DECISION:" followed by either INCLUDE or EXCLUDE."""


# Test prompt creation
sample = val_df.iloc[0]
prompt = create_zero_shot_prompt(
    sample["review_criteria"],
    sample["paper_title"],
    sample["paper_abstract"][:500] + "..."
)
print("Sample zero-shot prompt:")
print(prompt[:1000])

Sample zero-shot prompt:
## Systematic Review Selection Criteria:
SELECTION CRITERIA: All relevant randomised controlled trials (RCTs) that evaluated follow-up strategies for patients with epithelial ovarian cancer following completion of primary treatment.

## Paper to Screen:
Title: A critical evaluation of current protocols for the follow-up of women treated for gynecological malignancies: a pilot study.

Abstract: This retrospective review was undertaken to determine the efficacy of routine follow-up in the detection and management of recurrent cancer. The case notes of all women attending a regional cancer center who were diagnosed with cancer in 1997 were reviewed. Of 81 new cancers followed up for a median of 42 months (range 36-48), 14 have recurred after curative treatment and there were six cases of persistent disease. The median number of clinic visits per patient was 3.5 (range 1-16). Eight recurr...

## Decision:
Based on the selection criteria above, should this paper be 

## LLM Inference Setup

We support multiple backends:
1. **Ollama** (local models) - recommended for open-source models
2. **OpenAI-compatible API** (for cloud-hosted models)

### Installing Ollama
If you haven't installed Ollama:
1. Download from https://ollama.ai
2. Run: `ollama pull llama3.2` or `ollama pull mistral`

In [6]:
import requests

OLLAMA_URL = "http://localhost:11434/api/chat"

def call_ollama(prompt: str, model: str = "llama3.2", system: str = SYSTEM_PROMPT) -> str:
    """Call Ollama API for inference using requests."""
    try:
        response = requests.post(
            OLLAMA_URL,
            json={
                "model": model,
                "messages": [
                    {"role": "system", "content": system},
                    {"role": "user", "content": prompt}
                ],
                "options": {"temperature": 0.0},
                "stream": False
            },
            timeout=120
        )
        response.raise_for_status()
        return response.json()["message"]["content"]
    except requests.exceptions.ConnectionError:
        print("Ollama not running. Start it with: ollama serve")
        return ""
    except Exception as e:
        print(f"Ollama error: {e}")
        return ""


def parse_decision(response: str) -> int | None:
    """Parse LLM response into binary decision."""
    if not response:
        return None
        
    response_upper = response.upper()
    
    # Check for DECISION: prefix (for CoT prompts)
    if "DECISION:" in response_upper:
        after_decision = response_upper.split("DECISION:")[-1].strip()
        words = after_decision.split()
        if words:
            if "INCLUDE" in words[0]:
                return 1
            elif "EXCLUDE" in words[0]:
                return 0
    
    # Simple keyword matching
    if "INCLUDE" in response_upper and "EXCLUDE" not in response_upper:
        return 1
    elif "EXCLUDE" in response_upper and "INCLUDE" not in response_upper:
        return 0
    
    # Check first word
    first_word = response_upper.strip().split()[0] if response_upper.strip() else ""
    if "INCLUDE" in first_word:
        return 1
    elif "EXCLUDE" in first_word:
        return 0
    
    return None  # Unable to parse


# Test the parser
test_responses = [
    "INCLUDE",
    "EXCLUDE",
    "Based on the criteria, this paper should be INCLUDED.",
    "The paper does not meet criteria. DECISION: EXCLUDE",
    "I think we should include it."
]
for r in test_responses:
    print(f"{r[:50]:50s} -> {parse_decision(r)}")

INCLUDE                                            -> 1
EXCLUDE                                            -> 0
Based on the criteria, this paper should be INCLUD -> 1
The paper does not meet criteria. DECISION: EXCLUD -> 0
I think we should include it.                      -> 1


In [7]:
# Check available Ollama models
try:
    response = requests.get("http://localhost:11434/api/tags", timeout=5)
    response.raise_for_status()
    models = response.json().get("models", [])
    print("Available Ollama models:")
    for m in models:
        print(f"  - {m['name']}")
    if not models:
        print("  No models found. Run: ollama pull llama3.2")
except requests.exceptions.ConnectionError:
    print("Ollama not running!")
    print("\nTo set up Ollama:")
    print("1. Download from https://ollama.ai")
    print("2. Run: ollama serve")
    print("3. In another terminal: ollama pull llama3.2")
except Exception as e:
    print(f"Error checking Ollama: {e}")

Ollama not running!

To set up Ollama:
1. Download from https://ollama.ai
2. Run: ollama serve
3. In another terminal: ollama pull llama3.2


## Run Evaluation

In [8]:
from tqdm import tqdm

def run_evaluation(
    df: pd.DataFrame,
    model: str,
    prompt_type: Literal["zero_shot", "cot"] = "zero_shot",
    max_samples: int | None = None,
    save_results: bool = True
) -> pd.DataFrame:
    """
    Run LLM evaluation on the validation set.
    
    Args:
        df: Validation DataFrame
        model: Ollama model name
        prompt_type: "zero_shot" or "cot"
        max_samples: Limit number of samples (for testing)
        save_results: Save results to CSV
    
    Returns:
        DataFrame with predictions
    """
    if max_samples:
        df = df.head(max_samples).copy()
    else:
        df = df.copy()
    
    results = []
    prompt_fn = create_zero_shot_prompt if prompt_type == "zero_shot" else create_cot_prompt
    
    print(f"Running evaluation: model={model}, prompt={prompt_type}, samples={len(df)}")
    
    for idx, row in tqdm(df.iterrows(), total=len(df), desc="Evaluating"):
        prompt = prompt_fn(
            row["review_criteria"],
            row["paper_title"],
            row["paper_abstract"]
        )
        
        response = call_ollama(prompt, model=model)
        prediction = parse_decision(response)
        
        results.append({
            "review_pmid": row["review_pmid"],
            "paper_pmid": row["paper_pmid"],
            "label": row["label"],
            "prediction": prediction,
            "response": response[:500],  # Truncate for storage
            "model": model,
            "prompt_type": prompt_type
        })
    
    results_df = pd.DataFrame(results)
    
    if save_results:
        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
        filename = f"eval_{model.replace(':', '_')}_{prompt_type}_{timestamp}.csv"
        results_df.to_csv(RESULTS_DIR / filename, index=False)
        print(f"Saved results to: {RESULTS_DIR / filename}")
    
    return results_df

In [9]:
def calculate_metrics(results_df: pd.DataFrame) -> dict:
    """
    Calculate evaluation metrics.
    
    Returns dict with:
        - precision, recall, f1 (for "include" class)
        - accuracy
        - cohen_kappa
        - confusion_matrix
        - parse_failures
    """
    # Filter out parse failures
    valid = results_df[results_df["prediction"].notna()].copy()
    parse_failures = len(results_df) - len(valid)
    
    if len(valid) == 0:
        return {"error": "No valid predictions"}
    
    y_true = valid["label"].values
    y_pred = valid["prediction"].astype(int).values
    
    metrics = {
        "n_samples": len(results_df),
        "n_valid": len(valid),
        "parse_failures": parse_failures,
        "parse_failure_rate": parse_failures / len(results_df),
        "accuracy": accuracy_score(y_true, y_pred),
        "precision": precision_score(y_true, y_pred, pos_label=1),
        "recall": recall_score(y_true, y_pred, pos_label=1),
        "f1": f1_score(y_true, y_pred, pos_label=1),
        "cohen_kappa": cohen_kappa_score(y_true, y_pred),
        "confusion_matrix": confusion_matrix(y_true, y_pred).tolist(),
    }
    
    # Agreement rate (same as accuracy)
    metrics["agreement_rate"] = metrics["accuracy"]
    
    return metrics


def print_metrics(metrics: dict, model: str = "", prompt_type: str = ""):
    """Pretty-print evaluation metrics."""
    print("\n" + "="*60)
    if model:
        print(f"Model: {model} | Prompt: {prompt_type}")
    print("="*60)
    print(f"Samples: {metrics['n_samples']} | Valid: {metrics['n_valid']} | Parse failures: {metrics['parse_failures']} ({metrics['parse_failure_rate']:.1%})")
    print("-"*60)
    print(f"Accuracy:       {metrics['accuracy']:.3f}")
    print(f"Precision:      {metrics['precision']:.3f}")
    print(f"Recall:         {metrics['recall']:.3f}")
    print(f"F1 Score:       {metrics['f1']:.3f}")
    print(f"Cohen's Kappa:  {metrics['cohen_kappa']:.3f}")
    print("-"*60)
    cm = metrics["confusion_matrix"]
    print("Confusion Matrix:")
    print(f"                 Predicted")
    print(f"                 Excl   Incl")
    print(f"  Actual Excl    {cm[0][0]:4d}   {cm[0][1]:4d}")
    print(f"  Actual Incl    {cm[1][0]:4d}   {cm[1][1]:4d}")
    print("="*60)

## Test Run (Small Sample)

First, let's test with a small sample to verify everything works.

In [10]:
# Test with 10 samples first
MODEL = "llama3.2"  # Change to your available model

print(f"Testing with model: {MODEL}")
test_results = run_evaluation(
    val_df, 
    model=MODEL, 
    prompt_type="zero_shot",
    max_samples=10,
    save_results=False
)

print("\nSample results:")
print(test_results[["label", "prediction", "response"]].head())

Testing with model: llama3.2
Running evaluation: model=llama3.2, prompt=zero_shot, samples=10


Evaluating: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 10/10 [00:25<00:00,  2.51s/it]


Sample results:
   label  prediction response
0      1           1  INCLUDE
1      1           1  INCLUDE
2      1           1  INCLUDE
3      1           0  EXCLUDE
4      1           1  INCLUDE





In [11]:
# Calculate metrics on test run
test_metrics = calculate_metrics(test_results)
print_metrics(test_metrics, MODEL, "zero_shot")


Model: llama3.2 | Prompt: zero_shot
Samples: 10 | Valid: 10 | Parse failures: 0 (0.0%)
------------------------------------------------------------
Accuracy:       0.700
Precision:      0.667
Recall:         0.800
F1 Score:       0.727
Cohen's Kappa:  0.400
------------------------------------------------------------
Confusion Matrix:
                 Predicted
                 Excl   Incl
  Actual Excl       3      2
  Actual Incl       1      4


## Full Evaluation

Run evaluation on the complete validation set. This may take 30-60 minutes depending on your hardware.

In [12]:
# Full evaluation - running on all 1000 samples
MODEL = "llama3.2"

print("Running full evaluation...")
full_results = run_evaluation(
    val_df, 
    model=MODEL, 
    prompt_type="zero_shot",
    save_results=True
)

metrics = calculate_metrics(full_results)
print_metrics(metrics, MODEL, "zero_shot")

Running full evaluation...
Running evaluation: model=llama3.2, prompt=zero_shot, samples=1000


Evaluating: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 1000/1000 [41:54<00:00,  2.51s/it] 

Saved results to: c:\Users\juanx\Documents\LSE-UKHSA Project\Data\results\eval_llama3.2_zero_shot_20260115_193605.csv

Model: llama3.2 | Prompt: zero_shot
Samples: 1000 | Valid: 1000 | Parse failures: 0 (0.0%)
------------------------------------------------------------
Accuracy:       0.647
Precision:      0.596
Recall:         0.910
F1 Score:       0.721
Cohen's Kappa:  0.294
------------------------------------------------------------
Confusion Matrix:
                 Predicted
                 Excl   Incl
  Actual Excl     192    308
  Actual Incl      45    455





## Compare Multiple Models

Run evaluation across different models and prompt types.

In [13]:
# Configuration for multi-model comparison
MODELS_TO_TEST = [
    "llama3.2",
    "mistral",
]

PROMPT_TYPES = ["zero_shot", "cot"]

# Store all results
all_metrics = []

In [14]:
# Run comparison across models and prompt types
for model in MODELS_TO_TEST:
    for prompt_type in PROMPT_TYPES:
        print(f"\n{'='*60}")
        print(f"Evaluating: {model} with {prompt_type}")
        print("="*60)
        
        results = run_evaluation(
            val_df,
            model=model,
            prompt_type=prompt_type,
            save_results=True
        )
        
        metrics = calculate_metrics(results)
        metrics["model"] = model
        metrics["prompt_type"] = prompt_type
        all_metrics.append(metrics)
        
        print_metrics(metrics, model, prompt_type)


Evaluating: llama3.2 with zero_shot
Running evaluation: model=llama3.2, prompt=zero_shot, samples=1000


Evaluating: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 1000/1000 [42:43<00:00,  2.56s/it]


Saved results to: c:\Users\juanx\Documents\LSE-UKHSA Project\Data\results\eval_llama3.2_zero_shot_20260115_201927.csv

Model: llama3.2 | Prompt: zero_shot
Samples: 1000 | Valid: 1000 | Parse failures: 0 (0.0%)
------------------------------------------------------------
Accuracy:       0.649
Precision:      0.598
Recall:         0.912
F1 Score:       0.722
Cohen's Kappa:  0.298
------------------------------------------------------------
Confusion Matrix:
                 Predicted
                 Excl   Incl
  Actual Excl     193    307
  Actual Incl      44    456

Evaluating: llama3.2 with cot
Running evaluation: model=llama3.2, prompt=cot, samples=1000


Evaluating:  71%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ   | 712/1000 [1:16:14<53:17:34, 666.16s/it]

Ollama error: HTTPConnectionPool(host='localhost', port=11434): Read timed out. (read timeout=120)


Evaluating: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 1000/1000 [1:32:42<00:00,  5.56s/it]   


Saved results to: c:\Users\juanx\Documents\LSE-UKHSA Project\Data\results\eval_llama3.2_cot_20260115_215209.csv

Model: llama3.2 | Prompt: cot
Samples: 1000 | Valid: 999 | Parse failures: 1 (0.1%)
------------------------------------------------------------
Accuracy:       0.704
Precision:      0.809
Recall:         0.533
F1 Score:       0.643
Cohen's Kappa:  0.407
------------------------------------------------------------
Confusion Matrix:
                 Predicted
                 Excl   Incl
  Actual Excl     437     63
  Actual Incl     233    266

Evaluating: mistral with zero_shot
Running evaluation: model=mistral, prompt=zero_shot, samples=1000


Evaluating: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 1000/1000 [1:25:52<00:00,  5.15s/it] 


Saved results to: c:\Users\juanx\Documents\LSE-UKHSA Project\Data\results\eval_mistral_zero_shot_20260115_231802.csv

Model: mistral | Prompt: zero_shot
Samples: 1000 | Valid: 1000 | Parse failures: 0 (0.0%)
------------------------------------------------------------
Accuracy:       0.811
Precision:      0.846
Recall:         0.760
F1 Score:       0.801
Cohen's Kappa:  0.622
------------------------------------------------------------
Confusion Matrix:
                 Predicted
                 Excl   Incl
  Actual Excl     431     69
  Actual Incl     120    380

Evaluating: mistral with cot
Running evaluation: model=mistral, prompt=cot, samples=1000


Evaluating: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 1000/1000 [1:14:05<00:00,  4.45s/it]

Saved results to: c:\Users\juanx\Documents\LSE-UKHSA Project\Data\results\eval_mistral_cot_20260116_003208.csv

Model: mistral | Prompt: cot
Samples: 1000 | Valid: 998 | Parse failures: 2 (0.2%)
------------------------------------------------------------
Accuracy:       0.688
Precision:      0.902
Recall:         0.423
F1 Score:       0.576
Cohen's Kappa:  0.377
------------------------------------------------------------
Confusion Matrix:
                 Predicted
                 Excl   Incl
  Actual Excl     476     23
  Actual Incl     288    211





In [15]:
# Create comparison table
if all_metrics:
    comparison_df = pd.DataFrame(all_metrics)
    comparison_df = comparison_df[["model", "prompt_type", "accuracy", "precision", "recall", "f1", "cohen_kappa", "parse_failure_rate"]]
    comparison_df = comparison_df.round(3)
    display(comparison_df.sort_values("f1", ascending=False))
    
    # Save comparison
    comparison_df.to_csv(RESULTS_DIR / "model_comparison.csv", index=False)
    print(f"\nSaved comparison to: {RESULTS_DIR / 'model_comparison.csv'}")

Unnamed: 0,model,prompt_type,accuracy,precision,recall,f1,cohen_kappa,parse_failure_rate
2,mistral,zero_shot,0.811,0.846,0.76,0.801,0.622,0.0
0,llama3.2,zero_shot,0.649,0.598,0.912,0.722,0.298,0.0
1,llama3.2,cot,0.704,0.809,0.533,0.643,0.407,0.001
3,mistral,cot,0.688,0.902,0.423,0.576,0.377,0.002



Saved comparison to: c:\Users\juanx\Documents\LSE-UKHSA Project\Data\results\model_comparison.csv


## Error Analysis

Examine patterns in model failures.

In [16]:
def error_analysis(results_df: pd.DataFrame, val_df: pd.DataFrame) -> dict:
    """
    Analyze patterns in model errors.
    
    Returns:
        - False positives (predicted include, actual exclude)
        - False negatives (predicted exclude, actual include)
    """
    # Merge with original data for context
    merged = results_df.merge(
        val_df[["review_pmid", "paper_pmid", "review_title", "paper_title", "paper_abstract"]],
        on=["review_pmid", "paper_pmid"]
    )
    
    valid = merged[merged["prediction"].notna()].copy()
    valid["prediction"] = valid["prediction"].astype(int)
    
    # False positives: predicted 1, actual 0
    fp = valid[(valid["prediction"] == 1) & (valid["label"] == 0)]
    
    # False negatives: predicted 0, actual 1
    fn = valid[(valid["prediction"] == 0) & (valid["label"] == 1)]
    
    print(f"False Positives (wrongly included): {len(fp)}")
    print(f"False Negatives (wrongly excluded): {len(fn)}")
    
    return {
        "false_positives": fp,
        "false_negatives": fn
    }


# Example usage (uncomment after running evaluation):
# errors = error_analysis(full_results, val_df)
# 
# print("\nSample False Negatives (should have been included):")
# for _, row in errors["false_negatives"].head(3).iterrows():
#     print(f"\nReview: {row['review_title'][:60]}...")
#     print(f"Paper: {row['paper_title']}")
#     print(f"Response: {row['response'][:200]}...")

In [18]:
# Load best model results and run error analysis
best_model_results = pd.read_csv(RESULTS_DIR / "eval_mistral_zero_shot_20260115_231802.csv")

errors = error_analysis(best_model_results, val_df)

print("\n" + "="*60)
print("SAMPLE FALSE POSITIVES (Over-inclusion errors)")
print("="*60)
for _, row in errors["false_positives"].head(3).iterrows():
    print(f"\nðŸ“‹ Review: {row['review_title'][:80]}...")
    print(f"ðŸ“„ Paper: {row['paper_title'][:80]}...")
    if 'response' in row:
        print(f"ðŸ¤– Response: {str(row['response'])[:200]}...")

print("\n" + "="*60)
print("SAMPLE FALSE NEGATIVES (Missed relevant papers)")
print("="*60)
for _, row in errors["false_negatives"].head(3).iterrows():
    print(f"\nðŸ“‹ Review: {row['review_title'][:80]}...")
    print(f"ðŸ“„ Paper: {row['paper_title'][:80]}...")
    if 'response' in row:
        print(f"ðŸ¤– Response: {str(row['response'])[:200]}...")

False Positives (wrongly included): 69
False Negatives (wrongly excluded): 120

SAMPLE FALSE POSITIVES (Over-inclusion errors)

ðŸ“‹ Review: Bias due to selective inclusion and reporting of outcomes and analyses in system...
ðŸ“„ Paper: Impact of nonfatal myocardial infarction on outcomes in patients with advanced h...
ðŸ¤– Response:  INCLUDE...

ðŸ“‹ Review: Bias due to selective inclusion and reporting of outcomes and analyses in system...
ðŸ“„ Paper: Obesity Increases Risk-Adjusted Morbidity, Mortality, and Cost Following Cardiac...
ðŸ¤– Response:  INCLUDE...

ðŸ“‹ Review: Bias due to selective inclusion and reporting of outcomes and analyses in system...
ðŸ“„ Paper: Cardiovascular risk assessment scores for people with diabetes: a systematic rev...
ðŸ¤– Response:  INCLUDE...

SAMPLE FALSE NEGATIVES (Missed relevant papers)

ðŸ“‹ Review: Evaluation of follow-up strategies for patients with epithelial ovarian cancer f...
ðŸ“„ Paper: PET-CT in recurrent ovarian cancer: impact on treat

## Summary & Next Steps

### Metrics Interpretation
- **Recall** is critical for systematic reviews (we don't want to miss relevant papers)
- **Precision** affects workload (false positives mean extra manual review)
- **Cohen's Kappa** measures agreement beyond chance

### Recommendations for UKHSA
1. Prioritize **high recall** models for initial screening
2. Use LLM as first-pass filter, with human review of borderline cases
3. Consider ensemble approaches (multiple models voting)

### Next Steps
1. Test additional models (Mistral, Phi-3, Gemma)
2. Experiment with few-shot prompting
3. Fine-tune a model on Cochrane screening data
4. Build production pipeline with logging and monitoring