# 06: Evaluate LLMs for Systematic Review Screening

## Objective
Evaluate whether LLMs can correctly determine if a paper should be **included** or **excluded** from a Cochrane systematic review.

## Task
Given:
- **Review context** (title + abstract of the Cochrane review - defines the screening criteria)
- **Paper abstract** (the candidate study being screened)

Predict: INCLUDE or EXCLUDE

## Dataset
- **Validation set**: `ground_truth_validation_dataset.csv` - All Cochrane groups with `cochrane_group` column for filtering

## Prompt Types
1. **Zero-shot** - Direct question without reasoning
2. **Chain-of-thought (CoT)** - Ask the LLM to reason step-by-step before deciding

## Models (10 Models - All Local via Ollama)

### General-Purpose Models
| Model | Size | Description |
|-------|------|-------------|
| **Llama 3.2** | 3B | Meta's efficient baseline model |
| **Llama 3.1 8B** | 8B | Stronger instruction-following |
| **Mistral 7B** | 7B | Strong general-purpose model |
| **Mistral Nemo 12B** | 12B | Newer architecture, strong reasoning |
| **Qwen 2.5 7B** | 7B | Top benchmarks, rivals GPT-3.5 |
| **Gemma 2 9B** | 9B | Google's latest, excellent classification |
| **Phi-3 Medium** | 14B | Microsoft's efficient model |

### Biomedical-Specialized Models
| Model | Size | Description |
|-------|------|-------------|
| **OpenBioLLM-8B** | 8B | Llama-3 fine-tuned, outperforms GPT-3.5 on medical |
| **BioMistral 7B** | 7B | Mistral fine-tuned on PubMed Central |
| **Meditron 7B** | 7B | Fine-tuned on medical guidelines & PubMed |

## Output
- `Data/results/eval_*.csv` - Predictions with LLM reasoning saved
- Metrics: Accuracy, Precision, Recall, F1, Sensitivity, Specificity

**IMPORTANT:** All inference is local via Ollama - no data sent to external APIs.

In [1]:
# Install required packages for local LLM inference
%pip install -q ollama pandas scikit-learn tqdm

Note: you may need to restart the kernel to use updated packages.


In [None]:
# Setup and load data
import os
from pathlib import Path
import pandas as pd
import ollama
from tqdm.notebook import tqdm
from datetime import datetime
import time

notebook_dir = Path.cwd()
project_root = notebook_dir if (notebook_dir / "Data").exists() else notebook_dir.parent
DATA_DIR = project_root / "Data"
RESULTS_DIR = DATA_DIR / "results"
RESULTS_DIR.mkdir(exist_ok=True)

GROUND_TRUTH_CSV = DATA_DIR / "ground_truth_validation_dataset.csv"

# Load validation set (all Cochrane groups, filterable by cochrane_group column)
ground_truth = pd.read_csv(GROUND_TRUTH_CSV)
print(f"Loaded validation set: {len(ground_truth):,} examples")
print(f"\nLabel distribution:")
print(f"  Included: {(ground_truth['label'] == 1).sum():,}")
print(f"  Excluded: {(ground_truth['label'] == 0).sum():,}")
print(f"\nUnique reviews: {ground_truth['review_doi'].nunique():,}")
print(f"\nCochrane groups available (filter with cochrane_group column):")
print(ground_truth['cochrane_group'].value_counts().to_string())

Loaded validation set: 12,778 examples (Public Health)

Label distribution:
  Included: 3,704
  Excluded: 9,074

Unique reviews: 75

Note: Full dataset (~360K) preserved in ground_truth_all_categories.csv


In [16]:
# Check available Ollama models
try:
    models = ollama.list()
    print("Available local models:")
    if hasattr(models, 'models'):
        for model in models.models:
            name = model.model if hasattr(model, 'model') else str(model)
            print(f"  - {name}")
    elif isinstance(models, dict) and 'models' in models:
        for model in models['models']:
            name = model.get('name', model.get('model', str(model)))
            print(f"  - {name}")
    else:
        print(f"  Models: {models}")
except Exception as e:
    print(f"Error connecting to Ollama: {e}")
    print("Make sure Ollama is running: ollama serve")

Available local models:
  - cniongolo/biomistral:latest
  - koesn/llama3-openbiollm-8b:latest
  - mistral:latest
  - llama3.2:latest


In [4]:
# =============================================================================
# Prompt Templates - Include Review Context!
# =============================================================================

ZERO_SHOT_PROMPT = """You are a systematic review screening assistant. Your task is to determine whether a candidate paper should be INCLUDED or EXCLUDED from a specific Cochrane systematic review.

=== COCHRANE REVIEW ===
Title: {review_title}

Abstract/Objective: {review_abstract}

=== CANDIDATE PAPER ===
Title: {paper_title}

Abstract: {paper_abstract}

=== TASK ===
Based on the review's objectives and inclusion criteria, should this paper be INCLUDED or EXCLUDED?

Respond with only: INCLUDE or EXCLUDE"""


COT_PROMPT = """You are a systematic review screening assistant. Your task is to determine whether a candidate paper should be INCLUDED or EXCLUDED from a specific Cochrane systematic review.

=== COCHRANE REVIEW ===
Title: {review_title}

Abstract/Objective: {review_abstract}

=== CANDIDATE PAPER ===
Title: {paper_title}

Abstract: {paper_abstract}

=== TASK ===
Think step by step:
1. What is the review looking for? (population, intervention, outcomes)
2. What does the candidate paper study?
3. Does the paper match the review's criteria?

After your reasoning, give your final answer on a new line as: DECISION: INCLUDE or DECISION: EXCLUDE"""


def create_prompt(row: pd.Series, use_cot: bool = False) -> str:
    """Create prompt with review context and paper abstract."""
    template = COT_PROMPT if use_cot else ZERO_SHOT_PROMPT
    return template.format(
        review_title=str(row['review_title'])[:500],
        review_abstract=str(row['review_abstract'])[:2000],
        paper_title=str(row['paper_title'])[:300],
        paper_abstract=str(row['paper_abstract'])[:2000]
    )

# Preview a prompt
sample = ground_truth.iloc[0]
print("=" * 60)
print("SAMPLE ZERO-SHOT PROMPT:")
print("=" * 60)
print(create_prompt(sample, use_cot=False)[:1500] + "...")

SAMPLE ZERO-SHOT PROMPT:
You are a systematic review screening assistant. Your task is to determine whether a candidate paper should be INCLUDED or EXCLUDED from a specific Cochrane systematic review.

=== COCHRANE REVIEW ===
Title: Acupuncture for smoking cessation.

Abstract/Objective: Acupuncture is promoted as a treatment for smoking cessation, and is believed to reduce withdrawal symptoms. The objective of this review is to determine the effectiveness of acupuncture in smoking cessation in comparison with: a) sham acupuncture b) other interventions c) no intervention. We searched the Cochrane Tobacco Addiction Group trials register, Medline, PsycLit, Dissertation Abstracts, Health Planning and Administration, Social SciSearch, Smoking & Health, Embase, Biological Abstracts and DRUG. Randomised trials comparing a form of acupuncture with either sham acupuncture, another intervention or no intervention for smoking cessation. We extracted data in duplicate on the type of subjects, th

In [5]:
# =============================================================================
# Evaluation Functions - Save full reasoning
# =============================================================================
import re

def extract_decision(response: str) -> int:
    """Extract INCLUDE (1) or EXCLUDE (0) from LLM response."""
    response_upper = response.upper()
    
    # Look for explicit DECISION: pattern first (CoT)
    decision_match = re.search(r'DECISION:\s*(INCLUDE|EXCLUDE)', response_upper)
    if decision_match:
        return 1 if decision_match.group(1) == 'INCLUDE' else 0
    
    # Fall back to last occurrence
    include_pos = response_upper.rfind('INCLUDE')
    exclude_pos = response_upper.rfind('EXCLUDE')
    
    if include_pos > exclude_pos:
        return 1
    elif exclude_pos > include_pos:
        return 0
    
    return -1  # Could not determine


def run_evaluation(model_name: str, data: pd.DataFrame, use_cot: bool = False) -> pd.DataFrame:
    """Run evaluation and save full LLM reasoning."""
    results = []
    prompt_type = 'cot' if use_cot else 'zero_shot'
    
    for idx, row in tqdm(data.iterrows(), total=len(data), desc=f"{model_name} ({prompt_type})"):
        prompt = create_prompt(row, use_cot=use_cot)
        
        try:
            start = time.time()
            response = ollama.generate(model=model_name, prompt=prompt)
            elapsed = time.time() - start
            response_text = response.get('response', '')
            prediction = extract_decision(response_text)
        except Exception as e:
            response_text = f"ERROR: {e}"
            prediction = -1
            elapsed = 0
        
        results.append({
            'review_doi': row['review_doi'],
            'study_id': row['study_id'],
            'label': row['label'],
            'prediction': prediction,
            'correct': prediction == row['label'],
            'reasoning': response_text,  # Full LLM reasoning saved!
            'response_time_sec': round(elapsed, 2)
        })
    
    return pd.DataFrame(results)


print("Evaluation functions defined.")

Evaluation functions defined.


In [9]:
# =============================================================================
# Run Llama 3.2 Evaluation - Two prompt types
# =============================================================================
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

MODEL_NAME = "llama3.2"

# Full evaluation on entire dataset
SAMPLE_SIZE = None  # None = full dataset
eval_data = ground_truth.sample(n=SAMPLE_SIZE, random_state=42) if SAMPLE_SIZE else ground_truth

print(f"Evaluating on {len(eval_data):,} samples")
print(f"  Included: {(eval_data['label'] == 1).sum():,}")
print(f"  Excluded: {(eval_data['label'] == 0).sum():,}")

all_results = {}

Evaluating on 360,743 samples
  Included: 124,119
  Excluded: 236,624


In [11]:
# Run Zero-Shot evaluation
print("\n" + "=" * 60)
print("ZERO-SHOT EVALUATION")
print("=" * 60)

results_zero = run_evaluation(MODEL_NAME, eval_data, use_cot=False)
all_results['llama3.2_zero_shot'] = results_zero

# Save results with reasoning
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
output_file = RESULTS_DIR / f"eval_llama3.2_zero_shot_{timestamp}.csv"
results_zero.to_csv(output_file, index=False)
print(f"\n✓ Saved to {output_file.name}")

# Quick metrics
valid = results_zero[results_zero['prediction'] != -1]
print(f"\nZero-shot results ({len(valid)} valid predictions):")
print(f"  Accuracy: {accuracy_score(valid['label'], valid['prediction']):.3f}")
print(f"  Precision: {precision_score(valid['label'], valid['prediction'], zero_division=0):.3f}")
print(f"  Recall: {recall_score(valid['label'], valid['prediction'], zero_division=0):.3f}")
print(f"  F1: {f1_score(valid['label'], valid['prediction'], zero_division=0):.3f}")


ZERO-SHOT EVALUATION


llama3.2 (zero_shot):   0%|          | 0/360743 [00:00<?, ?it/s]

KeyboardInterrupt: 

In [8]:
# Run Chain-of-Thought evaluation
print("\n" + "=" * 60)
print("CHAIN-OF-THOUGHT EVALUATION")
print("=" * 60)

results_cot = run_evaluation(MODEL_NAME, eval_data, use_cot=True)
all_results['llama3.2_cot'] = results_cot

# Save results with reasoning
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
output_file = RESULTS_DIR / f"eval_llama3.2_cot_{timestamp}.csv"
results_cot.to_csv(output_file, index=False)
print(f"\n✓ Saved to {output_file.name}")

# Quick metrics
valid = results_cot[results_cot['prediction'] != -1]
print(f"\nChain-of-thought results ({len(valid)} valid predictions):")
print(f"  Accuracy: {accuracy_score(valid['label'], valid['prediction']):.3f}")
print(f"  Precision: {precision_score(valid['label'], valid['prediction'], zero_division=0):.3f}")
print(f"  Recall: {recall_score(valid['label'], valid['prediction'], zero_division=0):.3f}")
print(f"  F1: {f1_score(valid['label'], valid['prediction'], zero_division=0):.3f}")


CHAIN-OF-THOUGHT EVALUATION


llama3.2 (cot):   0%|          | 0/500 [00:00<?, ?it/s]

KeyboardInterrupt: 

In [None]:
# =============================================================================
# Run Mistral Evaluation - Two prompt types
# =============================================================================

MODEL_NAME = "mistral"
print(f"\n{'='*60}")
print(f"EVALUATING: {MODEL_NAME}")
print(f"{'='*60}")

# Run Zero-Shot evaluation
print("\n" + "=" * 60)
print("ZERO-SHOT EVALUATION")
print("=" * 60)

results_zero = run_evaluation(MODEL_NAME, eval_data, use_cot=False)
all_results['mistral_zero_shot'] = results_zero

# Save results with reasoning
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
output_file = RESULTS_DIR / f"eval_mistral_zero_shot_{timestamp}.csv"
results_zero.to_csv(output_file, index=False)
print(f"\n✓ Saved to {output_file.name}")

# Quick metrics
valid = results_zero[results_zero['prediction'] != -1]
print(f"\nZero-shot results ({len(valid)} valid predictions):")
print(f"  Accuracy: {accuracy_score(valid['label'], valid['prediction']):.3f}")
print(f"  Precision: {precision_score(valid['label'], valid['prediction'], zero_division=0):.3f}")
print(f"  Recall: {recall_score(valid['label'], valid['prediction'], zero_division=0):.3f}")
print(f"  F1: {f1_score(valid['label'], valid['prediction'], zero_division=0):.3f}")

# Run Chain-of-Thought evaluation
print("\n" + "=" * 60)
print("CHAIN-OF-THOUGHT EVALUATION")
print("=" * 60)

results_cot = run_evaluation(MODEL_NAME, eval_data, use_cot=True)
all_results['mistral_cot'] = results_cot

# Save results with reasoning
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
output_file = RESULTS_DIR / f"eval_mistral_cot_{timestamp}.csv"
results_cot.to_csv(output_file, index=False)
print(f"\n✓ Saved to {output_file.name}")

# Quick metrics
valid = results_cot[results_cot['prediction'] != -1]
print(f"\nChain-of-thought results ({len(valid)} valid predictions):")
print(f"  Accuracy: {accuracy_score(valid['label'], valid['prediction']):.3f}")
print(f"  Precision: {precision_score(valid['label'], valid['prediction'], zero_division=0):.3f}")
print(f"  Recall: {recall_score(valid['label'], valid['prediction'], zero_division=0):.3f}")
print(f"  F1: {f1_score(valid['label'], valid['prediction'], zero_division=0):.3f}")

In [None]:
# =============================================================================
# Run OpenBioLLM-8B Evaluation - Two prompt types
# =============================================================================
# State-of-the-art biomedical LLM based on Llama-3, fine-tuned on medical data
# Outperforms GPT-3.5 on medical benchmarks (72.5% avg accuracy)

MODEL_NAME = "koesn/llama3-openbiollm-8b"
print(f"\n{'='*60}")
print(f"EVALUATING: {MODEL_NAME}")
print(f"{'='*60}")

# Run Zero-Shot evaluation
print("\n" + "=" * 60)
print("ZERO-SHOT EVALUATION")
print("=" * 60)

results_zero = run_evaluation(MODEL_NAME, eval_data, use_cot=False)
all_results['openbiollm_zero_shot'] = results_zero

# Save results with reasoning
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
output_file = RESULTS_DIR / f"eval_openbiollm_zero_shot_{timestamp}.csv"
results_zero.to_csv(output_file, index=False)
print(f"\n✓ Saved to {output_file.name}")

# Quick metrics
valid = results_zero[results_zero['prediction'] != -1]
print(f"\nZero-shot results ({len(valid)} valid predictions):")
print(f"  Accuracy: {accuracy_score(valid['label'], valid['prediction']):.3f}")
print(f"  Precision: {precision_score(valid['label'], valid['prediction'], zero_division=0):.3f}")
print(f"  Recall: {recall_score(valid['label'], valid['prediction'], zero_division=0):.3f}")
print(f"  F1: {f1_score(valid['label'], valid['prediction'], zero_division=0):.3f}")

# Run Chain-of-Thought evaluation
print("\n" + "=" * 60)
print("CHAIN-OF-THOUGHT EVALUATION")
print("=" * 60)

results_cot = run_evaluation(MODEL_NAME, eval_data, use_cot=True)
all_results['openbiollm_cot'] = results_cot

# Save results with reasoning
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
output_file = RESULTS_DIR / f"eval_openbiollm_cot_{timestamp}.csv"
results_cot.to_csv(output_file, index=False)
print(f"\n✓ Saved to {output_file.name}")

# Quick metrics
valid = results_cot[results_cot['prediction'] != -1]
print(f"\nChain-of-thought results ({len(valid)} valid predictions):")
print(f"  Accuracy: {accuracy_score(valid['label'], valid['prediction']):.3f}")
print(f"  Precision: {precision_score(valid['label'], valid['prediction'], zero_division=0):.3f}")
print(f"  Recall: {recall_score(valid['label'], valid['prediction'], zero_division=0):.3f}")
print(f"  F1: {f1_score(valid['label'], valid['prediction'], zero_division=0):.3f}")

In [None]:
# =============================================================================
# Run BioMistral Evaluation - Two prompt types
# =============================================================================
# Mistral 7B fine-tuned on PubMed Central biomedical literature
# Designed specifically for biomedical NLP tasks

MODEL_NAME = "cniongolo/biomistral"
print(f"\n{'='*60}")
print(f"EVALUATING: {MODEL_NAME}")
print(f"{'='*60}")

# Run Zero-Shot evaluation
print("\n" + "=" * 60)
print("ZERO-SHOT EVALUATION")
print("=" * 60)

results_zero = run_evaluation(MODEL_NAME, eval_data, use_cot=False)
all_results['biomistral_zero_shot'] = results_zero

# Save results with reasoning
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
output_file = RESULTS_DIR / f"eval_biomistral_zero_shot_{timestamp}.csv"
results_zero.to_csv(output_file, index=False)
print(f"\n✓ Saved to {output_file.name}")

# Quick metrics
valid = results_zero[results_zero['prediction'] != -1]
print(f"\nZero-shot results ({len(valid)} valid predictions):")
print(f"  Accuracy: {accuracy_score(valid['label'], valid['prediction']):.3f}")
print(f"  Precision: {precision_score(valid['label'], valid['prediction'], zero_division=0):.3f}")
print(f"  Recall: {recall_score(valid['label'], valid['prediction'], zero_division=0):.3f}")
print(f"  F1: {f1_score(valid['label'], valid['prediction'], zero_division=0):.3f}")

# Run Chain-of-Thought evaluation
print("\n" + "=" * 60)
print("CHAIN-OF-THOUGHT EVALUATION")
print("=" * 60)

results_cot = run_evaluation(MODEL_NAME, eval_data, use_cot=True)
all_results['biomistral_cot'] = results_cot

# Save results with reasoning
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
output_file = RESULTS_DIR / f"eval_biomistral_cot_{timestamp}.csv"
results_cot.to_csv(output_file, index=False)
print(f"\n✓ Saved to {output_file.name}")

# Quick metrics
valid = results_cot[results_cot['prediction'] != -1]
print(f"\nChain-of-thought results ({len(valid)} valid predictions):")
print(f"  Accuracy: {accuracy_score(valid['label'], valid['prediction']):.3f}")
print(f"  Precision: {precision_score(valid['label'], valid['prediction'], zero_division=0):.3f}")
print(f"  Recall: {recall_score(valid['label'], valid['prediction'], zero_division=0):.3f}")
print(f"  F1: {f1_score(valid['label'], valid['prediction'], zero_division=0):.3f}")

In [None]:
# =============================================================================
# Run Llama 3.1 8B Evaluation - Two prompt types
# =============================================================================
# Stronger than Llama 3.2, excellent instruction-following

MODEL_NAME = "llama3.1:8b"
print(f"\n{'='*60}")
print(f"EVALUATING: {MODEL_NAME}")
print(f"{'='*60}")

# Run Zero-Shot evaluation
print("\n" + "=" * 60)
print("ZERO-SHOT EVALUATION")
print("=" * 60)

results_zero = run_evaluation(MODEL_NAME, eval_data, use_cot=False)
all_results['llama3.1_8b_zero_shot'] = results_zero

timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
output_file = RESULTS_DIR / f"eval_llama3.1_8b_zero_shot_{timestamp}.csv"
results_zero.to_csv(output_file, index=False)
print(f"\n✓ Saved to {output_file.name}")

valid = results_zero[results_zero['prediction'] != -1]
print(f"\nZero-shot results ({len(valid)} valid predictions):")
print(f"  Accuracy: {accuracy_score(valid['label'], valid['prediction']):.3f}")
print(f"  Precision: {precision_score(valid['label'], valid['prediction'], zero_division=0):.3f}")
print(f"  Recall: {recall_score(valid['label'], valid['prediction'], zero_division=0):.3f}")
print(f"  F1: {f1_score(valid['label'], valid['prediction'], zero_division=0):.3f}")

# Run Chain-of-Thought evaluation
print("\n" + "=" * 60)
print("CHAIN-OF-THOUGHT EVALUATION")
print("=" * 60)

results_cot = run_evaluation(MODEL_NAME, eval_data, use_cot=True)
all_results['llama3.1_8b_cot'] = results_cot

timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
output_file = RESULTS_DIR / f"eval_llama3.1_8b_cot_{timestamp}.csv"
results_cot.to_csv(output_file, index=False)
print(f"\n✓ Saved to {output_file.name}")

valid = results_cot[results_cot['prediction'] != -1]
print(f"\nChain-of-thought results ({len(valid)} valid predictions):")
print(f"  Accuracy: {accuracy_score(valid['label'], valid['prediction']):.3f}")
print(f"  Precision: {precision_score(valid['label'], valid['prediction'], zero_division=0):.3f}")
print(f"  Recall: {recall_score(valid['label'], valid['prediction'], zero_division=0):.3f}")
print(f"  F1: {f1_score(valid['label'], valid['prediction'], zero_division=0):.3f}")

In [None]:
# =============================================================================
# Run Qwen 2.5 7B Evaluation - Two prompt types
# =============================================================================
# Top benchmarks, rivals GPT-3.5, strong reasoning

MODEL_NAME = "qwen2.5:7b"
print(f"\n{'='*60}")
print(f"EVALUATING: {MODEL_NAME}")
print(f"{'='*60}")

# Run Zero-Shot evaluation
print("\n" + "=" * 60)
print("ZERO-SHOT EVALUATION")
print("=" * 60)

results_zero = run_evaluation(MODEL_NAME, eval_data, use_cot=False)
all_results['qwen2.5_7b_zero_shot'] = results_zero

timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
output_file = RESULTS_DIR / f"eval_qwen2.5_7b_zero_shot_{timestamp}.csv"
results_zero.to_csv(output_file, index=False)
print(f"\n✓ Saved to {output_file.name}")

valid = results_zero[results_zero['prediction'] != -1]
print(f"\nZero-shot results ({len(valid)} valid predictions):")
print(f"  Accuracy: {accuracy_score(valid['label'], valid['prediction']):.3f}")
print(f"  Precision: {precision_score(valid['label'], valid['prediction'], zero_division=0):.3f}")
print(f"  Recall: {recall_score(valid['label'], valid['prediction'], zero_division=0):.3f}")
print(f"  F1: {f1_score(valid['label'], valid['prediction'], zero_division=0):.3f}")

# Run Chain-of-Thought evaluation
print("\n" + "=" * 60)
print("CHAIN-OF-THOUGHT EVALUATION")
print("=" * 60)

results_cot = run_evaluation(MODEL_NAME, eval_data, use_cot=True)
all_results['qwen2.5_7b_cot'] = results_cot

timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
output_file = RESULTS_DIR / f"eval_qwen2.5_7b_cot_{timestamp}.csv"
results_cot.to_csv(output_file, index=False)
print(f"\n✓ Saved to {output_file.name}")

valid = results_cot[results_cot['prediction'] != -1]
print(f"\nChain-of-thought results ({len(valid)} valid predictions):")
print(f"  Accuracy: {accuracy_score(valid['label'], valid['prediction']):.3f}")
print(f"  Precision: {precision_score(valid['label'], valid['prediction'], zero_division=0):.3f}")
print(f"  Recall: {recall_score(valid['label'], valid['prediction'], zero_division=0):.3f}")
print(f"  F1: {f1_score(valid['label'], valid['prediction'], zero_division=0):.3f}")

In [None]:
# =============================================================================
# Run Gemma 2 9B Evaluation - Two prompt types
# =============================================================================
# Google's latest, excellent for classification tasks

MODEL_NAME = "gemma2:9b"
print(f"\n{'='*60}")
print(f"EVALUATING: {MODEL_NAME}")
print(f"{'='*60}")

# Run Zero-Shot evaluation
print("\n" + "=" * 60)
print("ZERO-SHOT EVALUATION")
print("=" * 60)

results_zero = run_evaluation(MODEL_NAME, eval_data, use_cot=False)
all_results['gemma2_9b_zero_shot'] = results_zero

timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
output_file = RESULTS_DIR / f"eval_gemma2_9b_zero_shot_{timestamp}.csv"
results_zero.to_csv(output_file, index=False)
print(f"\n✓ Saved to {output_file.name}")

valid = results_zero[results_zero['prediction'] != -1]
print(f"\nZero-shot results ({len(valid)} valid predictions):")
print(f"  Accuracy: {accuracy_score(valid['label'], valid['prediction']):.3f}")
print(f"  Precision: {precision_score(valid['label'], valid['prediction'], zero_division=0):.3f}")
print(f"  Recall: {recall_score(valid['label'], valid['prediction'], zero_division=0):.3f}")
print(f"  F1: {f1_score(valid['label'], valid['prediction'], zero_division=0):.3f}")

# Run Chain-of-Thought evaluation
print("\n" + "=" * 60)
print("CHAIN-OF-THOUGHT EVALUATION")
print("=" * 60)

results_cot = run_evaluation(MODEL_NAME, eval_data, use_cot=True)
all_results['gemma2_9b_cot'] = results_cot

timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
output_file = RESULTS_DIR / f"eval_gemma2_9b_cot_{timestamp}.csv"
results_cot.to_csv(output_file, index=False)
print(f"\n✓ Saved to {output_file.name}")

valid = results_cot[results_cot['prediction'] != -1]
print(f"\nChain-of-thought results ({len(valid)} valid predictions):")
print(f"  Accuracy: {accuracy_score(valid['label'], valid['prediction']):.3f}")
print(f"  Precision: {precision_score(valid['label'], valid['prediction'], zero_division=0):.3f}")
print(f"  Recall: {recall_score(valid['label'], valid['prediction'], zero_division=0):.3f}")
print(f"  F1: {f1_score(valid['label'], valid['prediction'], zero_division=0):.3f}")

In [None]:
# =============================================================================
# Run Phi-3 Medium 14B Evaluation - Two prompt types
# =============================================================================
# Microsoft's efficient model, punches above its weight

MODEL_NAME = "phi3:medium"
print(f"\n{'='*60}")
print(f"EVALUATING: {MODEL_NAME}")
print(f"{'='*60}")

# Run Zero-Shot evaluation
print("\n" + "=" * 60)
print("ZERO-SHOT EVALUATION")
print("=" * 60)

results_zero = run_evaluation(MODEL_NAME, eval_data, use_cot=False)
all_results['phi3_medium_zero_shot'] = results_zero

timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
output_file = RESULTS_DIR / f"eval_phi3_medium_zero_shot_{timestamp}.csv"
results_zero.to_csv(output_file, index=False)
print(f"\n✓ Saved to {output_file.name}")

valid = results_zero[results_zero['prediction'] != -1]
print(f"\nZero-shot results ({len(valid)} valid predictions):")
print(f"  Accuracy: {accuracy_score(valid['label'], valid['prediction']):.3f}")
print(f"  Precision: {precision_score(valid['label'], valid['prediction'], zero_division=0):.3f}")
print(f"  Recall: {recall_score(valid['label'], valid['prediction'], zero_division=0):.3f}")
print(f"  F1: {f1_score(valid['label'], valid['prediction'], zero_division=0):.3f}")

# Run Chain-of-Thought evaluation
print("\n" + "=" * 60)
print("CHAIN-OF-THOUGHT EVALUATION")
print("=" * 60)

results_cot = run_evaluation(MODEL_NAME, eval_data, use_cot=True)
all_results['phi3_medium_cot'] = results_cot

timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
output_file = RESULTS_DIR / f"eval_phi3_medium_cot_{timestamp}.csv"
results_cot.to_csv(output_file, index=False)
print(f"\n✓ Saved to {output_file.name}")

valid = results_cot[results_cot['prediction'] != -1]
print(f"\nChain-of-thought results ({len(valid)} valid predictions):")
print(f"  Accuracy: {accuracy_score(valid['label'], valid['prediction']):.3f}")
print(f"  Precision: {precision_score(valid['label'], valid['prediction'], zero_division=0):.3f}")
print(f"  Recall: {recall_score(valid['label'], valid['prediction'], zero_division=0):.3f}")
print(f"  F1: {f1_score(valid['label'], valid['prediction'], zero_division=0):.3f}")

In [None]:
# =============================================================================
# Run Meditron 7B Evaluation - Two prompt types
# =============================================================================
# Fine-tuned on medical guidelines & PubMed (biomedical specialized)

MODEL_NAME = "meditron:7b"
print(f"\n{'='*60}")
print(f"EVALUATING: {MODEL_NAME}")
print(f"{'='*60}")

# Run Zero-Shot evaluation
print("\n" + "=" * 60)
print("ZERO-SHOT EVALUATION")
print("=" * 60)

results_zero = run_evaluation(MODEL_NAME, eval_data, use_cot=False)
all_results['meditron_7b_zero_shot'] = results_zero

timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
output_file = RESULTS_DIR / f"eval_meditron_7b_zero_shot_{timestamp}.csv"
results_zero.to_csv(output_file, index=False)
print(f"\n✓ Saved to {output_file.name}")

valid = results_zero[results_zero['prediction'] != -1]
print(f"\nZero-shot results ({len(valid)} valid predictions):")
print(f"  Accuracy: {accuracy_score(valid['label'], valid['prediction']):.3f}")
print(f"  Precision: {precision_score(valid['label'], valid['prediction'], zero_division=0):.3f}")
print(f"  Recall: {recall_score(valid['label'], valid['prediction'], zero_division=0):.3f}")
print(f"  F1: {f1_score(valid['label'], valid['prediction'], zero_division=0):.3f}")

# Run Chain-of-Thought evaluation
print("\n" + "=" * 60)
print("CHAIN-OF-THOUGHT EVALUATION")
print("=" * 60)

results_cot = run_evaluation(MODEL_NAME, eval_data, use_cot=True)
all_results['meditron_7b_cot'] = results_cot

timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
output_file = RESULTS_DIR / f"eval_meditron_7b_cot_{timestamp}.csv"
results_cot.to_csv(output_file, index=False)
print(f"\n✓ Saved to {output_file.name}")

valid = results_cot[results_cot['prediction'] != -1]
print(f"\nChain-of-thought results ({len(valid)} valid predictions):")
print(f"  Accuracy: {accuracy_score(valid['label'], valid['prediction']):.3f}")
print(f"  Precision: {precision_score(valid['label'], valid['prediction'], zero_division=0):.3f}")
print(f"  Recall: {recall_score(valid['label'], valid['prediction'], zero_division=0):.3f}")
print(f"  F1: {f1_score(valid['label'], valid['prediction'], zero_division=0):.3f}")

In [None]:
# =============================================================================
# Run Mistral Nemo 12B Evaluation - Two prompt types
# =============================================================================
# Newer architecture, strong reasoning capabilities

MODEL_NAME = "mistral-nemo:12b"
print(f"\n{'='*60}")
print(f"EVALUATING: {MODEL_NAME}")
print(f"{'='*60}")

# Run Zero-Shot evaluation
print("\n" + "=" * 60)
print("ZERO-SHOT EVALUATION")
print("=" * 60)

results_zero = run_evaluation(MODEL_NAME, eval_data, use_cot=False)
all_results['mistral_nemo_12b_zero_shot'] = results_zero

timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
output_file = RESULTS_DIR / f"eval_mistral_nemo_12b_zero_shot_{timestamp}.csv"
results_zero.to_csv(output_file, index=False)
print(f"\n✓ Saved to {output_file.name}")

valid = results_zero[results_zero['prediction'] != -1]
print(f"\nZero-shot results ({len(valid)} valid predictions):")
print(f"  Accuracy: {accuracy_score(valid['label'], valid['prediction']):.3f}")
print(f"  Precision: {precision_score(valid['label'], valid['prediction'], zero_division=0):.3f}")
print(f"  Recall: {recall_score(valid['label'], valid['prediction'], zero_division=0):.3f}")
print(f"  F1: {f1_score(valid['label'], valid['prediction'], zero_division=0):.3f}")

# Run Chain-of-Thought evaluation
print("\n" + "=" * 60)
print("CHAIN-OF-THOUGHT EVALUATION")
print("=" * 60)

results_cot = run_evaluation(MODEL_NAME, eval_data, use_cot=True)
all_results['mistral_nemo_12b_cot'] = results_cot

timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
output_file = RESULTS_DIR / f"eval_mistral_nemo_12b_cot_{timestamp}.csv"
results_cot.to_csv(output_file, index=False)
print(f"\n✓ Saved to {output_file.name}")

valid = results_cot[results_cot['prediction'] != -1]
print(f"\nChain-of-thought results ({len(valid)} valid predictions):")
print(f"  Accuracy: {accuracy_score(valid['label'], valid['prediction']):.3f}")
print(f"  Precision: {precision_score(valid['label'], valid['prediction'], zero_division=0):.3f}")
print(f"  Recall: {recall_score(valid['label'], valid['prediction'], zero_division=0):.3f}")
print(f"  F1: {f1_score(valid['label'], valid['prediction'], zero_division=0):.3f}")

In [None]:
# =============================================================================
# Compare Results
# =============================================================================

comparison_rows = []

for run_name, results in all_results.items():
    valid = results[results['prediction'] != -1]
    if len(valid) == 0:
        continue
    
    # Compute metrics
    tn = ((valid['label'] == 0) & (valid['prediction'] == 0)).sum()
    tp = ((valid['label'] == 1) & (valid['prediction'] == 1)).sum()
    fn = ((valid['label'] == 1) & (valid['prediction'] == 0)).sum()
    fp = ((valid['label'] == 0) & (valid['prediction'] == 1)).sum()
    
    comparison_rows.append({
        'model': run_name,
        'n_samples': len(valid),
        'accuracy': accuracy_score(valid['label'], valid['prediction']),
        'precision': precision_score(valid['label'], valid['prediction'], zero_division=0),
        'recall': recall_score(valid['label'], valid['prediction'], zero_division=0),
        'f1': f1_score(valid['label'], valid['prediction'], zero_division=0),
        'sensitivity': tp / (tp + fn) if (tp + fn) > 0 else 0,  # Same as recall
        'specificity': tn / (tn + fp) if (tn + fp) > 0 else 0,
        'avg_response_time': results['response_time_sec'].mean()
    })

comparison = pd.DataFrame(comparison_rows)
print("\n" + "=" * 80)
print("COMPARISON: ZERO-SHOT vs CHAIN-OF-THOUGHT")
print("=" * 80)
print(comparison.to_string(index=False))

# Save comparison
comparison.to_csv(RESULTS_DIR / "model_comparison.csv", index=False)

In [None]:
# =============================================================================
# View Sample Reasoning
# =============================================================================

print("=" * 60)
print("SAMPLE LLM REASONING (Chain-of-Thought)")
print("=" * 60)

# Show a correct and incorrect example
cot_results = all_results.get('llama3.2_cot')
if cot_results is not None:
    correct = cot_results[cot_results['correct'] == True].iloc[0] if any(cot_results['correct']) else None
    incorrect = cot_results[cot_results['correct'] == False].iloc[0] if any(~cot_results['correct']) else None
    
    if correct is not None:
        print("\n✓ CORRECT PREDICTION:")
        print(f"  Label: {correct['label']} | Prediction: {correct['prediction']}")
        print(f"  Reasoning:\n{correct['reasoning'][:800]}...")
    
    if incorrect is not None:
        print("\n✗ INCORRECT PREDICTION:")
        print(f"  Label: {incorrect['label']} | Prediction: {incorrect['prediction']}")
        print(f"  Reasoning:\n{incorrect['reasoning'][:800]}...")

In [None]:
# =============================================================================
# Summary
# =============================================================================

print("\n" + "=" * 60)
print("EVALUATION COMPLETE")
print("=" * 60)
print(f"Samples evaluated: {len(eval_data):,}")
print(f"Prompt types: Zero-shot, Chain-of-thought")
print(f"Total experiments: 10 models × 2 prompts = 20 runs")

print(f"\nGeneral-Purpose Models (7):")
print("  - Llama 3.2 (3B)")
print("  - Llama 3.1 8B")
print("  - Mistral 7B")
print("  - Mistral Nemo 12B")
print("  - Qwen 2.5 7B")
print("  - Gemma 2 9B")
print("  - Phi-3 Medium 14B")

print(f"\nBiomedical-Specialized Models (3):")
print("  - OpenBioLLM-8B")
print("  - BioMistral 7B")
print("  - Meditron 7B")

print(f"\nResults saved to: {RESULTS_DIR}")
print("  - eval_<model>_zero_shot_*.csv")
print("  - eval_<model>_cot_*.csv")
print("  - model_comparison.csv")
print("\n✓ All inference was LOCAL via Ollama - no data sent to external APIs.")