# 08 — Agentic Approaches: Summary

This notebook briefly describes the four agentic strategies used in the unified LLM evaluation pipeline (Notebook 07) to improve systematic review screening beyond zero-shot baselines.

All strategies were evaluated on **100% of the Cochrane Public Health** ground-truth dataset (no sampling).

---
## Baseline: Zero-Shot Prompting

Each model receives the review scope and the candidate paper, then responds with a single word: INCLUDE or EXCLUDE. No examples, no reasoning chain.

- **Best model:** llama3.1:8b — F1 = 0.762, Precision = 0.828, Recall = 0.707
- **Problem:** Misses 29.3% of relevant papers (false-negative rate too high for systematic reviews)
- CoT was tested but consistently underperformed zero-shot, so was dropped

---
## Strategy 1: Dynamic Few-Shot Prompting

For each candidate paper, inject **review-specific examples** (2 included + 2 excluded papers from the same review) into the prompt. The model learns the inclusion boundary for that particular review topic.

- **Result:** F1 = 0.746, Precision = 0.636, **Recall = 0.900** (+0.193 vs baseline)
- **Cost:** 1 LLM call per paper (longer prompt)
- **Strength:** Massive recall boost — the model stops being overly conservative
- **Weakness:** Precision drops because the model over-includes borderline papers

---
## Strategy 2: Smart Ensemble (OR / Majority / Weighted Vote)

Combine Phase 1 predictions from all 7 models using voting rules — **zero extra LLM calls**.

| Rule | Logic | F1 | Prec | Rec |
|------|-------|-----|------|-----|
| **OR-rule** | Any model says INCLUDE → INCLUDE | 0.761 | 0.703 | 0.829 |
| **Majority** | >50% of models agree | 0.731 | 0.867 | 0.631 |
| **Weighted** | Vote weighted by each model's F1 | 0.731 | 0.867 | 0.631 |

- **Strength:** Free — reuses existing predictions
- **Weakness:** Models are highly correlated (70–78% error overlap), limiting diversity gains

---
## Strategy 3: Calibrated Recall Challenge - Best Overall

The most **surgical** strategy. For each paper the best model excluded, check how many other models disagreed. If ≥2 others said INCLUDE, send the paper to a challenger model with a **recall-biased reconsideration prompt** that asks it to look for *any* reasonable argument for relevance.

**Flow:**
1. Start with best model's predictions (llama3.1:8b)
2. For each EXCLUDE → count how many other models said INCLUDE
3. If ≥2 disagree → challenge with mistral-nemo:12b using recall-biased prompt
4. If challenger says INCLUDE → flip; otherwise keep EXCLUDE

**Key numbers:**
- Only **94 out of ~3,500 excludes** were challenged (2.7%)
- **85 flipped** to INCLUDE (90.4% flip rate)
- **F1 = 0.786** (+0.024), Recall = 0.776 (+0.069), Precision = 0.797 (−0.031)
- Extra compute: **4.2 minutes** (94 LLM calls instead of 4,000)

This was the **best strategy overall** — highest F1 with minimal precision cost and negligible extra compute.

---
## Strategy 4: Few-Shot Debate with Judge

Two screener models each evaluate every paper using the few-shot prompt. If they **agree**, that decision is final. If they **disagree**, a third judge model resolves the conflict using a prompt that favours inclusion.

**Flow:**
1. Screener A (llama3.1:8b) + Screener B (mistral-nemo:12b) both screen with few-shot prompts
2. Agree → take that decision (90.9% of cases)
3. Disagree → Judge (mistral) decides, biased toward INCLUDE

- **Result:** F1 = 0.738, Precision = 0.617, **Recall = 0.917** (highest recall of all strategies)
- **Cost:** 2–3 LLM calls per paper (67 minutes total)
- **Strength:** Highest recall — misses only 8.3% of relevant papers
- **Weakness:** Precision suffers; many false positives from the judge's inclusion bias

---
## Automated Strategy Selection (Phase 2 Diagnosis)

The pipeline **automatically selects** which strategies to run based on Phase 1 diagnostics:

| Diagnostic Signal | Threshold | Triggers |
|-------------------|-----------|----------|
| Precision − Recall gap > 0.05 | Bottleneck = recall | Calibrated Recall Challenge |
| Cross-model rescue potential > 25% | FN are recoverable | Calibrated Recall Challenge |
| Inter-model disagreement > 4% | Models are diverse | Few-Shot Debate |
| *(always)* | — | Dynamic Few-Shot, Smart Ensemble |

In our run: gap = 0.121 (recall bottleneck), rescue = 41.6%, disagreement = 6.0% → **all 4 strategies were activated**.

---
## Full Results Comparison

| Approach | F1 | Precision | Recall | Extra Cost |
|----------|-----|-----------|--------|------------|
| Baseline: llama3.1:8b (zero-shot) | 0.762 | 0.828 | 0.707 | — |
| **Calibrated Recall Challenge** | **0.786** | **0.797** | **0.776** | **94 calls (4 min)** |
| Ensemble: OR-rule | 0.761 | 0.703 | 0.829 | 0 calls |
| Dynamic Few-Shot | 0.746 | 0.636 | 0.900 | 4,000 calls (27 min) |
| Few-Shot Debate | 0.738 | 0.617 | 0.917 | 8,000–12,000 calls (67 min) |
| Ensemble: Majority / Weighted | 0.731 | 0.867 | 0.631 | 0 calls |

### Key Takeaways

1. **Calibrated Recall Challenge wins on F1** — best balance of precision and recall with minimal compute
2. **Few-Shot Debate wins on recall** (0.917) — best for minimising missed papers, but expensive and imprecise
3. **Ensembles are free** but limited by high inter-model error correlation
4. **Dynamic Few-Shot** provides the biggest single-strategy recall boost but trades too much precision
5. The **automated diagnosis** correctly identified recall as the bottleneck and activated all relevant strategies without human intervention

---
## Summary

- The **zero-shot baseline** (llama3.1:8b) achieved F1 = 0.762 but missed nearly 30% of relevant papers — unacceptable for systematic reviews
- **Dynamic Few-Shot** injects review-specific examples into the prompt, boosting recall to 0.900 but sacrificing precision
- **Smart Ensembles** combine the predictions already produced by all 7 models during Phase 1, requiring no additional LLM calls; the OR-rule matched baseline F1 while raising recall to 0.829
- **Calibrated Recall Challenge** was the best overall strategy (F1 = 0.786) — it surgically re-evaluated only 94 low-confidence excludes, flipping 85 to INCLUDE, at a cost of just 4 extra minutes
- **Few-Shot Debate** achieved the highest recall (0.917) by having two screeners + a judge, but at 3× the compute and with reduced precision
- The pipeline's **automated diagnosis** correctly identified recall as the bottleneck and activated all four strategies without manual intervention
- The key insight is that **targeted, confidence-aware interventions** (Calibrated Recall Challenge) outperform brute-force approaches (re-running everything with a different prompt)