# Introduction: AI + Formal Verification

This notebook introduces the key thesis: **LLMs and formal verification are a perfect match**.

## The Problem

AI-generated code is faster to write but harder to trust. Studies show:
- 40% increase in bug rates with AI coding assistants (FormAI research)
- 51% of GPT-3.5 generated C programs had vulnerabilities
- We're generating code faster than we can review it

## The Kleppmann Thesis

From [Martin Kleppmann's prediction](https://martin.kleppmann.com/2025/12/08/ai-formal-verification.html):

> "Rather than having humans review AI-generated code, I'd much rather have the AI prove to me that the code it has generated is correct."

Three forces are converging:
1. **Cost reduction**: LLMs can draft proofs, dramatically lowering the cost of formal verification
2. **Necessity**: AI-generated code needs *something* to replace human review
3. **Synergy**: Proof checkers reject invalid proofs—hallucinations get caught and retried

## Why This Matters for ML Practitioners

If you work with LLMs, you know:
- They hallucinate confidently
- They're great at pattern matching, bad at reasoning
- Output quality is probabilistic, not guaranteed

Formal verification flips this:
- A proof checker is **deterministic**: valid or invalid, no gray area
- When the LLM hallucinates, the checker rejects it
- The LLM can try again with the error message
- Eventually, you get a **verified** solution

## What We'll Demonstrate

This project reproduces key findings from the [FM-ALPACA paper](https://arxiv.org/abs/2501.16207) (ACL'25), which benchmarked LLMs on formal verification across 5 languages.

We'll show the same problem (binary search correctness) verified three ways:

| Tool | What It Checks | Style |
|------|---------------|-------|
| **Dafny** | Pre/postconditions, loop invariants | Annotated imperative code |
| **Lean4** | Mathematical theorems | Proof assistant / tactics |
| **TLA+** | State machine properties | Model checking |

## The Verification Loop

The core pattern we demonstrate:

```
┌─────────────────────────────────────────────┐
│  1. LLM generates code/proof                │
│         ↓                                   │
│  2. Verifier checks it                      │
│         ↓                                   │
│  3a. If valid → Done! Proven correct.       │
│  3b. If invalid → Feed error back to LLM   │
│         ↓                                   │
│  4. LLM fixes based on error               │
│         ↓                                   │
│  (repeat until valid or max attempts)       │
└─────────────────────────────────────────────┘
```

This loop is powerful because:
- The verifier provides **precise feedback** (not vague "looks wrong")
- LLMs are good at fixing code given specific error messages
- The final result is **mathematically proven**, not just "probably right"

## Setup Check

Let's verify the tools are installed:

In [None]:
import subprocess
import shutil
import os

def check_tool(name, cmd):
    try:
        result = subprocess.run(cmd, capture_output=True, text=True, timeout=10)
        print(f"✅ {name}: found")
        return True
    except Exception as e:
        print(f"❌ {name}: not found ({e})")
        return False

print("Checking tool installations...\n")

check_tool("Dafny", ["dafny", "--version"])
check_tool("Lean4", ["lean", "--version"])

tla_jar = os.path.expanduser("~/.tla/tla2tools.jar")
if os.path.exists(tla_jar):
    print(f"✅ TLA+ tools: found at {tla_jar}")
else:
    print(f"❌ TLA+ tools: not found at {tla_jar}")

print("\nIf any tools are missing, run: ./setup.sh")

In [None]:
# Test the LLM client
import sys
sys.path.insert(0, '..')

from src.llm_client import LLMClient

client = LLMClient()
print(f"✅ LLM Client initialized (model: {client.model})")

## Next Steps

Continue to the demo notebooks:

1. **[02-dafny-demo.ipynb](02-dafny-demo.ipynb)** - LLM generates Dafny annotations
2. **[03-lean4-demo.ipynb](03-lean4-demo.ipynb)** - LLM generates Lean4 proofs
3. **[04-tlaplus-demo.ipynb](04-tlaplus-demo.ipynb)** - LLM generates TLA+ specs
4. **[05-comparison.ipynb](05-comparison.ipynb)** - Side-by-side evaluation