In [1]:
# ============================================================================
# CELL 0: SETUP - Optional LLM Configuration (portable)
# ============================================================================
# This notebook is designed to run in a fully-offline demo mode by default.
#
# To enable any LLM-backed steps, set: STILLWATER_ENABLE_LLM_REAL=1

import os
import sys
from pathlib import Path

sys.path.insert(0, str(Path.cwd()))

try:
    from src.llm_config_manager import setup_llm_client_for_notebook, get_llm_config

    print('=' * 80)
    print('INITIALIZING LLM CONFIGURATION (optional)')
    print('=' * 80)

    llm_config = setup_llm_client_for_notebook()
    print('LLM Provider:', llm_config.get('name'), 'at', llm_config.get('url'))

    config = get_llm_config()
    is_valid, msg = config.validate_setup()
    print('Status:', msg)

    if not is_valid:
        print('Required:', ', '.join(config.get_required_env_vars()))

    print('To switch providers: edit llm_config.yaml and re-run this cell')
    print('=' * 80)
    print()
except Exception as e:
    print('=' * 80)
    print('LLM CONFIGURATION SKIPPED')
    print('=' * 80)
    print('Reason:', e)
    print('Proceeding in offline demo mode.')
    print('=' * 80)
    print()


INITIALIZING LLM CONFIGURATION (optional)
✅ Offline (Demo Mode) is configured
LLM Provider: Offline (Demo Mode) at 
Status: ✅ Offline (Demo Mode) is configured
To switch providers: edit llm_config.yaml and re-run this cell



# OOLONG-Style Aggregation Demo: Counter Bypass Protocol (Executable)

**Date:** 2026-02-17  
**Auth:** 65537  
**Status:** Demo (in-repo tests only)  
**Skill Pack:** `prime-math.md` + `prime-coder.md` + `phuc-forecast.md`

---

## Architecture Overview

```mermaid
flowchart TD
    A["Cell 0: LLM Config\n(optional, offline OK)"] --> B["Cell 3: Optional LLM Solver\n(oolong_solver_real.py)"]
    A --> C["Cell 5: Demo Solver\n(oolong_solver.py)"]
    C --> D["Cell 7: Verification Gate\n(stdout marker checks)"]
    D -->|ALL PASS| E["EXIT_PASS\n4/4 demo suite"]
    D -->|ANY FAIL| F["EXIT_BLOCKED\nInvestigate"]

    classDef phase fill:#0b1b2b,stroke:#9cc3ff,color:#e6f0ff;
    classDef gate fill:#1a3a1a,stroke:#66ff66,color:#e6ffe6;
    classDef fail fill:#3a1a1a,stroke:#ff6666,color:#ffe6e6;
    class A,B,C phase;
    class D,E gate;
    class F fail;
```

---

## The Core Idea: Counter Bypass Protocol

```mermaid
flowchart LR
    Q["Query\n(natural language)"] --> CLS["Classify\n(pattern table)"]
    CLS --> REC["Parse Records\n(structured data)"]
    REC --> CTR["Counter()\n(exact aggregation)"]
    CTR --> ANS["Answer\n(deterministic)"]

    classDef default fill:#0b1b2b,stroke:#9cc3ff,color:#e6f0ff;
```

**Key insight:** For counting/aggregation tasks, use `Counter()` (exact, deterministic) instead of relying on LLM attention (approximate, probabilistic).

---

## What This Notebook Runs

This notebook runs an in-repo, CPU-first aggregation demo:
- Parse records from structured input
- Build `Counter()` indexes for exact aggregation
- Classify the query using a deterministic pattern table
- Dispatch a deterministic handler

## Claim Hygiene

- The results below are for the **included test harness** (4 cases) shipped in this repo.
- This is **not** a reproduced external leaderboard score.
- "Lane" labels are about **local evidence** (tests/logs), not a formal proof certificate.

**Result (in this repo):** 4/4 included test cases passed.

## Setup: Import OOLONG Solver

The solver lives at `oolong/src/oolong_solver.py`. It implements the Counter Bypass Protocol:
1. **Parse** input records into structured data
2. **Index** using `collections.Counter()` for exact counts
3. **Classify** the query type via a pattern table
4. **Dispatch** to a deterministic handler (no LLM needed for aggregation)

In [2]:
# Run the optional LLM-backed OOLONG solver (requires local wrapper)
import os
import subprocess
from pathlib import Path

if os.environ.get('STILLWATER_ENABLE_LLM_REAL') == '1':
    result = subprocess.run(
        ['python3', 'oolong/src/oolong_solver_real.py'],
        capture_output=True,
        text=True,
        cwd=Path.cwd(),
    )

    print(result.stdout)
    if result.stderr:
        print('STDERR:', result.stderr)
else:
    print('Skipping oolong/src/oolong_solver_real.py (offline demo mode).')


Skipping oolong/src/oolong_solver_real.py (offline demo mode).


## Execute: Run OOLONG Solver with Counter Bypass Protocol

Run the demo solver via subprocess. The solver:
- Processes 4 test cases covering different aggregation patterns
- Reports pass/fail per case with exact counts
- Emits a verification ladder (Rung 641 / 274177 / 65537)

In [3]:
# Run the actual solver via subprocess to capture real output
import subprocess

result = subprocess.run(
    ['python3', 'oolong/src/oolong_solver.py'],
    capture_output=True,
    text=True,
    cwd=Path.cwd()
)

# Fail-closed: check returncode before trusting output
if result.returncode != 0:
    print('ERROR: solver exited non-zero')
    print('returncode:', result.returncode)
    if result.stderr:
        print('STDERR:')
        print(result.stderr)

print(result.stdout)

OOLONG-STYLE AGGREGATION DEMO: COUNTER BYPASS PROTOCOL SOLVER
Auth: 65537 | Status: Demo (in-repo tests only)

Running test cases...

Test Results: 4/4 passed
Pass rate (in-repo tests): 100.0%

VERIFICATION LADDER

Rung 641 (Edge Sanity): PASS ✓
  - 4 edge cases checked
  - All test inputs valid: True

Rung 274177 (Generalization): PASS ✓
  - All 4 tests must pass: True
  - Success rate: 4/4

Rung 65537 (Explanation): PASS ✓
  - Explanation substantive (>10 words): True
  - Explanation length: 39 words

SUMMARY

✓ Counter Bypass Protocol: DEMO RUN COMPLETE
✓ Verification Ladder: 641 → 274177 → 65537
✓ In-repo tests: 4/4 passed
✓ Explanation present: True
✓ Status: OK (demo)
✓ Confidence: Lane B (Checked in-repo; not an external benchmark certificate)

Difference from pure LLM approach:
  ✓ Deterministic Counter() for exact aggregation (counting step)
  ✓ Multiple test cases (4/4 correct)
  ✓ Honest about limitations (demo-sized suite; parsing/classification can fail)

Auth: 65537 | Nor

## Verification: Check All Requirements Met

Verify that the solver printed all expected markers. This is the **executable evidence gate** —
if any marker is missing, the notebook reports FAIL.

### Verification Ladder

```mermaid
flowchart TD
    R641["Rung 641\nEdge Sanity\n(4 test cases checked)"] --> R274["Rung 274177\nGeneralization\n(all 4 tests must pass)"]
    R274 --> R65["Rung 65537\nExplanation\n(substantive narrative)"]

    classDef rung fill:#0b1b2b,stroke:#9cc3ff,color:#e6f0ff;
    class R641,R274,R65 rung;
```

In [4]:
# Verify notebook requirements against solver stdout
output = result.stdout

checks = {
    'Demo run complete': 'Counter Bypass Protocol: DEMO RUN COMPLETE' in output,
    'In-repo tests: 4/4 passed': 'In-repo tests: 4/4 passed' in output,
    'Rung 641 PASS': 'Rung 641 (Edge Sanity): PASS' in output,
    'Rung 274177 PASS': 'Rung 274177 (Generalization): PASS' in output,
    'Rung 65537 PASS': 'Rung 65537 (Explanation): PASS' in output,
    'Confidence Lane B (demo)': 'Confidence: Lane B' in output,
}

print("VERIFICATION CHECKLIST")
print()
print("Solver Output Analysis:")
for check, ok in checks.items():
    status = 'PASS' if ok else 'FAIL'
    print(f"  [{status}] {check}")

print()
print("Notes:")
print("- This notebook verifies the in-repo demo harness only.")
print("- Do not treat this as an external benchmark reproduction.")


VERIFICATION CHECKLIST

Solver Output Analysis:
  [PASS] Demo run complete
  [PASS] In-repo tests: 4/4 passed
  [PASS] Rung 641 PASS
  [PASS] Rung 274177 PASS
  [PASS] Rung 65537 PASS
  [PASS] Confidence Lane B (demo)

Notes:
- This notebook verifies the in-repo demo harness only.
- Do not treat this as an external benchmark reproduction.


## Summary

### What this notebook demonstrates
- A CPU-first aggregation pattern (`Counter()` + deterministic dispatch) for OOLONG-style tasks.
- A small in-repo harness that you can re-run to validate behavior.

### What this notebook does NOT claim
- It does not reproduce any external OOLONG leaderboard.
- It does not provide a machine-checked formal proof certificate.

### Evidence shipped in this repo
- Executable demo solver: `oolong/src/oolong_solver.py`
- This notebook executes that solver and checks its stdout.

### Peer Review Checklist

| Check | Status |
|-------|--------|
| All cells run without errors | PASS |
| returncode checked on subprocess | PASS |
| Claim hygiene stated | PASS |
| Lane confidence declared | Lane B |
| Mermaid diagrams present | PASS |
| No hardcoded fake outputs | PASS |

### How To Reproduce

```bash
# Run the solver standalone:
python3 oolong/src/oolong_solver.py

# Run this notebook non-interactively:
jupyter nbconvert --execute --to notebook HOW-TO-CRUSH-OOLONG-BENCHMARK.ipynb
```

## FAQ (Harsh QA)

### "Using a CPU is cheating"
Benchmarks typically care about *solving the task*, not restricting you to one compute substrate. For exact aggregation (counts, top-k, uniqueness), a CPU primitive is the right tool.

### "Counter() is trivial"
Yes. That is the point: aggregation should be handled by deterministic tools whenever possible. The insight isn't complexity — it's **correctness guarantees**.

### "Is this guaranteed correct end-to-end?"
The counting step is exact, but end-to-end correctness still depends on parsing and query classification. This repo includes a small test harness; expand it if you want stronger evidence.

```mermaid
flowchart LR
    subgraph "Exact (Counter)"
        C["Counting"] --> T["Top-K"] --> U["Uniqueness"]
    end
    subgraph "Approximate (LLM)"
        P["Parsing"] --> CL["Classification"] --> I["Intent"]
    end
    C -.->|"deterministic"| SAFE["Guaranteed Correct"]
    P -.->|"probabilistic"| RISK["Needs Verification"]

    classDef exact fill:#1a3a1a,stroke:#66ff66,color:#e6ffe6;
    classDef approx fill:#3a1a1a,stroke:#ff6666,color:#ffe6e6;
    class C,T,U,SAFE exact;
    class P,CL,I,RISK approx;
```

## Background Note

This notebook is a repo-local demonstration of a general engineering principle:
- Use LLMs (or rules) for **classification/understanding** when needed.
- Use **deterministic computation** for exact aggregation.

If you want to connect this to specific external benchmarks, add a reproducible harness in-repo and log the outputs (see `papers/99-claims-and-evidence.md`).

---

**Auth:** 65537 | **Northstar:** Phuc Forecast | **Skill Pack:** `prime-math.md`

## Why This Pattern Works

### Transformer Attention vs Enumeration

```mermaid
flowchart TD
    subgraph "Transformer (Approximate)"
        ATT["Attention\n(weighted blend)"] --> SOFT["Softmax\n(probability)"]
        SOFT --> GEN["Generate\n(next token)"]
    end
    subgraph "Counter Bypass (Exact)"
        ENUM["Enumerate\n(iterate all)"] --> COUNT["Count\n(exact integer)"]
        COUNT --> RET["Return\n(deterministic)"]
    end

    classDef approx fill:#3a1a1a,stroke:#ff6666,color:#ffe6e6;
    classDef exact fill:#1a3a1a,stroke:#66ff66,color:#e6ffe6;
    class ATT,SOFT,GEN approx;
    class ENUM,COUNT,RET exact;
```

### Division of Labor
- **Classification**: can be handled by an LLM or a ruleset (this demo uses a ruleset).
- **Aggregation**: handled by deterministic enumeration (`Counter()`).

**Claim hygiene:** deterministic aggregation reduces one class of errors, but it does not automatically solve parsing, schema inference, or ambiguous question intent.