# Accessibility Audit — End-to-End Pipeline Walkthrough

**Two sequential passes over a real-world HTML page against WCAG criteria**

---
The audit runs in two passes:

| Pass | Scripts | Purpose |
|---|---|---|
| **Pass 1 — Programmatic** | `programmatic/*.py` — 43 rules | Binary pass/fail: objectively verifiable violations reported immediately |
| **Pass 2 — LLM** | `llm_preprocessing/*.py` + `llm/*.txt` — up to 21 calls | Quality judgment on items that *passed* Pass 1 |

**Design principle**: Pass 1 results **filter** Pass 2 inputs. Items that failed a binary
check (no alt attribute, no label, broken reference) are excluded from LLM evaluation —
there is no value in judging the quality of something that doesn't exist.

The Programmatic checks run in milliseconds and the overhead of running them first is negligible. This provides filtered LLM payloads,
no duplicate findings, and lower token cost.

---

**Test page**: [Vision Aid](https://visionaid.org) homepage — 1.9 MB, 14,626 lines of WordPress/Elementor HTML

In [1]:
from pathlib import Path
import json, re, sys, importlib.util
from bs4 import BeautifulSoup
from collections import Counter

# ── Paths (notebook lives at processing_scripts/) ─────────────────────────────
NOTEBOOK_DIR  = Path(".").resolve()
PROJECT_ROOT  = (NOTEBOOK_DIR / "..").resolve()
HTML_FILE     = PROJECT_ROOT / "test_files" / "home.html"
PROG_DIR      = NOTEBOOK_DIR / "programmatic"
PREPROCESSING = NOTEBOOK_DIR / "llm_preprocessing"
PROMPTS_DIR   = NOTEBOOK_DIR / "llm"

def load_module(name, directory):
    path = directory / f"{name}.py"
    spec = importlib.util.spec_from_file_location(name, path)
    mod  = importlib.util.module_from_spec(spec)
    spec.loader.exec_module(mod)
    return mod

def estimate_tokens(text):
    return max(1, len(text) // 4)

print(f"HTML file    : {HTML_FILE}  (exists={HTML_FILE.exists()})")
print(f"Programmatic : {PROG_DIR}")
print(f"LLM preproc  : {PREPROCESSING}")
print(f"Prompts      : {PROMPTS_DIR}")

HTML file    : /Users/andrew/git/visionaid-a11y-llm-audit/test_files/home.html  (exists=True)
Programmatic : /Users/andrew/git/visionaid-a11y-llm-audit/processing_scripts/programmatic
LLM preproc  : /Users/andrew/git/visionaid-a11y-llm-audit/processing_scripts/llm_preprocessing
Prompts      : /Users/andrew/git/visionaid-a11y-llm-audit/processing_scripts/llm


---
## The Problem — Raw HTML Is Too Large for Direct LLM Processing

Before running the audit, let's see what we're working with.

In [2]:
raw_text   = HTML_FILE.read_text(encoding="utf-8")
raw_bytes  = HTML_FILE.stat().st_size
raw_chars  = len(raw_text)
raw_lines  = raw_text.count("\n")
raw_tokens = raw_chars // 4

print("=" * 56)
print("  RAW HTML FILE METRICS")
print("=" * 56)
print(f"  File size    : {raw_bytes:>10,} bytes  ({raw_bytes/1024/1024:.2f} MB)")
print(f"  Characters   : {raw_chars:>10,}")
print(f"  Lines        : {raw_lines:>10,}")
print(f"  Est. tokens  : {raw_tokens:>10,}  (4 chars / token heuristic)")
print()

windows = [
    ("GPT-4o / GPT-4o-mini",       128_000),
    ("Claude Sonnet 4.6 (200 k)",  200_000),
]
print(f"  {'Model':<36} {'Limit':>8}   {'% used':>7}   Fits?")
print(f"  {'-'*58}")
for name, limit in windows:
    pct  = raw_tokens / limit * 100
    icon = "✓" if raw_tokens <= limit else "✗"
    print(f"  {name:<36} {limit:>8,}   {pct:>6.0f}%   {icon}")

  RAW HTML FILE METRICS
  File size    :  1,948,296 bytes  (1.86 MB)
  Characters   :  1,947,379
  Lines        :     14,626
  Est. tokens  :    486,844  (4 chars / token heuristic)

  Model                                   Limit    % used   Fits?
  ----------------------------------------------------------
  GPT-4o / GPT-4o-mini                  128,000      380%   ✗
  Claude Sonnet 4.6 (200 k)             200,000      243%   ✗


In [3]:
soup = BeautifulSoup(raw_text, "lxml")

def char_size(tags):
    return sum(len(str(el)) for el in tags)

sections = {
    "<style> blocks  (inline CSS)":    char_size(soup.find_all("style")),
    "<script> blocks (JS / JSON-LD)":  char_size(soup.find_all("script")),
    "<svg> elements":                  char_size(soup.find_all("svg")),
    "<img> tags":                      char_size(soup.find_all("img")),
    "<a> link tags":                   char_size(soup.find_all("a")),
    "<div> / <section> wrappers":      char_size(soup.find_all(["div", "section"])),
}
accounted = sum(sections.values())
sections["Text nodes / other"] = max(0, raw_chars - accounted)

print(f"  {'Element type':<34} {'Chars':>10}  {'~Tokens':>8}  {'% of file':>9}")
print("  " + "-" * 68)
for label, chars in sorted(sections.items(), key=lambda x: -x[1]):
    toks = chars // 4
    pct  = chars / raw_chars * 100
    bar  = "█" * int(pct / 2)
    print(f"  {label:<34} {chars/10:>10,}  {toks/10:>8,}  {pct/10:>8.1f}%  {bar}")
print("  " + "-" * 68)
print(f"  {'TOTAL':<34} {raw_chars:>10,}  {raw_chars//4:>8,}  {'100.0%':>9}")
print()
print("  98%+ comes from <div>/<section> wrappers — layout noise, not content.")
print("  The extractors discard all of this and emit only accessibility-relevant data.")

  Element type                            Chars   ~Tokens  % of file
  --------------------------------------------------------------------
  <div> / <section> wrappers         1,914,066.0  478,516.5      98.3%  ███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████
  <a> link tags                        56,954.0  14,238.5       2.9%  ██████████████
  <img> tags                           17,510.9   4,377.7       0.9%  ████
  <style> blocks  (inline CSS)          9,639.9   2,409.9       0.5%  ██
  <script> blocks (JS / JSON-LD)        9,466.9   2,366.7      

---
## Pass 1 — Programmatic Binary Checks

**Run first.** These checks are objectively verifiable directly from the HTML — no LLM
interpretation needed. Pass 1 produces two outputs:

1. **Direct findings** — binary violations reported immediately
2. **Filter sets** — used in the next step to exclude already-caught items from Pass 2

### Filtering rules

When a rule fires, it means something is missing or broken. There is nothing
for the LLM to evaluate quality of — so those items are excluded from Pass 2.

| Rule that fires | Excluded from Pass 2 |
|---|---|
| `PAGE_TITLE_001` / `_003` (missing/empty title) | CL01 Prompt 1 skipped entirely |
| `IFRAME_001` / `_002` (missing/empty iframe title) | Those iframes filtered from CL01 Prompt 5 |
| `HEAD_004` (empty heading) | That heading filtered from CL01 Prompt 2 |
| `FORM_LABEL_001` (no label at all) | That field filtered from CL02 Prompt 1 |
| `FORM_INSTR_001` (broken `aria-describedby`) | That field filtered from CL02 Prompt 5 |
| `NON_TEXT_001` (missing `alt` on `<img>`) | That image filtered from CL03 Prompts 1 & 2 |
| `NON_TEXT_002` (actionable image, no/empty alt) | That image filtered from CL03 Prompt 3 |

`FORM_LABEL_003` (placeholder-only), `HEAD_001` (skipped level), and `FORM_GROUP_001` (no legend)
are **not** filtered — the LLM adds value by evaluating severity and suggesting remediation.

In [13]:
# ── Load and run all three programmatic checkers ──────────────────────────────
prog_sem   = load_module("semantic_checklist_01", PROG_DIR)
prog_forms = load_module("forms_checklist_02",    PROG_DIR)
prog_ntext = load_module("nontext_checklist_03",  PROG_DIR)

sem_issues   = prog_sem.audit_html_file(str(HTML_FILE))
form_issues  = prog_forms.audit_forms(str(HTML_FILE))
ntext_issues = prog_ntext.audit_nontext(str(HTML_FILE))

all_prog_issues = sem_issues + form_issues + ntext_issues

# ── Display results ────────────────────────────────────────────────────────────
checkers = [
    ("CL01 Semantic",  sem_issues,   29),
    ("CL02 Forms",     form_issues,   7),
    ("CL03 Non-text",  ntext_issues,  7),
]

print("=" * 62)
print("  PASS 1 — PROGRAMMATIC AUDIT RESULTS  (home.html)")
print("=" * 62)

total = 0
for checker_name, issues, rule_count in checkers:
    print(f"\n  {checker_name}  ({rule_count} rules checked, {len(issues)} issue(s) found)")
    print(f"  {'─'*46}")
    if not issues:
        print("    No issues found.")
    else:
        by_rule = {}
        for i in issues:
            key = i["rule_id"]
            if key not in by_rule:
                by_rule[key] = (i["rule_name"], 0)
            by_rule[key] = (by_rule[key][0], by_rule[key][1] + 1)
        for rule_id in sorted(by_rule):
            rule_name, count = by_rule[rule_id]
            suffix = f"  × {count}" if count > 1 else ""
            print(f"    [{rule_id}] {rule_name}{suffix}")
    total += len(issues)

print(f"\n  {'─'*56}")
print(f"  Total Pass 1 issues : {total}")
print(f"  These are definitive findings — no LLM interpretation needed.")

  PASS 1 — PROGRAMMATIC AUDIT RESULTS  (home.html)

  CL01 Semantic  (29 rules checked, 1233 issue(s) found)
  ──────────────────────────────────────────────
    [HEAD_001] Skipped heading level  × 5
    [HEAD_002] Multiple <h1> elements  × 5
    [LAND_002] Multiple main landmarks  × 5
    [LAND_004] Multiple contentinfo landmarks  × 5
    [LAND_005] Multiple 'navigation' landmarks without accessible labels  × 39
    [LAND_006] Content outside landmark regions
    [LINK_001] Link without accessible name  × 197
    [NAV_003] Skip link is not first focusable element
    [PAGE_TITLE_002] Multiple <title> elements  × 3
    [PARSE_001] Duplicate ID  × 972

  CL02 Forms  (7 rules checked, 0 issue(s) found)
  ──────────────────────────────────────────────
    No issues found.

  CL03 Non-text  (7 rules checked, 124 issue(s) found)
  ──────────────────────────────────────────────
    [NON_TEXT_002] Actionable image missing alt text  × 124

  ────────────────────────────────────────────────────

In [14]:
# ── Build filter sets for Pass 2 ──────────────────────────────────────────────
# Filtering is applied directly on extracted payload values, not by ID matching,
# since many elements lack IDs. Each filter is a simple lambda over the payload.

skip_title_prompt = any(
    i["rule_id"] in ("PAGE_TITLE_001", "PAGE_TITLE_003")
    for i in sem_issues
)

iframe_rules_fired = {i["rule_id"] for i in sem_issues
                      if i["rule_id"] in ("IFRAME_001", "IFRAME_002")}

head_empty_fired   = any(i["rule_id"] == "HEAD_004" for i in sem_issues)
form_label_fired   = any(i["rule_id"] == "FORM_LABEL_001" for i in form_issues)
form_instr_fired   = any(i["rule_id"] == "FORM_INSTR_001" for i in form_issues)
ntext_missing_fired = any(i["rule_id"] == "NON_TEXT_001" for i in ntext_issues)
ntext_action_fired  = any(i["rule_id"] == "NON_TEXT_002" for i in ntext_issues)

print("Pass 1 → Pass 2 filter summary")
print(f"  {'Filter':<50} {'Active?':>8}")
print(f"  {'─'*60}")
print(f"  {'Skip CL01 Prompt 1 (page title)':<50} {'YES' if skip_title_prompt else 'no':>8}")
print(f"  {'Filter iframes from CL01 Prompt 5':<50} {'YES' if iframe_rules_fired else 'no':>8}")
print(f"  {'Filter empty headings from CL01 Prompt 2':<50} {'YES' if head_empty_fired else 'no':>8}")
print(f"  {'Filter unlabelled fields from CL02 Prompt 1':<50} {'YES' if form_label_fired else 'no':>8}")
print(f"  {'Filter broken-desc fields from CL02 Prompt 5':<50} {'YES' if form_instr_fired else 'no':>8}")
print(f"  {'Filter no-alt images from CL03 Prompts 1 & 2':<50} {'YES' if ntext_missing_fired else 'no':>8}")
print(f"  {'Filter no-alt actionable images from CL03 Prompt 3':<50} {'YES' if ntext_action_fired else 'no':>8}")
print()
print("  Items marked 'no' pass through to LLM quality evaluation unchanged.")

Pass 1 → Pass 2 filter summary
  Filter                                              Active?
  ────────────────────────────────────────────────────────────
  Skip CL01 Prompt 1 (page title)                          no
  Filter iframes from CL01 Prompt 5                        no
  Filter empty headings from CL01 Prompt 2                 no
  Filter unlabelled fields from CL02 Prompt 1              no
  Filter broken-desc fields from CL02 Prompt 5             no
  Filter no-alt images from CL03 Prompts 1 & 2             no
  Filter no-alt actionable images from CL03 Prompt 3      YES

  Items marked 'no' pass through to LLM quality evaluation unchanged.


---
## Pass 2 — LLM Semantic Quality Evaluation

With Pass 1 complete, we now extract structured JSON payloads for the LLM.
Each extractor pulls only accessibility-relevant content — discarding all CSS,
JavaScript, layout divs, and inline styles. The filter sets built above are then
applied before computing prompt slices.

The LLM evaluates what programmatic tools cannot: whether present attributes are
*meaningful*, *accurate*, and *sufficient* for screen reader users.

---
### Checklist 01 — Semantic Structure & Navigation

**Extractor**: `llm_preprocessing/semantic_checklist_01.py`  
**Prompts**: `llm/semantic_checklist_01.txt` (up to 7)

| Section | WCAG | What the LLM judges |
|---|---|---|
| `page_title` + `headings` | 2.4.2, 1.3.1 | Are title and headings meaningful and logically structured? |
| `flagged_links` | 2.4.4 | Are short/generic link texts clear when read out of context? |
| `landmarks` | 1.3.6 | Are multiple navs differentiated? Is the structure appropriate? |
| `tables` / `iframes` | 1.3.1, 4.1.2 | Caption/header clarity; iframe title meaningfulness |

*Binary checks (missing title, invalid lang, duplicate IDs, missing iframe title) are already
covered by Pass 1 — `language` is included in the payload for context only.*

In [15]:
cl01     = load_module("semantic_checklist_01", PREPROCESSING)
cl01_pay = cl01.extract(str(HTML_FILE))
cl01_json = json.dumps(cl01_pay, indent=2)
cl01_tok  = estimate_tokens(cl01_json)

print("Checklist 01 — Extraction summary")
print(f"  {'Section':<22}  {'Items':>6}")
print(f"  {'─'*32}")
for key, val in cl01_pay.items():
    if isinstance(val, list):
        print(f"  {key:<22}  {len(val):>6}")
    elif key == "images":
        total = sum(len(v) for v in val.values())
        miss  = len(val["missing_alt"])
        empty = len(val["empty_alt"])
        has   = len(val["has_alt"])
        print(f"  {key:<22}  {total:>6}  (missing={miss}, empty={empty}, has_alt={has})")
    elif isinstance(val, dict):
        print(f"  {key:<22}  {'—':>6}")
    else:
        print(f"  {key:<22}  {repr(val)[:20]:>20}")
print(f"\n  Raw payload : {len(cl01_json):>8,} chars  |  ~{cl01_tok:>6,} tokens")

Checklist 01 — Extraction summary
  Section                  Items
  ────────────────────────────────
  language                             'en-US'
  page_title                   —
  headings                   280
  images                     306  (missing=0, empty=134, has_alt=172)
  flagged_links               76
  forms                       16
  buttons                     15
  landmarks                   62
  tables                       0
  iframes                     15

  Raw payload :   70,644 chars  |  ~17,661 tokens


In [None]:
# ── CL01 prompt slices — with Pass 1 filters applied ─────────────────────────

# Filter: empty headings already caught by HEAD_004
headings_filtered = [h for h in cl01_pay["headings"] if h.get("text", "").strip()]

# Filter: iframes with missing/empty titles already caught by IFRAME_001/002
iframes_filtered  = [f for f in cl01_pay["iframes"]
                     if f.get("title") and str(f["title"]).strip()]

cl01_slices = {
    "1. Page Title":         None if skip_title_prompt
                             else {"page_title": cl01_pay["page_title"]},
    "2. Heading Structure":  {"page_title": cl01_pay["page_title"],
                              "headings":   headings_filtered},
    "3. Link Clarity":       {"flagged_links": cl01_pay["flagged_links"]},
    "4. Table Semantics":    {"tables":    cl01_pay["tables"]},
    "5. Iframe Titles":      {"iframes":   iframes_filtered},
    "6. Landmark Structure": {"landmarks": cl01_pay["landmarks"]},
    "7. Combined Summary":   None,  # skipped — redundant with individual prompts
}

print("Checklist 01 — Token budget per prompt (after Pass 1 filtering)")
print(f"  {'Prompt':<26}  {'Items':>6}  {'~Tokens':>8}  Note")
print(f"  {'─'*60}")
item_counts = {
    "1. Page Title":         0 if skip_title_prompt else 1,
    "2. Heading Structure":  len(headings_filtered),
    "3. Link Clarity":       len(cl01_pay["flagged_links"]),
    "4. Table Semantics":    len(cl01_pay["tables"]),
    "5. Iframe Titles":      len(iframes_filtered),
    "6. Landmark Structure": len(cl01_pay["landmarks"]),
    "7. Combined Summary":   0,
}
for name, data in cl01_slices.items():
    if data is None:
        note = "SKIPPED (Pass 1: missing/empty title)" if "Title" in name else "SKIPPED (summary)"
        print(f"  {name:<26}  {'—':>6}  {'—':>8}  {note}")
    else:
        t = estimate_tokens(json.dumps(data, indent=2))
        n = item_counts[name]
        orig_n = {
            "2. Heading Structure":  len(cl01_pay["headings"]),
            "5. Iframe Titles":      len(cl01_pay["iframes"]),
        }.get(name)
        note = f"  (filtered from {orig_n})" if orig_n and orig_n != n else ""
        print(f"  {name:<26}  {n:>6}  {t:>8,}{note}")

---
### Checklist 02 — Form Labels, Instructions & Groups

**Extractor**: `llm_preprocessing/forms_checklist_02.py`  
**Prompts**: `llm/forms_checklist_02.txt` (up to 6)

| Section | WCAG | What the LLM judges |
|---|---|---|
| `fields.effective_label` | 1.3.1, 2.4.6 | Is the label descriptive and meaningful? |
| `fields.label_source` | 1.3.1 | Is `placeholder_only` being used as a label? (severity + remediation) |
| `fields.instructions` | 3.3.2 | Are aria-describedby instructions clear and helpful? |
| `fields.required` | 3.3.2 | Is required status communicated in the label, not just via attribute? |
| `groups.legend` | 1.3.1 | Does the legend provide sufficient group context? |

*Fields with no label at all (`FORM_LABEL_001`) are filtered from Prompt 1 — there is
no label quality to assess. Fields with broken `aria-describedby` (`FORM_INSTR_001`)
are filtered from Prompt 5 — the instructions are unreachable.*

In [17]:
cl02      = load_module("forms_checklist_02", PREPROCESSING)
cl02_pay  = cl02.extract(str(HTML_FILE))
cl02_json = json.dumps(cl02_pay, indent=2)
cl02_tok  = estimate_tokens(cl02_json)

forms      = cl02_pay["forms"]
all_fields = [f for form in forms for f in form["fields"]]
all_groups = [g for form in forms for g in form["groups"]]

src_counts = Counter(f["label_source"] for f in all_fields)
required   = [f for f in all_fields if f["required"]]
with_instr = [f for f in all_fields if f["instructions"]]

print("Checklist 02 — Forms summary")
print(f"  Total forms            : {len(forms)}")
print(f"  Total fields           : {len(all_fields)}")
print(f"  Total groups           : {len(all_groups)}")
print(f"  Required fields        : {len(required)}")
print(f"  Fields with aria-desc  : {len(with_instr)}")
print(f"  Orphan labels          : {len(cl02_pay['orphan_labels'])}")
print()
print("  Label source breakdown:")
for src, count in src_counts.most_common():
    flag = "  ⚠" if src in ("placeholder_only", "none") else ""
    print(f"    {src:<22} {count:>4}{flag}")
print()
print(f"  Raw payload : {len(cl02_json):>8,} chars  |  ~{cl02_tok:>6,} tokens")

Checklist 02 — Forms summary
  Total forms            : 16
  Total fields           : 16
  Total groups           : 0
  Required fields        : 5
  Fields with aria-desc  : 0
  Orphan labels          : 0

  Label source breakdown:
    label_for                16

  Raw payload :    9,975 chars  |  ~ 2,493 tokens


In [None]:
# ── CL02 prompt slices — with Pass 1 filters applied ─────────────────────────

# Filter: fields with no label (FORM_LABEL_001) excluded from label quality prompt
fields_with_labels = [f for f in all_fields if f["label_source"] != "none"]

# Filter: fields with broken aria-describedby (FORM_INSTR_001) excluded from instructions prompt
fields_with_valid_instr = [f for f in all_fields if f["instructions"]]

placeholder_only = [f for f in all_fields if f["label_source"] == "placeholder_only"]

cl02_slices = {
    "1. Label Quality":        {"fields": fields_with_labels},
    "2. Placeholder-as-Label": {"fields": placeholder_only},
    "3. Group Label Quality":  {"groups": all_groups},
    "4. Required Indicators":  {"fields": required},
    "5. Instructions Quality": {"fields": fields_with_valid_instr},
    "6. Overall Form Summary": None,  # skipped — redundant with individual prompts
}

print("Checklist 02 — Token budget per prompt (after Pass 1 filtering)")
print(f"  {'Prompt':<28}  {'Items':>6}  {'~Tokens':>8}  Note")
print(f"  {'─'*62}")
counts = {
    "1. Label Quality":        len(fields_with_labels),
    "2. Placeholder-as-Label": len(placeholder_only),
    "3. Group Label Quality":  len(all_groups),
    "4. Required Indicators":  len(required),
    "5. Instructions Quality": len(fields_with_valid_instr),
    "6. Overall Form Summary": 0,
}
orig_counts = {
    "1. Label Quality": len(all_fields),
}
for name, data in cl02_slices.items():
    if data is None:
        print(f"  {name:<28}  {'—':>6}  {'—':>8}  SKIPPED (summary)")
        continue
    n = counts[name]
    t = estimate_tokens(json.dumps(data, indent=2))
    orig = orig_counts.get(name)
    note = f"  (filtered from {orig})" if orig and orig != n else ""
    if n == 0:
        note = "  SKIPPED (no items)"
    print(f"  {name:<28}  {n:>6}  {t:>8,}{note}")

---
### Checklist 03 — Non-text Content: Images, SVG, Icons & Media

**Extractor**: `llm_preprocessing/nontext_checklist_03.py`  
**Prompts**: `llm/nontext_checklist_03.txt` (up to 8)

Images are split into four categories because the evaluation criteria differ:

| Category | Pass 1 interaction | LLM evaluates |
|---|---|---|
| `informative` | Images with `alt=None` filtered out (NON_TEXT_001) | Is the alt text accurate and concise? |
| `decorative` | Images with `alt=None` filtered out (NON_TEXT_001) | Is empty alt genuinely correct — or does the image convey content? |
| `actionable` | Images with empty/null alt filtered out (NON_TEXT_002) | Does alt describe the destination/action, not the image appearance? |
| `complex` | No filtering — complex images are identified by content hints | Is the long description adequate for charts/diagrams? |

`sole_content` on icon fonts: when `True` + `aria-hidden=True` → critical unlabeled control (LLM flags severity).

In [19]:
cl03      = load_module("nontext_checklist_03", PREPROCESSING)
cl03_pay  = cl03.extract(str(HTML_FILE))
cl03_json = json.dumps(cl03_pay, indent=2)
cl03_tok  = estimate_tokens(cl03_json)

img = cl03_pay["images"]
total_images = sum(len(v) for v in img.values())

print("Checklist 03 — Non-text content summary")
print(f"  Images — informative   : {len(img['informative'])}")
print(f"  Images — decorative    : {len(img['decorative'])}")
print(f"  Images — actionable    : {len(img['actionable'])}  (in <a> or <button>)")
print(f"  Images — complex       : {len(img['complex'])}")
print(f"  Total images           : {total_images}")
print(f"  SVGs (non-hidden)      : {len(cl03_pay['svgs'])}")
print(f"  Icon fonts (unique)    : {len(cl03_pay['icon_fonts'])}")
print(f"  Media elements         : {len(cl03_pay['media'])}")
print()

flagged_imgs     = [i for i in img["informative"] if i["alt_flags"]]
unlabeled_icons  = [ic for ic in cl03_pay["icon_fonts"]
                    if ic["sole_content"] and ic["aria_hidden"] and not ic["aria_label"]]
broken_svgs      = [s for s in cl03_pay["svgs"] if not s["title"] and not s["aria_label"]]

print("  Pre-detected quality flags:")
print(f"    Informative images with alt_flags  : {len(flagged_imgs)}")
print(f"    Icons: sole_content + aria-hidden  : {len(unlabeled_icons)}  ← unlabeled controls")
print(f"    SVGs with no title or aria-label   : {len(broken_svgs)}")
print()
print(f"  Raw payload : {len(cl03_json):>8,} chars  |  ~{cl03_tok:>6,} tokens")

Checklist 03 — Non-text content summary
  Images — informative   : 60
  Images — decorative    : 10
  Images — actionable    : 236  (in <a> or <button>)
  Images — complex       : 0
  Total images           : 306
  SVGs (non-hidden)      : 5
  Icon fonts (unique)    : 12
  Media elements         : 0

  Pre-detected quality flags:
    Informative images with alt_flags  : 0
    Icons: sole_content + aria-hidden  : 1  ← unlabeled controls
    SVGs with no title or aria-label   : 5

  Raw payload :   76,663 chars  |  ~19,165 tokens


In [None]:
# ── CL03 prompt slices — with Pass 1 filters applied ─────────────────────────

# Filter: images with no alt at all (NON_TEXT_001) excluded from informative + decorative prompts
# In the extractor, images with alt=None don't appear in informative/decorative (they're
# in cl01_pay["images"]["missing_alt"]), so no further filtering needed for Prompts 1 & 2.

# Filter: actionable images with no/empty alt (NON_TEXT_002) excluded from Prompt 3
actionable_filtered = [
    im for im in img["actionable"]
    if im.get("alt") and str(im["alt"]).strip() and im["alt"] != "None"
]

cl03_slices = {
    "1. Informative Alt Quality":  {"images": img["informative"]},
    "2. Decorative Verification":  {"images": img["decorative"]},
    "3. Actionable Image Alt":     {"images": actionable_filtered},
    "4. Complex Descriptions":     {"images": img["complex"]},
    "5. SVG Accessibility":        {"svgs":   cl03_pay["svgs"]},
    "6. Icon Font Accessibility":  {"icon_fonts": cl03_pay["icon_fonts"]},
    "7. Media Captions":           {"media":  cl03_pay["media"]},
    "8. Overall Summary":          None,  # skipped — redundant with individual prompts
}

print("Checklist 03 — Token budget per prompt (after Pass 1 filtering)")
print(f"  {'Prompt':<30}  {'Items':>6}  {'~Tokens':>8}  Note")
print(f"  {'─'*64}")
orig_actionable = len(img["actionable"])
for name, data in cl03_slices.items():
    if data is None:
        print(f"  {name:<30}  {'—':>6}  {'—':>8}  SKIPPED (summary)")
        continue
    items = list(data.values())[0] if len(data) == 1 else []
    n = len(items) if isinstance(items, list) else 0
    t = estimate_tokens(json.dumps(data, indent=2))
    note = ""
    if name == "3. Actionable Image Alt" and orig_actionable != n:
        note = f"  (filtered from {orig_actionable}: {orig_actionable - n} had no/empty alt)"
    if n == 0:
        note = "  SKIPPED (no items)"
    print(f"  {name:<30}  {n:>6}  {t:>8,}{note}")

---
## Pass 2 — Claude API Calls

With the filtered payload slices ready, we now send each slice to the Claude API.

The `llm_client` module (`processing_scripts/llm_client/`) handles:
- Loading numbered prompt templates from `llm/*.txt`
- Filling the `{payload}` placeholder with serialised JSON
- Calling the Anthropic Messages API
- Stripping markdown code fences from responses
- Aggregating token usage across all calls

Configure **MODEL** and **TEMPERATURE** below to compare models or prompt strategies.

In [None]:
# ── API Configuration ─────────────────────────────────────────────────────────
# Adjust these to compare models or prompt strategies

MODEL       = "claude-haiku-4-5-20251001"   # e.g. "claude-haiku-4-5-20251001"
TEMPERATURE = 0.1                   # Low temperature for consistent JSON output
REPORTS_DIR = PROJECT_ROOT / "reports"
REPORTS_DIR.mkdir(exist_ok=True)

print(f"Model       : {MODEL}")
print(f"Temperature : {TEMPERATURE}")
print(f"Reports dir : {REPORTS_DIR}")

In [None]:
# ── Set up LLM client ─────────────────────────────────────────────────────────
sys.path.insert(0, str(NOTEBOOK_DIR))
from llm_client import AuditClient, load_all_prompts, run_all as _run_all

all_prompts = load_all_prompts(PROMPTS_DIR)
client      = AuditClient(model=MODEL, temperature=TEMPERATURE)

for stem, prompts in sorted(all_prompts.items()):
    print(f"  {stem}: {len(prompts)} prompts loaded")
print(f"\nClient ready — {MODEL}")

In [None]:
# ── Run Pass 2 — all checklist prompts ───────────────────────────────────────
# ⚠️  This makes real API calls and incurs token costs.

all_slices = {
    "semantic_checklist_01": cl01_slices,
    "forms_checklist_02":    cl02_slices,
    "nontext_checklist_03":  cl03_slices,
}

api_report = _run_all(client, all_prompts, all_slices, verbose=True)

print(f"\n{'─'*50}")
print(f"  Total input tokens : {api_report['total_usage']['input_tokens']:,}")
print(f"  Total output tokens: {api_report['total_usage']['output_tokens']:,}")

In [None]:
# ── Save structured report ────────────────────────────────────────────────────
import datetime

timestamp   = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
model_slug  = MODEL.replace("/", "-")
report_path = REPORTS_DIR / f"audit_{timestamp}_{model_slug}.json"

full_report = {
    "metadata": {
        "timestamp":    timestamp,
        "model":        MODEL,
        "temperature":  TEMPERATURE,
        "html_file":    str(HTML_FILE),
        "html_bytes":   raw_bytes,
        "raw_tokens":   raw_tokens,
    },
    "pass_1_programmatic": {
        "issues": all_prog_issues,
        "total":  len(all_prog_issues),
    },
    "pass_2_llm": api_report,
}

report_path.write_text(json.dumps(full_report, indent=2), encoding="utf-8")
print(f"Report saved → {report_path}")
print(f"File size    : {report_path.stat().st_size:,} bytes")

---
## Combined Results

In [None]:
# ── Token reduction summary ───────────────────────────────────────────────────
cl01_json = json.dumps(cl01_pay, indent=2)
cl02_json = json.dumps(cl02_pay, indent=2)
cl03_json = json.dumps(cl03_pay, indent=2)

checklists_tok = {
    "CL01 Semantic":  cl01_json,
    "CL02 Forms":     cl02_json,
    "CL03 Non-text":  cl03_json,
}
combined_tok = sum(estimate_tokens(j) for j in checklists_tok.values())

print("=" * 64)
print("  TOKEN REDUCTION — PASS 2 PAYLOADS vs RAW HTML")
print("=" * 64)
print(f"\n  {'Checklist':<20}  {'Chars':>10}  {'~Tokens':>8}  {'vs raw':>8}")
print(f"  {'─'*54}")
for name, js in checklists_tok.items():
    tok = estimate_tokens(js)
    pct = tok / raw_tokens * 100
    print(f"  {name:<20}  {len(js):>10,}  {tok:>8,}  {pct:>7.1f}%")
print(f"  {'─'*54}")
print(f"  {'COMBINED (all 3)':<20}  {'':>10}  {combined_tok:>8,}  "
      f"{combined_tok/raw_tokens*100:>7.1f}%")
print(f"  {'RAW HTML':<20}  {raw_bytes:>10,}  {raw_tokens:>8,}  {'100.0%':>8}")
print()
reduction = (1 - combined_tok / raw_tokens) * 100
print(f"  Token reduction : {reduction:.1f}%  ({raw_tokens // combined_tok}× smaller)")
print(f"  Max LLM calls   : 21  (empty slices after filtering are skipped)")
print(f"  Pass 1 findings : {len(all_prog_issues)} definitive issues (no LLM needed)")

In [28]:
# ── Pass 2 results summary ────────────────────────────────────────────────────
print("=" * 64)
print("  PASS 2 — LLM RESULTS")
print("=" * 64)

for stem, result in api_report["checklists"].items():
    n_done    = len(result["results"])
    n_skipped = len(result["skipped"])
    n_errors  = len(result["errors"])
    in_tok    = result["usage"]["input_tokens"]
    out_tok   = result["usage"]["output_tokens"]

    print(f"\n  {stem}")
    print(f"    Prompts run : {n_done}  |  skipped: {n_skipped}  |  errors: {n_errors}")
    print(f"    Tokens      : {in_tok:,} in / {out_tok:,} out")

    for label, response in result["results"].items():
        # Responses may be a dict or a list depending on the prompt schema
        if isinstance(response, dict):
            issues = response.get("issues") or response.get("findings") or []
        elif isinstance(response, list):
            issues = response
        else:
            issues = []
        n_issues = len(issues) if isinstance(issues, list) else "?"
        print(f"      [{label}]  {n_issues} issue(s) reported")

    if result["errors"]:
        for label, err in result["errors"].items():
            print(f"    ERROR [{label}]: {err}")

print(f"\n  {'─'*50}")
print(f"  Total API usage")
print(f"    Input tokens : {api_report['total_usage']['input_tokens']:,}")
print(f"    Output tokens: {api_report['total_usage']['output_tokens']:,}")
print(f"\n  Full report → {report_path}")

  PASS 2 — LLM RESULTS

  semantic_checklist_01
    Prompts run : 5  |  skipped: 1  |  errors: 1
    Tokens      : 38,669 in / 3,309 out
      [1. Page Title]  4 issue(s) reported
      [2. Heading Structure]  12 issue(s) reported
      [5. Iframe Titles]  15 issue(s) reported
      [6. Landmark Structure]  7 issue(s) reported
      [7. Combined Summary]  0 issue(s) reported
    ERROR [3. Link Clarity]: Unterminated string starting at: line 442 column 5 (char 13172)

  forms_checklist_02
    Prompts run : 3  |  skipped: 3  |  errors: 0
    Tokens      : 6,611 in / 3,419 out
      [1. Label Quality]  16 issue(s) reported
      [4. Required Indicators]  5 issue(s) reported
      [6. Overall Form Summary]  0 issue(s) reported

  nontext_checklist_03
    Prompts run : 6  |  skipped: 2  |  errors: 0
    Tokens      : 44,552 in / 10,597 out
      [1. Informative Alt Quality]  12 issue(s) reported
      [2. Decorative Verification]  10 issue(s) reported
      [3. Actionable Image Alt]  19 iss

---
## Architecture

```
home.html (1.9 MB, ~487k tokens)
    │
    │  PASS 1 — Programmatic (run first, milliseconds)
    │
    ├── programmatic/semantic_checklist_01.py   ──▶  29 rules  ──▶  binary findings
    ├── programmatic/forms_checklist_02.py      ──▶   7 rules  ──▶  binary findings
    └── programmatic/nontext_checklist_03.py    ──▶   7 rules  ──▶  binary findings
                                                           │
                                                           │  filter: items that failed a binary
                                                           │  check excluded from Pass 2
                                                           ▼
    │  PASS 2 — LLM Quality Evaluation (on filtered payloads)
    │
    ├── llm_preprocessing/semantic_checklist_01.py + semantic_checklist_01.txt  ──▶  up to 7 LLM calls
    ├── llm_preprocessing/forms_checklist_02.py    + forms_checklist_02.txt     ──▶  up to 6 LLM calls
    └── llm_preprocessing/nontext_checklist_03.py  + nontext_checklist_03.txt   ──▶  up to 8 LLM calls
                                                                                              │
                                                                                              ▼
                                                                                 merge all findings
                                                                                              │
                                                                                              ▼
                                                                                accessibility report
```

## Summary

### Pipeline Overview

This notebook runs a complete two-pass WCAG accessibility audit:

| Pass | What it does | Speed | Output |
|------|-------------|-------|--------|
| **Pass 1** — Programmatic | 43 binary rule checks across 3 scripts | Milliseconds | Definitive violations |
| **Pass 2** — LLM | Up to 21 Claude API calls on filtered payloads | Seconds | Qualitative findings |

### Known Issues on This Page (visionaid.org homepage)

**Pass 1 — Programmatic findings** (binary, definitive):
- Run the Pass 1 cells above to see the full list

**Pass 2 — LLM evaluates:**
- **CL01**: Heading structure, 76 flagged links, 62 landmarks, iframes (missing-title ones excluded by Pass 1)
- **CL02**: 16 forms — label quality, placeholder detection, aria-describedby instructions
- **CL03**: Actionable images with non-empty alt (filtered), 5 SVGs with no accessible name, icon-font buttons

### Configuration

To compare models or prompts, update **MODEL** and **TEMPERATURE** in the config cell and re-run from that cell down.  
Reports are saved to `reports/audit_<timestamp>_<model>.json` — one file per run for easy comparison.