# Accessibility Report Generator — Usage

This notebook runs the LLM accessibility audit benchmark: download the dataset, list samples, run one or more provider×prompt combinations, and view accuracy/F1.

**Benchmark:** [Tabular Accessibility Dataset](https://www.mdpi.com/2306-5729/10/9/149) (Zenodo [10.5281/zenodo.17062188](https://doi.org/10.5281/zenodo.17062188)).

**Benchmark slices:** (1) **dynamic** — 9 samples (Angular, React, Vue, PHP). (2) **vue** — 25 delivery projects × component variants (many Vue samples). (3) **accessguru** — static HTML snippets with violations (from [AccessGuruLLM](https://github.com/NadeenAhmad/AccessGuruLLM)). Use `--slices all` or `--slices dynamic,vue,accessguru` to run all; use `--slices dynamic` for the original 9 only.

**Labels:** We derive binary **has_issues** per sample; full labels are on each sample as `reference_issues`.

## 1. Setup

Run the **pip install** cell below first (install once). If you get ImportError for google.genai or openai, restart the kernel and run again. Then run the path/import cell.

In [1]:
# Install all provider SDKs (openai, google-genai) and dotenv. Run once.
%pip install -r requirements.txt -q


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.3[0m[39;49m -> [0m[32;49m26.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49m/opt/homebrew/opt/python@3.11/bin/python3.11 -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [2]:
import sys
import importlib
from pathlib import Path

# Repo root: use cwd (run notebook from repo root) or parent of this file
REPO_ROOT = Path.cwd()
if str(REPO_ROOT) not in sys.path:
    sys.path.insert(0, str(REPO_ROOT))

# Reload so kernel picks up latest code after edits
if 'src.dataset' in sys.modules:
    importlib.reload(sys.modules['src.dataset'])
if 'src.runner' in sys.modules:
    importlib.reload(sys.modules['src.runner'])

from src.dataset import get_benchmark_available, load_benchmark_slices
from src.llm import get_llm
from src.runner import run_benchmark, write_benchmark_results
from scoring.score import score_binary, f1_binary

#SLICES = ("dynamic", "vue", "accessguru")
SLICES = ("dynamic", "vue")

## 2. Download benchmark (one-time)

Run once: Zenodo (dynamic + vue) and AccessGuru (accessguru slice).

In [3]:
if get_benchmark_available(slices=SLICES):
    print("All slice data found.")
else:
    print("Run: %run scripts/download_benchmark.py   and/or   %run scripts/download_accessguru.py")

All slice data found.


In [4]:
%run scripts/download_benchmark.py
%run scripts/download_accessguru.py

Zip already exists: /Users/andrew/git/visionaid-a11y-llm-audit/data/LLM-WebAccessibility-v2.1.0.zip
Extracting...
Done. Benchmark path: /Users/andrew/git/visionaid-a11y-llm-audit/data/manuandru-LLM-WebAccessibility-eae77a6/Dynamic Generated Content
Exists: /Users/andrew/git/visionaid-a11y-llm-audit/data/accessguru/accessguru_dataset/accessguru_sampled_syntax_layout_dataset.csv
Exists: /Users/andrew/git/visionaid-a11y-llm-audit/data/accessguru/accessguru_dataset/accessguru_sampled_semantic_violations.csv
Exists: /Users/andrew/git/visionaid-a11y-llm-audit/data/accessguru/accessguru_dataset/Original_full_data.csv
Done. AccessGuru slice will use CSVs in /Users/andrew/git/visionaid-a11y-llm-audit/data/accessguru/accessguru_dataset


## 3. List samples

Inspect the loaded code samples and ground-truth labels (no API calls).

In [6]:
samples = load_benchmark_slices(slices=SLICES)
by_slice = {}
for s in samples:
    by_slice.setdefault(s.slice, []).append(s)
for sl in SLICES:
    print(f"  [{sl}] {len(by_slice.get(sl, []))} samples")
print(f"Total: {len(samples)} samples.\n")
for s in list(samples)[:15]:
    print(f"  {s.slice}  {s.file_name[:40]:<40} has_issues={s.has_issues}  lang={s.language}")
if len(samples) > 20:
    print("  ...")

  [dynamic] 9 samples
  [vue] 96 samples
Total: 105 samples.

  dynamic  angular-table-accessible.js              has_issues=False  lang=js
  dynamic  angular-table-invalid.js                 has_issues=True  lang=js
  dynamic  php-table-accessible.php                 has_issues=False  lang=php
  dynamic  php-table-invalid.php                    has_issues=True  lang=php
  dynamic  react-table-accessible.js                has_issues=False  lang=js
  dynamic  react-table-invalid.js                   has_issues=True  lang=js
  dynamic  vue-table-accessible-llm.vue             has_issues=False  lang=vue
  dynamic  vue-table-accessible.vue                 has_issues=False  lang=vue
  dynamic  vue-table-invalid.vue                    has_issues=True  lang=vue
  vue  delivery-01_accessible-composition-api   has_issues=False  lang=vue
  vue  delivery-01_accessible-options-api       has_issues=False  lang=vue
  vue  delivery-01_minimal-composition-api      has_issues=True  lang=vue
  vue  deli

## 4. Run single audit

Set the API key for your provider (e.g. `GEMINI_API_KEY`, `DEEPSEEK_API_KEY`, `MOONSHOT_API_KEY`) in the environment or `.env`. Results are written to **results/** as JSON (full model responses + per-sample comparison) and TXT (summary). Token counts are printed and saved in the JSON summary.

In [7]:
provider = "deepseek"   # or "anthropic" "gemini" "deepseek" "kimi"
prompt_name = "audit_binary"  # or "audit_with_reason", "audit_wcag_focused"

llm = get_llm(provider)
print(f"Running: provider={llm.provider}, model={llm.default_model}, prompt={prompt_name}")
results = run_benchmark(llm, prompt_name=prompt_name, slices=SLICES)
metrics = score_binary(results)
f1 = f1_binary(metrics)

# Write solutions and comparison to results/ (JSON + TXT)
total_in, total_out = write_benchmark_results(results, provider=provider, prompt_name=prompt_name, model=llm.default_model)

print(f"\nAccuracy: {metrics.accuracy:.2%}")
print(f"F1 (has_issues): {f1:.2%}")
print(f"TP={metrics.tp} TN={metrics.tn} FP={metrics.fp} FN={metrics.fn}  unclear={metrics.unclear}")
print(f"Tokens used: input={total_in:,}  output={total_out:,}")
print("Results written to results/ (JSON + TXT)")

Running: provider=deepseek, model=deepseek-chat, prompt=audit_binary

Accuracy: 49.52%
F1 (has_issues): 66.24%
TP=52 TN=0 FP=53 FN=0  unclear=0
Tokens used: input=85,893  output=630
Results written to results/ (JSON + TXT)


## 5. Compare all provider × prompt combinations

Runs OpenAI and Anthropic with each of the three prompts and prints a comparison table. Requires both API keys if you want all six runs to succeed.

In [None]:
PROVIDERS = ["openai", "anthropic"]
PROMPTS = ["audit_binary", "audit_with_reason", "audit_wcag_focused"]

rows = []
for prov in PROVIDERS:
    for pr in PROMPTS:
        try:
            llm = get_llm(prov)
            res = run_benchmark(llm, prompt_name=pr, slices=SLICES)
            m = score_binary(res)
            f1 = f1_binary(m)
            rows.append((prov, pr, m.accuracy, f1, m.unclear))
        except Exception as e:
            rows.append((prov, pr, float("nan"), float("nan"), str(e)))

print(f"{'Provider':<12} {'Prompt':<22} {'Accuracy':>10} {'F1':>8} {'Unclear':>8}")
print("-" * 64)
for prov, pr, acc, f1, unclear in rows:
    acc_s = f"{acc:.2%}" if isinstance(acc, float) and acc == acc else "N/A"
    f1_s = f"{f1:.2%}" if isinstance(f1, float) and f1 == f1 else "N/A"
    u_s = str(unclear) if isinstance(unclear, int) else str(unclear)[:12]
    print(f"{prov:<12} {pr:<22} {acc_s:>10} {f1_s:>8} {u_s:>12}")