Rebase main onto improvement-plan work: cleanup, phases 0-4.4, Phase 5 design + scaffolding by ps0394 · Pull Request #5 · ps0394/Clipper

ps0394 · 2026-04-22T21:07:28Z

Rebase `main` onto the improvement-plan work

This PR fast-forwards main from the state at the start of this cleanup sprint to the current paulsanders/fix-ci-workflows HEAD (49 commits). It folds in:

Repo cleanup

Legacy demo artifacts, archived-tests/, run_clipper_demo.py, github_integration.py, CLIPPER-DEMO-*.md, PERFORMANCE-OPTIMIZATION-*.md, and WORKSPACE-CLEANUP-SUMMARY.md are gone. URL lists moved under urls/. copilot-instructions.md moved under .github/.

Improvement plan — delivered phases

Full plan and rationale in docs/improvement-plan.md.

Phase 0 — CI fixes, pillar fixture suite, failure-mode transparency, environment metadata.
Phase 1.1 — content-type-aware scoring profiles.
Phase 1.2 — extraction preview in audit trail and markdown report.
Phase 2.1 — template-cluster findings in multi-URL reports.
Phase 3.1 — raw vs. rendered render-mode dimension with delta reporting.
Phase 4.1 — per-type JSON-LD field completeness for the structured_data pillar, with a universal-vs-profile-adjusted score view in the report.
Phase 4.2 (scoped down) — clipper history <url> trend command. Full-storage-backend scope was deferred as YAGNI.
Phase 4.3 — classifier lockdown test (golden-corpus regression gate).
Phase 4.4 — vendor-neutrality audit of the metadata pillar. ms.topic removed from the topic-field check; Learn pages -15 metadata points / -1.5 headline points on average, 0 competitor pages affected. Measurement script committed at scripts/measure-4.4-impact.py.

Copilot comparison-report hygiene

.github/copilot-instructions.md now codifies the two-score model (parseability_score vs universal_score), the cross-vendor rule (must use universal_score), required disclosures (per-page profile + detection source), and the list of forbidden framings. Added in response to a flawed v3 cross-vendor report.

Phase 5 — LLM ground-truth validation

Design doc approved (all six open questions resolved):

Scoring LLMs: GPT-4o primary + open-weight secondary (Llama 3.x). ±0.1 Spearman ρ cross-LLM agreement required.
Corpus size: N=60 (6 profiles × 10 pages), pilot at N=5 first.
Question generator: Anthropic Claude (cross-family guardrail).
Secondary reviewer for κ: required, ~40 min, blocking for publication. Dependency logged in improvement plan.
Artifact storage: committed to git with corpus / results / scratch split.
Audience: internal research + small product team.

Full design: docs/phase-5-design.md.

Scaffolding shipped (no LLM calls wired yet — blocked on credentials, not code):

retrievability/phase5/ package: schemas, prompt templates, generator / reviewer / scorer / grader / analyzer stubs. Pure-Python Spearman ρ + bootstrap CI (no scipy dep).
python main.py phase5 status reports scaffolding state.
evaluation/phase5-corpus/ and evaluation/phase5-results/ trackable, evaluation/phase5-scratch/ gitignored, .gitattributes LF-pinned for JSON.
28 new tests in tests/test_phase5_scaffolding.py.

Test suite

144 passing (up from pre-sprint baseline; +28 from Phase 5 scaffolding, +7 from Phase 4.2 history, +3 from Phase 4.4 metadata neutrality, plus earlier phase additions).

Risk

Fast-forwardable. No conflicts with main. Recommend squash-merge or merge-commit depending on repo convention; the commit history on this branch is intentionally fine-grained (one commit per decision) and worth preserving, so a plain merge is the better choice.

- Add 'pip install -r requirements.txt' to 3 workflows that were missing the readability-lxml dependency added in the v2 scoring overhaul (standards-compliance, instant-demo, component-analysis) - Remove duplicate Python code block in docs-eval action.yml that leaked outside the heredoc causing bash syntax errors Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Includes prioritized issues based on agent-perspective validation of Learn vs. competitive documentation scoring. High: no-JS fetch scoring, dual profiles, content-type-aware scoring Medium: extraction output, JSON-LD validation, template analysis Nice-to-have: auto-discovery, trend tracking, LLM extraction test Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Replace all YARA/Lighthouse branding with Clipper across 18 files - Update __init__.py version to 3.0.0 with accurate 6-pillar description - Delete unused hybrid_score.py (legacy Lighthouse/PageSpeed dependency) - Remove dead methods from access_gate_evaluator.py: _evaluate_http_compliance, _evaluate_content_quality, _test_content_negotiation, _check_encoding_declaration, _is_external_link - Remove dead _evaluate_content_quality_sync from performance_evaluator.py - Fix unreachable code block in _evaluate_http_compliance_enhanced_async

Previously matched any Disallow line regardless of User-agent block, causing incorrect penalties when rules targeted specific bots only. Now properly parses User-agent blocks and only applies rules from the wildcard (*) block. Uses longest-match precedence for Allow/Disallow as per robots.txt specification.

Replace outdated YARA 2.0 / Lighthouse documentation with accurate description of the current Clipper scoring system: pillar weights, sub-signal breakdowns, score classifications, audit trail format, and architecture overview.

Three bugs caused most URLs to score 0 in batch evaluations: 1. Destructive 75s outer timeout: asyncio.wait_for wrapped the entire evaluation, discarding all partial results when it fired. Removed — individual component timeouts (60s fast, 60s browser) are sufficient. 2. Thread pool starvation: ThreadPoolExecutor(max_workers=4) couldn't handle 5 URLs x 6 tasks per URL. Browser evaluations held threads for 20-30s, starving fast tasks. Increased to max_workers=24. 3. Event loop blocking: Chrome/Selenium calls (driver.get, axe.run) ran directly in the async event loop, freezing all other coroutines. Moved entire axe evaluation to run_in_executor. Also moved WebDriver creation in the pool to run_in_executor. Also: removed dead self.http_client code (set but never read), and stopped creating a new AccessGateEvaluator per HTTP compliance call (PerformanceOptimizedEvaluator already inherits from it).

Expand HTTP Compliance scoring with a new 'Agent Content Hints' sub-signal (0-20 points) that detects whether pages declare machine-readable alternate formats or LLM-specific endpoints. Signals detected: - <link rel='alternate' type='text/markdown'> (6 pts) - <meta name='markdown_url'> (4 pts) — e.g. Microsoft Learn - data-llm-hint attributes (4 pts) - llms.txt references (3 pts) - Non-HTML <link rel='alternate'> (3 pts) Existing sub-signals rebalanced: HTML reachability 20->15, redirect efficiency 30->25, crawl permissions 25->20, cache headers 25->20. Also fixes remaining Unicode emoji in terminal output (performance_score.py, cli.py, score.py) that caused charmap encoding errors on Windows.

…ent hints - README: Add agent content hints to HTTP Compliance sub-signals table, update point allocations (15/25/20/20/20), document detected signals - USER-INSTRUCTIONS: Replace old 5-pillar description with current 6-pillar model, update score format example with real component names, rewrite HTTP compliance section with 5 sub-signals including agent content hints

…ifacts Per the engineering audit (docs/engineering-audit.md), prune accumulated drift while leaving the core engine and documentation intact. Removed: - archived-tests/ (14 subdirs of committed test artifacts + chromedriver.exe) - Root-level progress/demo markdown (CLIPPER-DEMO-*.md, PERFORMANCE-OPTIMIZATION-*.md, WORKSPACE-CLEANUP-SUMMARY.md, clipper-scoring-overhaul-prompt.md) - Duplicate root copilot-instructions.md (canonical copy lives in .github/) - Dead demo scripts: run_clipper_demo.py, github_integration.py + GITHUB-INTEGRATION.md - Unused scripts/ experiments: hybrid-framework*, boilerpipe-comparison, lighthouse-comparison, _check_scores - Redundant demo GitHub workflows: clipper-comprehensive-demo, clipper-instant-demo, clipper-interactive-demo, clipper-enterprise-audit, example-usage Updated: - .gitignore: block archived-tests/ and results/ from being re-committed - README.md: file-structure section matches reality; CLI example points to urls/clipper-demo-urls.txt (the actual file path) Added: - docs/engineering-audit.md: senior-engineer code + repo audit with Azure migration plan Verified: all retrievability modules still import cleanly.

Sequences the clipper-improvement-issues.md backlog into 6 phases with prerequisites, exit criteria, and PR-sized scope per phase. Key differences from the original backlog: - Issues #1 and #2 merged into a single 'rendering-mode dimension' - Issue #3 promoted to P0 (biggest accuracy improvement) - Issue #6 promoted (transforms report usability at Learn scale) - Issue #9 reframed from 'nice-to-have' to strategic ground-truth validation - Issue #7 rejected (measurement tool, not discovery tool) - Three missing prerequisites added: test suite, failure-mode transparency, evaluator reproducibility - Azure migration integrated as Phase 5 between refinement and LLM work

Replace t-shirt sizing and 'one PR' shorthand with session counts, where a session is ~1-2 hours of focused Copilot-assisted work producing a landed PR or committed deliverable. Summary: - Phases 0-4 executable through this workflow: ~13-18 sessions total - Phase 5 (Azure): code portion ~8-12 sessions, deployment/hardening human - Phase 6 (LLM): ~2-3 to scaffold, research time on top

Each phase that changes user-visible behavior now lists the specific docs that must land alongside the code (README, USER-INSTRUCTIONS, docs/scoring.md, new docs/testing.md / storage.md / deployment.md). Keeps documentation from drifting behind implementation.

Adds tests/ with 11 minimal HTML fixtures (one per pillar success/failure shape), a pytest harness that parses and scores each fixture offline, and 27 assertions covering both absolute score ranges and good-vs-bad ordering for every pillar. Tests complete in <1s, no network, no browser. - tests/conftest.py: shared score_fixture helper - tests/test_pillars.py: 27 tests (ranges + ordering + audit-trail checks) - tests/_calibrate.py: one-off helper for picking range bounds - pytest.ini: testpaths + strict markers - requirements-dev.txt: pulls in pytest - .github/workflows/tests.yml: CI runs pytest on PR and main - docs/testing.md: layout, rationale, regression-catch sanity check - README.md: contributor 'Running tests' subsection Regression sanity verified: deliberately sabotaging _evaluate_semantic_html causes the semantic-HTML range and ordering tests to fail, confirming the suite catches real breakage rather than just asserting tautologies.

Pillar failures used to return 0.0 silently, which contaminated every aggregate downstream. They now raise PillarEvaluationError; the orchestrator drops the failing pillar from the weighted average, lists it in failed_pillars, and sets partial_evaluation=True. - retrievability/schemas.py: ScoreResult gains partial_evaluation (bool) and failed_pillars (List[str]) - retrievability/access_gate_evaluator.py: new PillarEvaluationError, sync orchestrator rewritten, per-pillar outer excepts raise instead of returning 0.0, failure_mode classifier learns 'partial_evaluation' tier, _capture_environment records clipper/python/platform and library versions plus browser/axe versions when the live axe path ran - retrievability/performance_evaluator.py: async orchestrator mirrors the same behavior with renormalization + failed_pillars collection - retrievability/cli.py: result line shows [PARTIAL] and (failed: ...) suffix when any pillar failed Tests: 5 new tests in tests/test_failure_modes.py cover baseline, single-pillar failure (surviving scores unchanged, renormalized final score), all-pillars failure (evaluation_error + empty component_scores), environment metadata presence, and per-pillar audit recording on failure. Existing 27 pillar tests continue to pass (32 total, <1.2s). Smoke test (live URL, performance mode): evaluation completes, partial_evaluation=false, _environment populated with expected keys. Docs: docs/scoring.md gets new 'Partial Evaluations' and 'Environment Metadata' sections describing the three-outcome contract, renormalization, CLI tagging, and the audit-trail environment block. README.md scoring-classification list adds entries for partial_evaluation and evaluation_error.

Each phase section now carries an explicit Status line, and the summary table gains a Status column. 0.1, 0.2, 0.3 marked completed with commit refs; everything else marked not started (or rejected for Issue #7).

@type

Clipper now detects the content type of every evaluated page and applies a per-type weight profile to produce a type-adjusted parseability_score. The default article weights are still reported as universal_score for backward compatibility and cross-type comparisons. - retrievability/profiles.py (new): six profiles (article, landing, reference, sample, faq, tutorial), each row summing to 1.0; detect_content_type(soup, url) with precedence ms.topic > JSON-LD @type > URL path > DOM heuristics > default article; full detection trace returned alongside the profile name. - retrievability/schemas.py: ScoreResult gains content_type (str) and universal_score (Optional[float]). - retrievability/access_gate_evaluator.py and performance_evaluator.py: orchestrators detect content type, apply the matched profile weights via a shared _weighted_score helper (renormalized over surviving pillars, compatible with the Phase 0.2 partial-evaluation contract), and record profile/detection/weights in audit_trail._content_type. - tests/test_content_types.py (new) + 4 fixtures: detection tests across ms.topic, JSON-LD, URL, and default; integration tests asserting ScoreResult shape and that type-adjusted weighting produces different numbers than article weights on divergent pillar vectors; parameterized invariant that every profile row sums to 1.0. 49/49 tests green (<8s). - Live calibration: Microsoft Learn /samples/browse/ is now detected as 'sample' and scores 53.6 (parseability) vs 48.7 (universal). A quickstart redirect chain resolves to a 'tutorial'-typed page scoring 64.1 vs 60.4. - Docs: docs/scoring.md gains a 'Content-type profiles' section with the full weight table, detection precedence, and parseability vs universal explanation. README.md scoring section now leads with the two-number contract. USER-INSTRUCTIONS.md gains a 'How the score is chosen for a page' walkthrough with a sample audit_trail._content_type snippet.

Content-extractability scores are hard to interpret in isolation. A 33/100 makes instant sense when you can see the three sentences Readability actually pulled out. This change surfaces that preview both in the structured audit trail (consumable by downstream tooling) and in the generated markdown report (consumable by humans). - retrievability/access_gate_evaluator.py: _evaluate_content_extractability now records extracted_preview (first ~300 chars) and extracted_chars (total extracted length) under audit_trail.content_extractability.extraction_metrics. Existing extracted_text_length is kept for backward compatibility. - retrievability/report.py: markdown report inserts an 'Extracted Preview' block under each URL, annotated with the extractability sub-score and character count. Block degrades gracefully when no preview is available (e.g., old score files, or an empty extraction). - tests/test_pillars.py: new test asserting the preview is a non-empty string, capped at ~300 chars, and extracted_chars is a positive int. 50/50 tests green (~15s). - Live smoke: Foundry Models overview page renders as '**Extracted Preview** (extractability 84.5/100, 69,751 chars extracted):' followed by the opening sentence of the article. - Docs: docs/scoring.md extractability section documents the new audit keys and report block. Plan status flipped to completed.

On multi-URL evaluations the report now detects per-pillar score clusters and hoists them into a top 'Template Findings' section. Each cluster is a group of three or more pages scoring the same low value (rounded, within 1 point) on one pillar; when that many pages share the same weak score on semantic_html or structured_data, the finding almost certainly lives in the shared CMS template rather than in per-page authoring, so fixing the template lifts the whole cluster. - retrievability/report.py: _detect_template_clusters groups by (pillar, rounded_score) rather than full 6-tuple match (real Learn corpora rarely repeat full tuples but routinely share individual weak pillar scores, e.g. structured_data=12 on 14/16 pages). Shared *good* scores are not surfaced as findings. _format_template_section renders a summary table plus per-cluster detail with estimated per-page uplift (weighted by the default article profile). Per-page section now annotates each page's template-covered pillars rather than hiding the page, so unique variation stays visible alongside template context. Single-URL runs keep the original 'Individual Page Results' section unchanged. - tests/test_report_clusters.py: 9 unit tests covering shared-weak clustering, shared-good suppression, min-three-member rule, divergent values not clustering, 1pt tolerance, uplift math (weight x gap), partial-pillar exclusion, and markdown rendering. 59/59 tests green (~9s). - Live smoke (16 Learn URLs): report now leads with 'Cluster 1: structured_data = 12/100 (14 pages)' and 'Cluster 2: dom_navigability = 35/100 (9 pages)', and every per-page entry in those clusters carries a Template-covered pillars: line pointing back to the top section. - Docs: USER-INSTRUCTIONS.md gains a 'Reading a report' section explaining the Template Findings / Page-Specific Findings split with an abbreviated example. Plan 2.1 flipped to completed. Design note: the plan specified full-tuple clustering within 1pt; per-pillar clustering was chosen instead because real corpora show individual weak pillars repeating far more than full tuples do, which is where the actionable template signal lives.

…rting Adds RenderMode to ScoreResult, --render-mode raw|rendered|both on express and score, and a 'Rendering-Mode Deltas' section in the report that flags pages with |rendered - raw| >= 15 as JS-dependent. Raw mode forces the DOM Navigability pillar through static analysis (no browser, no axe); all other pillars are unchanged. Scoping decision documented in docs/scoring.md: today's 'rendered' mode is a hybrid (only dom_navigability actually runs in a browser); true JS-rendered text-pillar scoring is a follow-up. 69 pytest passing; live smoke on a Learn URL confirms the new report section renders correctly.

@type

Structured-data Field Completeness (30 pts) is now computed per JSON-LD item against required + recommended field tables for the four validated @type values: Article, FAQPage, HowTo, BreadcrumbList. FAQPage requires a non-empty mainEntity of Question items with acceptedAnswer; BreadcrumbList requires itemListElement with >=2 entries; HowTo requires a non-empty step list. Items of other types fall back to the legacy generic key-field check so exotic schemas are not over-penalized. Missing and structurally invalid fields are logged in audit_trail.structured_data.field_completeness_detail. Exit criterion verified by new tests: an FAQPage with empty mainEntity scores below a complete one, and mainEntity appears in the audit trail's missing/invalid lists. 77 pytest passing (69 baseline + 8 new).

- Structured Data Sub-Signals table now documents per-type Field Completeness (Phase 4.1) with the Article/FAQPage/HowTo/BreadcrumbList expectations table. - Scoring System gains a Rendering Modes section covering --render-mode raw|rendered|both and the pessimistic min(rendered, raw) default (Phase 3.1). - Remove dead CONTRIBUTING.md and LICENSE links; neither file exists in the repo.

Adds a 'Profile Impact' summary table plus a per-page headline line that shows both the profile-weighted parseability_score and the article-default universal_score with their delta. The section is suppressed when every page was scored under the default article profile, so single-profile runs stay compact. Makes the content-type classifier's contribution to the headline number legible rather than buried in audit JSON, which is the main defense against a brittle-classifier critique: readers can see exactly how much the profile is moving each score. 83 pytest passing (77 + 6 new).

…er lockdown added, Azure demoted to Phase 6 Key changes: - Insert Phase 4.3 (content-type detector lockdown test) as a hard prerequisite for any correlation work against structural pillars. - Swap Phases 5 and 6: LLM ground-truth validation becomes Phase 5.1; Azure migration becomes Phase 6. LLM validation needs an inference endpoint, not a deployed service, so the prior 'depends on 5' coupling was false. - Phase 5.1 scope hardened with the critique from the external LLM-validation proposal review: two-path methodology (extractability vs retrievability), three axes reported separately (grounding / completeness / unsupported-claim rate) with no hand-picked composite, mandatory human calibration gate (>=80% agreement on 20-30 pages), mandatory 5-run variance check (sigma <= 5), Spearman correlation with N>=50 per content-type profile, cost ceiling + caching story, and GitHub Models (gpt-4o-mini + gpt-4o) as default backend with Azure OpenAI as the production option. - Fix phantom 'Storage abstraction' checkmark in the Azure section; 4.2 is still not started. - Update sequencing rationale, summary table, and execution notes to reflect the new order.

Locks detect_content_type() output against a golden file derived from the learn-analysis-v3 and competitive-analysis-v3 snapshot corpora. - New scripts/generate-classifier-golden.py writes tests/fixtures/classifier_corpus_golden.json and supports --check for CI/pre-commit drift detection. - New tests/test_classifier_lockdown.py parametrizes over 22 URLs covering every profile (article, landing, reference, sample, faq, tutorial) and every detection source (ms_topic, schema_type, url, dom, default). A classifier shift fails CI with the offending URL + signal named. - Hand-review of the golden surfaced no classifier bugs; behavior is preserved as-is per plan scope (lock, don't tune). - Fixes a pre-existing regression in access_gate_evaluator._evaluate_structured_data where fields_found was referenced unconditionally and field_completeness_detail was missing from the audit trail expected by Phase 4.1 tests. - docs/scoring.md gains a Detector stability subsection; docs/testing.md documents the regeneration workflow. Full suite: 107 passed (was 83 before Phase 4.3 additions + 3 previously failing due to the fields_found bug).

Scopes a narrow audit-and-prune pass across pillar evaluators for vendor-specific signals accepted as evidence alongside generic ones. Primary confirmed target: ms.topic inside the metadata pillar's topic/category field check (access_gate_evaluator.py:1334). ms.topic is a Learn-internal authoring signal (page role, CMS template hint), not a semantic topic declaration; accepting it alongside meta:keywords/category/articleSection creates asymmetric scoring with no symmetric recognition of equivalent conventions on other doc systems. Scope is deliberately narrow: audit, classify (equivalent / not equivalent / belongs elsewhere), remove non-equivalent vendor signals from pillar scoring, add two fixture tests (ms.topic alone scores zero, ms.topic + generic scores same as generic alone), re-run both corpora and report the before/after delta. Non-goals include removing ms.topic from the classifier (legitimate use) and adding symmetric vendor recognition (separate work). Depends on 4.3 (classifier locked so 4.4 side effects are observable). Prerequisite for Phase 5 (LLM correlations are uninterpretable on pillars that silently favor one vendor).

The prior instructions predated content-type profiles: they listed the old 5-pillar weights, described only parseability_score, and gave no guidance on cross-vendor or cross-corpus comparisons. That's why Copilot-authored comparison reports (learn-vs-competitors-v3) used parseability_score for cross-vendor deltas, omitted per-page profile assignments, and mixed exemplars into competitor averages. Adds: - Current 6-pillar structure and default (article) weights. - The two-score model: parseability_score (same-type comparisons) vs universal_score (cross-vendor / cross-corpus comparisons), with an explicit rule that cross-vendor deltas must use universal_score. - Required disclosures for comparison reports: headline score + rationale, per-page profile column, detection source, methodology caveats section linking to in-flight phases that affect numbers (currently Phase 4.4). - Required rules: universal_score for headlines, matched sample sizes or disclosed asymmetry, no mixing exemplars into competitor averages, symmetric projections, correct labeling of axe-only rendered-vs-raw deltas. - Forbidden framings: parseability_score deltas across profiles as like-for-like, attributing metadata leads entirely to CMS quality while 4.4 is in flight, 'agent-ready' bands without stating which score they're applied to, template-fix recommendations without variance/confidence.

ms.topic is a Microsoft Learn authoring signal that declares the page's CMS template role (tutorial/quickstart/overview/reference). It is consumed by the content-type classifier (retrievability/profiles.py) where that meaning is appropriate. Accepting it inside the metadata pillar's topic-field check alongside generic semantic topic signals (meta:topic, meta:category, meta:keywords, schema:articleSection) gave Learn pages a vendor-specific 15-point credit no other doc system could earn from a comparable internal signal. Change: retrievability/access_gate_evaluator.py:1334 no longer accepts ms.topic in the topic-field check. Classifier behavior in profiles.py is unchanged; ms.topic continues to be an authoritative content-type signal there. Impact measured offline on the v3 corpora via scripts/measure-4.4-impact.py: learn-v3 (16 URLs): 14 pages lose 15 metadata points (the ms.topic-only topic credit); 2 pages unchanged (aks/faq already had other topic signals; azure-data-factory Q&A had none before). Mean metadata delta -13.12. Bounded parseability delta -1.31 average on article profile, -1.50 worst case, up to -2.25 on sample profile. competitive-v3 (6 URLs): 0 pages affected. Confirms the removed credit was Learn-exclusive. Tests: tests/fixtures/metadata_ms_topic_only.html: ms.topic alone must score 0 on topic field. tests/fixtures/metadata_ms_topic_plus_keywords.html: ms.topic + meta:keywords must score same as meta:keywords alone (no double-credit). tests/fixtures/metadata_keywords_only.html: baseline for the equivalence assertion. tests/test_pillars.py adds two Phase 4.4 tests; full suite 109 passed (107 + 2 new). Docs: docs/scoring.md metadata table updated; Vendor-neutrality principle subsection added. docs/improvement-plan.md flipped to Completed.

Ships trend-view functionality without building the storage abstraction scaffolding the original plan entry called for. The StorageBackend protocol, LocalJSONStorage wrapper, CosmosStorage stub, and docs/storage.md were deliberately not created — they're speculative API surface for a second backend that may never exist. If Phase 6 (Azure migration) actually starts, generalize collect_history() at that time when there is a concrete second implementation to justify the abstraction. Delivered: retrievability/history.py: plain collect_history(url, root) + format_table() + run_history() entry point. Walks <root>/**/*_scores.json, filters by URL with light normalization (trailing slash and fragment stripped), sorts by score-file mtime, computes parseability delta vs. previous row. retrievability/cli.py: new 'history' subcommand wiring with --root and --json flags. tests/test_history.py: 7 hermetic tests using tmp_path corpora (URL matching, normalization, malformed-file tolerance, JSON mode, nonzero exit on miss). README.md and USER-INSTRUCTIONS.md: one-paragraph command examples. Verified manually against the committed evaluation/ corpus: 6 evaluations of learn.microsoft.com/en-us/azure/aks/faq across 2026-04-16..22 render correctly with parse deltas. Plan entry in docs/improvement-plan.md updated to document both what was delivered and what was explicitly left out. Status flipped to Completed (scoped down). Phase 6 dependency note rewritten to reflect that the storage contract will be introduced when that phase has a real consumer.

Draft-for-review design document that answers the five open questions before any Phase 5 code ships: LLM choice, corpus size, task design, correlation methodology, honest-null framing, and reporting. Key commitments the doc makes: - LLM scoring is an instrument, not a verdict. Reported as a third axis alongside parseability_score and universal_score, never replacing them. - Three null hypotheses stated up front. Phase 5's finished report must address each explicitly. - QA accuracy is the primary task (5 hand-authored questions per page, graded against ground truth). Summarization faithfulness is a robustness check on a subset. - Primary instrument Azure OpenAI GPT-4o; secondary instrument one open-weight model; correlations must agree within ±0.1 Spearman rho across the two to claim the finding is about the pages rather than the primary LLM. - Corpus N=30 minimum (6 profiles x 5 pages), N=60 ideal. Stratified by profile and vendor. Committed under evaluation/phase5-corpus/. - Decision gate: design review + five open questions resolved before any code. Pilot run on N=5 before scaling to N=30. Second-person review before publishing findings. Forbidden framings explicitly noted: Phase 5 does not produce 'Learn leads by X' headlines. It produces a correlation analysis of structural scoring vs. LLM behavior.

…an review Replaces the hand-authored-questions approach in section 4.1 with a generator-plus-review pipeline: a cross-family LLM (e.g. Claude when scoring uses GPT-4o and Llama) drafts 5 Q/A pairs per page, and a single reviewer spends ~2 min/page editing, accepting, or rejecting each pair. Why cross-family: prevents the scoring LLM from handshaking with its own question style. The generator only chooses which facts are tested; it cannot help the scorer answer. A weakly-structured page still produces questions the scorer fails to locate. Updated consequences throughout the doc: - Section 5 (corpus): ground truth is now reviewer-approved generator output; corpus directory stores generator prompts, raw generator output, and the review audit trail alongside HTML snapshots. - Section 5 (grading): inter-rater kappa is now computed on the accept/reject decision across 20 percent of pages rather than on fully hand-authored answers. - Section 8 (cost): hand-authoring line item (~12 hours) replaced with ~1 hour of review at N=30 and under in generator LLM spend. Critical path shifts from annotator time to reviewer capacity. - Section 10 (open questions): expanded from five to six. New question 3 asks which non-OpenAI, non-Meta model is the generator (Claude default). Former q3 (secondary annotator) rewritten as secondary reviewer. - Section 11 (decision gate): now references six open questions instead of five. Disclosure requirement added to section 4.1.1: the final Phase 5 findings report must state that questions were LLM-generated and human-reviewed, name the generator model, and report the reject rate. A high reject rate is itself a finding about the page.

Resolves open question 3. Generator family is Claude; specific model (Sonnet vs. Opus) deferred to pilot-time API access check. Section 4.1.1 updated to state Claude directly rather than 'or equivalent'.

…secondary Resolves open question 1. Both LLMs score every page; findings require per-pillar Spearman rho to agree within 0.1 across the two scorers to be attributable to the pages rather than the primary model. Specific Llama 3.x variant deferred to pilot time.

Resolves open question 2. N=60 (6 profiles x 10 pages) is the committed corpus size. Rationale: the honest-null framing in section 2 includes H0-profile (per-profile LLM delta is not significant); at N=30 we have 5 pages per profile and cannot reject H0-profile with any power, which would force us either to drop H0-profile from the null list or to ship a finding we cannot falsify. N=60 keeps the hypothesis reachable. Updated throughout: - Section 5 (Size): rewrote to state N=60 directly, with H0-profile as the justification. - Section 5 (Source): corpus extension target is now 60 (from the 22-URL golden base). - Section 8 (Cost): scaled all line items to N=60. Generator LLM ~, review ~2 hours, scoring LLM ~, plus ~1-2 hours of URL curation to fill 38 additional profile x vendor cells. Total human review time now ~3-4 hours (from the ~1-2 previously stated at N=30). - Section 10 (Q2): marked resolved: N=60, with pilot at N=5 before scaling. - Section 11 (Decision gate): pilot-to-full-corpus language updated from scaling to N=30 to scaling to N=60.

Resolves open question 4 (Option A). A second reviewer independently re-runs accept/reject on 12 of the 60 pages; Cohen's kappa is computed on the accept/reject decision and reported in the findings doc. Without kappa we cannot claim the ground-truth rubric is reproducible, so findings cannot be published. Tracking: added a 'Blocking dependency for publication' note under Phase 5 in docs/improvement-plan.md so this requirement does not get forgotten when the pilot completes. The entry states the scope (12 pages / 60 Q&A pairs), the estimated effort (~40 minutes for the second reviewer), and the threshold (kappa >= 0.7; below that the rubric must be tightened and the primary review re-run).

…ch split Resolves open question 5. ~3 MB at N=60 does not justify LFS or external storage; commit directly to git. Layout: - evaluation/phase5-corpus/ - committed, permanent. URLs, HTML snapshots, generator prompts, raw generator output, reviewer-approved Q&A, review audit trail. - evaluation/phase5-results/<run-id>/ - committed per published run, tagged by date and LLM version. Only runs backing a published finding are committed. - evaluation/phase5-scratch/ - gitignored. Pilot runs, prompt iterations, experiments that do not back a finding. .gitattributes entry for stable diffs. Licensing note added covering Anthropic Claude, OpenAI GPT-4o, and Meta Llama output redistribution under current TOS.

…l product team Resolves open question 6. Findings doc is written for engineers and PMs who understand Clipper's architecture. Implications: brief inline stats explanations rather than assumed background, no blinded replication protocol, no peer review gate beyond the second-person review in section 11. Forbidden framings still apply because stakeholders will act on the findings. All six open questions now resolved. Design status is APPROVED. Added phase-5-design-approved marker to docs/improvement-plan.md with a summary of the six resolutions. Decision gate in section 11 updated to reflect approval; implementation may begin with the pilot at N=5. Expansion to external audience requires re-opening this doc and hardening the methodology section before publishing.

Creates the module layout and data contracts for Phase 5 LLM ground-truth validation. No LLM calls are wired yet; that happens in the pilot runner once Anthropic and Azure OpenAI credentials are in place. Artifacts: - evaluation/phase5-corpus/ and evaluation/phase5-results/ trackable; evaluation/phase5-scratch/ gitignored. .gitignore pattern evaluation*/ replaced with specific subdir patterns so phase5 dirs can be committed. .gitattributes pins LF line endings on phase5 JSON for cross-platform diffs. - retrievability/phase5/ package: schemas (QAPair, ReviewRecord, CorpusPage, ScoringAnswer, Grade, RunManifest with to_dict/from_dict), templates.py (load_template + render with {{TOKEN}} placeholders), generator.py (Claude prompt builder + JSON-array output parser; ClientProtocol for later wiring), reviewer.py (CLI accept/edit/reject loop, injectable input_fn for tests), scorer.py (GPT-4o / Llama prompt builder and driver), grader.py (pilot-grade substring heuristic), analyzer.py (pure-Python Spearman rho + 10k-resample bootstrap CI, no scipy dep). - retrievability/phase5/prompts/generator.txt and scorer.txt: the exact templates sent to Claude and the scoring LLMs. Generator template enforces the Phase 5 design doc section 4.1.1 rules (document-grounded only, no outside knowledge, cross-section coverage). - retrievability/cli.py: new 'phase5' subcommand with a 'status' subsubcommand that reports scaffolding state. Smoke-tested via 'python main.py phase5 status'. - tests/test_phase5_scaffolding.py: 28 tests covering dataclass roundtrips, template loading and substitution, generator output parsing, reviewer CLI loop (accept/edit/reject paths), grader labels, and analyzer correlation math. Full suite: 144 passing (up from 116). Next: pilot runner that stitches corpus -> generator -> reviewer -> scorer -> grader -> analyzer end-to-end on N=5 pages. Blocked on credentials, not code.

PaulSanders and others added 30 commits April 20, 2026 10:07

Move URL files to urls/ directory and update all references

f954721

Rewrite docs/scoring.md for current 6-pillar architecture

6b1e515

Replace outdated YARA 2.0 / Lighthouse documentation with accurate description of the current Clipper scoring system: pillar weights, sub-signal breakdowns, score classifications, audit trail format, and architecture overview.

Add v1-vs-v2 comparison and score-check utility scripts

757afa5

Add content-negotiation scoring proposal draft

a16b9e4

Add Status field to each phase in the improvement plan

896fec2

Each phase section now carries an explicit Status line, and the summary table gains a Status column. 0.1, 0.2, 0.3 marked completed with commit refs; everything else marked not started (or rejected for Issue #7).

Mark Phase 1.1 complete in improvement plan (commit 7faf198)

1e29c80

Record Phase 1.2 commit hash (a408f51) in plan

9825809

Record Phase 2.1 commit hash (b1a7f89) in plan

f0be59a

Record Phase 3.1 commit hash (f80b749) in plan

16ffa4d

Record Phase 4.1 commit hash (c366d60) in plan

ff997a0

Remove stale Quick Demo Results table from README

202363a

PaulSanders added 19 commits April 22, 2026 13:23

Phase 4.3: backfill plan status with commit hash

6fa57d2

Phase 4.4: backfill plan status with commit hash

2e46249

Phase 4.2: backfill plan status with commit hash

12271c7

Phase 5 design: lock question generator to Anthropic Claude

08db2ea

Resolves open question 3. Generator family is Claude; specific model (Sonnet vs. Opus) deferred to pilot-time API access check. Section 4.1.1 updated to state Claude directly rather than 'or equivalent'.

ps0394 merged commit b218d09 into main Apr 22, 2026
0 of 3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rebase main onto improvement-plan work: cleanup, phases 0-4.4, Phase 5 design + scaffolding#5

Rebase main onto improvement-plan work: cleanup, phases 0-4.4, Phase 5 design + scaffolding#5
ps0394 merged 49 commits into
mainfrom
paulsanders/fix-ci-workflows

ps0394 commented Apr 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ps0394 commented Apr 22, 2026

Rebase main onto the improvement-plan work

Repo cleanup

Improvement plan — delivered phases

Copilot comparison-report hygiene

Phase 5 — LLM ground-truth validation

Test suite

Risk

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Rebase `main` onto the improvement-plan work