feat(scripts): dump cross-reference tags to TSV for downstream review#39876
feat(scripts): dump cross-reference tags to TSV for downstream review#39876kim-em wants to merge 4 commits into
Conversation
Add scripts/dump_crossref_tags.lean: a ~70-line standalone tool that loads the elaborated Mathlib environment, walks Mathlib.CrossRef.tagExt, and writes one TSV record per tagged declaration with columns (database, tag, declName, module, comment). This is the only piece of cross-reference review machinery that needs to live in mathlib4 — the rest (snippet fetching, filtering by PR diff, Markdown rendering, comment posting) lives in https://github.com/leanprover-community/external-tags and https://github.com/leanprover-community/mathlib-ci. The TSV emitted here is consumed by the privileged workflow_run job and treated as untrusted data; fields are normalised so tabs/newlines in user-controlled comments can't break the TSV framing, and the output is capped at 2 MB defensively (current population is ~55 KB, 491 tags). The script uses importModules (loadExts := true) rather than the withImportModules wrapper because the latter passes loadExts := false and would leave tagExt empty for imported modules. Replaces leanprover-community#39662 (which added the full standalone script in-tree at ~1k LOC). Per https://leanprover.zulipchat.com/#narrow/dm/110087,112680-dm/near/597848507, the bulk of that script now lives in https://github.com/leanprover-community/external-tags. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
PR summary 15ba169dedImport changes for modified filesNo significant changes to the import graph Import changes for all files
|
The lint-style action fails CI on undocumented scripts. Adds an entry in the "CI workflow" section pointing to the workflow_run consumer and the external-tags repo. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…un shim Stack on top of the dump_crossref_tags.lean PR. Adds two pieces: * .github/workflows/build_template.yml: three new steps in `post_steps`, after the import-graph upload. Run scripts/dump_crossref_tags.lean to produce crossref-tags.tsv. For pull_request_target builds, build a bridge payload (pr_number, head_sha) and emit it via leanprover-community/privilege-escalation-bridge/emit using artifact name `crossref-tags-bridge`. Non-PR builds skip the emit. * .github/workflows/crossref_review.yml: workflow_run listener on the two `continuous integration` workflows. Consumes the bridge artifact via privilege-escalation-bridge/consume, checks out leanprover-community/mathlib-ci, and invokes its scripts/crossref_review/post-comment.sh orchestrator. The orchestrator clones leanprover-community/external-tags at a pinned SHA and uses it to fetch snippets and render the PR comment. Replaces leanprover-community#39666. The original PR rolled its own artifact upload + PR number resolution fallback chain (~158 LOC) and included the snippet fetcher in scripts/crossref.lean (~760 LOC). This version delegates to the existing privilege-escalation-bridge infrastructure (same pattern as olean_report.yaml + olean_report_wf_run.yaml) and to external-tags, cutting the mathlib4 surface to ~97 LOC. Trust model: the privileged workflow_run job runs only code from mathlib-ci@<pinned SHA> (which runs only code from external-tags at a pinned SHA). Nothing from the build artifact is interpreted as code; the TSV is parsed as untrusted data with user-controllable fields escaped before interpolation into the comment. Depends on leanprover-community#39876. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The "Safety" paragraph claimed the 2 MB cap "bounds what a malicious PR could emit". It doesn't — this script lives in scripts/ and runs with the PR's permissions, so a malicious PR can remove the cap. The cap is defence-in-depth; the trusted bound lives downstream in mathlib-ci's post-comment.sh. Also: 491 tags in master today, not 539 (the earlier number was a text-scan count; the actual elaborated population is 491). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…un shim Stack on top of the dump_crossref_tags.lean PR. Adds two pieces: * .github/workflows/build_template.yml: three new steps in `post_steps`, after the import-graph upload. Run scripts/dump_crossref_tags.lean to produce crossref-tags.tsv. For pull_request_target builds, build a bridge payload (pr_number, head_sha) and emit it via leanprover-community/privilege-escalation-bridge/emit using artifact name `crossref-tags-bridge`. Non-PR builds skip the emit. * .github/workflows/crossref_review.yml: workflow_run listener on the two `continuous integration` workflows. Consumes the bridge artifact via privilege-escalation-bridge/consume, checks out leanprover-community/mathlib-ci, and invokes its scripts/crossref_review/post-comment.sh orchestrator. The orchestrator clones leanprover-community/external-tags at a pinned SHA and uses it to fetch snippets and render the PR comment. Replaces leanprover-community#39666. The original PR rolled its own artifact upload + PR number resolution fallback chain (~158 LOC) and included the snippet fetcher in scripts/crossref.lean (~760 LOC). This version delegates to the existing privilege-escalation-bridge infrastructure (same pattern as olean_report.yaml + olean_report_wf_run.yaml) and to external-tags, cutting the mathlib4 surface to ~97 LOC. Trust model: the privileged workflow_run job runs only code from mathlib-ci@<pinned SHA> (which runs only code from external-tags at a pinned SHA). Nothing from the build artifact is interpreted as code; the TSV is parsed as untrusted data with user-controllable fields escaped before interpolation into the comment. Depends on leanprover-community#39876. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The tag, module, and comment fields all pass through sanitiseField before being written to TSV, but tag.declName was interpolated raw. Lean `Name`s can contain backtick-quoted segments (`«weird name»`) and in principle could include tabs or newlines; the one unsanitised field in a "framing-safe" TSV is inconsistent and risks breaking the downstream parser. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…un shim Stack on top of the dump_crossref_tags.lean PR. Adds two pieces: * .github/workflows/build_template.yml: three new steps in `post_steps`, after the import-graph upload. Run scripts/dump_crossref_tags.lean to produce crossref-tags.tsv. For pull_request_target builds, build a bridge payload (pr_number, head_sha) and emit it via leanprover-community/privilege-escalation-bridge/emit using artifact name `crossref-tags-bridge`. Non-PR builds skip the emit. * .github/workflows/crossref_review.yml: workflow_run listener on the two `continuous integration` workflows. Consumes the bridge artifact via privilege-escalation-bridge/consume, checks out leanprover-community/mathlib-ci, and invokes its scripts/crossref_review/post-comment.sh orchestrator. The orchestrator clones leanprover-community/external-tags at a pinned SHA and uses it to fetch snippets and render the PR comment. Replaces leanprover-community#39666. The original PR rolled its own artifact upload + PR number resolution fallback chain (~158 LOC) and included the snippet fetcher in scripts/crossref.lean (~760 LOC). This version delegates to the existing privilege-escalation-bridge infrastructure (same pattern as olean_report.yaml + olean_report_wf_run.yaml) and to external-tags, cutting the mathlib4 surface to ~97 LOC. Trust model: the privileged workflow_run job runs only code from mathlib-ci@<pinned SHA> (which runs only code from external-tags at a pinned SHA). Nothing from the build artifact is interpreted as code; the TSV is parsed as untrusted data with user-controllable fields escaped before interpolation into the comment. Depends on leanprover-community#39876. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
Test output from rebasing #39909 onto this branch + #39877 and running the dump script. Breakdown:
All 505 rows have exactly 5 tab-separated fields; 204 carry a non-empty The 14 wikidata rows from #39909: One thing worth flagging for the downstream |
| rows := rows.push s!"{dbName}\t{sanitiseField tag.tag}\t\ | ||
| {sanitiseField tag.declName.toString}\t\ | ||
| {sanitiseField file}\t{sanitiseField tag.comment}" |
There was a problem hiding this comment.
Noobie question: would it be easy to also log the file + linenr of this tag? That might double the size of the TSV, but I think it might be quite useful for post-processors.
This PR adds
scripts/dump_crossref_tags.lean, which walksMathlib.CrossRef.tagExtin the elaborated Mathlib environment and writes one TSV record per tagged declaration. The TSV is consumed by a privilegedworkflow_runjob that posts the cross-reference review PR comment; the rest of that machinery lives in https://github.com/leanprover-community/external-tags and https://github.com/leanprover-community/mathlib-ci. Fields are sanitised so tabs/newlines in user-controlled comments can't break the TSV framing, and the output is capped at 2 MB.Uses
importModules (loadExts := true)rather thanwithImportModules, because the wrapper passesloadExts := falseand would leavetagExtempty for imported modules.🤖 Prepared with Claude Code