feat(scripts): dump cross-reference tags to TSV for downstream review by kim-em · Pull Request #39876 · leanprover-community/mathlib4

kim-em · 2026-05-26T14:40:15Z

This PR adds scripts/dump_crossref_tags.lean, which walks Mathlib.CrossRef.tagExt in the elaborated Mathlib environment and writes one TSV record per tagged declaration. The TSV is consumed by a privileged workflow_run job that posts the cross-reference review PR comment; the rest of that machinery lives in https://github.com/leanprover-community/external-tags and https://github.com/leanprover-community/mathlib-ci. Fields are sanitised so tabs/newlines in user-controlled comments can't break the TSV framing, and the output is capped at 2 MB.

Uses importModules (loadExts := true) rather than withImportModules, because the wrapper passes loadExts := false and would leave tagExt empty for imported modules.

🤖 Prepared with Claude Code

Add scripts/dump_crossref_tags.lean: a ~70-line standalone tool that loads the elaborated Mathlib environment, walks Mathlib.CrossRef.tagExt, and writes one TSV record per tagged declaration with columns (database, tag, declName, module, comment). This is the only piece of cross-reference review machinery that needs to live in mathlib4 — the rest (snippet fetching, filtering by PR diff, Markdown rendering, comment posting) lives in https://github.com/leanprover-community/external-tags and https://github.com/leanprover-community/mathlib-ci. The TSV emitted here is consumed by the privileged workflow_run job and treated as untrusted data; fields are normalised so tabs/newlines in user-controlled comments can't break the TSV framing, and the output is capped at 2 MB defensively (current population is ~55 KB, 491 tags). The script uses importModules (loadExts := true) rather than the withImportModules wrapper because the latter passes loadExts := false and would leave tagExt empty for imported modules. Replaces leanprover-community#39662 (which added the full standalone script in-tree at ~1k LOC). Per https://leanprover.zulipchat.com/#narrow/dm/110087,112680-dm/near/597848507, the bulk of that script now lives in https://github.com/leanprover-community/external-tags. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

github-actions · 2026-05-26T14:41:42Z

PR summary 15ba169ded

Import changes for modified files

No significant changes to the import graph

Import changes for all files

Files	Import difference

Declarations diff

+ maxBytes
+ moduleToFilePath
+ sanitiseField

You can run this locally as follows

## from your `mathlib4` directory:
git clone https://github.com/leanprover-community/mathlib-ci.git ../mathlib-ci

## summary with just the declaration names:
../mathlib-ci/scripts/pr_summary/declarations_diff.sh <optional_commit>

## more verbose report:
../mathlib-ci/scripts/pr_summary/declarations_diff.sh long <optional_commit>

The doc-module for scripts/pr_summary/declarations_diff.sh in the mathlib-ci repository contains some details about this script.

No changes to strong technical debt.

No changes to weak technical debt.

⚠️ Scripts folder reminder

This PR adds files under scripts/.
Please consider whether each added script belongs in this repository or in leanprover-community/mathlib-ci.

A script belongs in mathlib-ci if it is a CI automation script that interacts with GitHub (e.g. managing labels, posting comments, triggering bots), runs from a trusted external checkout in CI, or requires access to secrets.

A script belongs in this repository (scripts/) if it is a developer or maintainer tool to be run locally, a code maintenance or analysis utility, a style linting tool, or a data file used by the library's own linters.

See the mathlib-ci README for more details.

Added scripts files:

scripts/dump_crossref_tags.lean

The lint-style action fails CI on undocumented scripts. Adds an entry in the "CI workflow" section pointing to the workflow_run consumer and the external-tags repo. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…un shim Stack on top of the dump_crossref_tags.lean PR. Adds two pieces: * .github/workflows/build_template.yml: three new steps in `post_steps`, after the import-graph upload. Run scripts/dump_crossref_tags.lean to produce crossref-tags.tsv. For pull_request_target builds, build a bridge payload (pr_number, head_sha) and emit it via leanprover-community/privilege-escalation-bridge/emit using artifact name `crossref-tags-bridge`. Non-PR builds skip the emit. * .github/workflows/crossref_review.yml: workflow_run listener on the two `continuous integration` workflows. Consumes the bridge artifact via privilege-escalation-bridge/consume, checks out leanprover-community/mathlib-ci, and invokes its scripts/crossref_review/post-comment.sh orchestrator. The orchestrator clones leanprover-community/external-tags at a pinned SHA and uses it to fetch snippets and render the PR comment. Replaces leanprover-community#39666. The original PR rolled its own artifact upload + PR number resolution fallback chain (~158 LOC) and included the snippet fetcher in scripts/crossref.lean (~760 LOC). This version delegates to the existing privilege-escalation-bridge infrastructure (same pattern as olean_report.yaml + olean_report_wf_run.yaml) and to external-tags, cutting the mathlib4 surface to ~97 LOC. Trust model: the privileged workflow_run job runs only code from mathlib-ci@<pinned SHA> (which runs only code from external-tags at a pinned SHA). Nothing from the build artifact is interpreted as code; the TSV is parsed as untrusted data with user-controllable fields escaped before interpolation into the comment. Depends on leanprover-community#39876. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The "Safety" paragraph claimed the 2 MB cap "bounds what a malicious PR could emit". It doesn't — this script lives in scripts/ and runs with the PR's permissions, so a malicious PR can remove the cap. The cap is defence-in-depth; the trusted bound lives downstream in mathlib-ci's post-comment.sh. Also: 491 tags in master today, not 539 (the earlier number was a text-scan count; the actual elaborated population is 491). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…un shim Stack on top of the dump_crossref_tags.lean PR. Adds two pieces: * .github/workflows/build_template.yml: three new steps in `post_steps`, after the import-graph upload. Run scripts/dump_crossref_tags.lean to produce crossref-tags.tsv. For pull_request_target builds, build a bridge payload (pr_number, head_sha) and emit it via leanprover-community/privilege-escalation-bridge/emit using artifact name `crossref-tags-bridge`. Non-PR builds skip the emit. * .github/workflows/crossref_review.yml: workflow_run listener on the two `continuous integration` workflows. Consumes the bridge artifact via privilege-escalation-bridge/consume, checks out leanprover-community/mathlib-ci, and invokes its scripts/crossref_review/post-comment.sh orchestrator. The orchestrator clones leanprover-community/external-tags at a pinned SHA and uses it to fetch snippets and render the PR comment. Replaces leanprover-community#39666. The original PR rolled its own artifact upload + PR number resolution fallback chain (~158 LOC) and included the snippet fetcher in scripts/crossref.lean (~760 LOC). This version delegates to the existing privilege-escalation-bridge infrastructure (same pattern as olean_report.yaml + olean_report_wf_run.yaml) and to external-tags, cutting the mathlib4 surface to ~97 LOC. Trust model: the privileged workflow_run job runs only code from mathlib-ci@<pinned SHA> (which runs only code from external-tags at a pinned SHA). Nothing from the build artifact is interpreted as code; the TSV is parsed as untrusted data with user-controllable fields escaped before interpolation into the comment. Depends on leanprover-community#39876. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The tag, module, and comment fields all pass through sanitiseField before being written to TSV, but tag.declName was interpolated raw. Lean `Name`s can contain backtick-quoted segments (`«weird name»`) and in principle could include tabs or newlines; the one unsanitised field in a "framing-safe" TSV is inconsistent and risks breaking the downstream parser. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…un shim Stack on top of the dump_crossref_tags.lean PR. Adds two pieces: * .github/workflows/build_template.yml: three new steps in `post_steps`, after the import-graph upload. Run scripts/dump_crossref_tags.lean to produce crossref-tags.tsv. For pull_request_target builds, build a bridge payload (pr_number, head_sha) and emit it via leanprover-community/privilege-escalation-bridge/emit using artifact name `crossref-tags-bridge`. Non-PR builds skip the emit. * .github/workflows/crossref_review.yml: workflow_run listener on the two `continuous integration` workflows. Consumes the bridge artifact via privilege-escalation-bridge/consume, checks out leanprover-community/mathlib-ci, and invokes its scripts/crossref_review/post-comment.sh orchestrator. The orchestrator clones leanprover-community/external-tags at a pinned SHA and uses it to fetch snippets and render the PR comment. Replaces leanprover-community#39666. The original PR rolled its own artifact upload + PR number resolution fallback chain (~158 LOC) and included the snippet fetcher in scripts/crossref.lean (~760 LOC). This version delegates to the existing privilege-escalation-bridge infrastructure (same pattern as olean_report.yaml + olean_report_wf_run.yaml) and to external-tags, cutting the mathlib4 surface to ~97 LOC. Trust model: the privileged workflow_run job runs only code from mathlib-ci@<pinned SHA> (which runs only code from external-tags at a pinned SHA). Nothing from the build artifact is interpreted as code; the TSV is parsed as untrusted data with user-controllable fields escaped before interpolation into the comment. Depends on leanprover-community#39876. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

kim-em · 2026-05-27T17:18:58Z

Test output from rebasing #39909 onto this branch + #39877 and running the dump script.

$ lake env lean --run scripts/dump_crossref_tags.lean /tmp/crossref-tags.tsv
Wrote 505 cross-reference tag(s) to /tmp/crossref-tags.tsv (55607 bytes).

Breakdown:

482 stacks
9 kerodon
14 wikidata (from [Merged by Bors] - doc: add wikidata cross references #39909's 13 @[wikidata Q…] attributes; the extra is to_additive propagating Q505798 Lagrange to AddSubgroup.card_addSubgroup_dvd_card)

All 505 rows have exactly 5 tab-separated fields; 204 carry a non-empty comment so the sanitisation path is exercised. File size 55 KB, well under the 2 MB cap.

The 14 wikidata rows from #39909:

wikidata  Q193878   Ideal.quotientInfRingEquivPiQuotient                                       Mathlib/RingTheory/Ideal/Quotient/Operations.lean
wikidata  Q505798   Subgroup.card_subgroup_dvd_card                                            Mathlib/GroupTheory/Coset/Card.lean
wikidata  Q505798   AddSubgroup.card_addSubgroup_dvd_card                                      Mathlib/GroupTheory/Coset/Card.lean
wikidata  Q1052678  dense_iInter_of_isOpen                                                     Mathlib/Topology/Baire/Lemmas.lean
wikidata  Q245098   intermediate_value_Icc                                                     Mathlib/Topology/Order/IntermediateValue.lean
wikidata  Q253214   Metric.isCompact_iff_isClosed_bounded                                      Mathlib/Topology/MetricSpace/Bounded.lean
wikidata  Q190546   inner_mul_inner_self_le                                                    Mathlib/Analysis/InnerProductSpace/Basic.lean
wikidata  Q193286   exists_deriv_eq_zero                                                       Mathlib/Analysis/Calculus/LocalExtr/Rolle.lean
wikidata  Q189136   exists_ratio_deriv_eq_ratio_slope                                          Mathlib/Analysis/Calculus/Deriv/MeanValue.lean
wikidata  Q866116   exists_extension_norm_eq                                                   Mathlib/Analysis/Normed/Module/HahnBanach.lean
wikidata  Q192760   Complex.isAlgClosed                                                        Mathlib/Analysis/Complex/Polynomial/Basic.lean
wikidata  Q939927   ContinuousMap.subalgebra_topologicalClosure_eq_top_of_separatesPoints      Mathlib/Topology/ContinuousMap/StoneWeierstrass.lean
wikidata  Q1816967  exists_continuous_zero_one_of_isClosed                                     Mathlib/Topology/UrysohnsLemma.lean
wikidata  Q220680   ContractingWith.exists_fixedPoint                                          Mathlib/Topology/MetricSpace/Contracting.lean

One thing worth flagging for the downstream external-tags diff filter: a single one-line to_additive (attr := wikidata Q…) change in a PR produces ≥2 TSV rows, so the diff-to-tag mapping isn't 1:1.

jcommelin · 2026-05-28T06:40:07Z

+    rows := rows.push s!"{dbName}\t{sanitiseField tag.tag}\t\
+      {sanitiseField tag.declName.toString}\t\
+      {sanitiseField file}\t{sanitiseField tag.comment}"


Noobie question: would it be easy to also log the file + linenr of this tag? That might double the size of the TSV, but I think it might be quite useful for post-processors.

github-actions Bot added the CI Modifies the continuous integration setup or other automation label May 26, 2026

kim-em mentioned this pull request May 27, 2026

[Merged by Bors] - doc: add wikidata cross references #39909

Closed

jcommelin reviewed May 28, 2026

View reviewed changes

joneugster added the LLM-generated PRs with substantial input from LLMs - review accordingly label May 29, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(scripts): dump cross-reference tags to TSV for downstream review#39876

feat(scripts): dump cross-reference tags to TSV for downstream review#39876
kim-em wants to merge 4 commits into
leanprover-community:masterfrom
kim-em:crossref-dump-script

kim-em commented May 26, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 26, 2026 •

edited

Loading

Declarations diff

⚠️ Scripts folder reminder

Uh oh!

kim-em commented May 27, 2026

Uh oh!

jcommelin May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

kim-em commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR summary 15ba169ded

Import changes for modified files

Declarations diff

⚠️ Scripts folder reminder

Uh oh!

kim-em commented May 27, 2026

Uh oh!

jcommelin May 28, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kim-em commented May 26, 2026 •

edited

Loading

github-actions Bot commented May 26, 2026 •

edited

Loading