Skip to content

feat(scripts): dump cross-reference tags to TSV for downstream review#39876

Open
kim-em wants to merge 4 commits into
leanprover-community:masterfrom
kim-em:crossref-dump-script
Open

feat(scripts): dump cross-reference tags to TSV for downstream review#39876
kim-em wants to merge 4 commits into
leanprover-community:masterfrom
kim-em:crossref-dump-script

Conversation

@kim-em
Copy link
Copy Markdown
Contributor

@kim-em kim-em commented May 26, 2026

This PR adds scripts/dump_crossref_tags.lean, which walks Mathlib.CrossRef.tagExt in the elaborated Mathlib environment and writes one TSV record per tagged declaration. The TSV is consumed by a privileged workflow_run job that posts the cross-reference review PR comment; the rest of that machinery lives in https://github.com/leanprover-community/external-tags and https://github.com/leanprover-community/mathlib-ci. Fields are sanitised so tabs/newlines in user-controlled comments can't break the TSV framing, and the output is capped at 2 MB.

Uses importModules (loadExts := true) rather than withImportModules, because the wrapper passes loadExts := false and would leave tagExt empty for imported modules.

🤖 Prepared with Claude Code

Add scripts/dump_crossref_tags.lean: a ~70-line standalone tool that loads
the elaborated Mathlib environment, walks Mathlib.CrossRef.tagExt, and
writes one TSV record per tagged declaration with columns
(database, tag, declName, module, comment).

This is the only piece of cross-reference review machinery that needs to
live in mathlib4 — the rest (snippet fetching, filtering by PR diff,
Markdown rendering, comment posting) lives in
https://github.com/leanprover-community/external-tags and
https://github.com/leanprover-community/mathlib-ci. The TSV emitted here
is consumed by the privileged workflow_run job and treated as untrusted
data; fields are normalised so tabs/newlines in user-controlled comments
can't break the TSV framing, and the output is capped at 2 MB defensively
(current population is ~55 KB, 491 tags).

The script uses importModules (loadExts := true) rather than the
withImportModules wrapper because the latter passes loadExts := false and
would leave tagExt empty for imported modules.

Replaces leanprover-community#39662 (which added the full standalone script in-tree at ~1k
LOC). Per
https://leanprover.zulipchat.com/#narrow/dm/110087,112680-dm/near/597848507,
the bulk of that script now lives in
https://github.com/leanprover-community/external-tags.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@github-actions github-actions Bot added the CI Modifies the continuous integration setup or other automation label May 26, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 26, 2026

PR summary 15ba169ded

Import changes for modified files

No significant changes to the import graph

Import changes for all files
Files Import difference

Declarations diff

+ maxBytes
+ moduleToFilePath
+ sanitiseField

You can run this locally as follows
## from your `mathlib4` directory:
git clone https://github.com/leanprover-community/mathlib-ci.git ../mathlib-ci

## summary with just the declaration names:
../mathlib-ci/scripts/pr_summary/declarations_diff.sh <optional_commit>

## more verbose report:
../mathlib-ci/scripts/pr_summary/declarations_diff.sh long <optional_commit>

The doc-module for scripts/pr_summary/declarations_diff.sh in the mathlib-ci repository contains some details about this script.


No changes to strong technical debt.
No changes to weak technical debt.

⚠️ Scripts folder reminder

This PR adds files under scripts/.
Please consider whether each added script belongs in this repository or in leanprover-community/mathlib-ci.

A script belongs in mathlib-ci if it is a CI automation script that interacts with GitHub (e.g. managing labels, posting comments, triggering bots), runs from a trusted external checkout in CI, or requires access to secrets.

A script belongs in this repository (scripts/) if it is a developer or maintainer tool to be run locally, a code maintenance or analysis utility, a style linting tool, or a data file used by the library's own linters.

See the mathlib-ci README for more details.

Added scripts files:

  • scripts/dump_crossref_tags.lean

The lint-style action fails CI on undocumented scripts. Adds an entry in
the "CI workflow" section pointing to the workflow_run consumer and the
external-tags repo.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
kim-em added a commit to kim-em/mathlib4 that referenced this pull request May 27, 2026
…un shim

Stack on top of the dump_crossref_tags.lean PR. Adds two pieces:

* .github/workflows/build_template.yml: three new steps in `post_steps`,
  after the import-graph upload. Run scripts/dump_crossref_tags.lean to
  produce crossref-tags.tsv. For pull_request_target builds, build a
  bridge payload (pr_number, head_sha) and emit it via
  leanprover-community/privilege-escalation-bridge/emit using artifact
  name `crossref-tags-bridge`. Non-PR builds skip the emit.

* .github/workflows/crossref_review.yml: workflow_run listener on
  the two `continuous integration` workflows. Consumes the bridge
  artifact via privilege-escalation-bridge/consume, checks out
  leanprover-community/mathlib-ci, and invokes its
  scripts/crossref_review/post-comment.sh orchestrator. The orchestrator
  clones leanprover-community/external-tags at a pinned SHA and uses it
  to fetch snippets and render the PR comment.

Replaces leanprover-community#39666. The original PR rolled its own artifact upload + PR
number resolution fallback chain (~158 LOC) and included the snippet
fetcher in scripts/crossref.lean (~760 LOC). This version delegates to
the existing privilege-escalation-bridge infrastructure (same pattern
as olean_report.yaml + olean_report_wf_run.yaml) and to external-tags,
cutting the mathlib4 surface to ~97 LOC.

Trust model: the privileged workflow_run job runs only code from
mathlib-ci@<pinned SHA> (which runs only code from external-tags at a
pinned SHA). Nothing from the build artifact is interpreted as code;
the TSV is parsed as untrusted data with user-controllable fields
escaped before interpolation into the comment.

Depends on leanprover-community#39876.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The "Safety" paragraph claimed the 2 MB cap "bounds what a malicious
PR could emit". It doesn't — this script lives in scripts/ and runs
with the PR's permissions, so a malicious PR can remove the cap.
The cap is defence-in-depth; the trusted bound lives downstream in
mathlib-ci's post-comment.sh.

Also: 491 tags in master today, not 539 (the earlier number was a
text-scan count; the actual elaborated population is 491).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
kim-em added a commit to kim-em/mathlib4 that referenced this pull request May 27, 2026
…un shim

Stack on top of the dump_crossref_tags.lean PR. Adds two pieces:

* .github/workflows/build_template.yml: three new steps in `post_steps`,
  after the import-graph upload. Run scripts/dump_crossref_tags.lean to
  produce crossref-tags.tsv. For pull_request_target builds, build a
  bridge payload (pr_number, head_sha) and emit it via
  leanprover-community/privilege-escalation-bridge/emit using artifact
  name `crossref-tags-bridge`. Non-PR builds skip the emit.

* .github/workflows/crossref_review.yml: workflow_run listener on
  the two `continuous integration` workflows. Consumes the bridge
  artifact via privilege-escalation-bridge/consume, checks out
  leanprover-community/mathlib-ci, and invokes its
  scripts/crossref_review/post-comment.sh orchestrator. The orchestrator
  clones leanprover-community/external-tags at a pinned SHA and uses it
  to fetch snippets and render the PR comment.

Replaces leanprover-community#39666. The original PR rolled its own artifact upload + PR
number resolution fallback chain (~158 LOC) and included the snippet
fetcher in scripts/crossref.lean (~760 LOC). This version delegates to
the existing privilege-escalation-bridge infrastructure (same pattern
as olean_report.yaml + olean_report_wf_run.yaml) and to external-tags,
cutting the mathlib4 surface to ~97 LOC.

Trust model: the privileged workflow_run job runs only code from
mathlib-ci@<pinned SHA> (which runs only code from external-tags at a
pinned SHA). Nothing from the build artifact is interpreted as code;
the TSV is parsed as untrusted data with user-controllable fields
escaped before interpolation into the comment.

Depends on leanprover-community#39876.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The tag, module, and comment fields all pass through sanitiseField
before being written to TSV, but tag.declName was interpolated raw.
Lean `Name`s can contain backtick-quoted segments (`«weird name»`) and
in principle could include tabs or newlines; the one unsanitised field
in a "framing-safe" TSV is inconsistent and risks breaking the
downstream parser.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
kim-em added a commit to kim-em/mathlib4 that referenced this pull request May 27, 2026
…un shim

Stack on top of the dump_crossref_tags.lean PR. Adds two pieces:

* .github/workflows/build_template.yml: three new steps in `post_steps`,
  after the import-graph upload. Run scripts/dump_crossref_tags.lean to
  produce crossref-tags.tsv. For pull_request_target builds, build a
  bridge payload (pr_number, head_sha) and emit it via
  leanprover-community/privilege-escalation-bridge/emit using artifact
  name `crossref-tags-bridge`. Non-PR builds skip the emit.

* .github/workflows/crossref_review.yml: workflow_run listener on
  the two `continuous integration` workflows. Consumes the bridge
  artifact via privilege-escalation-bridge/consume, checks out
  leanprover-community/mathlib-ci, and invokes its
  scripts/crossref_review/post-comment.sh orchestrator. The orchestrator
  clones leanprover-community/external-tags at a pinned SHA and uses it
  to fetch snippets and render the PR comment.

Replaces leanprover-community#39666. The original PR rolled its own artifact upload + PR
number resolution fallback chain (~158 LOC) and included the snippet
fetcher in scripts/crossref.lean (~760 LOC). This version delegates to
the existing privilege-escalation-bridge infrastructure (same pattern
as olean_report.yaml + olean_report_wf_run.yaml) and to external-tags,
cutting the mathlib4 surface to ~97 LOC.

Trust model: the privileged workflow_run job runs only code from
mathlib-ci@<pinned SHA> (which runs only code from external-tags at a
pinned SHA). Nothing from the build artifact is interpreted as code;
the TSV is parsed as untrusted data with user-controllable fields
escaped before interpolation into the comment.

Depends on leanprover-community#39876.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@kim-em
Copy link
Copy Markdown
Contributor Author

kim-em commented May 27, 2026

Test output from rebasing #39909 onto this branch + #39877 and running the dump script.

$ lake env lean --run scripts/dump_crossref_tags.lean /tmp/crossref-tags.tsv
Wrote 505 cross-reference tag(s) to /tmp/crossref-tags.tsv (55607 bytes).

Breakdown:

All 505 rows have exactly 5 tab-separated fields; 204 carry a non-empty comment so the sanitisation path is exercised. File size 55 KB, well under the 2 MB cap.

The 14 wikidata rows from #39909:

wikidata  Q193878   Ideal.quotientInfRingEquivPiQuotient                                       Mathlib/RingTheory/Ideal/Quotient/Operations.lean
wikidata  Q505798   Subgroup.card_subgroup_dvd_card                                            Mathlib/GroupTheory/Coset/Card.lean
wikidata  Q505798   AddSubgroup.card_addSubgroup_dvd_card                                      Mathlib/GroupTheory/Coset/Card.lean
wikidata  Q1052678  dense_iInter_of_isOpen                                                     Mathlib/Topology/Baire/Lemmas.lean
wikidata  Q245098   intermediate_value_Icc                                                     Mathlib/Topology/Order/IntermediateValue.lean
wikidata  Q253214   Metric.isCompact_iff_isClosed_bounded                                      Mathlib/Topology/MetricSpace/Bounded.lean
wikidata  Q190546   inner_mul_inner_self_le                                                    Mathlib/Analysis/InnerProductSpace/Basic.lean
wikidata  Q193286   exists_deriv_eq_zero                                                       Mathlib/Analysis/Calculus/LocalExtr/Rolle.lean
wikidata  Q189136   exists_ratio_deriv_eq_ratio_slope                                          Mathlib/Analysis/Calculus/Deriv/MeanValue.lean
wikidata  Q866116   exists_extension_norm_eq                                                   Mathlib/Analysis/Normed/Module/HahnBanach.lean
wikidata  Q192760   Complex.isAlgClosed                                                        Mathlib/Analysis/Complex/Polynomial/Basic.lean
wikidata  Q939927   ContinuousMap.subalgebra_topologicalClosure_eq_top_of_separatesPoints      Mathlib/Topology/ContinuousMap/StoneWeierstrass.lean
wikidata  Q1816967  exists_continuous_zero_one_of_isClosed                                     Mathlib/Topology/UrysohnsLemma.lean
wikidata  Q220680   ContractingWith.exists_fixedPoint                                          Mathlib/Topology/MetricSpace/Contracting.lean

One thing worth flagging for the downstream external-tags diff filter: a single one-line to_additive (attr := wikidata Q…) change in a PR produces ≥2 TSV rows, so the diff-to-tag mapping isn't 1:1.

Comment on lines +72 to +74
rows := rows.push s!"{dbName}\t{sanitiseField tag.tag}\t\
{sanitiseField tag.declName.toString}\t\
{sanitiseField file}\t{sanitiseField tag.comment}"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noobie question: would it be easy to also log the file + linenr of this tag? That might double the size of the TSV, but I think it might be quite useful for post-processors.

@joneugster joneugster added the LLM-generated PRs with substantial input from LLMs - review accordingly label May 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI Modifies the continuous integration setup or other automation LLM-generated PRs with substantial input from LLMs - review accordingly

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants