Getting Intermediate States from Combined Tactics #14

AG161 · 2023-07-09T23:07:19Z

AG161
Jul 9, 2023

I'm trying to use LeanDojo to extract the ASTs of mathlib theorems for LLM training data, and I have a concern about combined tactics obfuscating the full AST of a proof. E.g. the get_single_tactic_proof in https://github.com/lean-dojo/LeanDojo/blob/main/src/lean_dojo/data_extraction/traced_data.py can combine a whole proof into a single tactic. In this case, the before and after state of the tactic is just the goal and then no goals. That's an extreme case but tactic combinators are quite common, particularly sequences of rewrites. What I'd like to be able to do is get the before and after state for each tactic that comprises some larger combined tactic. Essentially the opposite of get_single_tactic_proof in that it decomposes the entire proof to its smallest tactic components, giving me the "biggest" AST of the theorem.

Anyway I just wanted to ask how involved would something like that be, given the way LeanDojo works? Would it be some minor modifications of https://github.com/lean-dojo/LeanDojo/blob/main/src/lean_dojo/data_extraction/ExtractData.lean or would it require a major rework? I'm only asking because I'm not really familiar with metaprogramming in Lean. Thanks!

yangky11 · 2023-07-10T02:09:36Z

yangky11
Jul 10, 2023
Maintainer

For Lean 3, I don't think there is an easy way to do that, as the --ast --tsast --tspp flags we use for exporting the ASTs and tactic states are implemented in C++,

For Lean 4, it should be a fairly straightforward change to ExtractData.lean. Currently, in visitTacticInfo, we only export the tactic information for children of tacticSeq1Indented or tacticSeqBracketed. I guess that part should be changed if you want to export more fine-grained tactic info.

3 replies

yangky11 Jul 10, 2023
Maintainer

Copying @antonkov since we discussed similar topics.

antonkov Jul 10, 2023

Adding intermediate rewrites is relatively easy - I'm working on a PR for that - got them in ast.json - need to finish off with the postprocessing of the tactic string since in rw [a, b, c] sub rw rules will refer to just a but we need to process it to rewrite a.

@AG161 do you have any other tactic combinators besides rw/rewrite in mind which LeanDojo currently doesn't work with?

antonkov Jul 10, 2023

Here is the draft #16, I'm still working on taking config and location into account, e.g. so it generates from rw [a, b] at h -> rewrite [a] at h and rewrite [b] at h

AG161 · 2023-07-11T22:14:29Z

AG161
Jul 11, 2023
Author

@AG161 do you have any other tactic combinators besides rw/rewrite in mind which LeanDojo currently doesn't work with?

Hi @antonkov
Sorry for the late reply, I was having issues tracing an example repo and had to switch OS to get it done. Basically what I'm worried about are the tactic combinators in Lean and more complicated ways of writing tactics, for example in Section 5 of Theorem Proving in Lean 4: https://leanprover.github.io/theorem_proving_in_lean4/tactics.html#tactic-combinators

I made a fork of the LeanDojo example repo here with some basic examples: https://github.com/AG161/lean4-example/blob/main/Lean4Example.lean

theorem dojo2_uncombined (p : Prop) : p ∨ p → p := by
  intro h
  cases h
  assumption
  assumption

theorem dojo2_combined (p : Prop) : p ∨ p → p := by
  intro h
  cases h <;> assumption

In the first theorem, the get_traced_tactics() method returns this:


TracedTactic(tactic=intro h, state_before=p : Prop
⊢ p ∨ p → p, state_after=p : Prop
h : p ∨ p
⊢ p), TracedTactic(tactic=cases h, state_before=p : Prop
h : p ∨ p
⊢ p, state_after=case inl
p : Prop
h✝ : p
⊢ p

case inr
p : Prop
h✝ : p
⊢ p), TracedTactic(tactic=assumption, state_before=case inl
p : Prop
h✝ : p
⊢ p

case inr
p : Prop
h✝ : p
⊢ p, state_after=case inr
p : Prop
h✝ : p
⊢ p), TracedTactic(tactic=assumption, state_before=case inr
p : Prop
h✝ : p
⊢ p, state_after=no goals)]

The second theorem returns this:

[TracedTactic(tactic=intro h, state_before=p : Prop
⊢ p ∨ p → p, state_after=p : Prop
h : p ∨ p
⊢ p), TracedTactic(tactic=cases h <;> assumption, state_before=p : Prop
h : p ∨ p
⊢ p, state_after=no goals)]

Here the tactic combinator <;> has Lean first apply cases h, and then assumption to each goal. So even though the sequence of tactics that goes to Lean's compiler is the same as in the first theorem, the AST we look at in the second case "hides" part of the AST. Theoretically, there's no limit to the amount of tactics that can be hidden in this way. For the next two examples, I didn't provide the decomposed proofs out of laziness, but the same idea applies:

theorem dojo3_uncombined  (x y z : Nat)
        : (x + 0) * (0 + y * 1 + z * 0) = x * y := by
  simp

This theorem returns the following tactic tree:

[TracedTactic(tactic=simp, state_before=x y z : Nat
⊢ (x + 0) * (0 + y * 1 + z * 0) = x * y, state_after=no goals)]

The way the simp tactic works is that it searches for rewrite lemmas tagged to simplify and tries to find a way to prove the identity. Although an LLM could apply the same tactic and succeed, ideally I would like to have access to the AST of the actual sequence of tactics that were used to simplify the goal, which has to go to Lean's compiler anyway to check for correctness.

A final example (maybe @yangky11 can take a look at this one example, I found the result perplexing and I'm not sure if this is intended):

theorem dojo4_uncombined (p q r : Prop) (hp : p)
        : (p ∨ q ∨ r) ∧ (q ∨ p ∨ r) ∧ (q ∨ r ∨ p) := by
  repeat (first | apply And.intro | apply Or.inl; assumption | apply Or.inr | assumption)

Here the combined tactic continuously goes through the listed tactics until there are no goals left and applies the first one which works, then starts over. For this one get_traced_tactics() returns:

[TracedTactic(tactic=repeat (first | apply And.intro | apply Or.inl; assumption | apply Or.inr | assumption), state_before=p q r : Prop
hp : p
⊢ (p ∨ q ∨ r) ∧ (q ∨ p ∨ r) ∧ (q ∨ r ∨ p), state_after=no goals), TracedTactic(tactic=(first | apply And.intro | apply Or.inl; assumption | apply Or.inr | assumption), state_before=no goals, state_after=no goals), TracedTactic(tactic=first | apply And.intro | apply Or.inl; assumption | apply Or.inr | assumption, state_before=no goals, state_after=no goals), TracedTactic(tactic=apply And.intro, state_before=no goals, state_after=no goals), TracedTactic(tactic=apply Or.inl, state_before=no goals, state_after=no goals), TracedTactic(tactic=assumption, state_before=case right.right.h.h
p q r : Prop
hp : p
⊢ r, state_after=case right.right.h.h
p q r : Prop
hp : p
⊢ r), TracedTactic(tactic=apply Or.inr, state_before=no goals, state_after=no goals), TracedTactic(tactic=assumption, state_before=no goals, state_after=no goals)]

The first tactic in the list looks like what I expected the entire thing to look like - the combined tactic solves the entire goal. However, there are several more tactics that appear in the list which have no goals before and after. At a certain point in the list, it looks like it is tracking which tactics were selected out of the combined tactic (where it says apply And.intro), but this only happens for part of the proof and then a lot of the other tactics are left out, and the "no goals" before and after comes back. The output seems like it's wrong to me. @yangky11 if I should open an issue about this please let me know.

In principle, it might not be reasonable to get the "true" AST from any proof because someone could define their own arbitrarily complex tactics. But in practice, for most of the things in mathlib4 I imagine just decomposing the basic tactic combinators Lean provides would allow us to see the "largest" and most informative AST of the theorem. Anyway, that's what I was thinking about.

0 replies

antonkov · 2023-07-11T22:35:50Z

antonkov
Jul 11, 2023

Thanks @AG161,

the example

theorem dojo2_combined (p : Prop) : p ∨ p → p := by
  intro h
  cases h <;> assumption

will be parsed correctly by https://github.com/Paper-Proof/paper-proof parser (please take a look if you're curious)
,

but Paper Proof somehow chokes on the second example too - I'll take a look to see what's the bug

Paper proof parser works more in a way you expect/describe and parses elaboration tree, (Lean InfoTree). LeanDojo parser parses InfoTree for tactic nodes too but tries to overlay InfoTree tactics over initial Syntax (therefore assumption has only one entry even though it has multiple subentries in the InfoTree). One possible adjustment might be to let every tactic node has multiple state_before, state_after pairs @yangky11 ?

Regarding the simp tactic I thought the same as you say and was expecting to find rewrite tactic entries in elaboration tree, would be nice to display them in Paper Proof tactic tree, but sadly simp doesn't save it in elaboration info just constructs a term as far as I observed.

The PR for rewrite's is ready - I can take a look at other tactic combinators @yangky11 let me know what you think on the format for trace.xml

3 replies

AG161 Jul 11, 2023
Author

Wow that looks really great! I will check out the Paper Proof github and try to see if I can come up with any more examples.

Regarding the simp tactic I thought the same as you say and was expecting to find rewrite tactic entries in elaboration tree, would be nice to display them in Paper Proof tactic tree, but sadly simp doesn't save it in elaboration info just constructs a term as far as I observed.

Ah I see, that's not what I expected, interesting. Thanks @antonkov

yangky11 Jul 12, 2023
Maintainer

It may be quite difficult to trace what simp does under the hood. I'm also not sure if that's necessary, since the user typically just treats it as a simplification tactic without thinking about the exact set of rewrites.

I'm not sure what's happening for repeat (first | apply And.intro | apply Or.inl; assumption | apply Or.inr | assumption). Yes, opening an issue would be great! I'll take a look this weekend.

lakesare Jul 12, 2023

We don't detect backtracking in the paperproof parser, we treat all attempted tactics as if they actually happened, so the proof looks weird.

Here apply Or.inl; assumption was tried, but then we backtracked, and yet we draw it (which hides the second tactic apply Or.inr; assumption). Wonder if there is a way to detect backtracking in the parser?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Getting Intermediate States from Combined Tactics #14

{{title}}

Replies: 3 comments 6 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Getting Intermediate States from Combined Tactics #14

AG161 Jul 9, 2023

Replies: 3 comments · 6 replies

yangky11 Jul 10, 2023 Maintainer

yangky11 Jul 10, 2023 Maintainer

antonkov Jul 10, 2023

antonkov Jul 10, 2023

AG161 Jul 11, 2023 Author

antonkov Jul 11, 2023

AG161 Jul 11, 2023 Author

yangky11 Jul 12, 2023 Maintainer

lakesare Jul 12, 2023

AG161
Jul 9, 2023

Replies: 3 comments 6 replies

yangky11
Jul 10, 2023
Maintainer

yangky11 Jul 10, 2023
Maintainer

AG161
Jul 11, 2023
Author

antonkov
Jul 11, 2023

AG161 Jul 11, 2023
Author

yangky11 Jul 12, 2023
Maintainer