Format state_before/state_after datapoints as partial updates #17

antonkov · 2023-07-17T16:23:28Z

Generate datapoints in the new format:

we suspect it will make it easier for ReProver to predict tactics
it will make it easier for paperproof to provide the after state

with following differences from simple state_before/state_after approach:

Remove case goalName from before&after states
Leave the 1st goal in the before state, and only leave the goals it forked into in the after state.
In the after state, only show to ReProver the added and the updated hypotheses. Do not mention anything if the hypothesis was deleted.
From each example (like the apply And.intro example below), generate multiple examples, one per each diff. In paperproof, the user shouldn't need to write out every change the tactic would bring. rw might affect 3 hypotheses in a tactic, the user will only mention one of those, and expect ReProver to certainly do that change, without worrying about other hypotheses not being mentioned in the after state.
It's like a deepening lense, the model sees that it has a particular before state ⊢ p ∧ q, and it sees that it wants to have ⊢ p. The model does not care about anything else.

Examples:

Previous format for tactic trivial

before
case inl
p : Prop
h✝ : p
⊢ True

case inr
p : Prop
h✝ : p
⊢ True

after
case inr
p : Prop
h✝ : p
⊢ True

new format

before
p : Prop
h✝ : p
⊢ True

after
goals accomplished

new format for tactic intro h generates two datapoints

before

⊢ p -> q

after
...
h : p
⊢ q

before

⊢ p -> q

after
...
⊢ q

new format for apply And.intro generates two data points

before
q : Prop
p : Prop
hp : p
hq : q
⊢ p ∧ q

after
...
⊢ p

before
q : Prop
p : Prop
hp : p
hq : q
⊢ p ∧ q

after
...
⊢ q

new format for rw [Nat.add_comm a b]

before
a : Nat
b : Nat
c : Nat
⊢ c + (a + b) = c + b + a

after
...
⊢ c + (b + a) = c + b + a

for synthetic example with old format

before 
a : 1
b : 2
c : 3
⊢ goalA

after
a : 666
b : 2 
c : 3
m : 777
⊢ goalA

new format is two separate datapoints

before
a : 1
b : 2 
c : 3
⊢ goalA

after
...
a : 666
⊢ goalA

before
a : 1
b : 2 
c : 3
⊢ goalA

after
...
m : 777
⊢ goalA

yangky11 · 2023-07-18T20:31:58Z

I didn't understand this data point generated by intro h:

before

⊢ p -> q

after
...
⊢ q

Where is h : p?

antonkov · 2023-07-18T22:12:46Z

The idea is - we want the model to respond to two types of queries:

Specify new desired goal

before
${current_state}
⊢ ${current_goal}

after
...
⊢ ${new_goal}

Specify the desired hypothesis

before
${current_state}
⊢ ${current_goal}

after
...
${new_hypothesis}
⊢ ${current_goal}

The datapoint without any new hypothesis is to serve the query 1) - i.e. user doesn't specify what hypothesis appear only the goal change, ... is to convey that maybe there are changes in the state but we can't tell which changes exactly in the query and also don't care. It might feel more natural on the example by cases type Color = Red | Blue | Green, ⊢ \forall c : Color, P c and user specifies the goal ⊢ P Blue (it will create goals P Red and P Green as a byproduct but that's we assume what user wanted).

The other datapoint with h : p is to serve the 2nd type of queries when user wants a hypothesis change, even if it requires the goal change. However, you can notice the discrepancy in the query compared to the training datapoint since the old goal will have to be specified in the query (we don't know what new goal is but want to preserve old one if possible):

before

⊢ p -> q

after
...
h : p
⊢ p -> q

I'm not sure if the model would find it close to the inro h from training example, please let me know what you think and if it needs adjustment. Please also note that the last snippet is a query example not a training datapoint, all training datapoints are consistent and valid with a semantic of ... possibly hiding some hypothesis. Thanks

yangky11 · 2023-07-19T14:19:24Z

That makes sense. Thanks for the explanation. I think it should be good as long as examples in testing are presented to the model in the same format as in training. I'll start training the model with updated data asap, which will finish in one week.

yangky11 · 2023-07-19T14:30:05Z

@antonkov It looks like the code doesn't handle state_after_str == "no goals"? It leads to an AssertionError at assert lines[-1].startswith("⊢").

antonkov · 2023-07-19T15:02:22Z

Oh right, good catch. Should be straightforward to add a check if state_after_str == "no_goals": goals_after = [] can you do that?

yangky11 · 2023-07-19T15:12:59Z

@antonkov I fixed this error, but there are other cases that cannot be handled by the parsing code. For example, parse_hyps cannot parse inst✝⁷ : (x : B) → AddCommMonoid (E x) due to the additional : in the type. The LeanDojo repo has code for parsing the pretty-printed goals: https://github.com/lean-dojo/LeanDojo/blob/main/src/lean_dojo/interaction/dojo.py. Maybe you can directly use it or adapt it somehow (opening a new PR based on the updated given-next-state branch)?

yangky11 · 2023-07-19T15:14:20Z

An example goal that your current code cannot parse:

R : Type u_2
B : Type u_1
F : Type u_3
E : B → Type ?u.410614
inst✝⁸ : NontriviallyNormedField R
inst✝⁷ : (x : B) → AddCommMonoid (E x)
inst✝⁶ : (x : B) → Module R (E x)
inst✝⁵ : NormedAddCommGroup F
inst✝⁴ : NormedSpace R F
inst✝³ : TopologicalSpace B
inst✝² : TopologicalSpace (TotalSpace E)
inst✝¹ : (x : B) → TopologicalSpace (E x)
inst✝ : FiberBundle F E
ι : Type u_4
Z : VectorBundleCore R B F ι
b✝ : B
a : F
i j : ι
b : B
hb : b ∈ baseSet Z i
v : F
⊢ Trivialization.symm (localTriv Z i) b v = ↑(coordChange Z i (indexAt Z b) b) v

antonkov · 2023-07-19T15:17:12Z

Yes thanks, I'll iterate to fix the errors and send a new PR

antonkov added 4 commits July 17, 2023 17:12

Add draft of partial state change format

63b75da

Move state formatter to generator folder

f217cff

Format code

a8f897f

Minor test fixes

68ad53c

antonkov marked this pull request as ready for review July 18, 2023 15:03

yangky11 merged commit bb0aec7 into lean-dojo:given-next-state Jul 19, 2023
1 check passed

antonkov mentioned this pull request Jul 19, 2023

Parser tactic state using LeanDojo parser #21

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Format state_before/state_after datapoints as partial updates #17

Format state_before/state_after datapoints as partial updates #17

antonkov commented Jul 17, 2023 •

edited

Loading

yangky11 commented Jul 18, 2023

antonkov commented Jul 18, 2023 •

edited

Loading

yangky11 commented Jul 19, 2023

yangky11 commented Jul 19, 2023 •

edited

Loading

antonkov commented Jul 19, 2023

yangky11 commented Jul 19, 2023 •

edited

Loading

yangky11 commented Jul 19, 2023

antonkov commented Jul 19, 2023

Format state_before/state_after datapoints as partial updates #17

Format state_before/state_after datapoints as partial updates #17

Conversation

antonkov commented Jul 17, 2023 • edited Loading

yangky11 commented Jul 18, 2023

antonkov commented Jul 18, 2023 • edited Loading

yangky11 commented Jul 19, 2023

yangky11 commented Jul 19, 2023 • edited Loading

antonkov commented Jul 19, 2023

yangky11 commented Jul 19, 2023 • edited Loading

yangky11 commented Jul 19, 2023

antonkov commented Jul 19, 2023

antonkov commented Jul 17, 2023 •

edited

Loading

antonkov commented Jul 18, 2023 •

edited

Loading

yangky11 commented Jul 19, 2023 •

edited

Loading

yangky11 commented Jul 19, 2023 •

edited

Loading