/
replay-design-notes.txt
392 lines (302 loc) · 16.4 KB
/
replay-design-notes.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
Table-Of-Contents:
* Quick Overview
* Intro via examples
* Basics
* Generic ranges and multiple branches
* Thickets of branches
* Reasons for diverging from cherry-pick & rebase
* Server-side needs
* Decapitate HEAD-centric assumptions
* Performance
* Multiple branches and non-checked out branches
* Preserving topology, replaying merges
* Current status
======================================================================
Quick Overview
======================================================================
`git replay`, at a basic level, can perhaps be thought of as a
"default-to-dry-run rebase" -- meaning no updates to the working tree,
or to the index, or to any references. However, it differs from
rebase in that it:
* Works for branches that aren't checked out
* Works in a bare repository
* Can replay multiple branches simultaneously (with or without common
history in the range being replayed)
* Preserves relative topology by default (merges are replayed too)
* Focuses on performance
* Has several altered defaults as a result of the above
I sometimes think of `git replay` as "fast-replay", a patch-based
analogue to the snapshot-based fast-export & fast-import tools.
======================================================================
Intro via examples
======================================================================
* Basics -- rebasing
Since `git replay` can work in bare repositories or for branches that
are not checked out, there is no implicit assumption that HEAD is part
of what is being replayed nor that it is the target for where to
replay commits. These have to be specified separately:
$ git replay --onto target origin/main..mybranch
update refs/heads/mybranch ${NEW_mybranch_HASH} ${OLD_mybranch_HASH}
Note that this step does create several new commits (with the tip
commit represented by ${NEW_mybranch_HASH}), even though it has not
updated any references to make use of them. The output format is
specifically designed to be usable as input to `git update-ref
--stdin`. An option can be added later to have `git replay`
automatically do the update as well.
* Basics -- cherry-picking
If you want to cherry-pick rather than rebase, just change one flag:
$ git replay --advance target origin/main..mybranch
update refs/heads/target ${NEW_target_HASH} ${OLD_target_HASH}
With both --onto and --advance, the range of commits represented by
origin/main..mybranch are being replayed on top of target, the only
question is whether mybranch or target is updated to reflect the newly
replayed commits.
* Scaling up -- rebasing a stack of branches
What if you have a stack of branches, one depending upon another, and
you'd really like to rebase the whole set?
$ git replay --contained --onto origin/main origin/main..tipbranch
update refs/heads/branch1 ${NEW_branch1_HASH} ${OLD_branch1_HASH}
update refs/heads/branch2 ${NEW_branch2_HASH} ${OLD_branch2_HASH}
update refs/heads/tipbranch ${NEW_tipbranch_HASH} ${OLD_tipbranch_HASH}
So much nicer than trying to run N separate rebases, each of which
involves a different <ONTO> and <UPSTREAM> and forces you to first
check out each branch in turn.
* Scaling up -- generic ranges
When calling `git replay`, one can use more generic range expressions than
just a simple `A..B`:
$ git replay --onto origin/main ^base branch1 branch2 branch3
update refs/heads/branch1 ${NEW_branch1_HASH} ${OLD_branch1_HASH}
update refs/heads/branch2 ${NEW_branch2_HASH} ${OLD_branch2_HASH}
update refs/heads/branch3 ${NEW_branch3_HASH} ${OLD_branch3_HASH}
This will simultaneously rebase branch1, branch2, and branch3 -- all
commits they have since base, playing them on top of origin/main.
These three branches may have commits on top of base that they have in
common (in which case, their rebased versions will also share common
history), but they do not need to share any history being replayed.
* Scaling up -- replaying merges
Unlike rebase, if the range contains merges, the merges are not
dropped -- it is instead replayed. This is not a simple re-merge like
rebase's --rebase-merges, because it carries over any "fixups" made in
the original merge relative to an automatic merge. See below under
"Preserving topology, replaying merges".
* Scaling up -- thickets of branches
What if we just want to redo the merges in seen, perhaps reordering (or
even dropping some of) them?
$ git replay --interactive --onto next --first-parent next..seen
update refs/heads/seen ${NEW_seen_HASH} ${OLD_seen_HASH}
What if we want to update $TOPIC branch and then update seen to include the
updates to the $TOPIC branch?
$ git replay --interactive --onto next --first-parent next..seen $TOPIC
BUG: as-is, this would replay the commits in $TOPIC on top of next
too, which probably isn't wanted. We may want the commits in
"--first-parent next..seen" to be replayed on top of current next
while the commits in the base of $TOPIC we want to keep the same
parent. We need some kind of syntax to handle that. I think this
might be what the "no-rebase-cousins" is for in `git rebase`, but I
hate the name and I'm not sure if it's as generic as we need.
======================================================================
Reasons for diverging from cherry-pick & rebase
======================================================================
There are multiple reasons to diverge from the defaults in cherry-pick and
rebase.
* Server side needs
* Both cherry-pick and rebase, via the sequencer, are heavily tied
to updating the working tree, index, some refs, and a lot of
control files with every commit replayed, and invoke a mess of
hooks[1] that might be hard to avoid for backward compatibility
reasons (at least, that's been brought up a few times on the
list).
* cherry-pick and rebase both fork various subprocesses
unnecessarily, but somewhat intrinsically in part to ensure the
same hooks are called that old scripted implementations would
have called.
* "Dry run" behavior, where there are no updates to worktree, index,
or even refs might be important.
* Should not assume users only want to operate on HEAD (see next
section)
* Decapitate HEAD-centric assumptions
* cherry-pick forces commits to be played on top of HEAD; inflexible.
* rebase assumes the range of commits to be replayed is
upstream..HEAD by default, though it allows one to replay
upstream..otherbranch -- but it still forcibly and needlessly
checks out otherbranch before starting to replay things.
* Assuming HEAD is involved severely limits replaying multiple
(possibly divergent) branches.
* Once you stop assuming HEAD has a certain meaning, there's not
much reason to have two separate commands anymore (except for the
funny extra not-necessarily-compatible options both have gained
over time).
* (Micro issue: Assuming HEAD is involved also makes it harder for
new users to learn what rebase means and does; it makes command
lines hard to parse. Not sure I want to harp on this too much, as
I have a suspicion I might be creating a tool for experts with
complicated use cases, but it's a minor quibble.)
* Performance
* jj is slaughtering us on rebase speed[2]. I would like us to become
competitive. (I dropped a few comments in the link at [2] about why
git is currently so bad.)
* From [3], there was a simple 4-patch series in linux.git that took
53 seconds to rebase. Switching to ort dropped it to 16 seconds.
While that sounds great, only 11 *milliseconds* were needed to do
the actual merges. That means almost *all* the time (>99%) was
overhead! Big offenders:
* --reapply-cherry-picks should be the default
* can_fast_forward() should be ripped out, and perhaps other extraneous
revision walks
* avoid updating working tree, index, refs, reflogs, and control
structures except when needed (e.g. hitting a conflict, or operation
finished)
* Other performance ideas:
* single-file control structures instead of directory of files
* avoid forking subprocesses unless explicitly requested (e.g.
--exec, --strategy, --run-hooks). For example, definitely do not
invoke `git commit` or `git merge`.
* Sanitize hooks:
* dispense with all per-commit hooks for sure (pre-commit,
post-commit, post-checkout).
* pre-rebase also seems to assume exactly 1 ref is written, and
invoking it repeatedly would be stupid. Plus, it's specific
to "rebase". So...ignore? (Stolee's --ref-update option for
rebase probably broke the pre-rebase assumptions already...)
* post-rewrite hook might make sense, but fast-import got
exempted, and I think of replay like a patch-based analogue
to the snapshot-based fast-import.
* When not running server side, resolve conflicts in a sparse-cone
sparse-index worktree to reduce number of files written to a
working tree. (See below as well)
* [High risk of possible premature optimization] Avoid large
numbers of newly created loose objects, when replaying large
numbers of commits. Two possibilities: (1) Consider using
tmp-objdir and pack objects from the tmp-objdir at end of
exercise, (2) Lift code from git-fast-import to immediately
stuff new objects into a pack?
* Multiple branches and non-checked out branches
* The ability to operate on non-checked out branches also implies
that we should generally be able to replay when in a dirty working
tree (exception being when we expect to update HEAD and any of the
dirty files is one that needs to be updated by the replay).
* Also, if we are operating locally on a non-checked out branch and
hit a conflict, we should have a way to resolve the conflict without
messing with the user's work on their current branch.
* Idea: new worktree with sparse cone + sparse index checkout,
containing only files in the root directory, and whatever is
necessary to get the conflicts
* Companion to above idea: control structures should be written to
$GIT_COMMON_DIR/replay-${worktree}, so users can have multiple
replay sessions, and so we know which worktrees are associated
with which replay operations.
[1] https://lore.kernel.org/git/pull.749.v3.git.git.1586044818132.gitgitgadget@gmail.com/
[2] https://github.com/martinvonz/jj/discussions/49
[3] https://lore.kernel.org/git/CABPp-BE48=97k_3tnNqXPjSEfA163F8hoE+HY0Zvz1SWB2B8EA@mail.gmail.com/
======================================================================
Preserving topology, replaying merges
======================================================================
`git replay` will preserve relative topology by replaying merges.
Further, much as regular single-parent commits' changes are replayed,
we also want to replay the manual changes users include in merges.
Essentially, this means that after merging the rebased parents, we
need to amend that merge by applying the diff from `git show
--remerge-diff $oldmerge`. Or, equivalently, doing a three way merge
between:
* R: automatic remerge of $oldmerge accepting all conflicts
* O: $oldmerge
* N: (new) merge of rebased parents
A couple things to note about this three-way merge:
* `git diff R O` roughly equals `git show --remerge-diff $oldmerge`
* N is what current `git rebase --rebase-merge` uses, so we have a
superset of the information available to current `git rebase`.
This was discussed previously on list at [4], using the names pre-M,
M, and N instead of R, O, and N. After digging further, I think we
can do better on conflict resolution and avoiding nested conflict
markers...
Handling conflicts:
* When conflict markers are appropriate
* When creating R, we should "lie" about the hashes & commit summary
so that the conflict markers exactly match those that would be
used for N. Because doing so allows us to detect when N has the
same textual conflicts as R.
* Consider using XDL_MERGE_FAVOR_BASE[5] to avoid nested conflicts from
recursive merges.
* We need a special xdiff merging mode for three-way merging R, O, and N:
* Note that O does not have conflict hunks; it was the user created
merge, not an "automatic" merge. (Okay, user may be stupid and
commited with conflict markers, but I don't think we need to pay
attention to that, and users get what they deserve if they did that.)
* This special merging mode should never split a conflict hunk from
either R or N; it must operate on the entire hunk.
* If neither of R or N have conflict markers, then merging proceeds
as normal.
* If R & N have identical conflict hunks, then we can take the
version of text from O and the result is clean.
* If R has conflict hunks, but N does not:
* if merge.conflictStyle="merge", who cares, just two-way merge O & N
* if merge.conflictStyle="diff3", extend the conflict marker length by 1
for R, then three-way merge R, O, & N. You get a nested conflict.
* If N has conflict hunks that do not match R (R may or may not have
conflict hunks), then:
* We ignore both R & O and use the version from N as the resolution
* We do not mark it as resolved, though; we consider it to still be
conflicted.
* We make sure when the replay stops that the user is recommended to
run `git show --remerge-diff $oldmerge` for potential hints at
resolving the conflict. (Helpful since that command shows the diff
of R & O, and we threw away info from R & O here.)
* When conflict markers are not appropriate (binary files, mode
changes, modify/delete, etc., etc.):
* If both R & N have conflicts for a given path, and the three modes
& hashes from R match the three from N, then we can the version of
that path from O as the resolution.
* If the three modes & hashes do not match between R & N:
* Use N as the resolution
* Do not mark the file as resolved, even if N had no conflicts
* Make sure the user is recommended to run `git show --remerge-diff
$oldmerge` for potential hints at resolving the conflict.
[4] https://lore.kernel.org/git/CABPp-BHp+d62dCyAaJfh1cZ8xVpGyb97mZryd02aCOX=Qn=Ltw@mail.gmail.com/
[5] https://lore.kernel.org/git/CABPp-BF2KnktDTtTfp=hRS36HN-xYC8=P1eYcqaBhJvAJcTCAw@mail.gmail.com/
======================================================================
Current status
======================================================================
I have the basic replay functionality (cherry-pick or rebase), but:
* the code die()s if there are any conflicts. Not halts, dies().
* no resumability
* no nice output if there are conflicts
* replaying merges works IFF
* an automatic remerge of the original merge has no conflicts AND
* auto-merging the rebased parents has no conflicts AND
* three-way merging those two merges with the original merge has no
conflicts
Up next:
* --interactive support (needed for resumability)
* extend todo_item & friends to support:
play <merge-hash> relative to <label or secondary hash>
update-ref
* maybe:
* use 'play' instead of 'pick'
* drop 'merge'
* new make_replay_script() (similar to make_script_with_merges())
using new "play" & "update-ref"
* resumability when replaying commits locally
* update (current) worktree with conflicts
* save needed metadata as single INI file
* completed todo_list steps
* next todo_list steps
* ref-updates so far
* commit mappings
* flag settings
* handling non-trivial replaying of merges
* handle R & N both have conflicts with same mode,oid triplets
* handle R & N not having textual conflict type
* xdiff changes for handling R & N having textual conflicts
* checkout XDL_MERGE_FAVOR_BASE idea
* resolving conflicts in sparse-index worktrees with maximal sparsity
* when commit being replayed is being replayed on top of exact same parent,
just use commit as-is. (Enable fast-forward w/o the can_fast_forward()
penalty.)
* Nice touches
* Only assume refs/heads/ stuff are wanted to be updated, so that
`git replay origin/main..$TOPIC` works as shorthand for
`git replay --onto origin/main origi/main..$TOPIC`
* rewrite commit hashes referenced in commit messages when replaying
(much like filter-repo does)
* More tests
* Figure out what to do with server-side and conflicts