-
Notifications
You must be signed in to change notification settings - Fork 702
/
git-filter-repo.txt
1134 lines (922 loc) · 47.8 KB
/
git-filter-repo.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
git-filter-repo(1)
==================
NAME
----
git-filter-repo - Rewrite repository history
SYNOPSIS
--------
[verse]
'git filter-repo' --analyze
'git filter-repo' [<path_filtering_options>] [<content_filtering_options>]
[<ref_renaming_options>] [<commit_message_filtering_options>]
[<name_or_email_filtering_options>] [<parent_rewriting_options>]
[<generic_callback_options>] [<miscellaneous_options>]
DESCRIPTION
-----------
Rapidly rewrite entire repository history using user-specified filters.
This is a destructive operation which should not be used lightly; it
writes new commits, trees, tags, and blobs corresponding to (but
filtered from) the original objects in the repository, then deletes the
original history and leaves only the new. See <<DISCUSSION>> for more
details on the ramifications of using this tool. Several different
types of history rewrites are possible; examples include (but are not
limited to):
* stripping large files (or large directories or large extensions)
* stripping unwanted files by path
* extracting wanted paths and their history (stripping everything else)
* restructuring the file layout (such as moving all files into a
subdirectory in preparation for merging with another repo, making a
subdirectory become the new toplevel directory, or merging two
directories with independent filenames into one directory)
* renaming tags (also often in preparation for merging with another repo)
* replacing or removing sensitive text such as passwords
* making mailmap rewriting of user names or emails permanent
* making grafts or replacement refs permanent
* rewriting commit messages
Additionally, several concerns are handled automatically (many of these
can be overridden, but they are all on by default):
* rewriting (possibly abbreviated) hashes in commit messages to
refer to the new post-rewrite commit hashes
* pruning commits which become empty due to the above filters (also
handles edge cases like pruning of merge commits which become
degenerate and empty)
* creating replace-refs (see linkgit:git-replace[1]) for old commit
hashes, which if pushed and fetched will allow users to continue to
refer to new commits using (unabbreviated) old commit IDs
* stripping of original history to avoid mixing old and new history
* repacking the repository post-rewrite to shrink the repo for the
user
Also, it's worth noting that there is an important safety mechanism:
* abort if run from a repo that is not a fresh clone (to prevent
accidental data loss from rewriting local history that doesn't
exist anywhere else)
For those who know that there is large unwanted stuff in their history
and want help finding it, this command also
* provides an option to analyze a repository and generate reports that
can be useful in determining what to filter (or in determining
whether a separate filtering command was successful).
See also <<VERSATILITY>>, <<DISCUSSION>>, <<EXAMPLES>>, and
<<INTERNALS>>.
OPTIONS
-------
Analysis Options
~~~~~~~~~~~~~~~~
--analyze::
Analyze repository history and create a report that may be
useful in determining what to filter in a subsequent run (or
in determining if a previous filtering command did what you
wanted). Will not modify your repo.
Filtering based on paths (see also --filename-callback)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
--invert-paths::
Invert the selection of files from the specified
--path-{match,glob,regex} options below, i.e. only select
files matching none of those options.
--path-match <dir_or_file>::
--path <dir_or_file>::
Exact paths (files or directories) to include in filtered
history. Multiple --path options can be specified to get a
union of paths.
--path-glob <glob>::
Glob of paths to include in filtered history. Multiple
--path-glob options can be specified to get a union of paths.
--path-regex <regex>::
Regex of paths to include in filtered history. Multiple
--path-regex options can be specified to get a union of paths.
--use-base-name::
Match on file base name instead of full path from the top of
the repo. Incompatible with --path-rename.
Renaming based on paths (see also --filename-callback)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
--path-rename <old_name:new_name>::
--path-rename-match <old_name:new_name>::
Path to rename; if filename or directory matches <old_name>
rename to <new_name>. Multiple --path-rename options can be
specified.
Path shortcuts
~~~~~~~~~~~~~~
--paths-from-file <filename>::
Specify several path filtering and renaming directives, one
per line. Lines with `==>` in them specify path renames, and
lines can begin with `literal:` (the default), `glob:`, or
`regex:` to specify different matching styles
--subdirectory-filter <directory>::
Only look at history that touches the given subdirectory and
treat that directory as the project root. Equivalent to using
`--path <directory>/ --path-rename <directory>/:`
--to-subdirectory-filter <directory>::
Treat the project root as instead being under
<directory>. Equivalent to using `--path-rename :<directory>/`
Content editing filters (see also --blob-callback)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
--replace-text <expressions_file>::
A file with expressions that, if found, will be replaced. By
default, each expression is treated as literal text, but
`regex:` and `glob:` prefixes are supported. You can end the
line with `==>` and some replacement text to choose a
replacement choice other than the default of `***REMOVED***`.
--strip-blobs-bigger-than <size>::
Strip blobs (files) bigger than specified size (e.g. `5M`,
`2G`, etc)
--strip-blobs-with-ids <blob_id_filename>::
Read git object ids from each line of the given file, and
strip all of them from history
Renaming of refs (see also --refname-callback)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
--tag-rename <old:new>::
Rename tags starting with <old> to start with <new>. For example,
--tag-rename foo:bar will rename tag foo-1.2.3 to bar-1.2.3;
either <old> or <new> can be empty.
Filtering of commit messages (see also --message-callback)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
--preserve-commit-hashes::
By default, since commits are rewritten and thus gain new
hashes, references to old commit hashes in commit messages are
replaced with new commit hashes (abbreviated to the same
length as the old reference). Use this flag to turn off
updating commit hashes in commit messages.
--preserve-commit-encoding::
Do not reencode commit messages into UTF-8. By default, if the
commit object specifies an encoding for the commit message,
the message is re-encoded into UTF-8.
Filtering of names & emails (see also --name-callback and --email-callback)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
--mailmap <filename>::
Use specified mailmap file (see linkgit:git-shortlog[1] for details
on the format) when rewriting author, committer, and tagger names
and emails. If the specified file is part of git history,
historical versions of the file will be ignored; only the current
contents are consulted.
--use-mailmap::
Same as: '--mailmap .mailmap'
Parent rewriting
~~~~~~~~~~~~~~~~
--replace-refs {delete-no-add, delete-and-add, update-no-add, update-or-add, update-and-add}::
Replace refs (see linkgit:git-replace[1]) are used to rewrite
parents (unless turned off by the usual git mechanism); this
flag specifies what do do with those refs afterward. Replace
refs can either be deleted or updated to point at new commit
hashes. Also, new replace refs can be added for each commit
rewrite. With 'update-or-add', new replace refs are only
added for commit rewrites that aren't used to update an
existing replace ref. default is 'update-and-add' if
$GIT_DIR/filter-repo/already_ran does not exist;
'update-or-add' otherwise.
--prune-empty {always, auto, never}::
Whether to prune empty commits. 'auto' (the default) means
only prune commits which become empty (not commits which were
empty in the original repo, unless their parent was
pruned). When the parent of a commit is pruned, the first
non-pruned ancestor becomes the new parent.
--prune-degenerate {always, auto, never}::
Since merge commits are needed for history topology, they are
typically exempt from pruning. However, they can become
degenerate with the pruning of other commits (having fewer
than two parents, having one commit serve as both parents, or
having one parent as the ancestor of the other.) If such merge
commits have no file changes, they can be pruned. The default
('auto') is to only prune empty merge commits which become
degenerate (not which started as such).
Generic callback code snippets
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
--filename-callback <function_body>::
Python code body for processing filenames; see <<CALLBACKS>>.
--message-callback <function_body>::
Python code body for processing messages (both commit messages and
tag messages); see <<CALLBACKS>>.
--name-callback <function_body>::
Python code body for processing names of people; see <<CALLBACKS>>.
--email-callback <function_body>::
Python code body for processing emails addresses; see
<<CALLBACKS>>.
--refname-callback <function_body>::
Python code body for processing refnames; see <<CALLBACKS>>.
--blob-callback <function_body>::
Python code body for processing blob objects; see <<CALLBACKS>>.
--commit-callback <function_body>::
Python code body for processing commit objects; see <<CALLBACKS>>.
--tag-callback <function_body>::
Python code body for processing tag objects; see <<CALLBACKS>>.
--reset-callback <function_body>::
Python code body for processing reset objects; see <<CALLBACKS>>.
Location to filter from/to
~~~~~~~~~~~~~~~~~~~~~~~~~~
NOTE: Specifying alternate source or target locations implies --partial
except that the normal default for --replace-refs is used. However, unlike
normal uses of --partial, this doesn't risk mixing old and new history
since the old and new histories are in different repositories.
--source <source>::
Git repository to read from
--target <target>::
Git repository to overwrite with filtered history
Miscellaneous options
~~~~~~~~~~~~~~~~~~~~~
--help::
-h::
Show a help message and exit.
--force::
-f::
Rewrite history even if the current repo does not look like a fresh
clone.
--partial::
Do a partial history rewrite, resulting in the mixture of old and
new history. This implies a default of update-no-add for
--replace-refs, disables rewriting refs/remotes/origin/* to
refs/heads/*, disables removing of the 'origin' remote, disables
removing unexported refs, disables expiring the reflog, and
disables the automatic post-filter gc. Also, this modifies
--tag-rename and --refname-callback options such that instead of
replacing old refs with new refnames, it will instead create new
refs and keep the old ones around. Use with caution.
--refs <refs+>::
Limit history rewriting to the specified refs. Implies --partial.
In addition to the normal caveats of --partial (mixing old and new
history, no automatic remapping of refs/remotes/origin/* to
refs/heads/*, etc.), this also may cause problems for pruning of
degenerate empty merge commits when negative revisions are
specified.
--dry-run::
Do not change the repository. Run `git fast-export` and filter its
output, and save both the original and the filtered version for
comparison. This also disables rewriting commit messages due to
not knowing new commit IDs and disables filtering of some empty
commits due to inability to query the fast-import backend.
--debug::
Print additional information about operations being performed and
commands being run. (If used together with --dry-run, shows
extra information about what would be run).
--stdin::
Instead of running `git fast-export` and filtering its output,
filter the fast-export stream from stdin. The stdin must be in
the expected input format (e.g. it needs to include original-oid
directives).
--quiet::
Pass --quiet to other git commands called.
[[VERSATILITY]]
VERSATILITY
-----------
filter-repo has a hierarchy of capabilities on the spectrum from easy to
use convenience flags that perform pre-defined types of filtering, to
choices that provide lots of flexibility in controlling how filtering
occurs. This spectrum includes the following:
* Convenience flags making common types of history rewriting simple (e.g.
--path, --strip-blobs-bigger-than, --replace-text, --mailmap)
* Options which are shorthand for others or which provide greater control
than others (e.g. --subdirectory-filter could just be written using
both a path selection (--path) and a path rename (--path-rename)
filter; --paths-from-file can handle all other --path* options and more
such as regex renaming of paths)
* Generic python callbacks for handling a certain type of data (the
filename, message, name, email, and refname callbacks)
* Generic python callbacks for handling fundamental git objects, allowing
greater control over the combination of data types the object holds
(the commit, tag, blob, and reset callbacks)
* The ability to import filter-repo as a module in a python program and
use its classes and functions for even greater control and flexibility
while still leveraging lots of basic capabilities. One can even use
this to write new tools with a completely different interface.
For more information about callbacks, see <<CALLBACKS>>. For examples on
writing python programs that import filter-repo as a module to create new
history rewriting tools, look at the contrib/filter-repo-demos/ directory.
That directory includes, among other examples, a reimplementation of
git-filter-branch which is faster than git-filter-branch, and a
reimplementation of BFG Repo Cleaner with several bug fixes and new
features.
[[DISCUSSION]]
DISCUSSION
----------
Using filter-repo is relatively simple, but rewriting history is part of
a larger discussion in terms of collaboration. When you rewrite
history, the old and new histories are no longer compatible; if you push
this history somewhere for others to view, it will look as though you've
done a rebase of all branches and tags. Make sure you are familiar with
the "RECOVERING FROM UPSTREAM REBASE" section of linkgit:git-rebase[1]
(and in particular, "The hard case") before proceeding, in addition to
this section.
Steps to use git-filter-repo as part of the bigger picture of doing a
history rewrite are roughly as follows:
1. Create a clone of your repository (if you created special refs outside
of refs/heads/ or refs/tags/, make sure to fetch those too). Note
that `--bare` and `--mirror` clones are supported too, if you prefer.
2. (Optional) Run `git filter-repo --analyze`. This will create a
directory of reports mentioning renames that have occurred in your
repo and also listing sizes of objects aggregated by
path/directory/extension/blob-id; this information may be useful in
choosing how to filter your repo. It can also be useful to re-run
--analyze after filtering to verify the changes look correct.
3. Run filter-repo with your desired filtering options. Many examples
are given below. For more complex cases, note that doing the
filtering in multiple steps (by running multiple filter-repo
invocations in a sequence) is supported. If anything goes wrong here,
simply delete your clone and restart.
4. Push your new repository to its new home (note that
refs/remotes/origin/* will have been moved to refs/heads/* as the
first part of filter-repo, so you can just deal with normal branches
instead of remote tracking branches). While you can force push this
to the same URL you cloned from, there are good reasons to consider
pushing to a different location instead:
* People who cloned from the original repo will have old history.
When they fetch the new history you force pushed up, unless they
do a `git reset --hard @{u}` on their branches or rebase their
local work, git will think they have hundreds or thousands of
commits with very similar commit messages as what exist upstream
(but which include files you wanted excised from history), and
allow the user to merge the two histories, resulting in what
looks like two copies of each commit. If they then push this
history back up, then everyone now has history with two copies of
each commit and the bad files have returned. You're more likely
to succeed in forcing people to get rid of the old history if
they have to clone a new URL.
* Rewriting history will rewrite tags; those who have already
downloaded tags will not get the updated tags by default (see the
"On Re-tagging" section of linkgit:git-tag[1]). Every user
trying to use an existing clone will have to forcibly delete all
tags and re-fetch them; it may be easier for them to just
re-clone, which they are more likely to do with a new clone URL.
* Rewriting history may delete some refs (e.g. branches that only
had files that you wanted excised from history); unless you run
git push with the `--mirror` or `--prune` options, those refs
will continue to exist on the server. If folks then merge these
branches into others, then people have started mixing old and new
history. If users had already cloned these branches, removing
them from the server isn't enough; you need all users to delete
any local branches based on these refs and run fetch with the
`--prune` option as well. Simply re-cloning from a new URL is
easier.
* The server may not allow you to force push over some refs.
For example, code review systems may have special ref
namespaces (e.g. refs/changes/, refs/pull/,
refs/merge-requests/) that they have locked down.
5. (Optional) Some additional considerations
* filter-repo by default creates replace refs (see
linkgit:git-replace[1]) for each rewritten commit ID, allowing
you to use old (unabbreviated) commit hashes to refer to the
newly rewritten commits. If you want to use these replace refs,
push them to the relevant clone URL and tell users to adjust
their fetch refspec (e.g. `git config --add remote.origin.fetch
+refs/replace/*:refs/replace/*`) Sadly, some existing git servers
(e.g. Gerrit, GitHub) do not yet understand replace refs, and
thus one can't use old commit hashes within their UI; this may
change in the future. But replace refs at least help users
locally within the git CLI.
* If you have a central repo, you may want to prevent people
from pushing old commit IDs, in order to avoid mixing old
and new history. Every repository manager does this
differently, some provide specialized commands
(e.g. https://gerrit-review.googlesource.com/Documentation/cmd-ban-commit.html),
others require you to write hooks.
[[EXAMPLES]]
EXAMPLES
--------
Path based filtering
~~~~~~~~~~~~~~~~~~~~
To only keep the 'README.md' file plus the directories 'guides' and
'tools/releases/':
--------------------------------------------------
git filter-repo --path README.md --path guides/ --path tools/releases
--------------------------------------------------
Directory names can be given with or without a trailing slash, and all
filenames are relative to the toplevel of the repo. To keep all files
except these paths, just add `--invert-paths`:
--------------------------------------------------
git filter-repo --path README.md --path guides/ --path tools/releases --invert-paths
--------------------------------------------------
If you want to have both an inclusion filter and an exclusion filter, just
run filter-repo multiple times. For example, to keep the src/main
subdirectory but exclude files under src/main named 'data', run:
--------------------------------------------------
git filter-repo --path src/main/
git filter-repo --path-glob 'src/*/data' --invert-paths
--------------------------------------------------
Note that the asterisk (`*`) will match across multiple directories, so the
second command would remove e.g. src/main/org/whatever/data. Also, the
second command by itself would also remove e.g. src/not-main/foo/data, but
since src/not-main/ was removed by the first command, that's not an issue.
Also, the use of quotes around the asterisk is sometimes important to avoid
glob expansion by the shell.
You can also select paths by regular expression (see
https://docs.python.org/3/library/re.html#regular-expression-syntax).
For example, to only include files from the repo whose name is in the
format YYYY-MM-DD.txt and is found at least two subdirectories deep:
--------------------------------------------------
git filter-repo --path-regex '^.*/.*/[0-9]{4}-[0-9]{2}-[0-9]{2}.txt$'
--------------------------------------------------
If you want two directories to be renamed (and maybe merged if both are
renamed to the same location), use --path-rename; for example, to rename
both 'cmds/' and 'src/scripts/' to 'tools/':
--------------------------------------------------
git filter-repo --path-rename cmds:tools --path-rename src/scripts/:tools/
--------------------------------------------------
As with `--path`, directories can be specified with or without a
trailing slash for `--path-rename`.
If you do a `--path-rename` to something that was already in use, it will
be silently overwritten. However, if you try to rename multiple files to
the same location (e.g. src/scripts/run_release.sh and cmds/run_release.sh
both existed and had different content with the renames above), then you
will be given an error. If you have such a case, you may want to add
another rename command to move one of the paths somewhere else where it
won't collide:
--------------------------------------------------
git filter-repo --path-rename cmds/run_release.sh:tools/do_release.sh \
--path-rename cmds/:tools/ \
--path-rename src/scripts/:tools/
--------------------------------------------------
Also, `--path-rename` brings up ordering issues; all path arguments are
applied in order. Thus, a command like
--------------------------------------------------
git filter-repo --path-rename sources/:src/main/ --path src/main/
--------------------------------------------------
would make sense but reversing the two arguments would not (src/main/ is
created by the rename so reversing the two would give you an empty repo).
Also, note that the rename of cmds/run_release.sh a couple examples ago was
done before the other renames.
If you prefer to filter based solely on basename, use the `--use-base-name`
flag (though this is incompatible with `--path-rename`). For example, to
only include README.md and Makefile files from any directory:
--------------------------------------------------
git filter-repo --use-base-name --path README.md --path Makefile
--------------------------------------------------
If you wanted to delete all .DS_Store files in any directory, you could
either use:
--------------------------------------------------
git filter-repo --invert-paths --path '.DS_Store' --use-base-name
--------------------------------------------------
or
--------------------------------------------------
git filter-repo --invert-paths --path-glob '*/.DS_Store' --path '.DS_Store'
--------------------------------------------------
(the `--path-glob` isn't sufficient by itself as it might miss a toplevel
.DS_Store file; further while something like `--path-glob '*.DS_Store'`
would workaround that problem it would also grab files named `foo.DS_Store`
or `bar/baz.DS_Store`)
If you have a long list of files, directories, globs, or regular
expressions to filter on, you can stick them in a file and use
`--paths-from-file`; for example, with a file named stuff-i-want.txt with
contents of
--------------------------------------------------
README.md
guides/
tools/releases
glob:*.py
regex:^.*/.*/[0-9]{4}-[0-9]{2}-[0-9]{2}.txt$
tools/==>scripts/
regex:(.*)/([^/]*)/([^/]*)\.text$==>\2/\1/\3.txt
--------------------------------------------------
then you could run
--------------------------------------------------
git filter-repo --paths-from-file stuff-i-want.txt
--------------------------------------------------
to get a repo containing only the toplevel README.md file, the guides/ and
tools/releases/ directories, all python files, files whose name was of the
form YYYY.MM-DD.txt at least two subdirectories deep, and would rename
tools/ to scripts/ and rename files like foo/bar/baz/bleh.text to
baz/foo/bar/bleh.txt. Note the special line prefixes of `glob:` and
`regex:` and the special string `==>` denoting renames.
Finally, see also the `--filename-callback` from <<CALLBACKS>>.
Content based filtering
~~~~~~~~~~~~~~~~~~~~~~~
If you want to filter out all files bigger than a certain size, you can use
`--strip-blobs-bigger-than` with some size (K, M, and G suffixes are
recognized), e.g.:
--------------------------------------------------
git filter-repo --strip-blobs-bigger-than 10M
--------------------------------------------------
If you want to strip out all files with specified git object ids (hashes),
list the hashes in a file and run
--------------------------------------------------
git filter-repo --strip-blobs-with-ids FILE_WITH_GIT_BLOB_IDS
--------------------------------------------------
If you want to modify file contents, you can do so based on a list of
expressions in a file, one per line. For example, with a file named
expressions.txt containing
--------------------------------------------------
p455w0rd
foo==>bar
glob:*666*==>
regex:\bdriver\b==>pilot
literal:MM/DD/YYYY=>YYYY-MM-DD
regex:([0-9]{2})/([0-9]{2})/([0-9]{4})==>\3-\1-\2
--------------------------------------------------
then running
--------------------------------------------------
git filter-repo --replace-text expressions.txt
--------------------------------------------------
will go through and replace `p455w0rd` with `***REMOVED***`, `foo` with
`bar`, any line containing `666` with a blank line, the word `driver` with
`pilot` (but not if it has letters before or after; e.g. `drivers` will be
unmodified), replace the exact text `MM/DD/YYYY` with `YYYY-MM-DD` and
replace date strings of the form MM/DD/YYYY with ones of the form
YYYY-MM-DD. In the expressions file, there are a few things to note:
* Every line has a replacement, given by whatever is on the right of
`==>`. If `==>` does not appear on the line, the default replacement
is `***REMOVED***`.
* Lines can start with `literal:`, `glob:`, or `regex:` to specify
whether to do literal string matches,
globs (see https://docs.python.org/3/library/fnmatch.html), or regular
expressions (see https://docs.python.org/3/library/re.html#regular-expression-syntax).
If none of these are specified, `literal:` is assumed.
* globs and regexes are applied to each line of the file; it is not
possible with --replace-text to match a multi-line string.
* If multiple matches are found on a line, all are replaced.
See also the `--blob-callback` from <<CALLBACKS>>.
Refname based filtering
~~~~~~~~~~~~~~~~~~~~~~~
To rename tags, use `--tag-rename`, e.g.:
--------------------------------------------------
git filter-repo --tag-rename foo:bar
--------------------------------------------------
This will rename any tags starting with `foo` to now start with `bar`.
Either side of the colon could be blank, e.g.
--------------------------------------------------
git filter-repo --tag-rename '':'my-module-'
--------------------------------------------------
For more general refname modification, see `--refname-callback` from
<<CALLBACKS>>.
User and email based filtering
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To modify username and emails of commits, you can create a mailmap
file in the format accepted by linkgit:git-shortlog[1]. For example,
if you have a file named my-mailmap you can run
--------------------------------------------------
git filter-repo --mailmap my-mailmap
--------------------------------------------------
and if the current contents of that file are as follows (if the
specified mailmap file is version controlled, historical versions of
the file are ignored):
--------------------------------------------------
Name For User <email@addre.ss>
<new@ema.il> <old1@ema.il>
New Name And <new@ema.il> <old2@ema.il>
New Name And <new@ema.il> Old Name And <old3@ema.il>
--------------------------------------------------
then we can update username and/or emails based on the specified
mapping.
See also the `--name-callback` and `--email-callback` from
<<CALLBACKS>>.
Parent rewriting
~~~~~~~~~~~~~~~~
To replace $commit_A with $commit_B (e.g. make all commits which had
$commit_A as a parent instead have $commit_B for that parent), and
rewrite history to make it permanent:
--------------------------------------------------
git replace $commit_A $commit_B
git filter-repo --force
--------------------------------------------------
To create a new commit with the same contents as $commit_A except with
different parent(s) and then replace $commit_A with the new commit,
and rewrite history to make it permanent:
--------------------------------------------------
git replace --graft $commit_A $new_parent_or_parents
git filter-repo --force
--------------------------------------------------
The reason to specify --force is two-fold: filter-repo will error out
if no arguments are specified, and the new graft commit would
otherwise trigger the not-a-fresh-clone check.
Partial history rewrites
~~~~~~~~~~~~~~~~~~~~~~~~
To rewrite the history on just one branch (which may cause it to no longer
share any common history with other branches), use `--refs`. For example,
to remove a file named 'extraneous.txt' from the 'master' branch:
--------------------------------------------------
git filter-repo --invert-paths --path extraneous.txt --refs master
--------------------------------------------------
To rewrite just some recent commits:
--------------------------------------------------
git filter-repo --invert-paths --path extraneous.txt --refs master~3..master
--------------------------------------------------
[[CALLBACKS]]
CALLBACKS
---------
For flexibility, filter-repo allows you to specify functions on the
command line to further filter all changes. Please note that there
are some API compatibility caveats associated with these callbacks
that you should be aware of before using them; see the "API BACKWARD
COMPATIBILITY CAVEAT" comment near the top of git-filter-repo source
code.
All callback functions are of the same general format. For a command line
argument like
--------------------------------------------------
--foo-callback 'BODY'
--------------------------------------------------
the following code will be compiled and called:
--------------------------------------------------
def foo_callback(foo):
BODY
--------------------------------------------------
Thus, you just need to make sure your _BODY_ modifies and returns
_foo_ appropriately. One important thing to note for all callbacks is
that filter-repo uses bytestrings (see
https://docs.python.org/3/library/stdtypes.html#bytes) everywhere
instead of strings.
There are four callbacks that allow you to operate directly on raw
objects that contain data that's easy to write in
linkgit:fast-import[1] format:
--------------------------------------------------
--blob-callback
--commit-callback
--tag-callback
--reset-callback
--------------------------------------------------
We'll come back to these later because it is often the case that the
other callbacks are more convenient. The other callbacks operate on a
small piece of the raw objects or operate on pieces across multiple
types of raw object (e.g. author names and committer names and tagger
names across commits and tags, or refnames across commits, tags, and
resets, or messages across commits and tags). The convenience
callbacks are:
--------------------------------------------------
--filename-callback
--message-callback
--name-callback
--email-callback
--refname-callback
--------------------------------------------------
in each you are expected to simply return a new value based on the one
passed in. For example,
--------------------------------------------------
git-filter-repo --name-callback 'return name.replace(b"Wiliam", b"William")'
--------------------------------------------------
would result in the following function being called:
--------------------------------------------------
def name_callback(name):
return name.replace(b"Wiliam", b"William")
--------------------------------------------------
The email callback is quite similar:
--------------------------------------------------
git-filter-repo --email-callback 'return email.replace(b".cm", b".com")'
--------------------------------------------------
The refname callback is also similar, but note that the refname passed in
and returned are expected to be fully qualified (e.g. b"refs/heads/master"
instead of just b"master" and b"refs/tags/v1.0.7" instead of b"1.0.7"):
--------------------------------------------------
git-filter-repo --refname-callback '
# Change e.g. refs/heads/master to refs/heads/prefix-master
rdir,rpath = os.path.split(refname)
return rdir + b"/prefix-" + rpath'
--------------------------------------------------
The message callback is quite similar to the previous three callbacks,
though it operates on a bytestring that is likely more than one line:
--------------------------------------------------
git-filter-repo --message-callback '
if b"Signed-off-by:" not in message:
message += b"\nSigned-off-by: Me My <self@and.eye>"
return re.sub(b"[Ee]-?[Mm][Aa][Ii][Ll]", b"email", message)'
--------------------------------------------------
The filename callback is slightly more interesting. Returning None means
the file should be removed from all commits, returning the filename
unmodified marks the file to be kept, and returning a different name means
the file should be renamed. An example:
--------------------------------------------------
git-filter-repo --filename-callback '
if b"/src/" in filename:
# Remove all files with a directory named "src" in their path
# (except when "src" appears at the toplevel).
return None
elif filename.startswith(b"tools/"):
# Rename tools/ -> scripts/misc/
return b"scripts/misc/" + filename[6:]
else:
# Keep the filename and do not rename it
return filename
'
--------------------------------------------------
In contrast, the blob, reset, tag, and commit callbacks are not
expected to return a value, but are instead expected to modify the
object passed in. Major fields for these objects are (subject to API
backward compatibility caveats mentioned previously):
* Blob: `original_id` (original hash) and `data`
* Reset: `ref` (name of reference) and `from_ref` (hash or integer mark)
* Tag: `ref`, `from_ref`, `original_id`, `tagger_name`, `tagger_email`,
`tagger_date`, `message`
* Commit: `branch`, `original_id`, `author_name`, `author_email`,
`author_date`, `committer_name`, `committer_email`,
`committer_date `, `message`, `file_changes` (list of
FileChange objects, each containing a `type`, `filename`,
`mode`, and `blob_id`), `parents` (list of hashes or integer
marks)
An example of each:
--------------------------------------------------
git filter-repo --blob-callback '
if len(blob.data) > 25:
# Mark this blob for removal from all commits
blob.skip()
else:
blob.data = blob.data.sub(b"Hello", b"Goodbye")
'
--------------------------------------------------
--------------------------------------------------
git filter-repo --reset-callback 'reset.ref = reset.ref.replace(b"master", b"dev")'
--------------------------------------------------
--------------------------------------------------
git filter-repo --tag-callback '
if tag.tagger_name == b"Jim Williams":
# Omit this tag
tag.skip()
else:
tag.message = tag.message + b"\n\nTag of %s by %s on %s" % (tag.ref, tag.tagger_email, tag.tagger_date)'
--------------------------------------------------
--------------------------------------------------
git filter-repo --commit-callback '
# Remove executable files with three 6s in their name (including
# from leading directories).
# Also, undo deletion of sources/foo/bar.txt (change types are
# either b"D" (deletion) or b"M" (add or modify); renames are
# handled by deleting the old file and adding a new one)
commit.file_changes = [
change for change in commit.file_changes
if not (change.mode == b"100755" and
change.filename.count(b"6") == 3) and
not (change.type == b"D" and
change.filename == b"sources/foo/bar.txt")]
# Mark all .sh files as executable; modes in git are always one of
# 100644 (normal file), 100755 (executable), 120000 (symlink), or
# 160000 (submodule)
for change in commit.file_changes:
if change.filename.endswith(b".sh"):
change.mode = b"100755"
'
--------------------------------------------------
[[INTERNALS]]
INTERNALS
---------
You probably don't need to read this section unless you are just very
curious or you are trying to do a very complex history rewrite.
How filter-repo works
~~~~~~~~~~~~~~~~~~~~~
Roughly, filter-repo works by running
--------------------------------------------------
git fast-export <options> | filter | git fast-import <options>
--------------------------------------------------
where filter-repo not only launches the whole pipeline but also serves as
the _filter_ in the middle. However, filter-repo does a few additional
things on top in order to make it into a well-rounded filtering tool. A
sequence that more accurately reflects what filter-repo runs is:
1. Verify we're in a fresh clone
2. `git fetch -u . refs/remotes/origin/*:refs/heads/*`
3. `git remote rm origin`
4. `git fast-export --show-original-ids --reference-excluded-parents --fake-missing-tagger --signed-tags=strip --tag-of-filtered-object=rewrite --use-done-feature --no-data --reencode=yes --mark-tags --all | filter | git fast-import --force --quiet`
5. `git update-ref --no-deref --stdin`, fed with a list of refs to nuke, and a list of replace refs to delete, create, or update.
6. `git reset --hard`
7. `git reflog expire --expire=now --all`
8. `git gc --prune=now`
Some notes or exceptions on each of the above:
1. If we're not in a fresh clone, users will not be able to recover if
they used the wrong command or ran in the wrong repo. (Though
`--force` overrides this check, and it's also off if you've already
ran filter-repo once in this repo.)
2. Technically, we actually use a `git update-ref` command fed with a lot
of input due to the fact that users can use `--force` when local
branches might not match remote branches. But this fetch command
catches the intent rather succinctly.
3. We don't want users accidentally pushing back to the original repo, as
discussed in <<DISCUSSION>>. It also reminds users that since history
has been rewritten, this repo is no longer compatible with the
original. Finally, another minor benefit is this allows users to push
with the `--mirror` option to their new home without accidentally
sending remote tracking branches.
4. Some of these flags are always used but others are actually
conditional. For example, filter-repo's `--replace-text` and
`--blob-callback` options need to work on blobs so `--no-data` cannot
be passed to fast-export. But when we don't need to work on blobs,
passing `--no-data` speeds things up. Also, other flags may change
the structure of the pipeline as well (e.g. `--dry-run` and `--debug`)
5. We use this step to write replace refs for accessing the newly written
commit hashes using their previous names. Also, if refs were renamed
by various steps, we need to delete the old refnames in order to avoid
mixing old and new history.
6. Users also have old versions of files in their working tree and index;
we want those cleaned up to match the rewritten history as well. Note
that this step is skipped in bare repos.
7. Reflogs will hold on to old history, so we need to expire them.
8. We need to gc to avoid mixing new and old history. Also, it shrinks
the repository for users, so they don't have to do extra work. (Odds
are that they've only rewritten trees and commits and maybe a few
blobs, so `--aggressive` isn't needed and would be too slow.)
Information about these steps is printed out when `--debug` is passed
to filter-repo. When doing a `--partial` history rewrite, steps 2, 3,
7, and 8 are unconditionally skipped, step 5 is skipped if
`--replace-refs` is `update-no-add`, and just the nuke-unused-refs
portion of step 5 is skipped if `--replace-refs` is something else.
Limitations
~~~~~~~~~~~
Inherited limitations
^^^^^^^^^^^^^^^^^^^^^
Since git filter-repo calls fast-export and fast-import to do a lot of the
heavy lifting, it inherits limitations from those systems: