-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve sync performance for pull-mirrors #19125
Improve sync performance for pull-mirrors #19125
Conversation
I think tests for |
|
#19235 might make this unnecessary. In particular the call to |
I haven't studied the referenced PR in detail. The main merit of this PR is that it, for pull-mirrors, reduces the number of |
I did take #19235 for a spin and the mirror cloning/sync does indeed look better with that PR (nice work!), but it still does a number of I tried a few of the repos benchmarked in the description of this PR and noted that although the situation is better after #19235, this PR still gives an additional 6 time improvement (roughly speaking). So for the case of pull-mirrors it feels like there is still quite a lot to be gained, both in time and in system resources. |
For large repositories with many tags, SyncReleasesWithTags can be a costly operation (taking several minutes to complete). The reason is two-fold 1. on sync, every upstream repo tag is compared (for changes) against existing local entries in the release table to ensure that they are up-to-date. 2. the procedure for getting each tag involves several git operations git show-ref --tags -- v8.2.4477 git cat-file -t 29ab6ce9f36660cffaad3c8789e71162e5db5d2f git cat-file -p 29ab6ce9f36660cffaad3c8789e71162e5db5d2f git rev-list --count 29ab6ce9f36660cffaad3c8789e71162e5db5d2f of which the 'git rev-list --count' can be particularly heavy. This commit optimizes performance for pull-mirrors. We utilize the fact that a pull-mirror is always identical to its upstream and rebuild the entire release table on every sync and use a batch 'git for-each-ref .. refs/tags' call to retrieve all tags in one go. For large mirror repos, with hundreds of annotated tags, this brings down the duration of the sync operation from several minutes to a few seconds. Signed-off-by: Peter Gardfjäll <peter.gardfjall.work@gmail.com>
For large repositories with many tags, SyncReleasesWithTags can be a costly operation (taking several minutes to complete). The reason is two-fold 1. on sync, every upstream repo tag is compared (for changes) against existing local entries in the release table to ensure that they are up-to-date. 2. the procedure for getting each tag involves several git operations git show-ref --tags -- v8.2.4477 git cat-file -t 29ab6ce9f36660cffaad3c8789e71162e5db5d2f git cat-file -p 29ab6ce9f36660cffaad3c8789e71162e5db5d2f git rev-list --count 29ab6ce9f36660cffaad3c8789e71162e5db5d2f of which the 'git rev-list --count' can be particularly heavy. This commit optimizes performance for pull-mirrors. We utilize the fact that a pull-mirror is always identical to its upstream and rebuild the entire release table on every sync and use a batch 'git for-each-ref .. refs/tags' call to retrieve all tags in one go. For large mirror repos, with hundreds of annotated tags, this brings down the duration of the sync operation from several minutes to a few seconds. Signed-off-by: Peter Gardfjäll <peter.gardfjall.work@gmail.com>
b10d9f3
to
ef6352c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very cool. just some small questions from my side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally looks good to me.
The only nit is that the return len(data), bytes.TrimSpace(data), nil
is really strange .....
If it is a must and the test case is correct (which means git has strange behaviors), here we need a clear comment to describe why the TrimSpace
is needed.
Codecov Report
@@ Coverage Diff @@
## main #19125 +/- ##
==========================================
+ Coverage 47.50% 47.56% +0.05%
==========================================
Files 931 934 +3
Lines 130513 130681 +168
==========================================
+ Hits 61998 62152 +154
- Misses 61043 61056 +13
- Partials 7472 7473 +1
Continue to review full report at Codecov.
|
* giteaoffical/main: Fix broken of team create (go-gitea#19288) Remove `git.Command.Run` and `git.Command.RunInDir*` (go-gitea#19280) Performance improvement for add team user when org has more than 1000 repositories (go-gitea#19227) [skip ci] Updated translations via Crowdin Update JS dependencies (go-gitea#19281) Fix container download counter (go-gitea#19287) go.mod: update kevinburke/ssh_config to v1.2.0 (go-gitea#19286) Fix global packages enabled avaiable (go-gitea#19276) Add Goroutine stack inspector to admin/monitor (go-gitea#19207) Move checks for pulls before merge into own function (go-gitea#19271) Restore user autoregistration with email addresses (go-gitea#19261) Improve sync performance for pull-mirrors (go-gitea#19125) Refactor `git.Command.Run*`, introduce `RunWithContextString` and `RunWithContextBytes` (go-gitea#19266) Move reaction to models/issues/ (go-gitea#19264)
This PR addresses #18352
It aims to improve performance (and resource use) of the
SyncReleasesWithTags
operation for pull-mirrors.For large repositories with many tags,
SyncReleasesWithTags
can be a costly operation (taking several minutes to complete). The reason is two-fold:on sync, every upstream repo tag is compared (for changes) against existing local entries in the release table to ensure that they are up-to-date.
the procedure for getting each tag involves a series of git operations
of which the
git rev-list --count
can be particularly heavy.This PR optimizes performance for pull-mirrors. We utilize the fact that a pull-mirror is always identical to its upstream and rebuild the entire release table on every sync and use a batch
git for-each-ref .. refs/tags
call to retrieve all tags in one go.For large mirror repos, with hundreds of annotated tags, this brings down the duration of the sync operation from several minutes to a few seconds. A few unscientific examples run on my local machine:
0m28,673s
0m2,244s
8m00s
0m8,520s
14m20,383s
0m35,467s
I added a
foreachref
package which contains a flexible way of specifying which reference fields are of interest (git-for-each-ref(1)
) and to produce a parser for the expected output. These could be reused in other places wherefor-each-ref
is used. I'll add unit tests for those if the overall PR looks promising.