Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Fetching data for huge GitHUB repositories takes very long #57
...and once it is done performance is bad.
Original title: Restrict fetched pull reqs/issues by criteria?
I've just gotten forge working with my workplace's main repo (which sits on a GH:EE instance, there's only a minor thing in ghub that I had to advise to make that work - will file an issue there); now that it's done fetching all pull requests and issues on it (which took 6 hours, 3 of which emacs was blocking as it updated the sqlite db), I have a 600MB sqlite file full of old and outdated issues/pull requests; there are about 3 dozen relevant ones on this repo right now (we create PRs for small changes and they typically live for a week).
Magit-status performance on this repo has also deteriorated a little; updating topics blocks my emacs for a minute (even if like 3 PRs changed) and hitting RET on an open topic also takes about half a minute.
So that makes me wonder if it's possible to come up with criteria for topics to sync - if I understand correctly, the graphql endpoint supports queries, so I would like to come up with my own, like:
I imagine this might cut down on the amount of (dead) data forge has to store for that repo and make it faster to use for the things I tend to do day-to-day.
A careful redefinition of
I don't know how we'd fix the exceptionally long initialize-time though (aside from making the SQLite updates non-blocking). The only thing I can think of is to set a lower-bound as some sort of
I'm impressed GitHub can handle that. My company uses a home-grown system with a few million (yes.) such records -- I wonder if we'd ever move to something like GitHub if it can indeed scale like that
Definitely fetching/storing over 100000 records (even 10000) in the foreground seems like a big deal -- at least without some sort of are-you-sure confirmation.
It's a (fairly beefy) github enterprise installation (which has its own problems: they run on a single instance, so you can only scale them vertically); I'm not sure I want to see something like this run on public github (-:
One thing that made this import so heavy on my box was that emacs really was blocked for about 2-3 hours on inserting into sqlite; if that could be interleaved somehow (insert a batch every 1000 fetched elements?), that would probably improve responsiveness a lot, even though it'll likely cause emacs to block for a short period every so often.
That's probably because some data is being fetched from scratch every time. Basically everything in the
The problem here is that all available topics are being retrieved from the database and massaged a bit even though we know that all of that will just be discarded anyway. That doesn't matter for a tiny repository but starting with a repository of about Magit's size this it leads to an annoying hang.
Just like Magit's, many of Forge's commands use
It might be possible to fix that by handing a function that returns the list of candidates instead of the candidates itself to delay that code from being evaluated until we know that we have to. But then we have to decided whether
I'll probably add a new issue about that.
I wasn't really expecting Forge to behave well in such cases.
Forge is already doing that, more or less. It appears that there is a bug somewhere. Also see #6.
Making all the data available locally is a major design decision I made for Forge. Given such a large repository that doesn't work well obviously and we should provide a workaround but there is a limit to how fancy that can get. Basically we can forgo fetching topics that haven't been updated since a certain data.
We already do that for the second and subsequent times the user pulls data for a given repository. The hack that we could use for humongous repositories is to allow the user to explicitly set that date before doing the first pull.
(reposting from the correct account)
I really like the idea of restricting the time range on topic updates; being able to initially-fetch only the last 2 months of topics would still cover all the topics that are relevant to my day-to-day but cut down the number of topics that need to be fetched & stored by a lot.
referenced this issue
Jan 3, 2019
I tried it, and it works:
That fetched about 200 pages of topics for 10 minutes, and created a 50MB .sqlite file. It looks great now!
With some slightly better ergonomics around setting that date (prefix arg to the initial fetch command?), I think it would 100% resolve my issues (-: