New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fetching data for huge GitHUB repositories takes very long #57

Open
antifuchs opened this Issue Jan 2, 2019 · 11 comments

Comments

3 participants
@antifuchs
Copy link

antifuchs commented Jan 2, 2019

...and once it is done performance is bad.


Original title: Restrict fetched pull reqs/issues by criteria?

I've just gotten forge working with my workplace's main repo (which sits on a GH:EE instance, there's only a minor thing in ghub that I had to advise to make that work - will file an issue there); now that it's done fetching all pull requests and issues on it (which took 6 hours, 3 of which emacs was blocking as it updated the sqlite db), I have a 600MB sqlite file full of old and outdated issues/pull requests; there are about 3 dozen relevant ones on this repo right now (we create PRs for small changes and they typically live for a week).

Magit-status performance on this repo has also deteriorated a little; updating topics blocks my emacs for a minute (even if like 3 PRs changed) and hitting RET on an open topic also takes about half a minute.

So that makes me wonder if it's possible to come up with criteria for topics to sync - if I understand correctly, the graphql endpoint supports queries, so I would like to come up with my own, like:

  • assigned to, created by, or review-requested to me
  • created more recently than 2 weeks ago.

I imagine this might cut down on the amount of (dead) data forge has to store for that repo and make it faster to use for the things I tend to do day-to-day.

@vermiculus

This comment has been minimized.

Copy link

vermiculus commented Jan 2, 2019

Related to #42.

@vermiculus

This comment has been minimized.

Copy link

vermiculus commented Jan 2, 2019

A careful redefinition of (defconst ghub-fetch-repository ...) should be able to effect this, but that may not be ideal. From a standpoint of 'released code should do the right thing', it may be possible to only fetch data that has changed since the last pull. This 'last pull' timestamp could be stored in the Forge database along with the actual information pulled.

I don't know how we'd fix the exceptionally long initialize-time though (aside from making the SQLite updates non-blocking). The only thing I can think of is to set a lower-bound as some sort of defcustom – like 'only retrieve objects updated within the past six months'. The other requirements are complex enough to require a function that inspects and modifies the GraphQL query DSL-object, but the time-bound may be enough unless your bug tracker is as insanely busy with others' problems as mine at work :-)

@tarsius

This comment has been minimized.

Copy link
Member

tarsius commented Jan 2, 2019

How many topics does that repository have?

@antifuchs

This comment has been minimized.

Copy link

antifuchs commented Jan 2, 2019

A faiiiir amount! I can't figure out how to get the number out of the sqlite DB at the moment, but I'll just cite the rough numbers from the repo pages themselves:

  • PRs: 900 open, 120000 closed.
  • issues: 3 open, 5500 closed (we don't use issues much anymore).
@vermiculus

This comment has been minimized.

Copy link

vermiculus commented Jan 2, 2019

I'm impressed GitHub can handle that. My company uses a home-grown system with a few million (yes.) such records -- I wonder if we'd ever move to something like GitHub if it can indeed scale like that 😄 probably won't move, but a man can dream.

Definitely fetching/storing over 100000 records (even 10000) in the foreground seems like a big deal -- at least without some sort of are-you-sure confirmation.

@antifuchs

This comment has been minimized.

Copy link

antifuchs commented Jan 3, 2019

I'm impressed GitHub can handle that

It's a (fairly beefy) github enterprise installation (which has its own problems: they run on a single instance, so you can only scale them vertically); I'm not sure I want to see something like this run on public github (-:

One thing that made this import so heavy on my box was that emacs really was blocked for about 2-3 hours on inserting into sqlite; if that could be interleaved somehow (insert a batch every 1000 fetched elements?), that would probably improve responsiveness a lot, even though it'll likely cause emacs to block for a short period every so often.

@tarsius

This comment has been minimized.

Copy link
Member

tarsius commented Jan 3, 2019

Magit-status performance on this repo has also deteriorated a little; updating topics blocks my emacs for a minute (even if like 3 PRs changed)

That's probably because some data is being fetched from scratch every time. Basically everything in the ghub-fetch-repository constant that doesn't have a orderBy associated with it. I think only assignableUsers and labels matter here. I don't think the API can be asked to only return the elements that have changed since a certain data for those fields.

and hitting RET on an open topic also takes about half a minute.

The problem here is that all available topics are being retrieved from the database and massaged a bit even though we know that all of that will just be discarded anyway. That doesn't matter for a tiny repository but starting with a repository of about Magit's size this it leads to an annoying hang.

Just like Magit's, many of Forge's commands use magit-completing-read but are configured to not offer any completion candidates when there is a valid target at point, in which case they should just act on that. The problem is that the "maybe just use the default instead of prompting" functionality is implemented in that function, so callers have to hand it the list of completion candidates anyway.

It might be possible to fix that by handing a function that returns the list of candidates instead of the candidates itself to delay that code from being evaluated until we know that we have to. But then we have to decided whether magit-completing-read should call that function or just hand it of to completing-read. These functions already take a function in place of the completion candidates, but I think doing that has more effects than just delaying evaluation, so it has to be done carefully.

I'll probably add a new issue about that.

120000

I wasn't really expecting Forge to behave well in such cases.

One thing that made this import so heavy on my box was that emacs really was blocked for about 2-3 hours on inserting into sqlite; if that could be interleaved somehow (insert a batch every 1000 fetched elements?), that would probably improve responsiveness a lot, even though it'll likely cause emacs to block for a short period every so often.

Forge is already doing that, more or less. It appears that there is a bug somewhere. Also see #6.

@tarsius

This comment has been minimized.

Copy link
Member

tarsius commented Jan 3, 2019

So that makes me wonder if it's possible to come up with criteria for topics to sync - if I understand correctly, the graphql endpoint supports queries, so I would like to come up with my own,

Making all the data available locally is a major design decision I made for Forge. Given such a large repository that doesn't work well obviously and we should provide a workaround but there is a limit to how fancy that can get. Basically we can forgo fetching topics that haven't been updated since a certain data.

We already do that for the second and subsequent times the user pulls data for a given repository. The hack that we could use for humongous repositories is to allow the user to explicitly set that date before doing the first pull.

@antifuchs

This comment has been minimized.

Copy link

antifuchs commented Jan 3, 2019

(reposting from the correct account)

I really like the idea of restricting the time range on topic updates; being able to initially-fetch only the last 2 months of topics would still cover all the topics that are relevant to my day-to-day but cut down the number of topics that need to be fetched & stored by a lot.

@tarsius tarsius changed the title Restrict fetched pull reqs/issues by criteria? Fetching data for huge GitHUB repositories takes very long Jan 3, 2019

@tarsius tarsius pinned this issue Jan 3, 2019

@tarsius tarsius added the enhancement label Jan 3, 2019

@tarsius

This comment has been minimized.

Copy link
Member

tarsius commented Jan 3, 2019

I have pushed a quick draft to the sparse-fetch. Please give that a try.

@antifuchs

This comment has been minimized.

Copy link

antifuchs commented Jan 4, 2019

I have pushed a quick draft to the sparse-fetch. Please give that a try.

I tried it, and it works:

  1. Backed up the existing .sqlite db (it was closer to 550MB than to 600MB, sorry for exaggerating a little) and removed the original
  2. M-: (setq forge--initial-topic-until "2018-12-01")
  3. F y in the biggest repo
  4. M-: (setq forge--initial-topic-until nil) as indicated in the docstring

That fetched about 200 pages of topics for 10 minutes, and created a 50MB .sqlite file. It looks great now!

With some slightly better ergonomics around setting that date (prefix arg to the initial fetch command?), I think it would 100% resolve my issues (-:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment