Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor github archive downloader #32

Merged
merged 1 commit into from
May 18, 2021

Conversation

radiohead
Copy link
Contributor

@radiohead radiohead commented May 18, 2021

What

Refactor github archive downloader with the following notable changes:

  1. Stop reading full archives in memory. This was unnecessary, since the code is using buffered IO reader anyway. The downside here was huge memory consumption, since individual archives were taking around 200 MB in RAM.
  2. Add optional flag to allow marking files as downloaded even if some events had errors.
  3. Wrap DB operations in transactions - otherwise it was possible that DELETEs would succeed but INSERTs would fail.
  4. Use context.Context instead of "done" channels for timeouts and cancellations.
  5. Migrate to Go 1.13 so that we can use http.NewRequestWithContext.

Why

The downloader job is using so much RAM it's getting OOMKilled. I couldn't even download a week worth of files with 4 workers, because the container was using ~4 GBs of RAM (which was the limit on my Docker VM on macOS).

Some files contain events that cannot be parsed due to broken (?) JSON contents (e.g. invalid character '\\'' looking for beginning of value). Those files will never be marked as downloaded, so skipping broken events is the only way to mark those files as downloaded - hence the optional flag. We could still have the flag set to false (as it is by default) and the functionality won't be changed.

Transactions are nice to have, although I haven't noticed inconsistencies in the DB during my test runs. The rest of refactoring was done to make the code a bit more readable - feedback is welcome.

@radiohead radiohead requested a review from marefr May 18, 2021 11:49
Copy link
Member

@marefr marefr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM great work. Haven't tested it though, but looks good

@radiohead radiohead merged commit 9a3982e into master May 18, 2021
@radiohead radiohead deleted the feature/refactor-archive-downloader branch May 18, 2021 13:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants