Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What
Refactor github archive downloader with the following notable changes:
context.Context
instead of "done" channels for timeouts and cancellations.http.NewRequestWithContext
.Why
The downloader job is using so much RAM it's getting OOMKilled. I couldn't even download a week worth of files with 4 workers, because the container was using ~4 GBs of RAM (which was the limit on my Docker VM on macOS).
Some files contain events that cannot be parsed due to broken (?) JSON contents (e.g.
invalid character '\\'' looking for beginning of value
). Those files will never be marked as downloaded, so skipping broken events is the only way to mark those files as downloaded - hence the optional flag. We could still have the flag set tofalse
(as it is by default) and the functionality won't be changed.Transactions are nice to have, although I haven't noticed inconsistencies in the DB during my test runs. The rest of refactoring was done to make the code a bit more readable - feedback is welcome.