Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle GCS bucket record deletion #2030

Merged
merged 36 commits into from
May 2, 2024

Conversation

andrewpollock
Copy link
Contributor

@andrewpollock andrewpollock commented Mar 5, 2024

Detect and plumb through to worker for deletion from Datastore Bugs that are no longer present in GCS.

Related changes (#2023, #2029) to the NVD CVE generation cause CVEs no longer being generated (due to changes in the heuristics like that made in #1707) to remain existing in GCS.

This PR addresses cases like this and the need identified in #1467 by adding a deletion phase to the importing of new/updated records. The functionality is flag-protected, it won't go live in Production until a new --delete flag is included so in the execution of the importer.

Incidental changes:

  • use the GCS bucket directory_path to efficiently filter the blobs returned when listing bucket contents
  • make blob retrieval resilient to blob generation change between blob listing and blob retrieval (this can happen if combine-to-osv happens to have run in between these two points in time)
  • fix a behavior inconsistency with schema validation not being performed when ignore_last_import_time is in effect (addressing head scratching TODO from @michaelkedar)
  • tidy up the existing tests, making them more readable and debuggable
  • add a slow manual test against live data in staging to validate real-world behavior and run time (this adds ~13 minutes to an import run on just the CVE GCS bucket)

andrewpollock added a commit to andrewpollock/osv.dev that referenced this pull request Mar 5, 2024
docker/importer/importer.py Show resolved Hide resolved
docker/importer/importer.py Outdated Show resolved Hide resolved
docker/importer/importer.py Outdated Show resolved Hide resolved
docker/importer/importer.py Show resolved Hide resolved
docker/importer/importer.py Outdated Show resolved Hide resolved
docker/importer/importer.py Outdated Show resolved Hide resolved
docker/importer/importer.py Outdated Show resolved Hide resolved
docker/importer/importer.py Outdated Show resolved Hide resolved
Detect and plumb through to worker for deletion from Datastore Bugs that
are no longer present in GCS.

Related changes (google#2023, google#2029) to the NVD CVE generation cause CVEs no
longer being generated (due to changes in the heuristics like that made
in google#1707) to remain existing in GCS.

This PR addresses cases like this and the need identified in google#1467 by
adding a deletion phase to the importing of new/updated records. The
functionality is flag-protected, it won't go live in Production until
a new `--delete` flag is included so in the execution of the importer.

Incidental changes:

- use the GCS bucket `directory_path` to efficiently filter the blobs
  returned when listing bucket contents
- make blob retrieval resilient to blob generation change between blob
  listing and blob retrieval (this can happen if `combine-to-osv`
  happens to have run in between these two points in time)
- fix a behavior inconsistency with schema validation not being
  performed when `ignore_last_import_time` is in effect (addressing
  head scratching TODO from @michaelkeder)
- tidy up the existing tests, making them more readable and debuggable
- add a slow manual test against live data in staging to validate
  real-world behavior and run time (this adds ~13 minutes to an import
  run on just the CVE GCS bucket)
Use a list comprehension instead of a filter + lambda

For style guide compliance.
docker/importer/importer.py Outdated Show resolved Hide resolved
docker/importer/importer.py Outdated Show resolved Hide resolved
- refactor out inline function from _process_deletions_bucket
  - switch to using executor.submit() rather than map() to follow
    documented examples and reduce reader cognitive burden
- fix bug in safety check and add explicit test for it
Also add some additional instrumentation
So the human running them has a better idea of what the heck is going
on.
Because this takes a long time to run, the human running it has little
insight into what is going on.
Some of the logging shenanigans in the test were superfluous
Copy link
Collaborator

@oliverchang oliverchang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this generally looks good, thanks!

how's the performance of this running on the staging CVE bucket? if this ends up being slow and impacts our freshness SLO, we may need to consider moving this to e.g. a separate "cleanup" cron job instead of sticking this in the importer.

docker/importer/importer.py Outdated Show resolved Hide resolved
As expected, this crashes into the safety guardrail, so override it.
(Why did the linter not pick up on that?)
@andrewpollock
Copy link
Contributor Author

how's the performance of this running on the staging CVE bucket? if this ends up being slow and impacts our freshness SLO, we may need to consider moving this to e.g. a separate "cleanup" cron job instead of sticking this in the importer.

Yeah I think based on my testing so far, while not running on the same sized GKE node as it would in production, running it as part of the regular importing run isn't looking tenable. I think we'll need to have a separate say daily cronjob that runs it in deletion mode. Good thing I've got all of that behind a flag :-)

andrewpollock and others added 4 commits April 22, 2024 09:37
Because deletion is going to be slow to run, it will need to be run
separately from regular importing, so as not to cause the overall
runtime to violate our freshness SLO.
andrewpollock added a commit to andrewpollock/osv.dev that referenced this pull request Apr 24, 2024
This runs the code introduced in google#2030 that takes many hours to complete
and isn't practical to run inline with regular imports.

Addresses google#1467
46f1c03 decoupled deletion from updates, this introduces a new test for
testing in deletion mode.
Copy link
Collaborator

@oliverchang oliverchang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with some minor nits.

docker/importer/importer.py Show resolved Hide resolved
docker/importer/importer.py Outdated Show resolved Hide resolved
docker/importer/importer.py Outdated Show resolved Hide resolved
46f1c03 decoupled deletion from updates, this introduces a new test for
testing in deletion mode.

Signed-off-by: Andrew Pollock <apollock@google.com>
Copy link
Collaborator

@oliverchang oliverchang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@andrewpollock andrewpollock merged commit 8174d1c into google:master May 2, 2024
11 checks passed
andrewpollock added a commit that referenced this pull request May 8, 2024
This runs the code introduced in #2030 that takes many hours to complete
and isn't practical to run inline with regular imports.

Addresses #1467
@andrewpollock andrewpollock deleted the delete_stale_bugs branch May 23, 2024 04:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants