Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduces BulkVersionUpdater to fix Version read/write N+1s in PackageManager::Base #3268

Closed

Conversation

tiegz
Copy link
Contributor

@tiegz tiegz commented Dec 1, 2023

these changes take a cold update of npm/gatsby from 67 sec to 34 sec for me locally (down from 83 min originally, before #3267)

  • replace VersionUpdater with BulkVersionUpdater, which does all the version upserts in one query (accompanied by manual callbacks)
  • remove the uniqueness validation on Version for [:number, :project_id] because we have a unique index on those columns

app/models/version.rb Show resolved Hide resolved
THEN versions.repository_sources
ELSE
(COALESCE(versions.repository_sources, '[]'::jsonb) || COALESCE(EXCLUDED.repository_sources, '[]'::jsonb))
END)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this took a while to get right, but it should simulate the logic we had before of appending repository_sources. See the spec for a table test of examples

.first
&.tap(&:update_repository_async)
&.tap(&:update_project_tags_async)
@db_project.update_column(:versions_count, @db_project.versions.count) # normally counter_culture does this
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm tempted to wrap everything in here in a transaction, but we might want to wait and see how the upsert_all does in production.

If we do that, note that there's a new transaction-aware Sidekiq feature that we could use to defer the job-enqueuing here until after the transaction runs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is now a place where we need to keep potential new hooks in sync with this code, is there a way to limit the amount of knowledge about the Version class we have in this code? Can those hook calls be encapsulated elsewhere, closer to Version?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've put a note in the Version class here: https://github.com/librariesio/libraries.io/pull/3268/files#diff-61977d248b654178763e77a2fddc09152a20b85b1aecfc8222ef4233564593fcR48

But I agree, it'd be nice to make it more explicit. I'll look into a bulk callback thingy

@tiegz tiegz changed the title Fix Version read/write N+1s in PackageManager::Base Introduces BulkVersionUpdater to fix Version read/write N+1s in PackageManager::Base Dec 2, 2023
@tiegz tiegz marked this pull request as ready for review December 2, 2023 18:19
# handles merging any existing repository_sources with new repository_source (see specs for table tests)
# note that timestamps act slightly differently: we'll use the provided updated_at here if it exists, which
# might update updated_at even if no attributes changed. Whereas normally with AR's dirty attributes, timestamps
# wouldn't change unless other attributes had changed during the save.
Copy link
Contributor Author

@tiegz tiegz Dec 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is really the only big data change: before, we wouldn't update a Version if its attributes hadn't changed. But now, since we're passing all the Version attributes into one query, we'll save the updated_at if it's passed here (which it will be).

(this could be averted with a lot of SQL below, but I don't think it's worth it personally)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This wouldn't trigger extra Release changes when we sync, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nah, for releases that already exist, the on_duplicate logic below dictates how we update any info:

  • status: pick new over old
  • runtime_dependencies_count: pick new over old
  • original_license: pick new over old
  • published_at: pick new over old
  • created_at: use old, and if it isn't set then use new
  • updated_at: use new, and if it isn't set then use old
  • repository_sources: merge new into old, without duplicates or nulls

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updating the updated_at if nothing's changed doesn't seem great, esp since it's used for debugging data issues and is not standard object behavior :/ I would be expecting something like the "last checked at" timestamp to be bumped, but not the updated_at.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll see what I can do! Might not be too bad to preserve dirty state

@tiegz tiegz force-pushed the tz/fix-version-n-plus-ones-in-package-manager-base branch from 8da5988 to d1faffa Compare December 2, 2023 21:21
end

pp "added versions #{api_versions.map(&:version_number).join(', ')}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pp

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

before I remove this, do you remember if there was a reason for it? 349b6fb#diff-b3d7088ca715b858cc9b988967128f4f0fb6a117c0633b2e0bc7fb4926bda26fR108

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, right, this was for debugging since this is run from a Rake task. This can stay.

app/models/package_manager/maven/google.rb Outdated Show resolved Hide resolved
.first
&.tap(&:update_repository_async)
&.tap(&:update_project_tags_async)
@db_project.update_column(:versions_count, @db_project.versions.count) # normally counter_culture does this
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is now a place where we need to keep potential new hooks in sync with this code, is there a way to limit the amount of knowledge about the Version class we have in this code? Can those hook calls be encapsulated elsewhere, closer to Version?

newly_inserted_versions = @db_project.versions.where.not(id: existing_version_ids)

# run callbacks manually since upsert_all doesn't run callbacks.
Version.bulk_after_create_commit(newly_inserted_versions, @db_project)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if the callbacks fail for version #40 out of the 100 to be updated? Do we need to roll things back? Are the other 60 going to lose their callbacks forever?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case the PackageManagerDownloadWorker job would retry, and it would re-upsert all the versions and run their callbacks, which is fine because the job should be idempotent.. The callbacks are basically just hitting redis (to enqueue Sidekiq jobs), so if that's happening constantly we'll have a bigger problem on our hands.

validates :status, inclusion: { in: STATUSES, allow_blank: true }

belongs_to :project
counter_culture :project
has_many :dependencies, dependent: :delete_all
has_many :runtime_dependencies, -> { where kind: %w[runtime normal] }, class_name: "Dependency"

##############################################################################################
# NB: callbacks are run manually in bulk_after_create_commit, so keep that class in sync with these.
##############################################################################################
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made this comment a little more prominent, cc @johnbintz-tidelift


# normally counter_culture does this
project.update_column(:versions_count, project.versions.count)
@db_project
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this return value needed? If so, it's the wrong return value:

Suggested change
@db_project
project

@tiegz tiegz force-pushed the tz/fix-version-n-plus-ones-in-package-manager-base branch from feef05d to fa73f07 Compare January 30, 2024 14:31
@tiegz tiegz force-pushed the tz/fix-version-n-plus-ones-in-package-manager-base branch from fa73f07 to cb86ee2 Compare January 30, 2024 18:29
@tiegz tiegz force-pushed the tz/fix-version-n-plus-ones-in-package-manager-base branch from cb86ee2 to 5c13a48 Compare January 30, 2024 18:51

# create synthetic versions without saving them, so we can get their attributes
attrs = @api_versions
.map { |api_version| Version.new(api_version.to_version_model_attributes.merge(project: @db_project)) }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts about doing a bulk find_or_create on the versions here instead? One concern is that while the synthetic objects might be valid when constructed here from scratch, but they might not be valid when you do the upsert due to existing state.

For example, if an existing row has an invalid status, I think the from-scratch version would pass validation here (no changes to status), but after the upsert it would still have an invalid status, so you'd end up with a Version that doesn't pass validation (if the intention here is to only save valid Version objects).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the intention here is just to make sure we run validations and callbacks on each Version beforehand, as if they were inserted individually. There's some logic on line 43 to prefer the new data if possible (status = EXCLUDED.status,) so that'll squash the invalid status case.

# from Version#before_save
v.update_spdx_expression
# upsert_all doesn't do validation, so ensure they're valid here.
v.validate!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if there are two identical versions for the same project id in @api_versions? +/- differences in data? Or if two of these were running for the same project at the same time?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that should raise a unique index error in Bugsnag, which'll be desirable because we'll want to know what kind of weird data we're getting (I personally haven't ever seen a duplicate version coming back from an API, FWIW).

newly_inserted_versions = @db_project.versions.where.not(id: existing_version_ids)

# run callbacks manually since upsert_all doesn't run callbacks.
Version.bulk_after_create_commit(newly_inserted_versions, @db_project)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if the callbacks fail for version #40 out of the 100 to be updated? Do we need to roll things back? Are the other 60 going to lose their callbacks forever?

# handles merging any existing repository_sources with new repository_source (see specs for table tests)
# note that timestamps act slightly differently: we'll use the provided updated_at here if it exists, which
# might update updated_at even if no attributes changed. Whereas normally with AR's dirty attributes, timestamps
# wouldn't change unless other attributes had changed during the save.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updating the updated_at if nothing's changed doesn't seem great, esp since it's used for debugging data issues and is not standard object behavior :/ I would be expecting something like the "last checked at" timestamp to be bumped, but not the updated_at.

repository_source_name: self::HAS_MULTIPLE_REPO_SOURCES ? [self::REPOSITORY_SOURCE_NAME] : nil
).run!

remove_missing_versions(db_project, api_versions.map(&:version_number)) if sync_version == :all
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this use the version numbers from the existing and newly inserted Versions from BulkVersionUpdater?

I think this would potentially allow for extra API versions to be kept compared to what we actually found/inserted. Not sure how common that would be in practice, or if that matters

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's a great idea, altho could take a bit of code surgery. I'm tempted to introduce the least changes possible in this PR, so I'll just add a note about that for later.

@johnbintz-tidelift
Copy link
Contributor

Is this still needed?

@tiegz
Copy link
Contributor Author

tiegz commented Apr 1, 2024

I've not had time to figure out how to keep the timestamps updated properly. I'm closing it for now, but if we see performance issues we can always return to it. RIP

@tiegz tiegz closed this Apr 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants