New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduces BulkVersionUpdater to fix Version read/write N+1s in PackageManager::Base #3268
Conversation
THEN versions.repository_sources | ||
ELSE | ||
(COALESCE(versions.repository_sources, '[]'::jsonb) || COALESCE(EXCLUDED.repository_sources, '[]'::jsonb)) | ||
END) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this took a while to get right, but it should simulate the logic we had before of appending repository_sources
. See the spec for a table test of examples
.first | ||
&.tap(&:update_repository_async) | ||
&.tap(&:update_project_tags_async) | ||
@db_project.update_column(:versions_count, @db_project.versions.count) # normally counter_culture does this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm tempted to wrap everything in here in a transaction, but we might want to wait and see how the upsert_all
does in production.
If we do that, note that there's a new transaction-aware Sidekiq feature that we could use to defer the job-enqueuing here until after the transaction runs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is now a place where we need to keep potential new hooks in sync with this code, is there a way to limit the amount of knowledge about the Version
class we have in this code? Can those hook calls be encapsulated elsewhere, closer to Version
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've put a note in the Version
class here: https://github.com/librariesio/libraries.io/pull/3268/files#diff-61977d248b654178763e77a2fddc09152a20b85b1aecfc8222ef4233564593fcR48
But I agree, it'd be nice to make it more explicit. I'll look into a bulk callback thingy
# handles merging any existing repository_sources with new repository_source (see specs for table tests) | ||
# note that timestamps act slightly differently: we'll use the provided updated_at here if it exists, which | ||
# might update updated_at even if no attributes changed. Whereas normally with AR's dirty attributes, timestamps | ||
# wouldn't change unless other attributes had changed during the save. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is really the only big data change: before, we wouldn't update a Version if its attributes hadn't changed. But now, since we're passing all the Version attributes into one query, we'll save the updated_at
if it's passed here (which it will be).
(this could be averted with a lot of SQL below, but I don't think it's worth it personally)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This wouldn't trigger extra Release changes when we sync, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nah, for releases that already exist, the on_duplicate
logic below dictates how we update any info:
- status: pick new over old
- runtime_dependencies_count: pick new over old
- original_license: pick new over old
- published_at: pick new over old
- created_at: use old, and if it isn't set then use new
- updated_at: use new, and if it isn't set then use old
- repository_sources: merge new into old, without duplicates or nulls
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updating the updated_at if nothing's changed doesn't seem great, esp since it's used for debugging data issues and is not standard object behavior :/ I would be expecting something like the "last checked at" timestamp to be bumped, but not the updated_at.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll see what I can do! Might not be too bad to preserve dirty state
8da5988
to
d1faffa
Compare
end | ||
|
||
pp "added versions #{api_versions.map(&:version_number).join(', ')}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
before I remove this, do you remember if there was a reason for it? 349b6fb#diff-b3d7088ca715b858cc9b988967128f4f0fb6a117c0633b2e0bc7fb4926bda26fR108
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, right, this was for debugging since this is run from a Rake task. This can stay.
.first | ||
&.tap(&:update_repository_async) | ||
&.tap(&:update_project_tags_async) | ||
@db_project.update_column(:versions_count, @db_project.versions.count) # normally counter_culture does this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is now a place where we need to keep potential new hooks in sync with this code, is there a way to limit the amount of knowledge about the Version
class we have in this code? Can those hook calls be encapsulated elsewhere, closer to Version
?
newly_inserted_versions = @db_project.versions.where.not(id: existing_version_ids) | ||
|
||
# run callbacks manually since upsert_all doesn't run callbacks. | ||
Version.bulk_after_create_commit(newly_inserted_versions, @db_project) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if the callbacks fail for version #40 out of the 100 to be updated? Do we need to roll things back? Are the other 60 going to lose their callbacks forever?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case the PackageManagerDownloadWorker job would retry, and it would re-upsert all the versions and run their callbacks, which is fine because the job should be idempotent.. The callbacks are basically just hitting redis (to enqueue Sidekiq jobs), so if that's happening constantly we'll have a bigger problem on our hands.
validates :status, inclusion: { in: STATUSES, allow_blank: true } | ||
|
||
belongs_to :project | ||
counter_culture :project | ||
has_many :dependencies, dependent: :delete_all | ||
has_many :runtime_dependencies, -> { where kind: %w[runtime normal] }, class_name: "Dependency" | ||
|
||
############################################################################################## | ||
# NB: callbacks are run manually in bulk_after_create_commit, so keep that class in sync with these. | ||
############################################################################################## |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made this comment a little more prominent, cc @johnbintz-tidelift
app/models/version.rb
Outdated
|
||
# normally counter_culture does this | ||
project.update_column(:versions_count, project.versions.count) | ||
@db_project |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this return value needed? If so, it's the wrong return value:
@db_project | |
project |
feef05d
to
fa73f07
Compare
fa73f07
to
cb86ee2
Compare
cb86ee2
to
5c13a48
Compare
|
||
# create synthetic versions without saving them, so we can get their attributes | ||
attrs = @api_versions | ||
.map { |api_version| Version.new(api_version.to_version_model_attributes.merge(project: @db_project)) } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thoughts about doing a bulk find_or_create on the versions here instead? One concern is that while the synthetic objects might be valid when constructed here from scratch, but they might not be valid when you do the upsert due to existing state.
For example, if an existing row has an invalid status, I think the from-scratch version would pass validation here (no changes to status), but after the upsert it would still have an invalid status, so you'd end up with a Version that doesn't pass validation (if the intention here is to only save valid Version objects).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the intention here is just to make sure we run validations and callbacks on each Version beforehand, as if they were inserted individually. There's some logic on line 43 to prefer the new data if possible (status = EXCLUDED.status,
) so that'll squash the invalid status case.
# from Version#before_save | ||
v.update_spdx_expression | ||
# upsert_all doesn't do validation, so ensure they're valid here. | ||
v.validate! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if there are two identical versions for the same project id in @api_versions? +/- differences in data? Or if two of these were running for the same project at the same time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that should raise a unique index error in Bugsnag, which'll be desirable because we'll want to know what kind of weird data we're getting (I personally haven't ever seen a duplicate version coming back from an API, FWIW).
newly_inserted_versions = @db_project.versions.where.not(id: existing_version_ids) | ||
|
||
# run callbacks manually since upsert_all doesn't run callbacks. | ||
Version.bulk_after_create_commit(newly_inserted_versions, @db_project) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if the callbacks fail for version #40 out of the 100 to be updated? Do we need to roll things back? Are the other 60 going to lose their callbacks forever?
# handles merging any existing repository_sources with new repository_source (see specs for table tests) | ||
# note that timestamps act slightly differently: we'll use the provided updated_at here if it exists, which | ||
# might update updated_at even if no attributes changed. Whereas normally with AR's dirty attributes, timestamps | ||
# wouldn't change unless other attributes had changed during the save. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updating the updated_at if nothing's changed doesn't seem great, esp since it's used for debugging data issues and is not standard object behavior :/ I would be expecting something like the "last checked at" timestamp to be bumped, but not the updated_at.
repository_source_name: self::HAS_MULTIPLE_REPO_SOURCES ? [self::REPOSITORY_SOURCE_NAME] : nil | ||
).run! | ||
|
||
remove_missing_versions(db_project, api_versions.map(&:version_number)) if sync_version == :all |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this use the version numbers from the existing and newly inserted Versions from BulkVersionUpdater
?
I think this would potentially allow for extra API versions to be kept compared to what we actually found/inserted. Not sure how common that would be in practice, or if that matters
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's a great idea, altho could take a bit of code surgery. I'm tempted to introduce the least changes possible in this PR, so I'll just add a note about that for later.
Is this still needed? |
I've not had time to figure out how to keep the timestamps updated properly. I'm closing it for now, but if we see performance issues we can always return to it. RIP |
these changes take a cold update of
npm/gatsby
from 67 sec to 34 sec for me locally (down from 83 min originally, before #3267)