Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DEV-1125 - run duplicate holdings cleanup under sidekiq #303

Merged
merged 2 commits into from
May 29, 2024

Conversation

aelkiss
Copy link
Member

@aelkiss aelkiss commented May 28, 2024

Motivation

Running this as a single thread was way too slow and crashed in the middle, giving us no easy way to retry or determine what specific cluster might have failed.

This change:

  • Queues batches of clusters to clean up as jobs in sidekiq
  • Move logic out of bin & adds a phctl command for cleaning up duplicate holdings (this simplifies loading all the dependencies so the sidekiq stuff works rather than needing to duplicate it in the bin script)
  • Adds an optimization to avoid re-saving the cluster's holdings if there are no changes
  • Fixes sidekiq-web

Reviewing

@mwarin I think this should be fairly straightforward to re-review. Basically, this is the functionality you already looked at, but adding some functionality to queue small jobs in sidekiq with batches of (by default) 100 clusters at a time.

The sidekiq web update comes from https://github.com/sidekiq/sidekiq/wiki/Monitoring#standalone - we missed adding the rack-session gem before but it was required for a rack update we did earlier but hadn't tested since the tests don't load sidekiq-web.

I ran this locally by adding 5000 clusters from the production database, queuing up the jobs, and watching them run.

@aelkiss aelkiss marked this pull request as draft May 28, 2024 21:40
@aelkiss
Copy link
Member Author

aelkiss commented May 28, 2024

However -- I see the jobs running, but I'm not convinced it's properly cleaning things up (despite the passing tests); will investigate.

@aelkiss aelkiss force-pushed the DEV-1125-duplicate-holdings-cleanup branch from dcb5067 to 25a6e2b Compare May 28, 2024 21:42
@aelkiss
Copy link
Member Author

aelkiss commented May 28, 2024

I think this is an artifact of how I loaded the test data - I get e.g.

date_received: {"$date"=>"2020-07-21T00:00:00.000Z"}

I'll see if I can do a more proper test in dev...

@coveralls
Copy link

coveralls commented May 28, 2024

Coverage Status

coverage: 95.067% (-0.02%) from 95.082%
when pulling 3078966 on DEV-1125-duplicate-holdings-cleanup
into 2bf7c3f on main.

@aelkiss aelkiss marked this pull request as ready for review May 28, 2024 22:40
@aelkiss
Copy link
Member Author

aelkiss commented May 29, 2024

After loading the data with mongoimport, I verified that running this in dev does indeed clean up the duplicate holdings.

@aelkiss aelkiss requested a review from mwarin May 29, 2024 13:34
Copy link
Contributor

@mwarin mwarin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests pass, code makes sense save for a headscratcher or two, no reason not to APPROVE.

end
end

def dedupe_holdings(cluster)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fishy. Should dedupe_holdings have a body?

...or, as I look closer, this might be a remnant from the previous version, and should be deleted.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, needs to be deleted. I inlined the method because I wanted to keep track in remove_duplicate_holdings of how many things had been removed (so that we could avoid saving the cluster if nothing had changed)

lib/cleanup_duplicate_holdings.rb Outdated Show resolved Hide resolved
lib/cleanup_duplicate_holdings.rb Outdated Show resolved Hide resolved
docker-compose.yml Show resolved Hide resolved
@aelkiss aelkiss force-pushed the DEV-1125-duplicate-holdings-cleanup branch from 804be5c to 8c0b063 Compare May 29, 2024 14:58
* Queues batches of clusters to clean up as jobs
* Move logic to lib; add phctl command
* Don't re-save the cluster's holdings if there are no changes (optimization)
* add rack-session gem (needed by rack 3)
* update standalone code from https://github.com/sidekiq/sidekiq/wiki/Monitoring#standalone
* fix sidekiq_web service in docker-compose.yml
@aelkiss aelkiss force-pushed the DEV-1125-duplicate-holdings-cleanup branch from 8c0b063 to 3078966 Compare May 29, 2024 15:37
@aelkiss aelkiss merged commit ede5661 into main May 29, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants