Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed concurrent-overlapping-subrepo syncs. #2842

Closed
wants to merge 1 commit into from

Conversation

ggainey
Copy link
Contributor

@ggainey ggainey commented Oct 17, 2022

Syncing the 'same' kickstart repo into multiple repositories at the same time would cause fatal errors. Taught pulp_rpm to notice the collisions and assume "whoever got there first" is The Winner.

Issue #2278 has reliable reproducer script for the failure-case.

fixes #2278.
[nocoverage]

Syncing the 'same' kickstart repo into multiple repositories at
the same time would cause fatal errors. Taught pulp_rpm to notice
the collisions and assume "whoever got there first" is The Winner.

Issue pulp#2278 has reliable reproducer script for the failure-case.

fixes pulp#2278.
[nocoverage]
# such as RHEL6-kickstart). The collision is due to the kickstart-sub-repos naming
# scheme of `name = f"{repodata}-{treeinfo['hash']}"` from 100 lines or so up.
# This is not unique across repos, if multiple repos are syncing identical .treeinfo
# files.
Copy link
Contributor

@dralley dralley Oct 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a little bit more about why simply renaming the subrepos isn't sufficient

@@ -0,0 +1 @@
Addressed concurrent-sync subrepo collisions.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

concurrent sync ( no dash)

repo_sync_results[directory] = repo_version
except IntegrityError:
log.info(
_("Collision creating a new version for repository {} - skipping.").format(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would at least mention this is a sub-repo for clarity.

repo.last_sync_details = repo_config["sync_details"]
repo.save()

repo_sync_results[directory] = repo_version
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we're not going to arrive to this line because the error will happen earlier, is repo_sync_results going to be used anywhere further down? have you triedto sync this with mirror_complete policy?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point, we should probably try to put as little code in the try block as possible. Maybe it should be something like

try:
    repo_version = dv.create() or repo.latest_version()
except IntegrityError:
     log.info(_("Collision creating a new version for repository {} - skipping.").format(repo.name))
     repo_version = None

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I think it should be repo_version = repo.latest_version()?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, i do prefer having as little as possible in try block.
i think we should use latest availalbe version because repo_sync_results at this point will be empty dict which will create problems here https://github.com/pulp/pulp_rpm/blob/main/pulp_rpm/app/tasks/synchronizing.py#L582

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah well damn - def glad I asked for your review. No, I hadn't tried mirror_complete, and no, it doesn't work[0] with this proposed path. I've done a bunch of experiments tonight, and I don't think this approach is salvageable in the face of mirrored-metadata and kstrees. I'm going to close this PR and go back to the drawing board. Great catch @ipanova !

[0] As in, the sub-repo's metadata is broken in the face of concurrent syncs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, you just need to make sure you got repo_sync_results dict populated regardless of whether the integrityerror happens

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alas no. In the mirror_complete case, with repo_sync_results[directory] = repo.latest_version() added to the except-block, the sync "works", but the resulting repos have partial publications/distributions. In my test-run, I created 5 copies of RHEL6 KS repos, and with the change here looking at them from the content-app shows the following:

http :/pulp/content/r6-10-ks-1/
                    <li><a href="Packages/">Packages/</a></li>
                    <li><a href="ScalableFileSystem/">ScalableFileSystem/</a></li>
                    <li><a href="Server/">Server/</a></li>
                    <li><a href="config.repo">config.repo</a></li>
                    <li><a href="images/">images/</a></li>
                    <li><a href="repodata/">repodata/</a></li>
                    <li><a href="treeinfo">treeinfo</a></li>

http :/pulp/content/r6-10-ks-2/
                    <li><a href="HighAvailability/">HighAvailability/</a></li>
                    <li><a href="Packages/">Packages/</a></li>
                    <li><a href="ResilientStorage/">ResilientStorage/</a></li>
                    <li><a href="ScalableFileSystem/">ScalableFileSystem/</a></li>
                    <li><a href="config.repo">config.repo</a></li>
                    <li><a href="images/">images/</a></li>
                    <li><a href="repodata/">repodata/</a></li>
                    <li><a href="treeinfo">treeinfo</a></li>

http :/pulp/content/r6-10-ks-3/
                    <li><a href="LoadBalancer/">LoadBalancer/</a></li>
                    <li><a href="Packages/">Packages/</a></li>
                    <li><a href="Server/">Server/</a></li>
                    <li><a href="config.repo">config.repo</a></li>
                    <li><a href="images/">images/</a></li>
                    <li><a href="repodata/">repodata/</a></li>
                    <li><a href="treeinfo">treeinfo</a></li>

http :/pulp/content/r6-10-ks-4/
                    <li><a href="LoadBalancer/">LoadBalancer/</a></li>
                    <li><a href="Packages/">Packages/</a></li>
                    <li><a href="ResilientStorage/">ResilientStorage/</a></li>
                    <li><a href="Server/">Server/</a></li>
                    <li><a href="config.repo">config.repo</a></li>
                    <li><a href="images/">images/</a></li>
                    <li><a href="repodata/">repodata/</a></li>
                    <li><a href="treeinfo">treeinfo</a></li>

http :/pulp/content/r6-10-ks-5/
                    <li><a href="Packages/">Packages/</a></li>
                    <li><a href="ResilientStorage/">ResilientStorage/</a></li>
                    <li><a href="Server/">Server/</a></li>
                    <li><a href="config.repo">config.repo</a></li>
                    <li><a href="images/">images/</a></li>
                    <li><a href="repodata/">repodata/</a></li>
                    <li><a href="treeinfo">treeinfo</a></li>

I didn't take the time to understand exactly why, because I felt this was enough to show that the assumption driving this approach is Not Valid.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: The resulting PublishedArtifacts list for each overlapping repo is different (and they're all wrong in different ways) when we try to skip the DV step in the mirror-complete case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if that essentially comes down to async swallowing exceptions, as it sometimes likes to do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, alas - the problem with skipping subrepos when "someone else is handling them" is, if we finish "our" primary repo before whoever first got to subrepo-1 is done syncing it, we then end up creating the mirror publication/PublishedArtifacts before "they" have finalized the subrepo version - so "our" Publication is missing things, because they weren't published when we completed. (Note: it just occurred to me that specifying autopublish may have the exact same problem, I shall test...)

Basically, the problem is happening because we've invented an entire task-loop for subrepo syncing inside the "sync this kickstart repo" worker, and it needs things to happen in a very specific order, and we don't know that we need to lock on shared-subrepos until we're already in progress (so the tasking system locking can't help us). And when you have a deduplicated DistributionTree that is shared among independent workers all trying to do the same thing at the same time, order goes out the window. In the non-mirror case, it all ends up ok, because the publication process happens outside/after the sync, so by the time the Publication is being created the DistributionTree and all its subrepos are in place and everything is fine.

There are several ways I can think of to fix this "correctly", all of which are complicated, involve significant architectural changes, require complicated data migrations, or all three at once, which won't fix the problem people are having right now. (Example: 1) sync RPM repo foo. 2) foo notices it has subrepos. 3) foo creates subrepos and dispatches a task to sync them. 4) foo completes the main sync. 5) Because foo knows it has subrepos, and which ones they are, foo dispatches a task to clean up its publication-status, locked on itself and its now-known subrepos. Now serialization/locking works because the tasking system handles it for us.)

See @goosemania 's excellent explanation for gory details : #2304 (comment) This concurrency-problem is just one more layer on top of that issue.

@hao-yu 's suggestion of creating subrepos that are per-repo-named addresses the immediate problem, because that way the subrepos never collide with each other, and the DistributionTree ends up pointing to "whichever ones finished last", and the publication works. As noted by Tanya, the problem with it, is at the end of the cycle you have a DistributionTree that knows about one subrepo per Addon, but you have one copy of that subrepo per repository that shared it , and N-1 of those subrepos are abandoned and "unfindable" except by direct django/db queries. This means that orphan-cleanup can never "find" the artifacts from those subrepos - even cleaning up the DistributionTree that caused them to be sync'd will only cascade through to delete the last-one-in repo-copies that that DT's Addons know about.

The content and artifacts of the subrepos are de-duplicated, so the price we pay in general for this bandaid is a repo, 2 repo-versions (0 and 1), and repoversion-content entries for the subrepo's content, per "abandoned" subrepo. Forever. (Plus, of course, not being able to orphan-cleanup the subrepo content even after deleting the last use of the DistributionTree).

I would love to hear a "better idea", if anyone has one :)

@ggainey ggainey closed this Oct 18, 2022
@ggainey ggainey deleted the 2278_subrepo_collisions branch March 31, 2023 20:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants