Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create better primary keys for subtrees #2304

Closed
fao89 opened this issue Dec 22, 2021 · 8 comments · Fixed by #2851
Closed

Create better primary keys for subtrees #2304

fao89 opened this issue Dec 22, 2021 · 8 comments · Fixed by #2851

Comments

@fao89
Copy link
Member

fao89 commented Dec 22, 2021

Author: holger.hees (holger.hees)
Date: 2021-11-12

Redmine Issue: 9566, https://pulp.plan.io/issues/9566


Scenario:

My primary pulp instance is hosting 2 distributions of the same repository (like staging and production) which are referencing different versions of the same repository. During my initial run, both distributions are point to version 1. So far so good.

Now I have a secondary pulp instance which is mirroring the 2 primary distributions by creating separate remotes and repositories.

The repository on the primary node contains now a subtree which is identically in version 1 for staging and production. Means it has the same hash.

Now, during the sync process the metadata for this subtree are stored by createing a primary key like "{repodata}-{treeinfo['hash']}". This collides with staging and production, because contentwise and with the hash, the subtree is the same for both staging and production. The key should be something like "{repodata}-{treeinfo['hash']}-{repository_pk}"

#2173

@fao89
Copy link
Member Author

fao89 commented Dec 22, 2021

From: pulpbot (pulpbot)
Date: 2021-11-14T10:38:31Z


PR: #2176

@fao89
Copy link
Member Author

fao89 commented Dec 22, 2021

From: pulpbot (pulpbot)
Date: 2021-11-16T14:57:05Z


PR: #2180

@fao89
Copy link
Member Author

fao89 commented Dec 22, 2021

From: pulpbot (pulpbot)
Date: 2021-11-30T19:28:53Z


PR: #2185

@goosemania
Copy link
Member

goosemania commented Jan 22, 2022

TL;DR:
After a lot of experiments and investigation of this issue (and fixing other bugs a long the way), I came to a conclusion that we need to change a way we calculate digest for a distribution tree itself or keep track of subtree repo versions.
Changing the naming of subrepos only causes other problems and does not resolve the original problem fully (at least not in my experiments).

I'm using pulp_rpm 3.17.1.

Reproducer:

The illustration of the situation below. We do not track repo versions for subrepos, so always the latest version is used.


 ----------------                      ----------------
| repo version 1 |                    | repo version 2 |
 ----------------                      ----------------       
                   \                 /
                      -------------               
                     | dist tree 1 |                       
                      -------------                       
                           |                 
                      ----------                          
                     | sub repo |                       
                      ----------
                        | v1 |
                         ----                             
                        | v2 | 
                         ----                             

Problems with changing subrepos names only:

  • we still have only one distribution tree content unit, and the same addons/variants, but we are creating new subrepos. The new subrepos are created but they are not associated with the addon/variant object or distribution tree, they are orphaned. If we start associating them, we'll either orphan old subrepos, or increase the number of subrepos associate with a distribution tree. One way or the other, it's data incorrectness.

Potential solution:

  • During sync identify that a subrepo changed, that would mean that a new distribution tree content unit needs to be created and the old one needs to be removed from a repo for which sync is being run.
    • in the first stage before processing the main variant/repo, deal with subrepos: sync/check if they are the same
      • a subrepo name should contain a hash of its repomd.xml (and not a treeinfo hash)
    • if a new distribution tree needs to be created, its digest should be calculated based on its treeinfo digest + all digests from all variants/addons, new addon/variant objects need to be created as well.

@HolgerHees , for you information, since you brought this problem to our attention.
@dralley, @ggainey , @dkliban , @fao89 , any thoughts? any flaws in the suggested solution?

I also believe the fix won't be backportable (3.14/3.16 will need a separate solution). I suggest to encourage folks to upgrade to 3.17 and not even try to come up with a solution for old branches. It would be very error-prone.

@dralley
Copy link
Contributor

dralley commented Jan 22, 2022

I also believe the fix won't be backportable (3.14/3.16 will need a separate solution). I suggest to encourage folks to upgrade to 3.17 and not even try to come up with a solution for old branches. It would be very error-prone.

That's definitely fine for 3.16 since 3.17 is replacing it going forwards, I don't anticipate we will do any more 3.16.z releases.

For 3.14 it would be good to summarize the practical impact of this bug for any users who might be stuck on that release on a BZ. It sounds like it presents a data correctness problem for anyone using the standard test / prod promotion workflow - they might get a newer distribution tree on their slow branch than they were intending.

@fao89
Copy link
Member Author

fao89 commented Jan 24, 2022

Potential solution:

  • During sync identify that a subrepo changed, that would mean that a new distribution tree content unit needs to be created and the old one needs to be removed from a repo for which sync is being run.

    • in the first stage before processing the main variant/repo, deal with subrepos: sync/check if they are the same

      • a subrepo name should contain a hash of its repomd.xml (and not a treeinfo hash)
    • if a new distribution tree needs to be created, its digest should be calculated based on its treeinfo digest + all digests from all variants/addons, new addon/variant objects need to be created as well.

Good insight! Using repomd.xml hash sounds like the right way to go

@ggainey
Copy link
Contributor

ggainey commented Jan 26, 2022

I don't have useful feedback on the solution here - just want to record that pulp/pulpcore#2192 is related, and we'll need to update whatever we/I end up doing to fix that import/export problem, when we address this.

@dralley dralley removed Post labels Jan 28, 2022
@goosemania goosemania removed their assignment Jan 31, 2022
@dralley dralley removed their assignment Aug 25, 2022
ggainey added a commit to ggainey/pulp_rpm that referenced this issue Oct 31, 2022
DistributionTree digest and subrepo-names now both end with the
pulp-id of the "owning" Repository, making them unique to that
repo and therefore protected from concurrent-updates against
anything that is changing that Repository.

Addon/Variant/Image are transitively made unique by virtue of
having their DistributionTree be part of their unique-together.

Sub-repo **content** (e.g. Packages et al) are de-duplicated via
their existing uniqueness constraints.

The end result is a minor increase in Content objects (i.e.,
DistTrees/Addons/Images/Variants that used to have only one
instance are now one-per-containing-repo), and a small impact
on subrepo-syncing (since previously-unique subrepos will now
have a first-sync that would have been skipped). Content will
continue to only be sync'd once.

fixes pulp#2278.
fixes pulp#2775.
closes pulp#2304.
[nocoverage]
@ggainey
Copy link
Contributor

ggainey commented Oct 31, 2022

I'm addressing this under #2278 - closing this as a dup so we can bring all the discussion to one place.

@ggainey ggainey closed this as completed Oct 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants