Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 1789581: Fix osImageURL upgrade race #1357

Merged
merged 2 commits into from Jan 11, 2020

Conversation

runcom
Copy link
Member

@runcom runcom commented Jan 8, 2020

This patches does two things:

  • makes sure we create the osimageurl configmap before the new MCO deployment manifest is applied
  • version the osimageurl configmap

NOTE

  • This patch requires a 4.2.z patch to version the osimageurl config map there as well (done)
  • it might be needed to version the ControllerConfig as well (???)

related bug https://bugzilla.redhat.com/show_bug.cgi?id=1786993

@openshift-ci-robot openshift-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Jan 8, 2020
@runcom
Copy link
Member Author

runcom commented Jan 8, 2020

/hold

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 8, 2020
@runcom runcom changed the title Fox osImageURL upgrade race Fix osImageURL upgrade race Jan 8, 2020
@cgwalters
Copy link
Member

Sad we're not 🦊 ing it...

@openshift-ci-robot openshift-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jan 8, 2020
@@ -686,6 +686,11 @@ func (optr *Operator) getOsImageURL(namespace string) (string, error) {
if err != nil {
return "", err
}
releaseVersion := cm.Data["releaseVersion"]
optrVersion, _ := optr.vStore.Get("operator")
if releaseVersion != optrVersion {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any valid cases where the configmap from a previous setup would either be unversioned or not match while transitioning to versioned?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previous configmaps won't be versioned; but that's fine because Golang doesn't have Option<> so the releaseVersion will be the empty string, which won't match. Which is what we want - we'll ignore the previous unversioned configmap.

Copy link
Member Author

@runcom runcom Jan 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

notice there would still be a possible race where the 4.2 MCO reads the new 4.3 osimageurl configmap and could generate a new rendered mc with a newer osimageurl, that's why I believe we need to patch 4.2 as well to take versioning into account

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

notice there would still be a possible race where the 4.2 MCO reads the new 4.3 osimageurl configmap and could generate a new rendered mc with a newer osimageurl, that's why I believe we need to patch 4.2 as well to take versioning into account

I would rather we address that race condition directly rather than trying to patch 4.2 which requires us to enforce that 4.2.x (unpatched) upgrades through 4.2.x+1(patched) before going to 4.3. That's possible in theory but seems to be cumbersome in practice.

Some thoughts:

  1. Version the key in the osimageurl configmap (e.g. osImageURL-0.0.1-snapshot: ...) so that each MCO release has its own image reference. This would address this upgrade issue by changing the key name in 4.3 and protects us from any future upgrade race-conditions and failures. It would also allow you to do some weird cross-version testing, like if you wanted to have multiple MCOs referencing different osimages for some reason.
  2. Set the osimageurl value on the MCO deployment directly - it can be a flag passed in to the operator, the release infra will replace the image reference in the deployment manifest the same as it does in the configmap. If there are other reasons for using a configmap, the MCO deployment can be changed to reference a configmap mounted in just like all the other images. This would reduce any version issues to mismatches in the CC/MCC which we've dealt with before.
  3. Make MCO updates atomic by putting everything that always and only updates together together. The osimageurl, MCO, MCC controllers, and MC templates. Simplify controller architecture, merge "operator" and "controller" #878

@cgwalters
Copy link
Member

I haven't traced through the code and the logs from the race, but I completely believe you have and this looks highly likely to fix the problem.

I looked at previous variants of this like #1198 and the original was 835278f

Looking over the manifests we install, I think this is the last thing that is input to the MachineConfig that isn't versioned. The other CVO-installed manifests are pure kube level stuff like RBAC rules etc.

/approve
/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jan 8, 2020
@ashcrow
Copy link
Member

ashcrow commented Jan 8, 2020

Prometheus was not happy.

/test ci/prow/e2e-aws

@ashcrow
Copy link
Member

ashcrow commented Jan 8, 2020

/test ci/prow/e2e-aws

@kikisdeliveryservice
Copy link
Contributor

/test e2e-aws

This patch also guards against unversioned osImageURL config maps in the
future.

Signed-off-by: Antonio Murdaca <runcom@linux.com>
Signed-off-by: Antonio Murdaca <runcom@linux.com>
@runcom runcom changed the title Fix osImageURL upgrade race Bug 1789581: Fix osImageURL upgrade race Jan 9, 2020
@openshift-ci-robot openshift-ci-robot added the bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. label Jan 9, 2020
@openshift-ci-robot
Copy link
Contributor

@runcom: This pull request references Bugzilla bug 1789581, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Bug 1789581: Fix osImageURL upgrade race

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@runcom runcom removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 9, 2020
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

26 similar comments
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot openshift-merge-robot merged commit a1e3e11 into openshift:master Jan 11, 2020
@openshift-ci-robot
Copy link
Contributor

@runcom: All pull requests linked via external trackers have merged. Bugzilla bug 1789581 has been moved to the MODIFIED state.

In response to this:

Bug 1789581: Fix osImageURL upgrade race

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@runcom runcom deleted the osimageurl-race branch January 11, 2020 18:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants