New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 1858026: add nil check for infra.Status.PlatformStatus in isCloudConfigReqd #1935
Bug 1858026: add nil check for infra.Status.PlatformStatus in isCloudConfigReqd #1935
Conversation
@kikisdeliveryservice: This pull request references Bugzilla bug 1858026, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker. 3 validation(s) were run on this bug
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Replicated underlying bug with a missing struct field:
|
849ed80
to
781f979
Compare
Code LGTM, but I wonder if there are other bugs lurking. Doing a quick grep in the code I also see e.g.
which also doesn't seem to be accounting for a possibly (If this was Rust it'd be |
@kikisdeliveryservice: This pull request references Bugzilla bug 1858026, which is valid. 3 validation(s) were run on this bug
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
…itch stmt If infra.Status.PlatformStatus is empty it will throw a panic as it is a pointer. Some baremetal platforms have an empty Platform status which must be accounted for. Also add basic isCloudConfigRequired unit test Closes: BZ 1858026
781f979
to
804796a
Compare
good point!!! ill check those out |
e6d264d
to
52f0c96
Compare
Ok updated the remaining ones that didnt' have a nil check. |
52f0c96
to
3ea55db
Compare
/cherry-pick 4.5 |
@kikisdeliveryservice: once the present PR merges, I will cherry-pick it on top of 4.5 in a new PR and assign it to you. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@kikisdeliveryservice: This pull request references Bugzilla bug 1858026, which is valid. 3 validation(s) were run on this bug
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
1 similar comment
@kikisdeliveryservice: This pull request references Bugzilla bug 1858026, which is valid. 3 validation(s) were run on this bug
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/test e2e-gcp-upgrade |
/retest |
@kikisdeliveryservice: This pull request references Bugzilla bug 1858026, which is valid. 3 validation(s) were run on this bug
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: cgwalters, kikisdeliveryservice The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest Please review the full test history for this PR and help us cut down flakes. |
7 similar comments
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
@kikisdeliveryservice: All pull requests linked via external trackers have merged: openshift/machine-config-operator#1935. Bugzilla bug 1858026 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@kikisdeliveryservice: cannot checkout 4.5: error checking out 4.5: exit status 1. output: error: pathspec '4.5' did not match any file(s) known to git In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
return true | ||
default: | ||
|
||
if infra.Status.PlatformStatus != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job narrowing this down! I just have a nit and a follow up to make this whole situation better in the future and fix one last potential panic we still have
nit: instead of checking != nil
you can remove a nest with == nil { return ... }
..
Also, there's another place missed (which I myself missed originally) and I think instead of doing this check everywhere which anybody can miss and/or forget, maybe we need to make the thing better, the other check is here, https://github.com/openshift/machine-config-operator/blob/master/pkg/operator/sync.go#L322
the change in lib/resourcemerge/machineconfig.go
LGTM but following the trails for the panic indicates that our code isn't that robust to changes and we don't have units for all the functions that access the field being potentially nil (and we're human and miss things in reviews, it happens).
Then we can instead notice https://github.com/openshift/machine-config-operator/blob/master/pkg/operator/render.go#L80-L91 within createDiscoveredControllerConfigSpec
which is the first thing we do after grabbing the infra object. In those lines we make sure that if the PlatformStatus is nil, we "adjust" it for compatibility reasons and then we store the infra object in the ControllerConfig object which makes sure we can safely check that field as we made sure it's always non-nil (there's another bug there tho, unrelated, we're modifying an object from the indexers which is a no-no).
Anyway, it seems that after grabbing the initial infra object (https://github.com/openshift/machine-config-operator/blob/master/pkg/operator/sync.go#L233), we do run it through a compatibility fix to make sure PlatformStatus is never nil from there (https://github.com/openshift/machine-config-operator/blob/master/pkg/operator/render.go#L80-L91), perhaps a forward-looking fix for this BZ would be to:
-
grab the infra object (we do this already)
-
run it through
createDiscoveredControllerConfigSpec
(we do this already) -
pass the resulting
ControllerConfig.Spec.Infra...
object in funcs like this andgetIgnitionHost
, instead of theinfra
global one, as we're sure that the infra in the CC already went through this very same nil check (we need to do this)
With the above we:
- centralize the nil check (also there's a comment telling us why it could be nil, the key thing for this BZ seems to be an upgrade all the way from 4.1 which didn't have PlatformStatus)
- adjust it so it's never nil in our functions
The above allows us to avoid adding specific nil checks for the same thing over and over and help us tomorrow when the next PR adds something that accesses PlatformStatus w/o checking it
I'd say now that this is merged, let's backport the fix to the very next 4.5 z-release as it fixes the panic and land the above in master to make it more robust to future changes? (as well as fixing the latest potential panic in master)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Me imagining what Antonio means when he says on Twitter he's going to take some time to do some writing: Antonio sitting at his computer, sipping an espresso and writing a gripping Italian crime novel 📖
The actual result: Antonio sitting at his computer, sipping an espresso and writing longer pull request review comments 😉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cgwalters 😂 😂 😂 😂 😂 😂 😂 😂 😂 😂 😂 😂 it was fun to imagine myself wiring a crime novel in from of my laptop sipping an expresso tho! thanks for that Colin! 😂
/cherry-pick release-4.5 |
@kikisdeliveryservice: #1935 failed to apply on top of branch "release-4.5":
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/cherry-pick release-4.5 |
@kikisdeliveryservice: #1935 failed to apply on top of branch "release-4.5":
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
infra.Status.PlatformStatus is *PlatformStatus and in some <4.5.x setups this entire thing is empty and only platform is set to none. So when we hit the switch statement in the MCO we panic bc pointers, so check that it's really there before you do the switch stmt case comparisions. Before my fix, the unit test I added failed with same panic, now it passes and func returns false. This behavior was seen in bm clusters updating from 4.4.12->4.5.2, which is also when Platform was deprecated in favor of PlatformStatus and the MCO was missing the check. The checks do exist in the MCC transitioning them to the new type.
As for why we saw this with users but not in CI, AFAIK there isn't an e2e metal upgrade job accounting for this anywhere and I believe the existing metal job would just install a 4.5.x cluster with the new PlatformStatus.