Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 1858026: add nil check for infra.Status.PlatformStatus in isCloudConfigReqd #1935

Conversation

kikisdeliveryservice
Copy link
Contributor

@kikisdeliveryservice kikisdeliveryservice commented Jul 17, 2020

infra.Status.PlatformStatus is *PlatformStatus and in some <4.5.x setups this entire thing is empty and only platform is set to none. So when we hit the switch statement in the MCO we panic bc pointers, so check that it's really there before you do the switch stmt case comparisions. Before my fix, the unit test I added failed with same panic, now it passes and func returns false. This behavior was seen in bm clusters updating from 4.4.12->4.5.2, which is also when Platform was deprecated in favor of PlatformStatus and the MCO was missing the check. The checks do exist in the MCC transitioning them to the new type.

As for why we saw this with users but not in CI, AFAIK there isn't an e2e metal upgrade job accounting for this anywhere and I believe the existing metal job would just install a 4.5.x cluster with the new PlatformStatus.

@openshift-ci-robot openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 17, 2020
@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 17, 2020
@kikisdeliveryservice kikisdeliveryservice changed the title [WIP] add isCloudConfigRequired unit test [WIP] Bug 1858026: add isCloudConfigRequired unit test Jul 17, 2020
@openshift-ci-robot
Copy link
Contributor

@kikisdeliveryservice: This pull request references Bugzilla bug 1858026, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.6.0) matches configured target release for branch (4.6.0)
  • bug is in the state NEW, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

[WIP] Bug 1858026: add isCloudConfigRequired unit test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels Jul 17, 2020
@kikisdeliveryservice
Copy link
Contributor Author

Replicated underlying bug with a missing struct field:

 --- FAIL: TestIsCloudConfigRequired (0.00s)
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x14e7959] 

@kikisdeliveryservice kikisdeliveryservice requested review from cgwalters and removed request for ashcrow July 17, 2020 23:38
@cgwalters
Copy link
Member

Code LGTM, but I wonder if there are other bugs lurking. Doing a quick grep in the code I also see e.g.

if !equality.Semantic.DeepEqual(existing.Infra.Status.PlatformStatus.Type, required.Infra.Status.PlatformStatus.Type) {

which also doesn't seem to be accounting for a possibly nil status.

(If this was Rust it'd be Option<PlatformStatus> and the compiler would warn error on this...billion dollar mistake etc.)

@openshift-ci-robot
Copy link
Contributor

@kikisdeliveryservice: This pull request references Bugzilla bug 1858026, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.6.0) matches configured target release for branch (4.6.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

[WIP] Bug 1858026: add isCloudConfigRequired unit test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@kikisdeliveryservice kikisdeliveryservice changed the title [WIP] Bug 1858026: add isCloudConfigRequired unit test [WIP] Bug 1858026: add nil check for infra.Status.PlatformStatus in isCloudConfigReqd Jul 17, 2020
…itch stmt

If infra.Status.PlatformStatus is empty it will throw a panic as it is a pointer.
Some baremetal platforms have an empty Platform status which must be accounted for.

Also add basic isCloudConfigRequired unit test

Closes: BZ 1858026
@kikisdeliveryservice kikisdeliveryservice changed the title [WIP] Bug 1858026: add nil check for infra.Status.PlatformStatus in isCloudConfigReqd Bug 1858026: add nil check for infra.Status.PlatformStatus in isCloudConfigReqd Jul 17, 2020
@openshift-ci-robot openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 17, 2020
@kikisdeliveryservice
Copy link
Contributor Author

Code LGTM, but I wonder if there are other bugs lurking. Doing a quick grep in the code I also see e.g.

if !equality.Semantic.DeepEqual(existing.Infra.Status.PlatformStatus.Type, required.Infra.Status.PlatformStatus.Type) {

which also doesn't seem to be accounting for a possibly nil status.

(If this was Rust it'd be Option<PlatformStatus> and the compiler would warn error on this...billion dollar mistake etc.)

good point!!! ill check those out

@kikisdeliveryservice
Copy link
Contributor Author

Ok updated the remaining ones that didnt' have a nil check.

@kikisdeliveryservice
Copy link
Contributor Author

/cherry-pick 4.5

@openshift-cherrypick-robot

@kikisdeliveryservice: once the present PR merges, I will cherry-pick it on top of 4.5 in a new PR and assign it to you.

In response to this:

/cherry-pick 4.5

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot
Copy link
Contributor

@kikisdeliveryservice: This pull request references Bugzilla bug 1858026, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.6.0) matches configured target release for branch (4.6.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

Bug 1858026: add nil check for infra.Status.PlatformStatus in isCloudConfigReqd

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

1 similar comment
@openshift-ci-robot
Copy link
Contributor

@kikisdeliveryservice: This pull request references Bugzilla bug 1858026, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.6.0) matches configured target release for branch (4.6.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

Bug 1858026: add nil check for infra.Status.PlatformStatus in isCloudConfigReqd

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@kikisdeliveryservice
Copy link
Contributor Author

/test e2e-gcp-upgrade

@kikisdeliveryservice
Copy link
Contributor Author

/retest

@openshift-ci-robot openshift-ci-robot removed the bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. label Jul 18, 2020
@openshift-ci-robot
Copy link
Contributor

@kikisdeliveryservice: This pull request references Bugzilla bug 1858026, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.6.0) matches configured target release for branch (4.6.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

Bug 1858026: add nil check for infra.Status.PlatformStatus in isCloudConfigReqd

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added the bugzilla/severity-urgent Referenced Bugzilla bug's severity is urgent for the branch this PR is targeting. label Jul 18, 2020
@cgwalters
Copy link
Member

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jul 19, 2020
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cgwalters, kikisdeliveryservice

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [cgwalters,kikisdeliveryservice]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

7 similar comments
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot openshift-merge-robot merged commit 081315e into openshift:master Jul 20, 2020
@openshift-ci-robot
Copy link
Contributor

@kikisdeliveryservice: All pull requests linked via external trackers have merged: openshift/machine-config-operator#1935. Bugzilla bug 1858026 has been moved to the MODIFIED state.

In response to this:

Bug 1858026: add nil check for infra.Status.PlatformStatus in isCloudConfigReqd

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-cherrypick-robot

@kikisdeliveryservice: cannot checkout 4.5: error checking out 4.5: exit status 1. output: error: pathspec '4.5' did not match any file(s) known to git

In response to this:

/cherry-pick 4.5

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

return true
default:

if infra.Status.PlatformStatus != nil {
Copy link
Member

@runcom runcom Jul 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job narrowing this down! I just have a nit and a follow up to make this whole situation better in the future and fix one last potential panic we still have

nit: instead of checking != nil you can remove a nest with == nil { return ... }..

Also, there's another place missed (which I myself missed originally) and I think instead of doing this check everywhere which anybody can miss and/or forget, maybe we need to make the thing better, the other check is here, https://github.com/openshift/machine-config-operator/blob/master/pkg/operator/sync.go#L322

the change in lib/resourcemerge/machineconfig.go LGTM but following the trails for the panic indicates that our code isn't that robust to changes and we don't have units for all the functions that access the field being potentially nil (and we're human and miss things in reviews, it happens).

Then we can instead notice https://github.com/openshift/machine-config-operator/blob/master/pkg/operator/render.go#L80-L91 within createDiscoveredControllerConfigSpec which is the first thing we do after grabbing the infra object. In those lines we make sure that if the PlatformStatus is nil, we "adjust" it for compatibility reasons and then we store the infra object in the ControllerConfig object which makes sure we can safely check that field as we made sure it's always non-nil (there's another bug there tho, unrelated, we're modifying an object from the indexers which is a no-no).
Anyway, it seems that after grabbing the initial infra object (https://github.com/openshift/machine-config-operator/blob/master/pkg/operator/sync.go#L233), we do run it through a compatibility fix to make sure PlatformStatus is never nil from there (https://github.com/openshift/machine-config-operator/blob/master/pkg/operator/render.go#L80-L91), perhaps a forward-looking fix for this BZ would be to:

  • grab the infra object (we do this already)

  • run it through createDiscoveredControllerConfigSpec (we do this already)

  • pass the resulting ControllerConfig.Spec.Infra... object in funcs like this and getIgnitionHost, instead of the infra global one, as we're sure that the infra in the CC already went through this very same nil check (we need to do this)

With the above we:

  • centralize the nil check (also there's a comment telling us why it could be nil, the key thing for this BZ seems to be an upgrade all the way from 4.1 which didn't have PlatformStatus)
  • adjust it so it's never nil in our functions

The above allows us to avoid adding specific nil checks for the same thing over and over and help us tomorrow when the next PR adds something that accesses PlatformStatus w/o checking it

I'd say now that this is merged, let's backport the fix to the very next 4.5 z-release as it fixes the panic and land the above in master to make it more robust to future changes? (as well as fixing the latest potential panic in master)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Me imagining what Antonio means when he says on Twitter he's going to take some time to do some writing: Antonio sitting at his computer, sipping an espresso and writing a gripping Italian crime novel 📖

The actual result: Antonio sitting at his computer, sipping an espresso and writing longer pull request review comments 😉

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cgwalters 😂 😂 😂 😂 😂 😂 😂 😂 😂 😂 😂 😂 it was fun to imagine myself wiring a crime novel in from of my laptop sipping an expresso tho! thanks for that Colin! 😂

@kikisdeliveryservice
Copy link
Contributor Author

/cherry-pick release-4.5

@openshift-cherrypick-robot

@kikisdeliveryservice: #1935 failed to apply on top of branch "release-4.5":

In response to this:

/cherry-pick release-4.5

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@kikisdeliveryservice
Copy link
Contributor Author

/cherry-pick release-4.5

@openshift-cherrypick-robot

@kikisdeliveryservice: #1935 failed to apply on top of branch "release-4.5":

In response to this:

/cherry-pick release-4.5

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-urgent Referenced Bugzilla bug's severity is urgent for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants