Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-19537: OCB should fail if node is not coreos based #4442

Merged
merged 1 commit into from
Jul 23, 2024

Conversation

umohnani8
Copy link
Contributor

If a non-coreos based node, such as RHEL8, is added to a cluster and the user tries to enable on cluster builds, throw an error as that is not supported on non-coreos based nodes.

Fixes https://issues.redhat.com/browse/OCPBUGS-19537

- What I did
Added a check for when OCB is enabled to verify which OS the nodes have and to fail if they are non-coreos based.

- How to verify it
Start a cluster, add a RHEL node to it and then try enabling OCB

- Description for the changelog

Throw an error if cluster has non-coreos node when trying to enable OCB

@umohnani8
Copy link
Contributor Author

@cheesesashimi @inesqyx PTAL.

I am not completely sure if I am on the right track, this is where it seems we check whether OCL is enabled via the feature gate.

@cheesesashimi
Copy link
Member

cheesesashimi commented Jun 27, 2024

Here are some ideas to consider:

The osrelease wrapper works by looking at the /etc/os-release file that is present on the filesystem. In the MCD, this is fine because the MCD chroots into the nodes' filesystem and executes within that context. However, running it within the operator would read the /etc/os-release file that's present within the MCO's container image and not the one on the underlying host OS.

Each corev1.Node object has some information in its status field that we can use to figure out what the underlying node OS is. We could potentially extend our osrelease wrapper to infer all of this information (or at least as much as possible) from the corev1.Node objects in a given MachineConfigPool. I'm not sure what this looks like for a RHEL node, but here's what it looks like for an RHCOS node (we'd also need to consider OKD FCOS and OKD SCOS as well):

$ oc get node ip-10-0-27-18.ec2.internal -o yaml | yq '.status.nodeInfo'
architecture: amd64
bootID: 63e74981-e918-46fc-b9c4-99374a5a620e
containerRuntimeVersion: cri-o://1.29.4-6.rhaos4.16.git0e93ae2.el9
kernelVersion: 5.14.0-427.20.1.el9_4.x86_64
kubeProxyVersion: v1.30.1+9798e19
kubeletVersion: v1.30.1+9798e19
machineID: ec28d02b7ad61b67b62f3681741924b9
operatingSystem: linux
osImage: Red Hat Enterprise Linux CoreOS 417.94.202406050030-0
systemUUID: ec28d02b-7ad6-1b67-b62f-3681741924b9

Then, within the MCO, whenever we get a new MachineOSConfig, we can get a list of all of the nodes for the targeted MachineConfigPool, pass them into our modified osrelease wrapper which examines that field on each node, and degrade if we find a non-CoreOS node.

@umohnani8 umohnani8 force-pushed the ocb-yum branch 2 times, most recently from 73a925b to 1ac72d7 Compare June 27, 2024 19:42
@umohnani8
Copy link
Contributor Author

Thanks @cheesesashimi, I have updated the PR based on your review to grab the osImage info from corev1.Node an do a string comparison to check if it is a coreos based OS or not. PTAL

Copy link
Member

@cheesesashimi cheesesashimi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Off to a great start so far! The only improvement I'm specifically requesting is the addition of the node name and the name(s) of the MachineConfigPool(s) it's a member of to the error message. My other ideas are purely at your discretion 😄

pkg/operator/sync.go Outdated Show resolved Hide resolved
pkg/operator/sync.go Outdated Show resolved Hide resolved
@umohnani8
Copy link
Contributor Author

@sergiordlr can you please help verify we error out when adding a non-coreos based node.

@@ -164,3 +179,16 @@ func ListPools(node *corev1.Node, mcpLister v1.MachineConfigPoolLister) (*mcfgv1

return master, worker, custom, nil
}

// IsCoreOSNode checks whether the pretty name of a node matches any of the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thought (non-blocking): I was originally going to suggest for this to be moved to pkg/daemon/osrelease before I noticed the IsWindows() function above it, which made me realize that this is a fine home for this function. However, seeing the IsWindows() function brings up a question about what we should do about OCL for Windows nodes since they can't be updated that way. At the very least, we may want to do a small spike to better understand that relationship.

(To be clear: I'm not asking you to make any changes; I'm just putting my thoughts here for posterity 😄)

@@ -39,11 +34,11 @@ func TestOSDetection(t *testing.T) {
assert.False(t, nodeOSRelease.OS.IsLikeTraditionalRHEL7(), "expected IsLikeTraditionalRHEL7() to be false: %s", nodeOSRelease.EtcContent)

switch {
case strings.Contains(nodeOSRelease.EtcContent, rhcos):
case strings.Contains(nodeOSRelease.EtcContent, osrelease.RHCOS):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

praise: Nice work here! Thanks for making this change.

pkg/operator/sync.go Outdated Show resolved Hide resolved
@umohnani8 umohnani8 changed the title [BUG-19537] OCB should fail if node is not coreos based OCPBUGS-19537: OCB should fail if node is not coreos based Jul 2, 2024
@openshift-ci-robot openshift-ci-robot added jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. labels Jul 2, 2024
@openshift-ci-robot
Copy link
Contributor

@umohnani8: This pull request references Jira Issue OCPBUGS-19537, which is invalid:

  • expected the bug to target the "4.17.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

If a non-coreos based node, such as RHEL8, is added to a cluster and the user tries to enable on cluster builds, throw an error as that is not supported on non-coreos based nodes.

Fixes https://issues.redhat.com/browse/OCPBUGS-19537

- What I did
Added a check for when OCB is enabled to verify which OS the nodes have and to fail if they are non-coreos based.

- How to verify it
Start a cluster, add a RHEL node to it and then try enabling OCB

- Description for the changelog

Throw an error if cluster has non-coreos node when trying to enable OCB

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Jul 2, 2024
@umohnani8
Copy link
Contributor Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Jul 2, 2024
@openshift-ci-robot
Copy link
Contributor

@umohnani8: This pull request references Jira Issue OCPBUGS-19537, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.17.0) matches configured target version for branch (4.17.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @sergiordlr

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested a review from sergiordlr July 2, 2024 17:38
If a non-coreos based node, such as RHEL8, is added to
a cluster and the user tries to enable on cluster builds,
throw an error as that is not supported on non-coreos based
nodes.

Signed-off-by: Urvashi Mohnani <umohnani@redhat.com>
@umohnani8
Copy link
Contributor Author

/retest

@umohnani8
Copy link
Contributor Author

This is ready, @sergiordlr PTAL

@ptalgulk01
Copy link

ptalgulk01 commented Jul 15, 2024

Pre-merge verified -

$ oc get clusterversions.config.openshift.io 
NAME      VERSION                                                   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.17.0-0.ci.test-2024-07-15-054707-ci-ln-6p4dlvt-latest   True        False         9m11s   Cluster version is 4.17.0-0.ci.test-2024-07-15-054707-ci-ln-6p4dlvt-latest

Added RHEL nodes using ocp4-scale-runner flexy job

$ oc get node
NAME                                    STATUS   ROLES                  AGE     VERSION
ptalgulk-1507-tc-6xndc-master-0         Ready    control-plane,master   47m     v1.30.1+9798e19
ptalgulk-1507-tc-6xndc-master-1         Ready    control-plane,master   48m     v1.30.1+9798e19
ptalgulk-1507-tc-6xndc-master-2         Ready    control-plane,master   48m     v1.30.1+9798e19
ptalgulk-1507-tc-6xndc-w-a-l-rhel-0     Ready    worker                 6m23s   v1.30.2
ptalgulk-1507-tc-6xndc-w-a-l-rhel-1     Ready    worker                 6m17s   v1.30.2
ptalgulk-1507-tc-6xndc-worker-a-8k6gq   Ready    worker                 34m     v1.30.1+9798e19
ptalgulk-1507-tc-6xndc-worker-b-tbpcw   Ready    worker                 36m     v1.30.1+9798e19
ptalgulk-1507-tc-6xndc-worker-c-h8mdb   Ready    worker                 35m     v1.30.1+9798e19

Enabled and applied OCB on custom MCP rhel node

oc create -f - << EOF
apiVersion: machineconfiguration.openshift.io/v1alpha1
kind: MachineOSConfig
metadata:
 name: check
spec:
 buildInputs:
   baseImagePullSecret:
     name: global-pull-secret-copy
   containerFile:
   - containerfileArch: noarch
     content: RUN echo 'test' > /etc/test-image.txt
   imageBuilder:
     imageBuilderType: PodImageBuilder
   renderedImagePushSecret:
     name: $(oc get -n openshift-machine-config-operator sa builder -o jsonpath='{.secrets[0].name}')
   renderedImagePushspec: "image-registry.openshift-image-registry.svc:5000/openshift-machine-config-operator/ocb-image:latest"
 buildOutputs:
   currentImagePullSecret:
     name: global-pull-secret-copy
 machineConfigPool:
   name: check
EOF

We are able to see error logs message in operator instead of RHEL node logs and MCP is not in degraded state

$  oc get co machine-config     
NAME             VERSION                                                   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
machine-config         4.17.0-0.ci.test-2024-07-15-054707-ci-ln-6p4dlvt-latest   True        False         True       79m     Failed to resync 4.17.0-0.ci.test-2024-07-15-054707-ci-ln-6p4dlvt-latest because: on-cluster layering is only supported on CoreOS-based nodes: non-CoreOS OS "Red Hat Enterprise Linux 8.9 (Ootpa)" detected on node "ptalgulk-1507-tc-6xndc-w-a-l-rhel-0" in MachineConfigPool(s) ["check" "worker"]

$ oc get mcp
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
check    rendered-check-679dc8cb31540add3d08dca713fcc6e6    True      False      False      1              1                   1                     0                      8m7s
master   rendered-master-f77f9da088d8b1313cc504456aa675fd   True      False      False      3              3                   3                     0                      81m
worker   rendered-worker-679dc8cb31540add3d08dca713fcc6e6   True      False      False      4              4                   4                     0                      81m

adding qe-approved label

/label qe-approved

Thankyou!

@openshift-ci openshift-ci bot added the qe-approved Signifies that QE has signed off on this PR label Jul 15, 2024
@openshift-ci-robot
Copy link
Contributor

@umohnani8: This pull request references Jira Issue OCPBUGS-19537, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.17.0) matches configured target version for branch (4.17.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira (ptalgulk@redhat.com), skipping review request.

In response to this:

If a non-coreos based node, such as RHEL8, is added to a cluster and the user tries to enable on cluster builds, throw an error as that is not supported on non-coreos based nodes.

Fixes https://issues.redhat.com/browse/OCPBUGS-19537

- What I did
Added a check for when OCB is enabled to verify which OS the nodes have and to fail if they are non-coreos based.

- How to verify it
Start a cluster, add a RHEL node to it and then try enabling OCB

- Description for the changelog

Throw an error if cluster has non-coreos node when trying to enable OCB

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@umohnani8
Copy link
Contributor Author

@cheesesashimi PTAL at the test results above. This is ready to be merged!

@cheesesashimi
Copy link
Member

Nice work!

/lgtm
/approve

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jul 16, 2024
Copy link
Contributor

openshift-ci bot commented Jul 16, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cheesesashimi, umohnani8

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 16, 2024
@umohnani8
Copy link
Contributor Author

@yuqi-zhang @cheesesashimi any idea how to get the label needed to merge this - it wants the acknowledge-critical-fixes-only label.

@yuqi-zhang
Copy link
Contributor

We can apply this label if we want this bug to merge this week, anyone should have the ability. The label was added as a requirement, since we wanted to be more stable for shift week.

The requirement should drop off by next week and this can merge normally as well.

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD 5fbf0e5 and 2 for PR HEAD 21f91d0 in total

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD fe7da36 and 1 for PR HEAD 21f91d0 in total

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD a2f5c74 and 0 for PR HEAD 21f91d0 in total

Copy link
Contributor

openshift-ci bot commented Jul 23, 2024

@umohnani8: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot bot merged commit 0528946 into openshift:master Jul 23, 2024
17 checks passed
@openshift-ci-robot
Copy link
Contributor

@umohnani8: Jira Issue OCPBUGS-19537: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-19537 has been moved to the MODIFIED state.

In response to this:

If a non-coreos based node, such as RHEL8, is added to a cluster and the user tries to enable on cluster builds, throw an error as that is not supported on non-coreos based nodes.

Fixes https://issues.redhat.com/browse/OCPBUGS-19537

- What I did
Added a check for when OCB is enabled to verify which OS the nodes have and to fail if they are non-coreos based.

- How to verify it
Start a cluster, add a RHEL node to it and then try enabling OCB

- Description for the changelog

Throw an error if cluster has non-coreos node when trying to enable OCB

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-bot
Copy link
Contributor

[ART PR BUILD NOTIFIER]

Distgit: ose-machine-config-operator
This PR has been included in build ose-machine-config-operator-container-v4.17.0-202407231314.p0.g0528946.assembly.stream.el9.
All builds following this will include this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. qe-approved Signifies that QE has signed off on this PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants