Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MCO-565: MCO-568: MCO-659: MCO-660 On-cluster build opt-in function, building machine-os-builder stub, RBAC and service acct inclusions #3763

Merged
merged 1 commit into from Jul 26, 2023

Conversation

dkhater-redhat
Copy link
Contributor

…ler with label

and rework Machine OS Builder startup logic

- What I did

- How to verify it

- Description for the changelog

@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Jun 28, 2023

@dkhater-redhat: This pull request references MCO-565 which is a valid jira issue.

In response to this:

…ler with label

and rework Machine OS Builder startup logic

- What I did

- How to verify it

- Description for the changelog

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jun 28, 2023
@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 28, 2023
@cheesesashimi
Copy link
Member

I don't think the e2e test failures are related to the changes this PR introduces.

@dkhater-redhat
Copy link
Contributor Author

/test e2e-gcp-op

@dkhater-redhat dkhater-redhat changed the title MCO-565: On-cluster build opt-in function MCO-565: MCO-568: On-cluster build opt-in function and building machine-os-builder stub Jul 5, 2023
@dkhater-redhat dkhater-redhat changed the title MCO-565: MCO-568: On-cluster build opt-in function and building machine-os-builder stub MCO-565: MCO-568: MCO-659: MCO-660 On-cluster build opt-in function, building machine-os-builder stub, RBAC and service acct inclusions Jul 5, 2023
@dkhater-redhat dkhater-redhat force-pushed the mco-565b branch 2 times, most recently from deb6feb to 7eb7bba Compare July 10, 2023 14:46
@dkhater-redhat
Copy link
Contributor Author

/retest-required

Copy link
Contributor

@jkyros jkyros left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! Looking pretty good, just a couple of small things and whatever you want to do with that e2e test.

I did try this on a cluster, and rocked it back and forth with:

oc label mcp worker machineconfiguration.openshift.io/layering-enabled=
oc label mcp worker machineconfiguration.openshift.io/layering-enabled-

And it seemed to work great 🎉

EDIT: If you could update the commit name/description to match what this PR does (vs pausing NodeController) future generations will be extremely grateful 😄

manifests/machineosbuilder/controllerconfig.yaml Outdated Show resolved Hide resolved
pkg/operator/sync.go Show resolved Hide resolved
Comment on lines +10 to +17
resources: ["*"]
verbs: ["*"]
- apiGroups: [""]
resources: ["configmaps", "secrets"]
verbs: ["*"]
- apiGroups: ["config.openshift.io"]
resources: ["images", "clusterversions", "featuregates", "nodes", "nodes/status"]
verbs: ["*"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These look like some of these were probably copy-paste from the machine-config-controller. It's probably fine for now, but eventually the least privilege/wildcard police will come for us.

Like, I'm sure the MOB doesn't need to create/delete nodes, etc -- ideally we'd only give it the verbs/resources it actually needed 😄

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Once we've evolved on-cluster builds, we should audit these and get them down to a bare minimum.

pkg/operator/sync.go Outdated Show resolved Hide resolved
test/e2e/mob_test.go Outdated Show resolved Hide resolved
test/e2e/mob_test.go Outdated Show resolved Hide resolved
Copy link
Contributor

@jkyros jkyros left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lookin' good. Not trying to be nitpicky, just making sure the test is telling us what we want it to 😄

After this we can poke QE and have them take a look.

test/e2e/mob_test.go Outdated Show resolved Hide resolved
test/e2e/mob_test.go Outdated Show resolved Hide resolved
test/e2e/mob_test.go Outdated Show resolved Hide resolved
@dkhater-redhat
Copy link
Contributor Author

/retest-required

@dkhater-redhat
Copy link
Contributor Author

/test okd-scos-e2e-aws-ovn
/test e2e-gcp-op
/test e2e-aws-ovn

@djoshy
Copy link
Contributor

djoshy commented Jul 14, 2023

Looks like gcp-op is failing on machine os builder test, the pod isn't coming up. Could be flakey, but this is like the first failure of the new test v/s all the other previous runs

@djoshy
Copy link
Contributor

djoshy commented Jul 14, 2023

/retest-required

@djoshy
Copy link
Contributor

djoshy commented Jul 14, 2023

Looking at the pod log of mos builder, it looks like its stuck in a deadlock, any ideas? @dkhater-redhat

2023-07-14T17:33:24.301512448Z Hello, World!
2023-07-14T17:33:24.301795305Z fatal error: all goroutines are asleep - deadlock!
2023-07-14T17:33:24.303713202Z 
2023-07-14T17:33:24.303788294Z goroutine 1 [select (no cases)]:
2023-07-14T17:33:24.304085801Z main.main()
2023-07-14T17:33:24.304085801Z 	/go/src/github.com/openshift/machine-config-operator/cmd/machine-os-builder/main.go:26 +0x57

...and I'm sorry for the 3 separate comments :P

@dkhater-redhat
Copy link
Contributor Author

@djoshy thank you for pointing that out! i removed something in the main.go that i believe was causing the deadlock. hoping it works!

@jkyros
Copy link
Contributor

jkyros commented Jul 14, 2023

Without that select{} it's probably also going to run and exit immediately (albeit with zero status),

I assume that select was put in there to make sure it was long-running. (The deadlock detector is right "all goroutines are asleep" because there is only one goroutine and it's asleep.)

A sleep might work, I don't think it counts as a deadlock:

        <-time.After(876000*time.Hour) 

At first I was just like "we could just start a webserver!":

func main() {
        fmt.Println("Hello, World!")
        http.ListenAndServe("1234",nil)
}

but then I was like "hmm, I don't know how the FIPS scanner works, do I want to risk triggering it by shipping a dummy webserver right now?" 😄

@dkhater-redhat
Copy link
Contributor Author

/retest-required

@jkyros
Copy link
Contributor

jkyros commented Jul 17, 2023

/test verify

@jkyros
Copy link
Contributor

jkyros commented Jul 17, 2023

/retest-required

2 similar comments
@dkhater-redhat
Copy link
Contributor Author

/retest-required

@dkhater-redhat
Copy link
Contributor Author

/retest-required

@dkhater-redhat
Copy link
Contributor Author

/test e2e-gcp-op
/test e2e-hypershift

2 similar comments
@dkhater-redhat
Copy link
Contributor Author

/test e2e-gcp-op
/test e2e-hypershift

@dkhater-redhat
Copy link
Contributor Author

/test e2e-gcp-op
/test e2e-hypershift

@djoshy
Copy link
Contributor

djoshy commented Jul 19, 2023

/test e2e-gcp-op
/test e2e-hypershift

@dkhater-redhat
Copy link
Contributor Author

/test e2e-gcp-op
/test e2e-hypershift

@dkhater-redhat
Copy link
Contributor Author

/test e2e-gcp-op

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 20, 2023
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 20, 2023
@dkhater-redhat
Copy link
Contributor Author

/retest-required

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 21, 2023

@dkhater-redhat: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/okd-images 777d777 link false /test okd-images
ci/prow/okd-scos-e2e-aws-ovn 777d777 link false /test okd-scos-e2e-aws-ovn

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@dkhater-redhat
Copy link
Contributor Author

/retest-required

@dkhater-redhat
Copy link
Contributor Author

@sergiordlr @rioliu-rh ready for QE :-)

@sergiordlr
Copy link

No initial builder pod deployed

oc get pods -n openshift-machine-config-operator 
NAME                                         READY   STATUS    RESTARTS       AGE
machine-config-controller-6bb6fd8468-bkcbs   2/2     Running   2 (121m ago)   136m
machine-config-daemon-4qrnl                  2/2     Running   0              137m
machine-config-daemon-8dlvq                  2/2     Running   0              129m
machine-config-daemon-cg7hg                  2/2     Running   0              137m
machine-config-daemon-k2xqc                  2/2     Running   0              129m
machine-config-daemon-r8hcw                  2/2     Running   0              131m
machine-config-daemon-x2ngr                  2/2     Running   0              137m
machine-config-operator-54b7d9fbd6-8lbfw     2/2     Running   2 (121m ago)   140m
machine-config-server-7hb6q                  1/1     Running   0              136m
machine-config-server-rb7x9                  1/1     Running   0              136m
machine-config-server-szxjt                  1/1     Running   0              136m

Verify synced resources:

$ oc get clusterrole  |grep builder
machine-os-builder                                                          2023-07-26T10:47:17Z
machine-os-builder-events                                                   2023-07-26T10:47:17Z
$ oc get ClusterRoleBinding |grep builder
machine-os-builder                                                          ClusterRole/machine-os-builder                                                          137m
machine-os-builder-anyuid                                                   ClusterRole/system:openshift:scc:anyuid                                                 137m
$ oc get deployment -n openshift-machine-config-operator 
NAME                        READY   UP-TO-DATE   AVAILABLE   AGE
machine-config-controller   1/1     1            1           151m
machine-config-operator     1/1     1            1           160m
machine-os-builder          0/0     0            0           151m

$ oc get rolebindings -n default |grep event
machine-config-controller-events   ClusterRole/machine-config-controller-events   158m
machine-config-daemon-events       ClusterRole/machine-config-daemon-events       158m
machine-os-builder-events          ClusterRole/machine-os-builder-events          158m

$ oc get rolebindings -n openshift-machine-config-operator |grep event
machine-config-controller-events   ClusterRole/machine-config-controller-events   159m
machine-config-daemon-events       ClusterRole/machine-config-daemon-events       159m
machine-os-builder-events          ClusterRole/machine-os-builder-events          158m
$ oc get sa -n openshift-machine-config-operator machine-os-builder
NAME                 SECRETS   AGE
machine-os-builder   1         160m
$ oc rsh -n openshift-machine-config-operator daemonset/machine-config-daemon ls -larth /usr/bin |grep machine
Defaulted container "machine-config-daemon" out of: machine-config-daemon, oauth-proxy
-rwxr-xr-x. 1 root root  25K Apr 27 12:11  systemd-machine-id-setup
-rwxr-xr-x. 1 root root 1.8M Jul 26 10:24  machine-os-builder
-rwxr-xr-x. 1 root root  44M Jul 26 10:24  machine-config-server
-rwxr-xr-x. 1 root root  51M Jul 26 10:24  machine-config-operator
-rwxr-xr-x. 1 root root  55M Jul 26 10:24  machine-config-daemon
-rwxr-xr-x. 1 root root  80M Jul 26 10:24  machine-config-controller
-rwxr-xr-x. 1 root root  55M Jul 26 10:26  machine-config-daemon.rhel9

Verify deployment behavior:

When labeling and unlabeling a MCP the machine-os-builder deployment is scaled to 1 and a new pod is created:

oc label mcp infra/master/worker machineconfiguration.openshift.io/layering-enabled=

machine-os-builder-849f777495-t57t7 1/1 Running 0 67s

It worked fine with "worker" pool, "master" pool and a custom "infra" pool

When the label is removed the deployment is scaled to zero.

When we label more than 1 MCP, only 1 pod is create. We can label "master", "worker" and "infra" pool and only 1 machine-os-builder pod will be created. This pod will only be removed once we remove the label from all the MCP.

"version" command is added in this PR, but it is never used. We assume that it will be included later when developing the whole machine-os-builder functionality.

$ oc rsh -n openshift-machine-config-operator daemonset/machine-config-daemon  /usr/bin/machine-os-builder version
Defaulted container "machine-config-daemon" out of: machine-config-daemon, oauth-proxy
Hello, World!

Please, could you confirm that the "version" command behavior and the "machine-os-builder" deployment behavior when several MCPs are tagged are expected behaviors?

That's my only concern, if those behaviors are expected we will add the qe-approved label.

Thank you very much!

PS: I have not checked the bare minimum requirements in the permissions since in the comments it was said that it will be done at the end of the development.

@sergiordlr
Copy link

After talking with @dkhater-redhat we have confirmed that the behavior that we are seeing is the expected one.

We can add the qe-approved label.
/label qe-approved

@openshift-ci openshift-ci bot added the qe-approved Signifies that QE has signed off on this PR label Jul 26, 2023
@jkyros
Copy link
Contributor

jkyros commented Jul 26, 2023

woohoo!
/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jul 26, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 26, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dkhater-redhat, jkyros

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [dkhater-redhat,jkyros]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. qe-approved Signifies that QE has signed off on this PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants