Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MCO-831: added feature gate to mco for on cluster builds #4060

Merged

Conversation

dkhater-redhat
Copy link
Contributor

- What I did

- How to verify it

- Description for the changelog

@openshift-ci openshift-ci bot requested review from cdoern and djoshy December 5, 2023 19:25
@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 5, 2023
@dkhater-redhat
Copy link
Contributor Author

/retest

@dkhater-redhat
Copy link
Contributor Author

/retest

Copy link
Contributor

@cdoern cdoern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, this should work since the operator is already wired up to wait for featuregates.

pkg/operator/sync.go Show resolved Hide resolved
@cdoern
Copy link
Contributor

cdoern commented Dec 7, 2023

@dkhater-redhat here is some info on the failures:

it looks like its possible the FGs changed since we last vendored in the API, that is why unit is failing. Make sure you check the cluster config operator and/or API to see if anyone removed any features for the default FG. if they did, you'll need to edit templates/master/01-master-kubelet/_base/files/kubelet.yaml and templates/worker/01-worker-kbelet/_base/files/kubelet.yaml and remove the no longer needed featuregates

Other than that, verify is failing bc you bumped k8s to a version that changed what errorf and a few other functions allow as formatted args you'll need to go into the functions and change %w to %s and change err references to err.Error() to make it a string.

Going to take a look at e2e failures now, I have a feeling its all FG related.

@cdoern
Copy link
Contributor

cdoern commented Dec 7, 2023

actually. It seems the failure is on build: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/4060/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/1732782030187401216/build-log.txt at the bottom either

  1. containers/common changed their retry functions (I reallllllly hope not)
  2. you need to run go mod tidy go mod vendor and that should hopefully fix it.

I'll go look at c/common to make sure they didn't delete anything

Copy link
Contributor

@yuqi-zhang yuqi-zhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please squash the commits? Thanks!

@dkhater-redhat
Copy link
Contributor Author

/test ci/prow/unit
/test ci/prow/verify

Copy link
Contributor

openshift-ci bot commented Dec 7, 2023

@dkhater-redhat: The specified target(s) for /test were not found.
The following commands are available to trigger required jobs:

  • /test 4.12-upgrade-from-stable-4.11-images
  • /test cluster-bootimages
  • /test e2e-aws-ovn
  • /test e2e-aws-ovn-upgrade
  • /test e2e-gcp-op
  • /test e2e-gcp-op-single-node
  • /test e2e-hypershift
  • /test images
  • /test okd-scos-images
  • /test unit
  • /test verify

The following commands are available to trigger optional jobs:

  • /test 4.12-upgrade-from-stable-4.11-e2e-aws-ovn-upgrade
  • /test bootstrap-unit
  • /test e2e-alibabacloud-ovn
  • /test e2e-aws-disruptive
  • /test e2e-aws-ovn-fips
  • /test e2e-aws-ovn-fips-op
  • /test e2e-aws-ovn-workers-rhel8
  • /test e2e-aws-proxy
  • /test e2e-aws-serial
  • /test e2e-aws-single-node
  • /test e2e-aws-upgrade-single-node
  • /test e2e-aws-workers-rhel8
  • /test e2e-azure
  • /test e2e-azure-ovn-upgrade
  • /test e2e-azure-upgrade
  • /test e2e-gcp-op-layering
  • /test e2e-gcp-ovn-rt-upgrade
  • /test e2e-gcp-rt
  • /test e2e-gcp-rt-op
  • /test e2e-gcp-single-node
  • /test e2e-gcp-upgrade
  • /test e2e-metal-assisted
  • /test e2e-metal-ipi
  • /test e2e-metal-ipi-ovn-dualstack
  • /test e2e-metal-ipi-ovn-ipv6
  • /test e2e-openstack
  • /test e2e-openstack-dualstack
  • /test e2e-openstack-externallb
  • /test e2e-openstack-parallel
  • /test e2e-ovirt
  • /test e2e-ovirt-upgrade
  • /test e2e-ovn-step-registry
  • /test e2e-vsphere
  • /test e2e-vsphere-upgrade
  • /test e2e-vsphere-upi
  • /test e2e-vsphere-upi-zones
  • /test e2e-vsphere-zones
  • /test okd-e2e-aws
  • /test okd-e2e-gcp-op
  • /test okd-e2e-upgrade
  • /test okd-e2e-vsphere
  • /test okd-images
  • /test okd-scos-e2e-aws-ovn
  • /test okd-scos-e2e-gcp-op
  • /test okd-scos-e2e-gcp-ovn-upgrade
  • /test okd-scos-e2e-vsphere

Use /test all to run the following jobs that were automatically triggered:

  • pull-ci-openshift-machine-config-operator-master-bootstrap-unit
  • pull-ci-openshift-machine-config-operator-master-e2e-aws-ovn
  • pull-ci-openshift-machine-config-operator-master-e2e-aws-ovn-upgrade
  • pull-ci-openshift-machine-config-operator-master-e2e-gcp-op
  • pull-ci-openshift-machine-config-operator-master-e2e-gcp-op-layering
  • pull-ci-openshift-machine-config-operator-master-e2e-gcp-op-single-node
  • pull-ci-openshift-machine-config-operator-master-e2e-hypershift
  • pull-ci-openshift-machine-config-operator-master-images
  • pull-ci-openshift-machine-config-operator-master-okd-images
  • pull-ci-openshift-machine-config-operator-master-okd-scos-e2e-aws-ovn
  • pull-ci-openshift-machine-config-operator-master-okd-scos-images
  • pull-ci-openshift-machine-config-operator-master-unit
  • pull-ci-openshift-machine-config-operator-master-verify

In response to this:

/test ci/prow/unit
/test ci/prow/verify

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@dkhater-redhat
Copy link
Contributor Author

/test unit
/test verify

@dkhater-redhat dkhater-redhat force-pushed the ocb-feature-gate branch 2 times, most recently from 99c8cf4 to 06c6bb6 Compare December 7, 2023 21:48
@cdoern
Copy link
Contributor

cdoern commented Dec 7, 2023

Do you have unit tests that depend on the build controller working? If so, those aren't going to work since the FG isn't enabled. Also, if they are in unit tests, and you try to check if a FG exists you will get errors.

I can't tell if that is the issue or if the pods can't build. Try make binaries locally maybe and see if there are any failures?

@cdoern
Copy link
Contributor

cdoern commented Dec 7, 2023

/test unit

@dkhater-redhat dkhater-redhat force-pushed the ocb-feature-gate branch 2 times, most recently from 9e1a1ac to 5b26e22 Compare December 7, 2023 22:44
@yuqi-zhang
Copy link
Contributor

/retest-required

@jkyros
Copy link
Contributor

jkyros commented Dec 8, 2023

I know the SCOS test isn't required, but the operator pod on SCOS is panic-ing because the SCOS featuregate list doesn't have the new featuregate in it:

{  pods/openshift-machine-config-operator_machine-config-operator-5ff5d87b48-25fqg_machine-config-operator.log.gz:E1207 23:54:57.410264       1 runtime.go:79] Observed a panic: &errors.errorString{s:"feature \"OnClusterBuild\" is not registered in FeatureGates [AdminNetworkPolicy AlibabaPlatform AutomatedEtcdBackup AzureWorkloadIdentity BuildCSIVolumes CSIDriverSharedResource CloudDualStackNodeIPs ClusterAPIInstall DNSNameResolver DisableKubeletCloudCredentialProviders DynamicResourceAllocation EventedPLEG ExternalCloudProvider ExternalCloudProviderAzure ExternalCloudProviderExternal ExternalCloudProviderGCP GCPClusterHostedDNS GCPLabelsTags GatewayAPI InsightsConfigAPI InstallAlternateInfrastructureAWS MachineAPIOperatorDisableMachineHealthCheckController MachineAPIProviderOpenStack MachineConfigNodes ManagedBootImages MaxUnavailableStatefulSet MetricsServer MixedCPUsAllocation NetworkLiveMigration NodeSwap OpenShiftPodSecurityAdmission PrivateHostedZoneAWS RouteExternalCertificate SignatureStores 

And it's right, it doesn't:

https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/4060/pull-ci-openshift-machine-config-operator-master-okd-scos-e2e-aws-ovn/1732893976983244800/artifacts/e2e-aws-ovn/gather-extra/artifacts/featuregate.json

But it does have MachineConfigNodes which merged last week, so is this...our SCOS release images are just a little behind our OCP images?

@rioliu-rh
Copy link

/hold

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 8, 2023
@dkhater-redhat
Copy link
Contributor Author

/retest-required

@rioliu-rh
Copy link

rioliu-rh commented Dec 8, 2023

by default featureGate: OnClusterBuild is disabled.

$ oc get featuregate/cluster -o yaml | yq -y '.status.featureGates[].disabled' | grep OnClusterBuild
 - name: OnClusterBuild

try to turn on OCB when this featureGate is disabled

# create custom mcp
$ cat infra.mcp.yaml
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfigPool
metadata:
  name: infra
spec:
  machineConfigSelector:
    matchExpressions:
      - {key: machineconfiguration.openshift.io/role, operator: In, values: [worker,infra]}
  nodeSelector:
    matchLabels:
      node-role.kubernetes.io/infra: ""

$ oc apply -f infra.mcp.yaml
machineconfigpool.machineconfiguration.openshift.io/infra created

$ oc label node/ip-10-0-118-209.us-west-1.compute.internal node-role.kubernetes.io/infra=
node/ip-10-0-118-209.us-west-1.compute.internal labeled

$ mcp infra
NAME    CONFIG                                            UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
infra   rendered-infra-ee03b72baa24ec9b3184492526b023b3   True      False      False      1              1                   1                     0                      11m

# create configmap, pullsecret etc.
$ cat cm-on-cluster-build-config.yaml
apiVersion: v1
data:
  baseImagePullSecretName: mco-global-pull-secret
  finalImagePushSecretName: mco-test-pull-secret
  finalImagePullspec: "quay.io/mcoqe/layering"
kind: ConfigMap
metadata:
  name: on-cluster-build-config
  namespace: openshift-machine-config-operator

$ oc apply -f cm-on-cluster-build-config.yaml
configmap/on-cluster-build-config created

$ oc apply -f base-image-pull-secret.yaml
secret/mco-global-pull-secret created

$ oc apply -f final-image-pull-secret.yaml
secret/mco-test-pull-secret created

$ oc get cm/on-cluster-build-config -n openshift-machine-config-operator
NAME                      DATA   AGE
on-cluster-build-config   3      60s

$ oc get secret -n openshift-machine-config-operator | grep pull-secret
mco-global-pull-secret                      kubernetes.io/dockerconfigjson        1      81s
mco-test-pull-secret                        kubernetes.io/dockerconfigjson        1      65s

# label mcp/infra
$ oc label mcp/infra machineconfiguration.openshift.io/layering-enabled=
machineconfigpool.machineconfiguration.openshift.io/infra labeled

$ oc get mcp/infra -o yaml | yq -y '.metadata.labels'
machineconfiguration.openshift.io/layering-enabled: ''

# check deployment 
$ oc get deployment -n openshift-machine-config-operator
NAME                        READY   UP-TO-DATE   AVAILABLE   AGE
machine-config-controller   1/1     1            1           63m
machine-config-operator     1/1     1            1           67m

so when the featureGate is disabled, even required resources are created, deployment will not be created. this feature cannot be enabled.

patch featuregate/cluster to enable featureGate: OnClusterBuild

$ cat ../mc/featuregate_techpreview.yaml
apiVersion: config.openshift.io/v1
kind: FeatureGate
metadata:
  name: cluster
spec:
  featureSet: TechPreviewNoUpgrade

$ oc apply -f ../mc/featuregate_techpreview.yaml
Warning: resource featuregates/cluster is missing the kubectl.kubernetes.io/last-applied-configuration annotation which is required by oc apply. oc apply should only be used on resources created declaratively by either oc create --save-config or oc apply. The missing annotation will be patched automatically.
featuregate.config.openshift.io/cluster configured

$ oc get featuregate/cluster -o yaml | yq -y '.status.featureGates[].enabled' | grep OnClusterBuild
- name: OnClusterBuild

check whether deployment and pod are ready and running

$ oc get deployment/machine-os-builder -n openshift-machine-config-operator
NAME                 READY   UP-TO-DATE   AVAILABLE   AGE
machine-os-builder   1/1     1            1           20m

$ oc get pod -n openshift-machine-config-operator -l k8s-app=machine-os-builder
NAME                                 READY   STATUS    RESTARTS   AGE
machine-os-builder-bdc4f7d8c-cm6c2   1/1     Running   0          20m

btw, featureGate: DisableKubeletCloudCredentialProviders is added in disabled list.

$ oc get featuregate/cluster -o yaml | yq -y '.status.featureGates[].disabled'
- name: ClusterAPIInstall
- name: DisableKubeletCloudCredentialProviders
- name: EventedPLEG
- name: MachineAPIOperatorDisableMachineHealthCheckController

@rioliu-rh
Copy link

rioliu-rh commented Dec 8, 2023

/retitle MCO-831 added feature gate to mco for on cluster builds

@openshift-ci openshift-ci bot changed the title added feature gate to mco for on cluster builds MCO-831 added feature gate to mco for on cluster builds Dec 8, 2023
@jkyros
Copy link
Contributor

jkyros commented Jan 19, 2024

CI is just rotten today
/test e2e-gcp-op

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD 6745e60 and 0 for PR HEAD 251523f in total

@openshift-ci-robot
Copy link
Contributor

/hold

Revision 251523f was retested 3 times: holding

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 19, 2024
@yuqi-zhang
Copy link
Contributor

There were issues with build02 earlier, hopefully better now

/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 19, 2024
@yuqi-zhang
Copy link
Contributor

/test e2e-gcp-op

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD 6745e60 and 2 for PR HEAD 251523f in total

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD 139e1d0 and 1 for PR HEAD 251523f in total

@jkyros
Copy link
Contributor

jkyros commented Jan 19, 2024

/override ci/prow/e2e-gcp-op-single-node

Copy link
Contributor

openshift-ci bot commented Jan 19, 2024

@jkyros: Overrode contexts on behalf of jkyros: ci/prow/e2e-gcp-op-single-node

In response to this:

/override ci/prow/e2e-gcp-op-single-node

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@jkyros
Copy link
Contributor

jkyros commented Jan 20, 2024

hahaha I don't think build02 is fixed:

{  error occurred handling build machine-config-operator-amd64: build didn't start running within 30m0s (phase: Pending):
Found 1 events for Pod machine-config-operator-amd64-build:
* 2024-01-18T20:49:20Z 128x default-scheduler: 0/34 nodes are available: 1 Insufficient cpu, 1 node(s) had untolerated taint {node-role.kubernetes.io/ci-longtests-worker: ci-longtests-worker}, 10 node(s) had untolerated taint {node-role.kubernetes.io/ci-tests-worker: }, 3 node(s) had untolerated taint {node-role.kubernetes.io/ci-prowjobs-worker: ci-prowjobs-worker}, 3 node(s) had untolerated taint {node-role.kubernetes.io/infra: }, 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }, 6 node(s) were unschedulable, 7 node(s) didn't match Pod's node affinity/selector. no new claims to deallocate, preemption: 0/34 nodes are available: 1 No preemption victims found for incoming pod, 33 Preemption is not helpful for scheduling.}

@jkyros
Copy link
Contributor

jkyros commented Jan 22, 2024

/test e2e-gcp-op

@jkyros
Copy link
Contributor

jkyros commented Jan 22, 2024

It's not telling me here what the conflict is but locally it looks like we just got beat to a dependency bump, probably #4119 ?

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 22, 2024
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Jan 22, 2024
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 22, 2024
@jkyros
Copy link
Contributor

jkyros commented Jan 24, 2024

That looks like it worked -- the tests passed, we just failed in teardown. I'd pull out that "merge" commit 09ec41d (I assume that was just collateral damage from the rebase or somesuch) and then we can try again to get this in? 😄

@dkhater-redhat dkhater-redhat force-pushed the ocb-feature-gate branch 2 times, most recently from 12a01d0 to 1f41fcb Compare January 24, 2024 22:07
@jkyros
Copy link
Contributor

jkyros commented Jan 25, 2024

/test e2e-hypershift

@jkyros
Copy link
Contributor

jkyros commented Jan 25, 2024

/lgtm

@jkyros
Copy link
Contributor

jkyros commented Jan 25, 2024

/override ci/prow/e2e-gcp-op-single-node

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jan 25, 2024
Copy link
Contributor

openshift-ci bot commented Jan 25, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cdoern, dkhater-redhat, jkyros

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [cdoern,dkhater-redhat,jkyros]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Contributor

openshift-ci bot commented Jan 25, 2024

@jkyros: Overrode contexts on behalf of jkyros: ci/prow/e2e-gcp-op-single-node

In response to this:

/override ci/prow/e2e-gcp-op-single-node

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link
Contributor

openshift-ci bot commented Jan 25, 2024

@dkhater-redhat: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-azure-ovn-upgrade-out-of-change df610a6 link false /test e2e-azure-ovn-upgrade-out-of-change
ci/prow/okd-scos-e2e-aws-ovn df610a6 link false /test okd-scos-e2e-aws-ovn

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot bot merged commit 3d833c0 into openshift:master Jan 25, 2024
14 of 16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants