-
Notifications
You must be signed in to change notification settings - Fork 396
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OCPBUGS-20152,OCPBUGS-22364: fix timing issues upon cluster installation or upgrade #3987
Conversation
@cdoern: This pull request references Jira Issue OCPBUGS-20152, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: cdoern The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
9795d0c
to
6f6375e
Compare
/payload-job periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-gcp-ovn-upgrade |
@cdoern: trigger 1 job(s) for the /payload-(job|aggregate) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/a5a99250-7347-11ee-8c99-a9eeae8ad1fc-0 |
/hold for QE verification |
/retitle OCPBUGS-20152,OCPBUGS-22364: fix timing issues upon cluster installation or upgrade |
@cdoern: This pull request references Jira Issue OCPBUGS-20152, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. This pull request references Jira Issue OCPBUGS-22364, which is valid. 3 validation(s) were run on this bug
Requesting review from QA contact: The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
setup a cluster w/o capability $ cv version -o yaml | yq -y '.status.capabilities'
enabledCapabilities:
- OperatorLifecycleManager
knownCapabilities:
- Build
- CSISnapshot
- Console
- DeploymentConfig
- ImageRegistry
- Insights
- MachineAPI
- NodeTuning
- OperatorLifecycleManager
- Storage
- baremetal
- marketplace
- openshift-samples $ debug node/ip-10-0-69-28.us-east-2.compute.internal -- chroot /host stat /etc/docker/certs.d
stat: cannot statx '/etc/docker/certs.d': No such file or directory
error: non-zero exit code from debug container upgrade cluster from 4.14.1 to CI image built based on this PR and #4003, 4.15.0-0.test-2023-10-30-071930-ci-ln-hb38gnb-latest $ cv version -o yaml | yq -y '.status.history'
- acceptedRisks: 'Target release version="" image="registry.build05.ci.openshift.org/ci-ln-hb38gnb/release:latest"
cannot be verified, but continuing anyway because the update was forced: release
images that are not accessed via digest cannot be verified
Forced through blocking failures: Multiple precondition checks failed:
* Precondition "EtcdRecentBackup" failed because of "ControllerStarted": RecentBackup:
The etcd backup controller is starting, and will decide if recent backups are
available or if a backup is required
* Precondition "ClusterVersionRecommendedUpdate" failed because of "UnknownUpdate":
RetrievedUpdates=False (VersionNotFound), so the recommended status of updating
from 4.14.1 to 4.15.0-0.test-2023-10-30-071930-ci-ln-hb38gnb-latest is unknown.'
completionTime: '2023-10-30T09:55:42Z'
image: registry.build05.ci.openshift.org/ci-ln-hb38gnb/release:latest
startedTime: '2023-10-30T08:55:05Z'
state: Completed
verified: false
version: 4.15.0-0.test-2023-10-30-071930-ci-ln-hb38gnb-latest
- completionTime: '2023-10-30T08:00:07Z'
image: quay.io/openshift-release-dev/ocp-release@sha256:05ba8e63f8a76e568afe87f182334504a01d47342b6ad5b4c3ff83a2463018bd
startedTime: '2023-10-30T07:31:34Z'
state: Completed
verified: false
version: 4.14.1 check keywords 'no such file or directory' in mcc pod log, $ logs openshift-machine-config-operator -c machine-config-controller machine-config-controller-68444f8547-99qck | grep 'no such file or directory'
>> empty check fields $ oc get controllerconfig/machine-config-controller -o yaml | yq -y '.status.controllerCertificates'
- bundleFile: KubeAPIServerServingCAData
notAfter: '2033-10-27T07:18:30Z'
notBefore: '2023-10-30T07:18:30Z'
signer: CN=admin-kubeconfig-signer,OU=openshift
subject: CN=admin-kubeconfig-signer,OU=openshift
- bundleFile: KubeAPIServerServingCAData
notAfter: '2023-10-31T07:18:32Z'
notBefore: '2023-10-30T07:34:56Z'
signer: CN=kubelet-signer,OU=openshift
subject: CN=kube-csr-signer_@1698651297
- bundleFile: KubeAPIServerServingCAData
notAfter: '2023-10-31T07:18:32Z'
notBefore: '2023-10-30T07:18:32Z'
signer: CN=kubelet-signer,OU=openshift
subject: CN=kubelet-signer,OU=openshift
- bundleFile: KubeAPIServerServingCAData
notAfter: '2024-10-29T07:18:33Z'
notBefore: '2023-10-30T07:18:33Z'
signer: CN=kube-apiserver-to-kubelet-signer,OU=openshift
subject: CN=kube-apiserver-to-kubelet-signer,OU=openshift
- bundleFile: KubeAPIServerServingCAData
notAfter: '2024-10-29T07:18:33Z'
notBefore: '2023-10-30T07:18:33Z'
signer: CN=kube-control-plane-signer,OU=openshift
subject: CN=kube-control-plane-signer,OU=openshift
- bundleFile: KubeAPIServerServingCAData
notAfter: '2033-10-27T07:18:30Z'
notBefore: '2023-10-30T07:18:30Z'
signer: CN=kubelet-bootstrap-kubeconfig-signer,OU=openshift
subject: CN=kubelet-bootstrap-kubeconfig-signer,OU=openshift
- bundleFile: KubeAPIServerServingCAData
notAfter: '2024-10-29T07:34:54Z'
notBefore: '2023-10-30T07:34:53Z'
signer: CN=openshift-kube-apiserver-operator_node-system-admin-signer@1698651293
subject: CN=openshift-kube-apiserver-operator_node-system-admin-signer@1698651293
- bundleFile: RootCAData
notAfter: '2033-10-27T07:18:26Z'
notBefore: '2023-10-30T07:18:26Z'
signer: CN=root-ca,OU=openshift
subject: CN=root-ca,OU=openshift /unhold |
@rioliu-rh: This pull request references Jira Issue OCPBUGS-20152, which is invalid:
Comment This pull request references Jira Issue OCPBUGS-22364, which is valid. 3 validation(s) were run on this bug
Requesting review from QA contact: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/jira refresh |
@rioliu-rh: This pull request references Jira Issue OCPBUGS-20152, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
Requesting review from QA contact: This pull request references Jira Issue OCPBUGS-22364, which is valid. 3 validation(s) were run on this bug
Requesting review from QA contact: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
try again today, cluster setup is failed with same error |
5f0d86d
to
740a867
Compare
0580591
to
aee0b8a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, this looks good. I do have a few suggestions for you to have a look at.
@@ -637,11 +645,105 @@ func (optr *Operator) syncCustomResourceDefinitions() error { | |||
return err | |||
} | |||
} | |||
|
|||
if strings.Contains(crd, "controllerconfig") { | |||
currentCR, err := optr.apiExtClient.ApiextensionsV1().CustomResourceDefinitions().Get(context.TODO(), "controllerconfigs.machineconfiguration.openshift.io", metav1.GetOptions{}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: Hoist everything in the if strings.Contains(crd, "controllerconfig")
block into a separate function or method so it's a bit easier to follow.
For the section where you continue
, you can early return nil there to achieve the same effect.
@@ -475,6 +476,18 @@ func (ctrl *Controller) syncKubeletConfig(key string) error { | |||
klog.V(4).Infof("Finished syncing kubeletconfig %q (%v)", key, time.Since(startTime)) | |||
}() | |||
|
|||
if err := wait.PollUntilContextTimeout(context.TODO(), ctrlcommon.ControllerConfigRolloutInterval, ctrlcommon.ControllerConfigTimeout, false, func(_ context.Context) (bool, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: Hoist this wait code into a separate function for easier reuse since it looks like it is being used in multiple places.
func waitForControllerConfigCreation() error {
return wait.PollUntilContextTimeout(context.TODO(), ctrlcommon.ControllerConfigRolloutInterval, ctrlcommon.ControllerConfigTimeout, false, func(_ context.Context) (bool, error) {
// ...
}
}
@@ -473,6 +474,18 @@ func (ctrl *Controller) syncKubeletConfig(key string) error { | |||
klog.V(4).Infof("Finished syncing kubeletconfig %q (%v)", key, time.Since(startTime)) | |||
}() | |||
|
|||
if err := wait.PollUntilContextTimeout(context.TODO(), ctrlcommon.ControllerConfigRolloutInterval, ctrlcommon.ControllerConfigTimeout, false, func(_ context.Context) (bool, error) { | |||
_, err := ctrl.client.MachineconfigurationV1().ControllerConfigs().Get(context.TODO(), ctrlcommon.ControllerConfigName, metav1.GetOptions{}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be worth a CI cycle just to see if we can get away with using the lister instead. Although I don't know how difficult this particular issue is to reproduce there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Failing unit test is:
=== RUN TestUpdatesGeneratedMachineConfig
I0125 20:37:34.248250 20898 render_controller.go:436] Controller Config has not been created. Continuing context deadline exceeded
W0125 20:37:34.248323 20898 render_controller.go:580] No BaseOSContainerImage set
render_controller_test.go:135: Expected
testing.GetActionImpl{ActionImpl:testing.ActionImpl{Namespace:"", Verb:"get", Resource:schema.GroupVersionResource{Group:"", Version:"", Resource:"machineconfigs"}, Subresource:""}, Name:"rendered-test-cluster-master-a39f193e46d2048532b91e0ddc0faceb"}
got
testing.GetActionImpl{ActionImpl:testing.ActionImpl{Namespace:"", Verb:"get", Resource:schema.GroupVersionResource{Group:"machineconfiguration.openshift.io", Version:"v1", Resource:"controllerconfigs"}, Subresource:""}, Name:"machine-config-controller"}
render_controller_test.go:135: Expected
testing.UpdateActionImpl{ActionImpl:testing.ActionImpl{Namespace:"", Verb:"update", Resource:schema.GroupVersionResource{Group:"", Version:"", Resource:"machineconfigs"}, Subresource:""}, Object:(*v1.MachineConfig)(0xc000007520)}
got
testing.GetActionImpl{ActionImpl:testing.ActionImpl{Namespace:"", Verb:"get", Resource:schema.GroupVersionResource{Group:"machineconfiguration.openshift.io", Version:"v1", Resource:"controllerconfigs"}, Subresource:""}, Name:"machine-config-controller"}
render_controller_test.go:135: Expected
testing.UpdateActionImpl{ActionImpl:testing.ActionImpl{Namespace:"", Verb:"update", Resource:schema.GroupVersionResource{Group:"", Version:"", Resource:"machineconfigpools"}, Subresource:""}, Object:(*v1.MachineConfigPool)(0xc000248840)}
got
testing.GetActionImpl{ActionImpl:testing.ActionImpl{Namespace:"", Verb:"get", Resource:schema.GroupVersionResource{Group:"machineconfiguration.openshift.io", Version:"v1", Resource:"controllerconfigs"}, Subresource:""}, Name:"machine-config-controller"}
Some of the fields have got new values now. Not sure but this could be related to API change and we need to update our test?
494d0e2
to
04b336e
Compare
fix degrading on /etc/docker/certs.d not existing as well as cconfig not existing Signed-off-by: Charlie Doern <cdoern@redhat.com>
@cdoern: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
Issues go stale after 90d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle stale |
PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Stale issues rot after 30d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle rotten |
Rotten issues close after 30d of inactivity. Reopen the issue by commenting /close |
@cdoern: This pull request references Jira Issue OCPBUGS-22364. The bug has been updated to no longer refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
@openshift-bot: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
fix degrading on /etc/docker/certs.d not existing as well as cconfig not existing.
Also fix controllerconfig validation issues upon upgrade.
We seem to have some places where we assume the operator has beat everyone else to the punch in creating some directories and even some resources. We should never assume things exist and also we should not degrade on this type of race.