New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 1985997: Enable static pod fallback logic for SNO, with disruptive e2e test #1198
Conversation
b19d13c
to
025d0f7
Compare
/test e2e-gcp-operator-encryption-single-node |
/test e2e-gcp-operator-encryption-rotation-single-node |
} | ||
} | ||
|
||
func ensureClusterInGoodState(t testing.TB, cs clientSet, waitPollInterval, waitPollTimeout, mustBeReadyFor time.Duration) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does it need waitPollInterval ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you want to use wait.UntilWithContext
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, would just make this constant. 10s or 30s.
t.Logf("Waiting %s for the cluster to be in a good condition, interval = %v, timeout %v", mustBeReadyFor.String(), waitPollInterval, waitPollTimeout) | ||
|
||
return wait.Poll(waitPollInterval, waitPollTimeout, func() (bool, error) { | ||
monitorPod, err := cs.Kube.CoreV1().Pods("openshift-kube-apiserver").Get(context.TODO(), "kube-apiserver-startup-monitor", metav1.GetOptions{}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these are static pod with another name (node appended)
return false, nil /*retry*/ | ||
} | ||
} | ||
// TODO: on an HA cluster we could also check if pods are on the same revision |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would consider a cluster no in-progress as good (i.e. nodeStatus has no transitioning node). Why do we need all of this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just playing safe, checking if nodeStatus
contain only currentRevision
and nodeName
would be enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is enough to start with. We also don't want to start from some "dirty" state.
|
||
t.Logf("Setting UnsupportedConfigOverrides to %v", cfg) | ||
err := retry.OnError(retry.DefaultRetry, func(error) bool { return true }, func() error { | ||
raw, err := json.Marshal(cfg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: once outside of the loop is enough
t.Helper() | ||
|
||
t.Log("Waiting for StaticPodFallbackRevisionDegraded condition, interval = 20s, timeout = 6min") | ||
err := wait.Poll(20*time.Second, 6*time.Minute, func() (bool, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why 6min?
return false, nil /*retry*/ | ||
} | ||
|
||
if v1helpers.IsOperatorConditionFalse(ckaso.Status.Conditions, "StaticPodFallbackRevisionDegraded") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not check for True and return true? Seems to be more direct in a loop like this.
require.NoError(t, err) | ||
} | ||
|
||
func assertNodeStatus(t testing.TB, cs clientSet) (string, int32) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert what?
return nodeName, failedRevision | ||
} | ||
|
||
func assertKasPodAnnotatedOnNode(t testing.TB, cs clientSet, expectedFailedRevision int32, expectedFallbackReason, expectedFallbackMessage, nodeName string) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we have to check the message. It makes maintenance hard. We have unit tests for that.
If you really want to specific, just check for a keyword in the string (strings.Contains).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's sure as well, there might be cases that would break etcd, I can remove.
return filteredKasPods | ||
} | ||
|
||
func getDefaultUnsupportedConfig(t testing.TB, cs clientSet) map[string][]string { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure about this func. Name is unspecific, and it does abstract something that's better explicit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
getPlatformSpecificConfig
would be better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the infra check is nothing one should hide in a func
025d0f7
to
646d10a
Compare
/test e2e-aws-operator-disruptive-single-node |
d9c6e3b
to
cb79c1d
Compare
@p0lyn0mial: This pull request references Bugzilla bug 1989633, which is invalid:
Comment In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
cb79c1d
to
c54d014
Compare
setUnsupportedConfig(t, cs, cfg) | ||
|
||
// validate if the fallback condition is reported and the cluster is stable | ||
waitForFallbackDegradedCondition(t, cs, waitForFallbackDegradedConditionTimeout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can't we derive waitForFallbackDegradedConditionTimeout
from some per-platform number? E.g. oneNodeRolloutTimeout
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added TODO
c54d014
to
a2514eb
Compare
6a6110b
to
c11afe0
Compare
c11afe0
to
6170104
Compare
/retest |
@p0lyn0mial: This pull request references Bugzilla bug 1985997, which is valid. 3 validation(s) were run on this bug
Requesting review from QA contact: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/lgtm |
@p0lyn0mial: This pull request references Bugzilla bug 1985997, which is valid. 3 validation(s) were run on this bug
Requesting review from QA contact: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
1 similar comment
@p0lyn0mial: This pull request references Bugzilla bug 1985997, which is valid. 3 validation(s) were run on this bug
Requesting review from QA contact: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
6170104
to
6f77ec8
Compare
pulled openshift/library-go#1177 |
/retest |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: p0lyn0mial, sttts The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
@p0lyn0mial: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
2 similar comments
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
flakes:
|
@p0lyn0mial: All pull requests linked via external trackers have merged:
Bugzilla bug 1985997 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Replacing #1195. Enabling https://github.com/openshift/enhancements/blob/master/enhancements/kube-apiserver/startup-monitor.md for single node.
cc @romfreiman @eranco74