New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test for excessive etcd leadership changes #24291
test for excessive etcd leadership changes #24291
Conversation
I need to make this a "long running" test. |
origin/test/extended/prometheus/prometheus.go Lines 246 to 247 in 254adc6
|
creative. I wonder if we'll get lucky and flake /retest |
3a71260
to
91c3342
Compare
/retest |
test/extended/etcd/leader_changes.go
Outdated
totalLeaderChanges := result["data"].(map[string]interface{})["result"].([]interface{})[0].(map[string]interface{})["value"].([]interface{})[1] | ||
e2e.Logf("sum(etcd_server_leader_changes_seen_total) = %v", totalLeaderChanges) | ||
return strconv.Atoi(totalLeaderChanges.(string)) | ||
}, "5ms", "30s").Should(o.BeNumerically("<", 10)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please make tests readable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sttts I have added some more comments. Let me know if its still unclear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sttts I've re-written it again. PTAL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sttts it is pretty readable to me :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is! Thanks :)
249d37c
to
6c61274
Compare
6c61274
to
aa44485
Compare
/retest |
test/extended/etcd/leader_changes.go
Outdated
|
||
var _ = g.Describe("etcd", func() { | ||
defer g.GinkgoRecover() | ||
var ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: make this one liner
// NewE2EPrometheusRouterClient returns a Prometheus HTTP API client configured to | ||
// use the Prometheus route host, a bearer token, and no certificate verification. | ||
func NewE2EPrometheusRouterClient(oc *util.CLI) (prometheusv1.API, error) { | ||
var err error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: remove and make the one in 32 err := wait.PollImmediate(
/lgtm We can iterate, this looks good and get us what we want. /cc @hexfusion |
/retest Please review the full test history for this PR and help us cut down flakes. |
test/extended/etcd/leader_changes.go
Outdated
// check sum(etcd_server_leader_changes_seen_total) every 30s for 5m. | ||
// the value should consistently be less than 10. | ||
o.Consistently(func() model.SampleValue { | ||
result, err := prometheus.Query(context.Background(), "sum(etcd_server_leader_changes_seen_total)", time.Now()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
max(etcd_server_leader_changes_seen_total)
should equal one. Per a buried thread in the incident channel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It’s important to understand this is optimal (max leader change 1) and only true to my knowledge with AWS. So we will most likely fail on most clouds. I think we will need a knob to tune this per cloud. So we can adjust per cloud or only run on AWS to watch for regression.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need some guidance on what the numbers should be for other clouds.
@wking, if you have a suggestion for azure, I could at least start with that.
aa44485
to
aefb3e7
Compare
ca567ae
to
67b0147
Compare
/lgtm |
/retest Please review the full test history for this PR and help us cut down flakes. |
27 similar comments
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
Adds an extended test which watches the
etcd_server_leader_changes_seen_total
metric and fails if the total across all nodes is ever greater than 9.