Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test for excessive etcd leadership changes #24291

Merged
merged 1 commit into from Jan 23, 2020

Conversation

sanchezl
Copy link
Contributor

@sanchezl sanchezl commented Dec 12, 2019

Adds an extended test which watches the etcd_server_leader_changes_seen_total metric and fails if the total across all nodes is ever greater than 9.

@openshift-ci-robot openshift-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Dec 12, 2019
@sanchezl
Copy link
Contributor Author

I need to make this a "long running" test.

@deads2k
Copy link
Contributor

deads2k commented Dec 12, 2019

g.It("shouldn't report any alerts in firing state apart from Watchdog", func() {
if len(os.Getenv("TEST_UNSUPPORTED_ALLOW_VERSION_SKEW")) > 0 {

@deads2k
Copy link
Contributor

deads2k commented Dec 12, 2019

creative. I wonder if we'll get lucky and flake

/retest

@sanchezl
Copy link
Contributor Author

/retest

totalLeaderChanges := result["data"].(map[string]interface{})["result"].([]interface{})[0].(map[string]interface{})["value"].([]interface{})[1]
e2e.Logf("sum(etcd_server_leader_changes_seen_total) = %v", totalLeaderChanges)
return strconv.Atoi(totalLeaderChanges.(string))
}, "5ms", "30s").Should(o.BeNumerically("<", 10))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please make tests readable

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sttts I have added some more comments. Let me know if its still unclear.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sttts I've re-written it again. PTAL

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sttts it is pretty readable to me :-)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is! Thanks :)

@sanchezl sanchezl force-pushed the e2e-leader-change branch 3 times, most recently from 249d37c to 6c61274 Compare December 13, 2019 22:00
@openshift-ci-robot openshift-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Dec 13, 2019
@mfojtik
Copy link
Member

mfojtik commented Dec 16, 2019

/retest


var _ = g.Describe("etcd", func() {
defer g.GinkgoRecover()
var (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: make this one liner

// NewE2EPrometheusRouterClient returns a Prometheus HTTP API client configured to
// use the Prometheus route host, a bearer token, and no certificate verification.
func NewE2EPrometheusRouterClient(oc *util.CLI) (prometheusv1.API, error) {
var err error
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove and make the one in 32 err := wait.PollImmediate(

@mfojtik
Copy link
Member

mfojtik commented Dec 16, 2019

/lgtm

We can iterate, this looks good and get us what we want.

/cc @hexfusion

@openshift-ci-robot openshift-ci-robot added lgtm Indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Dec 16, 2019
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

// check sum(etcd_server_leader_changes_seen_total) every 30s for 5m.
// the value should consistently be less than 10.
o.Consistently(func() model.SampleValue {
result, err := prometheus.Query(context.Background(), "sum(etcd_server_leader_changes_seen_total)", time.Now())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

max(etcd_server_leader_changes_seen_total) should equal one. Per a buried thread in the incident channel.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It’s important to understand this is optimal (max leader change 1) and only true to my knowledge with AWS. So we will most likely fail on most clouds. I think we will need a knob to tune this per cloud. So we can adjust per cloud or only run on AWS to watch for regression.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need some guidance on what the numbers should be for other clouds.

@wking, if you have a suggestion for azure, I could at least start with that.

@openshift-ci-robot openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Dec 16, 2019
@sanchezl sanchezl force-pushed the e2e-leader-change branch 2 times, most recently from ca567ae to 67b0147 Compare December 16, 2019 14:54
@deads2k
Copy link
Contributor

deads2k commented Dec 16, 2019

/lgtm

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

27 similar comments
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet