Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry Elasticsearch logging health check #10233

Merged
merged 1 commit into from Jun 23, 2015

Conversation

satnam6502
Copy link
Contributor

Addresses #9486
The was only REST query to Elasticsearch logging that was not retried in case of failure (usually due to one of the replicas not being ready). This query is now retried. An attempt to use a readiness probe did not work -- I will revisit this approach later.
Risk: low.

@k8s-bot
Copy link

k8s-bot commented Jun 23, 2015

GCE e2e build/test passed for commit 2d57c1d21046c82387d05cd8d8ee4fa826b279c3.

Param("health", "pretty").
DoRaw()
var body []byte
for start := time.Now(); time.Since(start) < graceTime; time.Sleep(5 * time.Second) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're trying to standardise on using wait.Poll() for retry loops, but given that this source file has so many instances of homegrown retry loops, lets leave migration to wait.Poll() to a separate PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very happy to convert to using wait.Poll() in a subsequent PR (or get readiness probes working instead).

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Give that we now have three separate retry loops that each retry for up to graceTime=10minutes (i.e. a total of 30 minutes), this test is at risk of becoming extremely slow to fail. I've checked our Jenkins and in the past 100 runs the test either passed in 50-90 seconds total, or failed completely. Can we please reduce the grace period to 90 seconds (from 10 minutes?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At some point a single grace period was forced upon all the loops in this test -- something I did not like. This is because some retry loops should take a short amount of total time (all the ones up to the last one) and some loops need longer e.g. to gather the logs (the last one). Before the retry timeouts were adjusted based on the functionality being tested.

@ghost
Copy link

ghost commented Jun 23, 2015

LGTM barring the reduction of grace period from 10 minutes to 90 seconds.

@ghost
Copy link

ghost commented Jun 23, 2015

@lavalamp for a second set of eyes and "ok to merge" approval.

@satnam6502
Copy link
Contributor Author

90 seconds is not enough to wait for all the logs to be ingested.

@satnam6502
Copy link
Contributor Author

I have added a new constant to express how long to wait for the log lines (10 minutes) and adjusted the grace time (for status queries) down to 2 minutes.

@ghost
Copy link

ghost commented Jun 23, 2015

According to Jenkins, the test passes 93% of the time, and 100% of those are in under 90 seconds. I cannot understand why you believe that it's necessary to have a grace period of longer than 90 seconds on any of those retry loops. Am I missing something?

@satnam6502
Copy link
Contributor Author

Well, that is based on my experience of running the test manually when I sometimes notice it takes a long time for the log lines to end up in Elasticsearch. OK: I shall adjust down the timeout for observing the log lines.

@satnam6502
Copy link
Contributor Author

The timeout for observing the ingestion of the log lines has been reduced to 3 minutes.

@k8s-bot
Copy link

k8s-bot commented Jun 23, 2015

GCE e2e build/test passed for commit a2622c1b7f549e9d2c2bd510abf496b82d6cbfac.

@ghost
Copy link

ghost commented Jun 23, 2015

LGTM. Still awaiting "ok to merge" from @lavalamp.

@ghost ghost added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 23, 2015
@k8s-bot
Copy link

k8s-bot commented Jun 23, 2015

GCE e2e build/test passed for commit 52461b7.

@lavalamp
Copy link
Member

LGTM

j3ffml added a commit that referenced this pull request Jun 23, 2015
Retry Elasticsearch logging health check
@j3ffml j3ffml merged commit c3dd781 into kubernetes:master Jun 23, 2015
@satnam6502 satnam6502 unassigned ghost Aug 12, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lgtm "Looks good to me", indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants