Retry Elasticsearch logging health check #10233

satnam6502 · 2015-06-23T16:38:47Z

Addresses #9486
The was only REST query to Elasticsearch logging that was not retried in case of failure (usually due to one of the replicas not being ready). This query is now retried. An attempt to use a readiness probe did not work -- I will revisit this approach later.
Risk: low.

k8s-bot · 2015-06-23T16:58:02Z

GCE e2e build/test passed for commit 2d57c1d21046c82387d05cd8d8ee4fa826b279c3.

ghost · 2015-06-23T17:08:29Z

test/e2e/es_cluster_logging.go

-		Param("health", "pretty").
-		DoRaw()
+	var body []byte
+	for start := time.Now(); time.Since(start) < graceTime; time.Sleep(5 * time.Second) {


We're trying to standardise on using wait.Poll() for retry loops, but given that this source file has so many instances of homegrown retry loops, lets leave migration to wait.Poll() to a separate PR.

Very happy to convert to using wait.Poll() in a subsequent PR (or get readiness probes working instead).

Give that we now have three separate retry loops that each retry for up to graceTime=10minutes (i.e. a total of 30 minutes), this test is at risk of becoming extremely slow to fail. I've checked our Jenkins and in the past 100 runs the test either passed in 50-90 seconds total, or failed completely. Can we please reduce the grace period to 90 seconds (from 10 minutes?)

At some point a single grace period was forced upon all the loops in this test -- something I did not like. This is because some retry loops should take a short amount of total time (all the ones up to the last one) and some loops need longer e.g. to gather the logs (the last one). Before the retry timeouts were adjusted based on the functionality being tested.

ghost · 2015-06-23T17:16:20Z

LGTM barring the reduction of grace period from 10 minutes to 90 seconds.

ghost · 2015-06-23T17:17:29Z

@lavalamp for a second set of eyes and "ok to merge" approval.

satnam6502 · 2015-06-23T17:21:00Z

90 seconds is not enough to wait for all the logs to be ingested.

satnam6502 · 2015-06-23T17:33:45Z

I have added a new constant to express how long to wait for the log lines (10 minutes) and adjusted the grace time (for status queries) down to 2 minutes.

ghost · 2015-06-23T17:35:08Z

According to Jenkins, the test passes 93% of the time, and 100% of those are in under 90 seconds. I cannot understand why you believe that it's necessary to have a grace period of longer than 90 seconds on any of those retry loops. Am I missing something?

satnam6502 · 2015-06-23T17:49:13Z

Well, that is based on my experience of running the test manually when I sometimes notice it takes a long time for the log lines to end up in Elasticsearch. OK: I shall adjust down the timeout for observing the log lines.

satnam6502 · 2015-06-23T17:50:20Z

The timeout for observing the ingestion of the log lines has been reduced to 3 minutes.

k8s-bot · 2015-06-23T17:50:22Z

GCE e2e build/test passed for commit a2622c1b7f549e9d2c2bd510abf496b82d6cbfac.

ghost · 2015-06-23T17:55:03Z

LGTM. Still awaiting "ok to merge" from @lavalamp.

k8s-bot · 2015-06-23T18:09:38Z

GCE e2e build/test passed for commit 52461b7.

lavalamp · 2015-06-23T18:36:02Z

LGTM

Retry Elasticsearch logging health check

googlebot added the cla: yes label Jun 23, 2015

satnam6502 assigned ghost Jun 23, 2015

ghost reviewed Jun 23, 2015
View reviewed changes

satnam6502 force-pushed the es-logging branch from 2d57c1d to a2622c1 Compare June 23, 2015 17:33

Retry Elasticsearch logging health check

52461b7

satnam6502 force-pushed the es-logging branch from a2622c1 to 52461b7 Compare June 23, 2015 17:50

ghost added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 23, 2015

lavalamp added the ok-to-merge label Jun 23, 2015

j3ffml added a commit that referenced this pull request Jun 23, 2015

Merge pull request #10233 from satnam6502/es-logging

c3dd781

Retry Elasticsearch logging health check

j3ffml merged commit c3dd781 into kubernetes:master Jun 23, 2015

satnam6502 unassigned ghost Aug 12, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retry Elasticsearch logging health check #10233

Retry Elasticsearch logging health check #10233

satnam6502 commented Jun 23, 2015

k8s-bot commented Jun 23, 2015

ghost Jun 23, 2015

satnam6502 Jun 23, 2015

ghost Jun 23, 2015

satnam6502 Jun 23, 2015

ghost commented Jun 23, 2015

ghost commented Jun 23, 2015

satnam6502 commented Jun 23, 2015

satnam6502 commented Jun 23, 2015

ghost commented Jun 23, 2015

satnam6502 commented Jun 23, 2015

satnam6502 commented Jun 23, 2015

k8s-bot commented Jun 23, 2015

ghost commented Jun 23, 2015

k8s-bot commented Jun 23, 2015

lavalamp commented Jun 23, 2015

Retry Elasticsearch logging health check #10233

Retry Elasticsearch logging health check #10233

Conversation

satnam6502 commented Jun 23, 2015

k8s-bot commented Jun 23, 2015

ghost Jun 23, 2015

Choose a reason for hiding this comment

satnam6502 Jun 23, 2015

Choose a reason for hiding this comment

ghost Jun 23, 2015

Choose a reason for hiding this comment

satnam6502 Jun 23, 2015

Choose a reason for hiding this comment

ghost commented Jun 23, 2015

ghost commented Jun 23, 2015

satnam6502 commented Jun 23, 2015

satnam6502 commented Jun 23, 2015

ghost commented Jun 23, 2015

satnam6502 commented Jun 23, 2015

satnam6502 commented Jun 23, 2015

k8s-bot commented Jun 23, 2015

ghost commented Jun 23, 2015

k8s-bot commented Jun 23, 2015

lavalamp commented Jun 23, 2015