-
Notifications
You must be signed in to change notification settings - Fork 38.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retry Elasticsearch logging health check #10233
Conversation
GCE e2e build/test passed for commit 2d57c1d21046c82387d05cd8d8ee4fa826b279c3. |
Param("health", "pretty"). | ||
DoRaw() | ||
var body []byte | ||
for start := time.Now(); time.Since(start) < graceTime; time.Sleep(5 * time.Second) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're trying to standardise on using wait.Poll() for retry loops, but given that this source file has so many instances of homegrown retry loops, lets leave migration to wait.Poll() to a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very happy to convert to using wait.Poll()
in a subsequent PR (or get readiness probes working instead).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Give that we now have three separate retry loops that each retry for up to graceTime=10minutes (i.e. a total of 30 minutes), this test is at risk of becoming extremely slow to fail. I've checked our Jenkins and in the past 100 runs the test either passed in 50-90 seconds total, or failed completely. Can we please reduce the grace period to 90 seconds (from 10 minutes?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At some point a single grace period was forced upon all the loops in this test -- something I did not like. This is because some retry loops should take a short amount of total time (all the ones up to the last one) and some loops need longer e.g. to gather the logs (the last one). Before the retry timeouts were adjusted based on the functionality being tested.
LGTM barring the reduction of grace period from 10 minutes to 90 seconds. |
@lavalamp for a second set of eyes and "ok to merge" approval. |
90 seconds is not enough to wait for all the logs to be ingested. |
I have added a new constant to express how long to wait for the log lines (10 minutes) and adjusted the grace time (for status queries) down to 2 minutes. |
According to Jenkins, the test passes 93% of the time, and 100% of those are in under 90 seconds. I cannot understand why you believe that it's necessary to have a grace period of longer than 90 seconds on any of those retry loops. Am I missing something? |
Well, that is based on my experience of running the test manually when I sometimes notice it takes a long time for the log lines to end up in Elasticsearch. OK: I shall adjust down the timeout for observing the log lines. |
The timeout for observing the ingestion of the log lines has been reduced to 3 minutes. |
GCE e2e build/test passed for commit a2622c1b7f549e9d2c2bd510abf496b82d6cbfac. |
LGTM. Still awaiting "ok to merge" from @lavalamp. |
GCE e2e build/test passed for commit 52461b7. |
LGTM |
Retry Elasticsearch logging health check
Addresses #9486
The was only REST query to Elasticsearch logging that was not retried in case of failure (usually due to one of the replicas not being ready). This query is now retried. An attempt to use a readiness probe did not work -- I will revisit this approach later.
Risk: low.