Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Writetimeout is too short for index-pattern creation in Kibana #69

Closed
eldis80 opened this issue Dec 29, 2020 · 8 comments
Closed

Writetimeout is too short for index-pattern creation in Kibana #69

eldis80 opened this issue Dec 29, 2020 · 8 comments

Comments

@eldis80
Copy link

eldis80 commented Dec 29, 2020

Describe the bug
Index-pattern creation fails in Kibana because no indices are listed with the default query

Environment

  • OpenShift 4.5.15
  • CLO version 4.5.0-202012120433.p0

Logs
Couldn't find much info from logs

Expected behavior
List of available indices when creating an index-pattern in Kibana

Actual behavior
When creating index-pattern in Kibana it queries the indices with this kind of POST:
URL: https://kibana-openshift-logging.apps./elasticsearch/*/_search?ignore_unavailable=true
Payload: {"size":0,"aggs":{"indices":{"terms":{"field":"_index","size":200}}}}

After a while a toast pops up saying Kibana was unable to fetch indices.

Same query using Kibana's Dev Tools gives:
{
"message": "Client request error: socket hang up",
"statusCode": 502,
"error": "Bad Gateway"
}

To Reproduce
Steps to reproduce the behavior:

  1. Create an Elasticsearch cluster with enough docs.
  2. Try to create index-pattern in Kibana
  3. No indices are returned and can't create index-pattern

Additional context
I believe this happens because the query goes through elasticsearch-proxy and there was WriteTimeout of 5 seconds introduced in #57 . This WriteTimeout basically closes the connection if the response takes more than 5 seconds.

We have so many docs and shards because we have set the application logs retention to 30 days. Other logs (infra and audit) have retention for 7 days.

Beginning of response when same query is run from within ES pod using es_util tool tells that our query takes 8 seconds:
{
"took" : 8072,
"timed_out" : false,
"_shards" : {
"total" : 223,
"successful" : 223,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 1213667064,
"max_score" : 0.0,
"hits" : [ ]
},
"aggregations" : {
"indices" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "app-000050",
"doc_count" : 58109669
},
{
"key" : "app-000053",
"doc_count" : 41653740

@jcantrill
Copy link
Contributor

This may be a side effect of another issue we discovered where memory is being used because of a query across all indices to get meta data. Maybe your cluster needs more memory. Please consider posting the results of running a must gather. Instructions can be found at openshift/cluster-logging-operator

@eldis80
Copy link
Author

eldis80 commented Dec 30, 2020

I already increased the memory for elasticsearch-proxy as I saw those other issues but that didn't help. And even before we hadn't run in to OOM situations. As I understand from Go's http documentation, with TLS enabled the WriteTimeout is the time from request headers until response is totally written. And in our case the response from ES itself takes ~8 seconds so the elasticsearch-proxy has already closed the connection. I've tested this quite much with the Kibana's DevTools and any query taking longer than 5 seconds fails.

In any case, I think the WriteTimeout shouldn't be less than what is configured to Kibana's elasticsearch.requestTimeout (with CLO it's 300000ms). Otherwise the requests from Kibana going through elasticsearch-proxy will be disconnected after WriteTimeout.

ps. We are providing the must-gather information through a support ticket shortly.

@eldis80
Copy link
Author

eldis80 commented Jan 7, 2021

Could you comment on the reasoning why the elasticsearch-proxy's Go http.Server.WriteTimeout is set to 5 seconds when the requestTimeout in Kibana is set to 300 seconds?

@eldis80
Copy link
Author

eldis80 commented Jan 7, 2021

I have now tested this by compiling a new version of this elasticsearch-proxy where the http.Server.WriteTimeout in http.go is set to 600 seconds and instead of getting that "socket hang up" error like described in the bug description I'm able to get this (in Kibana Dev Tools tab):
{
"took" : 15495,
"timed_out" : false,
"_shards" : {
"total" : 235,
"successful" : 235,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 1134384077,
"max_score" : 0.0,
"hits" : [ ]
},
"aggregations" : {
"indices" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,

This means that the queries that take more than 5 seconds are successful now from Kibana to ES. Previously, anything that took longer than 5 seconds was unsuccessful as the connection was closed by elasticsearch-proxy.

@jcantrill
Copy link
Contributor

Closing fixed by #73 which bumps it to a minute. Additionally allows overriding the default configuration

@eldis80
Copy link
Author

eldis80 commented Feb 10, 2021

Great. I still don't understand why you would have shorter timeout at the proxy than what is configured in Kibana as the ElasticSearch queryTimeout. As I understand it, all queries from Kibana go through this elasticsearch-proxy in your implementation of ClusterLogging.

@eldis80
Copy link
Author

eldis80 commented Feb 10, 2021

And is the fix also coming to 4.5 or 4.6 releases? I can only see it in master and 4.7.

@jcantrill
Copy link
Contributor

Great. I still don't understand why you would have shorter timeout at the proxy than what is configured in Kibana as the ElasticSearch queryTimeout. As I understand it, all queries from Kibana go through this elasticsearch-proxy in your implementation of ClusterLogging.

The timeout was modified to address a memory issue when FIPS was enabled. I did not modify to the current value to understand what motivation was for picking that value. Regardless, I'm certain it was chosen from the perspective of the heavy write traffic from the collector and not read from Kibana. This change will be cherry-picked back to 4.5 and is awating verification from QE

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants