-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Writetimeout is too short for index-pattern creation in Kibana #69
Comments
This may be a side effect of another issue we discovered where memory is being used because of a query across all indices to get meta data. Maybe your cluster needs more memory. Please consider posting the results of running a must gather. Instructions can be found at openshift/cluster-logging-operator |
I already increased the memory for elasticsearch-proxy as I saw those other issues but that didn't help. And even before we hadn't run in to OOM situations. As I understand from Go's http documentation, with TLS enabled the WriteTimeout is the time from request headers until response is totally written. And in our case the response from ES itself takes ~8 seconds so the elasticsearch-proxy has already closed the connection. I've tested this quite much with the Kibana's DevTools and any query taking longer than 5 seconds fails. In any case, I think the WriteTimeout shouldn't be less than what is configured to Kibana's elasticsearch.requestTimeout (with CLO it's 300000ms). Otherwise the requests from Kibana going through elasticsearch-proxy will be disconnected after WriteTimeout. ps. We are providing the must-gather information through a support ticket shortly. |
Could you comment on the reasoning why the elasticsearch-proxy's Go http.Server.WriteTimeout is set to 5 seconds when the requestTimeout in Kibana is set to 300 seconds? |
I have now tested this by compiling a new version of this elasticsearch-proxy where the http.Server.WriteTimeout in http.go is set to 600 seconds and instead of getting that "socket hang up" error like described in the bug description I'm able to get this (in Kibana Dev Tools tab): This means that the queries that take more than 5 seconds are successful now from Kibana to ES. Previously, anything that took longer than 5 seconds was unsuccessful as the connection was closed by elasticsearch-proxy. |
Closing fixed by #73 which bumps it to a minute. Additionally allows overriding the default configuration |
Great. I still don't understand why you would have shorter timeout at the proxy than what is configured in Kibana as the ElasticSearch queryTimeout. As I understand it, all queries from Kibana go through this elasticsearch-proxy in your implementation of ClusterLogging. |
And is the fix also coming to 4.5 or 4.6 releases? I can only see it in master and 4.7. |
The timeout was modified to address a memory issue when FIPS was enabled. I did not modify to the current value to understand what motivation was for picking that value. Regardless, I'm certain it was chosen from the perspective of the heavy write traffic from the collector and not read from Kibana. This change will be cherry-picked back to 4.5 and is awating verification from QE |
Describe the bug
Index-pattern creation fails in Kibana because no indices are listed with the default query
Environment
Logs
Couldn't find much info from logs
Expected behavior
List of available indices when creating an index-pattern in Kibana
Actual behavior
When creating index-pattern in Kibana it queries the indices with this kind of POST:
URL: https://kibana-openshift-logging.apps./elasticsearch/*/_search?ignore_unavailable=true
Payload: {"size":0,"aggs":{"indices":{"terms":{"field":"_index","size":200}}}}
After a while a toast pops up saying Kibana was unable to fetch indices.
Same query using Kibana's Dev Tools gives:
{
"message": "Client request error: socket hang up",
"statusCode": 502,
"error": "Bad Gateway"
}
To Reproduce
Steps to reproduce the behavior:
Additional context
I believe this happens because the query goes through elasticsearch-proxy and there was WriteTimeout of 5 seconds introduced in #57 . This WriteTimeout basically closes the connection if the response takes more than 5 seconds.
We have so many docs and shards because we have set the application logs retention to 30 days. Other logs (infra and audit) have retention for 7 days.
Beginning of response when same query is run from within ES pod using es_util tool tells that our query takes 8 seconds:
{
"took" : 8072,
"timed_out" : false,
"_shards" : {
"total" : 223,
"successful" : 223,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 1213667064,
"max_score" : 0.0,
"hits" : [ ]
},
"aggregations" : {
"indices" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "app-000050",
"doc_count" : 58109669
},
{
"key" : "app-000053",
"doc_count" : 41653740
The text was updated successfully, but these errors were encountered: