Internal Server Error after 20 seconds for big queries in Jaeger UI #958

darthlynx · 2022-06-13T16:41:02Z

Describe the bug

We use the jaeger-query version 1.34.1 with the elasticsearch storage (OpenSearch cluster in our case). Both, ec2 instances with jaeger-query and OpenSearch cluster live in the same VPC, and we use VPC endpoint for the --es.server-urls parameter.

This setup works as expected when we search for the small traces, but Jaeger returns the 500 Internal Server Error after 20 seconds timeout for big traces (please find the screenshot below). At the same time, we set up the es.timeout to 120 seconds, but looks like jaeger does not apply this setting.

The logs contain the next error:

{
  "level": "error",
  "ts": 1655134725.2137299,
  "caller": "app/http_handler.go:487",
  "msg": "HTTP handler, Internal Server Error",
  "error": "search services failed: Post \"https://localhost:8888/jaeger-span-2022-06-13%2Cjaeger-span-2022-06-12%2Cjaeger-span-2022-06-11/_search?ignore_unavailable=true&rest_total_hits_as_int=true\": context canceled",
  "stacktrace": "github.com/jaegertracing/jaeger/cmd/query/app.(*APIHandler).handleError\n\tgithub.com/jaegertracing/jaeger/cmd/query/app/http_handler.go:487\ngithub.com/jaegertracing/jaeger/cmd/query/app.(*APIHandler).search\n\tgithub.com/jaegertracing/jaeger/cmd/query/app/http_handler.go:236\nnet/http.HandlerFunc.ServeHTTP\n\tnet/http/server.go:2084\ngithub.com/opentracing-contrib/go-stdlib/nethttp.MiddlewareFunc.func5\n\tgithub.com/opentracing-contrib/go-stdlib@v1.0.0/nethttp/server.go:154\nnet/http.HandlerFunc.ServeHTTP\n\tnet/http/server.go:2084\nnet/http.HandlerFunc.ServeHTTP\n\tnet/http/server.go:2084\ngithub.com/gorilla/mux.(*Router).ServeHTTP\n\tgithub.com/gorilla/mux@v1.8.0/mux.go:210\ngithub.com/jaegertracing/jaeger/cmd/query/app.additionalHeadersHandler.func1\n\tgithub.com/jaegertracing/jaeger/cmd/query/app/additional_headers_handler.go:28\nnet/http.HandlerFunc.ServeHTTP\n\tnet/http/server.go:2084\ngithub.com/gorilla/handlers.CompressHandlerLevel.func1\n\tgithub.com/gorilla/handlers@v1.5.1/compress.go:141\nnet/http.HandlerFunc.ServeHTTP\n\tnet/http/server.go:2084\ngithub.com/gorilla/handlers.recoveryHandler.ServeHTTP\n\tgithub.com/gorilla/handlers@v1.5.1/recovery.go:78\nnet/http.serverHandler.ServeHTTP\n\tnet/http/server.go:2916\nnet/http.(*conn).serve\n\tnet/http/server.go:1966"
}

Here are the jaeger-query parameters which we use:

/opt/jaeger/jaeger-query --log-level=debug --es.bulk.actions=1000 --es.bulk.flush-interval=200ms  --es.bulk.size=5000000 --es.bulk.workers=1 --es.max-doc-count=10000 --es.max-span-age=48h0m0s --es.timeout=120s --es-archive.timeout=120s --es.server-urls=https://${elasticsearch_url} --es.tls.server-name=${jaeger_cert_domain} --es.tls.enabled

To Reproduce
Steps to reproduce the behavior:

Create a query with big amount of documents for a big period of time (depends of number of documents available for search)
The Jaeger UI returns 500 Internal Server Error after 20 seconds

Expected behavior
Jaeger should not block queries which take more than 20 seconds. Ideally, it should apply the es.timeout setting

Screenshots

Version (please complete the following information):

OS: Linux (RHEL 7)
Jaeger version: v1.34.1
Deployment: AWS EC2 instances with elasticsearch storage (OpenSearch cluster). Both, ec2 instances with jaeger-query and OpenSearch cluster live in the same VPC, and we use VPC endpoint for the --es.server-urls parameter.

What troubleshooting steps did you try?

Tried debug level of logging
Changing the timeout using --es.timeout (even tried 0s which means no timeout)
Checked any additional timeouts (which might be set on the ALB or OpenSearch cluster, etc.)

Additional context
Problem is more frequent for the production environment where we have more documents stored in the elastic search, so we hit the 20s limit even for the short period queries. But at the same time, individual traces work as expected as well as every other query which takes less than 20 seconds

The text was updated successfully, but these errors were encountered:

yurishkuro · 2022-06-13T17:26:42Z

Based on the error message, the Context passed from HTTP server down to the query/storage is getting cancelled while the request is being executed. To my knowledge, http.Server does not have a timeout setting for how long the handler runs, instead there are other reasons why the context may be cancelled, one of them is that the client connection is being closed.

Is it possible that the UI is accessing the query service via some kind of proxy that enforces the 20sec timeout? What happens if you curl the query service directly with the same api query?

darthlynx · 2022-06-14T13:08:02Z

Hi @yurishkuro
Thank you for the quick response.
Yes, you are right, there is a proxy server, which I didn't know about, that sits between Jaeger and elastic search cluster. It uses the tornado web server which has the request_timeout equals to 20sec by default.

So there are no issues from the Jaeger side, and this issue may be closed

Thank you

darthlynx added the bug label Jun 13, 2022

yurishkuro closed this as completed Jun 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Internal Server Error after 20 seconds for big queries in Jaeger UI #958

Internal Server Error after 20 seconds for big queries in Jaeger UI #958

darthlynx commented Jun 13, 2022

yurishkuro commented Jun 13, 2022

darthlynx commented Jun 14, 2022

Internal Server Error after 20 seconds for big queries in Jaeger UI #958

Internal Server Error after 20 seconds for big queries in Jaeger UI #958

Comments

darthlynx commented Jun 13, 2022

yurishkuro commented Jun 13, 2022

darthlynx commented Jun 14, 2022