Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Internal Server Error after 20 seconds for big queries in Jaeger UI #958

Closed
darthlynx opened this issue Jun 13, 2022 · 2 comments
Closed
Labels

Comments

@darthlynx
Copy link

Describe the bug

We use the jaeger-query version 1.34.1 with the elasticsearch storage (OpenSearch cluster in our case). Both, ec2 instances with jaeger-query and OpenSearch cluster live in the same VPC, and we use VPC endpoint for the --es.server-urls parameter.

This setup works as expected when we search for the small traces, but Jaeger returns the 500 Internal Server Error after 20 seconds timeout for big traces (please find the screenshot below). At the same time, we set up the es.timeout to 120 seconds, but looks like jaeger does not apply this setting.

The logs contain the next error:

{
  "level": "error",
  "ts": 1655134725.2137299,
  "caller": "app/http_handler.go:487",
  "msg": "HTTP handler, Internal Server Error",
  "error": "search services failed: Post \"https://localhost:8888/jaeger-span-2022-06-13%2Cjaeger-span-2022-06-12%2Cjaeger-span-2022-06-11/_search?ignore_unavailable=true&rest_total_hits_as_int=true\": context canceled",
  "stacktrace": "github.com/jaegertracing/jaeger/cmd/query/app.(*APIHandler).handleError\n\tgithub.com/jaegertracing/jaeger/cmd/query/app/http_handler.go:487\ngithub.com/jaegertracing/jaeger/cmd/query/app.(*APIHandler).search\n\tgithub.com/jaegertracing/jaeger/cmd/query/app/http_handler.go:236\nnet/http.HandlerFunc.ServeHTTP\n\tnet/http/server.go:2084\ngithub.com/opentracing-contrib/go-stdlib/nethttp.MiddlewareFunc.func5\n\tgithub.com/opentracing-contrib/go-stdlib@v1.0.0/nethttp/server.go:154\nnet/http.HandlerFunc.ServeHTTP\n\tnet/http/server.go:2084\nnet/http.HandlerFunc.ServeHTTP\n\tnet/http/server.go:2084\ngithub.com/gorilla/mux.(*Router).ServeHTTP\n\tgithub.com/gorilla/mux@v1.8.0/mux.go:210\ngithub.com/jaegertracing/jaeger/cmd/query/app.additionalHeadersHandler.func1\n\tgithub.com/jaegertracing/jaeger/cmd/query/app/additional_headers_handler.go:28\nnet/http.HandlerFunc.ServeHTTP\n\tnet/http/server.go:2084\ngithub.com/gorilla/handlers.CompressHandlerLevel.func1\n\tgithub.com/gorilla/handlers@v1.5.1/compress.go:141\nnet/http.HandlerFunc.ServeHTTP\n\tnet/http/server.go:2084\ngithub.com/gorilla/handlers.recoveryHandler.ServeHTTP\n\tgithub.com/gorilla/handlers@v1.5.1/recovery.go:78\nnet/http.serverHandler.ServeHTTP\n\tnet/http/server.go:2916\nnet/http.(*conn).serve\n\tnet/http/server.go:1966"
}

Here are the jaeger-query parameters which we use:

/opt/jaeger/jaeger-query --log-level=debug --es.bulk.actions=1000 --es.bulk.flush-interval=200ms  --es.bulk.size=5000000 --es.bulk.workers=1 --es.max-doc-count=10000 --es.max-span-age=48h0m0s --es.timeout=120s --es-archive.timeout=120s --es.server-urls=https://${elasticsearch_url} --es.tls.server-name=${jaeger_cert_domain} --es.tls.enabled

To Reproduce
Steps to reproduce the behavior:

  1. Create a query with big amount of documents for a big period of time (depends of number of documents available for search)
  2. The Jaeger UI returns 500 Internal Server Error after 20 seconds

Expected behavior
Jaeger should not block queries which take more than 20 seconds. Ideally, it should apply the es.timeout setting

Screenshots
jaeger_internal_server_error_20s_timeout

Version (please complete the following information):

  • OS: Linux (RHEL 7)
  • Jaeger version: v1.34.1
  • Deployment: AWS EC2 instances with elasticsearch storage (OpenSearch cluster). Both, ec2 instances with jaeger-query and OpenSearch cluster live in the same VPC, and we use VPC endpoint for the --es.server-urls parameter.

What troubleshooting steps did you try?

  1. Tried debug level of logging
  2. Changing the timeout using --es.timeout (even tried 0s which means no timeout)
  3. Checked any additional timeouts (which might be set on the ALB or OpenSearch cluster, etc.)

Additional context
Problem is more frequent for the production environment where we have more documents stored in the elastic search, so we hit the 20s limit even for the short period queries. But at the same time, individual traces work as expected as well as every other query which takes less than 20 seconds

@darthlynx darthlynx added the bug label Jun 13, 2022
@yurishkuro
Copy link
Member

Based on the error message, the Context passed from HTTP server down to the query/storage is getting cancelled while the request is being executed. To my knowledge, http.Server does not have a timeout setting for how long the handler runs, instead there are other reasons why the context may be cancelled, one of them is that the client connection is being closed.

Is it possible that the UI is accessing the query service via some kind of proxy that enforces the 20sec timeout? What happens if you curl the query service directly with the same api query?

@darthlynx
Copy link
Author

Hi @yurishkuro
Thank you for the quick response.
Yes, you are right, there is a proxy server, which I didn't know about, that sits between Jaeger and elastic search cluster. It uses the tornado web server which has the request_timeout equals to 20sec by default.

So there are no issues from the Jaeger side, and this issue may be closed

Thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants