Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adjust the error message of "Your question took too long" #12423

Open
flamber opened this issue Apr 29, 2020 · 18 comments
Open

Adjust the error message of "Your question took too long" #12423

flamber opened this issue Apr 29, 2020 · 18 comments
Labels
Difficulty:Hard Priority:P1 Security holes w/o exploit, crashing, setup/upgrade, login, broken common features, correctness Querying/ .Team/QueryProcessor :hammer_and_wrench: Type:Bug Product defects

Comments

@flamber
Copy link
Contributor

flamber commented Apr 29, 2020

Describe the bug
There's many benefits from the new connection handling in 0.35, like better pool handling, faster and can handle much higher load, but it also means that it's not sending a newline every second anymore, which before 0.35 kept the connection alive.

This mean it's now needed to adjust timeouts of reverse-proxy/load-balancer, when a question with a query time exceeds the timeout of the proxy.

Otherwise the proxy will close the connection, which will show the error in dashboard/question:

Your question took too long
We didn't get an answer back from your database in time, so we had to stop. You can try again in a minute, or if the problem persists, you can email an admin to let them know.

Workaround
Change the timeout of the reverse-proxy/load-balancer, so the connection between the user/browser and Metabase isn't closed before results are returned.
Unsure which proxy might be closing the connection, then use the browser developer Network-tab to see the response headers of the failing request.

To Reproduce

  1. Setup Metabase behind a proxy with a timeout of 60 seconds (which is the default of Nginx and many other proxies)
  2. Create a question, where the query will exceed 60 seconds (example Postgres select pg_sleep(65); or MySQL select sleep(65);)
  3. Run the question (either directly or via a dashboard)
  4. After 60 seconds, then the proxy closes the connection, and the error is shown in the interface.
    There will also be an error in log, which says that Metabase has lost the connection to the user/browser, since the proxy in between has closed the connection:
    ERROR async.streaming-response :: Error determining whether HTTP request was canceled

Expected behavior
Either of these would probably help:

  1. The error message should probably be rephrased to reflect the behavior of 0.35
  2. It would be great if headers could be validated, so if the answer doesn't come from Metabase, then show a different error message
  3. If Metabase could send a keepalive-signal to the browser every X seconds (lower than 60), then that would be magic and probably avoid the needed proxy/lb timeout adjustments. But given all the issues with the old newline keepalive method, it might not be worth it.

Information about your Metabase Installation:
Metabase 0.35.3 behind a reverse-proxy with a timeout of 60 seconds

Additional context
Based on #12335 and https://discourse.metabase.com/t/your-question-took-too-long-0-35-1/9621
Giving it P2, since this seems like it might be a common problem
Related #11463

The Elastic BeanStalk image has hardcoded timeout of 600 seconds (10 minutes), which should probably be changed to a higher number, since that means it's not possible to run queries for longer than 10 minutes on EBS no matter if the load balancer has a higher timeout.
It is possible to manually change that, but every time the instance is upgraded, then those changes need to be manually applied again.

⬇️ Please click the 👍 reaction instead of leaving a +1 or update? comment

@viblo
Copy link

viblo commented Jun 25, 2020

Just to add another case: We have deployed Metabase on a Azure App Service, and App Services have a hard timeout of 230 seconds which is not possible to increase.

I wonder if a alternative solution would be to allow running queries in the background somehow? I have some queries that can take 10-20 minutes to run, and for that long queries I dont want to wait in the UI anyway. Instead I could start the query in some background/task list, and then come back later and list the queries, their status and if available their results. This would extend the type of queries possible to run through metabase, especially for analytics databases such as Snowflake or BigQuery.

@flamber
Copy link
Contributor Author

flamber commented Jun 25, 2020

@viblo Queuing is difficult. The old method had severe problems that could overload or crash Metabase. We're still investigating what the best approach would be. You might be interested in these issues too #10690 and #11328

@mfpinhal
Copy link

mfpinhal commented Jul 2, 2020

We are experiencing the same issue, due to Cloudflare's 100s limit (reference here).

@dariusdev
Copy link

It is not possible to change Cloudflare 100s limit without 'enterprise' plan.
It becomes impossible to run longer query. Would be nice to have option to keep sending newline and keep this connection active.

@EnilPajic
Copy link

EnilPajic commented May 27, 2021

Hello. Is there any progress on this?
Those "workarounds" mentioned by increasing LB/proxy timeout do not work if we do not control proxy, as it is in our case - we use cloudflare, and there is limit of 100s (mentioned by two previous comments too).

A single option "keep connection alive on long-running queries (send newline every 1s)" on metabase admin would be nice (also as mentioned in previous comment almost 11m before).

@Limess
Copy link

Limess commented Jun 21, 2021

This is also causing difficulties for us:
We have:
Cloudflare -> ALB -> Nginx -> ALB -> Metabase

with the ALBs being shared between several services.

Increasing idle timeouts/read timeouts across the board doesn't really work for us here and makes Metabase a special case - instead we're just dealing with a hard limit.

If there was still an option of sending the keep alives we'd definitely enable it.

@hopeswiller
Copy link

I have come across this issue and followed the process.
I have set this in my Nginx config but the still get the message of "Question took too long" and times out

http{
   ...
   proxy_read_timeout 3600;
   proxy_connect_timeout 3600;
   proxy_send_timeout 3600;
   ...
}

Unsure which proxy might be closing the connection, then use the browser developer Network-tab to see the response headers of the failing request.

I'm not sure what I should be checking for in the response headers

Below is my diagnostic info

{
  "browser-info": {
    "language": "en-GB",
    "platform": "Win32",
    "userAgent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36",
    "vendor": "Google Inc."
  },
  "system-info": {
    "file.encoding": "UTF-8",
    "java.runtime.name": "OpenJDK Runtime Environment",
    "java.runtime.version": "11.0.11+9",
    "java.vendor": "AdoptOpenJDK",
    "java.vendor.url": "https://adoptopenjdk.net/",
    "java.version": "11.0.11",
    "java.vm.name": "OpenJDK 64-Bit Server VM",
    "java.vm.version": "11.0.11+9",
    "os.name": "Linux",
    "os.version": "4.4.0-87-generic",
    "user.language": "en",
    "user.timezone": "GMT"
  },
  "metabase-info": {
    "databases": [
      "postgres",
      "mongo",
      "googleanalytics"
    ],
    "hosting-env": "unknown",
    "application-database": "postgres",
    "application-database-details": {
      "database": {
        "name": "PostgreSQL",
        "version": "11.8 (Ubuntu 11.8-1.pgdg16.04+1)"
      },
      "jdbc-driver": {
        "name": "PostgreSQL JDBC Driver",
        "version": "42.2.18"
      }
    },
    "run-mode": "prod",
    "version": {
      "date": "2021-07-14",
      "tag": "v0.40.1",
      "branch": "release-x.40.x",
      "hash": "ed8f9c8"
    },
    "settings": {
      "report-timezone": null
    }
  }
}

@flamber
Copy link
Contributor Author

flamber commented Jul 30, 2021

@hopeswiller

Please use the forum for questions and troubleshooting: https://discourse.metabase.com/

You're not writing how long time it takes before you're getting the timeout.
I cannot tell you which response header to look at, since not all proxies replaces the Server header. Post all response headers.

Remember to restart Nginx after making the change. 3600 seconds seems excessive. 600 or 1200 should be plenty (for most).

You are setting the timeouts for the http context (at a higher level). It does not show if you have other configurations at lower contexts like server or location, which might overrule the higher level.
See all your configuration with nginx -T

@flamber flamber added Difficulty:Hard Priority:P1 Security holes w/o exploit, crashing, setup/upgrade, login, broken common features, correctness and removed Priority:P2 Average run of the mill bug .Frontend labels Nov 11, 2021
@Czlenson95
Copy link

Is this problem will be addressed in future releases?
Workaround with increasing timeout on the proxy is not a solution for us.

@kszarlej
Copy link

kszarlej commented Mar 2, 2022

Hello I also think that this should be addressed. For example AWS ALB max idle timeout is 4000 seconds. This essentially means that if metabase runs on AWS behind ALB (which is pretty standard setup) you cannot use metabase with queries whose runtime exceeds 4000 seconds (~1 hr and 6 minutes) and we need to query Redshift directly.

Also in AWS configuration of the timeout is possible only per load balancer. Typically you run multiple applications on the same load balancer and each gets unique target group. Since you cannot configure idle_timeout per target group you end up with treating metabase specially and you have to create another, dedicated ALB for it.

Would be good if optionally we could just switch metabase front to just poll for cached results on backend with short-lived connections, instead of waiting for those results on a long-running open connection.

@ranquild ranquild added the .Team/QueryProcessor :hammer_and_wrench: label Jun 2, 2023
@seangibeault
Copy link

Any chance this is on the roadmap?

@j-ro
Copy link

j-ro commented Dec 14, 2023

I'll add my voice in saying that the existing workarounds here are not sufficient, as sometimes you do not control the proxy (a la Cloudflare).

I'd typically try to break up concerns here, with the frontend kicking off long running background jobs like fetching a query and then polling regularly for the result. The long running background job in turn could listen for keepalives from the frontend and cancel if the heartbeat stops. But relying on very long running processes from the frontend all the way to the database and back seems like perhaps a mistake, and certainly is causing some pain for us because we cannot get around a hard 100 second timeout via Cloudflare.

@sao2015

This comment was marked as off-topic.

@davyzhang
Copy link

I am experiencing the same problem changing cloudflare is not possible for me. Use websocket to send these long running query might be a solution

@ketandoshi
Copy link

ketandoshi commented Feb 23, 2024

We are experiencing the same issue of 524 timeout error after 100s with Metabase 0.46.8 & Cloudflare. Has anyone got any solution yet?

@alice-telescoop
Copy link

Same here, using metabase on another hosting provider that does not allow to control proxy. Is there any work in progress about this?

@paoliniluis
Copy link
Contributor

@alice-telescoop, even if we adjust the message, the hosting provider will cut the connection and leave the user without the answer. Why can't you change the hosting provider if the queries are slow?

@alice-telescoop
Copy link

Our metabase is a tools for the few business developpers of our small team. It would take quite a lot of time to change the hosting provider that we don't necessarily have. I was actually directed to this issue by the hosting provider itself. I realize by reading your answer that the issue only aims at changing the message. Is there any issue or planned work on having some sorte of keep-alive system?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Difficulty:Hard Priority:P1 Security holes w/o exploit, crashing, setup/upgrade, login, broken common features, correctness Querying/ .Team/QueryProcessor :hammer_and_wrench: Type:Bug Product defects
Projects
None yet
Development

No branches or pull requests