Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Microsoft][ODBC Driver 17 for SQL Server]The connection is broken and recovery is not possible. The connection is marked by the client driver as unrecoverable. No attempt was made to restore the connection. #9227

Open
julianomcl opened this issue Nov 18, 2022 · 17 comments
Labels
k/bug Something isn't working

Comments

@julianomcl
Copy link

Version Information

Server Version: v2.10.0
CLI Version (for CLI related issue): v2.10.0

What is the current behaviour?

We have Hasura hosted on a Cloud Run on Google Cloud Platform connected with a Postgres and a SQL Server databases.
From time to time, Hasura is losing connection with the SQL Server database and unable to reconnect again.

We are receiving this exception message:

[Microsoft][ODBC Driver 17 for SQL Server]The connection is broken and recovery is not possible. The connection is marked by the client driver as unrecoverable. No attempt was made to restore the connection.

What is the expected behaviour?

We would like to Hasura reconnect with SQL Server automatically, so we don't have to release a new revision on Cloud SQL to recover the connection.

Please provide any traces or logs that could help here.

This is the log we have from Cloud Run:

{
  "insertId": "xxxxxxxxxxxxxxxxxxxxxxxxxxxx",
  "jsonPayload": {
    "level": "error",
    "detail": {
      "error": "database query error",
      "internal": {
        "exception": {
          "type": "unsuccessful_return_code",
          "message": "[Microsoft][ODBC Driver 17 for SQL Server]The connection is broken and recovery is not possible. The connection is marked by the client driver as unrecoverable. No attempt was made to restore the connection."
        },
        "query": "BEGIN TRANSACTION"
      },
      "path": "$",
      "code": "unexpected"
    },
    "timestamp": "2022-11-18T11:49:34.097+0000",
    "type": "event-trigger"
  },
  "resource": {
    "type": "cloud_run_revision",
    "labels": {
      "configuration_name": "hasura-prd",
      "service_name": "hasura-prd",
      "location": "southamerica-east1",
      "revision_name": "hasura-prd-xxxxxxxxxxxxxxxxxxxxxxxxxxxx",
      "project_id": "xxxxxxxxxxxxxxxxxxxxxxxxxxxx"
    }
  },
  "timestamp": "2022-11-18T11:49:35.103839Z",
  "labels": {
    "instanceId": "xxxxxxxxxxxxxxxxxxxxxxxxxxxx"
  },
  "logName": "xxxxxxxxxxxxxxxxxxxxxxxxxxxx",
  "receiveTimestamp": "2022-11-18T11:49:35.195698667Z"
}

Keywords

sql server connection broken

@julianomcl julianomcl added the k/bug Something isn't working label Nov 18, 2022
@ajohnson1200
Copy link

Thanks for bringing this up. My guess here is that we're not doing proper connection pool validation (examples: https://confluence.atlassian.com/conf76/surviving-database-connection-closures-1018769693.html) and that behind the scenes GCP is doing maintenance / moving around DB's (as is their operational model).

Could you share what you have set for idle timeout and max connections for the DB in question in the Hasura Console --> Data tab?

@julianomcl
Copy link
Author

@ajohnson1200 Thanks for the response.
We have these values:

  • Max Connections Per Instance: 50
  • Idle Timeout: 5 (seconds)

@SamirTalwar
Copy link
Contributor

Hi @julianomcl.

I've been digging into this issue and unfortunately can't reproduce it. The idle timeout of 5 seconds that you have set should mean that any connections to the database are either dropped within 5 seconds, or used once and then discarded, as we drop any connections where we receive an error (for exactly this reason). As far as I can tell this has been the case since versions of Hasura pre-v2.10.

In addition, I would have thought that Google Cloud Run would be periodically restarting your containers anyway (the documentation suggests that there's a maximum container lifetime of 1 hour, but I might be wrong here). Are you using a mechanism which makes the Hasura container persist longer than this?

Can you tell me if you're you still seeing these issues? And have you tried upgrading Hasura to see if it makes the problem go away?

@tadeumaia
Copy link

tadeumaia commented May 30, 2023

Hello, I work with @julianomcl

Thanks for all the work you already did debugging this issue, but we are still having this issue.

Hasura is connected to two DB's one MSSQL and one POSTGRES, these problems only happen on the MSSQL database and locks all queries, after a container restart at cloud run the error is resolved and everything goes back to normal until we get hit by the error again.

Could you please provide us with some guidance on how to further debug this?

Here is some more information that could help:

Connection String
Driver={ODBC Driver 17 for SQL Server};Server=cloud-sql-internal-hostname:1433;Database=ourdb;Uid=user;Pwd=pwd;

Hasura Settings
HASURA_GRAPHQL_ENABLED_LOGTYPES | startup,http-log,query-log,websocket-log,webhook-log
HASURA_GRAPHQL_DEV_MODE | false
HASURA_GRAPHQL_METADATA_DATABASE_URL | Our postgressdb
HASURA_GRAPHQL_ENABLE_CONSOLE | true
HASURA_GRAPHQL_ENABLED_APIS | graphql, metadata

Cloud Run settings
resources:
limits:
cpu: 2000m
memory: 1G
startupProbe:
timeoutSeconds: 240
periodSeconds: 240
failureThreshold: 1
tcpSocket:
port: 8080

@lucasnad27
Copy link

lucasnad27 commented Jun 16, 2023

I'm connecting to a MS SQL Server database, running in Azure and encountering the same issue :/

I'm running v2.27.0-cloud.1 Happy to provide more details if needed. I can update an ENV var, which I believe triggers a restart in the hasura cloud environment and everything goes back to normal.

@lucasnad27
Copy link

I'm trying to understand better how vital this bug is to address. I'm preparing to put my Hasura Cloud instance into its first production environment and expect these connection errors to attempt to self-heal. I have this happen at least once every other week, leading to my entire site being down.

My only way around this issue is to go into the Hasura cloud dashboard, update an env var, triggering a restart of the server. I've hooked Hasura up to numerous Postgres services without any issues. Is anyone running SQL Server instances on top of Hasura cloud in a production capacity? Surely there are others with this problem.

I know these issues are hard to reproduce. Fwiw -- it happens when I'm actively developing. e.g., adding new tables, changing permissions, etc.,

@SamirTalwar
Copy link
Contributor

Hey folks, after a lot of prodding, I have come no closer to reproducing this.

Could you please give me as much information as you can about your database setup? For example, the last test I ran was on a Microsoft Azure SQL Database, with the pricing tier "General Purpose - Serverless: Gen5, 1 vCore".

It'd also be helpful to get exact version information:

> SELECT @@version AS version;
Microsoft SQL Azure (RTM) - 12.0.2000.8 Jun 1 2023 13:36:49 Copyright (C) 2022 Microsoft Corporation

@julianomcl
Copy link
Author

Hi, @SamirTalwar, thank you for your assistance with this matter.

I have executed the query and obtained the following result:

Microsoft SQL Server 2019 (RTM-CU19) (KB5023049) - 15.0.4298.1 (X64) Jan 27 2023 16:44:09 Copyright (C) 2019 Microsoft Corporation Standard Edition (64-bit) on Linux (Ubuntu 18.04.6 LTS) <X64>

@lucasnad27
Copy link

Version info as requested:

Microsoft SQL Azure (RTM) - 12.0.2000.8 Jun 1 2023 13:36:49 Copyright (C) 2022 Microsoft Corporation

I'm using Azure's fully-managed SQL Server:

  • Service tier: "Standard"
  • DTUs: 10
  • Max Storage 250 GB

Let me know if I can provide any additional information to help you debug. If I discover any sort of pattern to these outages, I'll be sure to post here as well. Given that it doesn't happen very often, it might be awhile 😮‍💨

@SamirTalwar
Copy link
Contributor

Hi folks, we're having trouble getting to the bottom of this. To find out if it's our own connection pooling mechanism or not, we have added a new toggle in v2.30.1. You can set your pool settings to {"enable": false} (using the CLI or API; the Console does not support this).

When you have access to v2.30.1, please give this a shot and let us know if it resolves the issue or not. Either way, it will be valuable information for tracking down the issue.

Thanks.

@lucasnad27
Copy link

Happy to do so Samir. I haven't used the CLI much, but should be able to get this done by EOW. I'll let you know when I'm able to flip the bit and hopefully I can re-produce the issue in the coming weeks 🤞🏻

@tadeumaia
Copy link

Some update from us to help you debug this.

We migrated our Hasura deployment from Cloud Run to GKE, both were using GCP VPN to connect to the database. All hasura configuration stayed the same.

We have been 22 days without a problem now.

@lucasnad27
Copy link

@SamirTalwar, perhaps I'm being dense, but I don't see an obvious way to update my pool settings via hasura CLI. I've configured my CLI to point to the Hasura cloud instance via HASURA_GRAPHQL_ENDPOINT & HASURA_GRAPHQL_ADMIN_SECRET env vars. I've looked at the help documentation but I need help finding the right command.

Hasura version: v2.30.1-cloud.1
CLI version: v2.17.0

@SamirTalwar
Copy link
Contributor

@lucasnad27: Using the CLI, you'll need to export your metadata with hasura metadata export, find your pool_settings (typically in databases.yaml), and update them to say something like:

pool_settings:
  enable: false

Then re-apply them with hasura metadata apply.

I highly recommend upgrading your CLI to match the server version first.

@lucasnad27
Copy link

lucasnad27 commented Jul 31, 2023

@SamirTalwar thanks for the guidance. Very helpful. I've updated the setting -- found in databases.yaml as mentioned -- to disable pool settings.

I'm hoping my staging hasura instance has enough traffic to replicate the issue I'm seeing in production (happened again a few days ago). If I don't see any issues on the staging database over the coming days, I'll make the same change to my production server. I'll keep this thread updated with my progress & findings.

@SamirTalwar
Copy link
Contributor

Great, thanks!

Assuming it makes it to your production instance and you're monitoring machine statistics, it would be great to know if there's major changes to CPU load, memory usage, or network traffic anywhere.

If you still have issues, please let me know.

@lucasnad27
Copy link

Promoted metadata update to production a few minutes ago. I don't have a ton of traffic on that instance yet, but it should be enough to give us some data points. I'll circle back early next week on this thread.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
k/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants