[Microsoft][ODBC Driver 17 for SQL Server]The connection is broken and recovery is not possible. The connection is marked by the client driver as unrecoverable. No attempt was made to restore the connection. #9227

julianomcl · 2022-11-18T16:08:09Z

Version Information

Server Version: v2.10.0
CLI Version (for CLI related issue): v2.10.0

What is the current behaviour?

We have Hasura hosted on a Cloud Run on Google Cloud Platform connected with a Postgres and a SQL Server databases.
From time to time, Hasura is losing connection with the SQL Server database and unable to reconnect again.

We are receiving this exception message:

[Microsoft][ODBC Driver 17 for SQL Server]The connection is broken and recovery is not possible. The connection is marked by the client driver as unrecoverable. No attempt was made to restore the connection.

What is the expected behaviour?

We would like to Hasura reconnect with SQL Server automatically, so we don't have to release a new revision on Cloud SQL to recover the connection.

Please provide any traces or logs that could help here.

This is the log we have from Cloud Run:

{
  "insertId": "xxxxxxxxxxxxxxxxxxxxxxxxxxxx",
  "jsonPayload": {
    "level": "error",
    "detail": {
      "error": "database query error",
      "internal": {
        "exception": {
          "type": "unsuccessful_return_code",
          "message": "[Microsoft][ODBC Driver 17 for SQL Server]The connection is broken and recovery is not possible. The connection is marked by the client driver as unrecoverable. No attempt was made to restore the connection."
        },
        "query": "BEGIN TRANSACTION"
      },
      "path": "$",
      "code": "unexpected"
    },
    "timestamp": "2022-11-18T11:49:34.097+0000",
    "type": "event-trigger"
  },
  "resource": {
    "type": "cloud_run_revision",
    "labels": {
      "configuration_name": "hasura-prd",
      "service_name": "hasura-prd",
      "location": "southamerica-east1",
      "revision_name": "hasura-prd-xxxxxxxxxxxxxxxxxxxxxxxxxxxx",
      "project_id": "xxxxxxxxxxxxxxxxxxxxxxxxxxxx"
    }
  },
  "timestamp": "2022-11-18T11:49:35.103839Z",
  "labels": {
    "instanceId": "xxxxxxxxxxxxxxxxxxxxxxxxxxxx"
  },
  "logName": "xxxxxxxxxxxxxxxxxxxxxxxxxxxx",
  "receiveTimestamp": "2022-11-18T11:49:35.195698667Z"
}

Keywords

sql server connection broken

The text was updated successfully, but these errors were encountered:

ajohnson1200 · 2022-11-18T22:41:21Z

Thanks for bringing this up. My guess here is that we're not doing proper connection pool validation (examples: https://confluence.atlassian.com/conf76/surviving-database-connection-closures-1018769693.html) and that behind the scenes GCP is doing maintenance / moving around DB's (as is their operational model).

Could you share what you have set for idle timeout and max connections for the DB in question in the Hasura Console --> Data tab?

julianomcl · 2022-11-21T14:45:19Z

@ajohnson1200 Thanks for the response.
We have these values:

Max Connections Per Instance: 50
Idle Timeout: 5 (seconds)

SamirTalwar · 2022-12-20T13:03:23Z

Hi @julianomcl.

I've been digging into this issue and unfortunately can't reproduce it. The idle timeout of 5 seconds that you have set should mean that any connections to the database are either dropped within 5 seconds, or used once and then discarded, as we drop any connections where we receive an error (for exactly this reason). As far as I can tell this has been the case since versions of Hasura pre-v2.10.

In addition, I would have thought that Google Cloud Run would be periodically restarting your containers anyway (the documentation suggests that there's a maximum container lifetime of 1 hour, but I might be wrong here). Are you using a mechanism which makes the Hasura container persist longer than this?

Can you tell me if you're you still seeing these issues? And have you tried upgrading Hasura to see if it makes the problem go away?

tadeumaia · 2023-05-30T14:55:42Z

Hello, I work with @julianomcl

Thanks for all the work you already did debugging this issue, but we are still having this issue.

Hasura is connected to two DB's one MSSQL and one POSTGRES, these problems only happen on the MSSQL database and locks all queries, after a container restart at cloud run the error is resolved and everything goes back to normal until we get hit by the error again.

Could you please provide us with some guidance on how to further debug this?

Here is some more information that could help:

Connection String
Driver={ODBC Driver 17 for SQL Server};Server=cloud-sql-internal-hostname:1433;Database=ourdb;Uid=user;Pwd=pwd;

Hasura Settings
HASURA_GRAPHQL_ENABLED_LOGTYPES | startup,http-log,query-log,websocket-log,webhook-log
HASURA_GRAPHQL_DEV_MODE | false
HASURA_GRAPHQL_METADATA_DATABASE_URL | Our postgressdb
HASURA_GRAPHQL_ENABLE_CONSOLE | true
HASURA_GRAPHQL_ENABLED_APIS | graphql, metadata

Cloud Run settings
resources:
limits:
cpu: 2000m
memory: 1G
startupProbe:
timeoutSeconds: 240
periodSeconds: 240
failureThreshold: 1
tcpSocket:
port: 8080

lucasnad27 · 2023-06-16T18:35:58Z

I'm connecting to a MS SQL Server database, running in Azure and encountering the same issue :/

I'm running v2.27.0-cloud.1 Happy to provide more details if needed. I can update an ENV var, which I believe triggers a restart in the hasura cloud environment and everything goes back to normal.

lucasnad27 · 2023-06-27T15:40:48Z

I'm trying to understand better how vital this bug is to address. I'm preparing to put my Hasura Cloud instance into its first production environment and expect these connection errors to attempt to self-heal. I have this happen at least once every other week, leading to my entire site being down.

My only way around this issue is to go into the Hasura cloud dashboard, update an env var, triggering a restart of the server. I've hooked Hasura up to numerous Postgres services without any issues. Is anyone running SQL Server instances on top of Hasura cloud in a production capacity? Surely there are others with this problem.

I know these issues are hard to reproduce. Fwiw -- it happens when I'm actively developing. e.g., adding new tables, changing permissions, etc.,

SamirTalwar · 2023-06-29T18:18:31Z

Hey folks, after a lot of prodding, I have come no closer to reproducing this.

Could you please give me as much information as you can about your database setup? For example, the last test I ran was on a Microsoft Azure SQL Database, with the pricing tier "General Purpose - Serverless: Gen5, 1 vCore".

It'd also be helpful to get exact version information:

> SELECT @@version AS version;
Microsoft SQL Azure (RTM) - 12.0.2000.8 Jun 1 2023 13:36:49 Copyright (C) 2022 Microsoft Corporation

julianomcl · 2023-06-29T18:29:57Z

Hi, @SamirTalwar, thank you for your assistance with this matter.

I have executed the query and obtained the following result:

Microsoft SQL Server 2019 (RTM-CU19) (KB5023049) - 15.0.4298.1 (X64) Jan 27 2023 16:44:09 Copyright (C) 2019 Microsoft Corporation Standard Edition (64-bit) on Linux (Ubuntu 18.04.6 LTS) <X64>

lucasnad27 · 2023-06-29T20:35:18Z

Version info as requested:

I'm using Azure's fully-managed SQL Server:

Service tier: "Standard"
DTUs: 10
Max Storage 250 GB

Let me know if I can provide any additional information to help you debug. If I discover any sort of pattern to these outages, I'll be sure to post here as well. Given that it doesn't happen very often, it might be awhile 😮‍💨

SamirTalwar · 2023-07-24T19:35:15Z

Hi folks, we're having trouble getting to the bottom of this. To find out if it's our own connection pooling mechanism or not, we have added a new toggle in v2.30.1. You can set your pool settings to {"enable": false} (using the CLI or API; the Console does not support this).

When you have access to v2.30.1, please give this a shot and let us know if it resolves the issue or not. Either way, it will be valuable information for tracking down the issue.

Thanks.

lucasnad27 · 2023-07-26T23:09:47Z

Happy to do so Samir. I haven't used the CLI much, but should be able to get this done by EOW. I'll let you know when I'm able to flip the bit and hopefully I can re-produce the issue in the coming weeks 🤞🏻

tadeumaia · 2023-07-27T13:20:44Z

Some update from us to help you debug this.

We migrated our Hasura deployment from Cloud Run to GKE, both were using GCP VPN to connect to the database. All hasura configuration stayed the same.

We have been 22 days without a problem now.

lucasnad27 · 2023-07-29T20:08:17Z

@SamirTalwar, perhaps I'm being dense, but I don't see an obvious way to update my pool settings via hasura CLI. I've configured my CLI to point to the Hasura cloud instance via HASURA_GRAPHQL_ENDPOINT & HASURA_GRAPHQL_ADMIN_SECRET env vars. I've looked at the help documentation but I need help finding the right command.

Hasura version: v2.30.1-cloud.1
CLI version: v2.17.0

SamirTalwar · 2023-07-31T09:01:03Z

@lucasnad27: Using the CLI, you'll need to export your metadata with hasura metadata export, find your pool_settings (typically in databases.yaml), and update them to say something like:

pool_settings:
  enable: false

Then re-apply them with hasura metadata apply.

I highly recommend upgrading your CLI to match the server version first.

lucasnad27 · 2023-07-31T21:09:14Z

@SamirTalwar thanks for the guidance. Very helpful. I've updated the setting -- found in databases.yaml as mentioned -- to disable pool settings.

I'm hoping my staging hasura instance has enough traffic to replicate the issue I'm seeing in production (happened again a few days ago). If I don't see any issues on the staging database over the coming days, I'll make the same change to my production server. I'll keep this thread updated with my progress & findings.

SamirTalwar · 2023-08-01T11:43:38Z

Great, thanks!

Assuming it makes it to your production instance and you're monitoring machine statistics, it would be great to know if there's major changes to CPU load, memory usage, or network traffic anywhere.

If you still have issues, please let me know.

lucasnad27 · 2023-08-03T23:27:07Z

Promoted metadata update to production a few minutes ago. I don't have a ton of traffic on that instance yet, but it should be enough to give us some data points. I'll circle back early next week on this thread.

julianomcl added the k/bug Something isn't working label Nov 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Microsoft][ODBC Driver 17 for SQL Server]The connection is broken and recovery is not possible. The connection is marked by the client driver as unrecoverable. No attempt was made to restore the connection. #9227

[Microsoft][ODBC Driver 17 for SQL Server]The connection is broken and recovery is not possible. The connection is marked by the client driver as unrecoverable. No attempt was made to restore the connection. #9227

julianomcl commented Nov 18, 2022

ajohnson1200 commented Nov 18, 2022

julianomcl commented Nov 21, 2022

SamirTalwar commented Dec 20, 2022

tadeumaia commented May 30, 2023 •

edited

lucasnad27 commented Jun 16, 2023 •

edited

lucasnad27 commented Jun 27, 2023

SamirTalwar commented Jun 29, 2023

julianomcl commented Jun 29, 2023

lucasnad27 commented Jun 29, 2023

SamirTalwar commented Jul 24, 2023

lucasnad27 commented Jul 26, 2023

tadeumaia commented Jul 27, 2023

lucasnad27 commented Jul 29, 2023

SamirTalwar commented Jul 31, 2023

lucasnad27 commented Jul 31, 2023 •

edited

SamirTalwar commented Aug 1, 2023

lucasnad27 commented Aug 3, 2023

[Microsoft][ODBC Driver 17 for SQL Server]The connection is broken and recovery is not possible. The connection is marked by the client driver as unrecoverable. No attempt was made to restore the connection. #9227

[Microsoft][ODBC Driver 17 for SQL Server]The connection is broken and recovery is not possible. The connection is marked by the client driver as unrecoverable. No attempt was made to restore the connection. #9227

Comments

julianomcl commented Nov 18, 2022

Version Information

What is the current behaviour?

What is the expected behaviour?

Please provide any traces or logs that could help here.

Keywords

ajohnson1200 commented Nov 18, 2022

julianomcl commented Nov 21, 2022

SamirTalwar commented Dec 20, 2022

tadeumaia commented May 30, 2023 • edited

lucasnad27 commented Jun 16, 2023 • edited

lucasnad27 commented Jun 27, 2023

SamirTalwar commented Jun 29, 2023

julianomcl commented Jun 29, 2023

lucasnad27 commented Jun 29, 2023

SamirTalwar commented Jul 24, 2023

lucasnad27 commented Jul 26, 2023

tadeumaia commented Jul 27, 2023

lucasnad27 commented Jul 29, 2023

SamirTalwar commented Jul 31, 2023

lucasnad27 commented Jul 31, 2023 • edited

SamirTalwar commented Aug 1, 2023

lucasnad27 commented Aug 3, 2023

tadeumaia commented May 30, 2023 •

edited

lucasnad27 commented Jun 16, 2023 •

edited

lucasnad27 commented Jul 31, 2023 •

edited