Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API does not reconnect to stacks-blockchain node in some situations #1584

Closed
CharlieC3 opened this issue Mar 15, 2023 · 8 comments · Fixed by #1640
Closed

API does not reconnect to stacks-blockchain node in some situations #1584

CharlieC3 opened this issue Mar 15, 2023 · 8 comments · Fixed by #1640
Milestone

Comments

@CharlieC3
Copy link
Member

Describe the bug
When the postgres database and stacks-node are terminated while an API writer continues running, the API writer sometimes is unable to re-connect to the stacks-blockchain nodes after it and the postgres database come back online after being down for an extended period (~30 minutes or so).
Despite this, the API logs still continue to indicate the API writer is attempting to connect to the stacks-node as one would expect:

Warning: failed to connect to node RPC server at mainnet-stacks-blockchain-snapshot:20443

However, I still think there is something wrong here. If I try restarting the stacks-node again, the API will continue printing the above log and won't be able to connect. If I try restarting the API writer after the stacks-node and postgres db have already come back online, it starts working.
Thus, this test seems to indicate the issue resides in the API rather than the stacks-node.

@zone117x
Copy link
Member

If you exec into the pod when this is happening, is mainnet-stacks-blockchain-snapshot reachable with curl? I'm wondering if there is some DNS caching issue somewhere.

@CharlieC3
Copy link
Member Author

I'll test that next time it happens, but it doesn't occur every day. More like every two weeks or so. However, we explicitly have DNS caching disabled, so I would not expect it to be related to DNS caching.
In the meantime, anything you can find here would be a big help.

@rafaelcr
Copy link
Collaborator

Next steps: set up some alerts to get more info about this

@zone117x
Copy link
Member

Here's a PR that will include verbose error details in that log entry which could help narrow down the problem: #1585

@CharlieC3
Copy link
Member Author

I was able to find this happening again. I've confirmed a few interesting things:

  • DNS caching is not a problem (IPs matched)
  • The API writer could not ping the stacks-node on port 20443 because 20443 was not being opened by the node
  • the API writer could not ping localhost:3999 because 3999 was not open
  • Restarting the API writer did not solve the problem
  • Restarting both the API writer and stacks-node fixed the problem

What's strange is this deployment was previously working, then broke without the API writer ever restarting. This means port 3999 was open, but the API closes port 3999 if it cannot contact the stacks-node for a while and enters this circular dependency.
I think that’s the actual bug, and the solution is the keep port 3999 open and responding even when the stacks-node goes down.

As for why the stacks-node needed to be restarted, @rafaelcr and I think there may have been a retry limit on the stacks-node before giving up (unconfirmed). There should not be a limit to the number of retries in that logic, but anyways it's outside the scope of this issue.

@CharlieC3
Copy link
Member Author

The maintenance I'm working on resulted in it happening again. This time I tried restarting just the stacks-node, and once it came back up it was able to connect to the API writer. This might warrant a look at how the stacks-node is attempting to retry connections to an event observer.

@deantchi
Copy link
Contributor

@rafaelcr per our morning discussion, here is a sample of the logs from yesterday's API issue
api
stx node

it was around the 6:03pm pst mark

@zone117x
Copy link
Member

zone117x commented May 1, 2023

I'm more confident now about the cause of this issue, and it is essentially two different issues that @CharlieC3 has mostly narrowed down in the above comments.

During the API-writer-mode process startup, it performs the following sequential steps. Each must succeed before the next:

(Legend: 🟢 = probably okay, 🟠 = potentially sus)

  1. 🟢 Wait until successful connection to postgres is established:
    dbStore = await PgStore.connect({
    usageName: `datastore-${apiMode}`,
    });
  2. 🟢 Start the event-observer server (opens port 3700), this allows the stacks-node to issue POST requests to the API. The stacks-node will stall all other operations and retry these POST requests until successful:
    const eventServer = await startEventServer({
    datastore: dbWriteStore,
    chainId: configuredChainID,
    });
    • 🟠 The stacks-node should be configured so that it does not have a retry limit here, otherwise the whole deployment can fail if it gives up. It sounds like one of the failure cases described in the above comments could be attributed to a retry-failure here.
  3. 🟠 Issue an http request to the stacks-node RPC :20443/v2/info endpoint:
    const networkChainId = await getStacksNodeChainID();
    • This retries forever until it succeeds once. This should not cause a circular dependency, however, it does seem like a brittle setup. If for some reason the stacks-node is not responding or stalling on :20443/v2/info, then we're likely in a deadlock. This scenario also sounds like one of the failure modes described in the above comments.
    • Fortunately, this /v2/info request is not strictly necessary -- it's only a safety check to ensure the STACKS_CHAIN_ID env var matches the stacks-node mode (mainnet vs testnet). Bad things happen later on during runtime if this is misconfigured, so we should probably leave this as the default behavior, but we can add another env var to disable this check and thus prevent a possible deadlock here.
  4. 🟢 A background connection monitor is started:
    monitorCoreRpcConnection().catch(error => {
    logger.error(`Error monitoring RPC connection: ${error}`, error);
    });
    • This is what logs the message Warning: failed to connect to node RPC ..., however, this is only informational. This does not stop or slow the rest of the API startup process, and it does not have any impact on the API during runtime. The API does not close/re-open ports, it simply monitors and logs.
  5. 🟢 The API starts the public http server (opens port 3999):
    const apiServer = await startApiServer({
    datastore: dbStore,
    writeDatastore: dbWriteStore,
    chainId: getApiConfiguredChainID(),
    });
    • There are no intentional code paths where port 3999 is closed/re-opened. It's possible that the API could enter into a bugged shutdown state, where something leaves the API partially-shutdown, however, we'd see pretty loud indications of this buggy state, and we haven't seen a shutdown bug like this in over a year IIRC. I don't think this could be the cause of any of the issues described in the above comments.

The likely steps to fixing this issue:

  • Add new config option that disables step 3 and eliminate that possible deadlock.
  • Ensure the event-emitter stacks-node does not have a retry-limit configured for http POSTs to the API's port 3700.
  • (Informational) Change the message in the background connection monitor described step 4 to indicate that it's informational-only.

blockstack-devops pushed a commit that referenced this issue Jun 7, 2023
## [7.2.2](v7.2.1...v7.2.2) (2023-06-07)

### Bug Fixes

* option to skip RPC request during init which may avoid startup deadlocks [#1584](#1584) ([#1640](#1640)) ([e0e61d3](e0e61d3))
LakshmiLavanyaKasturi pushed a commit that referenced this issue Jun 14, 2023
## [7.2.2](v7.2.1...v7.2.2) (2023-06-07)

### Bug Fixes

* option to skip RPC request during init which may avoid startup deadlocks [#1584](#1584) ([#1640](#1640)) ([e0e61d3](e0e61d3))
@smcclellan smcclellan added this to the Q2-2023 milestone Jun 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
5 participants