API does not reconnect to stacks-blockchain node in some situations #1584

CharlieC3 · 2023-03-15T22:29:20Z

Describe the bug
When the postgres database and stacks-node are terminated while an API writer continues running, the API writer sometimes is unable to re-connect to the stacks-blockchain nodes after it and the postgres database come back online after being down for an extended period (~30 minutes or so).
Despite this, the API logs still continue to indicate the API writer is attempting to connect to the stacks-node as one would expect:

Warning: failed to connect to node RPC server at mainnet-stacks-blockchain-snapshot:20443

However, I still think there is something wrong here. If I try restarting the stacks-node again, the API will continue printing the above log and won't be able to connect. If I try restarting the API writer after the stacks-node and postgres db have already come back online, it starts working.
Thus, this test seems to indicate the issue resides in the API rather than the stacks-node.

The text was updated successfully, but these errors were encountered:

zone117x · 2023-03-16T12:47:17Z

If you exec into the pod when this is happening, is mainnet-stacks-blockchain-snapshot reachable with curl? I'm wondering if there is some DNS caching issue somewhere.

CharlieC3 · 2023-03-16T14:25:42Z

I'll test that next time it happens, but it doesn't occur every day. More like every two weeks or so. However, we explicitly have DNS caching disabled, so I would not expect it to be related to DNS caching.
In the meantime, anything you can find here would be a big help.

rafaelcr · 2023-03-16T15:53:06Z

Next steps: set up some alerts to get more info about this

zone117x · 2023-03-16T16:36:05Z

Here's a PR that will include verbose error details in that log entry which could help narrow down the problem: #1585

CharlieC3 · 2023-03-16T19:43:52Z

I was able to find this happening again. I've confirmed a few interesting things:

DNS caching is not a problem (IPs matched)
The API writer could not ping the stacks-node on port 20443 because 20443 was not being opened by the node
the API writer could not ping localhost:3999 because 3999 was not open
Restarting the API writer did not solve the problem
Restarting both the API writer and stacks-node fixed the problem

What's strange is this deployment was previously working, then broke without the API writer ever restarting. This means port 3999 was open, but the API closes port 3999 if it cannot contact the stacks-node for a while and enters this circular dependency.
I think that’s the actual bug, and the solution is the keep port 3999 open and responding even when the stacks-node goes down.

As for why the stacks-node needed to be restarted, @rafaelcr and I think there may have been a retry limit on the stacks-node before giving up (unconfirmed). There should not be a limit to the number of retries in that logic, but anyways it's outside the scope of this issue.

CharlieC3 · 2023-03-17T02:49:16Z

The maintenance I'm working on resulted in it happening again. This time I tried restarting just the stacks-node, and once it came back up it was able to connect to the API writer. This might warrant a look at how the stacks-node is attempting to retry connections to an event observer.

deantchi · 2023-04-26T19:24:45Z

@rafaelcr per our morning discussion, here is a sample of the logs from yesterday's API issue
api
stx node

it was around the 6:03pm pst mark

zone117x · 2023-05-01T17:54:11Z

I'm more confident now about the cause of this issue, and it is essentially two different issues that @CharlieC3 has mostly narrowed down in the above comments.

During the API-writer-mode process startup, it performs the following sequential steps. Each must succeed before the next:

(Legend: 🟢 = probably okay, 🟠 = potentially sus)

🟢 Wait until successful connection to postgres is established:

stacks-blockchain-api/src/index.ts

Lines 120 to 122 in fe7fd50

    
           dbStore = await PgStore.connect({ 
        
             usageName: `datastore-${apiMode}`, 
        
           });

🟢 Start the event-observer server (opens port 3700), this allows the stacks-node to issue POST requests to the API. The stacks-node will stall all other operations and retry these POST requests until successful:

stacks-blockchain-api/src/index.ts

Lines 132 to 135 in fe7fd50

    
           const eventServer = await startEventServer({ 
        
             datastore: dbWriteStore, 
        
             chainId: configuredChainID, 
        
           });

🟠 The stacks-node should be configured so that it does not have a retry limit here, otherwise the whole deployment can fail if it gives up. It sounds like one of the failure cases described in the above comments could be attributed to a retry-failure here.

🟠 Issue an http request to the stacks-node RPC :20443/v2/info endpoint:

stacks-blockchain-api/src/index.ts

Line 142 in fe7fd50

const networkChainId = await getStacksNodeChainID();
- This retries forever until it succeeds once. This should not cause a circular dependency, however, it does seem like a brittle setup. If for some reason the stacks-node is not responding or stalling on :20443/v2/info, then we're likely in a deadlock. This scenario also sounds like one of the failure modes described in the above comments.
- Fortunately, this /v2/info request is not strictly necessary -- it's only a safety check to ensure the STACKS_CHAIN_ID env var matches the stacks-node mode (mainnet vs testnet). Bad things happen later on during runtime if this is misconfigured, so we should probably leave this as the default behavior, but we can add another env var to disable this check and thus prevent a possible deadlock here.

🟢 A background connection monitor is started:

stacks-blockchain-api/src/index.ts

Lines 152 to 154 in fe7fd50

    
           monitorCoreRpcConnection().catch(error => { 
        
             logger.error(`Error monitoring RPC connection: ${error}`, error); 
        
           });

This is what logs the message Warning: failed to connect to node RPC ..., however, this is only informational. This does not stop or slow the rest of the API startup process, and it does not have any impact on the API during runtime. The API does not close/re-open ports, it simply monitors and logs.

🟢 The API starts the public http server (opens port 3999):

stacks-blockchain-api/src/index.ts

Lines 179 to 183 in fe7fd50

    
           const apiServer = await startApiServer({ 
        
             datastore: dbStore, 
        
             writeDatastore: dbWriteStore, 
        
             chainId: getApiConfiguredChainID(), 
        
           });

There are no intentional code paths where port 3999 is closed/re-opened. It's possible that the API could enter into a bugged shutdown state, where something leaves the API partially-shutdown, however, we'd see pretty loud indications of this buggy state, and we haven't seen a shutdown bug like this in over a year IIRC. I don't think this could be the cause of any of the issues described in the above comments.

The likely steps to fixing this issue:

Add new config option that disables step 3 and eliminate that possible deadlock.
Ensure the event-emitter stacks-node does not have a retry-limit configured for http POSTs to the API's port 3700.
(Informational) Change the message in the background connection monitor described step 4 to indicate that it's informational-only.

…eadlocks #1584

…eadlocks #1584 (#1640)

## [7.2.2](v7.2.1...v7.2.2) (2023-06-07) ### Bug Fixes * option to skip RPC request during init which may avoid startup deadlocks [#1584](#1584) ([#1640](#1640)) ([e0e61d3](e0e61d3))

…eadlocks #1584 (#1640)

## [7.2.2](v7.2.1...v7.2.2) (2023-06-07) ### Bug Fixes * option to skip RPC request during init which may avoid startup deadlocks [#1584](#1584) ([#1640](#1640)) ([e0e61d3](e0e61d3))

zone117x mentioned this issue Mar 16, 2023

chore: include error details in failed to connect to node RPC... #1585

Merged

zone117x added a commit that referenced this issue May 1, 2023

fix: option to skip RPC request during init which may avoid startup d…

a603457

…eadlocks #1584

zone117x mentioned this issue May 1, 2023

fix: option to skip RPC request during init which may avoid startup deadlocks #1640

Merged

zone117x closed this as completed in #1640 Jun 7, 2023

zone117x added a commit that referenced this issue Jun 7, 2023

fix: option to skip RPC request during init which may avoid startup d…

e0e61d3

…eadlocks #1584 (#1640)

LakshmiLavanyaKasturi pushed a commit that referenced this issue Jun 14, 2023

fix: option to skip RPC request during init which may avoid startup d…

7eb23b6

…eadlocks #1584 (#1640)

smcclellan added this to the Q2-2023 milestone Jun 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API does not reconnect to stacks-blockchain node in some situations #1584

API does not reconnect to stacks-blockchain node in some situations #1584

CharlieC3 commented Mar 15, 2023

zone117x commented Mar 16, 2023

CharlieC3 commented Mar 16, 2023

rafaelcr commented Mar 16, 2023

zone117x commented Mar 16, 2023

CharlieC3 commented Mar 16, 2023

CharlieC3 commented Mar 17, 2023

deantchi commented Apr 26, 2023

zone117x commented May 1, 2023

API does not reconnect to stacks-blockchain node in some situations #1584

API does not reconnect to stacks-blockchain node in some situations #1584

Comments

CharlieC3 commented Mar 15, 2023

zone117x commented Mar 16, 2023

CharlieC3 commented Mar 16, 2023

rafaelcr commented Mar 16, 2023

zone117x commented Mar 16, 2023

CharlieC3 commented Mar 16, 2023

CharlieC3 commented Mar 17, 2023

deantchi commented Apr 26, 2023

zone117x commented May 1, 2023