-
Notifications
You must be signed in to change notification settings - Fork 111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API does not reconnect to stacks-blockchain node in some situations #1584
Comments
If you exec into the pod when this is happening, is |
I'll test that next time it happens, but it doesn't occur every day. More like every two weeks or so. However, we explicitly have DNS caching disabled, so I would not expect it to be related to DNS caching. |
Next steps: set up some alerts to get more info about this |
Here's a PR that will include verbose error details in that log entry which could help narrow down the problem: #1585 |
I was able to find this happening again. I've confirmed a few interesting things:
What's strange is this deployment was previously working, then broke without the API writer ever restarting. This means port As for why the stacks-node needed to be restarted, @rafaelcr and I think there may have been a retry limit on the stacks-node before giving up (unconfirmed). There should not be a limit to the number of retries in that logic, but anyways it's outside the scope of this issue. |
The maintenance I'm working on resulted in it happening again. This time I tried restarting just the stacks-node, and once it came back up it was able to connect to the API writer. This might warrant a look at how the stacks-node is attempting to retry connections to an event observer. |
I'm more confident now about the cause of this issue, and it is essentially two different issues that @CharlieC3 has mostly narrowed down in the above comments. During the API-writer-mode process startup, it performs the following sequential steps. Each must succeed before the next: (Legend: 🟢 = probably okay, 🟠 = potentially sus)
The likely steps to fixing this issue:
|
Describe the bug
When the postgres database and stacks-node are terminated while an API writer continues running, the API writer sometimes is unable to re-connect to the stacks-blockchain nodes after it and the postgres database come back online after being down for an extended period (~30 minutes or so).
Despite this, the API logs still continue to indicate the API writer is attempting to connect to the stacks-node as one would expect:
However, I still think there is something wrong here. If I try restarting the stacks-node again, the API will continue printing the above log and won't be able to connect. If I try restarting the API writer after the stacks-node and postgres db have already come back online, it starts working.
Thus, this test seems to indicate the issue resides in the API rather than the stacks-node.
The text was updated successfully, but these errors were encountered: