Websocket protocol errors cause full node halt/stall/crash #3124
Comments
And from another node
|
If it's reproducible it would be really helpful to gather logs with |
@tomusdrw here are detailed logs from trace. Let me know if you want me to post more, but I think this is just being replicated from a downed node. https://gist.github.com/drewstone/6b53510604755f18443880a6b6cbba88 EDIT: The node is only outputting info of the sort contained in the gist, it has dropped off of the network. |
@tomaka maybe any of your future rework that could lead to this problem? |
@bkchr The RPC runs on separate threads that aren't covered by any of my rework. The logic of these threads is entirely contained within the |
@drewstone unfortunately these logs don't really tell me anything cause I can't correlate them with any other activity happening on that node (is it pre-crash, right after crash, etc). Could you please send a complete log that shows both @bkchr I also don't think it's related. This error seems purely sending an incorrect WS upgrade request we should simply be rejected, but I'm really interested why it interferes with regular node operations. |
@tomusdrw Here are more extensive logs, my tmux has clipped it though so I cannot see how it originates. On the topic, the node never crashes per se, but it does drop connection from the rest of the network.. instead the node is just stuck outputting these logs. https://gist.github.com/drewstone/32432cd67a0e55f8ea4a8707e8170105 |
So the error doesn't seem to be related:
It seems like some kind of automated wordpress exploiter script that tries to connect to your WS server, but the server correctly closes the connection and the server keeps operating after that afaict. Can you post |
Yep, will report back once it halts again. |
One way to look at this, which isn't going to solve this particular problem but might still be worth thinking about, is that we could try to run different components in their own process, so that crashes of one don't bring down the entire node. There is precedence for this in ongoing refactorings in web browsers, see for example: https://www.chromium.org/servicification |
I got the same error. When I use the Load Balance to configuration the https(wss):9944 -> http(ws):9944. The substrate node got:
|
@jiangfuyao This error is not critical - it just means that an unexpected request has been made to the WebSockets server (for instance a HTTP post). It's just a regular node operation. Did your node stop working after that? If not then it's not related to this issue. Also I'm going to close this as it seems it has been 6 months without a halt ("Yep, will report back once it halts again."). Please re-open if the issue happens again. |
The node is work, but the wss can't connect. And stop the Load Balance service, the error will disappear. |
@jiangfuyao How is this service working? Maybe it expects some liveness endpoint to be open on the WS server? It tries to ping that endpoint, fails (you get the error message), and then assumes the service is dead, hence it doesn't work. I think we are running |
For anyone who lands here in the future. I got this websocket error (but not a node crash or anything like that) when I was using curl to submit RPC requests to the ws port (9944 by default) rather than the correct http port (9933 by default). |
We're running a custom runtime against master from ~July 14th. Some nodes are configured with
rpc-cors "*"
. Oftentimes, the following happens which requires a full restart of the node. The node doesn't respond toctrl + c
, it must be shut down externally (closing the window, etc.)The text was updated successfully, but these errors were encountered: