Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rpc: add retry logic to the default WS client #1190

Closed
niklasad1 opened this issue Oct 5, 2023 · 11 comments · Fixed by #1396
Closed

rpc: add retry logic to the default WS client #1190

niklasad1 opened this issue Oct 5, 2023 · 11 comments · Fixed by #1396

Comments

@niklasad1
Copy link
Member

niklasad1 commented Oct 5, 2023

Apparently users are having issues with that RPC client connection closes paritytech/jsonrpsee#949 (comment) and it's not possible to detect that easily in subxt.

Suggestions:

  1. Provide an abstraction for "a reconnecting WS client" re-established active subscriptions and re-send requests that were not finished. The downside it that it's possible to loose a subscription notifications when reconnecting. Lost notifications which may cause to submit a transaction twice for something author_submitAndWatch if the Finalized event is lost.
  2. Add example code how to deal with lost connections in subxt, it's unfortunately not super easy to implement because to RpcError variant has to be downcasted to a JsonrpseeError and check if it's JsonrpseeError::RestartedNeeded. Sadly this is not possible to do without restarting the subxt client.

There may be more similar issues with the recently added RPC spec v2 implementation.
Other ideas are welcome

/cc @jsdw @lexnv

@niklasad1 niklasad1 changed the title rpc: add reconnect logic to the default rpc client rpc: add retry logic to the default WS client Oct 6, 2023
@ara-selini
Copy link

ara-selini commented Nov 20, 2023

Hi, coming from paritytech/jsonrpsee#949

I am using a simple block subscriber now. To simulate a network error, I turn off my wifi. The code does not provide any errors or even Nones -- it just keeps awaiting for the next element. Both for subxt and jsonrpsee.

When using jsonrpsee, it is possible to detect failures similarly to the code snippet you shared in the above link but doing that in subxt is pretty difficult e.g. this code doesn't work because subxt takes ownership of the client:

        let ws_client = WsClientBuilder::default().build(node_url).await.unwrap();
        tokio::spawn(async move {
            ws_client.on_disconnect().await;  <----- 1st move here
            println!("Disconnected");
        });

        let rpc_client = RpcClient::new(ws_client);   <--- 2nd move here
        let client = OnlineClient::<PolkadotConfig>::from_rpc_client(rpc_client).await?;

So it sounds like detecting disconnects is difficult in subxt which means I can't go ahead with it in sensitive production code. Are there any temporary fixes (or other libraries) you would recommend in the meantime? Alternatively, I might just go ahead with implementing the functionality I need with reconnecting on top of jsonrpsee but that wouldn't be elegant.

@niklasad1
Copy link
Member Author

niklasad1 commented Nov 21, 2023

@ara-selini

You need to put the ws-client in an Arcfor that to work but that doesn't seems to work in subxt v0.32

@jsdw

Isn't possible to inject in an Arc<ClientT> anymore after the latest rpc refactoring?
Ideally, it should be possible to do

  let client = Arc::new(
        WsClientBuilder::default()
            .build("ws://localhost:9944")
            .await
            .unwrap(),
    );

    let c = RpcClient::new(client.clone());

    let subxt_client = OnlineClient::<PolkadotConfig>::from_rpc_client(c);
}

Can we implement RpcClientT for Arc<jsonrpsee::WsClient> to help with that or what do you reckon?

@niklasad1
Copy link
Member Author

I forgot to mention in this issue that I written a reconnecting-ws-client-wrapper crate that you could try but I haven't tested it with subxt yet.

@ara-selini
Copy link

Thanks a lot for the prompt response and commit! Any idea when the next release for subxt will happen -- or should I just use master in the meantime?

I reviewed your reconnecting-ws-client-wrapper crate today for a few hours and ran some tests. It looks pretty good! A few points:

  1. When the WS server crashes or closes the WS, it works great.

  2. When I test by connecting to an Alchemy node and stream blocks and turn my wifi off after a few seconds to simulate a network error, the code does not detect the failure. I did the same with equivalent code I had in TypeScript and it still did not detect a failure. My theory would be that a ping/pong is explicitly required and a heartbeat task is needed which checks that the last pong was not too early ago. I think this would be something very useful to avoid network failure edge cases.

  3. When I turn the wifi back on, Alchemy sends all the missed blocks (I suppose they store a cache of un-ACKed messages maybe and resend). This is a bit of an edge case because in some cases you might not want the missed messages e.g. if you have a trigger on a new block, the code would react to the first missed block (unless the client code explicitly sets a timeout-based cancellation of the trigger i.e. if haven't received a new block in the last 10 seconds, stop the trigger). That would work but an "Err" message when a disconnect happens would be much easier and it should be fairly easy to implement.

Also it doesn't work with Rust 1.71 so might want to mention it. (It works with 1.74).

@niklasad1
Copy link
Member Author

niklasad1 commented Nov 21, 2023

Yeah, I think the client needs to make an actual request for it to detect socket I/O error when you disconnect the network.

The jsonrpsee ws client has ping/pong support but it's disabled by default but IIRC we don't terminate the connection if the pong doesn't arrive in time.However, one would likely end up with an I/O error anyway if the network is disconnected and you will see the error.

I can enable the websocket ping/pongs in my repo other for you to test.

@ara-selini
Copy link

Yup, for making requests, (I haven't tested) but I think you'd just get an error.

Edge case mainly is important for subscriptions (right now there is no way to detect). ping/pong would be great and am happy to test if you implement it!

An Err returned on the subscription would be useful as well e.g. Err(Disconnect) so the client code knows how to handle a reconnect. (That way the scenario you mentioned regarding clients who need to receive all requests should mostly be handled as the client codes can just restart the ReconnectingWsClient). I think you can do it in the reconnect function -- just need to pass around the rx returned to the client code and send an Err in reconnect function before the actual reconnecting. Alternatively, for code that requires a restart if the WS disconnects, clients can just use the underlying jsonrpsee and detect the error themselves.

@niklasad1
Copy link
Member Author

An Err returned on the subscription would be useful as well e.g. Err(Disconnect) so the client code knows how to handle a reconnect. (That way the scenario you mentioned regarding clients who need to receive all requests should mostly be handled as the client codes can just restart the ReconnectingWsClient). I think you can do it in the reconnect function -- just need to pass around the rx returned to the client code and send an Err in reconnect function before the actual reconnecting. Alternatively, for code that requires a restart if the WS disconnects, clients can just use the underlying jsonrpsee and detect the error themselves.

The idea is to hide such "reconnects" from the user and the requests and subscriptions, the library just takes care of it under the hood "like it never has been disconnected" but it has a bunch of edge-cases where one may loose subscription message(s) when reconnecting/network is unavailable but as you say with a proper ping/pong mechanism I think it could work quite well.

An Err returned on the subscription would be useful as well e.g. Err(Disconnect) so the client code knows how to handle a reconnect. (That way the scenario you mentioned regarding clients who need to receive all requests should mostly be handled as the client codes can just restart the ReconnectingWsClient). I think you can do it in the reconnect function -- just need to pass around the rx returned to the client code and send an Err in reconnect function before the actual reconnecting. Alternatively, for code that requires a restart if the WS disconnects, clients can just use the underlying jsonrpsee and detect the error themselves.

For such things it's better to folks to implement their "own disconnect logic" on top of jsonrpsee instead, it would not a very pleasant API to deal with. Then the subscription stream in jsonrpsee isn't really easy to determine whether a subscription was closed or the connection was closed, so we could improve that.

However, I added functionality the reconnecting ws client to enable configure ws ping/pongs.
You can do just do:

        let client = ReconnectingWsClient::new(
            addr,
            ExponentialBackoff::from_millis(10),
            PingConfig::Enabled(Duration::from_secs(6))
        )

@ara-selini
Copy link

I ran some tests and it seems to work reasonably well. I haven't tested with subxt yet but I'd let you know if I face any issues with it. Thanks for the prompt responses!

@jsdw
Copy link
Collaborator

jsdw commented Jan 4, 2024

Some thoughts on the next steps for how we could handle reconnection better in Subxt with this reconnecting jsonrpsee Client:

  • Add an explicit RpcError::DisconnectedWillReconnect variant to subxt::error::RpcError to allow subscriptions produced from the Backend trait to signal that the RPC client has disconnected and intends to reconnect (so that users will know that events may be lost at this point while there is a network issue, but also that reconnection will occur). A non-reconnecting client will never use this variant (perhaps we should also add an RpcError::Disconnected variant too, but less important to think about now).
  • Then, have the underlying subxt::backend::rpc::RpcClientT implementation from the reconnecting jsonrpsee client spit out this error variant on disconnection.
  • Add a method to subxt::Error like .is_disconnected_will_reconnect() to easily check for this variant, so that users know that they can keep listening to subscriptions if they want (they can of course still opt to drop them and restart them or do whatever else, if they prefer).
  • We may need to modify some bits in the Backend impls to be aware of this DisconnectedWillReconnect error and handle themselves appropriately in this case. I think the error should be forwarded to users always (so that people know that messages may be lost), but in some cases it may not make sense to keep the subscription open (eg when you submit a tx; we can't restart that except by trying to submit the same TX again, which may or may not be the right thing to do..). Look through the methods and see which ones may need special handling.

@niklasad1
Copy link
Member Author

All right I get it now, I have modified the reconnecting ws client now to emit an event DisconnectWillReconnect every time a reconnection occurs and I'll add this extra error variant in subxt to test how it will work out.

There is also an API for subscribe to re-connections which a future that returns once an attempt to a reconnection occurs because the subscriptions notifs may be buffered and lagg behind a bit until all messages have read.

@niklasad1
Copy link
Member Author

niklasad1 commented Feb 9, 2024

Hey again,

We have added experimental support for a reconnecting rpc client behind the feature flag unstable-reconnecting-rpc-client which really is first step towards having some kind of reconnecting scheme in subxt.

As described earlier in this "reconnecting scheme" has some issues such as that subscriptions are "re-subscribed" on re-connect and it may try to submit the transaction twice if for instance author_submitAndWatch but the second transaction will be rejected by chain but still not good.

Still, would be good to get feedback after you have tested/tried it.

Morever, the reconnecting scheme will not work well with the new rpc v2 specification and we will open another issue to track this and provide some more specific subxt reconnecting features.

Example can be found here: https://github.com/paritytech/subxt/blob/master/subxt/examples/reconnecting_rpc_client.rs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants