Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Surface connection problems in the WASM to the Javascript #435

Closed
raoulmillais opened this issue Feb 1, 2021 · 14 comments
Closed

Surface connection problems in the WASM to the Javascript #435

raoulmillais opened this issue Feb 1, 2021 · 14 comments

Comments

@raoulmillais
Copy link
Contributor

Consumers of the WASM light client JavaScript need to know about connection errors to boot nodes as they will want to either alert the user or show a different UI if the client cannot sync. It shouldn't be a hard crash but a connection_error_callback function. I would handle this and emit an 'error' even on the SmoldotProvider (consistent with other PolkadotJS providers).

@tomaka
Copy link
Contributor

tomaka commented Feb 2, 2021

There are two category of reasons why we can fail to connect to a node in general:

  • Because the node is offline, because we have no Internet connection, or because the node is full or overloaded. These are all "normal situations".
  • Because of a protocol error in the communications with the node, or a mismatch between the local chain specs and the ones on the node, or in general if the node is misbehaving. These are all "abnormal situations" that happen because of an incompatibility between the local node and the remote node.

The bootnodes should as much as possible be treated the same way as all the other nodes on the network. Failing to connect to a bootnode because of one of these "normal situations" should in my opinion not be reported to the user.

However, the "abnormal situations" aren't supposed to happen in the case of bootnodes. The reason why you put a bootnode in your chain specs is because you know that it's compatible with your node, and incompatibilities should indeed be reported to the user.

Instead of having a callback for connection errors, I would suggest that you emit an 'error' whenever smoldot emits an error on its logs, and smoldot would emit an error whenever there's an "abnormal situation" w.r.t the bootnodes.

@raoulmillais
Copy link
Contributor Author

raoulmillais commented Feb 2, 2021

You're right this is indeed more subtle than I gave it justice in the description.

What if smoldot cannot connect to any bootnodes?

The bootnodes should as much as possible be treated the same way as all the other nodes on the network. Failing to connect to a bootnode because of one of these "normal situations" should in my opinion not be reported to the user.

I'm not sure I agree with this. The client is not usable from the perspective of the consumer if the bootnodes are not connectable and syncing the chain. While in the scenario that the user is offline, this is perfectly "normal" from the perspective of smoldot. The consumer will want to know about it so that they can, for example, display some sort of client offline message.

What if smoldot can connect to some but not all bootnodes?

However, the "abnormal situations" aren't supposed to happen in the case of bootnodes. The reason why you put a bootnode in your chain specs is because you know that it's compatible with your node, and incompatibilities should indeed be reported to the user.

Agreed, this suggests a JavaScript consumer needs more context when smoldot reports bootnode connectivity problems You can see in the devtools when something is wrong (e.g the recent 502 bad gateways and when a TLS connection is not supported) but there is no way currently to programmatically catch these conditions or make UI decisions about what to report to the user and what not to report.

Do we think that we might want to display some sort of degraded performance UI when some but not all bootnodes are available for syncing? Currently on westend the implication is that only one bootnode is available and therefore syncing will be severely hampered as we trigger the DOS protection on the node, but maybe that will not be the case long term and on other networks.

What if the consumer is building some sort of network monitor tool? Maybe the devops team of a substrate chain want to build a JavaScript dashboard and is using a WASM client to display network status? ... ok this is probably not realistic as they'd be better off using other tools for monitoring, but maybe they want to have a nicer UI for a dashboard

What if smoldot discovers other non-bootnodes from the bootnode and fails to connect to them?

In the future when more than just the bootnodes for a given chain have public TLS secured websockets. I can think of scenarios where we would like to be able to visualise the adoption of pure browser WASM clients across that network to demonstrate the effectiveness of unstoppable apps (maybe also as a marketing tool to encourage adoption). "How many nodes do I know about and how many did I fail to connect to?"

What other node connectivity use cases are there? What do we want to support? cc @Stefie @goldsteinsveta @wirednkod

@raoulmillais
Copy link
Contributor Author

raoulmillais commented Feb 2, 2021

  • Because of a protocol error in the communications with the node,

Not sure what we do about this.. can you give an example of why this would happed @tomaka ?

or a mismatch between the local chain specs and the ones on the node,

I categorise this as programmer error. I.E. the consumer made a mistake doing the integration. We should unceremoniously throw an error.

or in general if the node is misbehaving.

I think what we show depends on if it is one or all of the bootnodes and the consumer needs to make the decision. I.E. is their UI interested in partial bootnode connectivity problems?

These are all "abnormal situations" that happen because of an incompatibility between the local node and the remote node.

@raoulmillais
Copy link
Contributor Author

Instead of having a callback for connection errors, I would suggest that you emit an 'error' whenever smoldot emits an error on its logs, and smoldot would emit an error whenever there's an "abnormal situation" w.r.t the bootnodes.

This would work but log grepping feels brittle to me especially if we want to provide more context as outlined in the "What if smoldot can connect to some but not all bootnodes?" scenario above

@tomaka
Copy link
Contributor

tomaka commented Feb 2, 2021

I'm not sure I agree with this. The client is not usable from the perspective of the consumer if the bootnodes are not connectable and syncing the chain. While in the scenario that the user is offline, this is perfectly "normal" from the perspective of smoldot. The consumer will want to know about it so that they can, for example, display some sort of client offline message.

This could be done by querying the number of peers smoldot is connected to (e.g. through the system_health RPC query), not specifically bootnodes.

Do we think that we might want to display some sort of degraded performance UI when some but not all bootnodes are available for syncing? Currently on westend the implication is that only one bootnode is available and therefore syncing will be severely hampered as we trigger the DOS protection on the node, but maybe that will not be the case long term and on other networks.

Not being connected to any bootnode, but instead to other nodes is a completely normal situation. In fact, in the long term, we probably want to disconnect from bootnodes as soon as we are connected to non-bootnodes, as a polite way to give room to other nodes connecting.

What if the consumer is building some sort of network monitor tool? Maybe the devops team of a substrate chain want to build a JavaScript dashboard and is using a WASM client to display networrk status? ... ok this is probably not realistic as they'd be better off using other tools for monitoring, but maybe they want to have a nicer UI for a dashboard

I'd suggest that they build their own alternative to what is in bin/wasm-node.

While the content of src is extremely general-purpose, the wasm-node however, as it currently is, is designed to be optimized towards the substrate-connect use-case, not a general purpose node.

This design choice influences more than just logging: old blocks are discarded, storage entries aren't cached, we assume that the finalized chain is close to the head of the chain, etc.

What if smoldot discovers other non-bootnodes from the bootnode and fails to connect to them?

For what it's worth, discovering other nodes is disabled (because there's no other node on the network at the moment anyway), but can be added by adding the missing ~10 lines of code.

However, the discovery code as it exists would just randomly discover parts of the network. Discovering the entire network is a bigger beast.

Not sure what we do about this.. can you give an example of why this would happed @tomaka ?

Well, a bug in the code.

I categorise this as programmer error. I.E. the consumer made a mistake doing the integration. We should unceremoniously throw an error.

It can also be caused by a mistake on the bootnode's configuration. We shouldn't make smoldot crash if a bootnode is misconfigured. One of the selling point of the light client is that it should continue to work and can't be hacked even if bootnodes are hacked.

This would work but log grepping feels brittle to me especially if we want to provide more context as outlined in the "What if smoldot can connect to some but not all bootnodes?" scenario above

The reason why I'm suggesting logs is that the entire purpose of error and warn logs is to indicate to the user that something is wrong. Even if we add a mechanism separate from logs to report problems when connecting to bootnodes, it would still be desirable to show error/warn logs to the user anyway.

@raoulmillais
Copy link
Contributor Author

raoulmillais commented Feb 2, 2021

This could be done by querying the number of peers smoldot is connected to (e.g. through the system_health RPC query), not specifically bootnodes.

So you suggest SmoldotProvider reports an error event if isSyncing is true but peers is 0? I don't think that alone gives enough context. We need to treat the window where we only know about bootnodes differently to accurately report when we fail to connect to any nodes (and are probably offline).

I also don't think it is the responsibility of the consumer of the smoldot NPM package to keep track of whether it ever thought smoldot managed to connect to a peer (I.E. peer count was non-zero at some point in time) because "client offline" is a common scenario.

I'd suggest that they build their own alternative to what is in bin/wasm-node.

While the content of src is extremely general-purpose, the wasm-node however, as it currently is, is designed to be optimized towards the substrate-connect use-case, not a general purpose node.

This design choice influences more than just logging: old blocks are discarded, storage entries aren't cached, we assume that the finalized chain is close to the head of the chain, etc.

Ok 👍

What if smoldot discovers other non-bootnodes from the bootnode and fails to connect to them?

For what it's worth, discovering other nodes is disabled (because there's no other node on the network at the moment anyway), but can be added by adding the missing ~10 lines of code.

However, the discovery code as it exists would just randomly discover parts of the network. Discovering the entire network is a bigger beast.

Is there a tracking issue where we can talk about plans and ideas for peer discovery, I'm sure there will be lots to think about!

Not sure what we do about this.. can you give an example of why this would happed @tomaka ?

Well, a bug in the code.

Ok - see next reply, we should treat it the same

I categorise this as programmer error. I.E. the consumer made a mistake doing the integration. We should unceremoniously throw an error.

It can also be caused by a mistake on the bootnode's configuration. We shouldn't make smoldot crash if a bootnode is misconfigured. One of the selling point of the light client is that it should continue to work and can't be hacked even if bootnodes are hacked.

I didn't mean crash smoldot, I agree it should never crash. We should throw a JavaScript error in the glue code that lives in the smoldot repo. Bugs in the code / incorrect usage should cause JavaScript application to crash.

This would work but log grepping feels brittle to me especially if we want to provide more context as outlined in the "What if smoldot can connect to some but not all bootnodes?" scenario above

The reason why I'm suggesting logs is that the entire purpose of error and warn logs is to indicate to the user that something is wrong. Even if we add a mechanism separate from logs to report problems when connecting to bootnodes, it would still be desirable to show error/warn logs to the user anyway.

I think given what you've said above logs are reasonable beyond the initial bootstrapping phase and for more complex scenarios a different custom WASM should be built. However we still need to know about "I never managed to connect to any nodes" and I don't think that can reasonably done by a consumer.

So do we do that in the smoldot javascript code or is it the responsibility of the rust code?

If it is in the JavaScript glue code then that would be the websocket_new supplied when instantiating the WASM in https://github.com/paritytech/smoldot/blob/main/bin/wasm-node/javascript/index.js#L155 right? Is it reasonable for us to be parsing the chain spec and keeping track of 1. connection attempts to non-bootnodes and 2. counting to see whether all bootnodes failed there? (in case 1 we consider ourselves out of the bootstrapping window and in case 2 we report that we think we are offline)

@tomaka
Copy link
Contributor

tomaka commented Feb 2, 2021

So you suggest SmoldotProvider reports an error event if isSyncing is true but peers is 0? I don't think that alone gives enough context.

I think that whenever peers == 0 you should keep a message on screen saying "No connectivity". Or even treat the situation the same way as when PolkadotJS is configured to talk to a regular JSON-RPC node and can't reach it.

We need to treat the window where we only know about bootnodes differently to accurately report when we fail to connect to any nodes (and are probably offline).

However we still need to know about "I never managed to connect to any nodes" and I don't think that can reasonably done by a consumer.

I don't understand why you want a different treatment between before and after we have managed to contact bootnodes for the first time.

If we didn't manage to connect to any bootnode because of a bug/problem/error, then it will be reported in the logs.

If we didn't manage to connect to any bootnode because of a connectivity issue (the "normal situation"), then what is the point of telling the user "you were connected to the Internet before, but know it doesn't work anymore", compared to just "no connectivity"?.

Is there a tracking issue where we can talk about plans and ideas for peer discovery, I'm sure there will be lots to think about!

Peer discovery is already fully designed and has been fully working for years now. Smoldot only implements the existing system, so there's no concrete plan/idea. The latest development is that W3F has plans to contact academic researchers to explore possible improvements, but that's not tracked by an issue.

I didn't mean crash smoldot, I agree it should never crash. We should throw a JavaScript error in the glue code that lives in the smoldot repo. Bugs in the code / incorrect usage should cause JavaScript application to crash.

The point is the same: substrate-connect/smoldot should continue to be usable normally even if someone manages to get into the Parity servers (or RandomBlockchainCompany's servers) and alter the bootnodes in some way.

@raoulmillais
Copy link
Contributor Author

raoulmillais commented Feb 2, 2021

So you suggest SmoldotProvider reports an error event if isSyncing is true but peers is 0? I don't think that alone gives enough context.

I think that whenever peers == 0 you should keep a message on screen saying "No connectivity". Or even treat the situation the same way as when PolkadotJS is configured to talk to a regular JSON-RPC node and can't reach it.

We need to treat the window where we only know about bootnodes differently to accurately report when we fail to connect to any nodes (and are probably offline).

However we still need to know about "I never managed to connect to any nodes" and I don't think that can reasonably done by a consumer.

I don't understand why you want a different treatment between before and after we have managed to contact bootnodes for the first time.

Because if the client fails to connect to any boot nodes it can never make a successful RPC call to find out it isn't connected to any peers. I.E. if it doesn't connect to any peers it has no way of knowing programmatically that it never did so and cannot choose to show UI that it's not connected.

PolkadotJS makes some metadata etc RPC calls to bootstrap the client but if they error but they aren't triggered by us in the provider, so we can't catch the errors. Also the failures also don't cause the Api to fail to instantiate

So the current behaviour is that no errors or events are thrown to indicate that the client isn't connected: The PolkadotJS Api instantiates successfully and the only indication given by smoldot is out-of-band in log messages.

For comparison, the pre-existing scenario where the PolkadotJS API is connecting to a remote RPC node, it would emit an error when failing to connect the websocket to that node .. and so the consumer can handle that and show appropriate UI,

Either PolkadotJS needs to change or smoldot if we want to be able to handle "offline at instantiation" like we can with the existing WsProvider. This is definitely something to keep in mind when/if we upstream SmoldotProvider to PolkadotJS but it seemed more sensible to me to deal with it in smoldot until then.

If we didn't manage to connect to any bootnode because of a bug/problem/error, then it will be reported in the logs.

If we didn't manage to connect to any bootnode because of a connectivity issue (the "normal situation"), then what is the point of telling the user "you were connected to the Internet before, but know it doesn't work anymore", compared to just "no connectivity"?.

Is there a tracking issue where we can talk about plans and ideas for peer discovery, I'm sure there will be lots to think about!

Peer discovery is already fully designed and has been fully working for years now. Smoldot only implements the existing system, so there's no concrete plan/idea. The latest development is that W3F has plans to contact academic researchers to explore possible improvements, but that's not tracked by an issue.

Cool if you have any pointers for more reading that would be much appreciated 😄 I got the impression from gav's comments in element, that there was more to be done in this area:

"done Truly Right, there should be nothing in there that identifies anybody, not even IP addresses. [S]o that would mean that it'd need to connect to some DHT/kademlia style tracker network to download the current peers."

This is what I was thinking about when I asked if this has a place for discussion.

I didn't mean crash smoldot, I agree it should never crash. We should throw a JavaScript error in the glue code that lives in the smoldot repo. Bugs in the code / incorrect usage should cause JavaScript application to crash.

The point is the same: substrate-connect/smoldot should continue to be usable normally even if someone manages to get into the Parity servers (or RandomBlockchainCompany's servers) and alter the bootnodes in some way.

Ok I need to think about this more to get my head round the scenarios. Let's focus this issue about being able to detect not connecting to any bootnodes on instantiation.

@raoulmillais
Copy link
Contributor Author

So the current behaviour is that no errors or events are thrown to indicate that the client isn't connected: The PolkadotJS Api instantiates successfully and the only indication given by smoldot is out-of-band in log messages.

I just tested this by putting FF offline and actually with the default max_log_level, which we decided not to expose to consumers nothing at all is logged. There are however uncaught browser errors for each of the bootnodes because the websocket failed to connect.

@tomaka
Copy link
Contributor

tomaka commented Feb 2, 2021

What is supposed to happen if at initialization smoldot manages to connect to one bootnode, then immediately disconnects from it 10 milliseconds later?
Would you consider that since the initialization is successful, you show an inert node?

Again, my opinion is that the equivalent of "connected to the JSON-RPC" in the WsProvider should be system_health().peers == 0.
If the number of peers is non-zero, it's as if you're connected to the JSON-RPC node. If it goes down to zero, it's as if you're not connected.

"done Truly Right, there should be nothing in there that identifies anybody, not even IP addresses. [S]o that would mean that it'd need to connect to some DHT/kademlia style tracker network to download the current peers."

According to the context, I think he's referring to GitHub not being able to track your IP address.
The bootnodes can still identify users, but there's not much we can do about it because of browsers requiring TLS certificates.
Once connected to bootnodes, peers are indeed discovered through Kademlia.

@raoulmillais
Copy link
Contributor Author

What is supposed to happen if at initialization smoldot manages to connect to one bootnode, then immediately disconnects from it 10 milliseconds later?
Would you consider that since the initialization is successful, you show an inert node?

Again, my opinion is that the equivalent of "connected to the JSON-RPC" in the WsProvider should be system_health().peers == 0.
If the number of peers is non-zero, it's as if you're connected to the JSON-RPC node. If it goes down to zero, it's as if you're not connected.

Ok I see what you're saying and that makes sense. The idea of connectivity is just fundamentally different. So that suggests that I add system_health pinging in the SmoldotProvider as part of instantiation and trigger error events when peers === 0. This would make the SmoldotProvider a dropin replacement for a WsProvider and the behaviour consistent 👍 (I'd done that in the browser demo before and it didn't feel right)

"done Truly Right, there should be nothing in there that identifies anybody, not even IP addresses. [S]o that would mean that it'd need to connect to some DHT/kademlia style tracker network to download the current peers."

According to the context, I think he's referring to GitHub not being able to track your IP address.
The bootnodes can still identify users, but there's not much we can do about it because of browsers requiring TLS certificates.
Once connected to bootnodes, peers are indeed discovered through Kademlia.

@raoulmillais
Copy link
Contributor Author

So the current behaviour is that no errors or events are thrown to indicate that the client isn't connected: The PolkadotJS Api instantiates successfully and the only indication given by smoldot is out-of-band in log messages.

I just tested this by putting FF offline and actually with the default max_log_level, which we decided not to expose to consumers nothing at all is logged. There are however uncaught browser errors for each of the bootnodes because the websocket failed to connect.

So based on what we discussed: we shouldn't have uncaught browser errors - I.e. handle them gracefully in the websocket_new code but we should see a smoldot error (warn?) log message.

@tomaka
Copy link
Contributor

tomaka commented Feb 2, 2021

So based on what we discussed: we shouldn't have uncaught browser errors - I.e. handle them gracefully in the websocket_new code but we should see a smoldot error (warn?) log message.

Well, if the error is about the node being unreachable, I'd say to not print a log message.
If the error is caused for example by a 502 error code, or a missing TLS certificate, print a log message.

In both cases, yes, indeed, it shouldn't go straight in the console because of the browser.

@tomaka
Copy link
Contributor

tomaka commented Apr 12, 2021

The only remaining actionable item in this issue in #450
Consequently, closing in favour of #450.

@tomaka tomaka closed this as completed Apr 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants