Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Terminal only supports UTF-8 data, non-UTF-8 data disconnects clients #77

Open
mossblaser opened this issue Oct 10, 2023 · 0 comments
Open

Comments

@mossblaser
Copy link

Because the WebREPL protocol uses text-encoded websocket messages to pass terminal input and output, sending or receiving any non-UTF-8 encoded bytestream will cause any conformant websocket implementation to immediately disconnect.

Proof of concept

As a minimal proof-of-concept, consider the following minimal snippet:

>>> import sys
>>> _ = sys.stdout.buffer.write(b"\xff\n")
Disconnected

Now obviously as a REPL interacting with a human, this is not a major limitation. However, as a WiFi-attached analogue to a serial port (which is certainly an appealing usecase for webrepl IMO) this is obviously problematic.

As a, perhaps, slightly more motivating example, consider the following snippet which prints a '£', which is encoded by a two-byte sequence in UTF-8:

>>> import sys
>>> for d in b'\xc2\xa3\n':
...     _ = sys.stdout.buffer.write(bytes([d]))
...
Disconnected

Because the two bytes in '£' are written separately, they are transmitted in separate websocket messages -- neither of which (individually) contain a valid UTF-8 value, again forcing compliant websocket implementations to disconnect.

Note: the webrepl_cli.py script's websocket implementation is non-compliant (it does not verify UTF-8 well-formedness) so the above snippets will not result in disconnection. Many "industrial strength" websocket implementations, however, will drop the connection -- including those in browsers and many libraries.

What can be done?

It seems to me that there isn't an obvious good fix here but it boils down to whether supporting non-UTF-8 values is in-scope or not.

If non-UTF-8 data is not supported, at a minimum this ought to be made explicit and documented. In an ideal world, micropython ought to buffer incomplete UTF-8 sequences and substitute/delete non-UTF-8 values. Obviously this adds quite a bit of complexity to the implementation which would likely be undesirable.

Alternatively, non-UTF-8 data needs to be transmitted/received via some side-channel. As I understand, it appears that binary messages are currently being explored as a side-channel for non-REPL messages and could serve this purpose.

As a possibly minimal example of a potential backward-compatible change: incomplete or non-UTF-8 byte sequences could be encoded as a series of binary messages with length 1, interleaved with text-based messages for all other terminal input/output. On the assumption that non-UTF-8 encoded data will be rare (and indeed nobody is currently complaining about this so this might be a good assumption), this implementation, whilst inefficient, might be simple to implement both in micropython and clients.

Alternatively, a new binary message type could be defined which would be used for all inputs/outputs containing any non-UTF-8 values. This would have the advantage of not loosing efficiency but could potentially require greater commitment regarding the binary message formatting used by the protocol going forward.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant