Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature: improve hashmail scalability #94

Open
guggero opened this issue May 11, 2023 · 0 comments
Open

feature: improve hashmail scalability #94

guggero opened this issue May 11, 2023 · 0 comments

Comments

@guggero
Copy link
Member

guggero commented May 11, 2023

This is a brain storming issue for future improvements to the hashmail server and LNC transport protocol to increase scalability.

Mailbox ID based horizontal scaling

In most cluster environments services are scaled horizontally by using load balancers with some kind of "session stickiness" mechanism turned on, that makes sure the same client always is forwarded to the same backend.
With LNC this is not feasible, because both the server (litd) and the client (browser or LNC client application) must connect to the same instance of aperture in order to be able to communicate.
Since both server and client derive the same mailbox or session ID (=SID), they can use that ID to look up the server instance to connect to by asking any of the instances:

  1. LNC application (server or client) connects to load balanced aperture instance (e.g. mailbox.terminal.lightning.engineering), queries ServerForMailbox(SID=xyz)
  2. The aperture instance that got the request checks the connected etcd instance if SID is already mapped.
    1. If the SID is mapped, return the direct server instance (e.g. srv-004.mailbox.terminal.lightning.engineering) as the response.
    2. If the SID is not yet mapped, pick a server at random (each aperture knows each other instance through etcd) and store the mapping in etcd. Return the server instance as the response.

The reason this extra call and then direct connection to a non-load balanced instance is necessary is the nature of the LNC transport connection: It's a long-lived gRPC (or WebSocket) connection. So it uses up one of max. 65k TCP client ports on a machine. If we would run all connections through the same Layer 4 load balancer, then we would still be limited to a maximum of 65k clients. With this extra call we increase the initial connection complexity a bit but allow each individual instance to accept a maximum of 65k clients.

Hashmail protocol improvements

The current protocol requires both the server and client of an LNC connection to open two half-duplex mailbox streams to the mailbox server, each with a unique (but related) SID to create the virtual full-duplex connection between server and client.
Because in practice the two SIDs are identical except for the last bit which is flipped, the protocol could quite easily be simplified to allow a single full-duplex stream to be opened. So basically merging two uni-directional gRPC (or WebSocket) streams into a single, bi-directional one.

A further improvement that could lead to better memory efficiency on the aperture side is to remove the requirement for a-synchronous mailbox usage. Originally the mailbox protocol was developed for sending a single piece of information from a sender to a receiver, potentially a-synchronous. So the sender would send their message and disconnect. Then the receiver would later connect to the server and retrieve the message. That is useful for sending updates of a Pool sidecar channel order from the buyer to the recipient, or for transferring Taproot asset proofs from sender to receiver. But for a protocol that requires both parties to be online in the first place, that message buffering feature is not strictly necessary.
So the LNC mailbox gRPC endpoint could be improved further by making writes blocking as well until there is a reader on the other end (instead of allowing writes up to a certain buffer size).

Versioning

Because all the improvements described above will largely be incompatible with older clients, some form of versioning system needs to be implemented at the LNC transport (=hashmail) layer.
There are version numbers available at the LNC transmission (=Go-Back-N) and LNC noise layer, but those won't be of any use, as they can only be exchanged once an initial transport layer connection was initialized.

Most likely some sort of new connection phrase encoding (either as an URL or QR code) would be required to indicate a new mailbox protocol needs to be used (so an old client wouldn't understand the new scheme and would give an error message).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants