Remove varlink and bring back the native server/client socket serialization #22

Profpatsch · 2021-03-19T23:22:06Z

Fixes #4

Please see commit messages.

The lorri internal stream-events output probably changes a lot, I didn’t spend any time keeping it backwards compatible. But since it’s experimental and internal we don’t have to care as much. It’s still using serde Serialize instances, but some care was taken to split it from the internal representation.

Amended the changelog in release.nix (see release.nix for instructions)

If the other side of that channel is closed, it means that the pool propagated a panic, so we should just log the fact instead of panicing again. The logging then needs to happen via an explicit logger, since slog_scope doesn’t seem to reach into the thread. We add explicit loggers to both `Pool` and `Async`, and adjust calls. In order to call these in tests, we provide a simple stderr logger setup via `logging::test_logger()`.

tl;dr: varlink bad, serde good, use `LORRI_BACKEND=native` to make `lorri daemon` and `lorri internal ping` use the serde-based protocol. All the other commands to follow. ~~~ This re-adds a slightly modified version of the native communication protocol between lorri server and client(s) that was removed in b0cbe27 in favor of a varlink-based RPC format. I was skeptical back when it was introduced, since the arguments in favor of varlink where mostly “we don’t have to maintain any custom socket code” and “RPC is better than whatever the current socket protocol does”, without actually trying to work with the existing socket code first. As it turns out, varlink has multiple weaknesss that make working with it a pain in the ass: 1. It is JSON-based. Thus is cannot encode arbitrary bytes like filepaths. lorri is full of filepaths. Every event that happens contains filepaths. Also program output. Which is also random bytes. Which lead to hundreds of `TryFrom` instances which convert paths to utf-8, crashing lorri in case they are not utf-8 clean. Also all these conversions from and to strings were written by hand. There is a format that can encode arbitrary rust structs and enums, and it’s used by most rust code that does any serialization, it’s called serde. It has a binary representation and also a json one if you really need the debuggability. Which brings us to 2. Varlink is a format made for public interfaces It is based on the assumption that there is external consumers of the API which want to auto-generate some code to talk to it. But the lorri server/client protocol is not that. It’s internal. We don’t want to talk to other tools through this API, we want to talk to a lorri client/daemon of exactly the same version. Yes, we have `lorri internal stream-events`, which is a prototype of an external interface to the build events, and which currently uses varlink, but that is entirely an implementation detail. In fact, how the client (`stream-events`) talks to the server (`lorri daemon`) and how it prints those messages to the user should probably be different data types in the first place anyway, so that we can maintain backwards-compat while being able to refactor internally. 3. Varlink does not understand ADTs and rust enums Json does not do enums. There is a bunch of encodings for them, and they can also be done manually, but varlink is a “protocol” written by go programmers, who do not know how ADTs work generally. Thus it does not have any good enum type. Thus we need to manually encode/decode enums into the format it understands. There is a rust library which does this automatically, serde. 4. Varlink has bad documentation, no specification, and little users Initially when we introduced it, it might have looked as if varlink was picking up steam, but the simple fact is that varlink documentation is very sparse, and hasn’t been improved since we first introduced it in 2019. Probably because there is a bunch of formats with better adoption out there that actually can encode public APIs and have tooling to support things like versioning, like protobuf or thrift. Also the part that varlink uses a custom format, and the EBNF is the best you get, no examples of usage, no patterns, no FAQ. Q: So why not switch to protobuf? A: It was an idea, and may yet be a good idea, but only for the public interface of lorri. For the internal interface where we essentially have a unix socket that `accept()`s in a loop, and can require users to keep the version of the server and client in exact sync, there is little reason to introduce manual encoding and parsing steps that come with the benefit of backward/forward compat. Q: Why is this so much code? A: The code, as the one that was originally replaced, has a module that describes a generic typesafe reader/writer (`src/socket/read_writer.rs`), and a higher-level module with a server and a client implementation that uses it (`src/socket/communicate.rs`). The read/writer module is handling the type-based serialization and reading/writing typed messages to a socket via the serde bincode format. It supports timeouts, which complicates the code slighly but is reqiured for not blocking indefinitly. The code should be relatively easy to understand. The communicate module implements the meat of the communication, and it also defines all different communications that can happen between the various clients and the server, using the underlying read/writer. There is a bit of trait wrangling to make sure that a communication always binds the same two request/response types together, and a bit of code to define a typesafe listener and clients for all handlers. The most messy module right now is `src/socket.rs`, which implements the actual accept loop, and tries its best to not run out of threads while doing that. lorri might run for a long time and be called many times (e.g. via `lorri direnv`), so we don’t want to generate too many thread zombies. We could opt for a thread pool instead, which doesn’t have this problem. Q: This is just the ping event, right? Yes, this commit only goes to the point of where the code was when it was first replaced with varlink, which is the ping event. The rest of the events is going to be implemented in (a) following commit(s). Let’s look forward to dropping varlink in a bit.

It complicates the code and the commit that introduces it didn’t give any good reason we’d want to set the socket path for only one command.

Since it was just a ping, we don’t need to implement another `CommunicationType` for it.

Since we are using serde, we need to add a few more deserialize calls to types, but it comes for free. There was a particularly abhorrent Serialize instance for a path type, which is removed. I’m so looking forward to removing all the horrible TryFrom instances and other war crimes.

We should get away from carnix as fast as possible, it hasn’t been maintained and the quality of the tool is questionable. I wasn’t even able to reach nest.pijul.org, and this is the only place the source is hosted. The error this works around was undebuggable from the message itself, and I’m still not entirely sure what exactly happens, since there is no source code comments anywhere to be found. Very bad software.

Finally. See previous commits for reasoning. Nothing in the public interface should change. The client code for the native protocol still needs to be improved somewhat so that it does not die on `unwrap()` or similar. This removes some tests that were just hopelessly broken, e.g. the “integration” “test” that had to poll for the socket to be available after starting the server, instead of propagating out readiness. Not even talking about the fact that these tests should just call the actual binaries.

We only ever have one path, make it less confusing.

Previously we’ve been using the same Serialize instance for the internal “over the wire” protocol between daemon/client and for the stream-events command json serialization. This needs to be split up if we ever want to stabilize the interface. Thus a first step, which keeps the inner structure of `Event`, but stubs out all the inner types that have to be converted for the json interface. This pattern looks quite nice, even though implementing `map()` for each of these interface types is a bit of boilerplate.

The Server struct had an unnecessary indirection after removing the varlink backend.

After getting rid of varlink the new clients were temporarily created with lots of `unwrap()`ping, but we can do better. In the spirit of #12, I first tried to add the `failure` crate, but after a while I noticed that the usability is really low. Even though it has documentation, the documentation doesn’t talk about how to make backtraces work. Additionally, you would have to build your own error pretty printing. Based on the recommendation in the issue, I switched the error definitions to `anyhow` instead, which feels a lot more modern and *does* integrate better with the `std::error::Error` trait (even though backtraces don’t work on stable, but they didn’t for `failure` either, and we will get them for free in time). There is a small trait `ExitAs`, that gives an automatic `From` instance to every error type that implements it and `Error`, which is nice. Though I couldn’t get it to work with the `Context` trait like I did with the `Context` type from `failure. ~~~ A small improvement was done to the `thread::Pool`, in addition to only returning instantly when a thread panics, we now also return when any thread returns an `Err`. This allows us to actually transmit typed errors, and use the `?` operator directly in the thread closure. ~~~ In ops itself, we factor out all the client connections into the `daemon::client` module, and replace all unwraps by ExitErrors. This gives us back the nice exit errors we had before, plus error traces for anything that implements `Error` and uses `#[source]`. Nice. The `lorri direnv` call was still broken when the daemon was not running, this fixes that as well.

It’s a major version bump because we remove a CLI flag (even though it was probably not used much). The internal change might also expose new bugs, so let’s keep it safe and tell people that the behaviour might have changed. Plus `internal stream-events` also has seen some damage, so it would be unfair to make this a patch release.

Profpatsch · 2021-03-20T11:46:27Z

Okay we really need to get rid of the “always up to date clippy in CI” thing, it’s aggravating.

symphorien · 2021-03-20T21:19:37Z

lorri internal stream-events --kind snapshot just hangs for me.
I'm running the daemon with systemd, and paid attention that the lorri command and the lorri daemon are both from this PR.
lorri internal ping works, though.

Profpatsch · 2021-03-28T11:18:43Z

lorri internal stream-events --kind snapshot just hangs for me.

have you been using this before? The logic around this is very confused, I copied it verbatim. I don’t think I broke anything, but it’s rather hard to figure out. I’d say this will need a rewrite to clean up in the future, so let’s ignore it for now.

Profpatsch · 2021-03-28T11:41:08Z

Okay, again, the only CI failure is coming from the newer clippy warnings, which happens because CI always fetches the nightly clippy, which we should remove because they are not easy to reproduce locally.

Profpatsch · 2021-03-28T11:41:20Z

LGTM

symphorien · 2021-03-28T14:58:30Z

have you been using this before? The logic around this is very confused, I copied it verbatim. I don’t think I broke anything, but it’s rather hard to figure out.

Yes, since 2020-04-25 modulo command renaming (it used to be lorri internal__stream_events --kind snapshot)
Here is at least half the fix #25
The command is still hanging, I don't remember if it used to.

symphorien · 2021-03-28T15:42:15Z

I think I found the source of the problem: when thread in ops::stream_events is dropped, we wait on it, that is forever. So the function cannot return.
I tried to fix it with

diff --git a/src/ops.rs b/src/ops.rs
index a7f3ca0..95ca37f 100644
--- a/src/ops.rs
+++ b/src/ops.rs
@@ -606,6 +606,12 @@ pub fn stream_events(kind: EventKind) -> OpResult {
         })
     };
 
+    // when running in snapshot mode, we should exit as soon as we print the answer
+    // of the daemon. However, when dropping `thread` at the end of the function,
+    // we will wait for it and hang indefinitely.
+    // Let's leak it, anyway we are going to exit the process as a whole very soon.
+    let thread = std::mem::ManuallyDrop::new(thread);
+
     let mut snapshot_done = false;
     loop {
         select! {

but then lorri internal stream-events fails with a timeout after 0.5 second:

$  target/debug/lorri internal stream-events
{"Completed":{"nix_file":"/home/symphorien/src/lorri/shell.nix","rooted_output_paths":{"shell_gc_root":"/home/symphorien/.cache/lorri/gc_roots/07d6e569120731e7ee9a3297613c71b7/gc_root/shell_gc_root"}}}
Mar 28 17:36:23.148 ERRO Message(R(Timeout(D(Millis(500)))))

Maybe it always used to timeout but when trying to return the error it would hang as explained above?

symphorien · 2021-03-28T16:08:45Z

Mmh this leaks file descriptors very fast...

symphorien · 2021-03-28T18:01:28Z

file descriptors in the server*

Profpatsch · 2021-03-28T18:06:10Z

Ah, it’s a regression then, async will block until the thread inside finishes, since everything else is just sloppy.

Profpatsch · 2021-03-28T18:58:02Z

@symphorien I think the solution here is to allow Async to not require a thread to join itself when it goes out of scope. There is another solution, but that requires a bunch of backwards-pointing shutdown channels and I don’t think it’s worth the overhead.

Profpatsch · 2021-03-28T19:14:29Z

Okay, I have a fix.

Profpatsch force-pushed the death-to-varlink branch 4 times, most recently from a103dc5 to 7714ae2 Compare March 20, 2021 00:21

Profpatsch and others added 12 commits March 20, 2021 01:30

chore(ops/ping): remove socket_address argument

9451d37

It complicates the code and the commit that introduces it didn’t give any good reason we’d want to set the socket path for only one command.

feat(ops): add native backend support for lorri direnv

0bea215

Since it was just a ping, we don’t need to implement another `CommunicationType` for it.

chore(builder): OutputPaths -> OutputPath

0ce3e2a

We only ever have one path, make it less confusing.

chore(daemon): socket.rs -> daemon/server.hs

9c3f2c8

The Server struct had an unnecessary indirection after removing the varlink backend.

chore(tests): “fix” varmap_leak test test on CI

ade01bf

Profpatsch force-pushed the death-to-varlink branch from 7714ae2 to 6e417ef Compare March 20, 2021 00:30

Profpatsch mentioned this pull request Mar 23, 2021

slog panics when I pipe the output to another command #20

Closed

Profpatsch mentioned this pull request Mar 28, 2021

RFC: use failure crate for errors #12

Closed

Profpatsch merged commit fee4ffa into canon Mar 28, 2021

Profpatsch deleted the death-to-varlink branch March 28, 2021 11:41

symphorien mentioned this pull request Mar 28, 2021

(fix) don't busywait when hitting file descriptor limit #26

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove varlink and bring back the native server/client socket serialization #22

Remove varlink and bring back the native server/client socket serialization #22

Profpatsch commented Mar 19, 2021

Profpatsch commented Mar 20, 2021

symphorien commented Mar 20, 2021

Profpatsch commented Mar 28, 2021

Profpatsch commented Mar 28, 2021

Profpatsch commented Mar 28, 2021

symphorien commented Mar 28, 2021

symphorien commented Mar 28, 2021

symphorien commented Mar 28, 2021

symphorien commented Mar 28, 2021

Profpatsch commented Mar 28, 2021

Profpatsch commented Mar 28, 2021

Profpatsch commented Mar 28, 2021

Remove varlink and bring back the native server/client socket serialization #22

Remove varlink and bring back the native server/client socket serialization #22

Conversation

Profpatsch commented Mar 19, 2021

Profpatsch commented Mar 20, 2021

symphorien commented Mar 20, 2021

Profpatsch commented Mar 28, 2021

Profpatsch commented Mar 28, 2021

Profpatsch commented Mar 28, 2021

symphorien commented Mar 28, 2021

symphorien commented Mar 28, 2021

symphorien commented Mar 28, 2021

symphorien commented Mar 28, 2021

Profpatsch commented Mar 28, 2021

Profpatsch commented Mar 28, 2021

Profpatsch commented Mar 28, 2021