Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove varlink and bring back the native server/client socket serialization #22

Merged
merged 13 commits into from Mar 28, 2021

Conversation

Profpatsch
Copy link
Collaborator

Fixes #4

Please see commit messages.

The lorri internal stream-events output probably changes a lot, I didn’t spend any time keeping it backwards compatible. But since it’s experimental and internal we don’t have to care as much. It’s still using serde Serialize instances, but some care was taken to split it from the internal representation.

  • Amended the changelog in release.nix (see release.nix for instructions)

If the other side of that channel is closed, it means that the pool
propagated a panic, so we should just log the fact instead of panicing
again.

The logging then needs to happen via an explicit logger, since
slog_scope doesn’t seem to reach into the thread. We add explicit
loggers to both `Pool` and `Async`, and adjust calls. In order to call
these in tests, we provide a simple stderr logger setup via
`logging::test_logger()`.
@Profpatsch Profpatsch force-pushed the death-to-varlink branch 4 times, most recently from a103dc5 to 7714ae2 Compare March 20, 2021 00:21
Profpatsch and others added 12 commits March 20, 2021 01:30
tl;dr: varlink bad, serde good, use `LORRI_BACKEND=native` to make
`lorri daemon` and `lorri internal ping` use the serde-based protocol.
All the other commands to follow.

~~~

This re-adds a slightly modified version of the native communication
protocol between lorri server and client(s) that was removed in
b0cbe27 in favor of a varlink-based
RPC format.

I was skeptical back when it was introduced, since the arguments in
favor of varlink where mostly “we don’t have to maintain any custom
socket code” and “RPC is better than whatever the current socket
protocol does”, without actually trying to work with the existing
socket code first.

As it turns out, varlink has multiple weaknesss that make working with
it a pain in the ass:

1. It is JSON-based.

Thus is cannot encode arbitrary bytes like filepaths. lorri is full of
filepaths. Every event that happens contains filepaths. Also program
output. Which is also random bytes.

Which lead to hundreds of `TryFrom` instances which convert paths to
utf-8, crashing lorri in case they are not utf-8 clean. Also all these
conversions from and to strings were written by hand.

There is a format that can encode arbitrary rust structs and enums,
and it’s used by most rust code that does any serialization, it’s
called serde. It has a binary representation and also a json one if
you really need the debuggability. Which brings us to

2. Varlink is a format made for public interfaces

It is based on the assumption that there is external consumers of the
API which want to auto-generate some code to talk to it. But the lorri
server/client protocol is not that. It’s internal. We don’t want to
talk to other tools through this API, we want to talk to a lorri
client/daemon of exactly the same version.

Yes, we have `lorri internal stream-events`, which is a prototype of
an external interface to the build events, and which currently uses
varlink, but that is entirely an implementation detail. In fact, how
the client (`stream-events`) talks to the server (`lorri daemon`) and
how it prints those messages to the user should probably be different
data types in the first place anyway, so that we can maintain
backwards-compat while being able to refactor internally.

3. Varlink does not understand ADTs and rust enums

Json does not do enums. There is a bunch of encodings for them, and
they can also be done manually, but varlink is a “protocol” written by
go programmers, who do not know how ADTs work generally. Thus it does
not have any good enum type. Thus we need to manually encode/decode
enums into the format it understands. There is a rust library which
does this automatically, serde.

4. Varlink has bad documentation, no specification, and little users

Initially when we introduced it, it might have looked as if varlink
was picking up steam, but the simple fact is that varlink
documentation is very sparse, and hasn’t been improved since we first
introduced it in 2019. Probably because there is a bunch of formats
with better adoption out there that actually can encode public APIs
and have tooling to support things like versioning, like protobuf or
thrift.

Also the part that varlink uses a custom format, and the EBNF is the
best you get, no examples of usage, no patterns, no FAQ.

Q: So why not switch to protobuf?

A: It was an idea, and may yet be a good idea, but only for the public
interface of lorri. For the internal interface where we essentially
have a unix socket that `accept()`s in a loop, and can require users
to keep the version of the server and client in exact sync, there is
little reason to introduce manual encoding and parsing steps that come
with the benefit of backward/forward compat.

Q: Why is this so much code?

A: The code, as the one that was originally replaced, has a module
that describes a generic typesafe
reader/writer (`src/socket/read_writer.rs`), and a higher-level module
with a server and a client implementation that uses
it (`src/socket/communicate.rs`).

The read/writer module is handling the type-based serialization and
reading/writing typed messages to a socket via the serde bincode
format. It supports timeouts, which complicates the code slighly but
is reqiured for not blocking indefinitly. The code should be
relatively easy to understand.

The communicate module implements the meat of the communication, and
it also defines all different communications that can happen between
the various clients and the server, using the underlying read/writer.
There is a bit of trait wrangling to make sure that a communication
always binds the same two request/response types together, and a bit
of code to define a typesafe listener and clients for all handlers.

The most messy module right now is `src/socket.rs`, which implements
the actual accept loop, and tries its best to not run out of threads
while doing that. lorri might run for a long time and be called many
times (e.g. via `lorri direnv`), so we don’t want to generate too many
thread zombies. We could opt for a thread pool instead, which doesn’t
have this problem.

Q: This is just the ping event, right?

Yes, this commit only goes to the point of where the code was when it
was first replaced with varlink, which is the ping event. The rest of
the events is going to be implemented in (a) following commit(s).

Let’s look forward to dropping varlink in a bit.
It complicates the code and the commit that introduces it didn’t give
any good reason we’d want to set the socket path for only one command.
Since it was just a ping, we don’t need to implement another
`CommunicationType` for it.
Since we are using serde, we need to add a few more deserialize calls
to types, but it comes for free.

There was a particularly abhorrent Serialize instance for a path type,
which is removed.

I’m so looking forward to removing all the horrible TryFrom instances
and other war crimes.
We should get away from carnix as fast as possible, it hasn’t been
maintained and the quality of the tool is questionable. I wasn’t even
able to reach nest.pijul.org, and this is the only place the source is
hosted.

The error this works around was undebuggable from the message itself,
and I’m still not entirely sure what exactly happens, since there is
no source code comments anywhere to be found. Very bad software.
Finally. See previous commits for reasoning. Nothing in the public
interface should change.

The client code for the native protocol still needs to be improved
somewhat so that it does not die on `unwrap()` or similar.

This removes some tests that were just hopelessly broken, e.g. the
“integration” “test” that had to poll for the socket to be available
after starting the server, instead of propagating out readiness. Not
even talking about the fact that these tests should just call the
actual binaries.
We only ever have one path, make it less confusing.
Previously we’ve been using the same Serialize instance for the
internal “over the wire” protocol between daemon/client and for the
stream-events command json serialization.

This needs to be split up if we ever want to stabilize the interface.

Thus a first step, which keeps the inner structure of `Event`, but
stubs out all the inner types that have to be converted for the json
interface.

This pattern looks quite nice, even though implementing `map()` for
each of these interface types is a bit of boilerplate.
The Server struct had an unnecessary indirection after removing the
varlink backend.
After getting rid of varlink the new clients were temporarily created
with lots of `unwrap()`ping, but we can do better.

In the spirit of #12, I
first tried to add the `failure` crate, but after a while I noticed
that the usability is really low. Even though it has documentation,
the documentation doesn’t talk about how to make backtraces work.
Additionally, you would have to build your own error pretty printing.

Based on the recommendation in the issue, I switched the error
definitions to `anyhow` instead, which feels a lot more modern and
*does* integrate better with the `std::error::Error` trait (even
though backtraces don’t work on stable, but they didn’t for `failure`
either, and we will get them for free in time).

There is a small trait `ExitAs`, that gives an automatic `From`
instance to every error type that implements it and `Error`, which is
nice. Though I couldn’t get it to work with the `Context` trait like I
did with the `Context` type from `failure.

~~~

A small improvement was done to the `thread::Pool`, in addition to
only returning instantly when a thread panics, we now also return when
any thread returns an `Err`. This allows us to actually transmit typed
errors, and use the `?` operator directly in the thread closure.

~~~

In ops itself, we factor out all the client connections into the
`daemon::client` module, and replace all unwraps by ExitErrors.
This gives us back the nice exit errors we had before, plus error
traces for anything that implements `Error` and uses `#[source]`.
Nice.

The `lorri direnv` call was still broken when the daemon was not
running, this fixes that as well.
It’s a major version bump because we remove a CLI flag (even though it
was probably not used much).

The internal change might also expose new bugs, so let’s keep it safe
and tell people that the behaviour might have changed.
Plus `internal stream-events` also has seen some damage, so it would
be unfair to make this a patch release.
@Profpatsch
Copy link
Collaborator Author

Okay we really need to get rid of the “always up to date clippy in CI” thing, it’s aggravating.

@symphorien
Copy link
Contributor

lorri internal stream-events --kind snapshot just hangs for me.
I'm running the daemon with systemd, and paid attention that the lorri command and the lorri daemon are both from this PR.
lorri internal ping works, though.

@Profpatsch
Copy link
Collaborator Author

lorri internal stream-events --kind snapshot just hangs for me.

have you been using this before? The logic around this is very confused, I copied it verbatim. I don’t think I broke anything, but it’s rather hard to figure out. I’d say this will need a rewrite to clean up in the future, so let’s ignore it for now.

@Profpatsch
Copy link
Collaborator Author

Okay, again, the only CI failure is coming from the newer clippy warnings, which happens because CI always fetches the nightly clippy, which we should remove because they are not easy to reproduce locally.

@Profpatsch
Copy link
Collaborator Author

LGTM

@Profpatsch Profpatsch merged commit fee4ffa into canon Mar 28, 2021
@Profpatsch Profpatsch deleted the death-to-varlink branch March 28, 2021 11:41
@symphorien
Copy link
Contributor

have you been using this before? The logic around this is very confused, I copied it verbatim. I don’t think I broke anything, but it’s rather hard to figure out.

Yes, since 2020-04-25 modulo command renaming (it used to be lorri internal__stream_events --kind snapshot)
Here is at least half the fix #25
The command is still hanging, I don't remember if it used to.

@symphorien
Copy link
Contributor

I think I found the source of the problem: when thread in ops::stream_events is dropped, we wait on it, that is forever. So the function cannot return.
I tried to fix it with

diff --git a/src/ops.rs b/src/ops.rs
index a7f3ca0..95ca37f 100644
--- a/src/ops.rs
+++ b/src/ops.rs
@@ -606,6 +606,12 @@ pub fn stream_events(kind: EventKind) -> OpResult {
         })
     };
 
+    // when running in snapshot mode, we should exit as soon as we print the answer
+    // of the daemon. However, when dropping `thread` at the end of the function,
+    // we will wait for it and hang indefinitely.
+    // Let's leak it, anyway we are going to exit the process as a whole very soon.
+    let thread = std::mem::ManuallyDrop::new(thread);
+
     let mut snapshot_done = false;
     loop {
         select! {

but then lorri internal stream-events fails with a timeout after 0.5 second:

$  target/debug/lorri internal stream-events
{"Completed":{"nix_file":"/home/symphorien/src/lorri/shell.nix","rooted_output_paths":{"shell_gc_root":"/home/symphorien/.cache/lorri/gc_roots/07d6e569120731e7ee9a3297613c71b7/gc_root/shell_gc_root"}}}
Mar 28 17:36:23.148 ERRO Message(R(Timeout(D(Millis(500)))))

Maybe it always used to timeout but when trying to return the error it would hang as explained above?

@symphorien
Copy link
Contributor

Mmh this leaks file descriptors very fast...

@symphorien
Copy link
Contributor

file descriptors in the server*

@Profpatsch
Copy link
Collaborator Author

Ah, it’s a regression then, async will block until the thread inside finishes, since everything else is just sloppy.

@Profpatsch
Copy link
Collaborator Author

@symphorien I think the solution here is to allow Async to not require a thread to join itself when it goes out of scope. There is another solution, but that requires a bunch of backwards-pointing shutdown channels and I don’t think it’s worth the overhead.

@Profpatsch
Copy link
Collaborator Author

Okay, I have a fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

lorri internal stream-event fails with varlinkerror
2 participants