Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Interop test runner: Use exit code of dialer as source of truth #115

Closed
thomaseizinger opened this issue Jan 25, 2023 · 9 comments · Fixed by #121
Closed

Interop test runner: Use exit code of dialer as source of truth #115

thomaseizinger opened this issue Jan 25, 2023 · 9 comments · Fixed by #121

Comments

@thomaseizinger
Copy link
Contributor

Relevant discussion: #114 (comment)

Basic idea:

  • On success, the listener loops with a timeout (1min?) and waits to be terminated by docker compose
  • On success, the dialer exits with 0
  • On failure the dialer exits with 1
  • On failure, the listener exits with 1

This should simplify the following code snippets because we can use the exit code of the dialer as the source of truth:

try {
const { stdout, stderr } = await exec(`docker compose -f ${path.join(dir, "compose.yaml")} up ${upFlags.join(" ")}`);
console.log("Finished:", stdout)
} catch (e: any) {
if (e !== null && typeof e === "object" && typeof e["stdout"] === "string") {
if (e["stdout"].match(/dialer.*ping successful/i) !== null) {
// The ping succeeded, but the listener exited first. Common if
// the dialer tear-down is slow as is the case with browser
// tests.
return null
}
}
console.log("Failure", e)
return e
} finally {
try {
const { stdout, stderr } = await exec(`docker compose -f ${path.join(dir, "compose.yaml")} down`);
} catch (e) {
console.log("Failed to compose down", e)
}
await fs.rm(dir, { recursive: true, force: true })
}

@MarcoPolo
Copy link
Contributor

We should still check the ping duration so that if an implementation silently fails it doesn't get ignored. Much harder to accidentally fake a ping duration.

@thomaseizinger
Copy link
Contributor Author

What do you mean silently fails? The test binary decides whether or not it is successful. If we start putting this logic into the test runner, things will get very messy as we add more tests.

If we expect the ping duration to be within a certain range, I'd suggest that the test itself asserts that and exits with 1 or 0 respectively. This is consistent with how other test frameworks & runners work.

It might get a bit tricky once we add a delay to the network communication but even then I think it would be better to pass that information into the test and derive a correct assertion from there rather than asserting things within the test runner.

@marten-seemann
Copy link
Contributor

marten-seemann commented Jan 27, 2023

What do you mean silently fails? The test binary decides whether or not it is successful.

It's relatively easy to mess up the exit code, depending on your language. It would be really bad if an implementation is relying on the interop tests passing, just to notice that this was during an incorrectly set exit code. Checking the RTT would be more robust.

If we expect the ping duration to be within a certain range, I'd suggest that the test itself asserts that and exits with 1 or 0 respectively. This is consistent with how other test frameworks & runners work.

That works (only!) for the ping RTT. For the handshake RTT test, the implementation would not how many RTTs to expect.
In this case as well it would be more robust to do that check in the runner and not in the test itself.
This is how the QUIC interop runner works by the way.

@thomaseizinger
Copy link
Contributor Author

What do you mean silently fails? The test binary decides whether or not it is successful.

It's relatively easy to mess up the exit code, depending on your language. It would be really bad if an implementation is relying on the interop tests passing, just to notice that this was during an incorrectly set exit code. Checking the RTT would be more robust.

Really? A test with bad assertions is useless, I agree but I am not sure we should be making design decisions based on that. I'd rather try and push for making it super easy to write these tests, with minimal boilerplate such that it is easy to review and we can check these problems. Also, wouldn't it always need two implementations to be buggy? Assuming we ran all implementations in all combinations, it seems unlikely that one buggy one slips through because it is asserting the wrong thing AND has a bug.

If we expect the ping duration to be within a certain range, I'd suggest that the test itself asserts that and exits with 1 or 0 respectively. This is consistent with how other test frameworks & runners work.

That works (only!) for the ping RTT. For the handshake RTT test, the implementation would not how many RTTs to expect. In this case as well it would be more robust to do that check in the runner and not in the test itself. This is how the QUIC interop runner works by the way.

What we will have to work out is the scope of the tests and of the test runner. For a QUIC interop test runner, it makes perfect sense to know about RTT time because the entire scope of the test runner is about QUIC connections. Ping is one of many protocols in the libp2p space. I think in an ideal world, our test runner shouldn't know about individual protocols. Measuring and exposing handshake RTT makes sense to me because every test will be expected to do a handshake. I don't see every test to run the ping protocol.

@MarcoPolo
Copy link
Contributor

Also, wouldn't it always need two implementations to be buggy?

No. Only the dialer needs to return a false positive (needs to return a bad exit code of 0).

This isn't hypothetical. Glen's original PR that introduced the js-libp2p tests had a similar subtle bug (https://github.com/libp2p/test-plans/pull/98/files#diff-0b6202f38d6093a4c51465493b638fc8e1acadac2384bd814968202bdd994b11). Can you find it?

This doesn't complicate the test at all. You simply have to print some data to stdout in json format. You can even manually build the JSON string for all I care. We shouldn't bike shed this. I'll write up what this test should do in a small doc tomorrow. Basically the ping protocol and print two results in json format ping RTT and total RTT from before connect to after ping.

What we will have to work out is the scope of the tests and of the test runner.

As little as possible. The only reason I want to put RTT in here is because it's a superset of the ping test. Any other kind of test, will be a separate thing.

I think in an ideal world, our test runner shouldn't know about individual protocols.

Ping is often the first protocol an implementation will support and test. It's easy to implement and easy to understand. It adds one round trip after connnection setup, so it makes calculating the handshake RTT easy.


Let's try not to bikeshed this please. It's really not a problem to read the stdout from the dialer and make sure the ping result is reasonable as well as checking the exit code. And if the exit code is non-zero, then that's a straightforward failure (the current code should be changed).

@thomaseizinger
Copy link
Contributor Author

This doesn't complicate the test at all.

I am not too worried about the complexity of the test but of the test runner. Unless every test will run the ping protocol, how will the test runner know what to scan the output for? Unless I am missing something, it is the test runner's job to compute the matrix of tests binaries to be run and start them up but it is ignorant over which particular test it is running.

I am happy to stick to one exception for ping but what I am worried about is the following ending up in the test runner:

if (testname == "ping) {
	// scan output for ping duration
}

if (testname == "kademlia-put") {
	// scan output for something kademlia specific
}

// etc

@thomaseizinger
Copy link
Contributor Author

thomaseizinger commented Jan 27, 2023

The above assumes that we will extend these interop tests by building multiple different binaries and containers, each one with a particular test goal in mind and we always just feed a list of containers to the test runner to execute. Side note: That is also where this idea came from. As soon as we write another test, it can no longer just be versions.ts but the GitHub action needs to configure, which containers should be started.

Perhaps I am making a wrong assumption somewhere here and you are planning a different approach of extending these tests?

@MarcoPolo
Copy link
Contributor

Maybe there's a misunderstanding?

I'm explicitly not trying to make this a generic runner for every test we want. The RTT counter is a special case because it's a superset of the ping test.

The kademlia interop test is different because we don't care what transport/muxer/secure channel that runs on. Therefore it should simply be a different test. It'll be easier to understand and iterate on, even if we duplicate a little bit of code.

@thomaseizinger
Copy link
Contributor Author

Definitely a misunderstanding then, I thought we were going to use this harness for more tests!

There is probably still a way to do that but we can design that with the next test :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants