Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hang in eio_linux tests #319

Closed
talex5 opened this issue Sep 20, 2022 · 5 comments
Closed

Hang in eio_linux tests #319

talex5 opened this issue Sep 20, 2022 · 5 comments
Labels
bug Something isn't working

Comments

@talex5
Copy link
Collaborator

talex5 commented Sep 20, 2022

@haesbaert wrote:

it seems to hang on lib_eio_linux/tests/test.ml on the very first test

We should try to find out what's causing this. I wrote:

What happens if you just run ./_build/default/lib_eio_linux/tests/test.exe in a loop? It doesn't hang for me.

@talex5 talex5 added the bug Something isn't working label Sep 20, 2022
@haesbaert
Copy link
Contributor

haesbaert commented Sep 20, 2022

Sorry I had missed the original message, yes it hangs for me, rather fast:

RUN 579
Testing `eio_linux'.
This run has ID `26FSRR2T'.

  ...           io          0   copy.

with

sam:eio: local i=0;while true; do echo RUN $i; i=$((i+1)); ./_build/default/lib_eio_linux/tests/test.exe;done
Linux sam 5.18.16-arch1-1 #1 SMP PREEMPT_DYNAMIC Wed, 03 Aug 2022 11:25:04 +0000 x86_64 GNU/Linux

@haesbaert
Copy link
Contributor

I reduced the program that hangs as much as I could to this, I think my initial impression is correct, somehow the EOF is never seen by the reader.

let test_copy () =
  Eio_linux.run ~queue_depth:10 @@ fun _stdenv ->
  Eio.Switch.run @@ fun sw ->
  let from_pipe, to_pipe = Eio_linux.pipe sw in
  let buffer = Cstruct.create 20 in
  Eio.Flow.copy (Eio.Flow.string_source "a") to_pipe;
  Eio.Flow.close to_pipe;
  let () = 
    try 
      while true do
        ignore (Eio.Flow.read from_pipe buffer)
      done
  with End_of_file -> ()
  in
  Eio.Flow.close from_pipe

strace here: https://gist.github.com/haesbaert/437fd9e30e4568cc3f5ba95f0387d63a

Writer is FD6, which is actually closed during the hang (by looking at /proc/foo), FD5 (reader) is still opened and we are blocked in io_uring_wait_cqe().

The pattern I see is that if the close happens before a readv is queued, sometimes the readv will never see EOF. If the readv is submitted before the close, it always sees the EOF. I've tested this in two machines with slightly different kernels 5.18 vs 5.19, with released as well as current code base for eio and uring, behaviour is the same.

My only theory of why you can't trigger the bug is because on your tests the writer/reader dance terminates always in the order where the close only happens after the reader is queued, the order really depends on which CQE comes back first on the Fiber.both() tests. This program above should always trigger the bad case, I can hang it in < 5 seconds.

Tomorrow I wanna try to peek at the uring stats, like dropped requests and whatnot, I'll also write the equivalent in C and try to trigger.

At this point, it smells like a kernel bug though.

haesbaert added a commit to haesbaert/eio that referenced this issue Sep 26, 2022
Turns out the kernel needs to handle blocking file descriptors differently, it
needs to do thread-pooling and it's a different code path (also slower).

It also seems that this is bugged in some cases. This makes sense as I believe
most people don't test with blocking FDs. I believe we should make all our FDs
non blocking as a precaution.

There is a program listed in issue ocaml-multicore#319 that replicates this. I could hang it in
< 5 seconds, and it's now running for 10minutes with the fix, so kaboom.

TLDR: linux being linux.
@haesbaert
Copy link
Contributor

I can confirm the bug with a simple C program https://gist.github.com/haesbaert/10d3e3bb5fa9171dfcf65e1f5b58e95c
The non-blocking version works.
cc -o uring_tests uring_tests.c -Wall -luring && while true;do date; ./uring_tests;done

@talex5
Copy link
Collaborator Author

talex5 commented Sep 26, 2022

I can confirm the bug with a simple C program

OK, that hangs for me after a while too! Using Linux 5.19.9.

@haesbaert
Copy link
Contributor

I've tested the kernel patch from axboe in axboe/liburing#665 (comment) and indeed it fixes the bug.

talex5 pushed a commit to haesbaert/eio that referenced this issue Oct 20, 2022
Turns out the kernel needs to handle blocking file descriptors differently, it
needs to do thread-pooling and it's a different code path (also slower).

It also seems that this is bugged in some cases. This makes sense as I believe
most people don't test with blocking FDs. I believe we should make all our FDs
non blocking as a precaution.

There is a program listed in issue ocaml-multicore#319 that replicates this. I could hang it in
< 5 seconds, and it's now running for 10minutes with the fix, so kaboom.

A fix has been committed to mainline Linux (should be in Linux 6.1):
torvalds/linux@46a525e
haesbaert added a commit to haesbaert/eio that referenced this issue Oct 20, 2022
This issue turned out to be a kernel bug which has been fixed in:
torvalds/linux@46a525e

Reported here:
axboe/liburing#665 (comment)

We can workaround it by making sure pipes are non-blocking, uring considers
pipes unbounded work and relies on uring worker threads if they are blocking,
the bug is only triggered if this is the case, so force them to be non-blocking.

If we use splice, the splice call itself needs the worker threads and the bug
surfaces again (I verified this with perf probes), so disable splice for now.
Even with the fix, it's desirable to keep pipes as non-blocking to avoid thread
pooling.

The splice call can return EAGAIN in uring, this happens even with the kernel
patched, so handle it for the future.

We can tune this better by disabling splice only for the unpatched kernels.
haesbaert added a commit that referenced this issue Oct 21, 2022
haesbaert added a commit to haesbaert/eio that referenced this issue Oct 24, 2022
Make sure we workaround uring pipe bug from ocaml-multicore#319 in Eio_unix.pipe.
Zap Eio_linux.pipe since it's only used in tests.
haesbaert added a commit to haesbaert/eio that referenced this issue Oct 24, 2022
Make sure we use pipes as nonblocking see ocaml-multicore#319 in Eio_unix.pipe.
Zap Eio_linux.pipe since it's only used in tests.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants