Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

eio_linux backend hangs #409

Closed
bikallem opened this issue Jan 18, 2023 · 2 comments · Fixed by #428
Closed

eio_linux backend hangs #409

bikallem opened this issue Jan 18, 2023 · 2 comments · Fixed by #428
Labels
bug Something isn't working

Comments

@bikallem
Copy link
Contributor

Consider the following program that uses Eio.Net.accept_fork to establish a server on port 8081 and attempts to establish client connections to this server. The number of clients is user configurable via cli option -clients. If the number <= 62, then we get one output as follows

+Server accepted connection from client
+Server received: "Hello from client"
+client received: "Bye"

However, if the number is >= 63, then the program doesn't even print the above and just (hangs?).

$ cat test_run_server.ml
open Eio

let addr = `Tcp (Eio.Net.Ipaddr.V4.loopback, 8081)

let read_all flow =
  let b = Buffer.create 100 in
  Eio.Flow.copy flow (Eio.Flow.buffer_sink b);
  Buffer.contents b

let eio_run_server ~clients env sw =
  let run_client id () =
    traceln "client: Connecting to server ...%d" id;
    let flow = Eio.Net.connect ~sw env#net addr in
    Eio.Flow.copy_string "Hello from client" flow;
    Eio.Flow.shutdown flow `Send;
    let msg = read_all flow in
    traceln "client received: %S" msg;
  in
  let connection_handler clock flow _addr =
    traceln "Server accepted connection from client";
    Fun.protect (fun () ->
      let msg = read_all flow in
      traceln "Server received: %S" msg;
      Eio.Time.sleep clock 0.01
    ) ~finally:(fun () -> Eio.Flow.copy_string "Bye" flow)
  in
  let server_sock = Eio.Net.listen ~reuse_addr:true ~backlog:128 ~sw env#net addr in
  let connection_handler = connection_handler env#clock in
  let clients = List.init clients (fun id -> run_client (id+1)) in
  let server () =
    traceln "starting server ..."; 
    Eio.Net.accept_fork ~sw server_sock ~on_error:raise connection_handler 
  in
  Fiber.all (server :: clients)

let () =
  let clients = ref 60 in
  Arg.parse
    [ ("-clients", Arg.Set_int clients, " total clients to spawn")]
    ignore "test Eio.Net.fork_accept()";

  Printexc.record_backtrace true ;
  Eio_main.run @@ fun env ->
  Switch.run @@ fun sw ->
  eio_run_server ~clients:!clients env sw
$ cat dune
(executable (name test_run_server) (libraries eio eio_main))
@talex5
Copy link
Collaborator

talex5 commented Jan 18, 2023

It's because it submits enough requests to fill uring's SQE buffer. Then uring returns None on the next request, indicating that a submit is needed. Ideally, we would call submit at this point. Maybe uring could even do it automatically for us. Instead, eio_linux adds the failed request to the io_q queue, and doesn't try to resume anything from there until some existing request actually finishes.

It would probably also be better if eio_linux just failed the request completely if it really can't get an SQE (even after submit). That should never happen under normal use, and it's probably better to let the user know the system is badly overloaded.

/cc @haesbaert @TheLortex

(this is with the dev version of uring; the current uring 0.4 release also fails but for a different reason: it runs out of space in its collection of in-progress requests and gives up)

@talex5 talex5 added the bug Something isn't working label Jan 18, 2023
@haesbaert
Copy link
Contributor

My instinct would be to make the application do a Uring.submit when it got None, and then retry, if the first retry failed, then do what we are already doing:

match op with

In this case if read_fixed or write_fixed got None, we would try a submit once before returning. Since EIO already ignores the return of submit, we're not worse than before.

I wouldn't want uring doing submits by itself when needed since it discards state. In fact when submit itself fails with error that I can't remember we should drain the cqe as it's the kernel back pressure mechanism.

talex5 added a commit to talex5/eio that referenced this issue Jan 31, 2023
We may fail to submit a job because the SQE queue is full. Previously
we would wait until some existing request completed, but that might
never happen. Instead, we just flush the SQE queue and retry.

Fixes ocaml-multicore#409.
talex5 added a commit to talex5/eio that referenced this issue Jan 31, 2023
We may fail to submit a job because the SQE queue is full. Previously
we would wait until some existing request completed, but that might
never happen. Instead, we just flush the SQE queue and retry.

Fixes ocaml-multicore#409.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants