eio_linux backend hangs #409

bikallem · 2023-01-18T11:36:18Z

Consider the following program that uses Eio.Net.accept_fork to establish a server on port 8081 and attempts to establish client connections to this server. The number of clients is user configurable via cli option -clients. If the number <= 62, then we get one output as follows

+Server accepted connection from client
+Server received: "Hello from client"
+client received: "Bye"

However, if the number is >= 63, then the program doesn't even print the above and just (hangs?).

$ cat test_run_server.ml
open Eio

let addr = `Tcp (Eio.Net.Ipaddr.V4.loopback, 8081)

let read_all flow =
  let b = Buffer.create 100 in
  Eio.Flow.copy flow (Eio.Flow.buffer_sink b);
  Buffer.contents b

let eio_run_server ~clients env sw =
  let run_client id () =
    traceln "client: Connecting to server ...%d" id;
    let flow = Eio.Net.connect ~sw env#net addr in
    Eio.Flow.copy_string "Hello from client" flow;
    Eio.Flow.shutdown flow `Send;
    let msg = read_all flow in
    traceln "client received: %S" msg;
  in
  let connection_handler clock flow _addr =
    traceln "Server accepted connection from client";
    Fun.protect (fun () ->
      let msg = read_all flow in
      traceln "Server received: %S" msg;
      Eio.Time.sleep clock 0.01
    ) ~finally:(fun () -> Eio.Flow.copy_string "Bye" flow)
  in
  let server_sock = Eio.Net.listen ~reuse_addr:true ~backlog:128 ~sw env#net addr in
  let connection_handler = connection_handler env#clock in
  let clients = List.init clients (fun id -> run_client (id+1)) in
  let server () =
    traceln "starting server ..."; 
    Eio.Net.accept_fork ~sw server_sock ~on_error:raise connection_handler 
  in
  Fiber.all (server :: clients)

let () =
  let clients = ref 60 in
  Arg.parse
    [ ("-clients", Arg.Set_int clients, " total clients to spawn")]
    ignore "test Eio.Net.fork_accept()";

  Printexc.record_backtrace true ;
  Eio_main.run @@ fun env ->
  Switch.run @@ fun sw ->
  eio_run_server ~clients:!clients env sw

$ cat dune
(executable (name test_run_server) (libraries eio eio_main))

The text was updated successfully, but these errors were encountered:

talex5 · 2023-01-18T12:02:21Z

It's because it submits enough requests to fill uring's SQE buffer. Then uring returns None on the next request, indicating that a submit is needed. Ideally, we would call submit at this point. Maybe uring could even do it automatically for us. Instead, eio_linux adds the failed request to the io_q queue, and doesn't try to resume anything from there until some existing request actually finishes.

It would probably also be better if eio_linux just failed the request completely if it really can't get an SQE (even after submit). That should never happen under normal use, and it's probably better to let the user know the system is badly overloaded.

/cc @haesbaert @TheLortex

(this is with the dev version of uring; the current uring 0.4 release also fails but for a different reason: it runs out of space in its collection of in-progress requests and gives up)

haesbaert · 2023-01-18T12:54:17Z

My instinct would be to make the application do a Uring.submit when it got None, and then retry, if the first retry failed, then do what we are already doing:

eio/lib_eio_linux/eio_linux.ml

Line 290 in 80f1486

match op with

In this case if read_fixed or write_fixed got None, we would try a submit once before returning. Since EIO already ignores the return of submit, we're not worse than before.

I wouldn't want uring doing submits by itself when needed since it discards state. In fact when submit itself fails with error that I can't remember we should drain the cqe as it's the kernel back pressure mechanism.

We may fail to submit a job because the SQE queue is full. Previously we would wait until some existing request completed, but that might never happen. Instead, we just flush the SQE queue and retry. Fixes ocaml-multicore#409.

talex5 added the bug Something isn't working label Jan 18, 2023

talex5 mentioned this issue Jan 31, 2023

eio_linux: call submit as needed #428

Merged

talex5 closed this as completed in #428 Jan 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

eio_linux backend hangs #409

eio_linux backend hangs #409

bikallem commented Jan 18, 2023

talex5 commented Jan 18, 2023 •

edited

haesbaert commented Jan 18, 2023

eio_linux backend hangs #409

eio_linux backend hangs #409

Comments

bikallem commented Jan 18, 2023

talex5 commented Jan 18, 2023 • edited

haesbaert commented Jan 18, 2023

talex5 commented Jan 18, 2023 •

edited