Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(udp): use sendmmsg and recvmmsg #1741

Closed
wants to merge 19 commits into from
Closed

Conversation

mxinden
Copy link
Collaborator

@mxinden mxinden commented Mar 13, 2024

Write and read up to BATCH_SIZE datagrams with single sendmmsg and recvmmsg syscall.

Previously neqo_bin::udp::Socket::send would use sendmmsg, but provide a single buffer to write into only, effectively using sendmsg instead of sendmmsg. Same with Socket::recv.

With this commit Socket::send provides BATCH_SIZE number of buffers on each sendmmsg syscall, thus writing more than one datagram at a time if available. Same with Socket::recv.


Part of #1693.

@mxinden mxinden force-pushed the recvmmsg branch 3 times, most recently from 1995639 to c06253f Compare March 15, 2024 09:50
Copy link

github-actions bot commented Mar 15, 2024

Benchmark results

Performance differences relative to 3e53709.

  • drain a timer quickly time: [434.66 ns 441.22 ns 447.54 ns]
    change: [-0.6964% +0.8250% +2.9072%] (p = 0.38 > 0.05)
    No change in performance detected.

  • coalesce_acked_from_zero 1+1 entries
    time: [196.62 ns 197.11 ns 197.61 ns]
    change: [-0.8171% -0.4829% -0.1380%] (p = 0.01 < 0.05)
    Change within noise threshold.

  • coalesce_acked_from_zero 3+1 entries
    time: [240.43 ns 241.00 ns 241.64 ns]
    change: [-0.5382% -0.0751% +0.3426%] (p = 0.76 > 0.05)
    No change in performance detected.

  • coalesce_acked_from_zero 10+1 entries
    time: [239.00 ns 239.69 ns 240.53 ns]
    change: [-1.3511% -0.5261% +0.0766%] (p = 0.17 > 0.05)
    No change in performance detected.

  • coalesce_acked_from_zero 1000+1 entries
    time: [216.68 ns 216.96 ns 217.27 ns]
    change: [-0.7531% -0.1853% +0.3917%] (p = 0.55 > 0.05)
    No change in performance detected.

  • RxStreamOrderer::inbound_frame()
    time: [119.87 ms 119.94 ms 120.00 ms]
    change: [+0.4749% +0.6968% +0.8697%] (p = 0.00 < 0.05)
    Change within noise threshold.

  • transfer/Run multiple transfers with varying seeds
    time: [118.08 ms 118.39 ms 118.71 ms]
    thrpt: [33.696 MiB/s 33.786 MiB/s 33.875 MiB/s]
    change:
    time: [-3.4361% -3.1033% -2.7586%] (p = 0.00 < 0.05)
    thrpt: [+2.8369% +3.2027% +3.5584%]
    Change within noise threshold.

  • transfer/Run multiple transfers with the same seed
    time: [118.90 ms 119.07 ms 119.25 ms]
    thrpt: [33.542 MiB/s 33.594 MiB/s 33.642 MiB/s]
    change:
    time: [-2.7935% -2.5730% -2.3522%] (p = 0.00 < 0.05)
    thrpt: [+2.4089% +2.6410% +2.8738%]
    Change within noise threshold.

  • 1-conn/1-100mb-resp (aka. Download)/client
    time: [1.0852 s 1.0898 s 1.0946 s]
    thrpt: [91.361 MiB/s 91.760 MiB/s 92.150 MiB/s]
    change:
    time: [-2.3395% -0.9694% +0.1106%] (p = 0.16 > 0.05)
    thrpt: [-0.1105% +0.9789% +2.3956%]
    No change in performance detected.

  • 1-conn/10_000-parallel-1b-resp (aka. RPS)/client
    time: [542.73 ms 546.39 ms 550.03 ms]
    thrpt: [18.181 Kelem/s 18.302 Kelem/s 18.425 Kelem/s]
    change:
    time: [+26.586% +27.556% +28.610%] (p = 0.00 < 0.05)
    thrpt: [-22.245% -21.603% -21.002%]
    💔 Performance has regressed.

  • 1-conn/1-1b-resp (aka. HPS)/client
    time: [52.299 ms 52.653 ms 53.052 ms]
    thrpt: [18.849 elem/s 18.992 elem/s 19.121 elem/s]
    change:
    time: [+2.6478% +3.6365% +4.7018%] (p = 0.00 < 0.05)
    thrpt: [-4.4906% -3.5089% -2.5795%]
    💔 Performance has regressed.

Client/server transfer results

Transfer of 134217728 bytes over loopback.

Client Server CC Pacing Mean [ms] Min [ms] Max [ms] Relative
msquic msquic 671.8 ± 293.0 379.2 1321.6 1.00
neqo msquic reno on 908.9 ± 211.4 713.1 1268.5 1.00
neqo msquic reno 873.0 ± 170.4 713.5 1221.6 1.00
neqo msquic cubic on 958.8 ± 398.9 697.1 1869.5 1.00
neqo msquic cubic 832.3 ± 127.4 710.1 1009.0 1.00
msquic neqo reno on 3230.5 ± 230.0 2967.8 3778.0 1.00
msquic neqo reno 3232.8 ± 200.1 2888.6 3582.1 1.00
msquic neqo cubic on 3478.9 ± 224.7 3270.5 3986.4 1.00
msquic neqo cubic 3262.9 ± 74.2 3161.2 3363.7 1.00
neqo neqo reno on 2627.0 ± 131.2 2445.5 2849.8 1.00
neqo neqo reno 2537.0 ± 227.2 2377.2 3146.6 1.00
neqo neqo cubic on 3032.9 ± 272.3 2734.0 3479.1 1.00
neqo neqo cubic 2626.5 ± 205.8 2403.5 3060.2 1.00

⬇️ Download logs

Copy link

codecov bot commented Mar 15, 2024

Codecov Report

Attention: Patch coverage is 73.68421% with 5 lines in your changes are missing coverage. Please review.

Project coverage is 93.22%. Comparing base (3e53709) to head (38fca6b).

Files Patch % Lines
neqo-http3/src/send_message.rs 50.00% 5 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1741      +/-   ##
==========================================
- Coverage   93.26%   93.22%   -0.04%     
==========================================
  Files         110      110              
  Lines       35669    35680      +11     
==========================================
- Hits        33266    33262       -4     
- Misses       2403     2418      +15     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@mxinden mxinden force-pushed the recvmmsg branch 3 times, most recently from 08b4f32 to a2f2d82 Compare March 15, 2024 16:37
@mxinden
Copy link
Collaborator Author

mxinden commented Mar 15, 2024

Status update:

Thus far I am having a hard time finding clear signals that this is a performance improvement. Not saying it isn't. Just missing a proof.

Duration of client/server transfers on benchmark machine vary significantly (30s - 45s) across CI runs. Same for I/O syscalls samples in perf traces.

Maybe worth hooking into something like criterion to get stable metrics and filter out outliers. I will investigate further. Suggestions welcome.

@larseggert
Copy link
Collaborator

Remember that pacing interferes here. You might want to log how large the batches are that you are actually doing on TX/RX. I think @KershawChang had a feature to turn off pacing, which should increase those batches.

As discussed elsewhere, we likely have other bugs/bottlenecks in the code that limit the impact this change will have, too.

@mxinden
Copy link
Collaborator Author

mxinden commented Mar 16, 2024

Thank you for the input.

You might want to log how large the batches are that you are actually doing on TX/RX

Will do.

Remember that pacing interferes here.

It is actually disabled by default, though by mistake. Thanks for the hint! See #1753.

Read up to `BATCH_SIZE = 32` with single `recvmmsg` syscall.

Previously `neqo_bin::udp::Socket::recv` would use `recvmmsg`, but provide a
single buffer to write into only, effectively using `recvmsg` instead of
`recvmmsg`.

With this commit `Socket::recv` provides `BATCH_SIZE` number of buffers on each
`recvmmsg` syscall, thus reading more than one datagram at a time if available.
@mxinden
Copy link
Collaborator Author

mxinden commented Apr 3, 2024

Small update:

Solely using recvmmsg (this pull request) without sendmmsg or any buffer optimizations does not show any performance improvements. Quite the opposite, our benchmarks report regressions.

I hacked together a sendmmsg implementation (not yet pushed). The results (sendmmsg + recvmmsg) look promising:

➜  neqo-bin git:(main) ✗ critcmp main sendmmsg                                            
group                                          main                                     sendmmsg
-----                                          ----                                     --------
1-conn/1-100mb-resp (aka. Download)/client     1.09    541.4±5.05ms   184.7 MB/sec      1.00  494.7±213.29ms   202.1 MB/sec

@larseggert
Copy link
Collaborator

recvmmsg without sendmmsg will have no benefit, since without the latter batches will be very short.

@mxinden
Copy link
Collaborator Author

mxinden commented Apr 4, 2024

  • 1-conn/1-100mb-resp (aka. Download)/client
    time: [781.71 ms 787.17 ms 792.92 ms]
    thrpt: [126.12 MiB/s 127.04 MiB/s 127.93 MiB/s]
    change:
    time: [-30.837% -28.888% -26.932%] (p = 0.00 < 0.05)
    thrpt: [+36.858% +40.623% +44.586%]
    💚 Performance has improved.

🎉 40% improvement in throughput. First (?) time breaking 1 Gbit/s on CI.

@mxinden mxinden changed the title feat(udp): use recvmmsg feat(udp): use sendmmsg and recvmmsg Apr 4, 2024
@mxinden
Copy link
Collaborator Author

mxinden commented Apr 5, 2024

  • 1-conn/10_000-1b-seq-resp (aka. RPS)/client
    time: [446.15 ms 448.43 ms 450.77 ms]
    thrpt: [22.184 Kelem/s 22.300 Kelem/s 22.414 Kelem/s]
    change:
    time: [+14.829% +15.832% +16.877%] (p = 0.00 < 0.05)
    thrpt: [-14.440% -13.668% -12.914%]
    💔 Performance has regressed.

  • 100-seq-conn/1-1b-resp (aka. HPS)/client
    time: [3.6042 s 3.6064 s 3.6087 s]
    thrpt: [27.711 elem/s 27.728 elem/s 27.745 elem/s]
    change:
    time: [+6.1249% +6.2317% +6.3299%] (p = 0.00 < 0.05)
    thrpt: [-5.9531% -5.8662% -5.7714%]
    💔 Performance has regressed.

Currently investigating why performance regresses here. It does not seem to be tied to the BATCH_SIZE. Setting BATCH_SIZE to 1, effectively disabling sendmmsg and recvmmsg, does not improve the RPS benchmark:

➜  neqo-bin git:(recvmmsg) ✗ critcmp main recvmmsg-batch-32 recvmmsg-batch-1 -f "RPS"                            
group                                          main                                   recvmmsg-batch-1                       recvmmsg-batch-32
-----                                          ----                                   ----------------                       -----------------
1-conn/10_000-1b-seq-resp (aka. RPS)/client    1.00   122.8±16.81ms 79.5 KElem/sec    1.24   152.6±17.01ms 64.0 KElem/sec    1.24   152.6±23.00ms 64.0 KElem/sec

`<SendMessage as Sendstream>::send_data` attempts to send a slice of data down
into the QUIC layer, more specifically `neqo_transport::Connection::stream_send_atomic`.

While it attempts to send any existing buffered data at the http3 layer first,
it does not itself fill the http3 layer buffer, but instead only sends data, if
the lower QUIC layer has capacity, i.e. only if it can send the data down to the
QUIC layer right away.

https://github.com/mozilla/neqo/blob/5dfe106669ccb695187511305c21b8e8a8775e91/neqo-http3/src/send_message.rs#L168-L221

`<SendMessage as Sendstream>::send_data` is called via
`Http3ServerHandler::send_data`. The wrapper first marks the stream as
`stream_has_pending_data`, marks itself as `needs_processing` and then calls
down into `<SendMessage as Sendstream>::send_data`.

https://github.com/mozilla/neqo/blob/5dfe106669ccb695187511305c21b8e8a8775e91/neqo-http3/src/connection_server.rs#L51-L74

Thus the latter always marks the former as `stream_has_pending_data` even though
the former never writes into the buffer and thus might actually not have pending
data.

Why is this problematic?

1. Say that the buffer of the `BufferedStream` of `SendMessage` is empty.

2. Say that the user attempts to write data via `Http3ServerHandler::send_data`.
Despite not filling the http3 layer buffer, the stream is marked as
`stream_has_pending_data`.

3. Say that next the user calls `Http3Server::process`, which will call
`Http3Server::process_http3`, which will call
`Http3ServerHandler::process_http3`, which will call
`Http3Connection::process_sending`, which will call `Http3Connection::send_non_control_streams`.

`Http3Connection::send_non_control_streams` will attempt to flush all http3
layer buffers of streams marked via `stream_has_pending_data`, e.g. the stream
from step (2). Thus it will call `<SendMessage as SendStream>::send` (note
`send` and not the previous `send_data`). This function will attempt the
stream's http3 layer buffer. In the case where the http3 layer stream buffer is
empty, it will enqueue a `DataWritable` event for the user. Given that the
buffer of our stream is empty (see (1)) such `DataWritable` event is always emitted.

https://github.com/mozilla/neqo/blob/5dfe106669ccb695187511305c21b8e8a8775e91/neqo-http3/src/send_message.rs#L236-L264

The user, on receiving the `DataWritable` event will attempt to write to it via
`Http3ServerHandler::send_data`, back to step (2), thus closing the busy loop.

How to fix?

This commit adds an additional check to the `has_pending_data` function to
ensure it indeed does have pending data. This breaks the above busy loop.
This reverts commit 5d314b6.
mxinden added a commit to mxinden/neqo that referenced this pull request Apr 7, 2024
`neqo_transport::Connection` offers 3 process methods:

- `process`
- `process_output`
- `process_input`

Where `process` is a wrapper around `process_input` and `process_output` calling
both in sequence.

https://github.com/mozilla/neqo/blob/5dfe106669ccb695187511305c21b8e8a8775e91/neqo-transport/src/connection/mod.rs#L1099-L1107

Previously `neqo-client` would use `process` only. Thus continuously
interleaving output and input. Say `neqo-client` would have multiple datagrams
buffered through a GRO read, it could potentially have to do a write in between
each `process` calls, as each call to `process` with an input datagram might
return an output datagram to be written.

With this commit `neqo-client` uses `process_output` and `process_input`
directly, thus reducing interleaving on batch reads (GRO and in the future
recvmmsg) and in the future batch writes (GSO and sendmmsg).

Extracted from mozilla#1741.
mxinden added a commit to mxinden/neqo that referenced this pull request Apr 7, 2024
`neqo_transport::Connection` offers 3 process methods:

- `process`
- `process_output`
- `process_input`

Where `process` is a wrapper around `process_input` and `process_output` calling
both in sequence.

https://github.com/mozilla/neqo/blob/5dfe106669ccb695187511305c21b8e8a8775e91/neqo-transport/src/connection/mod.rs#L1099-L1107

Previously `neqo-client` would use `process` only. Thus continuously
interleaving output and input. Say `neqo-client` would have multiple datagrams
buffered through a GRO read, it could potentially have to do a write in between
each `process` calls, as each call to `process` with an input datagram might
return an output datagram to be written.

With this commit `neqo-client` uses `process_output` and `process_input`
directly, thus reducing interleaving on batch reads (GRO and in the future
recvmmsg) and in the future batch writes (GSO and sendmmsg).

Extracted from mozilla#1741.
mxinden added a commit to mxinden/neqo that referenced this pull request Apr 7, 2024
`neqo_transport::Connection` offers 4 process methods:

- `process`
- `process_output`
- `process_input`
- `process_multiple_input`

Where `process` is a wrapper around `process_input` and `process_output` calling
both in sequence.

https://github.com/mozilla/neqo/blob/5dfe106669ccb695187511305c21b8e8a8775e91/neqo-transport/src/connection/mod.rs#L1099-L1107

Where `process_input` delegates to `process_multiple_input`.

https://github.com/mozilla/neqo/blob/5dfe106669ccb695187511305c21b8e8a8775e91/neqo-transport/src/connection/mod.rs#L979-L1000

Previously `neqo-client` would use `process` only. Thus continuously
interleaving output and input. Say `neqo-client` would have multiple datagrams
buffered through a GRO read, it could potentially have to do a write in between
each `process` calls, as each call to `process` with an input datagram might
return an output datagram to be written.

With this commit `neqo-client` uses `process_output` and `process_multiple_input`
directly, thus reducing interleaving on batch reads (GRO and in the future
recvmmsg) and in the future batch writes (GSO and sendmmsg).

By using `process_multiple_input` instead of `process` or `process_input`,
auxiliarry logic, like `self.cleanup_closed_streams` only has to run per input
datagram batch, and not for each input datagram.

Extracted from mozilla#1741.
github-merge-queue bot pushed a commit that referenced this pull request Apr 9, 2024
* refactor(client): use process_output and process_multiple_input

`neqo_transport::Connection` offers 4 process methods:

- `process`
- `process_output`
- `process_input`
- `process_multiple_input`

Where `process` is a wrapper around `process_input` and `process_output` calling
both in sequence.

https://github.com/mozilla/neqo/blob/5dfe106669ccb695187511305c21b8e8a8775e91/neqo-transport/src/connection/mod.rs#L1099-L1107

Where `process_input` delegates to `process_multiple_input`.

https://github.com/mozilla/neqo/blob/5dfe106669ccb695187511305c21b8e8a8775e91/neqo-transport/src/connection/mod.rs#L979-L1000

Previously `neqo-client` would use `process` only. Thus continuously
interleaving output and input. Say `neqo-client` would have multiple datagrams
buffered through a GRO read, it could potentially have to do a write in between
each `process` calls, as each call to `process` with an input datagram might
return an output datagram to be written.

With this commit `neqo-client` uses `process_output` and `process_multiple_input`
directly, thus reducing interleaving on batch reads (GRO and in the future
recvmmsg) and in the future batch writes (GSO and sendmmsg).

By using `process_multiple_input` instead of `process` or `process_input`,
auxiliarry logic, like `self.cleanup_closed_streams` only has to run per input
datagram batch, and not for each input datagram.

Extracted from #1741.

* process_output before handle

* process_ouput after each input batch
Always use up send space on QUIC layer to ensure receiving
`ConnectionEvent::SendStreamWritable` event when more send space is available.

See mozilla#1819 for details. This commit
implements option 2.

Fixes mozilla#1819.
@larseggert
Copy link
Collaborator

@mxinden is this OBE now?

@mxinden
Copy link
Collaborator Author

mxinden commented Jul 1, 2024

Sorry for the delay here. I plan to extract parts of this pull request for #1693. That said, no need for a draft pull request, I can just cherry-pick off of my branch.

@mxinden mxinden closed this Jul 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants