Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reimplement sendmmsg/recvmmsg with better API #1744

Merged
merged 7 commits into from
Oct 14, 2022
Merged

Conversation

pacak
Copy link
Contributor

@pacak pacak commented Jun 13, 2022

Motivation is explained in #1602, new version allows to receive data without performing allocations inside the receive loop and to use received data without extra copying.

This pull request contains a breaking change to API recvmmsg (obviously) and also affects recvmsg - new version does not set length of control message buffer if one is passed. Later change can be avoided with a bit more copy-paste.

Fixes #1602

@pacak pacak force-pushed the master branch 6 times, most recently from 7f5ccbc to 9e80118 Compare June 13, 2022 04:12
Copy link
Collaborator

@rtzoeller rtzoeller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @pacak! Overall the changes look very reasonable, although I've not done an entirely thorough review just yet.

A few requests:

  • IMHO it makes sense to make a similar change for sendmmsg in the same release as this PR, to avoid splitting up the breaking changes into multiple releases (thus prolonging any associated pains). Would you be willing to make a similar change for sendmmsg?
  • Do you have an example of retrofitting an existing caller to use the new API you can share? In places where we make nontrivial API changes, I'd like to maintain a list of example changes to help other clients migrate.

CHANGELOG.md Outdated Show resolved Hide resolved
src/sys/socket/mod.rs Outdated Show resolved Hide resolved
@pacak
Copy link
Contributor Author

pacak commented Jun 20, 2022

Would you be willing to make a similar change for sendmmsg?

I can but it would take some time. It would be great if #1748 was merged before that. Currently making any changes to socket/mod.rs is unnecessary painful with the editor configured to automatically format everything.

Do you have an example of retrofitting an existing caller to use the new API you can share?

Sort of. I updated the tests in test/sys/test_socket.rs to use new API, nothing crazy. Just one extra parameters allocated in advance, passed into recvmmsg and implicit collect is replaced with explicit one. Other than that I can't really imagine how would you use existing API. #1602 contains the explanation of what's wrong with the old API and it also contains some "me too" messages, plus a few more of "me too" got removed.

@rtzoeller
Copy link
Collaborator

I can but it would take some time. It would be great if #1748 was merged before that. Currently making any changes to socket/mod.rs is unnecessary painful with the editor configured to automatically format everything.

Fair enough. The alternative would be to just disable the auto-formatting, but I agree that's annoying to do for one repo (when the proper fix is apparent). I've been reluctant to submit any bulk reformatting changes up until now to avoid causing merge conflicts for all outstanding PRs, but I think we've reached a cross-over point where most PRs already have merge conflicts anyways so we should just do it. @asomers thoughts?

@asomers
Copy link
Member

asomers commented Jun 20, 2022

Yeah, I think it's time to do it. I for one don't have any big outstanding PRs.

@pacak pacak force-pushed the master branch 2 times, most recently from 276e8a3 to 90e4d53 Compare June 22, 2022 02:21
@pacak pacak changed the title reimplement recvmmsg with better API reimplement sendmmsg/recvmmsg with better API Jun 22, 2022
@pacak
Copy link
Contributor Author

pacak commented Jun 22, 2022

Would you be willing to make a similar change for sendmmsg?

Ended up implementing it. Tests seem to work. Don't have use cases in real code so can't test it properly.

@rtzoeller
Copy link
Collaborator

@pacak can you rebase and reformat? Thanks!

@pacak
Copy link
Contributor Author

pacak commented Jun 24, 2022

Seems to work.

@rroohhh
Copy link

rroohhh commented Jul 5, 2022

Something seems to be wrong here. This testcase: https://gist.github.com/rroohhh/acb6247f7634e4e68d53604c4c3f88b1
fails in debug mode with

thread 'main' panicked at 'assertion failed: `(left == right)`
  left: `[207]`,
 right: `[170]`', src/bin/testcase.rs:52:9

and in release mode with

thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: EFAULT', src/bin/testcase.rs:38:6

@pacak
Copy link
Contributor Author

pacak commented Jul 5, 2022

Something seems to be wrong here. This testcase: https://gist.github.com/rroohhh/acb6247f7634e4e68d53604c4c3f88b1

Interesting. Moving buffers allocation out from lines 33 and 45 fixes the problem for me, looks like something holds an invalid pointer in both cases... Thank you for a great test case, I'll see how to fix the problem tomorrow.

@pacak
Copy link
Contributor Author

pacak commented Jul 6, 2022

A good test case.

A temporary array created on lines 33/45 used to be stored inside a pointer in MultHdrs structure, I fixed it by requiring a reference with appropriate lifetime, now the test case fails to compile.

@rtzoeller
Copy link
Collaborator

Looks like nightly clippy introduced some new warnings. I'll look at fixing them in the next day or two.

@rroohhh
Copy link

rroohhh commented Jul 6, 2022

@pacak thank you for the swift response and fix

@rtzoeller
Copy link
Collaborator

@pacak apologies for the delay in responding to this. I've been trying to track down any potentially impacted clients to verify that this new solution fully meets their needs, but had previously been unsuccessful. Fortunately I've just found two.


@amatissart your tokio-udt crate makes use of sendmmsg and recvmmsg. In the interest of helping us avoid breaking clients, would you be willing to try adopting your crate to these proposed changes?

@nullchinchilla same as above, except with the sosistab crate and just recvmmsg.

It'd be greatly appreciated if you're willing to help out, but there's no obligation. Thank you both either way. 😃

@rtzoeller
Copy link
Collaborator

@pacak a few observations; I've not had too much time to dig in:

  1. I can bang on test_recvmm2 indefinitely on Linux AMD64 with no issues.
  2. Anecdotally, there appears to be a "clumpiness" to the failures. Run in a loop until failure, I might see the test fail on the first iteration on three or four attempts in a row, and then on the next run take dozens of iterations to fail. Not sure what to make of this - perhaps an alignment issue or something similar where we get "lucky" frequently, in a streaky way?
  3. Here are the values of recv and rmsg for the passing and failing cases on Linux aarch64. Perhaps you can make something out of them - it seems like we might be getting back data but not the control message?
Passing
[src/sys/socket/mod.rs:1827] &recv = MultiResults {
    rmm: MultiHeaders {
        items: [
            mmsghdr {
                msg_hdr: msghdr {
                    msg_name: 0x0000000000000000,
                    msg_namelen: 0,
                    msg_iov: 0x0000ffffa4001290,
                    msg_iovlen: 2,
                    msg_control: 0x0000ffffa40013f0,
                    msg_controllen: 64,
                    msg_flags: 0,
                },
                msg_len: 400,
            },
            mmsghdr {
                msg_hdr: msghdr {
                    msg_name: 0x0000000000000000,
                    msg_namelen: 0,
                    msg_iov: 0x0000ffffa4001350,
                    msg_iovlen: 2,
                    msg_control: 0x0000ffffa4001430,
                    msg_controllen: 64,
                    msg_flags: 0,
                },
                msg_len: 0,
            },
        ],
        addresses: [
            core::mem::maybe_uninit::MaybeUninit<()>,
            core::mem::maybe_uninit::MaybeUninit<()>,
        ],
        cmsg_buffers: Some(
            [
                64,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                1,
                0,
                0,
                0,
                37,
                0,
                0,
                0,
                11,
                73,
                4,
                99,
                0,
                0,
                0,
                0,
                84,
                54,
                93,
                30,
                0,
                0,
				<zeroes>,
                0,
                0,
            ],
        ),
        msg_controllen: 64,
    },
    current_index: 0,
    received: 1,
}
[src/sys/socket/mod.rs:1832] &rmsg = RecvMsg {
    bytes: 400,
    cmsghdr: Some(
        cmsghdr {
            cmsg_len: 64,
            cmsg_level: 1,
            cmsg_type: 37,
        },
    ),
    address: Some(
        (),
    ),
    flags: (empty),
    iobufs: PhantomData,
    mhdr: msghdr {
        msg_name: 0x0000000000000000,
        msg_namelen: 0,
        msg_iov: 0x0000ffffa4001290,
        msg_iovlen: 2,
        msg_control: 0x0000ffffa40013f0,
        msg_controllen: 64,
        msg_flags: 0,
    },
}
Failing
[src/sys/socket/mod.rs:1827] &recv = MultiResults {
    rmm: MultiHeaders {
        items: [
            mmsghdr {
                msg_hdr: msghdr {
                    msg_name: 0x0000000000000000,
                    msg_namelen: 0,
                    msg_iov: 0x0000ffffb4001290,
                    msg_iovlen: 2,
                    msg_control: 0x0000ffffb40013f0,
                    msg_controllen: 0,
                    msg_flags: 0,
                },
                msg_len: 400,
            },
            mmsghdr {
                msg_hdr: msghdr {
                    msg_name: 0x0000000000000000,
                    msg_namelen: 0,
                    msg_iov: 0x0000ffffb4001350,
                    msg_iovlen: 2,
                    msg_control: 0x0000ffffb4001430,
                    msg_controllen: 64,
                    msg_flags: 0,
                },
                msg_len: 0,
            },
        ],
        addresses: [
            core::mem::maybe_uninit::MaybeUninit<()>,
            core::mem::maybe_uninit::MaybeUninit<()>,
        ],
        cmsg_buffers: Some(
            [
                0,
                0,
                <zeroes>,
                0,
                0,
            ],
        ),
        msg_controllen: 64,
    },
    current_index: 0,
    received: 1,
}
[src/sys/socket/mod.rs:1832] &rmsg = RecvMsg {
    bytes: 400,
    cmsghdr: None,
    address: Some(
        (),
    ),
    flags: (empty),
    iobufs: PhantomData,
    mhdr: msghdr {
        msg_name: 0x0000000000000000,
        msg_namelen: 0,
        msg_iov: 0x0000ffffb4001290,
        msg_iovlen: 2,
        msg_control: 0x0000ffffb40013f0,
        msg_controllen: 0,
        msg_flags: 0,
    },
}

@pacak
Copy link
Contributor Author

pacak commented Aug 23, 2022

Yea, looks like we are getting the data but not the control header.

Things to try in no particular order:

  1. Try to change the test to transmit a bunch of messages at once and see if it fails in the first iteration or it can fail in the non first one - where it should perform no allocations.
  2. Try to backport this test to the old API in some way. It's a new test I added, maybe there's a problem with recvmmsg on aarch64

@pacak
Copy link
Contributor Author

pacak commented Oct 6, 2022

Rebased the branch, backported the test in #1837

@pacak
Copy link
Contributor Author

pacak commented Oct 12, 2022

@rtzoeller Can you try running both this and backported tests on your Pi?

@rtzoeller
Copy link
Collaborator

bors try

@bors
Copy link
Contributor

bors bot commented Oct 13, 2022

try

Merge conflict.

@pacak
Copy link
Contributor Author

pacak commented Oct 14, 2022

ubuntu@ip-10-90-3-99:~/nix$ cat /proc/cpuinfo  | head -8
processor       : 0
BogoMIPS        : 243.75
Features        : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs
CPU implementer : 0x41
CPU architecture: 8
CPU variant     : 0x3
CPU part        : 0xd0c
CPU revision    : 1
ubuntu@ip-10-90-3-99:~/nix$ lsb_release  -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 22.04.1 LTS
Release:        22.04
Codename:       jammy
ubuntu@ip-10-90-3-99:~/nix$ rustc -vV
rustc 1.56.1 (59eed8a2a 2021-11-01)
binary: rustc
commit-hash: 59eed8a2aac0230a8b53e89d4e99d55912ba6b35
commit-date: 2021-11-01
host: aarch64-unknown-linux-gnu
release: 1.56.1
LLVM version: 13.0.0
ubuntu@ip-10-90-3-99:~/nix$ for i in `seq 1 10000` ;do cargo test test_recvmm2 --lib ; done 2> /dev/null > a
ubuntu@ip-10-90-3-99:~/nix$ cat a | sort | uniq -c
  30000 
  10000 running 1 test
  10000 test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 60 filtered out; finished in 0.00s
  10000 test sys::socket::test::test_recvmm2 ... ok
thread 'sys::test_socket::test_af_alg_cipher' panicked at 'bind failed: ENOENT', test/sys/test_socket.rs:861:27

Something tells me the problem is in the image or container CI uses.

@pacak
Copy link
Contributor Author

pacak commented Oct 14, 2022

Also works on a bare metal instance.

@rtzoeller
Copy link
Collaborator

Something tells me the problem is in the image or container CI uses.

I would agree, except it is reproducing on the physical hardware in front me of me as well. 😭


Tired of being the bottleneck here, I spent some additional time on this today with the intent of finding a root cause. I had the idea today of using strace to look for differences in syscalls in the runs that pass and the runs that fail.

The only problem with that strategy is that using strace causes the runs to always succeed - I can reliably reproduce the failure by invoking the command directly (with --test-threads=1), but running it under e.g. strace -e trace=recvmmsg leads to the test always passing...

Which leads me to conclude that there is probably some sort of race condition going on, but that it's probably outside the scope of this code (i.e. in the kernel) and that it's only hitting very slow devices. It reproduces on the CI, and on my RPi3 with both the 5.17 and 5.19 kernels on Fedora 36, but I wouldn't expect either to be particularly quick. Alternatively, there is a bug in this code, but it's sufficiently subtle that we are unlikely to characterize it here without additional reports from other users.

For good measure, I confirmed that the bug also reproduces on my hardware under the aarch64-unknown-linux-musl target but fails to reproduce under strace (as expected).

@pacak thank you for your effort and patience on this. Can you disable test_recvmm2 on Linux aarch64, clean up the git history (either all squashed or combining the format + clippy fixes into the previous commits, whichever you prefer) and then I'll submit this?

@pacak
Copy link
Contributor Author

pacak commented Oct 14, 2022

I would agree, except it is reproducing on the physical hardware in front me of me as well.

It could be related to the kernel version, I used Ubuntu 22.04 - latest LTS which uses 5.15. Also not shown in what I posted but I did the same test on 4 different instances - starting from the fast 64 core bare metal ones and down to slow dual core virtual one.

Can you disable test_recvmm2 on Linux aarch64, clean up the git history

Will do. The test itself mostly works on aarch64, I'll just disable the assert about the timestamps on that platform.

Do you think it's worth mentioning a potential problem on aarch64 in the docs? Or at least in comments? Or the conditions are rare enough so it's unlikely? It needs to be a combination of aarch64, slow hardware, recvmmsg and an attempt to collect kernel timestamps - later to you usually do when you care about the best performance and accurate measurements only with a workaround being to use recvmsg...

CMSG_FIRSTHDR/CMSG_NEXTHDR operate in terms of pointers contained
inside msghdr structure, vector capacity doesn't matter for them.

This would change external behavior of recvmsg/recvmmsg in a sense
that buffer passed to store controll messages won't have it's length
updated but intended way to receive control messages is with cmsgs
iterator on `RecvMsg` which would still work.

This change is required to allow using a single vector to store
control messages from multiple packets
This is already an unsafe function, dealing with pointers directly
does not make it much more unsafe but simplifies lifetimes later on
and similar to a previous commit allows to alocate a single buffer
to store all the control messages
We'll be using that to reinitialize buffers later
New implementation performs no allocations after all the necessary
structures are created, removes potentially unsound code that
was used by the old version (see below) and adds a bit more
documentation about bugs in how timeout is actually handled

```
    let timeout = if let Some(mut t) = timeout {
        t.as_mut() as *mut libc::timespec
    } else {
        ptr::null_mut()
    };
```
pacak added a commit to pacak/nix that referenced this pull request Oct 14, 2022
@rtzoeller
Copy link
Collaborator

Do you think it's worth mentioning a potential problem on aarch64 in the docs? Or at least in comments?

A comment would be a good idea - I don't think we need to call it out in the public docs.

@pacak
Copy link
Contributor Author

pacak commented Oct 14, 2022

Added a comment. Also if we are talking race conditions - backported tests might be passing on the old version of recvmmsg because it's less efficient and performs multiple allocations between calls.

Copy link
Collaborator

@rtzoeller rtzoeller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bors r+

@bors bors bot merged commit fbdac70 into nix-rust:master Oct 14, 2022
@pacak
Copy link
Contributor Author

pacak commented Oct 14, 2022

Hmmm.... And we can test this theory by adding a sleep or some redundant allocations in the new test between calling recvmmsg and using the results - line 1847 or so. Well, you can because I can't reproduce the problem on my side.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

recvmmsg is strange
5 participants