Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

alloc_sockaddr: handle abstract paths (v2) #2248

Merged
merged 9 commits into from Feb 21, 2019

Conversation

@timbertson
Copy link
Contributor

commented Feb 13, 2019

Attempting #2233 again.

I got the tests working on MacOS, but they only failed due to a logic error - the path of an unnamed socket was being returned as a string of 14 null bytes, rather than "" or raising EAFNOSUPPORT. Since the first byte was null, it looked like a perfectly valid (if weird) abstract socket name. I've had to add an ugly #ifdef __linux__ now, which makes me sad.

For context, support for returning "" for an unnamed socket was originally added in https://caml.inria.fr/mantis/view.php?id=7039 (@xavierleroy). The dangerous part of that bug (reading past the end of the allocated space) is already fixed by using strnlen, so we could decide that you once again get garbage for an unnamed socket. But it's probably more useful to keep the current behaviour, even if the code is a bit awkward.

/cc @diml @shindere. If you're able to run this through pre-check and find out (a) the backtrace and (b) the test output, that would be super useful. I'm assuming EAFNOSUPPORT is coming from the case where the sender has an unnamed socket. But I'd like to figure out which function is actually raising it.

@shindere

This comment has been minimized.

Copy link
Contributor

commented Feb 13, 2019

@timbertson timbertson force-pushed the timbertson:sockaddr_unix_abstract_v2 branch from 6b14cdb to 70cbf16 Feb 13, 2019
@shindere

This comment has been minimized.

Copy link
Contributor

commented Feb 13, 2019

@diml

This comment has been minimized.

Copy link
Member

commented Feb 13, 2019

precheck looks good, there are a few failures but they look unrelated to this PR.

AFAIK, support for abstract socket is a Linux specific feature, so the #ifdef __linux__ seems reasonable here.

@diml
diml approved these changes Feb 13, 2019
@shindere

This comment has been minimized.

Copy link
Contributor

commented Feb 13, 2019

@shindere

This comment has been minimized.

Copy link
Contributor

commented Feb 13, 2019

@diml

This comment has been minimized.

Copy link
Member

commented Feb 13, 2019

Ah indeed, I missed that. @timbertson do you have enough information to understand and fix the issue?

@timbertson

This comment has been minimized.

Copy link
Contributor Author

commented Feb 13, 2019

Thanks, @shindere. Is it possible to get a backtrace? In particular it could be either the sendto or the recvfrom call which could be raising. I can't log into https://ci.inria.fr/ocaml/job/precheck/188 so I don't know if there are more details in there which I can't see.

If it's not easy to get a backtrace, I could push a temporary commit with a bit of extra logging between the calls to find out how far it gets...

You mention zsystems, is that https://en.wikipedia.org/wiki/Linux_on_z_Systems ?

timbertson added 2 commits Feb 14, 2019
@shindere

This comment has been minimized.

Copy link
Contributor

commented Feb 14, 2019

@timbertson

This comment has been minimized.

Copy link
Contributor Author

commented Feb 14, 2019

Thanks @shindere , I can reproduce now too with the latest two commits. I meant to add this comment when I pushed them, but github was timing out:


OK, this is getting very weird. I added some extra logging and a try/catch so I could see exactly where it failed, and it suddenly started failing locally on linux with EAFNOSUPPORT. It only failed with native compilation, the bytecode mode still passed.

I reduced the diff to its minimal form, and it ended up that if I surround the recvfrom line with a try/catch, it throws (and gets caught). If I don't surround it, no error is thrown.

Debug changes (still passes): tree 718e21e
Minimal diff (fails due to EAFNOSUPPORT being thrown and caught): commit 8283b3e

Further, I added some debug logging to sockaddr.c itself to print out the sa_family value in alloc_sockaddr (commented out in the above commits, because it breaks the test output checks). With the code that passes, it prints address family: 1 for both test cases.

With the try/catch, it prints 1 for the first test case (a named socket), then 0 for the second test case (the unnamed socket). This causes that function to raise EAFNOSUPPORT, since 0 isn't a known address family and it rightly falls through to the default switch case.

So my question is: how on earth could adding a try/catch affect that? This seems like either I'm forgetting something dumb, or it's quite a nasty bug.

I cherry-picked my branch onto 4.07.1 to see if it was a recent bug, but it has the same behaviour there.


Concrete steps for anyone who wants to reproduce (please reproduce, tell me I'm not going mad 😉 ):

$ git checkout 8283b3e8c06f4b18a787ad7332aabcc323ddb66a^
$ make world.opt
$ ocamltest/ocamltest testsuite/tests/lib-unix/unix-socket/recvfrom_unix.ml -e
# (passes)

$ git checkout 8283b3e8c06f4b18a787ad7332aabcc323ddb66a
$ ocamltest/ocamltest testsuite/tests/lib-unix/unix-socket/recvfrom_unix.ml -e
# (fails)
@shindere

This comment has been minimized.

Copy link
Contributor

commented Feb 14, 2019

@shindere

This comment has been minimized.

Copy link
Contributor

commented Feb 15, 2019

@shindere

This comment has been minimized.

Copy link
Contributor

commented Feb 15, 2019

By the way, just for the record I ran valgrind as follows:

valgrind --track-origins=yes ./recvfrom_unix.opt
@timbertson

This comment has been minimized.

Copy link
Contributor Author

commented Feb 15, 2019

That's incredibly helpful, thanks! I've got a few ideas now, I'll do some digging and report back.

@timbertson

This comment has been minimized.

Copy link
Contributor Author

commented Feb 16, 2019

OK, I think I've figured it out: the existing code only ever worked by chance. When the sender is unbound, recvfrom sets adr_len to 0, and doesn't write anything to the address pointer. We then switch on the first field of that address (sa_family), which is uninitialized.

Since this can only happen for AF_UNIX, I've added a guard so that if adr_len is zero, we set sa_family explicitly to AF_UNIX (since the rest of the code already handles a zero-length address correctly).

It might be more correct to use getsockname to get the receiver's sa_family and use that in unix_recvfrom, but that would have to be done differently at each callsite to alloc_sockaddr.

@shindere could you please run the latest commit through pre-check? If everything passes I'll squash the commits and we'll be good to go 😀

@timbertson timbertson force-pushed the timbertson:sockaddr_unix_abstract_v2 branch from 409f1d8 to 472447b Feb 16, 2019
@shindere

This comment has been minimized.

Copy link
Contributor

commented Feb 18, 2019

@timbertson

This comment has been minimized.

Copy link
Contributor Author

commented Feb 18, 2019

@shindere

This comment has been minimized.

Copy link
Contributor

commented Feb 18, 2019

@timbertson

This comment has been minimized.

Copy link
Contributor Author

commented Feb 19, 2019

That seems reasonable, like this? (d75fc80)

I haven't worked with ocaml's C bindings much, so I followed the existing style and hopefully I'm not doing anything silly ;)

@diml

This comment has been minimized.

Copy link
Member

commented Feb 20, 2019

Yes, that looks correct. A more idiomatic version would be:

value alloc_unix_sockaddr(value path) {
  CAMLparam1(path);
  CAMLlocal1(res);
  res = caml_alloc_small(1, 0);
  Field(res,0) = path;
  CAMLreturn(res);
}

The Begin/End_roots macros are more low-level, you need to understand more in detail what they are doing to check the correctness of the code.

I restarted precheck with the tip of this PR.

Field(res,0) = path;
End_roots();
return res;
CAMLparam1(path);

This comment has been minimized.

Copy link
@hannesm

hannesm Feb 20, 2019

Member

seems like a unicode object replacement character slipped into the previous line (first codepoint), should instead be a normal whitespace (0x20).

@diml

This comment has been minimized.

Copy link
Member

commented Feb 21, 2019

The tip of this PR looks ready to me. @shindere does it look good to you as well?

@shindere

This comment has been minimized.

Copy link
Contributor

commented Feb 21, 2019

@diml

This comment has been minimized.

Copy link
Member

commented Feb 21, 2019

Sounds good!

@diml diml merged commit 3424192 into ocaml:trunk Feb 21, 2019
2 checks passed
2 checks passed
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@diml

This comment has been minimized.

Copy link
Member

commented Feb 21, 2019

BTW, I moved the changelog entry to trunk since this didn't make it to 4.08.

diml added a commit that referenced this pull request Feb 21, 2019
@shindere

This comment has been minimized.

Copy link
Contributor

commented Feb 21, 2019

@timbertson-zd

This comment has been minimized.

Copy link

commented Feb 21, 2019

Thanks for your help @shindere and @diml! :)

I think we do not want it in 4.08, do we?

I don't think so, it's just that the initial PR snuck into 4.08 so I put the Changes entry there the first time around.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.