Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable usage of experimental inet_backend option for TCP listeners #297

Conversation

Maria-12648430
Copy link
Contributor

In OTP/23, there is an experimental socket backend for inet/gen_tcp. It can be activated by the undocumented {inet_backend, socket} option for gen_tcp:listen.

If present, it has to be the first option in the list. As ranch_tcp:listen modifies the options list given to it, it was not possible to activate the socket backend by specifying {inet_backend, socket} in the socket_opts parameter given to ranch:start_listener. This PR aims to change this behavior, ensuring that this option remains in the first position if present.

As this feature is experimental and undocumented in OTP, I did not add it to the documentation and specs in ranch. After all, it is not meant for widespread adoption by users, but just to enable you (@essen, @juhlig, others?) to experiment with it.

@juhlig
Copy link
Contributor

juhlig commented Jun 18, 2020

@Maria-12648430 the test you supplied fails on MacOS. It looks like the failure is due to a problem with the socket backend itself on MacOS, not with your changes to ranch. Starting a listener with the socket backend seems to work, connecting seems to work. The error occurs later, apparently (as far as I can tell from the error report) when the server receives something from the client. I retried the test on the MacOS environment (which is the most I can do, I don't have MacOS on my hands to play with), but the error persists, so it is not an intermittend failure and should probably be reported at https://bugs.erlang.org/.

Anyway, as I see it, the purpose of this PR is to enable us and others to activate the socket backend in the first place, not that using it works. It is experimental after all, and making it work is OTPs responsibility.

With that in mind, as I see it, the following approaches are possible:

  • no test at all; the feature itself is experimental, in turn making it experimental in ranch (for that reason, I agree that it should not be documented). In any case, it should not be used in any real world projects
  • keep the test, just skip it on MacOS
  • keep the test, even for MacOS, but reduce it to only start a listener with the socket backend (which seems to work in all tested environments), but do not send/receive anything; the test can be extended later once the socket backend has made it to some official state

@essen what do you think? My tendency is slightly in favor of the first option, no test =^^=

@essen
Copy link
Member

essen commented Jun 18, 2020

Opened https://bugs.erlang.org/browse/ERL-1284

Yes we need a test and this test is fine, OTP just needs a small fix. I have one comment about the test though.

test/acceptor_SUITE.erl Outdated Show resolved Hide resolved
@Maria-12648430 Maria-12648430 force-pushed the enable_usage_of_inet_backend_option branch from df68672 to eab9859 Compare June 18, 2020 11:03
@Maria-12648430
Copy link
Contributor Author

I don't have MacOS on my hands to play with

Neither do I 😜

@essen thanks for opening the ticket at bugs.erlang.org, I didn't know they would accept just the error report but require some minimal program to demonstrate.

@juhlig
Copy link
Contributor

juhlig commented Jun 18, 2020

Oh, I just noticed something. In ranch_acceptors_sup, when num_listen_sockets is >1 and the port is either absent or set to 0, the port of the first connected socket is inserted at the head of the socket options list of the other sockets... This then defeats what you did with your changes, as {inet_backend, _} is then not the first option any more, even though the user specified them that way.

If you want to look into that, to prevent you from going on a wild goose chase, I'll mention that multiple listen sockets don't seem to work right now anyway, because of reasons in the socket backend that I'll look into more deeply tomorrow.

@Maria-12648430
Copy link
Contributor Author

Thanks for the heads-up :)

@Maria-12648430 Maria-12648430 force-pushed the enable_usage_of_inet_backend_option branch from eab9859 to 018b1bb Compare June 19, 2020 08:20
@Maria-12648430
Copy link
Contributor Author

I changed the way by which the port is set for subsequent listening sockets in ranch_acceptors_sup to use lists:keystore. This solves the problem pointed out by @juhlig. Using an unconditional lists:keystore instead of lists:keyfind(port, ...) and if different from Port -> lists:keydelete(port, ...) and then prepend {port, Port} is also more simple and has the same result, only with the port tuple replaced (possibly by the same value) or appended if not present, instead of prepended.

@juhlig
Copy link
Contributor

juhlig commented Jun 19, 2020

Nice :)

@Maria-12648430
Copy link
Contributor Author

Thanks 😁

Btw, I noticed a loophole there, unrelated to inet_backend.

It is legal to give an option multiple times in the option list, gen_tcp:listen (and ssl for that matter) will not complain, but silently take the last one of them. If multiple listen sockets are used, and the port option was given multiple times with the value of last being 0, all the sockets will end up listening on different random ports.

This behavior is undocumented in OTP, probably just circumstantial, and an absolutely pointless thing to do deliberately. But it is a possibility for an unintentional misconfiguration of ranch.

@juhlig
Copy link
Contributor

juhlig commented Jun 19, 2020

@essen
Copy link
Member

essen commented Jun 19, 2020

About a potential issue please open a separate PR with a test case but it's not really a priority to fix.

@Maria-12648430
Copy link
Contributor Author

Maybe next week 😉 Enjoy your weekend.

@essen
Copy link
Member

essen commented Jun 22, 2020

OK so we were almost ready to release 2.0 when this dropped. Due to the problem on macOS we have chosen to merge this after releasing 2.0. We want the PR to work on macOS first. I think the patch to Erlang/OTP should just add the missing case clause, based on a quick reading of the code. I am not sure why the code singles out {error,closed} and crashes on other errors though.

Thanks for the PR! And sorry this won't make it into 2.0. Feel free to send a PR to OTP. I would like to eventually test Cowboy with the inet_backend option set to help OTP get this backend ready so I will get to it eventually if nobody does it first, but probably not before the end of summer.

@Maria-12648430
Copy link
Contributor Author

Maria-12648430 commented Jun 22, 2020

I was mistaken about the issue with multiple listen sockets and repeated port options. Sort of, anyway: While the possibility for the described scenario does exist in ranch_acceptors_sup, a list with repeated options does not get there because ranch:filter_options drops all but the last one before that.

@Maria-12648430
Copy link
Contributor Author

Thanks for the PR!

You're very welcome 😄

sorry this won't make it into 2.0

No problem at all.

Feel free to send a PR to OTP.

I'm pretty sure that this is way out of my league 😅 Yet 😁

@essen
Copy link
Member

essen commented Jun 22, 2020

It's probably an easy one, although time consuming without direct access to a macOS environment. First produce a test case that reproduces the problem without Ranch (you can take the test you wrote and strip Ranch bit by bit). Then either report the test case to the OTP ticket or fix the OTP code. The fix is probably to update socket_cancel/2 here: https://github.com/erlang/otp/blob/master/lib/kernel/src/gen_tcp_socket.erl#L487 and add a clause {error, einval} -> ok.

Anyway without a macOS environment I wouldn't worry too much about it. 15 minutes feedback loops are not fun. :-)

@Maria-12648430
Copy link
Contributor Author

15 minutes feedback loops are not fun. :-)

Not exactly a huge load of it anyway 😅 But I have some free time on my hands right now, so I'll at least have a look.

@Maria-12648430
Copy link
Contributor Author

I moved most of the tests with the socket backend in an own test group and removed the obsolete single tcp_socket_echo test. Two tests are commented out (until ERL-1287 is solved).

I also tried the same in the sendfile suite, but it looks like file:sendfile does not work with the socket backend yet.

@essen
Copy link
Member

essen commented Jun 26, 2020

I also tried the same in the sendfile suite, but it looks like file:sendfile does not work with the socket backend yet.

Huh, I wonder why. Do you get errors?

@Maria-12648430
Copy link
Contributor Author

Just {error, badarg}, and this then causes a badmatch in ranch_tcp:sendfile etc.

@Maria-12648430
Copy link
Contributor Author

Minimized example:

1> {ok, S}=gen_tcp:listen(8888, [{inet_backend, socket}, {packet, raw}, binary]).
{ok,{'$inet',gen_tcp_socket,
{<0.82.0>,{'$socket',#Ref<0.511248149.1624113153.72213>}}}}
2> {ok, C}=gen_tcp:accept(S).
{ok,{'$inet',gen_tcp_socket,
{<0.84.0>,{'$socket',#Ref<0.511248149.1624113153.72233>}}}}
3> file:sendfile("./test.txt", C).
{error,badarg}

@Maria-12648430
Copy link
Contributor Author

@Maria-12648430 Maria-12648430 force-pushed the enable_usage_of_inet_backend_option branch from 67370b8 to 85763d9 Compare June 26, 2020 14:33
@Maria-12648430
Copy link
Contributor Author

The last commit adds socket backend tests for sendfile. They are failing right now, but should starting when OTP/23.1 arrives (https://bugs.erlang.org/browse/ERL-1293)

test/acceptor_SUITE.erl Outdated Show resolved Hide resolved
@Maria-12648430 Maria-12648430 force-pushed the enable_usage_of_inet_backend_option branch from 4d3df8c to 7b18793 Compare August 24, 2020 15:29
@Maria-12648430 Maria-12648430 force-pushed the enable_usage_of_inet_backend_option branch 2 times, most recently from 35416e6 to 666a8ec Compare November 6, 2020 09:24
@Maria-12648430
Copy link
Contributor Author

Alpine is probably a flaky test.

Hm, can't imagine what could be flaky here... it happens on different lines on different runs, the one that caused the last failure was:

"true\n" = do_exec_log(Rel ++ " eval "
                        "'maps:is_key(metrics, ranch:info(" ++ ExampleStr ++ "))'"),

and do_exec_log is

do_exec_log(Cmd) ->
        ct:log("Command: ~s~n", [Cmd]),
        Out=os:cmd(Cmd),
        ct:log("Output:~n~n~s~n", [Out]),
        Out.

Windows could be the return of https://bugs.erlang.org/browse/ERL-938 except this time it's platform specific. It's not related to your branch, see for example this from more than one month ago: https://builds.ninenines.eu/logs/ranch/226/win10/

It wasn't detected earlier because 23.0 was not actually installed on the Windows environment until 23.1 was released. So the tests you remember passing probably were not running against 23.0. I don't think you'll make them pass if you retry forever, there's probably a legitimate issue there to investigate.

Yes, we were just trying to find out if the issue is intermittend or reliably fails. As it worked on 23.0 once (but not always), it is intermittend.

As you already made ERL-938 before and probably have some more insights into this issue, could you nudge it up at bugs erlang again? ;)

@essen
Copy link
Member

essen commented Nov 6, 2020

The question is whether there is a test that doesn't fail after https://builds.ninenines.eu/logs/ranch/226/win10/ because before that, they weren't really running against OTP 23.0 because it wasn't installed (even if the logs said 23.0 that wasn't the case it was running against the version in the PATH). I don't think it's intermittent.

@Maria-12648430
Copy link
Contributor Author

Maria-12648430 commented Nov 6, 2020

The question is whether there is a test that doesn't fail after https://builds.ninenines.eu/logs/ranch/226/win10/ because before that, they weren't really running against OTP 23.0 because it wasn't installed (even if the logs said 23.0 that wasn't the case it was running against the version in the PATH). I don't think it's intermittent.

If it helps, look at https://buildkite.com/ninenines/ranch-prs/builds/325, there are 3 runs for Windows. The first and last one failed on both 23.0 and 23.1, but the second one passed on 23.0 and only failed on 23.1. Doesn't seem to happen very often, though, so it looks like there definitely is something wrong starting with 23.0. Actually, we may better call it "intermittend success" instead of "intermittend failure" ;)

@essen
Copy link
Member

essen commented Nov 6, 2020

The run you point to fails in both 23.0 and 23.1 though? https://builds.ninenines.eu/logs/ranch-prs/325/win10/ There's 8 failures on both 23.x in all 3 runs.

@Maria-12648430
Copy link
Contributor Author

Maria-12648430 commented Nov 6, 2020

No, look at the output in the "Log" tab, not the logs.html in "Artifacts", I think that gets overwritten if you do a retry.

@essen
Copy link
Member

essen commented Nov 6, 2020

Oh right it gets overwritten. I need to improve that. Fair enough then, it doesn't always happen. But very frequently. Would be great to isolate a short snippet to reproduce the issue.

@essen
Copy link
Member

essen commented Nov 6, 2020

I'll try restarting the CI machines as well just in case it's just a lack of resources due to some unrelated process. But you'll have to wait a bit for that to happen.

@Maria-12648430
Copy link
Contributor Author

No sweat ;)

@juhlig
Copy link
Contributor

juhlig commented Nov 23, 2020

Heh... so it looks like https://bugs.erlang.org/browse/ERL-960 never got really fixed, and only popped up in ranch again because the major version changed in OTP/23...

@juhlig
Copy link
Contributor

juhlig commented Nov 23, 2020

Or rather, it seems to be related to https://bugs.erlang.org/browse/ERL-960, not the same. We couldn't reproduce it locally with your z module, or other... But still, with internal_active_n set to 1, the ssl tests pass on Windows.

@essen
Copy link
Member

essen commented Nov 23, 2020

OK there's probably a new issue then.

@essen
Copy link
Member

essen commented Nov 26, 2020

@essen
Copy link
Member

essen commented Apr 17, 2021

I forget, can we merge this now?

@Maria-12648430
Copy link
Contributor Author

Oh, I completely forgot this... ^^;
You may want to run the tests again, don't know if anything changed since it was last tested. If it passes, it can be merged.

@Maria-12648430 Maria-12648430 force-pushed the enable_usage_of_inet_backend_option branch from 1c9656d to fca1334 Compare April 20, 2021 08:00
@Maria-12648430
Copy link
Contributor Author

@essen so, how about merging this? :)

tcp_getopts_capability,
tcp_getstat_capability,
tcp_upgrade,
%% @TODO: Enable when https://bugs.erlang.org/browse/ERL-1287 is fixed
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been fixed so I guess the tests can be enabled again.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes No... If I run them on OTP 24, they fail. I think I read somewhere (can't remember where) that while it was fixed in 23.something, it re-occurred in 24. I'm investigating, I don't have the needed version of 23 in kerl right now, it's currently building.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On top of that, with 24 tcp_error_eacces fails with an obscure error message, but this one was definitely working before. Investigating that one, too, but it looks like this will need another report and fixing in OTP.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me give you a hand? =)

I ran them on 23.3 for you, and while tcp_10_acceptors_10_listen_sockets ultimately fails, it is for another reason, starting the listener actually works there. With 24, I can confirm that it doesn't start.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@juhlig thanks 🥰 I'll collect the bits and pieces later and open a ticket at OTP.

Well... so I guess we can't merge this just yet, or if, only with the 2 tests plus the eacces one disabled... @essen?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right they are related to that raw issue, OK, should have checked more thoroughly... So we can leave those disabled until an OTP release fixes the issue and then enable them for 24.x+.

I think we can merge with the 2 tests + eacces disabled for now. But we should try them against current OTP master before opening any ticket there. Want to do that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Want to do that?

Can do, but not today :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, master build was faster than expected.

Ok, so on a quick glance, it all seems to work (except for what Jan pointed out, a test failing for other reasons than the socket backend) when run on current master, even the eacces test.
So... I'll need to fix that test (needs some thinking, but I'll manage. Tomorrow. For real this time 😅), and then we can enable them when 25 is out, or 24.1 if it's already fixed in there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, the 3 tests are disabled with a todo note to re-enable them when fixed (probably 24.1), and the one test that wasn't working has been fixed (this post listen callback thing came at just the right time for this :)), but it is among the disabled ones.

test/acceptor_SUITE.erl Outdated Show resolved Hide resolved
@essen essen added this to the 2.1.0 milestone Sep 1, 2021
@essen
Copy link
Member

essen commented Sep 3, 2021

OK let's see what tests are saying and then merge if all good. Will review at the same time.

@essen
Copy link
Member

essen commented Sep 3, 2021

By the way will we need to do the same for ssl? I think we can change the backend since OTP 24.

@Maria-12648430
Copy link
Contributor Author

By the way will we need to do the same for ssl? I think we can change the backend since OTP 24.

Hm, I don't know, will have to check. Last thing I remember was Ingela saying they won't use it in ssl before it is stable, but I'll try.

@Maria-12648430
Copy link
Contributor Author

No, doesn't work, not even on current master. And see erlang/otp#4234

@essen
Copy link
Member

essen commented Sep 3, 2021

Yes but I have seen some bug reports around that. Guess they're for clients though.

@essen
Copy link
Member

essen commented Sep 6, 2021

Merged, thanks!

@essen essen closed this Sep 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants