New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resolv fails under load with SocketError: bind: name or service not known #3659

Closed
jsvd opened this Issue Feb 12, 2016 · 17 comments

Comments

Projects
None yet
5 participants
@jsvd

jsvd commented Feb 12, 2016

I setup dnsmasq on my mac and ran the following script:

require 'resolv'
Resolv::DNS.open(:nameserver => "127.0.0.1") do |dns|
  10000.times.each do |i|
    begin
      dns.getaddress("server-a.my.lan")
      print "#{i} " if (i % 2000 == 0)
    rescue => e
      puts i
      raise
    end
  end
end
% rvm use 1.9.3
Using /Users/joaoduarte/.rvm/gems/ruby-1.9.3-p551
% ruby dns.rb
0 2000 4000 6000 8000
% rvm use 2.2.1
Using /Users/joaoduarte/.rvm/gems/ruby-2.2.1
% ruby dns.rb
0 2000 4000 6000 8000
% rvm use jruby-1.7.23
Using /Users/joaoduarte/.rvm/gems/jruby-1.7.23
% ruby dns.rb
0 136
SocketError: bind: name or service not known
                bind at org/jruby/ext/socket/RubyUDPSocket.java:160
    bind_random_port at /Users/joaoduarte/.rvm/rubies/jruby-1.7.23/lib/ruby/1.9/resolv.rb:638
          initialize at /Users/joaoduarte/.rvm/rubies/jruby-1.7.23/lib/ruby/1.9/resolv.rb:777
  make_udp_requester at /Users/joaoduarte/.rvm/rubies/jruby-1.7.23/lib/ruby/1.9/resolv.rb:543
       each_resource at /Users/joaoduarte/.rvm/rubies/jruby-1.7.23/lib/ruby/1.9/resolv.rb:500
        each_address at /Users/joaoduarte/.rvm/rubies/jruby-1.7.23/lib/ruby/1.9/resolv.rb:396
          getaddress at /Users/joaoduarte/.rvm/rubies/jruby-1.7.23/lib/ruby/1.9/resolv.rb:372
              (root) at dns.rb:5
               times at org/jruby/RubyFixnum.java:280
                each at org/jruby/RubyEnumerator.java:274
              (root) at dns.rb:3
                open at /Users/joaoduarte/.rvm/rubies/jruby-1.7.23/lib/ruby/1.9/resolv.rb:307
              (root) at dns.rb:2
@jsvd

This comment has been minimized.

Show comment
Hide comment
@jsvd

jsvd Feb 12, 2016

for completeness sake:

% rvm use jruby-9.0.5.0
Using /Users/joaoduarte/.rvm/gems/jruby-9.0.5.0
% ruby dns.rb
0 109
SocketError: bind: name or service not known
                bind at org/jruby/ext/socket/RubyUDPSocket.java:167
    bind_random_port at /Users/joaoduarte/.rvm/rubies/jruby-9.0.5.0/lib/ruby/stdlib/resolv.rb:658
          initialize at /Users/joaoduarte/.rvm/rubies/jruby-9.0.5.0/lib/ruby/stdlib/resolv.rb:809
  make_udp_requester at /Users/joaoduarte/.rvm/rubies/jruby-9.0.5.0/lib/ruby/stdlib/resolv.rb:563
      fetch_resource at /Users/joaoduarte/.rvm/rubies/jruby-9.0.5.0/lib/ruby/stdlib/resolv.rb:520
       each_resource at /Users/joaoduarte/.rvm/rubies/jruby-9.0.5.0/lib/ruby/stdlib/resolv.rb:513
        each_address at /Users/joaoduarte/.rvm/rubies/jruby-9.0.5.0/lib/ruby/stdlib/resolv.rb:410
          getaddress at /Users/joaoduarte/.rvm/rubies/jruby-9.0.5.0/lib/ruby/stdlib/resolv.rb:386
     block in dns.rb at dns.rb:5
               times at org/jruby/RubyFixnum.java:302
                each at org/jruby/RubyEnumerator.java:293
     block in dns.rb at dns.rb:3
                open at /Users/joaoduarte/.rvm/rubies/jruby-9.0.5.0/lib/ruby/stdlib/resolv.rb:306
               <top> at dns.rb:2

jsvd commented Feb 12, 2016

for completeness sake:

% rvm use jruby-9.0.5.0
Using /Users/joaoduarte/.rvm/gems/jruby-9.0.5.0
% ruby dns.rb
0 109
SocketError: bind: name or service not known
                bind at org/jruby/ext/socket/RubyUDPSocket.java:167
    bind_random_port at /Users/joaoduarte/.rvm/rubies/jruby-9.0.5.0/lib/ruby/stdlib/resolv.rb:658
          initialize at /Users/joaoduarte/.rvm/rubies/jruby-9.0.5.0/lib/ruby/stdlib/resolv.rb:809
  make_udp_requester at /Users/joaoduarte/.rvm/rubies/jruby-9.0.5.0/lib/ruby/stdlib/resolv.rb:563
      fetch_resource at /Users/joaoduarte/.rvm/rubies/jruby-9.0.5.0/lib/ruby/stdlib/resolv.rb:520
       each_resource at /Users/joaoduarte/.rvm/rubies/jruby-9.0.5.0/lib/ruby/stdlib/resolv.rb:513
        each_address at /Users/joaoduarte/.rvm/rubies/jruby-9.0.5.0/lib/ruby/stdlib/resolv.rb:410
          getaddress at /Users/joaoduarte/.rvm/rubies/jruby-9.0.5.0/lib/ruby/stdlib/resolv.rb:386
     block in dns.rb at dns.rb:5
               times at org/jruby/RubyFixnum.java:302
                each at org/jruby/RubyEnumerator.java:293
     block in dns.rb at dns.rb:3
                open at /Users/joaoduarte/.rvm/rubies/jruby-9.0.5.0/lib/ruby/stdlib/resolv.rb:306
               <top> at dns.rb:2
@jsvd

This comment has been minimized.

Show comment
Hide comment
@jsvd

jsvd Feb 12, 2016

After a big more investigation, I see the problem: in resolv when a bind is attempted at a random port, the rescue block expects an exception from the Errno family:

    def self.bind_random_port(udpsock, bind_host="0.0.0.0") # :nodoc:
      begin
        port = rangerand(1024..65535)
        udpsock.bind(bind_host, port)
      rescue Errno::EADDRINUSE, # POSIX
             Errno::EACCES, # SunOS: See PRIV_SYS_NFS in privileges(5)
             Errno::EPERM # FreeBSD: security.mac.portacl.port_high is configurable.  See mac_portacl(4).
        retry
      end
    end

So I created a small script to test which exception is thrown if the port is blocked:

$ cat udp.rb
require 'socket'
u1 = UDPSocket.new
u1.bind("127.0.0.1", 53333)
u2 = UDPSocket.new
begin
u2.bind("127.0.0.1", 53333)
rescue => e
  puts e.class
  puts e.message
  puts e.class.ancestors.inspect
end

So ruby mri 2.2.1 behaves as expected:

$ rvm use 2.2.1
Using /Users/joaoduarte/.rvm/gems/ruby-2.2.1
$ ruby udp.rb
Errno::EADDRINUSE
Address already in use - bind(2) for "127.0.0.1" port 53333
[Errno::EADDRINUSE, SystemCallError, StandardError, Exception, Object, Kernel, BasicObject]

But JRuby throws a SocketError instead:

$ rvm use jruby-1.7.23
Using /Users/joaoduarte/.rvm/gems/jruby-1.7.23
$ ruby udp.rb
SocketError
bind: name or service not known
[SocketError, StandardError, Exception, Object, Kernel, BasicObject]

jsvd commented Feb 12, 2016

After a big more investigation, I see the problem: in resolv when a bind is attempted at a random port, the rescue block expects an exception from the Errno family:

    def self.bind_random_port(udpsock, bind_host="0.0.0.0") # :nodoc:
      begin
        port = rangerand(1024..65535)
        udpsock.bind(bind_host, port)
      rescue Errno::EADDRINUSE, # POSIX
             Errno::EACCES, # SunOS: See PRIV_SYS_NFS in privileges(5)
             Errno::EPERM # FreeBSD: security.mac.portacl.port_high is configurable.  See mac_portacl(4).
        retry
      end
    end

So I created a small script to test which exception is thrown if the port is blocked:

$ cat udp.rb
require 'socket'
u1 = UDPSocket.new
u1.bind("127.0.0.1", 53333)
u2 = UDPSocket.new
begin
u2.bind("127.0.0.1", 53333)
rescue => e
  puts e.class
  puts e.message
  puts e.class.ancestors.inspect
end

So ruby mri 2.2.1 behaves as expected:

$ rvm use 2.2.1
Using /Users/joaoduarte/.rvm/gems/ruby-2.2.1
$ ruby udp.rb
Errno::EADDRINUSE
Address already in use - bind(2) for "127.0.0.1" port 53333
[Errno::EADDRINUSE, SystemCallError, StandardError, Exception, Object, Kernel, BasicObject]

But JRuby throws a SocketError instead:

$ rvm use jruby-1.7.23
Using /Users/joaoduarte/.rvm/gems/jruby-1.7.23
$ ruby udp.rb
SocketError
bind: name or service not known
[SocketError, StandardError, Exception, Object, Kernel, BasicObject]
@enebo

This comment has been minimized.

Show comment
Hide comment
@enebo

enebo Feb 12, 2016

Member

@jsvd I do not have time to look at this today but I will add some extra info. I can see that this behavior of expecting EADDRINUSE applies all the way back to 1.8.7. So this has been broken a long time. fwiw, we get back from pretty generic error messages from Java's net layer. It looks like we should be examining the string (sad but likely true) to figure out if we should be raising EADDRINUSE.

Member

enebo commented Feb 12, 2016

@jsvd I do not have time to look at this today but I will add some extra info. I can see that this behavior of expecting EADDRINUSE applies all the way back to 1.8.7. So this has been broken a long time. fwiw, we get back from pretty generic error messages from Java's net layer. It looks like we should be examining the string (sad but likely true) to figure out if we should be raising EADDRINUSE.

@jsvd

This comment has been minimized.

Show comment
Hide comment
@jsvd

jsvd Feb 12, 2016

Thanks for the feedback @enebo

jsvd commented Feb 12, 2016

Thanks for the feedback @enebo

@headius

This comment has been minimized.

Show comment
Hide comment
@headius

headius Feb 13, 2016

Member

Are you able to reproduce this on 9k?

I'm having trouble getting your script to fail the same way. I get ResolvError generally if there's a DNS server at the target address. I'm on OS X.

Member

headius commented Feb 13, 2016

Are you able to reproduce this on 9k?

I'm having trouble getting your script to fail the same way. I get ResolvError generally if there's a DNS server at the target address. I'm on OS X.

@headius

This comment has been minimized.

Show comment
Hide comment
@headius

headius Feb 13, 2016

Member

Even though I can't reproduce it, if you can give me the SocketError trace while passing -Xbacktrace.style=full I should be able to track down where it is coming from and fix it.

Member

headius commented Feb 13, 2016

Even though I can't reproduce it, if you can give me the SocketError trace while passing -Xbacktrace.style=full I should be able to track down where it is coming from and fix it.

@jsvd

This comment has been minimized.

Show comment
Hide comment
@jsvd

jsvd Feb 13, 2016

The problem comes from what @enebo said, a binding on a used port raises a SocketError instead of EADDRINUSE.

Now, the reason why resolv fails in the first place is that a getresource calls on DNS.bind_random_port (http://www.rubydoc.info/stdlib/resolv/Resolv%2FDNS.bind_random_port) for each request. The method chooses a random port from the non privileged range. If there's a udp socket open from another application, it's a matter of time until the random hits that used port.

Since it's raising "the wrong exception", it bubbles up instead of being retried (normal behaviour).

jsvd commented Feb 13, 2016

The problem comes from what @enebo said, a binding on a used port raises a SocketError instead of EADDRINUSE.

Now, the reason why resolv fails in the first place is that a getresource calls on DNS.bind_random_port (http://www.rubydoc.info/stdlib/resolv/Resolv%2FDNS.bind_random_port) for each request. The method chooses a random port from the non privileged range. If there's a udp socket open from another application, it's a matter of time until the random hits that used port.

Since it's raising "the wrong exception", it bubbles up instead of being retried (normal behaviour).

@headius

This comment has been minimized.

Show comment
Hide comment
@headius

headius Feb 14, 2016

Member

@jsvd Confirmed! I believe this will fix by the ruby-2.3+socket branch (still in progress) which aligns our socket implementations much more closely to CRuby. That should get merged in any day now and be in 9.1.

Member

headius commented Feb 14, 2016

@jsvd Confirmed! I believe this will fix by the ruby-2.3+socket branch (still in progress) which aligns our socket implementations much more closely to CRuby. That should get merged in any day now and be in 9.1.

@headius headius added this to the JRuby 9.1.0.0 milestone Feb 14, 2016

@headius headius modified the milestones: JRuby 9.1.1.0, JRuby 9.1.0.0 Apr 20, 2016

@headius

This comment has been minimized.

Show comment
Hide comment
@headius

headius Apr 20, 2016

Member

The socket branch will not make it into 9.1. Bumping.

Member

headius commented Apr 20, 2016

The socket branch will not make it into 9.1. Bumping.

@headius headius closed this Apr 20, 2016

@headius headius reopened this Apr 20, 2016

@nbarrientos

This comment has been minimized.

Show comment
Hide comment
@nbarrientos

nbarrientos Apr 27, 2016

Contributor

It's likely that we're hitting this issue too on jRuby 1.7.20.1 which is the jRuby version embedded in puppet-server 1.1.3 :( Is there way to work around it? Quite a few things rely on Resolv over here :/

Contributor

nbarrientos commented Apr 27, 2016

It's likely that we're hitting this issue too on jRuby 1.7.20.1 which is the jRuby version embedded in puppet-server 1.1.3 :( Is there way to work around it? Quite a few things rely on Resolv over here :/

@headius

This comment has been minimized.

Show comment
Hide comment
@headius

headius May 2, 2016

Member

@nbarrientos It's unlikely we'll be putting a lot of effort into the Socket subsystem on JRuby 1.7, so your best bet would be to try JRuby 9.1 when it comes out. We may not have it fixed, but we'll be closer, and we'll work to get it fixed for a 9.1 update.

Member

headius commented May 2, 2016

@nbarrientos It's unlikely we'll be putting a lot of effort into the Socket subsystem on JRuby 1.7, so your best bet would be to try JRuby 9.1 when it comes out. We may not have it fixed, but we'll be closer, and we'll work to get it fixed for a 9.1 update.

@headius

This comment has been minimized.

Show comment
Hide comment
@headius

headius May 2, 2016

Member

I have a fix for this we could include in 9.1: https://gist.github.com/5b101a92d8c2a3d4cb414140f882ebc5

It's up to @enebo if it's too risky the day of the release :-)

Member

headius commented May 2, 2016

I have a fix for this we could include in 9.1: https://gist.github.com/5b101a92d8c2a3d4cb414140f882ebc5

It's up to @enebo if it's too risky the day of the release :-)

headius added a commit that referenced this issue May 2, 2016

@headius

This comment has been minimized.

Show comment
Hide comment
@headius

headius May 2, 2016

Member

I've incorporated a localized fix for this issue into 9.1 and we can call this fixed. There's another bug outstanding for the socket rework that still needs to be done.

Member

headius commented May 2, 2016

I've incorporated a localized fix for this issue into 9.1 and we can call this fixed. There's another bug outstanding for the socket rework that still needs to be done.

@headius headius modified the milestones: JRuby 9.1.0.0, JRuby 9.1.1.0 May 2, 2016

@headius headius closed this May 2, 2016

headius added a commit that referenced this issue May 2, 2016

@headius

This comment has been minimized.

Show comment
Hide comment
@headius

headius May 2, 2016

Member

@nbarrientos I have made the same fix for 1.7, so it will be in 1.7.26 whenever we release that. In the short term your only workaround would be to modify resolv.rb so it also rescues SocketError around bind.

Member

headius commented May 2, 2016

@nbarrientos I have made the same fix for 1.7, so it will be in 1.7.26 whenever we release that. In the short term your only workaround would be to modify resolv.rb so it also rescues SocketError around bind.

@nbarrientos

This comment has been minimized.

Show comment
Hide comment
@nbarrientos

nbarrientos May 10, 2016

Contributor

Thanks.

Contributor

nbarrientos commented May 10, 2016

Thanks.

@perlun

This comment has been minimized.

Show comment
Hide comment
@perlun

perlun Aug 15, 2016

Contributor

@headius - any chance we could get a 1.7.26 to get this fix incorporated? It would be Very Nice indeed. 😇

Contributor

perlun commented Aug 15, 2016

@headius - any chance we could get a 1.7.26 to get this fix incorporated? It would be Very Nice indeed. 😇

@perlun

This comment has been minimized.

Show comment
Hide comment
@perlun

perlun Sep 8, 2016

Contributor

@headius - just to make things extremely clear, was this included in 1.7.26? I think I failed to find it when I skimmed through the release notes.

Contributor

perlun commented Sep 8, 2016

@headius - just to make things extremely clear, was this included in 1.7.26? I think I failed to find it when I skimmed through the release notes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment