Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

Currently the reconnect logic will only retry once if EM.reconnect throw... #63

Merged
merged 1 commit into from Jan 18, 2014

Conversation

Projects
None yet
2 participants
Contributor

corlettb commented Jan 7, 2014

...s an exception. EM.reconnect can throw an exception in the case of a failed DNS lookup for example.

It looks as though in eventmachine if an EM.add_periodic_timer throws an exception no further calls to that block of code will be made. Presumably it reschedules itself after executing the callback which doesn't get called when an exception goes off. Arguably thats a bug with eventmachine? maybe?

Here is a simple program to demonstrate that (You'll only get one exception, not one every 5 seconds):

require 'eventmachine'

def throw_exception
raise "Issues and problems"
end

def puts_running
puts "Running...."
end

EM.run {
EM.add_periodic_timer(1) { puts_running }
EM.add_periodic_timer(5) { throw_exception }

EM.error_handler do |e|
puts "Eventmachine problem, #{e}"
puts ("#{e.backtrace.join("\n")}")
end
}


It looks as though EM.connect has two ways of handling errors. The first being an exception (e.g. bad dns), the second is via callbacks.

This change simply handles any exception created by EM.reconnect so further retries are attempted.

Also adding a test case for this issue which will simulate a DNS failure to trigger an exception.

Currently the reconnect logic will only retry once if EM.reconnect th…
…rows an exception. EM.reconnect can throw an exception in the case of a failed DNS lookup for example.

It looks as though in eventmachine if an EM.add_periodic_timer throws an exception no further calls to that block of code will be made. Presumably it reschedules itself after executing the callback which doesn't get called when an exception goes off. Arguably thats a bug with eventmachine? maybe?

Here is a simple program to demonstrate that (You'll only get one exception, not one every 5 seconds):

require 'eventmachine'

def throw_exception
    raise "Issues and problems"
end

def puts_running
  puts "Running...."
end

EM.run {
  EM.add_periodic_timer(1) { puts_running }
  EM.add_periodic_timer(5) { throw_exception }

  EM.error_handler do |e|
    puts "Eventmachine problem, #{e}"
    puts ("#{e.backtrace.join("\n")}")
  end
}

------------

It looks as though EM.connect has two ways of handling errors. The first being an exception (e.g. bad dns), the second is via callbacks.

This change simply handles any exception created by EM.reconnect so further retries are attempted.

Also adding a test case for this issue which will simulate a DNS failure to trigger an exception.
Contributor

corlettb commented Jan 7, 2014

Just for some clarification, we are running an old version of Nats in production 0.4.22-beta.4 although this bit of code looks unchanged in the latest version.

We recently had an issue in production where certain vms had lost their networking (Unrelated but this was down to a driver issue with vmware).

To fix the networking the vms were v-motioned onto another physical box by the vmware team. This did fix the networking.

Unfortunately the nats client did not reconnect. This resulted in the cloud foundry router nodes getting re-added to the load balancer which didn't have a good routing table and a number of failed requests going to users.

Looking in the log file I found :

Eventmachine problem, unable to resolve server address
./vendor/bundle/ruby/1.9.1/gems/eventmachine-0.12.11.cloudfoundry.3/lib/eventmachine.rb:822:in connect_server' ./vendor/bundle/ruby/1.9.1/gems/eventmachine-0.12.11.cloudfoundry.3/lib/eventmachine.rb:822:inreconnect'
./vendor/bundle/ruby/1.9.1/gems/nats-0.4.22.beta.4/lib/nats/client.rb:553:in attempt_reconnect' ./vendor/bundle/ruby/1.9.1/gems/nats-0.4.22.beta.4/lib/nats/client.rb:524:inblock in schedule_reconnect'
./vendor/bundle/ruby/1.9.1/gems/eventmachine-0.12.11.cloudfoundry.3/lib/em/timers.rb:51:in call' ./vendor/bundle/ruby/1.9.1/gems/eventmachine-0.12.11.cloudfoundry.3/lib/em/timers.rb:51:infire'
./vendor/bundle/ruby/1.9.1/gems/eventmachine-0.12.11.cloudfoundry.3/lib/eventmachine.rb:256:in call' ./vendor/bundle/ruby/1.9.1/gems/eventmachine-0.12.11.cloudfoundry.3/lib/eventmachine.rb:256:inrun_machine'
./vendor/bundle/ruby/1.9.1/gems/eventmachine-0.12.11.cloudfoundry.3/lib/eventmachine.rb:256:in run' ./lib/router.rb:67:in<top (required)>'
./bin/router:6:in require' ./bin/router:6:in

'

It would seem that the first reconnection attempt triggered this exception and no more further attempts were made. The machine was obviously unable to connect to the dns server as well as the nats server when the network was down.

Owner

derekcollison commented Jan 7, 2014

Thanks, will take a look. Also need to check why tests are not running properly.

derekcollison added a commit that referenced this pull request Jan 18, 2014

Merge pull request #63 from corlettb/master
Currently the reconnect logic will only retry once if EM.reconnect throw...

@derekcollison derekcollison merged commit c15edfc into nats-io:master Jan 18, 2014

1 check failed

default The Travis CI build failed
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment