New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No error if RabbitMQ goes down #76

Closed
ebuildy opened this Issue Mar 29, 2016 · 78 comments

Comments

Projects
None yet
6 participants
@ebuildy

ebuildy commented Mar 29, 2016

Before version 3.3.0, when Rabbit server went down, logstash throws:

Queue subscription ended! Will retry in 5s
RabbitMQ connection down, will wait to retry subscription

With version 4.0.1, nothing happen (as you can on docker-compose logs, here https://gist.github.com/ebuildy/d405dab7dbaca2e3a03dc2a7fa6f5b29) if queue server is down, is it normal?

Thanks you,

@andrewvc

This comment has been minimized.

Show comment
Hide comment
@andrewvc

andrewvc Mar 30, 2016

Contributor

@ebuildy that is because we now use the RabbitMQ library's built in auto-reconnect facility which is more robust. We'd be glad to accept a patch that hooks into it and logs these events however!

Contributor

andrewvc commented Mar 30, 2016

@ebuildy that is because we now use the RabbitMQ library's built in auto-reconnect facility which is more robust. We'd be glad to accept a patch that hooks into it and logs these events however!

@ebuildy

This comment has been minimized.

Show comment
Hide comment
@ebuildy

ebuildy Mar 30, 2016

You are talking about this: https://www.rabbitmq.com/api-guide.html#recovery ?

Thanks you, I mush finished some PR on other logstash plugins and RabbitMQ Java client (rabbitmq/rabbitmq-java-client#138) before, then I will give an eye to it.

ebuildy commented Mar 30, 2016

You are talking about this: https://www.rabbitmq.com/api-guide.html#recovery ?

Thanks you, I mush finished some PR on other logstash plugins and RabbitMQ Java client (rabbitmq/rabbitmq-java-client#138) before, then I will give an eye to it.

@digitalkaoz

This comment has been minimized.

Show comment
Hide comment
@digitalkaoz

digitalkaoz Oct 18, 2016

@andrewvc beside there are no logs if the rabbitmq server closes the channel, a reconnect isnt happening, why i assume logstash with rabbitmq input is not usable atm, especially in a dockerized environment... any solutions here? we to use a haproxy in between rabbit and logstash

@andrewvc beside there are no logs if the rabbitmq server closes the channel, a reconnect isnt happening, why i assume logstash with rabbitmq input is not usable atm, especially in a dockerized environment... any solutions here? we to use a haproxy in between rabbit and logstash

@andrewvc

This comment has been minimized.

Show comment
Hide comment
@andrewvc

andrewvc Oct 18, 2016

Contributor

@digitalkaoz why is the rabbitmq server closing the channel?

AFAIK if you do that the meaning is you want the client to stop and not restart. Please correct me if I'm wrong here.

Contributor

andrewvc commented Oct 18, 2016

@digitalkaoz why is the rabbitmq server closing the channel?

AFAIK if you do that the meaning is you want the client to stop and not restart. Please correct me if I'm wrong here.

@andrewvc

This comment has been minimized.

Show comment
Hide comment
@andrewvc

andrewvc Oct 18, 2016

Contributor

@digitalkaoz that's a different situation, may I add, from one where the server has gone down as a result of an error.

Contributor

andrewvc commented Oct 18, 2016

@digitalkaoz that's a different situation, may I add, from one where the server has gone down as a result of an error.

@andrewvc

This comment has been minimized.

Show comment
Hide comment
@andrewvc

andrewvc Oct 18, 2016

Contributor

@ebuildy @digitalkaoz , if you all can share a reproducible test case I'd be glad to try it and find a fix. Just list the steps to reproduce, preferably just using logstash and a local rabbitmq.

Contributor

andrewvc commented Oct 18, 2016

@ebuildy @digitalkaoz , if you all can share a reproducible test case I'd be glad to try it and find a fix. Just list the steps to reproduce, preferably just using logstash and a local rabbitmq.

@digitalkaoz

This comment has been minimized.

Show comment
Hide comment
@digitalkaoz

digitalkaoz Oct 18, 2016

@andrewvc the rabbit and haproxy server are restarted during deployments (they are just docker containers)

here the setup:

rabbitmq -> 5672(tcp) -> haproxy -> 5672(tcp) -> logstash

rabbitmq.conf is default

haproxy.conf

listen amqp
        mode tcp
        option tcplog
        bind *:5672
        log global
        balance         roundrobin
        timeout connect 2s
        timeout client  2m
        timeout server  2m
        option          clitcpka
        option tcp-check
        server rabbitmq:5672 check port 5672

logstash.conf

input {
    rabbitmq {
        connection_timeout => 1000
        heartbeat => 30
        host => "haproxy"
        subscription_retry_interval_seconds => 30
        queue => "logs"
        user => "guest"
        password => "guest"
    }

if we only restart rabbitmq it seems fine when we gracefully shutdown the server:

$ docker exec rabbitmq rabbitmqctl close_connection {$CON_ID} "rabbit shutdown"
$ docker exec rabbitmq rabbitmqctl stop_app

we explicitly close the connection to force the logstash-rabbitmq-input-plugin to force a restart

but if we restart haproxy the tcp gets cut off ungracefully and the logstash plugin wont notice the channel is closed (and doenst go into restart mode) because it received an error frame. the java recovery stuff doesnt seem to work for that.

hope thats enough information

digitalkaoz commented Oct 18, 2016

@andrewvc the rabbit and haproxy server are restarted during deployments (they are just docker containers)

here the setup:

rabbitmq -> 5672(tcp) -> haproxy -> 5672(tcp) -> logstash

rabbitmq.conf is default

haproxy.conf

listen amqp
        mode tcp
        option tcplog
        bind *:5672
        log global
        balance         roundrobin
        timeout connect 2s
        timeout client  2m
        timeout server  2m
        option          clitcpka
        option tcp-check
        server rabbitmq:5672 check port 5672

logstash.conf

input {
    rabbitmq {
        connection_timeout => 1000
        heartbeat => 30
        host => "haproxy"
        subscription_retry_interval_seconds => 30
        queue => "logs"
        user => "guest"
        password => "guest"
    }

if we only restart rabbitmq it seems fine when we gracefully shutdown the server:

$ docker exec rabbitmq rabbitmqctl close_connection {$CON_ID} "rabbit shutdown"
$ docker exec rabbitmq rabbitmqctl stop_app

we explicitly close the connection to force the logstash-rabbitmq-input-plugin to force a restart

but if we restart haproxy the tcp gets cut off ungracefully and the logstash plugin wont notice the channel is closed (and doenst go into restart mode) because it received an error frame. the java recovery stuff doesnt seem to work for that.

hope thats enough information

@digitalkaoz

This comment has been minimized.

Show comment
Hide comment
@digitalkaoz

digitalkaoz Oct 18, 2016

i think thats a quite common scenario in a containerized world

i think thats a quite common scenario in a containerized world

@digitalkaoz

This comment has been minimized.

Show comment
Hide comment
@digitalkaoz

digitalkaoz Oct 18, 2016

when i restart logstash i get the following error:

An unexpected error occurred!", :error=>com.rabbitmq.client.AlreadyClosedException: connection is already closed due to connection error; cause: java.io.EOFException, :class=>"Java::ComRabbitmqClient::AlreadyClosedException", :backtrace=>["com.rabbitmq.client.impl.AMQChannel.ensureIsOpen(com/rabbitmq/client/impl/AMQChannel.java:195)", "com.rabbitmq.client.impl.AMQChannel.rpc(com/rabbitmq/client/impl/AMQChannel.java:241)", "com.rabbitmq.client.impl.ChannelN.basicCancel(com/rabbitmq/client/impl/ChannelN.java:1127)", "java.lang.reflect.Method.invoke(java/lang/reflect/Method.java:498)", "RUBY.method_missing(/opt/logstash/vendor/bundle/jruby/1.9/gems/march_hare-2.15.0-java/lib/march_hare/channel.rb:874)", "RUBY.shutdown_consumer(/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-input-rabbitmq-4.1.0/lib/logstash/inputs/rabbitmq.rb:279)", "RUBY.stop(/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-input-rabbitmq-4.1.0/lib/logstash/inputs/rabbitmq.rb:273)", "RUBY.do_stop(/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-core-2.4.0-java/lib/logstash/inputs/base.rb:83)", "org.jruby.RubyArray.each(org/jruby/RubyArray.java:1613)", "RUBY.shutdown(/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-core-2.4.0-java/lib/logstash/pipeline.rb:385)", "RUBY.stop_pipeline(/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-core-2.4.0-java/lib/logstash/agent.rb:413)", "RUBY.shutdown_pipelines(/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-core-2.4.0-java/lib/logstash/agent.rb:406)", "org.jruby.RubyHash.each(org/jruby/RubyHash.java:1342)", "RUBY.shutdown_pipelines(/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-core-2.4.0-java/lib/logstash/agent.rb:406)", "RUBY.shutdown(/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-core-2.4.0-java/lib/logstash/agent.rb:402)", "RUBY.execute(/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-core-2.4.0-java/lib/logstash/agent.rb:233)", "RUBY.run(/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-core-2.4.0-java/lib/logstash/runner.rb:94)", "org.jruby.RubyProc.call(org/jruby/RubyProc.java:281)", "RUBY.run(/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-core-2.4.0-java/lib/logstash/runner.rb:99)", "org.jruby.RubyProc.call(org/jruby/RubyProc.java:281)", "RUBY.initialize(/opt/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.22/lib/stud/task.rb:24)", "java.lang.Thread.run(java/lang/Thread.java:745)"], :level=>:warn}

digitalkaoz commented Oct 18, 2016

when i restart logstash i get the following error:

An unexpected error occurred!", :error=>com.rabbitmq.client.AlreadyClosedException: connection is already closed due to connection error; cause: java.io.EOFException, :class=>"Java::ComRabbitmqClient::AlreadyClosedException", :backtrace=>["com.rabbitmq.client.impl.AMQChannel.ensureIsOpen(com/rabbitmq/client/impl/AMQChannel.java:195)", "com.rabbitmq.client.impl.AMQChannel.rpc(com/rabbitmq/client/impl/AMQChannel.java:241)", "com.rabbitmq.client.impl.ChannelN.basicCancel(com/rabbitmq/client/impl/ChannelN.java:1127)", "java.lang.reflect.Method.invoke(java/lang/reflect/Method.java:498)", "RUBY.method_missing(/opt/logstash/vendor/bundle/jruby/1.9/gems/march_hare-2.15.0-java/lib/march_hare/channel.rb:874)", "RUBY.shutdown_consumer(/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-input-rabbitmq-4.1.0/lib/logstash/inputs/rabbitmq.rb:279)", "RUBY.stop(/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-input-rabbitmq-4.1.0/lib/logstash/inputs/rabbitmq.rb:273)", "RUBY.do_stop(/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-core-2.4.0-java/lib/logstash/inputs/base.rb:83)", "org.jruby.RubyArray.each(org/jruby/RubyArray.java:1613)", "RUBY.shutdown(/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-core-2.4.0-java/lib/logstash/pipeline.rb:385)", "RUBY.stop_pipeline(/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-core-2.4.0-java/lib/logstash/agent.rb:413)", "RUBY.shutdown_pipelines(/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-core-2.4.0-java/lib/logstash/agent.rb:406)", "org.jruby.RubyHash.each(org/jruby/RubyHash.java:1342)", "RUBY.shutdown_pipelines(/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-core-2.4.0-java/lib/logstash/agent.rb:406)", "RUBY.shutdown(/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-core-2.4.0-java/lib/logstash/agent.rb:402)", "RUBY.execute(/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-core-2.4.0-java/lib/logstash/agent.rb:233)", "RUBY.run(/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-core-2.4.0-java/lib/logstash/runner.rb:94)", "org.jruby.RubyProc.call(org/jruby/RubyProc.java:281)", "RUBY.run(/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-core-2.4.0-java/lib/logstash/runner.rb:99)", "org.jruby.RubyProc.call(org/jruby/RubyProc.java:281)", "RUBY.initialize(/opt/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.22/lib/stud/task.rb:24)", "java.lang.Thread.run(java/lang/Thread.java:745)"], :level=>:warn}

@andrewvc

This comment has been minimized.

Show comment
Hide comment
@andrewvc

andrewvc Oct 18, 2016

Contributor

@digitalkaoz thanks for the very detailed description here!

@michaelklishin should the java client's auto connection recovery just handle this?

Contributor

andrewvc commented Oct 18, 2016

@digitalkaoz thanks for the very detailed description here!

@michaelklishin should the java client's auto connection recovery just handle this?

@michaelklishin

This comment has been minimized.

Show comment
Hide comment
@michaelklishin

michaelklishin Oct 19, 2016

Contributor

It takes time to detect peer unavailability. I don't see how this is a connection recovery issue: recovery cannot start if a client connection is still considered to be alive.

Contributor

michaelklishin commented Oct 19, 2016

It takes time to detect peer unavailability. I don't see how this is a connection recovery issue: recovery cannot start if a client connection is still considered to be alive.

@michaelklishin

This comment has been minimized.

Show comment
Hide comment
@michaelklishin

michaelklishin Oct 19, 2016

Contributor

Also: do not set heartbeats to a value lower than 5 seconds. That is recipe for false positives.

Contributor

michaelklishin commented Oct 19, 2016

Also: do not set heartbeats to a value lower than 5 seconds. That is recipe for false positives.

@digitalkaoz

This comment has been minimized.

Show comment
Hide comment
@digitalkaoz

digitalkaoz Oct 19, 2016

We waited alot but logstash wont recover the Connection. I would rather get
false positives than nothing...any more ideas? The Error Message at the End
says it very clear. The Channel was already closed, bc it received an Error
Frame. The Channel is unusable once received an error_frame...

and neither this plugin, nor march_here nor the the java-amqp lib seems to notice that

digitalkaoz commented Oct 19, 2016

We waited alot but logstash wont recover the Connection. I would rather get
false positives than nothing...any more ideas? The Error Message at the End
says it very clear. The Channel was already closed, bc it received an Error
Frame. The Channel is unusable once received an error_frame...

and neither this plugin, nor march_here nor the the java-amqp lib seems to notice that

@andrewvc

This comment has been minimized.

Show comment
Hide comment
@andrewvc

andrewvc Oct 19, 2016

Contributor

@michaelklishin thanks for popping in!

He's setting the heartbeat to 30 seconds. It's confusing because the connection timeout is in millis, but the heartbeat is in seconds.

I think this is correct yes? Looking at the source of the java client it looks like timeout is set in millis and heartbeats in seconds. Is that all correct?

Contributor

andrewvc commented Oct 19, 2016

@michaelklishin thanks for popping in!

He's setting the heartbeat to 30 seconds. It's confusing because the connection timeout is in millis, but the heartbeat is in seconds.

I think this is correct yes? Looking at the source of the java client it looks like timeout is set in millis and heartbeats in seconds. Is that all correct?

@digitalkaoz

This comment has been minimized.

Show comment
Hide comment
@digitalkaoz

digitalkaoz Oct 25, 2016

@andrewvc @michaelklishin i created a tiny docker-compose setup to reproduce this issue:

https://github.com/digitalkaoz/rabbit-haproxy-logstash-fail

simply follow the reproduce steps...

@andrewvc @michaelklishin i created a tiny docker-compose setup to reproduce this issue:

https://github.com/digitalkaoz/rabbit-haproxy-logstash-fail

simply follow the reproduce steps...

@michaelklishin

This comment has been minimized.

Show comment
Hide comment
@michaelklishin

michaelklishin Oct 25, 2016

Contributor

Heartbeat timeout is in seconds per protocol spec. Also see this thread.

Contributor

michaelklishin commented Oct 25, 2016

Heartbeat timeout is in seconds per protocol spec. Also see this thread.

@michaelklishin

This comment has been minimized.

Show comment
Hide comment
@michaelklishin

michaelklishin Oct 25, 2016

Contributor

So you have HAproxy in between. HAproxy has its own timeouts (which heartbeats in theory make irrelevant but still) and what RabbitMQ sees is a TCP connection from HAproxy, not the client. I can do a March Hare release with a preview version of the 3.6.6 Java client, however.

Contributor

michaelklishin commented Oct 25, 2016

So you have HAproxy in between. HAproxy has its own timeouts (which heartbeats in theory make irrelevant but still) and what RabbitMQ sees is a TCP connection from HAproxy, not the client. I can do a March Hare release with a preview version of the 3.6.6 Java client, however.

@andrewvc

This comment has been minimized.

Show comment
Hide comment
@andrewvc

andrewvc Oct 25, 2016

Contributor

Thanks @digitalkaoz , I'll try and repro today.

Contributor

andrewvc commented Oct 25, 2016

Thanks @digitalkaoz , I'll try and repro today.

@andrewvc

This comment has been minimized.

Show comment
Hide comment
@andrewvc

andrewvc Oct 25, 2016

Contributor

Also thanks for the background @michaelklishin . I'm wondering if what's happening is that HAProxy is switching the backend without letting logstash know, and without disconnecting from logstash. Will report back.

Contributor

andrewvc commented Oct 25, 2016

Also thanks for the background @michaelklishin . I'm wondering if what's happening is that HAProxy is switching the backend without letting logstash know, and without disconnecting from logstash. Will report back.

@digitalkaoz

This comment has been minimized.

Show comment
Hide comment
@digitalkaoz

digitalkaoz Oct 25, 2016

it still fails when i wait lets say 5 minutes... @michaelklishin as i understand the relevant fix is in https://github.com/rabbitmq/rabbitmq-java-client/releases/tag/v4.0.0.M2 ?

it still fails when i wait lets say 5 minutes... @michaelklishin as i understand the relevant fix is in https://github.com/rabbitmq/rabbitmq-java-client/releases/tag/v4.0.0.M2 ?

@michaelklishin

This comment has been minimized.

Show comment
Hide comment
@michaelklishin

michaelklishin Oct 25, 2016

Contributor

@andrewvc typically proxies propagate upstream connection loss. At least that's what HAproxy does by default. In fact, if it didn't then the two logical ends of the TCP connection would be out of sync.

I doubt it's a Java client problem here but I've updated March Hare to the latest 3.6.x version from source.

Contributor

michaelklishin commented Oct 25, 2016

@andrewvc typically proxies propagate upstream connection loss. At least that's what HAproxy does by default. In fact, if it didn't then the two logical ends of the TCP connection would be out of sync.

I doubt it's a Java client problem here but I've updated March Hare to the latest 3.6.x version from source.

@michaelklishin

This comment has been minimized.

Show comment
Hide comment
@michaelklishin

michaelklishin Oct 25, 2016

Contributor

@digitalkaoz I'm sorry but there is no evidence that I have seen that this is a client library issue. A good thing to try would be to eliminate HAproxy from the picture entirely and also verify the effective heartbeat timeout value (management UI displays it on the connection page).

Contributor

michaelklishin commented Oct 25, 2016

@digitalkaoz I'm sorry but there is no evidence that I have seen that this is a client library issue. A good thing to try would be to eliminate HAproxy from the picture entirely and also verify the effective heartbeat timeout value (management UI displays it on the connection page).

@digitalkaoz

This comment has been minimized.

Show comment
Hide comment
@digitalkaoz

digitalkaoz Oct 25, 2016

@michaelklishin sure if i remove haproxy from the setup the reconnect works. but thats not a typical setup (or at least at scale)

with haproxy in the middle, reconnect fails (even with a heartbeat of 10s), but my question is why? shouldnt the heartbeat tell the plugin that the connection isnt healthy anymore and reinitiate a recovery/reconnect ?

digitalkaoz commented Oct 25, 2016

@michaelklishin sure if i remove haproxy from the setup the reconnect works. but thats not a typical setup (or at least at scale)

with haproxy in the middle, reconnect fails (even with a heartbeat of 10s), but my question is why? shouldnt the heartbeat tell the plugin that the connection isnt healthy anymore and reinitiate a recovery/reconnect ?

@michaelklishin

This comment has been minimized.

Show comment
Hide comment
@michaelklishin

michaelklishin Oct 25, 2016

Contributor

It should. We don't know why that doesn't happen: that's why I recommended checking the actual effective value.

Contributor

michaelklishin commented Oct 25, 2016

It should. We don't know why that doesn't happen: that's why I recommended checking the actual effective value.

@digitalkaoz

This comment has been minimized.

Show comment
Hide comment
@digitalkaoz

digitalkaoz Oct 25, 2016

after restarting rabbit there is no connection alive on the rabbit-node because nobody initiated a new connection. neither haproxy, nor logstash. bc. their connection is still there but defacto useless?

after restarting rabbit there is no connection alive on the rabbit-node because nobody initiated a new connection. neither haproxy, nor logstash. bc. their connection is still there but defacto useless?

@andrewvc

This comment has been minimized.

Show comment
Hide comment
@andrewvc

andrewvc Oct 25, 2016

Contributor

@michaelklishin, is there a best practice here for accomplishing what @digitalkaoz wants to accomplish? I think having a layer of indirection via a load balancer makes sense. This is something that's come up multiple times, I'm wondering if the RabbitMQ project has an official stance here.

Contributor

andrewvc commented Oct 25, 2016

@michaelklishin, is there a best practice here for accomplishing what @digitalkaoz wants to accomplish? I think having a layer of indirection via a load balancer makes sense. This is something that's come up multiple times, I'm wondering if the RabbitMQ project has an official stance here.

@michaelklishin

This comment has been minimized.

Show comment
Hide comment
@michaelklishin

michaelklishin Oct 25, 2016

Contributor

There's no stance as HAproxy is supposed to be transparent to the client.

@digitalkaoz HAproxy is not supposed to initiate any connections upstream unless a new client connects to it.

Can we please stop guessing and move on to forming hypotheses and verifying them? The first one I have is: the heartbeat value is actually not set to what the reporter thinks it is used. I've mentioned how this can be verified above. Without this we can pour hours and hours of time and get nothing out of it. My team already spends hours every day on various support questions and we would appreciate if the reporters cooperated a bit to save us all some time.

I also pushed a Java client update to March Hare, even though I doubt it will change anything.

A Wireshark/tcpdump capture will provide a lot of information to look at.

Contributor

michaelklishin commented Oct 25, 2016

There's no stance as HAproxy is supposed to be transparent to the client.

@digitalkaoz HAproxy is not supposed to initiate any connections upstream unless a new client connects to it.

Can we please stop guessing and move on to forming hypotheses and verifying them? The first one I have is: the heartbeat value is actually not set to what the reporter thinks it is used. I've mentioned how this can be verified above. Without this we can pour hours and hours of time and get nothing out of it. My team already spends hours every day on various support questions and we would appreciate if the reporters cooperated a bit to save us all some time.

I also pushed a Java client update to March Hare, even though I doubt it will change anything.

A Wireshark/tcpdump capture will provide a lot of information to look at.

@michaelklishin

This comment has been minimized.

Show comment
Hide comment
@michaelklishin

michaelklishin Oct 25, 2016

Contributor

A Wireshark capture should make it easy to see heartbeat frames on the wire: not only their actual presence but also timestamps (they are sent at roughly 1/2 of the effective timeout value).

Contributor

michaelklishin commented Oct 25, 2016

A Wireshark capture should make it easy to see heartbeat frames on the wire: not only their actual presence but also timestamps (they are sent at roughly 1/2 of the effective timeout value).

@digitalkaoz

This comment has been minimized.

Show comment
Hide comment
@digitalkaoz

digitalkaoz Oct 25, 2016

@michaelklishin the heartbeat is set to 10s as i told you above...but that didnt change anything, or do you mean some other value?

screen shot 2016-10-25 at 16 17 12

@michaelklishin the heartbeat is set to 10s as i told you above...but that didnt change anything, or do you mean some other value?

screen shot 2016-10-25 at 16 17 12

@michaelklishin

This comment has been minimized.

Show comment
Hide comment
@michaelklishin

michaelklishin Oct 25, 2016

Contributor

What is in server logs? HAproxy logs?

Contributor

michaelklishin commented Oct 25, 2016

What is in server logs? HAproxy logs?

@digitalkaoz

This comment has been minimized.

Show comment
Hide comment
@digitalkaoz

digitalkaoz Oct 25, 2016

mh the haproxy logs are not very helpful:

proxy_1     | 00000000:amqp.srvcls[0007:0008]
proxy_1     | 00000000:amqp.clicls[0007:0008]
proxy_1     | 00000000:amqp.closed[0007:0008]
rabbit_1 exited with code 137
proxy_1     | 00000001:amqp.accept(0006)=0008 from [172.18.0.5:49608]
proxy_1     | [WARNING] 298/143248 (1) : Server amqp/rabbit is DOWN, reason: Layer4 timeout, check duration: 5001ms. 0 active and 0 backup servers left. 1 sessions active, 0 requeued, 0 remaining in queue.
proxy_1     | [ALERT] 298/143248 (1) : proxy 'amqp' has no server available!

172.18.0.5 is logstash btw

digitalkaoz commented Oct 25, 2016

mh the haproxy logs are not very helpful:

proxy_1     | 00000000:amqp.srvcls[0007:0008]
proxy_1     | 00000000:amqp.clicls[0007:0008]
proxy_1     | 00000000:amqp.closed[0007:0008]
rabbit_1 exited with code 137
proxy_1     | 00000001:amqp.accept(0006)=0008 from [172.18.0.5:49608]
proxy_1     | [WARNING] 298/143248 (1) : Server amqp/rabbit is DOWN, reason: Layer4 timeout, check duration: 5001ms. 0 active and 0 backup servers left. 1 sessions active, 0 requeued, 0 remaining in queue.
proxy_1     | [ALERT] 298/143248 (1) : proxy 'amqp' has no server available!

172.18.0.5 is logstash btw

@michaelklishin

This comment has been minimized.

Show comment
Hide comment
@michaelklishin

michaelklishin Oct 25, 2016

Contributor

@digitalkaoz can you please post a less abridged log together with RabbitMQ node and Logstash logs?

Contributor

michaelklishin commented Oct 25, 2016

@digitalkaoz can you please post a less abridged log together with RabbitMQ node and Logstash logs?

@michaelklishin

This comment has been minimized.

Show comment
Hide comment
@michaelklishin

michaelklishin Oct 25, 2016

Contributor

According to the abridged log above, HAproxy does accept a client connection, then detects that an upstream node is down and there are no more upstreams to try. Then… we don't know what happens because the log ends.

Contributor

michaelklishin commented Oct 25, 2016

According to the abridged log above, HAproxy does accept a client connection, then detects that an upstream node is down and there are no more upstreams to try. Then… we don't know what happens because the log ends.

@digitalkaoz

This comment has been minimized.

Show comment
Hide comment
@digitalkaoz

digitalkaoz Oct 25, 2016

there are no more logs from any of the systems (neither haproxy, nor logstash)

there are no more logs from any of the systems (neither haproxy, nor logstash)

@michaelklishin

This comment has been minimized.

Show comment
Hide comment
@michaelklishin

michaelklishin Oct 25, 2016

Contributor

@digitalkaoz what about RabbitMQ? Can you take a tcpdump or Wireshark capture?

Contributor

michaelklishin commented Oct 25, 2016

@digitalkaoz what about RabbitMQ? Can you take a tcpdump or Wireshark capture?

@digitalkaoz

This comment has been minimized.

Show comment
Hide comment
@digitalkaoz

digitalkaoz Oct 25, 2016

can i recommend cloning the repo? im not very familiar with tcp connection debugging :/

can i recommend cloning the repo? im not very familiar with tcp connection debugging :/

@digitalkaoz

This comment has been minimized.

Show comment
Hide comment
@digitalkaoz

digitalkaoz Oct 25, 2016

@michaelklishin i attached the full log here https://gist.github.com/digitalkaoz/8f1825dd1ad2976a1516afdccf120e7f

logstash: 172.18.0.5
haproxy: 172.18.0.2
rabbit: 172.18.0.3

digitalkaoz commented Oct 25, 2016

@michaelklishin i attached the full log here https://gist.github.com/digitalkaoz/8f1825dd1ad2976a1516afdccf120e7f

logstash: 172.18.0.5
haproxy: 172.18.0.2
rabbit: 172.18.0.3

@michaelklishin

This comment has been minimized.

Show comment
Hide comment
@michaelklishin

michaelklishin Oct 25, 2016

Contributor

From RabbitMQ log, after a few inbound connections and restarts:

rabbit_1    | accepting AMQP connection <0.400.0> (172.18.0.3:38102 -> 172.18.0.2:5672)
rabbit_1    |
rabbit_1    | =WARNING REPORT==== 25-Oct-2016::14:49:07 ===
rabbit_1    | closing AMQP connection <0.400.0> (172.18.0.3:38102 -> 172.18.0.2:5672):
rabbit_1    | client unexpectedly closed TCP connection

This means that at 25-Oct-2016::14:49:07 RabbitMQ did register an inbound AMQP 0-9-1 client connection and then its TCP connection was immediately closed by HAproxy.

I am not sure what exactly srvcls and clicls are, probably "server connection closed" and "client connection closed", respectively:

proxy_1     | 00000001:amqp.srvcls[0008:0007]
proxy_1     | 00000001:amqp.clicls[0008:0007]
proxy_1     | 00000001:amqp.closed[0008:0007]

I'm not sure why it happens, perhaps HAproxy decides to reset all connections now that it has an upstream to connect to. Also this should not prevent March Hare from attempting to connect again after a period of time.

Can you try March Hare master?

Contributor

michaelklishin commented Oct 25, 2016

From RabbitMQ log, after a few inbound connections and restarts:

rabbit_1    | accepting AMQP connection <0.400.0> (172.18.0.3:38102 -> 172.18.0.2:5672)
rabbit_1    |
rabbit_1    | =WARNING REPORT==== 25-Oct-2016::14:49:07 ===
rabbit_1    | closing AMQP connection <0.400.0> (172.18.0.3:38102 -> 172.18.0.2:5672):
rabbit_1    | client unexpectedly closed TCP connection

This means that at 25-Oct-2016::14:49:07 RabbitMQ did register an inbound AMQP 0-9-1 client connection and then its TCP connection was immediately closed by HAproxy.

I am not sure what exactly srvcls and clicls are, probably "server connection closed" and "client connection closed", respectively:

proxy_1     | 00000001:amqp.srvcls[0008:0007]
proxy_1     | 00000001:amqp.clicls[0008:0007]
proxy_1     | 00000001:amqp.closed[0008:0007]

I'm not sure why it happens, perhaps HAproxy decides to reset all connections now that it has an upstream to connect to. Also this should not prevent March Hare from attempting to connect again after a period of time.

Can you try March Hare master?

@digitalkaoz

This comment has been minimized.

Show comment
Hide comment
@michaelklishin

This comment has been minimized.

Show comment
Hide comment
@michaelklishin

michaelklishin Oct 25, 2016

Contributor

@andrewvc clients have no way of identifying what RabbitMQ server they are connected to. It would be pretty silly to only allow recovery to re-connect to the same node because in practice reconnection to the same node will not succeed for a while.

Contributor

michaelklishin commented Oct 25, 2016

@andrewvc clients have no way of identifying what RabbitMQ server they are connected to. It would be pretty silly to only allow recovery to re-connect to the same node because in practice reconnection to the same node will not succeed for a while.

@michaelklishin

This comment has been minimized.

Show comment
Hide comment
@michaelklishin

michaelklishin Oct 25, 2016

Contributor

So I have conducted a little experiment with March Hare master, vanilla RabbitMQ 3.6.5 with all defaults, and HAproxy 1.5 (not sure what the exact version is, possibly even a -dev one) with
the following config:

global
        log 127.0.0.1   local0
        log 127.0.0.1   local1 notice
        maxconn 4096
        user antares
        group staff
        # daemon
        debug

defaults
        log        global
        option     dontlognull
        option     redispatch
        retries    3
        maxconn    2000
        contimeout 5000
        clitimeout 50000
        srvtimeout 50000

backend rabbitmq
        mode         tcp
        server node1 127.0.0.1:5672 maxconn 2048

frontend rabbitmq-in
        bind *:5673
        balance
        default_backend rabbitmq

My March Hare connection code is about as trivial as it gets:

c = MarchHare.connect(port: 5673, requested_heartbeat: 8)

and then I periodically check if the connection is open:

c.open? # => true

I then shut down RabbitMQ a few times, observe that the connection is closed, start it back, observe that it is open again in a few seconds. This repeatedly worked 5 times in a row.

Here's HAproxy log, you can see inbound connections from different ephemeral ports (read: different client sockets):

00000001:rabbitmq-in.accept(0004)=0005 from [127.0.0.1:53004]
00000001:rabbitmq.srvcls[0005:0006]
00000001:rabbitmq.clicls[0005:0006]
00000001:rabbitmq.closed[0005:0006]
00000002:rabbitmq-in.accept(0004)=0005 from [127.0.0.1:53023]
00000002:rabbitmq.clicls[0005:0006]
00000003:rabbitmq-in.accept(0004)=0005 from [127.0.0.1:53033]
00000003:rabbitmq.srvcls[0005:0007]
00000003:rabbitmq.clicls[0005:0007]
00000003:rabbitmq.closed[0005:0007]
00000004:rabbitmq-in.accept(0004)=0005 from [127.0.0.1:53038]
00000004:rabbitmq.srvcls[0005:0007]
00000004:rabbitmq.clicls[0005:0007]
00000004:rabbitmq.closed[0005:0007]
00000005:rabbitmq-in.accept(0004)=0005 from [127.0.0.1:53054]
00000005:rabbitmq.srvcls[0005:0007]
00000005:rabbitmq.clicls[0005:0007]
00000005:rabbitmq.closed[0005:0007]
00000006:rabbitmq-in.accept(0004)=0005 from [127.0.0.1:53060]
00000006:rabbitmq.clicls[0005:0007]
00000007:rabbitmq-in.accept(0004)=0005 from [127.0.0.1:53067]

Here's RabbitMQ log from the most recent restart:

=INFO REPORT==== 25-Oct-2016::22:45:03 ===
Stopped RabbitMQ application

=INFO REPORT==== 25-Oct-2016::22:45:03 ===
Halting Erlang VM

=INFO REPORT==== 25-Oct-2016::22:45:13 ===
Starting RabbitMQ 3.6.5 on Erlang 19.1
Copyright (C) 2007-2016 Pivotal Software, Inc.
Licensed under the MPL.  See http://www.rabbitmq.com/

…

=INFO REPORT==== 25-Oct-2016::22:45:15 ===
Memory limit set to 5381MB of 5978MB total.

=INFO REPORT==== 25-Oct-2016::22:45:15 ===
Disk free limit set to 6269MB

=INFO REPORT==== 25-Oct-2016::22:45:15 ===
Limiting to approx 499900 file handles (449908 sockets)

=INFO REPORT==== 25-Oct-2016::22:45:15 ===
FHC read buffering:  OFF
FHC write buffering: ON

=INFO REPORT==== 25-Oct-2016::22:45:15 ===
Priority queues enabled, real BQ is rabbit_variable_queue

=INFO REPORT==== 25-Oct-2016::22:45:15 ===
Management plugin: using rates mode 'basic'

=INFO REPORT==== 25-Oct-2016::22:45:15 ===
msg_store_transient: using rabbit_msg_store_ets_index to provide index

=INFO REPORT==== 25-Oct-2016::22:45:15 ===
msg_store_persistent: using rabbit_msg_store_ets_index to provide index

=INFO REPORT==== 25-Oct-2016::22:45:15 ===
started TCP Listener on [::]:5672

=INFO REPORT==== 25-Oct-2016::22:45:15 ===
started SSL Listener on [::]:5671

=INFO REPORT==== 25-Oct-2016::22:45:15 ===
rabbit_stomp: default user 'guest' enabled

=INFO REPORT==== 25-Oct-2016::22:45:15 ===
started STOMP TCP Listener on [::]:61613

=INFO REPORT==== 25-Oct-2016::22:45:15 ===
MQTT retained message store: rabbit_mqtt_retained_msg_store_dets

=INFO REPORT==== 25-Oct-2016::22:45:15 ===
started MQTT TCP Listener on [::]:1883

=INFO REPORT==== 25-Oct-2016::22:45:15 ===
opening log file: "/tmp/rabbit-mgmt/access.log.2016_10_25_19"

=INFO REPORT==== 25-Oct-2016::22:45:15 ===
Management plugin started. Port: 15672

=INFO REPORT==== 25-Oct-2016::22:45:15 ===
Statistics event collector started.

=INFO REPORT==== 25-Oct-2016::22:45:15 ===
Statistics channel stats collector started.

=INFO REPORT==== 25-Oct-2016::22:45:15 ===
Statistics queue stats collector started.

=INFO REPORT==== 25-Oct-2016::22:45:15 ===
Statistics database started.

=INFO REPORT==== 25-Oct-2016::22:45:15 ===
Statistics garbage collector started for table aggr_queue_stats_fine_stats with interval 5000.

…

=INFO REPORT==== 25-Oct-2016::22:45:15 ===
Statistics garbage collector started for table aggr_exchange_stats_fine_stats with interval 5000.

=INFO REPORT==== 25-Oct-2016::22:45:15 ===
Server startup complete; 12 plugins started.
 * rabbitmq_shovel_management
 * rabbitmq_management
 * rabbitmq_web_dispatch
 * webmachine
 * rabbitmq_jms_topic_exchange
 * rabbitmq_consistent_hash_exchange
 * rabbitmq_shovel
 * rabbitmq_management_agent
 * rabbitmq_mqtt
 * rabbitmq_stomp
 * mochiweb
 * amqp_client

=INFO REPORT==== 25-Oct-2016::22:45:15 ===
Statistics garbage collector started for table aggr_node_stats_coarse_node_stats with interval 5000.

…

=INFO REPORT==== 25-Oct-2016::22:45:15 ===
Statistics garbage collector started for table connection_stats with interval 5000.

=INFO REPORT==== 25-Oct-2016::22:45:19 ===
accepting AMQP connection <0.629.0> (127.0.0.1:53068 -> 127.0.0.1:5672)

As you can see, the most recent HAproxy connection is on port 53067 and RabbitMQ accepts a connection from 53068 (HAproxy), which means the two descriptors were opened in rapid succession. So yes, this is real: March Hare recovers successfully when an HAproxy upstream goes down.

Contributor

michaelklishin commented Oct 25, 2016

So I have conducted a little experiment with March Hare master, vanilla RabbitMQ 3.6.5 with all defaults, and HAproxy 1.5 (not sure what the exact version is, possibly even a -dev one) with
the following config:

global
        log 127.0.0.1   local0
        log 127.0.0.1   local1 notice
        maxconn 4096
        user antares
        group staff
        # daemon
        debug

defaults
        log        global
        option     dontlognull
        option     redispatch
        retries    3
        maxconn    2000
        contimeout 5000
        clitimeout 50000
        srvtimeout 50000

backend rabbitmq
        mode         tcp
        server node1 127.0.0.1:5672 maxconn 2048

frontend rabbitmq-in
        bind *:5673
        balance
        default_backend rabbitmq

My March Hare connection code is about as trivial as it gets:

c = MarchHare.connect(port: 5673, requested_heartbeat: 8)

and then I periodically check if the connection is open:

c.open? # => true

I then shut down RabbitMQ a few times, observe that the connection is closed, start it back, observe that it is open again in a few seconds. This repeatedly worked 5 times in a row.

Here's HAproxy log, you can see inbound connections from different ephemeral ports (read: different client sockets):

00000001:rabbitmq-in.accept(0004)=0005 from [127.0.0.1:53004]
00000001:rabbitmq.srvcls[0005:0006]
00000001:rabbitmq.clicls[0005:0006]
00000001:rabbitmq.closed[0005:0006]
00000002:rabbitmq-in.accept(0004)=0005 from [127.0.0.1:53023]
00000002:rabbitmq.clicls[0005:0006]
00000003:rabbitmq-in.accept(0004)=0005 from [127.0.0.1:53033]
00000003:rabbitmq.srvcls[0005:0007]
00000003:rabbitmq.clicls[0005:0007]
00000003:rabbitmq.closed[0005:0007]
00000004:rabbitmq-in.accept(0004)=0005 from [127.0.0.1:53038]
00000004:rabbitmq.srvcls[0005:0007]
00000004:rabbitmq.clicls[0005:0007]
00000004:rabbitmq.closed[0005:0007]
00000005:rabbitmq-in.accept(0004)=0005 from [127.0.0.1:53054]
00000005:rabbitmq.srvcls[0005:0007]
00000005:rabbitmq.clicls[0005:0007]
00000005:rabbitmq.closed[0005:0007]
00000006:rabbitmq-in.accept(0004)=0005 from [127.0.0.1:53060]
00000006:rabbitmq.clicls[0005:0007]
00000007:rabbitmq-in.accept(0004)=0005 from [127.0.0.1:53067]

Here's RabbitMQ log from the most recent restart:

=INFO REPORT==== 25-Oct-2016::22:45:03 ===
Stopped RabbitMQ application

=INFO REPORT==== 25-Oct-2016::22:45:03 ===
Halting Erlang VM

=INFO REPORT==== 25-Oct-2016::22:45:13 ===
Starting RabbitMQ 3.6.5 on Erlang 19.1
Copyright (C) 2007-2016 Pivotal Software, Inc.
Licensed under the MPL.  See http://www.rabbitmq.com/

…

=INFO REPORT==== 25-Oct-2016::22:45:15 ===
Memory limit set to 5381MB of 5978MB total.

=INFO REPORT==== 25-Oct-2016::22:45:15 ===
Disk free limit set to 6269MB

=INFO REPORT==== 25-Oct-2016::22:45:15 ===
Limiting to approx 499900 file handles (449908 sockets)

=INFO REPORT==== 25-Oct-2016::22:45:15 ===
FHC read buffering:  OFF
FHC write buffering: ON

=INFO REPORT==== 25-Oct-2016::22:45:15 ===
Priority queues enabled, real BQ is rabbit_variable_queue

=INFO REPORT==== 25-Oct-2016::22:45:15 ===
Management plugin: using rates mode 'basic'

=INFO REPORT==== 25-Oct-2016::22:45:15 ===
msg_store_transient: using rabbit_msg_store_ets_index to provide index

=INFO REPORT==== 25-Oct-2016::22:45:15 ===
msg_store_persistent: using rabbit_msg_store_ets_index to provide index

=INFO REPORT==== 25-Oct-2016::22:45:15 ===
started TCP Listener on [::]:5672

=INFO REPORT==== 25-Oct-2016::22:45:15 ===
started SSL Listener on [::]:5671

=INFO REPORT==== 25-Oct-2016::22:45:15 ===
rabbit_stomp: default user 'guest' enabled

=INFO REPORT==== 25-Oct-2016::22:45:15 ===
started STOMP TCP Listener on [::]:61613

=INFO REPORT==== 25-Oct-2016::22:45:15 ===
MQTT retained message store: rabbit_mqtt_retained_msg_store_dets

=INFO REPORT==== 25-Oct-2016::22:45:15 ===
started MQTT TCP Listener on [::]:1883

=INFO REPORT==== 25-Oct-2016::22:45:15 ===
opening log file: "/tmp/rabbit-mgmt/access.log.2016_10_25_19"

=INFO REPORT==== 25-Oct-2016::22:45:15 ===
Management plugin started. Port: 15672

=INFO REPORT==== 25-Oct-2016::22:45:15 ===
Statistics event collector started.

=INFO REPORT==== 25-Oct-2016::22:45:15 ===
Statistics channel stats collector started.

=INFO REPORT==== 25-Oct-2016::22:45:15 ===
Statistics queue stats collector started.

=INFO REPORT==== 25-Oct-2016::22:45:15 ===
Statistics database started.

=INFO REPORT==== 25-Oct-2016::22:45:15 ===
Statistics garbage collector started for table aggr_queue_stats_fine_stats with interval 5000.

…

=INFO REPORT==== 25-Oct-2016::22:45:15 ===
Statistics garbage collector started for table aggr_exchange_stats_fine_stats with interval 5000.

=INFO REPORT==== 25-Oct-2016::22:45:15 ===
Server startup complete; 12 plugins started.
 * rabbitmq_shovel_management
 * rabbitmq_management
 * rabbitmq_web_dispatch
 * webmachine
 * rabbitmq_jms_topic_exchange
 * rabbitmq_consistent_hash_exchange
 * rabbitmq_shovel
 * rabbitmq_management_agent
 * rabbitmq_mqtt
 * rabbitmq_stomp
 * mochiweb
 * amqp_client

=INFO REPORT==== 25-Oct-2016::22:45:15 ===
Statistics garbage collector started for table aggr_node_stats_coarse_node_stats with interval 5000.

…

=INFO REPORT==== 25-Oct-2016::22:45:15 ===
Statistics garbage collector started for table connection_stats with interval 5000.

=INFO REPORT==== 25-Oct-2016::22:45:19 ===
accepting AMQP connection <0.629.0> (127.0.0.1:53068 -> 127.0.0.1:5672)

As you can see, the most recent HAproxy connection is on port 53067 and RabbitMQ accepts a connection from 53068 (HAproxy), which means the two descriptors were opened in rapid succession. So yes, this is real: March Hare recovers successfully when an HAproxy upstream goes down.

@michaelklishin

This comment has been minimized.

Show comment
Hide comment
@michaelklishin

michaelklishin Oct 25, 2016

Contributor

In the provided TCP dump file, if we apply a display filter that only displays the packets/segments/frames sent by the Logstash process (which seems to be on host 172.19.0.5, according to my understanding of the heartbeat frame exchange) like this:

ip.src eq 172.19.0.5

we see that it attempts to connect and gets an RST (reset) packet:

173 2016-10-25 17:19:06.146341  172.19.0.5  59566   172.19.0.4  5672    TCP 54  59566 → 5672 [RST] Seq=10 Win=0 Len=0

and then the next time it tries and succeeds is over a minute later:

236 2016-10-25 17:20:26.381197  172.19.0.5  59812   172.19.0.4  5672    TCP 66  59812 → 5672 [ACK] Seq=9 Ack=494 Win=30336 Len=0 TSval=362368 TSecr=362368

so depending on the recovery interval used, this may or may not indicate that the client did not try to reconnect more than once. Default interval is 5 seconds.

Contributor

michaelklishin commented Oct 25, 2016

In the provided TCP dump file, if we apply a display filter that only displays the packets/segments/frames sent by the Logstash process (which seems to be on host 172.19.0.5, according to my understanding of the heartbeat frame exchange) like this:

ip.src eq 172.19.0.5

we see that it attempts to connect and gets an RST (reset) packet:

173 2016-10-25 17:19:06.146341  172.19.0.5  59566   172.19.0.4  5672    TCP 54  59566 → 5672 [RST] Seq=10 Win=0 Len=0

and then the next time it tries and succeeds is over a minute later:

236 2016-10-25 17:20:26.381197  172.19.0.5  59812   172.19.0.4  5672    TCP 66  59812 → 5672 [ACK] Seq=9 Ack=494 Win=30336 Len=0 TSval=362368 TSecr=362368

so depending on the recovery interval used, this may or may not indicate that the client did not try to reconnect more than once. Default interval is 5 seconds.

@digitalkaoz

This comment has been minimized.

Show comment
Hide comment
@digitalkaoz

digitalkaoz Oct 25, 2016

@michaelklishin thanks that you dig into. im trying to help all i can, but im neither into java-rabbitmq-client nor march-here nor logstash-plugins.

can you open a channel too (not just a connection, because thats what the logstash-plugin does, which gets closed unnoticed)

@michaelklishin thanks that you dig into. im trying to help all i can, but im neither into java-rabbitmq-client nor march-here nor logstash-plugins.

can you open a channel too (not just a connection, because thats what the logstash-plugin does, which gets closed unnoticed)

@michaelklishin

This comment has been minimized.

Show comment
Hide comment
@michaelklishin

michaelklishin Oct 25, 2016

Contributor
jruby-9.1.2.0 :003 > [c.open?, ch.open?]
 => [true, true]
# stop RabbitMQ
jruby-9.1.2.0 :004 > [c.open?, ch.open?]
 => [false, false]
jruby-9.1.2.0 :005 > [c.open?, ch.open?]
 => [false, false]
# start RabbitMQ
jruby-9.1.2.0 :006 > [c.open?, ch.open?]
 => [false, false]
# node finishes booting
jruby-9.1.2.0 :007 > [c.open?, ch.open?]
 => [true, true]
Contributor

michaelklishin commented Oct 25, 2016

jruby-9.1.2.0 :003 > [c.open?, ch.open?]
 => [true, true]
# stop RabbitMQ
jruby-9.1.2.0 :004 > [c.open?, ch.open?]
 => [false, false]
jruby-9.1.2.0 :005 > [c.open?, ch.open?]
 => [false, false]
# start RabbitMQ
jruby-9.1.2.0 :006 > [c.open?, ch.open?]
 => [false, false]
# node finishes booting
jruby-9.1.2.0 :007 > [c.open?, ch.open?]
 => [true, true]
@michaelklishin

This comment has been minimized.

Show comment
Hide comment
@michaelklishin

michaelklishin Oct 25, 2016

Contributor

March Hare 2.19.0 is out.

Contributor

michaelklishin commented Oct 25, 2016

March Hare 2.19.0 is out.

@andrewvc

This comment has been minimized.

Show comment
Hide comment
@andrewvc

andrewvc Oct 25, 2016

Contributor

Great, Il'l try upgrading tomorrow. With 2.18.0 I was seeing a state where the connection was closed but auto-recovery wasn't touching it. I'll have more tmrw.

Contributor

andrewvc commented Oct 25, 2016

Great, Il'l try upgrading tomorrow. With 2.18.0 I was seeing a state where the connection was closed but auto-recovery wasn't touching it. I'll have more tmrw.

andrewvc added a commit to andrewvc/logstash-mixin-rabbitmq_connection that referenced this issue Oct 26, 2016

Bump march_hare version to 2.19.0 to fix reconnect issues
This fixes
logstash-plugins/logstash-input-rabbitmq#76
where with ha_proxy switching rabbitmq backends would cause it to stall
indefinitely

andrewvc added a commit to andrewvc/logstash-mixin-rabbitmq_connection that referenced this issue Oct 26, 2016

Bump march_hare version to 2.19.0 to fix reconnect issues
This fixes
logstash-plugins/logstash-input-rabbitmq#76
where with ha_proxy switching rabbitmq backends would cause it to stall
indefinitely
@andrewvc

This comment has been minimized.

Show comment
Hide comment
@andrewvc

andrewvc Oct 26, 2016

Contributor

Bumping versions fixed this issue to me @michaelklishin . Thanks for pushing the new version.

Opened this PR to bump it logstash-plugins/logstash-mixin-rabbitmq_connection#32

We need a separate PR to do a better job logging bad connections.

Contributor

andrewvc commented Oct 26, 2016

Bumping versions fixed this issue to me @michaelklishin . Thanks for pushing the new version.

Opened this PR to bump it logstash-plugins/logstash-mixin-rabbitmq_connection#32

We need a separate PR to do a better job logging bad connections.

@andrewvc

This comment has been minimized.

Show comment
Hide comment
@andrewvc

andrewvc Oct 26, 2016

Contributor

Please respond to this ticket if you're still experiencing this bug @digitalkaoz and we can reopen it . You should be able to upgrade by running logstash-plugin --update

Contributor

andrewvc commented Oct 26, 2016

Please respond to this ticket if you're still experiencing this bug @digitalkaoz and we can reopen it . You should be able to upgrade by running logstash-plugin --update

@digitalkaoz

This comment has been minimized.

Show comment
Hide comment
@digitalkaoz

digitalkaoz Oct 26, 2016

@andrewvc im not sure how it fixes it for you?

march_hare-2.19.0
logstash-mixin-rabbitmq_connection-4.2.0
logstash-input-rabbitmq-4.1.0

same thing, if i kill and start rabbit logstash still doesnt recover the connection, last message from logstash was now (thanks to your newest commit ;) )

{:timestamp=>"2016-10-26T19:12:31.687000+0000", :message=>"RabbitMQ connection was closed!", :url=>"amqp://guest:XXXXXX@proxy:5672/", :automatic_recovery=>true, :cause=>com.rabbitmq.client.ShutdownSignalException: connection error, :level=>:warn}

the important thing is to not restart logstash!
can you tell me your step how it worked on your side?

digitalkaoz commented Oct 26, 2016

@andrewvc im not sure how it fixes it for you?

march_hare-2.19.0
logstash-mixin-rabbitmq_connection-4.2.0
logstash-input-rabbitmq-4.1.0

same thing, if i kill and start rabbit logstash still doesnt recover the connection, last message from logstash was now (thanks to your newest commit ;) )

{:timestamp=>"2016-10-26T19:12:31.687000+0000", :message=>"RabbitMQ connection was closed!", :url=>"amqp://guest:XXXXXX@proxy:5672/", :automatic_recovery=>true, :cause=>com.rabbitmq.client.ShutdownSignalException: connection error, :level=>:warn}

the important thing is to not restart logstash!
can you tell me your step how it worked on your side?

@andrewvc

This comment has been minimized.

Show comment
Hide comment
@andrewvc

andrewvc Oct 26, 2016

Contributor

@digitalkaoz yes, I followed the following your procedure (using your excellent docker-compose):

  1. Start up the whole stack, wait for LS to come up
  2. docker-compose kill rabbit producer
  3. docker-compose up rabbit producer

That all works for me. Not for you? BTW, I'd recommend upgrading to logstash-input-rabbitmq 5.2.0

Contributor

andrewvc commented Oct 26, 2016

@digitalkaoz yes, I followed the following your procedure (using your excellent docker-compose):

  1. Start up the whole stack, wait for LS to come up
  2. docker-compose kill rabbit producer
  3. docker-compose up rabbit producer

That all works for me. Not for you? BTW, I'd recommend upgrading to logstash-input-rabbitmq 5.2.0

@digitalkaoz

This comment has been minimized.

Show comment
Hide comment
@digitalkaoz

digitalkaoz Oct 27, 2016

still no luck:

logstash:5
logstash-mixin-rabbitmq_connection-4.2.0
logstash-input-rabbitmq-5.2.0
march_hare-2.19.0

what am i doing wrong?

my steps:

$ docker-composer up #all up
$ docker-compose kill rabbit producer #kill rabbit+producer
$ sleep 10
$ docker-compose start rabbit producer #restart rabbit+producer

still logstash wont catch up :/

digitalkaoz commented Oct 27, 2016

still no luck:

logstash:5
logstash-mixin-rabbitmq_connection-4.2.0
logstash-input-rabbitmq-5.2.0
march_hare-2.19.0

what am i doing wrong?

my steps:

$ docker-composer up #all up
$ docker-compose kill rabbit producer #kill rabbit+producer
$ sleep 10
$ docker-compose start rabbit producer #restart rabbit+producer

still logstash wont catch up :/

@michaelklishin

This comment has been minimized.

Show comment
Hide comment
@michaelklishin

michaelklishin Oct 27, 2016

Contributor

@digitalkaoz I'm sorry to be so blunt but if you want to get to the bottom of it you need to put in some debugging effort of your own. We have provided several hypotheses, investigated at least two, explained what various log entries mean, what can be altered to identify what application may be introducing the behaviour observed, analyzed at least one Wireshark dump, and based on all the findings produced a new March Hare version. That's enough information to understand how to debug a small distributed system like this one.

Asking "what am I doing wrong" is unlikely to be productive. Trying things at random is unlikely to be efficient.

Contributor

michaelklishin commented Oct 27, 2016

@digitalkaoz I'm sorry to be so blunt but if you want to get to the bottom of it you need to put in some debugging effort of your own. We have provided several hypotheses, investigated at least two, explained what various log entries mean, what can be altered to identify what application may be introducing the behaviour observed, analyzed at least one Wireshark dump, and based on all the findings produced a new March Hare version. That's enough information to understand how to debug a small distributed system like this one.

Asking "what am I doing wrong" is unlikely to be productive. Trying things at random is unlikely to be efficient.

@andrewvc

This comment has been minimized.

Show comment
Hide comment
@andrewvc

andrewvc Oct 28, 2016

Contributor

@digitalkaoz can you update the docker-compose stuff with the latest? One thing that was different in my repro is that I ran everything except logstash in docker. That shouldn't make a difference, but maybe it does.

Contributor

andrewvc commented Oct 28, 2016

@digitalkaoz can you update the docker-compose stuff with the latest? One thing that was different in my repro is that I ran everything except logstash in docker. That shouldn't make a difference, but maybe it does.

@andrewvc

This comment has been minimized.

Show comment
Hide comment
@andrewvc

andrewvc Oct 31, 2016

Contributor

Hey @digitalkaoz I'm still glad to help you get to the bottom of this. @michaelklishin thanks so much for your help, I really appreciate it!

I'm glad to personally put in some more hours here myself. What would help me move this forward @digitalkaoz is if you could update your docker config to the latest versions so we can be reproing the same things.

Contributor

andrewvc commented Oct 31, 2016

Hey @digitalkaoz I'm still glad to help you get to the bottom of this. @michaelklishin thanks so much for your help, I really appreciate it!

I'm glad to personally put in some more hours here myself. What would help me move this forward @digitalkaoz is if you could update your docker config to the latest versions so we can be reproing the same things.

@digitalkaoz

This comment has been minimized.

Show comment
Hide comment
@digitalkaoz

digitalkaoz Nov 1, 2016

sorry @andrewvc i had an busy weekend.

i updated everything to the latests version, but still no luck. the repo is up to date now.

$ docker-composer up #all up
$ docker-compose kill rabbit producer
$ sleep 10
$ docker-compose start rabbit producer

sorry @andrewvc i had an busy weekend.

i updated everything to the latests version, but still no luck. the repo is up to date now.

$ docker-composer up #all up
$ docker-compose kill rabbit producer
$ sleep 10
$ docker-compose start rabbit producer
@michaelklishin

This comment has been minimized.

Show comment
Hide comment
@michaelklishin

michaelklishin Nov 1, 2016

Contributor

In the snippet above both RabbitMQ and publisher are shut down and started at the same time. Note that this is not the same steps that at least some in this thread performed and creates a natural race condition between the two starting. If the producer starts before RabbitMQ then connection recovery WILL NOT BE ENGAGED since there never was a successful connection to recover: initial TCP connection will fail.

Contributor

michaelklishin commented Nov 1, 2016

In the snippet above both RabbitMQ and publisher are shut down and started at the same time. Note that this is not the same steps that at least some in this thread performed and creates a natural race condition between the two starting. If the producer starts before RabbitMQ then connection recovery WILL NOT BE ENGAGED since there never was a successful connection to recover: initial TCP connection will fail.

@digitalkaoz

This comment has been minimized.

Show comment
Hide comment
@digitalkaoz

digitalkaoz Nov 1, 2016

@michaelklishin the producer waits 10s before doing anything, so rabbit has time to start

@michaelklishin the producer waits 10s before doing anything, so rabbit has time to start

@andrewvc

This comment has been minimized.

Show comment
Hide comment
@andrewvc

andrewvc Nov 1, 2016

Contributor

I've reopened this issue now that I've reconfirmed it @digitalkaoz , but as I've traced the issue to what I believe to definitely be a MarchHare bug we should move the discussion to ruby-amqp/march_hare#107

Interestingly MarchHare does recover if rabbitmq comes back in under the 5s window. If it takes any longer it is perma-wedged.

Contributor

andrewvc commented Nov 1, 2016

I've reopened this issue now that I've reconfirmed it @digitalkaoz , but as I've traced the issue to what I believe to definitely be a MarchHare bug we should move the discussion to ruby-amqp/march_hare#107

Interestingly MarchHare does recover if rabbitmq comes back in under the 5s window. If it takes any longer it is perma-wedged.

@andrewvc andrewvc reopened this Nov 1, 2016

@andrewvc

This comment has been minimized.

Show comment
Hide comment
@andrewvc

andrewvc Nov 1, 2016

Contributor

BTW, thanks for the excellent docker compose @digitalkaoz

Contributor

andrewvc commented Nov 1, 2016

BTW, thanks for the excellent docker compose @digitalkaoz

@jordansissel

This comment has been minimized.

Show comment
Hide comment
@jordansissel

jordansissel Nov 1, 2016

Contributor

@andrewvc and I worked on this for a while today. Packet capture in wireshark showed indeed that the rabbitmq client from Logstash was waiting 5 seconds (as expected) when rabbitmq was stopped before reconnecting. However, reconnecting was only attempted once:

  • ✔️ Client's connection is terminated by HAproxy because RabbitMQ was terminated
  • ✔️ Client initiates new connection to haproxy
  • ✔️ Client sends protocol negotiation: AMQP\0\0\9\1 ("AMQP" + hex 00000901)
  • ✔️ 5 seconds later, Client starts tcp teardown, sending FIN to close the connection. Presumably because the backend (haproxy) did not respond to the client's protocol negotiation
  • No further connection attempts are made by the client. Expectation is that after 5 seconds, again, the client would attempt to reconnect.

The problem we are looking at now is to try and understand why the rabbitmq client is not reconnecting after the first reconnection attempt had timed out.

Contributor

jordansissel commented Nov 1, 2016

@andrewvc and I worked on this for a while today. Packet capture in wireshark showed indeed that the rabbitmq client from Logstash was waiting 5 seconds (as expected) when rabbitmq was stopped before reconnecting. However, reconnecting was only attempted once:

  • ✔️ Client's connection is terminated by HAproxy because RabbitMQ was terminated
  • ✔️ Client initiates new connection to haproxy
  • ✔️ Client sends protocol negotiation: AMQP\0\0\9\1 ("AMQP" + hex 00000901)
  • ✔️ 5 seconds later, Client starts tcp teardown, sending FIN to close the connection. Presumably because the backend (haproxy) did not respond to the client's protocol negotiation
  • No further connection attempts are made by the client. Expectation is that after 5 seconds, again, the client would attempt to reconnect.

The problem we are looking at now is to try and understand why the rabbitmq client is not reconnecting after the first reconnection attempt had timed out.

@jordansissel

This comment has been minimized.

Show comment
Hide comment
@jordansissel

jordansissel Nov 1, 2016

Contributor

Based on the behavior, I was able to reproduce this without docker and without haproxy

  • Run rabbitmq + logstash
  • Let logstash rmq input connect
  • Stop rabbitmq
  • run nc -l 5672 to accept the next connection retry from Logstash
  • wait until 'AMQP' appears on nc (this is the rmq client attempting to reconnect)
  • After this attempt to reconnect, rmq input never tries again.
Contributor

jordansissel commented Nov 1, 2016

Based on the behavior, I was able to reproduce this without docker and without haproxy

  • Run rabbitmq + logstash
  • Let logstash rmq input connect
  • Stop rabbitmq
  • run nc -l 5672 to accept the next connection retry from Logstash
  • wait until 'AMQP' appears on nc (this is the rmq client attempting to reconnect)
  • After this attempt to reconnect, rmq input never tries again.
@michaelklishin

This comment has been minimized.

Show comment
Hide comment
@michaelklishin

michaelklishin Nov 1, 2016

Contributor

@jordansissel this is an edge case in the protocol. Before connection negotiation happens we cannot set up a heartbeat timer but without it it takes a long time to detect unresponsive peers (whether there is a TCP connection issue or just no response as in the nc example).

It therefore requires an intermediary or a bogus server that doesn't respond to a protocol header.

Contributor

michaelklishin commented Nov 1, 2016

@jordansissel this is an edge case in the protocol. Before connection negotiation happens we cannot set up a heartbeat timer but without it it takes a long time to detect unresponsive peers (whether there is a TCP connection issue or just no response as in the nc example).

It therefore requires an intermediary or a bogus server that doesn't respond to a protocol header.

@andrewvc

This comment has been minimized.

Show comment
Hide comment
@andrewvc

andrewvc Nov 1, 2016

Contributor

@michaelklishin confirmed this with @jordansissel . The fix is in this PR: ruby-amqp/march_hare#108

Thanks for pointing us to the right place!

Contributor

andrewvc commented Nov 1, 2016

@michaelklishin confirmed this with @jordansissel . The fix is in this PR: ruby-amqp/march_hare#108

Thanks for pointing us to the right place!

@michaelklishin

This comment has been minimized.

Show comment
Hide comment
@michaelklishin

michaelklishin Nov 1, 2016

Contributor

@andrewvc @jordansissel thank you for finding a way to reproduce with netcat. If there is a way for you to verify end-to-end with March Hare master, I'd be happy to do a March Hare release tomorrow morning (European time).

Contributor

michaelklishin commented Nov 1, 2016

@andrewvc @jordansissel thank you for finding a way to reproduce with netcat. If there is a way for you to verify end-to-end with March Hare master, I'd be happy to do a March Hare release tomorrow morning (European time).

@michaelklishin

This comment has been minimized.

Show comment
Hide comment
@michaelklishin

michaelklishin Nov 2, 2016

Contributor

March Hare 2.20.0 is up.

Contributor

michaelklishin commented Nov 2, 2016

March Hare 2.20.0 is up.

@digitalkaoz

This comment has been minimized.

Show comment
Hide comment
@digitalkaoz

digitalkaoz Nov 2, 2016

@michaelklishin thanks ill test it now
@andrewvc can you release a 4.2.1 of logstash-mixin-rabbitmq_connection so i get march_hare 2.20.0 in?

digitalkaoz commented Nov 2, 2016

@michaelklishin thanks ill test it now
@andrewvc can you release a 4.2.1 of logstash-mixin-rabbitmq_connection so i get march_hare 2.20.0 in?

@digitalkaoz

This comment has been minimized.

Show comment
Hide comment
@digitalkaoz

digitalkaoz Nov 2, 2016

@michaelklishin @andrewvc @jordansissel i can confirm 2.20.0 of march_hare fixed the reconnection issues. thanks for all your efforts!

@michaelklishin @andrewvc @jordansissel i can confirm 2.20.0 of march_hare fixed the reconnection issues. thanks for all your efforts!

@andrewvc

This comment has been minimized.

Show comment
Hide comment
@andrewvc

andrewvc Nov 2, 2016

Contributor

@digitalkaoz fantastic!

Contributor

andrewvc commented Nov 2, 2016

@digitalkaoz fantastic!

@andrewvc andrewvc closed this Nov 2, 2016

@jakauppila

This comment has been minimized.

Show comment
Hide comment
@jakauppila

jakauppila Nov 3, 2016

Should these changes be ported over to logstash-output-rabbitmq as well?

Should these changes be ported over to logstash-output-rabbitmq as well?

@digitalkaoz

This comment has been minimized.

Show comment
Hide comment
@digitalkaoz

digitalkaoz Nov 3, 2016

I think it is, because the mixin (used by Input and output) loads
march_hare the output gets this fix to. Or am i wrong?

Jared Kauppila notifications@github.com schrieb am Do., 3. Nov. 2016,
20:47:

Should these changes be ported over to logstash-output-rabbitmq as well?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#76 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAR61yzJWpvPx2UuNA35Gtgj-x-flg93ks5q6jprgaJpZM4H69a4
.

I think it is, because the mixin (used by Input and output) loads
march_hare the output gets this fix to. Or am i wrong?

Jared Kauppila notifications@github.com schrieb am Do., 3. Nov. 2016,
20:47:

Should these changes be ported over to logstash-output-rabbitmq as well?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#76 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAR61yzJWpvPx2UuNA35Gtgj-x-flg93ks5q6jprgaJpZM4H69a4
.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment