Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No error if RabbitMQ goes down #76

Closed
ebuildy opened this issue Mar 29, 2016 · 78 comments
Closed

No error if RabbitMQ goes down #76

ebuildy opened this issue Mar 29, 2016 · 78 comments

Comments

@ebuildy
Copy link

@ebuildy ebuildy commented Mar 29, 2016

Before version 3.3.0, when Rabbit server went down, logstash throws:

Queue subscription ended! Will retry in 5s
RabbitMQ connection down, will wait to retry subscription

With version 4.0.1, nothing happen (as you can on docker-compose logs, here https://gist.github.com/ebuildy/d405dab7dbaca2e3a03dc2a7fa6f5b29) if queue server is down, is it normal?

Thanks you,

@andrewvc
Copy link
Contributor

@andrewvc andrewvc commented Mar 30, 2016

@ebuildy that is because we now use the RabbitMQ library's built in auto-reconnect facility which is more robust. We'd be glad to accept a patch that hooks into it and logs these events however!

@ebuildy
Copy link
Author

@ebuildy ebuildy commented Mar 30, 2016

You are talking about this: https://www.rabbitmq.com/api-guide.html#recovery ?

Thanks you, I mush finished some PR on other logstash plugins and RabbitMQ Java client (rabbitmq/rabbitmq-java-client#138) before, then I will give an eye to it.

@digitalkaoz
Copy link

@digitalkaoz digitalkaoz commented Oct 18, 2016

@andrewvc beside there are no logs if the rabbitmq server closes the channel, a reconnect isnt happening, why i assume logstash with rabbitmq input is not usable atm, especially in a dockerized environment... any solutions here? we to use a haproxy in between rabbit and logstash

@andrewvc
Copy link
Contributor

@andrewvc andrewvc commented Oct 18, 2016

@digitalkaoz why is the rabbitmq server closing the channel?

AFAIK if you do that the meaning is you want the client to stop and not restart. Please correct me if I'm wrong here.

@andrewvc
Copy link
Contributor

@andrewvc andrewvc commented Oct 18, 2016

@digitalkaoz that's a different situation, may I add, from one where the server has gone down as a result of an error.

@andrewvc
Copy link
Contributor

@andrewvc andrewvc commented Oct 18, 2016

@ebuildy @digitalkaoz , if you all can share a reproducible test case I'd be glad to try it and find a fix. Just list the steps to reproduce, preferably just using logstash and a local rabbitmq.

@digitalkaoz
Copy link

@digitalkaoz digitalkaoz commented Oct 18, 2016

@andrewvc the rabbit and haproxy server are restarted during deployments (they are just docker containers)

here the setup:

rabbitmq -> 5672(tcp) -> haproxy -> 5672(tcp) -> logstash

rabbitmq.conf is default

haproxy.conf

listen amqp
        mode tcp
        option tcplog
        bind *:5672
        log global
        balance         roundrobin
        timeout connect 2s
        timeout client  2m
        timeout server  2m
        option          clitcpka
        option tcp-check
        server rabbitmq:5672 check port 5672

logstash.conf

input {
    rabbitmq {
        connection_timeout => 1000
        heartbeat => 30
        host => "haproxy"
        subscription_retry_interval_seconds => 30
        queue => "logs"
        user => "guest"
        password => "guest"
    }

if we only restart rabbitmq it seems fine when we gracefully shutdown the server:

$ docker exec rabbitmq rabbitmqctl close_connection {$CON_ID} "rabbit shutdown"
$ docker exec rabbitmq rabbitmqctl stop_app

we explicitly close the connection to force the logstash-rabbitmq-input-plugin to force a restart

but if we restart haproxy the tcp gets cut off ungracefully and the logstash plugin wont notice the channel is closed (and doenst go into restart mode) because it received an error frame. the java recovery stuff doesnt seem to work for that.

hope thats enough information

@digitalkaoz
Copy link

@digitalkaoz digitalkaoz commented Oct 18, 2016

i think thats a quite common scenario in a containerized world

@digitalkaoz
Copy link

@digitalkaoz digitalkaoz commented Oct 18, 2016

when i restart logstash i get the following error:

An unexpected error occurred!", :error=>com.rabbitmq.client.AlreadyClosedException: connection is already closed due to connection error; cause: java.io.EOFException, :class=>"Java::ComRabbitmqClient::AlreadyClosedException", :backtrace=>["com.rabbitmq.client.impl.AMQChannel.ensureIsOpen(com/rabbitmq/client/impl/AMQChannel.java:195)", "com.rabbitmq.client.impl.AMQChannel.rpc(com/rabbitmq/client/impl/AMQChannel.java:241)", "com.rabbitmq.client.impl.ChannelN.basicCancel(com/rabbitmq/client/impl/ChannelN.java:1127)", "java.lang.reflect.Method.invoke(java/lang/reflect/Method.java:498)", "RUBY.method_missing(/opt/logstash/vendor/bundle/jruby/1.9/gems/march_hare-2.15.0-java/lib/march_hare/channel.rb:874)", "RUBY.shutdown_consumer(/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-input-rabbitmq-4.1.0/lib/logstash/inputs/rabbitmq.rb:279)", "RUBY.stop(/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-input-rabbitmq-4.1.0/lib/logstash/inputs/rabbitmq.rb:273)", "RUBY.do_stop(/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-core-2.4.0-java/lib/logstash/inputs/base.rb:83)", "org.jruby.RubyArray.each(org/jruby/RubyArray.java:1613)", "RUBY.shutdown(/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-core-2.4.0-java/lib/logstash/pipeline.rb:385)", "RUBY.stop_pipeline(/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-core-2.4.0-java/lib/logstash/agent.rb:413)", "RUBY.shutdown_pipelines(/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-core-2.4.0-java/lib/logstash/agent.rb:406)", "org.jruby.RubyHash.each(org/jruby/RubyHash.java:1342)", "RUBY.shutdown_pipelines(/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-core-2.4.0-java/lib/logstash/agent.rb:406)", "RUBY.shutdown(/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-core-2.4.0-java/lib/logstash/agent.rb:402)", "RUBY.execute(/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-core-2.4.0-java/lib/logstash/agent.rb:233)", "RUBY.run(/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-core-2.4.0-java/lib/logstash/runner.rb:94)", "org.jruby.RubyProc.call(org/jruby/RubyProc.java:281)", "RUBY.run(/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-core-2.4.0-java/lib/logstash/runner.rb:99)", "org.jruby.RubyProc.call(org/jruby/RubyProc.java:281)", "RUBY.initialize(/opt/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.22/lib/stud/task.rb:24)", "java.lang.Thread.run(java/lang/Thread.java:745)"], :level=>:warn}

@andrewvc
Copy link
Contributor

@andrewvc andrewvc commented Oct 18, 2016

@digitalkaoz thanks for the very detailed description here!

@michaelklishin should the java client's auto connection recovery just handle this?

@michaelklishin
Copy link
Contributor

@michaelklishin michaelklishin commented Oct 19, 2016

It takes time to detect peer unavailability. I don't see how this is a connection recovery issue: recovery cannot start if a client connection is still considered to be alive.

@michaelklishin
Copy link
Contributor

@michaelklishin michaelklishin commented Oct 19, 2016

Also: do not set heartbeats to a value lower than 5 seconds. That is recipe for false positives.

@digitalkaoz
Copy link

@digitalkaoz digitalkaoz commented Oct 19, 2016

We waited alot but logstash wont recover the Connection. I would rather get
false positives than nothing...any more ideas? The Error Message at the End
says it very clear. The Channel was already closed, bc it received an Error
Frame. The Channel is unusable once received an error_frame...

and neither this plugin, nor march_here nor the the java-amqp lib seems to notice that

@andrewvc
Copy link
Contributor

@andrewvc andrewvc commented Oct 19, 2016

@michaelklishin thanks for popping in!

He's setting the heartbeat to 30 seconds. It's confusing because the connection timeout is in millis, but the heartbeat is in seconds.

I think this is correct yes? Looking at the source of the java client it looks like timeout is set in millis and heartbeats in seconds. Is that all correct?

@digitalkaoz
Copy link

@digitalkaoz digitalkaoz commented Oct 25, 2016

@andrewvc @michaelklishin i created a tiny docker-compose setup to reproduce this issue:

https://github.com/digitalkaoz/rabbit-haproxy-logstash-fail

simply follow the reproduce steps...

@michaelklishin
Copy link
Contributor

@michaelklishin michaelklishin commented Oct 25, 2016

Heartbeat timeout is in seconds per protocol spec. Also see this thread.

@michaelklishin
Copy link
Contributor

@michaelklishin michaelklishin commented Oct 25, 2016

So you have HAproxy in between. HAproxy has its own timeouts (which heartbeats in theory make irrelevant but still) and what RabbitMQ sees is a TCP connection from HAproxy, not the client. I can do a March Hare release with a preview version of the 3.6.6 Java client, however.

@andrewvc
Copy link
Contributor

@andrewvc andrewvc commented Oct 25, 2016

Thanks @digitalkaoz , I'll try and repro today.

@andrewvc
Copy link
Contributor

@andrewvc andrewvc commented Oct 25, 2016

Also thanks for the background @michaelklishin . I'm wondering if what's happening is that HAProxy is switching the backend without letting logstash know, and without disconnecting from logstash. Will report back.

@digitalkaoz
Copy link

@digitalkaoz digitalkaoz commented Oct 25, 2016

it still fails when i wait lets say 5 minutes... @michaelklishin as i understand the relevant fix is in https://github.com/rabbitmq/rabbitmq-java-client/releases/tag/v4.0.0.M2 ?

@michaelklishin
Copy link
Contributor

@michaelklishin michaelklishin commented Oct 25, 2016

@andrewvc typically proxies propagate upstream connection loss. At least that's what HAproxy does by default. In fact, if it didn't then the two logical ends of the TCP connection would be out of sync.

I doubt it's a Java client problem here but I've updated March Hare to the latest 3.6.x version from source.

@michaelklishin
Copy link
Contributor

@michaelklishin michaelklishin commented Oct 25, 2016

@digitalkaoz I'm sorry but there is no evidence that I have seen that this is a client library issue. A good thing to try would be to eliminate HAproxy from the picture entirely and also verify the effective heartbeat timeout value (management UI displays it on the connection page).

@digitalkaoz
Copy link

@digitalkaoz digitalkaoz commented Oct 25, 2016

@michaelklishin sure if i remove haproxy from the setup the reconnect works. but thats not a typical setup (or at least at scale)

with haproxy in the middle, reconnect fails (even with a heartbeat of 10s), but my question is why? shouldnt the heartbeat tell the plugin that the connection isnt healthy anymore and reinitiate a recovery/reconnect ?

@michaelklishin
Copy link
Contributor

@michaelklishin michaelklishin commented Oct 25, 2016

It should. We don't know why that doesn't happen: that's why I recommended checking the actual effective value.

@digitalkaoz
Copy link

@digitalkaoz digitalkaoz commented Oct 25, 2016

after restarting rabbit there is no connection alive on the rabbit-node because nobody initiated a new connection. neither haproxy, nor logstash. bc. their connection is still there but defacto useless?

@andrewvc
Copy link
Contributor

@andrewvc andrewvc commented Oct 25, 2016

@michaelklishin, is there a best practice here for accomplishing what @digitalkaoz wants to accomplish? I think having a layer of indirection via a load balancer makes sense. This is something that's come up multiple times, I'm wondering if the RabbitMQ project has an official stance here.

@michaelklishin
Copy link
Contributor

@michaelklishin michaelklishin commented Oct 25, 2016

There's no stance as HAproxy is supposed to be transparent to the client.

@digitalkaoz HAproxy is not supposed to initiate any connections upstream unless a new client connects to it.

Can we please stop guessing and move on to forming hypotheses and verifying them? The first one I have is: the heartbeat value is actually not set to what the reporter thinks it is used. I've mentioned how this can be verified above. Without this we can pour hours and hours of time and get nothing out of it. My team already spends hours every day on various support questions and we would appreciate if the reporters cooperated a bit to save us all some time.

I also pushed a Java client update to March Hare, even though I doubt it will change anything.

A Wireshark/tcpdump capture will provide a lot of information to look at.

@michaelklishin
Copy link
Contributor

@michaelklishin michaelklishin commented Oct 25, 2016

A Wireshark capture should make it easy to see heartbeat frames on the wire: not only their actual presence but also timestamps (they are sent at roughly 1/2 of the effective timeout value).

@digitalkaoz
Copy link

@digitalkaoz digitalkaoz commented Oct 25, 2016

@michaelklishin the heartbeat is set to 10s as i told you above...but that didnt change anything, or do you mean some other value?

screen shot 2016-10-25 at 16 17 12

@andrewvc
Copy link
Contributor

@andrewvc andrewvc commented Oct 26, 2016

Bumping versions fixed this issue to me @michaelklishin . Thanks for pushing the new version.

Opened this PR to bump it logstash-plugins/logstash-mixin-rabbitmq_connection#32

We need a separate PR to do a better job logging bad connections.

@andrewvc
Copy link
Contributor

@andrewvc andrewvc commented Oct 26, 2016

Please respond to this ticket if you're still experiencing this bug @digitalkaoz and we can reopen it . You should be able to upgrade by running logstash-plugin --update

@digitalkaoz
Copy link

@digitalkaoz digitalkaoz commented Oct 26, 2016

@andrewvc im not sure how it fixes it for you?

march_hare-2.19.0
logstash-mixin-rabbitmq_connection-4.2.0
logstash-input-rabbitmq-4.1.0

same thing, if i kill and start rabbit logstash still doesnt recover the connection, last message from logstash was now (thanks to your newest commit ;) )

{:timestamp=>"2016-10-26T19:12:31.687000+0000", :message=>"RabbitMQ connection was closed!", :url=>"amqp://guest:XXXXXX@proxy:5672/", :automatic_recovery=>true, :cause=>com.rabbitmq.client.ShutdownSignalException: connection error, :level=>:warn}

the important thing is to not restart logstash!
can you tell me your step how it worked on your side?

@andrewvc
Copy link
Contributor

@andrewvc andrewvc commented Oct 26, 2016

@digitalkaoz yes, I followed the following your procedure (using your excellent docker-compose):

  1. Start up the whole stack, wait for LS to come up
  2. docker-compose kill rabbit producer
  3. docker-compose up rabbit producer

That all works for me. Not for you? BTW, I'd recommend upgrading to logstash-input-rabbitmq 5.2.0

@digitalkaoz
Copy link

@digitalkaoz digitalkaoz commented Oct 27, 2016

still no luck:

logstash:5
logstash-mixin-rabbitmq_connection-4.2.0
logstash-input-rabbitmq-5.2.0
march_hare-2.19.0

what am i doing wrong?

my steps:

$ docker-composer up #all up
$ docker-compose kill rabbit producer #kill rabbit+producer
$ sleep 10
$ docker-compose start rabbit producer #restart rabbit+producer

still logstash wont catch up :/

@michaelklishin
Copy link
Contributor

@michaelklishin michaelklishin commented Oct 27, 2016

@digitalkaoz I'm sorry to be so blunt but if you want to get to the bottom of it you need to put in some debugging effort of your own. We have provided several hypotheses, investigated at least two, explained what various log entries mean, what can be altered to identify what application may be introducing the behaviour observed, analyzed at least one Wireshark dump, and based on all the findings produced a new March Hare version. That's enough information to understand how to debug a small distributed system like this one.

Asking "what am I doing wrong" is unlikely to be productive. Trying things at random is unlikely to be efficient.

@andrewvc
Copy link
Contributor

@andrewvc andrewvc commented Oct 28, 2016

@digitalkaoz can you update the docker-compose stuff with the latest? One thing that was different in my repro is that I ran everything except logstash in docker. That shouldn't make a difference, but maybe it does.

@andrewvc
Copy link
Contributor

@andrewvc andrewvc commented Oct 31, 2016

Hey @digitalkaoz I'm still glad to help you get to the bottom of this. @michaelklishin thanks so much for your help, I really appreciate it!

I'm glad to personally put in some more hours here myself. What would help me move this forward @digitalkaoz is if you could update your docker config to the latest versions so we can be reproing the same things.

@digitalkaoz
Copy link

@digitalkaoz digitalkaoz commented Nov 1, 2016

sorry @andrewvc i had an busy weekend.

i updated everything to the latests version, but still no luck. the repo is up to date now.

$ docker-composer up #all up
$ docker-compose kill rabbit producer
$ sleep 10
$ docker-compose start rabbit producer
@michaelklishin
Copy link
Contributor

@michaelklishin michaelklishin commented Nov 1, 2016

In the snippet above both RabbitMQ and publisher are shut down and started at the same time. Note that this is not the same steps that at least some in this thread performed and creates a natural race condition between the two starting. If the producer starts before RabbitMQ then connection recovery WILL NOT BE ENGAGED since there never was a successful connection to recover: initial TCP connection will fail.

@digitalkaoz
Copy link

@digitalkaoz digitalkaoz commented Nov 1, 2016

@michaelklishin the producer waits 10s before doing anything, so rabbit has time to start

@andrewvc
Copy link
Contributor

@andrewvc andrewvc commented Nov 1, 2016

I've reopened this issue now that I've reconfirmed it @digitalkaoz , but as I've traced the issue to what I believe to definitely be a MarchHare bug we should move the discussion to ruby-amqp/march_hare#107

Interestingly MarchHare does recover if rabbitmq comes back in under the 5s window. If it takes any longer it is perma-wedged.

@andrewvc andrewvc reopened this Nov 1, 2016
@andrewvc
Copy link
Contributor

@andrewvc andrewvc commented Nov 1, 2016

BTW, thanks for the excellent docker compose @digitalkaoz

@jordansissel
Copy link
Contributor

@jordansissel jordansissel commented Nov 1, 2016

@andrewvc and I worked on this for a while today. Packet capture in wireshark showed indeed that the rabbitmq client from Logstash was waiting 5 seconds (as expected) when rabbitmq was stopped before reconnecting. However, reconnecting was only attempted once:

  • ✔️ Client's connection is terminated by HAproxy because RabbitMQ was terminated
  • ✔️ Client initiates new connection to haproxy
  • ✔️ Client sends protocol negotiation: AMQP\0\0\9\1 ("AMQP" + hex 00000901)
  • ✔️ 5 seconds later, Client starts tcp teardown, sending FIN to close the connection. Presumably because the backend (haproxy) did not respond to the client's protocol negotiation
  • No further connection attempts are made by the client. Expectation is that after 5 seconds, again, the client would attempt to reconnect.

The problem we are looking at now is to try and understand why the rabbitmq client is not reconnecting after the first reconnection attempt had timed out.

@jordansissel
Copy link
Contributor

@jordansissel jordansissel commented Nov 1, 2016

Based on the behavior, I was able to reproduce this without docker and without haproxy

  • Run rabbitmq + logstash
  • Let logstash rmq input connect
  • Stop rabbitmq
  • run nc -l 5672 to accept the next connection retry from Logstash
  • wait until 'AMQP' appears on nc (this is the rmq client attempting to reconnect)
  • After this attempt to reconnect, rmq input never tries again.
@michaelklishin
Copy link
Contributor

@michaelklishin michaelklishin commented Nov 1, 2016

@jordansissel this is an edge case in the protocol. Before connection negotiation happens we cannot set up a heartbeat timer but without it it takes a long time to detect unresponsive peers (whether there is a TCP connection issue or just no response as in the nc example).

It therefore requires an intermediary or a bogus server that doesn't respond to a protocol header.

@andrewvc
Copy link
Contributor

@andrewvc andrewvc commented Nov 1, 2016

@michaelklishin confirmed this with @jordansissel . The fix is in this PR: ruby-amqp/march_hare#108

Thanks for pointing us to the right place!

@michaelklishin
Copy link
Contributor

@michaelklishin michaelklishin commented Nov 1, 2016

@andrewvc @jordansissel thank you for finding a way to reproduce with netcat. If there is a way for you to verify end-to-end with March Hare master, I'd be happy to do a March Hare release tomorrow morning (European time).

@michaelklishin
Copy link
Contributor

@michaelklishin michaelklishin commented Nov 2, 2016

March Hare 2.20.0 is up.

@digitalkaoz
Copy link

@digitalkaoz digitalkaoz commented Nov 2, 2016

@michaelklishin thanks ill test it now
@andrewvc can you release a 4.2.1 of logstash-mixin-rabbitmq_connection so i get march_hare 2.20.0 in?

@digitalkaoz
Copy link

@digitalkaoz digitalkaoz commented Nov 2, 2016

@michaelklishin @andrewvc @jordansissel i can confirm 2.20.0 of march_hare fixed the reconnection issues. thanks for all your efforts!

@andrewvc
Copy link
Contributor

@andrewvc andrewvc commented Nov 2, 2016

@digitalkaoz fantastic!

@andrewvc andrewvc closed this Nov 2, 2016
@jakauppila
Copy link

@jakauppila jakauppila commented Nov 3, 2016

Should these changes be ported over to logstash-output-rabbitmq as well?

@digitalkaoz
Copy link

@digitalkaoz digitalkaoz commented Nov 3, 2016

I think it is, because the mixin (used by Input and output) loads
march_hare the output gets this fix to. Or am i wrong?

Jared Kauppila notifications@github.com schrieb am Do., 3. Nov. 2016,
20:47:

Should these changes be ported over to logstash-output-rabbitmq as well?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#76 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAR61yzJWpvPx2UuNA35Gtgj-x-flg93ks5q6jprgaJpZM4H69a4
.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

6 participants
You can’t perform that action at this time.