No error if RabbitMQ goes down #76

Closed
ebuildy opened this Issue Mar 29, 2016 · 78 comments

Projects

None yet

6 participants

@ebuildy
ebuildy commented Mar 29, 2016

Before version 3.3.0, when Rabbit server went down, logstash throws:

Queue subscription ended! Will retry in 5s
RabbitMQ connection down, will wait to retry subscription

With version 4.0.1, nothing happen (as you can on docker-compose logs, here https://gist.github.com/ebuildy/d405dab7dbaca2e3a03dc2a7fa6f5b29) if queue server is down, is it normal?

Thanks you,

@andrewvc
Contributor

@ebuildy that is because we now use the RabbitMQ library's built in auto-reconnect facility which is more robust. We'd be glad to accept a patch that hooks into it and logs these events however!

@ebuildy
ebuildy commented Mar 30, 2016

You are talking about this: https://www.rabbitmq.com/api-guide.html#recovery ?

Thanks you, I mush finished some PR on other logstash plugins and RabbitMQ Java client (rabbitmq/rabbitmq-java-client#138) before, then I will give an eye to it.

@digitalkaoz

@andrewvc beside there are no logs if the rabbitmq server closes the channel, a reconnect isnt happening, why i assume logstash with rabbitmq input is not usable atm, especially in a dockerized environment... any solutions here? we to use a haproxy in between rabbit and logstash

@andrewvc
Contributor

@digitalkaoz why is the rabbitmq server closing the channel?

AFAIK if you do that the meaning is you want the client to stop and not restart. Please correct me if I'm wrong here.

@andrewvc
Contributor

@digitalkaoz that's a different situation, may I add, from one where the server has gone down as a result of an error.

@andrewvc
Contributor

@ebuildy @digitalkaoz , if you all can share a reproducible test case I'd be glad to try it and find a fix. Just list the steps to reproduce, preferably just using logstash and a local rabbitmq.

@digitalkaoz
digitalkaoz commented Oct 18, 2016 edited

@andrewvc the rabbit and haproxy server are restarted during deployments (they are just docker containers)

here the setup:

rabbitmq -> 5672(tcp) -> haproxy -> 5672(tcp) -> logstash

rabbitmq.conf is default

haproxy.conf

listen amqp
        mode tcp
        option tcplog
        bind *:5672
        log global
        balance         roundrobin
        timeout connect 2s
        timeout client  2m
        timeout server  2m
        option          clitcpka
        option tcp-check
        server rabbitmq:5672 check port 5672

logstash.conf

input {
    rabbitmq {
        connection_timeout => 1000
        heartbeat => 30
        host => "haproxy"
        subscription_retry_interval_seconds => 30
        queue => "logs"
        user => "guest"
        password => "guest"
    }

if we only restart rabbitmq it seems fine when we gracefully shutdown the server:

$ docker exec rabbitmq rabbitmqctl close_connection {$CON_ID} "rabbit shutdown"
$ docker exec rabbitmq rabbitmqctl stop_app

we explicitly close the connection to force the logstash-rabbitmq-input-plugin to force a restart

but if we restart haproxy the tcp gets cut off ungracefully and the logstash plugin wont notice the channel is closed (and doenst go into restart mode) because it received an error frame. the java recovery stuff doesnt seem to work for that.

hope thats enough information

@digitalkaoz

i think thats a quite common scenario in a containerized world

@digitalkaoz
digitalkaoz commented Oct 18, 2016 edited

when i restart logstash i get the following error:

An unexpected error occurred!", :error=>com.rabbitmq.client.AlreadyClosedException: connection is already closed due to connection error; cause: java.io.EOFException, :class=>"Java::ComRabbitmqClient::AlreadyClosedException", :backtrace=>["com.rabbitmq.client.impl.AMQChannel.ensureIsOpen(com/rabbitmq/client/impl/AMQChannel.java:195)", "com.rabbitmq.client.impl.AMQChannel.rpc(com/rabbitmq/client/impl/AMQChannel.java:241)", "com.rabbitmq.client.impl.ChannelN.basicCancel(com/rabbitmq/client/impl/ChannelN.java:1127)", "java.lang.reflect.Method.invoke(java/lang/reflect/Method.java:498)", "RUBY.method_missing(/opt/logstash/vendor/bundle/jruby/1.9/gems/march_hare-2.15.0-java/lib/march_hare/channel.rb:874)", "RUBY.shutdown_consumer(/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-input-rabbitmq-4.1.0/lib/logstash/inputs/rabbitmq.rb:279)", "RUBY.stop(/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-input-rabbitmq-4.1.0/lib/logstash/inputs/rabbitmq.rb:273)", "RUBY.do_stop(/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-core-2.4.0-java/lib/logstash/inputs/base.rb:83)", "org.jruby.RubyArray.each(org/jruby/RubyArray.java:1613)", "RUBY.shutdown(/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-core-2.4.0-java/lib/logstash/pipeline.rb:385)", "RUBY.stop_pipeline(/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-core-2.4.0-java/lib/logstash/agent.rb:413)", "RUBY.shutdown_pipelines(/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-core-2.4.0-java/lib/logstash/agent.rb:406)", "org.jruby.RubyHash.each(org/jruby/RubyHash.java:1342)", "RUBY.shutdown_pipelines(/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-core-2.4.0-java/lib/logstash/agent.rb:406)", "RUBY.shutdown(/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-core-2.4.0-java/lib/logstash/agent.rb:402)", "RUBY.execute(/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-core-2.4.0-java/lib/logstash/agent.rb:233)", "RUBY.run(/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-core-2.4.0-java/lib/logstash/runner.rb:94)", "org.jruby.RubyProc.call(org/jruby/RubyProc.java:281)", "RUBY.run(/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-core-2.4.0-java/lib/logstash/runner.rb:99)", "org.jruby.RubyProc.call(org/jruby/RubyProc.java:281)", "RUBY.initialize(/opt/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.22/lib/stud/task.rb:24)", "java.lang.Thread.run(java/lang/Thread.java:745)"], :level=>:warn}

@andrewvc
Contributor

@digitalkaoz thanks for the very detailed description here!

@michaelklishin should the java client's auto connection recovery just handle this?

@michaelklishin
Contributor

It takes time to detect peer unavailability. I don't see how this is a connection recovery issue: recovery cannot start if a client connection is still considered to be alive.

@michaelklishin
Contributor

Also: do not set heartbeats to a value lower than 5 seconds. That is recipe for false positives.

@digitalkaoz
digitalkaoz commented Oct 19, 2016 edited

We waited alot but logstash wont recover the Connection. I would rather get
false positives than nothing...any more ideas? The Error Message at the End
says it very clear. The Channel was already closed, bc it received an Error
Frame. The Channel is unusable once received an error_frame...

and neither this plugin, nor march_here nor the the java-amqp lib seems to notice that

@andrewvc
Contributor

@michaelklishin thanks for popping in!

He's setting the heartbeat to 30 seconds. It's confusing because the connection timeout is in millis, but the heartbeat is in seconds.

I think this is correct yes? Looking at the source of the java client it looks like timeout is set in millis and heartbeats in seconds. Is that all correct?

@digitalkaoz

@andrewvc @michaelklishin i created a tiny docker-compose setup to reproduce this issue:

https://github.com/digitalkaoz/rabbit-haproxy-logstash-fail

simply follow the reproduce steps...

@michaelklishin
Contributor

Heartbeat timeout is in seconds per protocol spec. Also see this thread.

@michaelklishin
Contributor

So you have HAproxy in between. HAproxy has its own timeouts (which heartbeats in theory make irrelevant but still) and what RabbitMQ sees is a TCP connection from HAproxy, not the client. I can do a March Hare release with a preview version of the 3.6.6 Java client, however.

@andrewvc
Contributor

Thanks @digitalkaoz , I'll try and repro today.

@andrewvc
Contributor

Also thanks for the background @michaelklishin . I'm wondering if what's happening is that HAProxy is switching the backend without letting logstash know, and without disconnecting from logstash. Will report back.

@digitalkaoz

it still fails when i wait lets say 5 minutes... @michaelklishin as i understand the relevant fix is in https://github.com/rabbitmq/rabbitmq-java-client/releases/tag/v4.0.0.M2 ?

@michaelklishin
Contributor

@andrewvc typically proxies propagate upstream connection loss. At least that's what HAproxy does by default. In fact, if it didn't then the two logical ends of the TCP connection would be out of sync.

I doubt it's a Java client problem here but I've updated March Hare to the latest 3.6.x version from source.

@michaelklishin
Contributor

@digitalkaoz I'm sorry but there is no evidence that I have seen that this is a client library issue. A good thing to try would be to eliminate HAproxy from the picture entirely and also verify the effective heartbeat timeout value (management UI displays it on the connection page).

@digitalkaoz
digitalkaoz commented Oct 25, 2016 edited

@michaelklishin sure if i remove haproxy from the setup the reconnect works. but thats not a typical setup (or at least at scale)

with haproxy in the middle, reconnect fails (even with a heartbeat of 10s), but my question is why? shouldnt the heartbeat tell the plugin that the connection isnt healthy anymore and reinitiate a recovery/reconnect ?

@michaelklishin
Contributor

It should. We don't know why that doesn't happen: that's why I recommended checking the actual effective value.

@digitalkaoz

after restarting rabbit there is no connection alive on the rabbit-node because nobody initiated a new connection. neither haproxy, nor logstash. bc. their connection is still there but defacto useless?

@andrewvc
Contributor

@michaelklishin, is there a best practice here for accomplishing what @digitalkaoz wants to accomplish? I think having a layer of indirection via a load balancer makes sense. This is something that's come up multiple times, I'm wondering if the RabbitMQ project has an official stance here.

@michaelklishin
Contributor

There's no stance as HAproxy is supposed to be transparent to the client.

@digitalkaoz HAproxy is not supposed to initiate any connections upstream unless a new client connects to it.

Can we please stop guessing and move on to forming hypotheses and verifying them? The first one I have is: the heartbeat value is actually not set to what the reporter thinks it is used. I've mentioned how this can be verified above. Without this we can pour hours and hours of time and get nothing out of it. My team already spends hours every day on various support questions and we would appreciate if the reporters cooperated a bit to save us all some time.

I also pushed a Java client update to March Hare, even though I doubt it will change anything.

A Wireshark/tcpdump capture will provide a lot of information to look at.

@michaelklishin
Contributor

A Wireshark capture should make it easy to see heartbeat frames on the wire: not only their actual presence but also timestamps (they are sent at roughly 1/2 of the effective timeout value).

@digitalkaoz

@michaelklishin the heartbeat is set to 10s as i told you above...but that didnt change anything, or do you mean some other value?

screen shot 2016-10-25 at 16 17 12

@michaelklishin
Contributor

What is in server logs? HAproxy logs?

@digitalkaoz
digitalkaoz commented Oct 25, 2016 edited

mh the haproxy logs are not very helpful:

proxy_1     | 00000000:amqp.srvcls[0007:0008]
proxy_1     | 00000000:amqp.clicls[0007:0008]
proxy_1     | 00000000:amqp.closed[0007:0008]
rabbit_1 exited with code 137
proxy_1     | 00000001:amqp.accept(0006)=0008 from [172.18.0.5:49608]
proxy_1     | [WARNING] 298/143248 (1) : Server amqp/rabbit is DOWN, reason: Layer4 timeout, check duration: 5001ms. 0 active and 0 backup servers left. 1 sessions active, 0 requeued, 0 remaining in queue.
proxy_1     | [ALERT] 298/143248 (1) : proxy 'amqp' has no server available!

172.18.0.5 is logstash btw

@michaelklishin
Contributor

@digitalkaoz can you please post a less abridged log together with RabbitMQ node and Logstash logs?

@michaelklishin
Contributor

According to the abridged log above, HAproxy does accept a client connection, then detects that an upstream node is down and there are no more upstreams to try. Then… we don't know what happens because the log ends.

@digitalkaoz

there are no more logs from any of the systems (neither haproxy, nor logstash)

@michaelklishin
Contributor

@digitalkaoz what about RabbitMQ? Can you take a tcpdump or Wireshark capture?

@digitalkaoz

can i recommend cloning the repo? im not very familiar with tcp connection debugging :/

@digitalkaoz
digitalkaoz commented Oct 25, 2016 edited

@michaelklishin i attached the full log here https://gist.github.com/digitalkaoz/8f1825dd1ad2976a1516afdccf120e7f

logstash: 172.18.0.5
haproxy: 172.18.0.2
rabbit: 172.18.0.3

@michaelklishin
Contributor

From RabbitMQ log, after a few inbound connections and restarts:

rabbit_1    | accepting AMQP connection <0.400.0> (172.18.0.3:38102 -> 172.18.0.2:5672)
rabbit_1    |
rabbit_1    | =WARNING REPORT==== 25-Oct-2016::14:49:07 ===
rabbit_1    | closing AMQP connection <0.400.0> (172.18.0.3:38102 -> 172.18.0.2:5672):
rabbit_1    | client unexpectedly closed TCP connection

This means that at 25-Oct-2016::14:49:07 RabbitMQ did register an inbound AMQP 0-9-1 client connection and then its TCP connection was immediately closed by HAproxy.

I am not sure what exactly srvcls and clicls are, probably "server connection closed" and "client connection closed", respectively:

proxy_1     | 00000001:amqp.srvcls[0008:0007]
proxy_1     | 00000001:amqp.clicls[0008:0007]
proxy_1     | 00000001:amqp.closed[0008:0007]

I'm not sure why it happens, perhaps HAproxy decides to reset all connections now that it has an upstream to connect to. Also this should not prevent March Hare from attempting to connect again after a period of time.

Can you try March Hare master?

@digitalkaoz

maybe we also need some haproxy ninja? everything i try is an educated guess how to configure all of those 5000 different options in my tcp-proxy. do you know any?

@andrewvc
Contributor

Here's one interesting datapoint, if I kill haproxy and start it back up (using your very useful docker-compose @digitalkaoz ) everything works great. So it's something in the haproxy config I think.

@andrewvc
Contributor

OK, well, this is definitely something between the Java client and HAProxy. Here's a pcap dump using the docker compose stuff from the haproxy node.

The sequence of events was:

  1. Everything running correctly
  2. RMQ killed
  3. RMQ brought back up
  4. Logstash restarted
  5. Producer Restarted

I'm too busy to keep digging on this till next week. Maybe someone else could find something here.

Fun fact, if I kill and restart haproxy everything recovers correctly. It's something about the way connections are re-initiated through haproxy that is funky.

@digitalkaoz

But neverless, the logstash plugin should recognize the closed Channel and
start the recovery process.

Andrew Cholakian notifications@github.com schrieb am Di., 25. Okt. 2016,
18:48:

Here's one interesting datapoint, if I kill haproxy and start it back up
(using your very useful docker-compose @digitalkaoz
https://github.com/digitalkaoz ) everything works great. So it's
something in the haproxy config I think.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#76 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAR613fz45lkMatLAB2vcFsXZnF_olQZks5q3jL0gaJpZM4H69a4
.

@andrewvc
Contributor

@digitalkaoz that's the tricky thing. The java library is supposed to do this! According to http://rubymarchhare.info/articles/error_handling.html March Hare should handle that itself. That makes it a library client bug. Please correct me if I'm wrong @michaelklishin

@andrewvc
Contributor

I agree that we should make this work in Logstash BTW, if the client cannot be fixed then we'll have to find a workaround, but I'd prefer to see if we can fix the client first.

@andrewvc
Contributor

I think the key issue here is RMQ server identity. If I stop and start the proxy it's fine. If I start and stop the proxy AND the rabbitmq server it is not. @michaelklishin could the issue be that the recovery code assumes the consumer will be bound to the same server?

That is to say, the automatic recovery assumes that the other end would be the same rabbitmq instance that was there before, but swapping instances out is not supported?

@michaelklishin
Contributor

As soon as there's evidence (and not guesses) that there's a March Hare or Java client bug, I'd be happy to look into it. So far I see no solid evidence. I have provided enough pointers, hypotheses and general investigation directions. My recommendations are being largely ignored (it takes several requests to get data, no traffic captures provided, March Hare master wasn't tried).

@michaelklishin
Contributor
michaelklishin commented Oct 25, 2016 edited

@andrewvc clients have no way of identifying what RabbitMQ server they are connected to. It would be pretty silly to only allow recovery to re-connect to the same node because in practice reconnection to the same node will not succeed for a while.

@michaelklishin
Contributor
michaelklishin commented Oct 25, 2016 edited

So I have conducted a little experiment with March Hare master, vanilla RabbitMQ 3.6.5 with all defaults, and HAproxy 1.5 (not sure what the exact version is, possibly even a -dev one) with
the following config:

global
        log 127.0.0.1   local0
        log 127.0.0.1   local1 notice
        maxconn 4096
        user antares
        group staff
        # daemon
        debug

defaults
        log        global
        option     dontlognull
        option     redispatch
        retries    3
        maxconn    2000
        contimeout 5000
        clitimeout 50000
        srvtimeout 50000

backend rabbitmq
        mode         tcp
        server node1 127.0.0.1:5672 maxconn 2048

frontend rabbitmq-in
        bind *:5673
        balance
        default_backend rabbitmq

My March Hare connection code is about as trivial as it gets:

c = MarchHare.connect(port: 5673, requested_heartbeat: 8)

and then I periodically check if the connection is open:

c.open? # => true

I then shut down RabbitMQ a few times, observe that the connection is closed, start it back, observe that it is open again in a few seconds. This repeatedly worked 5 times in a row.

Here's HAproxy log, you can see inbound connections from different ephemeral ports (read: different client sockets):

00000001:rabbitmq-in.accept(0004)=0005 from [127.0.0.1:53004]
00000001:rabbitmq.srvcls[0005:0006]
00000001:rabbitmq.clicls[0005:0006]
00000001:rabbitmq.closed[0005:0006]
00000002:rabbitmq-in.accept(0004)=0005 from [127.0.0.1:53023]
00000002:rabbitmq.clicls[0005:0006]
00000003:rabbitmq-in.accept(0004)=0005 from [127.0.0.1:53033]
00000003:rabbitmq.srvcls[0005:0007]
00000003:rabbitmq.clicls[0005:0007]
00000003:rabbitmq.closed[0005:0007]
00000004:rabbitmq-in.accept(0004)=0005 from [127.0.0.1:53038]
00000004:rabbitmq.srvcls[0005:0007]
00000004:rabbitmq.clicls[0005:0007]
00000004:rabbitmq.closed[0005:0007]
00000005:rabbitmq-in.accept(0004)=0005 from [127.0.0.1:53054]
00000005:rabbitmq.srvcls[0005:0007]
00000005:rabbitmq.clicls[0005:0007]
00000005:rabbitmq.closed[0005:0007]
00000006:rabbitmq-in.accept(0004)=0005 from [127.0.0.1:53060]
00000006:rabbitmq.clicls[0005:0007]
00000007:rabbitmq-in.accept(0004)=0005 from [127.0.0.1:53067]

Here's RabbitMQ log from the most recent restart:

=INFO REPORT==== 25-Oct-2016::22:45:03 ===
Stopped RabbitMQ application

=INFO REPORT==== 25-Oct-2016::22:45:03 ===
Halting Erlang VM

=INFO REPORT==== 25-Oct-2016::22:45:13 ===
Starting RabbitMQ 3.6.5 on Erlang 19.1
Copyright (C) 2007-2016 Pivotal Software, Inc.
Licensed under the MPL.  See http://www.rabbitmq.com/

…

=INFO REPORT==== 25-Oct-2016::22:45:15 ===
Memory limit set to 5381MB of 5978MB total.

=INFO REPORT==== 25-Oct-2016::22:45:15 ===
Disk free limit set to 6269MB

=INFO REPORT==== 25-Oct-2016::22:45:15 ===
Limiting to approx 499900 file handles (449908 sockets)

=INFO REPORT==== 25-Oct-2016::22:45:15 ===
FHC read buffering:  OFF
FHC write buffering: ON

=INFO REPORT==== 25-Oct-2016::22:45:15 ===
Priority queues enabled, real BQ is rabbit_variable_queue

=INFO REPORT==== 25-Oct-2016::22:45:15 ===
Management plugin: using rates mode 'basic'

=INFO REPORT==== 25-Oct-2016::22:45:15 ===
msg_store_transient: using rabbit_msg_store_ets_index to provide index

=INFO REPORT==== 25-Oct-2016::22:45:15 ===
msg_store_persistent: using rabbit_msg_store_ets_index to provide index

=INFO REPORT==== 25-Oct-2016::22:45:15 ===
started TCP Listener on [::]:5672

=INFO REPORT==== 25-Oct-2016::22:45:15 ===
started SSL Listener on [::]:5671

=INFO REPORT==== 25-Oct-2016::22:45:15 ===
rabbit_stomp: default user 'guest' enabled

=INFO REPORT==== 25-Oct-2016::22:45:15 ===
started STOMP TCP Listener on [::]:61613

=INFO REPORT==== 25-Oct-2016::22:45:15 ===
MQTT retained message store: rabbit_mqtt_retained_msg_store_dets

=INFO REPORT==== 25-Oct-2016::22:45:15 ===
started MQTT TCP Listener on [::]:1883

=INFO REPORT==== 25-Oct-2016::22:45:15 ===
opening log file: "/tmp/rabbit-mgmt/access.log.2016_10_25_19"

=INFO REPORT==== 25-Oct-2016::22:45:15 ===
Management plugin started. Port: 15672

=INFO REPORT==== 25-Oct-2016::22:45:15 ===
Statistics event collector started.

=INFO REPORT==== 25-Oct-2016::22:45:15 ===
Statistics channel stats collector started.

=INFO REPORT==== 25-Oct-2016::22:45:15 ===
Statistics queue stats collector started.

=INFO REPORT==== 25-Oct-2016::22:45:15 ===
Statistics database started.

=INFO REPORT==== 25-Oct-2016::22:45:15 ===
Statistics garbage collector started for table aggr_queue_stats_fine_stats with interval 5000.

…

=INFO REPORT==== 25-Oct-2016::22:45:15 ===
Statistics garbage collector started for table aggr_exchange_stats_fine_stats with interval 5000.

=INFO REPORT==== 25-Oct-2016::22:45:15 ===
Server startup complete; 12 plugins started.
 * rabbitmq_shovel_management
 * rabbitmq_management
 * rabbitmq_web_dispatch
 * webmachine
 * rabbitmq_jms_topic_exchange
 * rabbitmq_consistent_hash_exchange
 * rabbitmq_shovel
 * rabbitmq_management_agent
 * rabbitmq_mqtt
 * rabbitmq_stomp
 * mochiweb
 * amqp_client

=INFO REPORT==== 25-Oct-2016::22:45:15 ===
Statistics garbage collector started for table aggr_node_stats_coarse_node_stats with interval 5000.

…

=INFO REPORT==== 25-Oct-2016::22:45:15 ===
Statistics garbage collector started for table connection_stats with interval 5000.

=INFO REPORT==== 25-Oct-2016::22:45:19 ===
accepting AMQP connection <0.629.0> (127.0.0.1:53068 -> 127.0.0.1:5672)

As you can see, the most recent HAproxy connection is on port 53067 and RabbitMQ accepts a connection from 53068 (HAproxy), which means the two descriptors were opened in rapid succession. So yes, this is real: March Hare recovers successfully when an HAproxy upstream goes down.

@michaelklishin
Contributor

In the provided TCP dump file, if we apply a display filter that only displays the packets/segments/frames sent by the Logstash process (which seems to be on host 172.19.0.5, according to my understanding of the heartbeat frame exchange) like this:

ip.src eq 172.19.0.5

we see that it attempts to connect and gets an RST (reset) packet:

173 2016-10-25 17:19:06.146341  172.19.0.5  59566   172.19.0.4  5672    TCP 54  59566 → 5672 [RST] Seq=10 Win=0 Len=0

and then the next time it tries and succeeds is over a minute later:

236 2016-10-25 17:20:26.381197  172.19.0.5  59812   172.19.0.4  5672    TCP 66  59812 → 5672 [ACK] Seq=9 Ack=494 Win=30336 Len=0 TSval=362368 TSecr=362368

so depending on the recovery interval used, this may or may not indicate that the client did not try to reconnect more than once. Default interval is 5 seconds.

@digitalkaoz

@michaelklishin thanks that you dig into. im trying to help all i can, but im neither into java-rabbitmq-client nor march-here nor logstash-plugins.

can you open a channel too (not just a connection, because thats what the logstash-plugin does, which gets closed unnoticed)

@michaelklishin
Contributor
jruby-9.1.2.0 :003 > [c.open?, ch.open?]
 => [true, true]
# stop RabbitMQ
jruby-9.1.2.0 :004 > [c.open?, ch.open?]
 => [false, false]
jruby-9.1.2.0 :005 > [c.open?, ch.open?]
 => [false, false]
# start RabbitMQ
jruby-9.1.2.0 :006 > [c.open?, ch.open?]
 => [false, false]
# node finishes booting
jruby-9.1.2.0 :007 > [c.open?, ch.open?]
 => [true, true]
@michaelklishin
Contributor

March Hare 2.19.0 is out.

@andrewvc
Contributor

Great, Il'l try upgrading tomorrow. With 2.18.0 I was seeing a state where the connection was closed but auto-recovery wasn't touching it. I'll have more tmrw.

@andrewvc andrewvc added a commit to andrewvc/logstash-mixin-rabbitmq_connection that referenced this issue Oct 26, 2016
@andrewvc andrewvc Bump march_hare version to 2.19.0 to fix reconnect issues
This fixes
logstash-plugins/logstash-input-rabbitmq#76
where with ha_proxy switching rabbitmq backends would cause it to stall
indefinitely
de9af23
@andrewvc andrewvc added a commit to andrewvc/logstash-mixin-rabbitmq_connection that referenced this issue Oct 26, 2016
@andrewvc andrewvc Bump march_hare version to 2.19.0 to fix reconnect issues
This fixes
logstash-plugins/logstash-input-rabbitmq#76
where with ha_proxy switching rabbitmq backends would cause it to stall
indefinitely
6d9aa88
@andrewvc andrewvc referenced this issue in logstash-plugins/logstash-mixin-rabbitmq_connection Oct 26, 2016
Closed

Bump march_hare version to 2.19.0 to fix reconnect issues #32

@andrewvc
Contributor

Bumping versions fixed this issue to me @michaelklishin . Thanks for pushing the new version.

Opened this PR to bump it logstash-plugins/logstash-mixin-rabbitmq_connection#32

We need a separate PR to do a better job logging bad connections.

@elasticsearch-bot elasticsearch-bot pushed a commit to logstash-plugins/logstash-mixin-rabbitmq_connection that closed this issue Oct 26, 2016
@andrewvc andrewvc Bump march_hare version to 2.19.0 to fix reconnect issues
This fixes
logstash-plugins/logstash-input-rabbitmq#76
where with ha_proxy switching rabbitmq backends would cause it to stall
indefinitely

Fixes #32
8d65704
@andrewvc
Contributor

Please respond to this ticket if you're still experiencing this bug @digitalkaoz and we can reopen it . You should be able to upgrade by running logstash-plugin --update

@andrewvc andrewvc referenced this issue in logstash-plugins/logstash-mixin-rabbitmq_connection Oct 26, 2016
Closed

Add conn shutdown error log. Improve other error log messages #33

@digitalkaoz
digitalkaoz commented Oct 26, 2016 edited

@andrewvc im not sure how it fixes it for you?

march_hare-2.19.0
logstash-mixin-rabbitmq_connection-4.2.0
logstash-input-rabbitmq-4.1.0

same thing, if i kill and start rabbit logstash still doesnt recover the connection, last message from logstash was now (thanks to your newest commit ;) )

{:timestamp=>"2016-10-26T19:12:31.687000+0000", :message=>"RabbitMQ connection was closed!", :url=>"amqp://guest:XXXXXX@proxy:5672/", :automatic_recovery=>true, :cause=>com.rabbitmq.client.ShutdownSignalException: connection error, :level=>:warn}

the important thing is to not restart logstash!
can you tell me your step how it worked on your side?

@andrewvc
Contributor
andrewvc commented Oct 26, 2016 edited

@digitalkaoz yes, I followed the following your procedure (using your excellent docker-compose):

  1. Start up the whole stack, wait for LS to come up
  2. docker-compose kill rabbit producer
  3. docker-compose up rabbit producer

That all works for me. Not for you? BTW, I'd recommend upgrading to logstash-input-rabbitmq 5.2.0

@digitalkaoz
digitalkaoz commented Oct 27, 2016 edited

still no luck:

logstash:5
logstash-mixin-rabbitmq_connection-4.2.0
logstash-input-rabbitmq-5.2.0
march_hare-2.19.0

what am i doing wrong?

my steps:

$ docker-composer up #all up
$ docker-compose kill rabbit producer #kill rabbit+producer
$ sleep 10
$ docker-compose start rabbit producer #restart rabbit+producer

still logstash wont catch up :/

@michaelklishin
Contributor

@digitalkaoz I'm sorry to be so blunt but if you want to get to the bottom of it you need to put in some debugging effort of your own. We have provided several hypotheses, investigated at least two, explained what various log entries mean, what can be altered to identify what application may be introducing the behaviour observed, analyzed at least one Wireshark dump, and based on all the findings produced a new March Hare version. That's enough information to understand how to debug a small distributed system like this one.

Asking "what am I doing wrong" is unlikely to be productive. Trying things at random is unlikely to be efficient.

@andrewvc
Contributor

@digitalkaoz can you update the docker-compose stuff with the latest? One thing that was different in my repro is that I ran everything except logstash in docker. That shouldn't make a difference, but maybe it does.

@andrewvc
Contributor

Hey @digitalkaoz I'm still glad to help you get to the bottom of this. @michaelklishin thanks so much for your help, I really appreciate it!

I'm glad to personally put in some more hours here myself. What would help me move this forward @digitalkaoz is if you could update your docker config to the latest versions so we can be reproing the same things.

@digitalkaoz

sorry @andrewvc i had an busy weekend.

i updated everything to the latests version, but still no luck. the repo is up to date now.

$ docker-composer up #all up
$ docker-compose kill rabbit producer
$ sleep 10
$ docker-compose start rabbit producer
@michaelklishin
Contributor

In the snippet above both RabbitMQ and publisher are shut down and started at the same time. Note that this is not the same steps that at least some in this thread performed and creates a natural race condition between the two starting. If the producer starts before RabbitMQ then connection recovery WILL NOT BE ENGAGED since there never was a successful connection to recover: initial TCP connection will fail.

@digitalkaoz

@michaelklishin the producer waits 10s before doing anything, so rabbit has time to start

@andrewvc
Contributor
andrewvc commented Nov 1, 2016

I've reopened this issue now that I've reconfirmed it @digitalkaoz , but as I've traced the issue to what I believe to definitely be a MarchHare bug we should move the discussion to ruby-amqp/march_hare#107

Interestingly MarchHare does recover if rabbitmq comes back in under the 5s window. If it takes any longer it is perma-wedged.

@andrewvc andrewvc reopened this Nov 1, 2016
@andrewvc
Contributor
andrewvc commented Nov 1, 2016

BTW, thanks for the excellent docker compose @digitalkaoz

@jordansissel
jordansissel commented Nov 1, 2016 edited

@andrewvc and I worked on this for a while today. Packet capture in wireshark showed indeed that the rabbitmq client from Logstash was waiting 5 seconds (as expected) when rabbitmq was stopped before reconnecting. However, reconnecting was only attempted once:

  • ✔️ Client's connection is terminated by HAproxy because RabbitMQ was terminated
  • ✔️ Client initiates new connection to haproxy
  • ✔️ Client sends protocol negotiation: AMQP\0\0\9\1 ("AMQP" + hex 00000901)
  • ✔️ 5 seconds later, Client starts tcp teardown, sending FIN to close the connection. Presumably because the backend (haproxy) did not respond to the client's protocol negotiation
  • No further connection attempts are made by the client. Expectation is that after 5 seconds, again, the client would attempt to reconnect.

The problem we are looking at now is to try and understand why the rabbitmq client is not reconnecting after the first reconnection attempt had timed out.

@jordansissel
jordansissel commented Nov 1, 2016 edited

Based on the behavior, I was able to reproduce this without docker and without haproxy

  • Run rabbitmq + logstash
  • Let logstash rmq input connect
  • Stop rabbitmq
  • run nc -l 5672 to accept the next connection retry from Logstash
  • wait until 'AMQP' appears on nc (this is the rmq client attempting to reconnect)
  • After this attempt to reconnect, rmq input never tries again.
@michaelklishin
Contributor

@jordansissel this is an edge case in the protocol. Before connection negotiation happens we cannot set up a heartbeat timer but without it it takes a long time to detect unresponsive peers (whether there is a TCP connection issue or just no response as in the nc example).

It therefore requires an intermediary or a bogus server that doesn't respond to a protocol header.

@andrewvc
Contributor
andrewvc commented Nov 1, 2016

@michaelklishin confirmed this with @jordansissel . The fix is in this PR: ruby-amqp/march_hare#108

Thanks for pointing us to the right place!

@michaelklishin
Contributor

@andrewvc @jordansissel thank you for finding a way to reproduce with netcat. If there is a way for you to verify end-to-end with March Hare master, I'd be happy to do a March Hare release tomorrow morning (European time).

@michaelklishin
Contributor

March Hare 2.20.0 is up.

@digitalkaoz
digitalkaoz commented Nov 2, 2016 edited

@michaelklishin thanks ill test it now
@andrewvc can you release a 4.2.1 of logstash-mixin-rabbitmq_connection so i get march_hare 2.20.0 in?

@andrewvc andrewvc referenced this issue in logstash-plugins/logstash-mixin-rabbitmq_connection Nov 2, 2016
Closed

Bump march hare to 2.20.0 #34

@digitalkaoz

@michaelklishin @andrewvc @jordansissel i can confirm 2.20.0 of march_hare fixed the reconnection issues. thanks for all your efforts!

@andrewvc
Contributor
andrewvc commented Nov 2, 2016

@digitalkaoz fantastic!

@andrewvc andrewvc closed this Nov 2, 2016
@Jakauppila

Should these changes be ported over to logstash-output-rabbitmq as well?

@digitalkaoz

I think it is, because the mixin (used by Input and output) loads
march_hare the output gets this fix to. Or am i wrong?

Jared Kauppila notifications@github.com schrieb am Do., 3. Nov. 2016,
20:47:

Should these changes be ported over to logstash-output-rabbitmq as well?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#76 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAR61yzJWpvPx2UuNA35Gtgj-x-flg93ks5q6jprgaJpZM4H69a4
.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment