Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reconnection issues after a failure #377

Closed
freyssin opened this issue Feb 26, 2020 · 7 comments · Fixed by #370
Closed

Reconnection issues after a failure #377

freyssin opened this issue Feb 26, 2020 · 7 comments · Fixed by #370
Assignees
Labels
Milestone

Comments

@freyssin
Copy link

We have a client that subscribes on a topic with QoS2 in a durable session (cleanSession= false). This client uses the HiveMQ blocking API with automatic reconnection feature. This test works correctly in the absence of failure, but the subscriber stops receiving messages after stopping and restarting the server.

Expected behavior

We expect that the client reconnect to the server and continue to receive messages (no loss, no duplication).

Actual behavior

The client reconnects regularly to the server and then closes the connection. No message is received.

Detailed execution

Below the initialization code of the subscriber:

Mqtt3Client.builder()`
      .identifier(clientid)
      .serverHost(uri.getHost())
      .serverPort(uri.getPort())
      .automaticReconnectWithDefaultConfig()
      .buildBlocking()
connect(false);

In the table below we reproduce the sequence of messages before and after the disconnection (If necessary we have a wireshark capture of this sequence). As explained below it seems that the problem comes from the takeover of the client.

Server (frames sent) Client (frames sent)
  PUBLISH #41  
    PUBREC #41
  PUBREL #41  
    PUBCOMP#41
  PUBLISH #42  
    PUBREC #42
  == Server Stop/Restart == == Server Stop/Restart ==
571   CONNECT
572 CONNACK  
574 PUBLISH #42 (DUP)  
575   PUBREC #42
577 PUBREL #42  
578   TCP Connection closed [FIN/ACK]
580 TCP Connection closed [RST/ACK]  
     
613   CONNECT
614 CONNACK  
615 PUBREL #42  
617   TCP Connection closed [FIN/ACK]
619 TCP Connection closed [RST/ACK]  

The sequence 613-619 is then repeated regularly.

The behavior of the HiveMQ client seems surprising to me:

  1. Before the failure, message Implementing Mqtt3PingResp decoder and tests #42 is received by the client and the corresponding PUBREC is sent in response before the flow is interrupted.
  2. After the failure, the server sends anew the message Implementing Mqtt3PingResp decoder and tests #42 (574). This corresponds to the case where the server has not received the PUBREC. In fact the message has been received at TCP level but has not been processed before the failure. The duplicated message Implementing Mqtt3PingResp decoder and tests #42 is acknowledged by the client with a new PUBREC (575). This behavior also complies with the specification.
  3. The server then sends the PUBREL corresponding to the PUBREC sent by the client, which seems to trigger the closure of the connection by this one. This is strange and could be interpreted as the client believing the PUBREL message is illegal. On the contrary we believe this complies with the specification.
  4. After each reconnection the server retransmits the PUBREL of message Implementing Mqtt3PingResp decoder and tests #42 (last non acknowledged message protocol). We believe this complies with the specification, however each time the client responds by closing the connection.

Details

  • Affected HiveMQ MQTT Client version(s): 1.1.3
  • Used JVM version:
    java version "1.8.0_221"
    Java(TM) SE Runtime Environment (build 1.8.0_221-b11)
    Java HotSpot(TM) 64-Bit Server VM (build 25.221-b11, mixed mode)
@SgtSilvio
Copy link
Member

Hi @freyssin
We have never seen such a behavior before.
Can you please add information about:

  • Which broker you are using
  • More detailed information of the packet, like packet ids

@freyssin
Copy link
Author

Hi Silvio,

First of all, I want to thank you for the speed of your response.

The broker is a proprietary software that has worked for many years with fuse and paho clients. I would like to validate it with the HiveMQ client because it offers an interesting and ambitious API.

I understand that you cannot test and support all the MQTT brokers, which is why I give in my report a detailed description of the MQTT exchanges between the client and the server. From my point of view all the actions initiated by the broker seem comply with the MQTT specification, but it is possible that a detail escaped me. To allow you to easily analyze the situation, I am enclosing a WireShark capture containing the last packets exchanged before the failure, and all of the packets during reconnection (only 10 MQTT frames). The ZIP file also contains the CSV export of these frames:

hivemq-reconnect-filtered.zip

Thank you.

@SgtSilvio
Copy link
Member

Thank you for the information.
Could you please add a DisconnectedListener to the client and provide the stacktrace of the cause for the disconnection:

Mqtt5Client.builder()
        ...
        .addDisconnectedListener(context -> context.getCause().printStackTrace())

@freyssin
Copy link
Author

I have to investigate a little because the problem does not appear at each stop. After 3 attempts I got the following message (multiple times but without stacktrace):
"com.hivemq.client.mqtt.mqtt3.exceptions.Mqtt3DisconnectException: PUBREL must not be received with the same packet identifier as a QoS 2 PUBLISH when no PUBREC has been sent yet".

I will analyze the network traces to verify that we are in the same situation.

@SgtSilvio
Copy link
Member

This message helped.
Actually the behavior is currently wrong but already fixed with #370 which will be merged soon and released.
Please try out the snapshot version improvement~memory-improvements-SNAPSHOT (documentation on how to use snapshots https://hivemq.github.io/hivemq-mqtt-client/docs/installation/#snapshots)

@SgtSilvio SgtSilvio added the bug label Feb 26, 2020
@SgtSilvio SgtSilvio added this to To do in Version 1.1.x via automation Feb 26, 2020
@SgtSilvio SgtSilvio added this to the 1.1.4 milestone Feb 26, 2020
@SgtSilvio SgtSilvio moved this from To do to Review in progress in Version 1.1.x Feb 26, 2020
@SgtSilvio SgtSilvio self-assigned this Feb 26, 2020
@freyssin
Copy link
Author

I have tested again with the snapshot version you suggested and the problem no longer seems to occur.
Thank you and congratulations for the speed of your responses.

@SgtSilvio
Copy link
Member

@freyssin Thank you for your valuable and very detailed issue. This issue will be closed when the PR is merged.

Version 1.1.x automation moved this from Review in progress to Done Mar 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
No open projects
Version 1.1.x
  
Done
Development

Successfully merging a pull request may close this issue.

2 participants