Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tcp connection timeout every ~4hours #386

Closed
0hax opened this issue Mar 12, 2020 · 7 comments
Closed

tcp connection timeout every ~4hours #386

0hax opened this issue Mar 12, 2020 · 7 comments

Comments

@0hax
Copy link

0hax commented Mar 12, 2020

Hi Martin,
Thanks for all the work you have done on this project, it's really amazing!

I'm currently running the esp wifi repeater with multiple ESPurna enabled Sonoff devices connected to it.
All devices are communicating with a MQTT broker directly connected to my router.

Every ~4hours I can see all devices disconnecting from the MQTT broker with Mosquitto broker saying:
"Client ESPURNA-BE9FA4 has exceeded timeout, disconnecting."

So I decided to do a tcpdump of this issue and found out that when the disconnect happens, we get a TCP Reset packet.
The tcpdump runs from the raspberry where the broker runs.

It's like the esp wifi repeater is closing TCP connection every ~4hours.

To ensure that my ESPurna devices was correctly configured, I connected it directly to my router and didn't see any disconnect or TCP reset.

Is this a know issue?

@martin-ger
Copy link
Owner

This is not a known issue. Let us try to trace it down:

  • 4 hours is not directly related to any NAT timeouts (30 min for a TCP connection)
  • You receive a proper TCP Reset packet from the ESPurna device? This is certainly not sent by the esp_wifi_repeater, as it cannot create any TCP message, it just rewrites IP packets and adapts TCP port numbers. So it must be sent by the ESPurnas. The questions is: why...? Probably they don't receive anything before that timeout. Can you see any hints for that in the traces? Retransmissions?

@0hax
Copy link
Author

0hax commented Mar 12, 2020

It's hard to tell with the trace I have. I would need to do another one with only one ESPurna device connected to the repeater and logging enabled on mosquitto, repeater and ESPurna.

In one of the case, I get:

mosquitto broker log:

Socket error on client ESPURNA-E52C6B, disconnecting.

Wireshark:

MQTT Publish from homeassistant to the broker
MQTT Publish from the broker to the relay (for ESPURNA-E52C6B)
TCP [RST, ACK] from the relay (from ESPURNA-E52C6B) to the broker

In other case, I get

mosquitto broker log:

Client ESPURNA-515B2B has exceeded timeout, disconnecting.

Wireshark:

TCP [FIN, ACK] from the broker to the relay (for ESPURNA-515B2B)
TCP [RST, ACK] from the relay (from ESPURNA-515B2B) to the broker

@0hax
Copy link
Author

0hax commented Mar 12, 2020

I reproduced the issue this afternoon with only one ESPurna device connected to the repeater and to the broker.

From the mosquitto broker log, I can see PINGREQ and PINGRESP every 300 seconds as excepted (this is the keepalive timeout in ESPurna config)

And at 15:04:19, there is this error:

Client ESPURNA-B76EC4 has exceeded timeout, disconnecting.

On the ESPurna console, I got this at the same time:

[200227] [MQTT] Sending light/status => 1 (PID 1)
[500235] [MQTT] Sending light/status => 1 (PID 1)
[800241] [MQTT] Sending light/status => 1 (PID 1)
[019683] [MQTT] TCP Disconnected
[019684] [MQTT] Disconnected!
[019685] [RELAY] Reset relay (0) due to MQTT disconnection

And on the wireshark trace, I get below packets:

1099 15:01:17.798666 192.168.1.3 → 192.168.1.63 TCP 66 54867 → 1883 [ACK] Seq=1037 Ack=1171 Win=237 Len=0 TSval=3896277863 TSecr=3428795574
1100 15:02:17.916956 192.168.1.3 → 192.168.1.63 MQTT 68 Ping Request
1101 15:02:17.917355 192.168.1.63 → 192.168.1.3 MQTT 68 Ping Response
1102 15:02:17.919774 192.168.1.3 → 192.168.1.63 TCP 66 54867 → 1883 [ACK] Seq=1039 Ack=1173 Win=237 Len=0 TSval=3896337985 TSecr=3428855695
1103 15:03:18.051488 192.168.1.3 → 192.168.1.63 MQTT 68 Ping Request
1104 15:03:18.051869 192.168.1.63 → 192.168.1.3 MQTT 68 Ping Response
1105 15:03:18.053943 192.168.1.3 → 192.168.1.63 TCP 66 54867 → 1883 [ACK] Seq=1041 Ack=1175 Win=237 Len=0 TSval=3896398119 TSecr=3428915830
1106 15:04:18.202663 192.168.1.3 → 192.168.1.63 MQTT 68 Ping Request
1107 15:04:18.202997 192.168.1.63 → 192.168.1.3 MQTT 68 Ping Response
1108 15:04:18.204947 192.168.1.3 → 192.168.1.63 TCP 66 54867 → 1883 [ACK] Seq=1043 Ack=1177 Win=237 Len=0 TSval=3896458271 TSecr=3428975982
1109 15:04:19.405037 192.168.1.63 → 192.168.1.10 TCP 54 1883 → 54743 [FIN, ACK] Seq=119 Ack=1131 Win=63784 Len=0
1110 15:04:19.412232 192.168.1.10 → 192.168.1.63 TCP 54 54743 → 1883 [RST, ACK] Seq=1131 Ack=120 Win=2144 Len=0
1111 15:04:19.505599 192.168.1.63 → 192.168.1.3 MQTT 83 Publish Message [light/status]
1112 15:04:19.508867 192.168.1.3 → 192.168.1.63 TCP 66 54867 → 1883 [ACK] Seq=1043 Ack=1194 Win=237 Len=0 TSval=3896459575 TSecr=3428977284
1113 15:05:18.643846 192.168.1.3 → 192.168.1.63 MQTT 68 Ping Request
1114 15:05:18.644237 192.168.1.63 → 192.168.1.3 MQTT 68 Ping Response
1115 15:05:18.646224 192.168.1.3 → 192.168.1.63 TCP 66 54867 → 1883 [ACK] Seq=1045 Ack=1196 Win=237 Len=0 TSval=3896518713 TSecr=3429036423
1116 15:06:18.784419 192.168.1.3 → 192.168.1.63 MQTT 68 Ping Request
1117 15:06:18.784831 192.168.1.63 → 192.168.1.3 MQTT 68 Ping Response
1118 15:06:18.786785 192.168.1.3 → 192.168.1.63 TCP 66 54867 → 1883 [ACK] Seq=1047 Ack=1198 Win=237 Len=0 TSval=3896578854 TSecr=3429096564
1119 15:07:18.929349 192.168.1.3 → 192.168.1.63 MQTT 68 Ping Request
1120 15:07:18.929759 192.168.1.63 → 192.168.1.3 MQTT 68 Ping Response
1121 15:07:18.934138 192.168.1.3 → 192.168.1.63 TCP 66 54867 → 1883 [ACK] Seq=1049 Ack=1200 Win=237 Len=0 TSval=3896639001 TSecr=3429156709
1122 15:08:19.079026 192.168.1.3 → 192.168.1.63 MQTT 68 Ping Request
1123 15:08:19.079418 192.168.1.63 → 192.168.1.3 MQTT 68 Ping Response
1124 15:08:19.081177 192.168.1.3 → 192.168.1.63 TCP 66 54867 → 1883 [ACK] Seq=1051 Ack=1202 Win=237 Len=0 TSval=3896699149 TSecr=3429216860
1125 15:09:19.219690 192.168.1.3 → 192.168.1.63 MQTT 68 Ping Request
1126 15:09:19.220081 192.168.1.63 → 192.168.1.3 MQTT 68 Ping Response
1127 15:09:19.222228 192.168.1.3 → 192.168.1.63 TCP 66 54867 → 1883 [ACK] Seq=1053 Ack=1204 Win=237 Len=0 TSval=3896759291 TSecr=3429277001
1128 15:10:19.353111 192.168.1.3 → 192.168.1.63 MQTT 68 Ping Request
1129 15:10:19.353432 192.168.1.63 → 192.168.1.3 MQTT 68 Ping Response
1130 15:10:19.355566 192.168.1.3 → 192.168.1.63 TCP 66 54867 → 1883 [ACK] Seq=1055 Ack=1206 Win=237 Len=0 TSval=3896819425 TSecr=3429337134
1131 15:10:24.342262 192.168.1.10 → 192.168.1.63 TCP 58 54643 → 1883 [SYN] Seq=0 Win=5840 Len=0 MSS=1460
1132 15:10:24.342437 192.168.1.63 → 192.168.1.10 TCP 58 1883 → 54643 [SYN, ACK] Seq=0 Ack=1 Win=64240 Len=0 MSS=1460
1133 15:10:24.349120 192.168.1.10 → 192.168.1.63 MQTT 99 Connect Command
1134 15:10:24.349254 192.168.1.63 → 192.168.1.10 TCP 54 1883 → 54643 [ACK] Seq=1 Ack=46 Win=64195 Len=0
1135 15:10:24.349935 192.168.1.63 → 192.168.1.10 MQTT 58 Connect Ack
1136 15:10:24.392085 192.168.1.10 → 192.168.1.63 MQTT 154 Unsubscribe Request (id=1), Subscribe Request (id=2) [light/relay/+/set], Subscribe Request (id=3) [light/pulse/+/set], Subscribe Request (id=4) [light/led/+/set], Subscribe Request (id=5) [light/action/set]
1137 15:10:24.392497 192.168.1.63 → 192.168.1.10 MQTT 58 Unsubscribe Ack (id=1)

I have added packets before and after to give some context.
Interesting packets are at 15:04:19.
192.168.1.3 is homeassistant listening to MQTT. (Odroid board)
192.168.1.10 is the esp repeater (with ESPurna connected to it.)
192.168.1.63 is the raspberry with the broker.

The problem is still that we get a TCP [FIN, ACK] packet from the broker followed by the TCP [RST, ACK] packet from the ESPurna.

I don't get why I have this issue only when connected to the repeater.

@0hax
Copy link
Author

0hax commented Mar 13, 2020

I managed to capture a trace with the monitor mode of the repeater.
It shows what happens before the broker send the TCP[FIN, ACK] packet and why the broker sends it.

The issue starts when the ESPurna try to publish its status. When the broker sends the ACK to this status, it doesn't get transmitted to the ESPurna because the esp relay doesn't manage to connect to the allocated port. The error is a ICMP Port unreachable.
Because the ESPurna can no more connect with the broker, after the keepalive time since the last ping, the broker sends a TCP[FIN, ACK] to the ESPurna.

Here is the interesting part:

1119 01:13:07.367797 192.168.4.2 → 192.168.1.63 MQTT 56 Ping Request
1120 01:13:07.376429 192.168.1.63 → 192.168.4.2 TCP 54 1883 → 25324 [ACK] Seq=181 Ack=1437 Win=64009 Len=0
1121 01:13:07.378429 192.168.1.63 → 192.168.4.2 MQTT 56 Ping Response
1122 01:13:07.492682 192.168.4.2 → 192.168.1.63 TCP 54 25324 → 1883 [ACK] Seq=1437 Ack=183 Win=5625 Len=0

1130 01:15:54.415933 192.168.4.2 → 192.168.1.63 MQTT 74 Publish Message [speakers/status]
1131 01:15:54.417622 192.168.4.1 → 192.168.4.2 ICMP 70 Destination unreachable (Port unreachable)
1132 01:15:55.056621 192.168.4.2 → 192.168.1.63 TCP 74 [TCP Retransmission] 25324 → 1883 [PSH, ACK] Seq=1437 Ack=183 Win=5625 Len=20
1133 01:15:55.058412 192.168.4.1 → 192.168.4.2 ICMP 70 Destination unreachable (Port unreachable)
1134 01:15:56.556451 192.168.4.2 → 192.168.1.63 TCP 74 [TCP Retransmission] 25324 → 1883 [PSH, ACK] Seq=1437 Ack=183 Win=5625 Len=20
1135 01:15:56.558286 192.168.4.1 → 192.168.4.2 ICMP 70 Destination unreachable (Port unreachable)

Any idea why the port would be closed after ~4 hours ?

@0hax
Copy link
Author

0hax commented Mar 20, 2020

I have found the issue and sent a fix for it in esp-open-lwip. (martin-ger/esp-open-lwip#28)
Let me know what you think!

@martin-ger
Copy link
Owner

Thanks a lot! Great analysis and thank you for the patch!
Will include it in the binary right now.

martin-ger added a commit that referenced this issue Mar 21, 2020
@0hax
Copy link
Author

0hax commented Mar 21, 2020

Thanks a lot! Great analysis and thank you for the patch!
Will include it in the binary right now.

Happy to help! Thanks for merging it.

st0ff3r pushed a commit to nabovarme/esp-open-lwip that referenced this issue Mar 23, 2020
@0hax 0hax closed this as completed Mar 27, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants