Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Attempting MQTT connection...failed, rc=-2 try again in 5 seconds #13

Closed
SimonMerrett opened this issue Aug 26, 2020 · 26 comments
Closed

Comments

@SimonMerrett
Copy link

Hi, I realise I'm clutching at straws here but I'm starting to run out of options. Before the recent releases in July and August I set up a link from an Arduino Nano to mqtt2zigbee running on a Pi 4 with Raspbian desktop. I recently tried to port this over to an Arduino Nano and a Pi 3 A+ on Raspbian light. I used the most recent mqtt2zigbee, mosquitto and couldn't get things working with the ncurses, gatewayNode or ncursesInt to start with. After updating the pubsubclient and RF24 etc libraries for the Arduino, I can get intermittent connections but I get long periods of Attempting MQTT connection...failed, rc=-2 try again in 5 seconds.

I have tried going back to the original setup with the Pi 4 and the Nano but the Nano is now working from the latest RF24 etc version so it isn't exactly the same (same intermittent connection dropping and failure to reconnect). I have interrupt errors on the ncursesInt example for every attempted packet and this is using the same hardware connections that were working when testing my last issue (ie no changes). I did go through the example and explicitly made sure interrupts were turned back on with gw.interrupts(1) rather than gw.interrupts() - like I said, running out of things to check/tweak.

I notice that the installer now gives options for Pi SPI drivers which weren't there before. I selected SPIdev for compatibility on the A+ but I don't know if that's what would have been installed by default on the pre-Jul update version on the Pi 4.

I acknowledge that this could be a problem with mosquitto or pubsubclient but given the raft of recent changes to RF24, I thought it would at least be worth asking if there are any thoughts on this. Even confirming that the process followed in the guide (https://tmrh20.blogspot.com/2019/05/automationiot-with-nrf24l01-and-mqtt.html) still results in reliable performance would be really helpful. I set up mosquitto local service following https://randomnerdtutorials.com/how-to-install-mosquitto-broker-on-raspberry-pi/ .

Not sure what else to include/mention at this stage because it is such a tricky issue to localise. Please shout for more detail that might be helpful.

@TMRh20
Copy link
Member

TMRh20 commented Aug 26, 2020 via email

@SimonMerrett
Copy link
Author

Brilliant pointers - thanks. I will probably start with a fresh install of RaspPi OS with desktop on the A+ as it will make setup and reinstalling everything faster. I'm definitely going to try the original SPI drivers too. Then I can run through the checklist. I have been deleting the dhcplist.txt files every now and then. Sometimes the node is quickly detected, other times not. Oh yes, I have also turned the Pi examples and Arduino code to radio.setPALevel(RF24_PA_MIN) as the antennas are only about a metre apart for testing.

@SimonMerrett
Copy link
Author

pi@raspberrypi:~ $ ping 10.10.3.5
PING 10.10.3.5 (10.10.3.5) 56(84) bytes of data.
64 bytes from 10.10.3.5: icmp_seq=1 ttl=64 time=33.3 ms
64 bytes from 10.10.3.5: icmp_seq=2 ttl=64 time=30.1 ms
64 bytes from 10.10.3.5: icmp_seq=3 ttl=64 time=44.4 ms
^C
--- 10.10.3.5 ping statistics ---
36 packets transmitted, 3 received, 91.6667% packet loss, time 332ms
rtt min/avg/max/mdev = 30.075/35.936/44.395/6.131 ms
pi@raspberrypi:~ $ ping 10.10.3.5
PING 10.10.3.5 (10.10.3.5) 56(84) bytes of data.
64 bytes from 10.10.3.5: icmp_seq=9 ttl=64 time=1022 ms
64 bytes from 10.10.3.5: icmp_seq=42 ttl=64 time=29.7 ms
64 bytes from 10.10.3.5: icmp_seq=44 ttl=64 time=28.9 ms
64 bytes from 10.10.3.5: icmp_seq=46 ttl=64 time=23.3 ms
64 bytes from 10.10.3.5: icmp_seq=52 ttl=64 time=24.1 ms
64 bytes from 10.10.3.5: icmp_seq=54 ttl=64 time=14.0 ms
64 bytes from 10.10.3.5: icmp_seq=56 ttl=64 time=20.1 ms
64 bytes from 10.10.3.5: icmp_seq=57 ttl=64 time=36.2 ms
64 bytes from 10.10.3.5: icmp_seq=58 ttl=64 time=34.7 ms
64 bytes from 10.10.3.5: icmp_seq=60 ttl=64 time=24.4 ms
64 bytes from 10.10.3.5: icmp_seq=61 ttl=64 time=25.5 ms
64 bytes from 10.10.3.5: icmp_seq=62 ttl=64 time=22.2 ms
64 bytes from 10.10.3.5: icmp_seq=63 ttl=64 time=20.5 ms
64 bytes from 10.10.3.5: icmp_seq=64 ttl=64 time=22.1 ms
64 bytes from 10.10.3.5: icmp_seq=66 ttl=64 time=29.3 ms
64 bytes from 10.10.3.5: icmp_seq=68 ttl=64 time=29.7 ms
64 bytes from 10.10.3.5: icmp_seq=71 ttl=64 time=15.8 ms
^C
--- 10.10.3.5 ping statistics ---
71 packets transmitted, 17 received, 76.0563% packet loss, time 644ms
rtt min/avg/max/mdev = 14.027/83.679/1022.149/234.688 ms

for an example of what happens to ping as the node stops being connected. BTW, the decent ping times in the second run are still while the Arduino Zero is failing mqtt connection. While I have been typing this message it has reconnected! Just so sporadic.

@TMRh20
Copy link
Member

TMRh20 commented Aug 26, 2020 via email

@SimonMerrett
Copy link
Author

Just a quick update. I changed to channel 50 without doing any RF survey and moved the Arduino Zero node away from the desktop with laptop, raspi, mobile phone etc. There seemed to be a significant improvement in "uptime" but I don't have the serial monitor available away from the desktop and am only able to monitor with RF24Gateway_ncurses packet flow. There was a point where the gateway wasn't receiving anything and a power cycle on the Zero didn't help - it took ctrl+c and restarting RF24Gateway_ncurses but it did get going again. I will try and do better testing soon. I'm considering going to RF24_250KBPS to see if that helps.

@TMRh20
Copy link
Member

TMRh20 commented Aug 28, 2020 via email

@SimonMerrett
Copy link
Author

Brilliant tips again - I will check if I'm restarting on the correct channel.
In the meantime I found this ref a deeper call into clocks on raspberry pi which may be of use: https://yosh.ke.mu/article/raspberry_pi_getting_time

@SimonMerrett
Copy link
Author

SimonMerrett commented Aug 28, 2020

so I changed the failure detection section in ncurses to mesh.begin(channel) where uint8_t channel = 50. I will leave it running now to see if that is more consistent. However, there's nothing in failLog.txt so I don't know if that function was called. Will delete so that we know if it is created in the next run.

Edit I rechecked 'failLog.txt' and there was a 1 in there. We could be on to something here.

and I also wondered about the commented out delay here https://github.com/nRF24/RF24Mesh/blob/fce52350416070d3fae40a866119010fad631cc8/RF24Mesh.cpp#L17 as I'm using SPIDEV.

@SimonMerrett
Copy link
Author

So I left it again and after about two hours the connection dropped again. I left it 40 mins and there was no reconnection. failLog.tx contained 1. But I'm not sure if the problem is at the Pi end. As a test, I rebooted the Arduino and the packets start flowing and the mqtt messages are successfully being received by the mosquitto server.

I have create a gist with my node code here, in case you can see something silly I have done in there.

After deleting the failLog.txt and waiting for the next irrecoverable drop in connection, I plan to move the Pi away from the desk so that I can bring the Arduino back and hook it up to the serial monitor. Is it worth turning on any of the #defines like DEBUG?

@TMRh20
Copy link
Member

TMRh20 commented Aug 28, 2020

https://gist.github.com/SimonMerrett/b09132b51bb1e9642e240995cf500ba2#file-rf24_mqtt_bulb_string-ino-L117 

Not sure how this works without causing problems, because radio.begin() is not called until mesh.begin();
Either way, you would need to put that line after mesh.begin() or call radio.begin() before setPALevel();

It is difficult to identify issues just looking at code, so I would suggest running the default mqtt example on another device if possible, with the only changes made being ones required to make it operational, pin #s, ip address 

RF24Network_config.h:

    //#define SERIAL_DEBUG
    //#define SERIAL_DEBUG_MINIMAL
    //#define SERIAL_DEBUG_ROUTING

Arduino Sketch, AVR devices need printf enabled for debug output:
#include "printf.h"
After serial.begin
printf_begin();

You said you are using a Nano?

In the debug output, there should be a bunch of these during renwal:

NET Sending id 180 from 04444 to 0100 type 194 //Poll request
MAC Sending to 00 via 00 on pipe 0
NET Pipe 0 on node 00 has address ccccccccc3
MAC Received on 0 id 180 from 00 to 04444 type 194  //Poll response
NET message ffef
System payload rcvd 194
Sending id 180 from 04444 to 00 type 195   //Address request
MAC Sending to 00 via 00 on pipe 0
NET Pipe 0 on node 00 has address ccccccccc3
MAC Received on 0 id 180 from 00 to 04444 type 128  //Address response
NET message 0004
MAC: System payload rcvd 128
setup_address node=04 mask=07 parent=00 pipe=04

...

Attempting MQTT connection... NET Sending id 182 from 04 to 00 type 148  //Fragmented payload
NET Sending id 182 from 04 to 00 type 150  //Fragmented payload

That is mainly what you should see, along with some type 198 messages (lookups) and partial printouts of the IP data.
You should be able to see what is happening, if the device is getting poll responses etc, if writes are failing, not getting address responses, interference from another device, etc.

@SimonMerrett
Copy link
Author

SimonMerrett commented Aug 28, 2020

Oof, really struggling here. I was using an arduino Zero but I can't see a way to enable printf() for SAMD series* so I switched across to a Nano and I can't even get as far as I was with the Zero! The debug output is in the gist but it isn't representative of the issues I was observing with the Zero.

*Tried the libprintf to no avail

@TMRh20
Copy link
Member

TMRh20 commented Aug 28, 2020 via email

@SimonMerrett
Copy link
Author

SimonMerrett commented Aug 28, 2020

image

Well I was just about to try that when this happened! I can't even copy/paste it because Win10 clipboard is refusing to capture it properly. So do I need to change the buffer size in #define MAX_PAYLOAD_SIZE 144 in RF24NetworkConfig.h?
Interesting that libprintf managed to break through to the serial monitor during the crash!

@TMRh20
Copy link
Member

TMRh20 commented Aug 28, 2020 via email

@SimonMerrett
Copy link
Author

No that suggests problems with your code. (memory issues: buffer overruns etc) Again, test with a known working example subscribed to the same topic... As a developer, I really don't have time to provide technical support, debugging other peoples code etc, it just doesn't make sense/not possible. Users need to test with known working code if finding issues.

Sorry - as I said at the beginning I was clutching at straws and I'm going to use the example and build back up. I agree it's not your job to debug my code - sorry and thanks for everything you've done.

@TMRh20
Copy link
Member

TMRh20 commented Aug 28, 2020 via email

@SimonMerrett
Copy link
Author

I just noticed the removal of delay() in reconnect(), as fixed in nRF24/RF24Ethernet@09ff6d3
Because I had to adapt the example sketch for mqtt_basic.ino to the SAMD21 my "plain" copy (made before updating the libraries) didn't receive the benefit of the updated example code. That led to my more customised code inheriting the old version of reconnect and I have just changed it. Based on the stability I have seen over the last couple of hours, this could have been the main cause of the issue I experienced. Without throwing too many rocks, a quick look at the buffer and memory management-related issues on the pubsub repo makes me think that's where I should focus effort now.

@TMRh20
Copy link
Member

TMRh20 commented Aug 30, 2020

Excellent. Updates are a dual edged sword sometimes, fixing one issue and causing another.
I think it is a memory issue of some sort. I did notice the pubsub client will fragment large payloads depending on the configured size of the client buffer, so modifying MAX_PAYLOAD_SIZE in RF24Network_config.h may actually make a difference if there are issues there, as well as minimizing the length of the messages of course.

@SimonMerrett
Copy link
Author

More tests over the last couple of days suggest that I was also suffering from a poor rf link as a more favourable position for rf propagation and reception has also improved stability (still testing with RF24_PA_MIN - moved after mesh.begin()).
I don't have a choice to reduce the length of the long messages without digging into zigbee2mqtt but the good news is that for the foreseeable future I don't need to subscribe to any topics, only publish commands. There is also a long message example for the pubsubclient library that I may look into if I need to handle long messages.
Due to the rf link being so important, I was thinking of implementing a power ramping scheme in reconnect(). I can't think of a good way to mirror this on the gateway though (to ensure acks get through). If anything comes to mind, please mention it.

@TMRh20
Copy link
Member

TMRh20 commented Aug 31, 2020

I'm not sure about what you mean by mirroring that on the gateway. How would the gateway know when to adjust the power level? I keep my nodes at the max pa level.

One of my nodes dropped into a state where it could renew its address and verify connectivity, but not connect to mqtt. A quick reset brought it back up, so there could be some issue either in RF24Ethernet or mqtt. I may try a few things with the default examples i'm using as time allows, maybe a different mqtt client too just to be sure.

@TMRh20 TMRh20 reopened this Aug 31, 2020
@SimonMerrett
Copy link
Author

The gateway absolutely wouldn't know the power level but there may be a scheme which allows you to ensure that at some point all combinations of power are tried at both ends. The wrong but easiest analogy is that a 12hr clock that is slow will eventually be in sync with an accurate clock because they're cycling at different speeds. I really don't know if it's possible but if it were, I imagine it would employ a similar scheme. If the gateway were expecting a known node to always be connected, it could ramp through power levels at rate A for a window of time since the known node was last connected. Meanwhile the known node could ramp through power levels at rate B as it tried to reconnect. Anyway I'm dragging this off topic.

@TMRh20
Copy link
Member

TMRh20 commented Aug 31, 2020

Interesting, trying not to get too far off topic, that would probably require some internal changes to RF24Gateway unless it was just done randomly or based off RF24Mesh/Network level messages.
I think I previously tried averaging the RPD values and adjusting power levels, but it never worked out well. In RF24Mesh it is used for debugging: https://github.com/nRF24/RF24Mesh/blob/master/RF24Mesh.cpp#L292

@TMRh20
Copy link
Member

TMRh20 commented Nov 12, 2020

Still trying to find time to fully investigate this issue. I've been testing with the https://github.com/256dpi/arduino-mqtt library and just filed an issue to get it working with IP address, so you need to download the library from github, not the library manager until a release is made to have it work out of the box.

Its working nicely, so should help rule out/in issues between RF24Ethernet and pubsub library.

@SimonMerrett
Copy link
Author

Thanks. I tried implementing an oled display and buttons on my samd21-based device but the various calls to mqtt etc were taking so long it killed the responsiveness to the user. So I resorted to a samd21 attached to the Pi with an ethernet spi adapter and a 5cm ethernet cable between them, and basic rf24 messages from the gui samd21 and the rpi samd21. The project was terribly overrun so that was the easiest avenue avaible. But I do want to see where the delays may have originated. BTW, for integration with e.g. zigbee2mqtt where the messages can get quite long, being able to whack the pubsub payload buffer size up to 2048 bytes on the samd21 without fear of piling in on memory is very nice.

@TMRh20
Copy link
Member

TMRh20 commented Nov 23, 2020

I've been using the alternate MQTT library for the past while on all of my nodes and have not seen any downtime on them. This of course indicates a problem between RF24Ethernet and the pubsub lib, although I'm not certain if it is the combination of them or just the mqtt library.
I think I'm going to close this issue again though, as there is a working alternative that doesn't seem to be affected.

@TMRh20 TMRh20 closed this as completed Nov 23, 2020
@SimonMerrett
Copy link
Author

Thanks for testing it out with known-working radios. I will take a look at the alternate library.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants