-
Notifications
You must be signed in to change notification settings - Fork 893
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Pico W-based node crashes when rebroadcasting received floodmsg #3931
Comments
The very next line which appears after
or
The former accounts for most of the next actions. I don't know how the logs are formed; assuming that the action takes place and is then logged, it stands to reason that the entry for Log 1 - normal use5116 instances of Log 2 - normal use5273 instances of Log 3 - normal use5481 instances of Log 4 - dummy loadNo instances of the [Router] module Log 5 - continuing from previous log, switching to normal antenna4171 instances of |
This sounds like the problem of the LoRa transmission interfering with the USB: #3241 In that case it makes sense that the last log is that it is about to transmit. With the antenna at the other side of the board than the USB port, I've been running the Pico W without issues. |
It's an interesting idea but I don't think it applies here:
In any case I've started a new session with the |
I was running another test and have had another crash, this one after 11h 50m, same result and same final entry in the log. On this occasion the Pico W also lost its cofiguration 2 seconds before the crash and this is captured in the log as a failure of The previous logs show successful I don't know if this is related to this issue or whether it is a separate issue. Perhaps it is analagous to a previous issue where the Pico W would crash after 45 seconds when unconfigured. Log 6
|
To be honest this really still sounds like hardware-related issues. The logic where it fails is the same for all devices running Meshtastic, which don't have the problem. When it doesn't receive anything, it will also transmit much less, so chances are lower that it happens. Regarding the flash issue: do you use exactly the same flash chip as the Pico W? Because the RP2040 on its own doesn't have any flash. |
Just want to add that I don’t have logs but my uptime rarely gets up to 2 days since upgrading to version 2.3.7. |
Also aware of several other people running Pi Pico & Pi Pico W based nodes in the UK that are experiencing similar issues with crashes. |
It is a Raspberry Pi Pico W. |
If you can, it would be interesting to have it attached to a terminal and log where it stops. |
Even more useful would be to use another Pico or a Pico probe as debugger: https://www.digikey.ca/en/maker/projects/raspberry-pi-pico-and-rp2040-cc-part-2-debugging-with-vs-code/470abc7efb07432b82c95f6f67f184c0 If it's really a crash caused by the firmware, this will tell us more exactly what the cause is. |
This session lasted for 13h 5m and then crashed at the same point. The I think that this, coupled with the other tests, rules out RF interference as a cause. It appears that the firmware really does eventually reach some invalid state, seemingly related to the Router module and the other node date it is rebroadcasting. Log 7
Nice one, thanks for that link. I don't have another Pico at the moment but if I do later on I'll take a good look at it. Someone else may have the kit to hand to make one. |
A mate and I were chatting about this and it looks like an overrun which corrupts the available storage. It's as if the available storage fills up and reaches capacity, and then any actions which need free space overrun the area and crash the node. Presumably one such action is storing a packet which qualifies for floodmsg rebroadcast. So it would stand to reason that this would be an event which would be commonly seen as the last event before a crash. Another such event is a saving the prefs, where a new prefs file is written before the old one is deleted and the new one renamed. Since this takes place about 1/10th of the time that a rebroadcast takes place, statistically it would be less likely to trigger the overrun, but when it does it would be followed by the rebroadcast crash. That's what was seen in Log 6, where the prefs failed 2 seconds before the final crash. What are your thoughts on that, @GUVWAF? What would fill up a Pico W's storage during use, and can I make a firmware tweak to turn that off? For example, if it is the nodedb that fills it up, can I limit the nodedb to a certain number? Or perhaps packets stored for rebroadcast are not garbage collected and so accumulate and eventually consume all the available space. This would also explain the results seen in the various tests and observations above (eg dummy load, running elsewhere in the country, etc), and could explain why it's okay for some time (typically 12 to 20 hours) and then dies, as this might be how long it takes at this location for all the space to become consumed based on the nodes I can see and how noisy they are. |
With "storage" you mean flash, right? Did you always see it saving someting to flash just before it happened? To be honest I don't think flash has anything to do with it. It's persistent storage, so if you have issues because it's running full (which I highly doubt), you would have it also immediately after reboot. I think at this point the only way to find out if the problem lays in the firmware and if so, where, would be to connect a debugger. |
I'm referring to wherever this is going on – The nodedb is also stored somehwere on the node. What if the firmware, when running on the Pico / Pico W, is inadvertently allowing this space to become filled. That could be a reason one might see something like the above. There must be ways I can add code to the firmware to make a custom uf2 which adds more logging and which will help reveal things going on that might be relevant, such as free space? I don't know enough about the firmware to know how to do that. |
Yes, this is saving to flash, but there's plenty available and it's persistent storage so you would always have issues after it's gotten full.
Yes, but that happens in RAM and is not related to the above log message.
This is in both RAM and flash.
You can add a call to |
I've attached a Raspberry Pi Debug Probe to a Raspberry Pi Pico W node running in VS Code debug mode from this tagged release https://github.com/meshtastic/firmware/releases/tag/v2.3.9.f06c56a I've also connected to serial output over usb and I will leave this running and see if anything is caught in VS Code. If anyone wants me to do anything in particular in VS Code let me know |
@AeroXuk Nice, if it crashes, can you copy the call stack of both cores and post it here? |
Hi GUVWAF, Would you be so kind as to provide some info about LOG_DEBUG() lines to print out the FreeHeap, etc? Just so I can also attempt to assist with tracking this down also. Cheers |
I ran another test with the same config but with It also had the Log 8 - tx_enabled false
Could this be a bug in the filesystem library, which results in the prefs process not completing, thus losing the prefs on the next boot? And the process for the expansion of the PSK or the enqueuing for send – the actions which follow the rebroadcast floodmsg log entry – are similarly compromised? |
Seems indeed I was wrong and it's not related to RF interference, and it might indeed be related to the flash issues. I'm currently travelling so can't really look into this now. |
Updated to add - I am also seeing this problem on separate PicoW hardware:
|
I notice that we're seeing nodes with emojis in the long name. Could there be a bug related to the handling of these nodes? I'm just floating ideas here. If you do a For example the updating of the nodedb as the quantity goes up, the oldest nodes are purged from it. But what if nodes with emojis are not correctly identified and not purged, and the rest of the firmware assumes this was a successful operation when, in fact, it was not. I can see how this might eventually lead to a state where data is present where it shouldn't be, and perhaps this is the trigger for a crash - for example when a new prefs file is trying to be written, or when a PSK is being expanded or a message enqueued for sending (these last two being actions which follow the rebroadcast floodmsg entry in the log). Alternatively, perhaps nodes with emojis can inadvertently create a form of injection where invalid data ends up being executed. Again, just floating ideas, not asserting that this exact sequence of events is taking place. It very much feels like some sort of overrun / no free space state being reached. |
Might this bug (now fixed) be relevant? The Pico W node does not have a display. Is it experiencing a memory leak? |
More testing the Pico W with different firmwares, to see if things like throttling nodedb saves are relevant. Log 9 - fw 2.3.5The node crashed after 13h 11m when saving prefs. On reboot the prefs had been lost and it was back at firmware defaults.
Log 10 - fw 2.2.24The node crashed after 24h 47m. Seconds before crashing it started putting out corrupted log lines, including an empty
Another Pico W node running
SummaryIt appears to be the same free space problem, manifesting slightly differently with this older firmware. Hopefully the way in which it fails in The next test is a build which includes this memory leak fix, in case this is relevant. |
Testing the Pico W with the nightly Feeling cautiously optimistic that this memory leak bug has been behind the failures after 12-20 hours. I'll leave this open for now and will test again once |
I'll close it and we can re-open if it pops up again. |
Okay, thanks |
SummaryI have tested this for a few more days with the release alpha The symptoms manifested in various ways, all related to running out of working memory:
|
Category
Other
Hardware
Raspberry Pi Pico (W)
Firmware Version
2.3.6.7a3570a + 2.3.9.f06c56a
Description
Details
Raspberry Pi Pico W-based node runs for several hours, typically 12-20 hours, and then crashes. The heartbeat LED stops flashing, normally off but in one case it remained on solid, presumably crashing at the exact moment it was on during a heartbeat flash.
I started logging the serial monitor and found that in every case the final line recorded before the crash is
[Router] Rebroadcasting received floodmsg to neighbors
. There are a few thousand such messages in each log.To see how relevant this was I ran the node as before but replaced the antenna with a 50 ohm dummy load. This allowed it to access the radio and transmit nodeinfo and position broadcasts but did not receive any incoming packets.
This time the node stayed up and running without problems, and I stopped logging after around 26 hours. Without rebooting, I then replaced the dummy load with the normal antenna and could see the node now processing incoming packets. After around 13 hours the node crashed with the same final line in the log.
This node has been up and running continuously for a week in a completely different location, with the same power supply and batteries, without this issue.
Node details
Raspberry Pi Pico W-based node with a custom PCB and a Ebyte E22 LoRa module, which is equivalent to a Pico W + Waveshare LoRa board. Running firmwares
2.3.6.7a3570a
and2.3.9.f06c56a
. Configured as follows:Log 1 - normal use
Firmware
2.3.6.7a3570a
Running for 18h 23m before crash
5145 instances of
Rebroadcasting received floodmsg to neighbors
in logLog 2 - normal use
Firmware
2.3.9.f06c56a
Running for 18h 48m before crash
5297 instances of
Rebroadcasting received floodmsg to neighbors
in logLog 3 - normal use
Firmware
2.3.9.f06c56a
Running for 20h 30m before crash
5492 instances of
Rebroadcasting received floodmsg to neighbors
in logLog 4 - dummy load
Firmware
2.3.9.f06c56a
Running for 26h 16m before manually stopping logging
Node sent nodeinfo broadcasts every 1hr and fixed position broadcasts every 15m
No packets received
No instances of
Rebroadcasting received floodmsg to neighbors
in logLog 5 - continuing from previous log, switching to normal antenna
Firmware
2.3.9.f06c56a
Running for a further 13h 15m, since reconnecting the antenna, before crash
4279 instances of
Rebroadcasting received floodmsg to neighbors
in logThoughts
The node has been running without this problem in another location, and runs fine at this location when it has no incoming packets to handle. It appears that some combination of the incoming packets local to this area, and these rebroadcasting events, is causing the firmware to eventually crash. I've looked for any common cause, such as a specific node ID being referenced, or emojis in the nodename, which is perhaps sending malformed packets, but I can't find anything.
With you knowing how the firmware is running, I'd welcome any ideas on what might be happening to cause the eventual crash in the specific and reproducable way seen.
I have the full log files here if you want anything searching for, or I can make them available to devs in their entirety if that helps.
Relevant log output
The text was updated successfully, but these errors were encountered: