Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Pico W-based node crashes when rebroadcasting received floodmsg #3931

Closed
clwgh opened this issue May 18, 2024 · 29 comments
Closed

[Bug]: Pico W-based node crashes when rebroadcasting received floodmsg #3931

clwgh opened this issue May 18, 2024 · 29 comments
Labels
bug Something isn't working

Comments

@clwgh
Copy link

clwgh commented May 18, 2024

Category

Other

Hardware

Raspberry Pi Pico (W)

Firmware Version

2.3.6.7a3570a + 2.3.9.f06c56a

Description

Details

Raspberry Pi Pico W-based node runs for several hours, typically 12-20 hours, and then crashes. The heartbeat LED stops flashing, normally off but in one case it remained on solid, presumably crashing at the exact moment it was on during a heartbeat flash.

I started logging the serial monitor and found that in every case the final line recorded before the crash is [Router] Rebroadcasting received floodmsg to neighbors. There are a few thousand such messages in each log.

To see how relevant this was I ran the node as before but replaced the antenna with a 50 ohm dummy load. This allowed it to access the radio and transmit nodeinfo and position broadcasts but did not receive any incoming packets.

This time the node stayed up and running without problems, and I stopped logging after around 26 hours. Without rebooting, I then replaced the dummy load with the normal antenna and could see the node now processing incoming packets. After around 13 hours the node crashed with the same final line in the log.

This node has been up and running continuously for a week in a completely different location, with the same power supply and batteries, without this issue.

Node details

Raspberry Pi Pico W-based node with a custom PCB and a Ebyte E22 LoRa module, which is equivalent to a Pico W + Waveshare LoRa board. Running firmwares 2.3.6.7a3570a and 2.3.9.f06c56a. Configured as follows:

  • owner and short owner names set
  • nodeinfo broadcast interval 3600 seconds
  • lat, lon and alt set to fixed point
  • fixed position true
  • smart position disabled
  • position flags 11
  • position broadcast interval 900 seconds
  • TZ set to "GMT0BST,M3.5.0/1:00:00,M10.4.0/2:00:00"
  • LoRa region EU_868
  • TX power 22
  • Hop limit 4
  • everything else at firmware defaults

Log 1 - normal use

[...]
DEBUG | 15:38:24 66208 [Router] Module 'nodeinfo' considered
DEBUG | 15:38:24 66208 [Router] Module 'routing' wantsPacket=1
INFO  | 15:38:24 66208 [Router] Received routing from=0xf96a8c6c, id=0x2da2a395, portnum=4, payloadlen=48
DEBUG | 15:38:24 66208 [Router] Routing sniffing (id=0x2da2a395 fr=0x6c to=0xb1, WantAck=0, HopLim=3 Ch=0x0 Portnum=4 requestId=5869e064 rxtime=1715701104 rxSNR=2.25 rxRSSI=-87)
INFO  | 15:38:24 66208 [Router] Rebroadcasting received floodmsg to neighbors

Firmware 2.3.6.7a3570a
Running for 18h 23m before crash
5145 instances of Rebroadcasting received floodmsg to neighbors in log

Log 2 - normal use

[...]
DEBUG | 12:46:32 67714 [Router] Module 'EnvironmentTelemetry' wantsPacket=1
INFO  | 12:46:32 67714 [Router] Received EnvironmentTelemetry from=0xf96a8c6c, id=0x2da2a643, portnum=67, payloadlen=22
DEBUG | 12:46:32 67714 [Router] Module 'EnvironmentTelemetry' considered
DEBUG | 12:46:32 67714 [Router] Module 'PowerTelemetry' wantsPacket=1
INFO  | 12:46:32 67714 [Router] Received PowerTelemetry from=0xf96a8c6c, id=0x2da2a643, portnum=67, payloadlen=22
DEBUG | 12:46:32 67714 [Router] Module 'PowerTelemetry' considered
DEBUG | 12:46:32 67714 [Router] Module 'routing' wantsPacket=1
INFO  | 12:46:32 67714 [Router] Received routing from=0xf96a8c6c, id=0x2da2a643, portnum=67, payloadlen=22
DEBUG | 12:46:32 67714 [Router] Routing sniffing (id=0x2da2a643 fr=0x6c to=0xff, WantAck=0, HopLim=2 Ch=0x0 Portnum=67 rxtime=1715777192 rxSNR=-16 rxRSSI=-107)
INFO  | 12:46:32 67714 [Router] Rebroadcasting received floodmsg to neighbors

Firmware 2.3.9.f06c56a
Running for 18h 48m before crash
5297 instances of Rebroadcasting received floodmsg to neighbors in log

Log 3 - normal use

[...]
DEBUG | 11:22:45 73782 [Router] Module 'nodeinfo' wantsPacket=1
INFO  | 11:22:45 73782 [Router] Received nodeinfo from=0xfa6e6c7c, id=0x1b8a625, portnum=4, payloadlen=34
DEBUG | 11:22:45 73782 [Router] old user !fa6e6c7c/V7HCL1/V71, channel=0
DEBUG | 11:22:45 73782 [Router] updating changed=0 user !fa6e6c7c/V7HCL1/V71, channel=0
DEBUG | 11:22:45 73782 [Router] Module 'nodeinfo' considered
DEBUG | 11:22:45 73782 [Router] Module 'routing' wantsPacket=1
INFO  | 11:22:45 73782 [Router] Received routing from=0xfa6e6c7c, id=0x1b8a625, portnum=4, payloadlen=34
DEBUG | 11:22:45 73782 [Router] Routing sniffing (id=0x01b8a625 fr=0x7c to=0xff, WantAck=0, HopLim=2 Ch=0x0 Portnum=4 rxtime=1715858565 rxSNR=-17.5 rxRSSI=-108)
INFO  | 11:22:45 73782 [Router] Rebroadcasting received floodmsg to neighbors

Firmware 2.3.9.f06c56a
Running for 20h 30m before crash
5492 instances of Rebroadcasting received floodmsg to neighbors in log

Log 4 - dummy load

[...]
DEBUG | ??:??:?? 93686 [PositionModule] localSend to channel 0
DEBUG | ??:??:?? 93686 [PositionModule] Add packet record (id=0x37260b96 fr=0x39 to=0xff, WantAck=0, HopLim=4 Ch=0x0 Portnum=3 priority=10)
DEBUG | ??:??:?? 93686 [PositionModule] Expanding short PSK #1
DEBUG | ??:??:?? 93686 [PositionModule] Using AES128 key!
DEBUG | ??:??:?? 93686 [PositionModule] Installing AES128 key!
DEBUG | ??:??:?? 93686 [PositionModule] enqueuing for send (id=0x37260b96 fr=0x39 to=0xff, WantAck=0, HopLim=4 Ch=0x8 encrypted hopStart=4 priority=10)
DEBUG | ??:??:?? 93687 [PositionModule] txGood=236,rxGood=0,rxBad=0
DEBUG | ??:??:?? 93687 [PositionModule] NOTE: tophone queue status queue is full, discarding oldest
DEBUG | ??:??:?? 93687 [PositionModule] Using channel 0 (hash 0x8)
DEBUG | ??:??:?? 93687 [PositionModule] Expanding short PSK #1
DEBUG | ??:??:?? 93687 [PositionModule] Using AES128 key!
DEBUG | ??:??:?? 93687 [PositionModule] Installing AES128 key!
DEBUG | ??:??:?? 93687 [PositionModule] decoded message (id=0x37260b96 fr=0x39 to=0xff, WantAck=0, HopLim=4 Ch=0x0 Portnum=3 hopStart=4 priority=10)
WARN  | ??:??:?? 93687 [PositionModule] ToPhone queue is full, dropping packet.
DEBUG | ??:??:?? 93687 [RadioIf] Starting low level send (id=0x37260b96 fr=0x39 to=0xff, WantAck=0, HopLim=4 Ch=0x8 encrypted hopStart=4 priority=10)
DEBUG | ??:??:?? 93687 [RadioIf] (bw=250, sf=11, cr=4/5) packet symLen=8 ms, payloadSize=35, time 493 ms
DEBUG | ??:??:?? 93687 [RadioIf] AirTime - Packet transmitted : 493ms
DEBUG | ??:??:?? 93687 [RadioIf] Completed sending (id=0x37260b96 fr=0x39 to=0xff, WantAck=0, HopLim=4 Ch=0x8 encrypted hopStart=4 priority=10)
DEBUG | ??:??:?? 93691 [Power] Battery: usbPower=0, isCharging=0, batMv=4140, batPct=96
DEBUG | ??:??:?? 93711 [Power] Battery: usbPower=0, isCharging=0, batMv=4139, batPct=96
DEBUG | ??:??:?? 93731 [Power] Battery: usbPower=0, isCharging=0, batMv=4138, batPct=96
DEBUG | ??:??:?? 93751 [Power] Battery: usbPower=0, isCharging=0, batMv=4138, batPct=96
DEBUG | ??:??:?? 93771 [Power] Battery: usbPower=0, isCharging=0, batMv=4138, batPct=96
[...]
DEBUG | ??:??:?? 94472 [Power] Battery: usbPower=0, isCharging=0, batMv=4139, batPct=96
DEBUG | ??:??:?? 94492 [Power] Battery: usbPower=0, isCharging=0, batMv=4138, batPct=96
DEBUG | ??:??:?? 94512 [Power] Battery: usbPower=0, isCharging=0, batMv=4139, batPct=96
DEBUG | ??:??:?? 94532 [Power] Battery: usbPower=0, isCharging=0, batMv=4139, batPct=96
DEBUG | ??:??:?? 94552 [Power] Battery: usbPower=0, isCharging=0, batMv=4138, batPct=96

Firmware 2.3.9.f06c56a
Running for 26h 16m before manually stopping logging
Node sent nodeinfo broadcasts every 1hr and fixed position broadcasts every 15m
No packets received
No instances of Rebroadcasting received floodmsg to neighbors in log

Log 5 - continuing from previous log, switching to normal antenna

[...]
DEBUG | 08:22:11 143307 [Router] Module 'routing' wantsPacket=1
INFO  | 08:22:11 143307 [Router] Received routing from=0x8c632643, id=0x73a70eab, portnum=5, payloadlen=2
DEBUG | 08:22:11 143307 [Router] Routing sniffing (id=0x73a70eab fr=0x43 to=0xb1, WantAck=0, HopLim=2 Ch=0x0 Portnum=5 requestId=7e7c1c3c rxtime=1716020530 rxSNR=3.25 rxRSSI=-85)
DEBUG | 08:22:11 143307 [Router] Receiving an ACK or reply not for me, but don't need to rebroadcast this direct message anymore.
DEBUG | 08:22:11 143307 [Router] cancelSending id=0x7e7c1c3c, removed=0
INFO  | 08:22:11 143307 [Router] Rebroadcasting received floodmsg to neighbors

Firmware 2.3.9.f06c56a
Running for a further 13h 15m, since reconnecting the antenna, before crash
4279 instances of Rebroadcasting received floodmsg to neighbors in log

Thoughts

The node has been running without this problem in another location, and runs fine at this location when it has no incoming packets to handle. It appears that some combination of the incoming packets local to this area, and these rebroadcasting events, is causing the firmware to eventually crash. I've looked for any common cause, such as a specific node ID being referenced, or emojis in the nodename, which is perhaps sending malformed packets, but I can't find anything.

With you knowing how the firmware is running, I'd welcome any ideas on what might be happening to cause the eventual crash in the specific and reproducable way seen.

I have the full log files here if you want anything searching for, or I can make them available to devs in their entirety if that helps.

Relevant log output

See log sections above
@clwgh clwgh added the bug Something isn't working label May 18, 2024
@clwgh
Copy link
Author

clwgh commented May 18, 2024

The very next line which appears after Rebroadcasting received floodmsg to neighbors in all cases is:

DEBUG | xx:xx:xx xxxx [Router] Expanding short PSK #1

or

DEBUG | xx:xx:xx xxxx [Router] enqueuing for send (...various parameters...)

The former accounts for most of the next actions. I don't know how the logs are formed; assuming that the action takes place and is then logged, it stands to reason that the entry for Rebroadcasting received floodmsg to neighbors happened successfully, or is just informing that a process is about to take place, and that it is one of the above two actions which actually crashes the node, since that action does not get logged.

Log 1 - normal use

5116 instances of [Router] Expanding short PSK #1
28 instances of [Router] enqueuing for send

Log 2 - normal use

5273 instances of [Router] Expanding short PSK #1
23 instances of [Router] enqueuing for send

Log 3 - normal use

5481 instances of [Router] Expanding short PSK #1
10 instances of [Router] enqueuing for send

Log 4 - dummy load

No instances of the [Router] module

Log 5 - continuing from previous log, switching to normal antenna

4171 instances of [Router] Expanding short PSK #1
107 instances of [Router] enqueuing for send

@GUVWAF
Copy link
Member

GUVWAF commented May 19, 2024

This sounds like the problem of the LoRa transmission interfering with the USB: #3241

In that case it makes sense that the last log is that it is about to transmit.

With the antenna at the other side of the board than the USB port, I've been running the Pico W without issues.

@clwgh
Copy link
Author

clwgh commented May 19, 2024

It's an interesting idea but I don't think it applies here:

  1. The node crashes regardless of whether the USB port is connected or has a cable in it. The USB console was added later and simply allows me to log the activity before it crashes. The USB is fine and is not the issue.
  2. The node was not crashing when in a completely different part of the country and was up for a week on test with a similarly nearby indoor antenna.
  3. The dummy load test was still transmitting nodeinfo and position broadcasts at full power for 26 hours and the node did not crash. I was able to hear the transmissions on a SDR in another part of the house without problem.
  4. It only specifically crashes when the firmware Router module logs a floodmsg rebroadcast at this location and after many such rebroadcasts have taken place.

In any case I've started a new session with the tx_power set to 1 and will report the results in due course as log 7.

@clwgh
Copy link
Author

clwgh commented May 19, 2024

I was running another test and have had another crash, this one after 11h 50m, same result and same final entry in the log. On this occasion the Pico W also lost its cofiguration 2 seconds before the crash and this is captured in the log as a failure of /prefs/db.proto. After restarting the node it was back to firmware defaults.

The previous logs show successful /prefs/db.proto saves and did not factory reset when crashing.

I don't know if this is related to this issue or whether it is a separate issue. Perhaps it is analagous to a previous issue where the Pico W would crash after 45 seconds when unconfigured.

Log 6

[...]
DEBUG | 03:00:46 42606 [Router] Module 'nodeinfo' wantsPacket=1
INFO  | 03:00:46 42606 [Router] Received nodeinfo from=0xe2e26330, id=0x58962b2c, portnum=4, payloadlen=41
WARN  | 03:00:46 42606 [Router] Node database full! Erasing oldest entry
DEBUG | 03:00:46 42606 [Router] old user //, channel=0
DEBUG | 03:00:46 42606 [Router] updating changed=1 user !e2e26330/Stockport 01/SK01, channel=0
DEBUG | 03:00:46 42606 [Router] Enter state: ON
DEBUG | 03:00:46 42606 [Router] Node status update: 89 online, 100 total
INFO  | 03:00:46 42606 [Router] Saving /prefs/db.proto
ERROR | 03:00:46 42606 [Router] Failed to open destination file /prefs/db.proto
ERROR | 03:00:46 42606 [Router] Error: can't rename new pref file
WARN  | 03:00:47 42606 [Router] ToPhone queue is full, dropping packet.
DEBUG | 03:00:47 42606 [Router] Module 'nodeinfo' considered
DEBUG | 03:00:47 42606 [Router] Module 'routing' wantsPacket=1
INFO  | 03:00:47 42606 [Router] Received routing from=0xe2e26330, id=0x58962b2c, portnum=4, payloadlen=41
DEBUG | 03:00:47 42606 [Router] Routing sniffing (id=0x58962b2c fr=0x30 to=0x14, WantAck=0, HopLim=0 Ch=0x0 Portnum=4 WANTRESP rxtime=1716087646 rxSNR=3 rxRSSI=-85)
DEBUG | 03:00:47 42606 [Router] Module 'routing' considered
WARN  | 03:00:47 42606 [RadioIf] Can not send yet, busyRx
DEBUG | 03:00:47 42606 [RadioIf] rx_snr found. hop_limit:0 rx_snr:3.250000
DEBUG | 03:00:47 42606 [RadioIf] rx_snr found in packet. Setting tx delay:1050
DEBUG | 03:00:48 42608 [RadioIf] Ignore false preamble detection.
DEBUG | 03:00:48 42608 [RadioIf] rx_snr found. hop_limit:0 rx_snr:3.250000
DEBUG | 03:00:48 42608 [RadioIf] rx_snr found in packet. Setting tx delay:1134
DEBUG | 03:00:48 42608 [RadioIf] (bw=250, sf=11, cr=4/5) packet symLen=8 ms, payloadSize=44, time 559 ms
DEBUG | 03:00:48 42608 [RadioIf] Lora RX (id=0x73b33971 fr=0x7c to=0xff, WantAck=0, HopLim=1 Ch=0x8 encrypted rxSNR=2 rxRSSI=-85)
DEBUG | 03:00:48 42608 [RadioIf] AirTime - Packet received : 559ms
DEBUG | 03:00:48 42608 [Router] Add packet record (id=0x73b33971 fr=0x7c to=0xff, WantAck=0, HopLim=1 Ch=0x8 encrypted rxSNR=2 rxRSSI=-85)
DEBUG | 03:00:48 42608 [Router] Using channel 0 (hash 0x8)
DEBUG | 03:00:48 42608 [Router] Expanding short PSK #1
DEBUG | 03:00:48 42608 [Router] Using AES128 key!
DEBUG | 03:00:48 42608 [Router] Installing AES128 key!
DEBUG | 03:00:48 42608 [Router] decoded message (id=0x73b33971 fr=0x7c to=0xff, WantAck=0, HopLim=1 Ch=0x0 Portnum=67 rxtime=1716087648 rxSNR=2 rxRSSI=-85)
DEBUG | 03:00:48 42608 [Router] handleReceived(REMOTE) (id=0x73b33971 fr=0x7c to=0xff, WantAck=0, HopLim=1 Ch=0x0 Portnum=67 rxtime=1716087648 rxSNR=2 rxRSSI=-85)
DEBUG | 03:00:48 42608 [Router] Module 'DeviceTelemetry' wantsPacket=1
INFO  | 03:00:48 42608 [Router] Received DeviceTelemetry from=0xfa6e6c7c, id=0x73b33971, portnum=67, payloadlen=24
INFO  | 03:00:48 42608 [Router] (Received from V71): air_util_tx=10.697194, channel_utilization=29.330000, battery_level=93, voltage=4.135000
DEBUG | 03:00:48 42608 [Router] updateTelemetry REMOTE node=0xfa6e6c7c 
DEBUG | 03:00:48 42608 [Router] Node status update: 89 online, 100 total
DEBUG | 03:00:48 42608 [Router] Module 'DeviceTelemetry' considered
DEBUG | 03:00:48 42608 [Router] Module 'EnvironmentTelemetry' wantsPacket=1
INFO  | 03:00:48 42608 [Router] Received EnvironmentTelemetry from=0xfa6e6c7c, id=0x73b33971, portnum=67, payloadlen=24
DEBUG | 03:00:48 42608 [Router] Module 'EnvironmentTelemetry' considered
DEBUG | 03:00:48 42608 [Router] Module 'PowerTelemetry' wantsPacket=1
INFO  | 03:00:48 42608 [Router] Received PowerTelemetry from=0xfa6e6c7c, id=0x73b33971, portnum=67, payloadlen=24
DEBUG | 03:00:48 42608 [Router] Module 'PowerTelemetry' considered
DEBUG | 03:00:48 42608 [Router] Module 'routing' wantsPacket=1
INFO  | 03:00:48 42608 [Router] Received routing from=0xfa6e6c7c, id=0x73b33971, portnum=67, payloadlen=24
DEBUG | 03:00:48 42608 [Router] Routing sniffing (id=0x73b33971 fr=0x7c to=0xff, WantAck=0, HopLim=1 Ch=0x0 Portnum=67 rxtime=1716087648 rxSNR=2 rxRSSI=-85)
INFO  | 03:00:48 42608 [Router] Rebroadcasting received floodmsg to neighbors

@GUVWAF
Copy link
Member

GUVWAF commented May 19, 2024

To be honest this really still sounds like hardware-related issues. The logic where it fails is the same for all devices running Meshtastic, which don't have the problem. When it doesn't receive anything, it will also transmit much less, so chances are lower that it happens.
Maybe it's not the LoRa transmission affecting the USB, but something else on the PCB.

Regarding the flash issue: do you use exactly the same flash chip as the Pico W? Because the RP2040 on its own doesn't have any flash.

@AeroXuk
Copy link
Contributor

AeroXuk commented May 19, 2024

Just want to add that I don’t have logs but my uptime rarely gets up to 2 days since upgrading to version 2.3.7.
I’ll have to check what version I was on before but on an older version with uptime tracking I regularly reached a week of uptime.

@AeroXuk
Copy link
Contributor

AeroXuk commented May 19, 2024

Also aware of several other people running Pi Pico & Pi Pico W based nodes in the UK that are experiencing similar issues with crashes.

@clwgh
Copy link
Author

clwgh commented May 19, 2024

Regarding the flash issue: do you use exactly the same flash chip as the Pico W? Because the RP2040 on its own doesn't have any flash.

It is a Raspberry Pi Pico W.

@clwgh
Copy link
Author

clwgh commented May 19, 2024

my uptime rarely gets up to 2 days since upgrading to version 2.3.7

If you can, it would be interesting to have it attached to a terminal and log where it stops.

@GUVWAF
Copy link
Member

GUVWAF commented May 19, 2024

Even more useful would be to use another Pico or a Pico probe as debugger: https://www.digikey.ca/en/maker/projects/raspberry-pi-pico-and-rp2040-cc-part-2-debugging-with-vs-code/470abc7efb07432b82c95f6f67f184c0

If it's really a crash caused by the firmware, this will tell us more exactly what the cause is.

@clwgh
Copy link
Author

clwgh commented May 19, 2024

I've started a new session with the tx_power set to 1 and will report the results in due course as log 7.

This session lasted for 13h 5m and then crashed at the same point. The tx_power was set to 1 which is around 7-8dBm / 6-7mW. There are 4717 instances of Rebroadcasting received floodmsg to neighbors in the log.

I think that this, coupled with the other tests, rules out RF interference as a cause. It appears that the firmware really does eventually reach some invalid state, seemingly related to the Router module and the other node date it is rebroadcasting.

Log 7

[...]
INFO  | ??:??:?? 3 Power output set to 1
[...]
DEBUG | 17:16:31 47096 [Router] Module 'nodeinfo' wantsPacket=1
INFO  | 17:16:31 47096 [Router] Received nodeinfo from=0xb05b28aa, id=0x2384defb, portnum=4, payloadlen=39
DEBUG | 17:16:31 47096 [Router] old user !b05b28aa/M7DZH Base/dzhb, channel=0
DEBUG | 17:16:31 47096 [Router] updating changed=0 user !b05b28aa/M7DZH Base/dzhb, channel=0
DEBUG | 17:16:31 47096 [Router] Module 'nodeinfo' considered
DEBUG | 17:16:31 47096 [Router] Module 'routing' wantsPacket=1
INFO  | 17:16:31 47096 [Router] Received routing from=0xb05b28aa, id=0x2384defb, portnum=4, payloadlen=39
DEBUG | 17:16:31 47096 [Router] Routing sniffing (id=0x2384defb fr=0xaa to=0x6c, WantAck=0, HopLim=3 Ch=0x0 Portnum=4 WANTRESP rxtime=1716138991 rxSNR=-17.75 rxRSSI=-106 hopStart=5)
INFO  | 17:16:31 47096 [Router] Rebroadcasting received floodmsg to neighbors

Even more useful would be to use another Pico or a Pico probe as debugger

Nice one, thanks for that link. I don't have another Pico at the moment but if I do later on I'll take a good look at it. Someone else may have the kit to hand to make one.

@clwgh
Copy link
Author

clwgh commented May 19, 2024

A mate and I were chatting about this and it looks like an overrun which corrupts the available storage. It's as if the available storage fills up and reaches capacity, and then any actions which need free space overrun the area and crash the node.

Presumably one such action is storing a packet which qualifies for floodmsg rebroadcast. So it would stand to reason that this would be an event which would be commonly seen as the last event before a crash.

Another such event is a saving the prefs, where a new prefs file is written before the old one is deleted and the new one renamed. Since this takes place about 1/10th of the time that a rebroadcast takes place, statistically it would be less likely to trigger the overrun, but when it does it would be followed by the rebroadcast crash. That's what was seen in Log 6, where the prefs failed 2 seconds before the final crash.

What are your thoughts on that, @GUVWAF? What would fill up a Pico W's storage during use, and can I make a firmware tweak to turn that off? For example, if it is the nodedb that fills it up, can I limit the nodedb to a certain number? Or perhaps packets stored for rebroadcast are not garbage collected and so accumulate and eventually consume all the available space.

This would also explain the results seen in the various tests and observations above (eg dummy load, running elsewhere in the country, etc), and could explain why it's okay for some time (typically 12 to 20 hours) and then dies, as this might be how long it takes at this location for all the space to become consumed based on the nodes I can see and how noisy they are.

@GUVWAF
Copy link
Member

GUVWAF commented May 19, 2024

With "storage" you mean flash, right? Did you always see it saving someting to flash just before it happened?

To be honest I don't think flash has anything to do with it. It's persistent storage, so if you have issues because it's running full (which I highly doubt), you would have it also immediately after reboot.

I think at this point the only way to find out if the problem lays in the firmware and if so, where, would be to connect a debugger.

@clwgh
Copy link
Author

clwgh commented May 19, 2024

With "storage" you mean flash, right? Did you always see it saving someting to flash just before it happened?

I'm referring to wherever this is going on – Failed to open destination file /prefs/db.proto ... Error: can't rename new pref file, and how a rebroadcast of a floodmsg works. I assume it has to store something in order to rebroadcast it? So I'm asking if anyone involved in the firmware can think where in the various processes this is going on, and if that space became full, might it lead to the behaviour seen. Presumably this is taking place in the 264Kb SRAM.

The nodedb is also stored somehwere on the node. What if the firmware, when running on the Pico / Pico W, is inadvertently allowing this space to become filled. That could be a reason one might see something like the above.

There must be ways I can add code to the firmware to make a custom uf2 which adds more logging and which will help reveal things going on that might be relevant, such as free space? I don't know enough about the firmware to know how to do that.

@GUVWAF
Copy link
Member

GUVWAF commented May 19, 2024

I'm referring to wherever this is going on – Failed to open destination file /prefs/db.proto ... Error: can't rename new pref file

Yes, this is saving to flash, but there's plenty available and it's persistent storage so you would always have issues after it's gotten full.

I assume it has to store something in order to rebroadcast it?

Yes, but that happens in RAM and is not related to the above log message.

The nodedb is also stored somehwere on the node.

This is in both RAM and flash.

There must be ways I can add code to the firmware to make a custom uf2 which adds more logging and which will help reveal things going on that might be relevant, such as free space?

You can add a call to MemGet::getFreeHeap() and print that to check the amount of free memory and other LOG_DEBUG() messages after "Rebroadcasting received floodmsg" to check more closer to where it really fails. However, using a debugger would give us the precise stack even including platform function calls (so not Meshtastic's code), if it really crashed due to the firmware.

@AeroXuk
Copy link
Contributor

AeroXuk commented May 20, 2024

I've attached a Raspberry Pi Debug Probe to a Raspberry Pi Pico W node running in VS Code debug mode from this tagged release https://github.com/meshtastic/firmware/releases/tag/v2.3.9.f06c56a

I've also connected to serial output over usb and I will leave this running and see if anything is caught in VS Code.

If anyone wants me to do anything in particular in VS Code let me know

@GUVWAF
Copy link
Member

GUVWAF commented May 20, 2024

@AeroXuk Nice, if it crashes, can you copy the call stack of both cores and post it here?

@Vanguard4893
Copy link

I'm referring to wherever this is going on – Failed to open destination file /prefs/db.proto ... Error: can't rename new pref file

Yes, this is saving to flash, but there's plenty available and it's persistent storage so you would always have issues after it's gotten full.

I assume it has to store something in order to rebroadcast it?

Yes, but that happens in RAM and is not related to the above log message.

The nodedb is also stored somehwere on the node.

This is in both RAM and flash.

There must be ways I can add code to the firmware to make a custom uf2 which adds more logging and which will help reveal things going on that might be relevant, such as free space?

You can add a call to MemGet::getFreeHeap() and print that to check the amount of free memory and other LOG_DEBUG() messages after "Rebroadcasting received floodmsg" to check more closer to where it really fails. However, using a debugger would give us the precise stack even including platform function calls (so not Meshtastic's code), if it really crashed due to the firmware.

Hi GUVWAF,

Would you be so kind as to provide some info about LOG_DEBUG() lines to print out the FreeHeap, etc?

Just so I can also attempt to assist with tracking this down also.

Cheers

@clwgh
Copy link
Author

clwgh commented May 20, 2024

I ran another test with the same config but with lora.tx_enabled set to false. This allows the node to go through the same motions but doesn't transmit. The same issue occured; the node was up for 11h 58m and then crashed with the same message. This rules out RF interference.

It also had the Failed to open destination file /prefs/db.proto message 2 seconds before the final Rebroadcasting received floodmsg to neighbors, as in Log 6 above.

Log 8 - tx_enabled false

[...]
WARN  | 06:26:05 43072 [Router] send - !config.lora.tx_enabled
[...]
DEBUG | 06:26:11 43078 [Router] Module 'nodeinfo' wantsPacket=1
INFO  | 06:26:11 43078 [Router] Received nodeinfo from=0xe2e4b0bc, id=0x6ba56224, portnum=4, payloadlen=42
WARN  | 06:26:11 43078 [Router] Node database full! Erasing oldest entry
DEBUG | 06:26:11 43078 [Router] old user //, channel=0
DEBUG | 06:26:11 43079 [Router] updating changed=1 user !e2e4b0bc/MunkUKSK2_A/MnkA, channel=0
DEBUG | 06:26:11 43079 [Router] Enter state: ON
DEBUG | 06:26:11 43079 [Router] Node status update: 99 online, 100 total
INFO  | 06:26:11 43079 [Router] Saving /prefs/db.proto
ERROR | 06:26:12 43079 [Router] Failed to open destination file /prefs/db.proto
ERROR | 06:26:12 43079 [Router] Error: can't rename new pref file
WARN  | 06:26:12 43079 [Router] ToPhone queue is full, dropping packet.
DEBUG | 06:26:12 43079 [Router] Module 'nodeinfo' considered
DEBUG | 06:26:12 43079 [Router] Module 'routing' wantsPacket=1
INFO  | 06:26:12 43079 [Router] Received routing from=0xe2e4b0bc, id=0x6ba56224, portnum=4, payloadlen=42
DEBUG | 06:26:12 43079 [Router] Routing sniffing (id=0x6ba56224 fr=0xbc to=0x30, WantAck=0, HopLim=0 Ch=0x0 Portnum=4 requestId=26930bed rxtime=1716186371 rxSNR=2.75 rxRSSI=-87)
DEBUG | 06:26:12 43079 [Router] Receiving an ACK or reply not for me, but don't need to rebroadcast this direct message anymore.
DEBUG | 06:26:12 43079 [Router] cancelSending id=0x26930bed, removed=0
DEBUG | 06:26:12 43079 [Router] Module 'routing' considered
DEBUG | 06:26:14 43081 [RadioIf] (bw=250, sf=11, cr=4/5) packet symLen=8 ms, payloadSize=72, time 772 ms
DEBUG | 06:26:14 43081 [RadioIf] Lora RX (id=0x72a966a8 fr=0x48 to=0x60, WantAck=0, HopLim=1 Ch=0x8 encrypted rxSNR=2.5 rxRSSI=-87 hopStart=3)
DEBUG | 06:26:14 43081 [RadioIf] AirTime - Packet received : 772ms
DEBUG | 06:26:14 43081 [Router] Add packet record (id=0x72a966a8 fr=0x48 to=0x60, WantAck=0, HopLim=1 Ch=0x8 encrypted rxSNR=2.5 rxRSSI=-87 hopStart=3)
DEBUG | 06:26:14 43081 [Router] Using channel 0 (hash 0x8)
DEBUG | 06:26:14 43081 [Router] Expanding short PSK #1
DEBUG | 06:26:14 43081 [Router] Using AES128 key!
DEBUG | 06:26:14 43081 [Router] Installing AES128 key!
DEBUG | 06:26:14 43081 [Router] decoded message (id=0x72a966a8 fr=0x48 to=0x60, WantAck=0, HopLim=1 Ch=0x0 Portnum=4 WANTRESP rxtime=1716186374 rxSNR=2.5 rxRSSI=-87 hopStart=3)
DEBUG | 06:26:14 43081 [Router] handleReceived(REMOTE) (id=0x72a966a8 fr=0x48 to=0x60, WantAck=0, HopLim=1 Ch=0x0 Portnum=4 WANTRESP rxtime=1716186374 rxSNR=2.5 rxRSSI=-87 hopStart=3)
DEBUG | 06:26:14 43081 [Router] Module 'nodeinfo' wantsPacket=1
INFO  | 06:26:14 43081 [Router] Received nodeinfo from=0xe2e1d848, id=0x72a966a8, portnum=4, payloadlen=50
DEBUG | 06:26:14 43081 [Router] old user !e2e1d848/Dayle M0OUE Base 🗼/OUEB, channel=0
DEBUG | 06:26:14 43081 [Router] updating changed=0 user !e2e1d848/Dayle M0OUE Base 🗼/OUEB, channel=0
DEBUG | 06:26:14 43081 [Router] Module 'nodeinfo' considered
DEBUG | 06:26:14 43081 [Router] Module 'routing' wantsPacket=1
INFO  | 06:26:14 43081 [Router] Received routing from=0xe2e1d848, id=0x72a966a8, portnum=4, payloadlen=50
DEBUG | 06:26:14 43081 [Router] Routing sniffing (id=0x72a966a8 fr=0x48 to=0x60, WantAck=0, HopLim=1 Ch=0x0 Portnum=4 WANTRESP rxtime=1716186374 rxSNR=2.5 rxRSSI=-87 hopStart=3)
INFO  | 06:26:14 43081 [Router] Rebroadcasting received floodmsg to neighbors

Could this be a bug in the filesystem library, which results in the prefs process not completing, thus losing the prefs on the next boot? And the process for the expansion of the PSK or the enqueuing for send – the actions which follow the rebroadcast floodmsg log entry – are similarly compromised?

@GUVWAF
Copy link
Member

GUVWAF commented May 20, 2024

Seems indeed I was wrong and it's not related to RF interference, and it might indeed be related to the flash issues.

I'm currently travelling so can't really look into this now.
@caveman99 any ideas?

@Vanguard4893
Copy link

Updated to add - I am also seeing this problem on separate PicoW hardware:

DEBUG | 16:43:25 79093 [Router] Module 'routing' wantsPacket=1
INFO  | 16:43:25 79093 [Router] Received routing from=0xf96a8c6c, id=0x3160fa43, portnum=4, payloadlen=48
DEBUG | 16:43:25 79093 [Router] Routing sniffing (id=0x3160fa43 fr=0x6c to=0xc8, WantAck=0, HopLim=4 Ch=0x0 Portnum=4 WANTRESP rxtime=1716482605 rxSNR=6.5 rxRSSI=-46)
INFO  | 16:43:25 79093 [Router] Rebroadcasting received floodmsg to neighbors

@clwgh
Copy link
Author

clwgh commented May 24, 2024

I notice that we're seeing nodes with emojis in the long name. Could there be a bug related to the handling of these nodes? I'm just floating ideas here.

If you do a --nodes they are presented in a formatted table; however nodes with emojis in break the formatting, which implies they have not been considered in the formatting logic. Could the same be true for more core operations?

For example the updating of the nodedb as the quantity goes up, the oldest nodes are purged from it. But what if nodes with emojis are not correctly identified and not purged, and the rest of the firmware assumes this was a successful operation when, in fact, it was not.

I can see how this might eventually lead to a state where data is present where it shouldn't be, and perhaps this is the trigger for a crash - for example when a new prefs file is trying to be written, or when a PSK is being expanded or a message enqueued for sending (these last two being actions which follow the rebroadcast floodmsg entry in the log).

Alternatively, perhaps nodes with emojis can inadvertently create a form of injection where invalid data ends up being executed.

Again, just floating ideas, not asserting that this exact sequence of events is taking place. It very much feels like some sort of overrun / no free space state being reached.

@clwgh
Copy link
Author

clwgh commented May 27, 2024

Might this bug (now fixed) be relevant? The Pico W node does not have a display. Is it experiencing a memory leak?

@clwgh
Copy link
Author

clwgh commented May 28, 2024

More testing the Pico W with different firmwares, to see if things like throttling nodedb saves are relevant.

Log 9 - fw 2.3.5

The node crashed after 13h 11m when saving prefs. On reboot the prefs had been lost and it was back at firmware defaults.

[...]
DEBUG | 04:17:51 47517 [Router] Module 'nodeinfo' wantsPacket=1
INFO  | 04:17:51 47517 [Router] Received nodeinfo from=0x30326c78, id=0x258840a5, portnum=4, payloadlen=47
WARN  | 04:17:51 47517 [Router] Node database full! Erasing oldest entry
DEBUG | 04:17:51 47517 [Router] old user //, channel=0
DEBUG | 04:17:51 47517 [Router] updating changed=1 user !30326c78/Meshtastic M0ODB/GONZ, channel=0
DEBUG | 04:17:51 47517 [Router] Enter state: ON
DEBUG | 04:17:51 47517 [Router] Node status update: 90 online, 100 total
INFO  | 04:17:51 47517 [Router] Saving /prefs/db.proto

Log 10 - fw 2.2.24

The node crashed after 24h 47m. Seconds before crashing it started putting out corrupted log lines, including an empty [Router] entry three lines from the end. The prefs were not lost (the prefs are lost if the prefs update process happens to be the event which causes the crash).

[...]
DEBUG | 20:45:56 89185 [Router] Resending implicit ack for a repeated floodmsg
DEBUG | 20:45:56 89185 [Router] 0DEBUG | 20:45:56 89185 [Router] txGood=4842,rxGood=8616,rxBad=2944
DEBUG | 20:45:56 89185 [Router] rx_snr found. hop_limit:2 rx_snr:6.000000
DEBUG | 20:45:56 89185 [Router] rx_snr found in packet. Setting tx delay:2478
DEBUG | 20:45:56 89185 [Router] Found existing packet record for fr=0xda562e20,to=0xffffffff,id=0x79f69492
DEBUG | 20:45:56 89185 [Router] 0DEBUG | 20:45:56 89185 [Router] 0DEBUG | 20:45:56 89185 [Router] cancelSending id=0x79f69492, removed=1
DEBUG | 20:45:56 89185 [Router] Incoming message was filtered 0xda562e20
WARN  | 20:45:57 89185 [RadioIf] Can not send yet, busyRx
DEBUG | 20:45:57 89185 [RadioIf] rx_snr found. hop_limit:3 rx_snr:-10.750000
[...]
DEBUG | 20:46:05 89193 [Router] localSend to channel 0
DEBUG | 20:46:05 89193 [Router] 
DEBUG | 20:46:05 89193 [Router] recentPackets size=91
DEBUG | 20:46:05 89193 [Router] recentPackets size=77 (after clearing expired packets)

Another Pico W node running 2.2.24, in another part of the country, crashed after 23h 30m at the same part of the log process, including the same empty [Router] entry three lines from the end.

[...]
DEBUG | 06:31:26 84626 [Router] localSend to channel 0
DEBUG | 06:31:26 84626 [Router] 
DEBUG | 06:31:26 84626 [Router] recentPackets size=98
DEBUG | 06:31:26 84626 [Router] recentPackets size=98 (after clearing expired packets)

Summary

It appears to be the same free space problem, manifesting slightly differently with this older firmware. Hopefully the way in which it fails in 2.3.x and here in 2.2.24 offers more insight into what's actually happening, given dev knowledge of how actions are logged and executed.

The next test is a build which includes this memory leak fix, in case this is relevant.

@clwgh
Copy link
Author

clwgh commented May 28, 2024

@GUVWAF I see the latest release has

RP2040: Add getFreeHeap() and getHeapSize() support by @GUVWAF in #3890

Can you advise if there's a way to get this info into the logs please? Many thanks.

@clwgh
Copy link
Author

clwgh commented May 29, 2024

Testing the Pico W with the nightly 2.3.11.77cf5c6 which includes the memory leak fix. The node has been up for 40 hours without crashing. There are two other Pico W being tested in other parts of the country and they are also now staying up without crashing.

Feeling cautiously optimistic that this memory leak bug has been behind the failures after 12-20 hours. I'll leave this open for now and will test again once 2.3.11 reaches alpha.

@thebentern
Copy link
Contributor

I'll close it and we can re-open if it pops up again.

@clwgh
Copy link
Author

clwgh commented May 29, 2024

Okay, thanks

@clwgh
Copy link
Author

clwgh commented Jun 6, 2024

Summary

I have tested this for a few more days with the release alpha 2.3.11.2740a56 and the Pico W stays up and does not crash. Others testing 2.3.11 with their Pico W also report the same result. Therefore I'm confident saying that the memory leak bug, which was fixed in 2.3.11, was the cause all along.

The symptoms manifested in various ways, all related to running out of working memory:

  • The more packets being dealt with, the shorter the time between crashes. This accounts for the varied results across different Pico Ws in different parts of the country.
  • If a floodmsg rebroadcast was taking place, the next action in the log was either PSK expansion or message enqueuing. This would fail and the node would crash.
  • If prefs were being saved, the prefs save process is not atomic and would fail part way through followed by a crash, resulting in all preferences being back to firmware defaults after cycling power.
  • If logging was taking place, log entries would often become corrupted, with multiple lines concatenated or empty module lines just before the crash.
  • Packets would be reported as corrupted just before the crash. Almost certainly not received corrupted, but stored corrupted due to memory having ran out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants