Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: T-Lora v2 1.6 goes into boot-loop #2084

Closed
rpsainio opened this issue Dec 31, 2022 · 31 comments
Closed

[Bug]: T-Lora v2 1.6 goes into boot-loop #2084

rpsainio opened this issue Dec 31, 2022 · 31 comments
Labels
bug Something isn't working

Comments

@rpsainio
Copy link

Category

Other

Hardware

T-Lora v2 1.6

Firmware Version

2.0.8.090e166-d

Description

Every now and then T-Lora v2 1.6 gets stuck in boot loop and only pressing the reset-button makes it tick again. This one has extra BME680 sensor connected so it is not just a plain board. Normally it runs for days without a hitch.
I have two similar nodes (with BME680) and they both behave the same way.

Relevant log output

------------------------Boot loop -------------------------------------------
Rebooting...
ets Jun  8 2016 00:22:57

rst:0xc (SW_CPU_RESET),boot:0x13 (SPI_FAST_FLASH_BOOT)
configsip: 188777542, SPIWP:0xee
clk_drv:0x00,q_drv:0x00,d_drv:0x00,cs0_drv:0x00,hd_drv:0x00,wp_drv:0x00
mode:DIO, clock div:2
load:0x3fff0030,len:1184
load:0x40078000,len:13132
load:0x40080400,len:3036
entry 0x400805e4
[    11][D][esp32-hal-cpu.c:244] setCpuFrequencyMhz(): PLL: 480 / 2 = 240 Mhz, APB: 80000000 Hz
\357\277\275\357\277\275@??:??:?? 0 

//\ E S H T /\ S T / C

??:??:?? 0 booted, wake cause 0 (boot count 1), reset_reason=reset
??:??:?? 0 Filesystem files (479232/1048576 Bytes):
??:??:?? 0  /prefs/channels.proto (53 Bytes)
??:??:?? 0  /prefs/config.proto (83 Bytes)
??:??:?? 0  /prefs/db.proto (566 Bytes)
??:??:?? 0  /prefs/module.proto (26 Bytes)
??:??:?? 0  /static/.gitkeep (0 Bytes)
??:??:?? 0  /static/Logo_Black.svg.gz (592 Bytes)
??:??:?? 0  /static/icon.svg.gz (842 Bytes)
??:??:?? 0  /static/index.2759ba5d.js.gz (347 Bytes)
??:??:?? 0  /static/index.85831067.css.gz (15140 Bytes)
??:??:?? 0  /static/index.d4fbecc1.js.gz (410906 Bytes)
??:??:?? 0  /static/index.html.gz (507 Bytes)
??:??:?? 0  /static/robots.txt.gz (42 Bytes)
??:??:?? 0  /static/site.webmanifest.gz (196 Bytes)
[   273][I][esp32-hal-i2c.c:75] i2cInit(): Initialising I2C Master: sda=21 scl=22 freq=100000
??:??:?? 0 Using analog input 35 for battery level
??:??:?? 0 I2C device found at address 0x3c
??:??:?? 0 0x4 subtype probed in 2 tries 
??:??:?? 0 ssd1306 display found
??:??:?? 0 I2C device found at address 0x77
??:??:?? 0 Wire.available() = 1
??:??:?? 0 BME-680 sensor found at address 0x77
??:??:?? 0 2 I2C devices found
Guru Meditation Error: Core  1 panic'ed (Interrupt wdt timeout on CPU1). 

Core  1 register dump:
PC      : 0x40085a96  PS      : 0x00050035  A0      : 0x4000bff0  A1      : 0x3ffbfccc  
A2      : 0x8400025c  A3      : 0x3ffbed40  A4      : 0x40085af0  A5      : 0x3ffbfc9c  
A6      : 0x00000000  A7      : 0x3ffbe71c  A8      : 0x80084740  A9      : 0x40096432  
A10     : 0x3ffd7db0  A11     : 0x00000804  A12     : 0x00000001  A13     : 0x3ffbfc7c  
A14     : 0x00000000  A15     : 0x00000000  SAR     : 0x00000006  EXCCAUSE: 0x00000006  
EXCVADDR: 0x00000000  LBEG    : 0x40091a48  LEND    : 0x40091a53  LCOUNT  : 0xffffffff  
Core  1 was running in ISR context:
EPC1    : 0x4013a64f  EPC2    : 0x00000000  EPC3    : 0x00000000  EPC4    : 0x00000000


Backtrace:0x40085a93:0x3ffbfccc |<-CORRUPTED


Core  0 register dump:
PC      : 0x402091ae  PS      : 0x00060835  A0      : 0x801384c0  A1      : 0x3ffbd6c0  
A2      : 0x00000000  A3      : 0x40091004  A4      : 0x00060820  A5      : 0x3ffbd020  
A6      : 0x007bfce8  A7      : 0x003fffff  A8      : 0x80137ffe  A9      : 0x3ffbd690  
A10     : 0x00000000  A11     : 0x3ffbfce4  A12     : 0x3ffbfce4  A13     : 0x00000000  
A14     : 0x00060820  A15     : 0x00000000  SAR     : 0x0000001a  EXCCAUSE: 0x00000006  
EXCVADDR: 0x00000000  LBEG    : 0x00000000  LEND    : 0x00000000  LCOUNT  : 0x00000000  


Backtrace:0x402091ab:0x3ffbd6c00x401384bd:0x3ffbd6e0 0x40097990:0x3ffbd700 




ELF file SHA256: 0000000000000000

Rebooting...

------------------------------Pressing Reset ---------------
ERON_RESET),boot:0x13 (SPI_FAST_FLASH_BOOT)
configsip: 188777542, SPIWP:0xee
clk_drv:0x00,q_drv:0x00,d_drv:0x00,cs0_drv:0x00,hd_drv:0x00,wp_drv:0x00
mode:DIO, clock div:2
load:0x3fff0030,len:1184
load:0x40078000,len:13132
load:0x40080400,len:3036
entry 0x400805e4
[    11][D][esp32-hal-cpu.c:244] setCpuFrequencyMhz(): PLL: 480 / 2 = 240 Mhz, APB: 80000000 Hz
\357\277\275\357\277\275@??:??:?? 0 

//\ E S H T /\ S T / C

??:??:?? 0 booted, wake cause 0 (boot count 1), reset_reason=reset
??:??:?? 0 Filesystem files (479232/1048576 Bytes):
??:??:?? 0  /prefs/channels.proto (53 Bytes)
??:??:?? 0  /prefs/config.proto (83 Bytes)
??:??:?? 0  /prefs/db.proto (566 Bytes)
??:??:?? 0  /prefs/module.proto (26 Bytes)
??:??:?? 0  /static/.gitkeep (0 Bytes)
??:??:?? 0  /static/Logo_Black.svg.gz (592 Bytes)
??:??:?? 0  /static/icon.svg.gz (842 Bytes)
??:??:?? 0  /static/index.2759ba5d.js.gz (347 Bytes)
??:??:?? 0  /static/index.85831067.css.gz (15140 Bytes)
??:??:?? 0  /static/index.d4fbecc1.js.gz (410906 Bytes)
??:??:?? 0  /static/index.html.gz (507 Bytes)
??:??:?? 0  /static/robots.txt.gz (42 Bytes)
??:??:?? 0  /static/site.webmanifest.gz (196 Bytes)
[   273][I][esp32-hal-i2c.c:75] i2cInit(): Initialising I2C Master: sda=21 scl=22 freq=100000
??:??:?? 0 Using analog input 35 for battery level
??:??:?? 0 I2C device found at address 0x3c
??:??:?? 0 0x4 subtype probed in 2 tries 
??:??:?? 0 ssd1306 display found
??:??:?? 0 I2C device found at address 0x77
??:??:?? 0 Wire.available() = 1
??:??:?? 0 BME-680 sensor found at address 0x77
??:??:?? 0 2 I2C devices found
??:??:?? 0 Meshtastic hwvendor=3, swver=2.0.8.090e166-d
??:??:?? 0 Setting random seed 1863777771
??:??:?? 0 Total heap: 198940
??:??:?? 0 Free heap: 164848
??:??:?? 0 Total PSRAM: 0
??:??:?? 0 Free PSRAM: 0
??:??:?? 0 NVS: UsedEntries 2, FreeEntries 628, AllEntries 630, NameSpaces 1
??:??:?? 0 Setup Preferences in Flash Storage
??:??:?? 0 Number of Device Reboots: 180
??:??:?? 0 OTA firmware version 0.1.1.98ab02e
??:??:?? 0 Initializing NodeDB
??:??:?? 0 Loading /prefs/db.proto
??:??:?? 0 Loaded saved devicestate version 20
??:??:?? 0 Loading /prefs/config.proto
??:??:?? 0 Loaded saved config version 20
??:??:?? 0 Loading /prefs/module.proto
??:??:?? 0 Loaded saved moduleConfig version 20
??:??:?? 0 Loading /prefs/channels.proto
??:??:?? 0 Loaded saved channelFile version 20
[   443][E][vfs_api.cpp:104] open(): /littlefs/oem/oem.proto does not exist, no permits for creation
??:??:?? 0 No /oem/oem.proto preferences found
??:??:?? 0 Number of Device Reboots: 180
??:??:?? 0 Expanding short PSK #1
??:??:?? 0 Wanted region 3, using EU_868
??:??:?? 0 region=3, NODENUM=0x25c12400, dbsize=5
??:??:?? 0 Saving /prefs/db.proto
??:??:?? 0 Read RTC time as 0
??:??:?? 0 WANT GPS=1
~??:??:?? 1 Warning: Failed to find UBlox & MTK GNSS Module
??:??:?? 1 GxGSA NOT available
??:??:?? 1 WANT GPS=0
??:??:?? 1 Codec2 disabled (AudioModule 0, Region EU_868, permitted 0)
??:??:?? 1 External Notification Module Disabled
??:??:?? 1 Turning on screen
??:??:?? 2 Module wants a UI Frame
??:??:?? 2 Module wants a UI Frame
@rpsainio rpsainio added the bug Something isn't working label Dec 31, 2022
@rpsainio rpsainio changed the title [Bug]: [Bug]: T-Lora v2 1.6 goes into boot-loop Dec 31, 2022
@karamo
Copy link

karamo commented Jan 3, 2023

I suspect this has to do with the heap leak.
I also know the reboot, which lasts up to several minutes. Sometimes the module starts again and runs "normally" for a certain time.

@rpsainio
Copy link
Author

rpsainio commented Jan 3, 2023

This one stays in loop until RESET is pressed

@caveman99
Copy link
Sponsor Member

i saw the same on my t-beam but running 2.0.10 fixed it for me.

@rpsainio
Copy link
Author

rpsainio commented Jan 3, 2023

OK, I will give 2.0.10 a try.

@rpsainio
Copy link
Author

rpsainio commented Jan 4, 2023

Sorry to say, but it ended up in boot-loop again with Meshtastic hwvendor=3, swver=2.0.10.e09b12c-d

loop_10_out.txt

@thebentern
Copy link
Contributor

Does removing the BME sensor have any affect?

@rpsainio
Copy link
Author

Thought about that as well as it loops after I2-init. I have normally "power.isPowerSaving true" and switching this off seems to help, but it will of course shorten batterytime. I just soldered an elko over VCC-GND next to BME680 - did not help. Next now is to disconnect BME680. Let's see

@karamo
Copy link

karamo commented Jan 19, 2023

@rpsainio Your "every now and then" can be the Problem with the Heap-Leak.
See issue #2165

@rpsainio
Copy link
Author

@karamo Earlier versions used to print out heap-usage when there was a change, but I have not seen this on console output since versions. I saw that the heap grew, but then it also released space so I never saw running out of heap. So, I am still concentrating on finding out how that externally connected BME680 might cause the boot-loop. The system remains in this boot-loop until RESET is pressed and I would expect a boot to clean up a heap-leak like you have implemented with WDT

@rpsainio
Copy link
Author

@thebentern It ended up in boot-loop also without BME680.
swver=2.0.13.7e27729-d
I will now try to catch the situation when it first time enters the loop, I guess that is the interesting point to see. BTW, is there a way to activate heap-status so that we could see if this is actually #2165 .

@rpsainio
Copy link
Author

@thebentern So, this time it happened faster than I hoped:
assert_01.txt

@caveman99
Copy link
Sponsor Member

@rpsainio the heap status is off on release builds. if you grab any build artefact from our CI that is not a release you will have it back.

@caveman99
Copy link
Sponsor Member

@thebentern So, this time it happened faster than I hoped: assert_01.txt

this is most likely a stuck core. "Guru Meditation Error: Core 1 panic'ed (Interrupt wdt timeout on CPU1)." 2 seconds into the boot suggests so. Because from a cold boot we only use CPU0. The second core is only enabled when threading starts. That would be the codec2 thread (if enabled) or the webserver. To my knowledge the rest of the firmware does not use FreeRTOS threads.

@rpsainio
Copy link
Author

It started by asserting esp_sleep_wakeup_cause_t doLightSleep(uint64_t) sleep.cpp:308 (esp_light_sleep_start() == ESP_OK)

@rpsainio
Copy link
Author

This is a standalone node and should not have codec2 or webserver enabled. Besides Core1 panics only after reboot, so is there some missing init for CPU1 ?

@caveman99
Copy link
Sponsor Member

i have decoded the coredumps...

First one is straightforward...

assert failed: esp_sleep_wakeup_cause_t doLightSleep(uint64_t) sleep.cpp:308 (esp_light_sleep_start() == ESP_OK)


Backtrace: 0x4008448d:0x3ffd8790 0x400966a1:0x3ffd87b0 0x4009c821:0x3ffd87d0 0x400f77eb:0x3ffd8900 0x400d43e0:0x3ffd8920 0x400f9f4d:0x3ffd8940 0x400de075:0x3ffd8960 0x400d57aa:0x3ffd8980 0x400fa8a1:0x3ffd89a0 0x400de43a:0x3ffd89d0 0x4012c80d:0x3ffd89f0

    0x4008448d: panic_abort at /Users/ficeto/Desktop/ESP32/ESP32S2/esp-idf-public/components/esp_system/panic.c:402
    0x400966a1: esp_system_abort at /Users/ficeto/Desktop/ESP32/ESP32S2/esp-idf-public/components/esp_system/esp_system.c:128
    0x4009c821: __assert_func at /Users/ficeto/Desktop/ESP32/ESP32S2/esp-idf-public/components/newlib/assert.c:85
    0x400f77eb: doLightSleep(unsigned long long) at /home/runner/work/firmware/firmware/src/sleep.cpp:307
    0x400d43e0: lsIdle at /home/runner/work/firmware/firmware/src/PowerFSM.cpp:82
    0x400d43e0: lsIdle at /home/runner/work/firmware/firmware/src/PowerFSM.cpp:61
    0x400f9f4d: Fsm::run_machine() at /home/runner/work/firmware/firmware/.pio/libdeps/tlora-v2-1-1.6/arduino-fsm/Fsm.cpp:142
    0x400de075: concurrency::PowerFSMThread::runOnce() at /home/runner/work/firmware/firmware/src/PowerFSMThread.h:20
    0x400d57aa: concurrency::OSThread::run() at /home/runner/work/firmware/firmware/src/concurrency/OSThread.cpp:81
    0x400fa8a1: ThreadController::runOrDelay() at /home/runner/work/firmware/firmware/.pio/libdeps/tlora-v2-1-1.6/Thread/ThreadController.cpp:59
    0x400de43a: loop() at /home/runner/work/firmware/firmware/src/main.cpp:546
    0x4012c80d: loopTask(void*) at /home/runner/.platformio/packages/framework-arduinoespressif32/cores/esp32/main.cpp:50

@caveman99
Copy link
Sponsor Member

Second is way more elusive.

Guru Meditation Error: Core  1 panic'ed (Interrupt wdt timeout on CPU1). 

Core  1 register dump:
PC      : 0x40081980  PS      : 0x00060035  A0      : 0x800819c3  A1      : 0x3ffc00fc  
A2      : 0x04000000  A3      : 0x00000000  A4      : 0x00000001  A5      : 0x3ffd8820  
A6      : 0x3ffbe750  A7      : 0x3ff440e8  A8      : 0x0000001a  A9      : 0xffffffff  
A10     : 0x04000000  A11     : 0x003fffff  A12     : 0x00002680  A13     : 0x00002000  
A14     : 0x00000001  A15     : 0x00000001  SAR     : 0x00000006  EXCCAUSE: 0x00000006  
EXCVADDR: 0x00000000  LBEG    : 0x40093485  LEND    : 0x40093495  LCOUNT  : 0xfffffffe  
Core  1 was running in ISR context:
EPC1    : 0x4013cb27  EPC2    : 0x00000000  EPC3    : 0x00000000  EPC4    : 0x00000000


Backtrace: 0x4008197d:0x3ffc00fc |<-CORRUPTED

    0x4008197d: gpio_isr_loop at /Users/ficeto/Desktop/ESP32/ESP32S2/esp-idf-public/components/driver/gpio.c:412


Core  0 register dump:
PC      : 0x4020fdaa  PS      : 0x00060835  A0      : 0x8013a910  A1      : 0x3ffbda40  
A2      : 0x00000000  A3      : 0x400911a8  A4      : 0x00060020  A5      : 0x80000000  
A6      : 0x007c0168  A7      : 0x003fffff  A8      : 0x8013a452  A9      : 0x3ffbda10  
A10     : 0x00000000  A11     : 0x80000001  A12     : 0x8009a2c0  A13     : 0x3ffbd940  
A14     : 0x00000003  A15     : 0x00060023  SAR     : 0x0000001a  EXCCAUSE: 0x00000006  
EXCVADDR: 0x00000000  LBEG    : 0x00000000  LEND    : 0x00000000  LCOUNT  : 0x00000000  


Backtrace: 0x4020fda7:0x3ffbda40 0x4013a90d:0x3ffbda60 0x40097c38:0x3ffbda80

    0x4020fda7: cpu_ll_waiti at /Users/ficeto/Desktop/ESP32/ESP32S2/esp-idf-public/components/hal/esp32/include/hal/cpu_ll.h:183
    0x4020fda7: esp_pm_impl_waiti at /Users/ficeto/Desktop/ESP32/ESP32S2/esp-idf-public/components/esp_pm/pm_impl.c:837
    0x4013a90d: esp_vApplicationIdleHook at /Users/ficeto/Desktop/ESP32/ESP32S2/esp-idf-public/components/esp_system/freertos_hooks.c:63
    0x40097c38: prvIdleTask at /Users/ficeto/Desktop/ESP32/ESP32S2/esp-idf-public/components/freertos/tasks.c:3987

@caveman99
Copy link
Sponsor Member

and third one is actually not the same.

Guru Meditation Error: Core 1 panic'ed (Interrupt wdt timeout on CPU1).

Core  1 register dump:
PC      : 0x40085c2c  PS      : 0x00050035  A0      : 0x4000bff0  A1      : 0x3ffc014c  
A2      : 0x00000200  A3      : 0x3ffba530  A4      : 0x000001ff  A5      : 0x3ffc011c  
A6      : 0x00000000  A7      : 0x3ffbe738  A8      : 0x80084860  A9      : 0x400966ce  
A10     : 0x3ffd96cc  A11     : 0x00000804  A12     : 0x00000001  A13     : 0x3ffc00fc  
A14     : 0x00000000  A15     : 0x00000000  SAR     : 0x00000006  EXCCAUSE: 0x00000006  
EXCVADDR: 0x00000000  LBEG    : 0x40091bec  LEND    : 0x40091bf7  LCOUNT  : 0xffffffff  
Core  1 was running in ISR context:
EPC1    : 0x4013cb27  EPC2    : 0x00000000  EPC3    : 0x00000000  EPC4    : 0x00000000


Backtrace: 0x40085c29:0x3ffc014c |<-CORRUPTED

    0x4013cb27: uart_hal_write_txfifo at /Users/ficeto/Desktop/ESP32/ESP32S2/esp-idf-public/components/hal/uart_hal_iram.c:35


Core  0 register dump:
PC      : 0x4020fdaa  PS      : 0x00060835  A0      : 0x8013a910  A1      : 0x3ffbda40  
A2      : 0x00000000  A3      : 0x400911a8  A4      : 0x00060020  A5      : 0x80000000  
A6      : 0x007c0168  A7      : 0x003fffff  A8      : 0x8013a452  A9      : 0x3ffbda10  
A10     : 0x00000000  A11     : 0x3ffc0164  A12     : 0x3ffc0164  A13     : 0x00000000  
A14     : 0x00060020  A15     : 0x00000000  SAR     : 0x0000001a  EXCCAUSE: 0x00000006  
EXCVADDR: 0x00000000  LBEG    : 0x00000000  LEND    : 0x00000000  LCOUNT  : 0x00000000  


Backtrace: 0x4020fda7:0x3ffbda40 0x4013a90d:0x3ffbda60 0x40097c38:0x3ffbda80

    0x4020fdaa: esp_pm_impl_waiti at /Users/ficeto/Desktop/ESP32/ESP32S2/esp-idf-public/components/esp_pm/pm_impl.c:839
    0x400911a8: _cleanup_r at /builds/idf/crosstool-NG/.build/xtensa-esp32-elf/src/newlib/newlib/libc/stdio/findfp.c:205
    0x4020fda7: cpu_ll_waiti at /Users/ficeto/Desktop/ESP32/ESP32S2/esp-idf-public/components/hal/esp32/include/hal/cpu_ll.h:183
    0x4020fda7: esp_pm_impl_waiti at /Users/ficeto/Desktop/ESP32/ESP32S2/esp-idf-public/components/esp_pm/pm_impl.c:837
    0x4013a90d: esp_vApplicationIdleHook at /Users/ficeto/Desktop/ESP32/ESP32S2/esp-idf-public/components/esp_system/freertos_hooks.c:63
    0x40097c38: prvIdleTask at /Users/ficeto/Desktop/ESP32/ESP32S2/esp-idf-public/components/freertos/tasks.c:3987

@caveman99
Copy link
Sponsor Member

this really looks if CPU1 is not initialized after the assert/reboot. But we don't do that, the IDF does that. Crashes/Dumps are random and show whatever is on the call stack when the WDT triggers.

@caveman99
Copy link
Sponsor Member

going back to the original assert. Building a firmware now that will tell us the actual error code from ESP32.

@caveman99
Copy link
Sponsor Member

can you check https://github.com/meshtastic/artifacts/blob/device/pr/pr2183-firmware-2.0.14.2e0c4a2.zip?raw=true please? - if it asserts again, please post the logfile.

@rpsainio
Copy link
Author

rpsainio commented Jan 21, 2023

@caveman99 The logfile is 40M, but here is most likely the interesting part; the return code from esp_light_sleep_start
DEBUG | 09:01:01 7134 [Router] ++++++ Thread Router freed heap 68220 -> 68544 (324) ++++++
INFO | 09:01:03 7135 [PowerFSM] lsEnter begin, ls_secs=300
INFO | 09:01:03 7135 [PowerFSM] GPS prepare sleep!
DEBUG | 09:01:03 7135 [PowerFSM] esp_light_sleep_start result 1

assert failed: esp_sleep_wakeup_cause_t doLightSleep(uint64_t) sleep.cpp:317 (res == ESP_OK)

Backtrace: 0x4008448d:0x3ffd88f0 0x400966a1:0x3ffd8910 0x4009c821:0x3ffd8930 0x400f7b8d:0x3ffd8a60 0x400d44c5:0x3ffd8a80 0x400fa7cd:0x3ffd8aa0 0x400de239:0x3ffd8ac0 0x400d58df:0x3ffd8ae0 0x400fb121:0x3
ffd8b10 0x400de606:0x3ffd8b40 0x4012d085:0x3ffd8b60

ELF file SHA256: 8842e7ec291e84ff

E (24421) esp_core_dump_flash: Core dump flash config is corrupted! CRC=0x7bd5c66f instead of 0x0
Rebooting...
ets Jun 8 2016 00:22:57

rst:0xc (SW_CPU_RESET),boot:0x13 (SPI_FAST_FLASH_BOOT)
configsip: 188777542, SPIWP:0xee
clk_drv:0x00,q_drv:0x00,d_drv:0x00,cs0_drv:0x00,hd_drv:0x00,wp_drv:0x00
mode:DIO, clock div:2
load:0x3fff0030,len:1184
load:0x40078000,len:13132
load:0x40080400,len:3036
entry 0x400805e4

assert_02.txt

@rpsainio
Copy link
Author

@rpsainio Your "every now and then" can be the Problem with the Heap-Leak. See issue #2165

@karamo in fact there was also a reboot caused by heap-leak (seems to me), but the original problem lies still in entering LS
heap01.txt

@rpsainio
Copy link
Author

@caveman99 I was not able to find out how you implemented the printing of return code as the value 1 looks something I cannot verify from documentation. Looking into esp_sleep.h tells that the function is supposed to return ESP_ERR_INVALID_STATE (is defined as 0x0103 ), OTOH online documentation claims that it will return other values starting with ESP_ERR_SLEEP... So, if you print only first byte then it might match or mask by 0xff. Supposing ESP_ERR_INVALID_STATE as value, then it is caused by BT or WIFI not being shutdown.
You mentioned freeRTOS - is it possible that this function gets hooked by freeRTOS and having some other meaning for return values ?

@rpsainio
Copy link
Author

@caveman99 UPDATE: I am now testing a way to solve this problem, but I need to let it run several days - So, I let you know then

@rpsainio
Copy link
Author

@caveman99
So the assert happens in sleep.ccp when esp_light_sleep_start() return non-zero value. Furthermore running that extra FW revealed that the return value is 1. As I already mentioned I could not find this return value anywhere in original ESP32-documentation so it might come from some Arduino-layers, but I never followed that up.
I started to look how doLightSleep gets called and found it actually in PowerFSM.cpp in lsIdle(). It gets called two times: the LS-time is splitted up into 30 second slices:

    uint32_t sleepTime = 30;

    // If some other service would stall sleep, don't let sleep happen yet
    if (doPreflightSleep()) {
        setLed(false); // Never leave led on while in light sleep
        esp_sleep_source_t wakeCause2 = doLightSleep(sleepTime * 1000LL);

Then if there is nothing else going the timer returns and here starts the funny part

            case ESP_SLEEP_WAKEUP_TIMER:
            // Normal case: timer expired, we should just go back to sleep ASAP

            setLed(true);                 // briefly turn on led
            wakeCause2 = doLightSleep(1); // leave led on for 1ms

            secsSlept += sleepTime;
            // LOG_INFO("sleeping, flash led!\n");
            break;

So, we flash the LED for 1ms....
Now according to https://docs.espressif.com/projects/esp-idf/en/latest/esp32/api-reference/system/sleep_modes.html one of the reasons why esp_light_sleep_start() might return non-zero is ESP_ERR_SLEEP_TOO_SHORT_SLEEP_DURATION , so my bet is that every now and then 1ms LS-sleep is considered too short.

I tested this by commenting out lines below.

// setLed(true); // briefly turn on led
// wakeCause2 = doLightSleep(1); // leave led on for 1ms

I also replaced the original assert with code that prints out returnvalue of esp_light_sleep_start()
Now it always returns 0 and would not assert so I think 1ms LS-sleep was the culprit.

@geeksville
Copy link
Member

This issue has been mentioned on Meshtastic. There might be relevant details there:

https://meshtastic.discourse.group/t/clarification-on-router-client-mode-app-visibility/7208/2

@caveman99
Copy link
Sponsor Member

@rpsainio good catch. Interesting part is, that code is 3 years old (!) Maybe a regression or behaviour change from the newer ESP-IDF I'll change that to a 100ms flash for testing.

@caveman99
Copy link
Sponsor Member

Could you test the firmware from that Pull Request?

@rpsainio
Copy link
Author

rpsainio commented Feb 4, 2023

@caveman99 just flashed and started again

@rpsainio
Copy link
Author

rpsainio commented Feb 5, 2023

@caveman99 It has been running now 47.000 seconds and earlier it failed around 2.000 - 6.000 seconds. No assert or esp_light_sleep_start messages so time to close this now. BTW, as a side effect I can tell that heap consumption is also OK
../heap_leak/2228/heap_summary.py console_01_fixed.txt
DeviceTelemetryModule -55548
RadioIf -573480
Router 626156
NodeInfoModule -17104

Total heap used:-19976

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants