Skip to content
This repository has been archived by the owner on Feb 4, 2023. It is now read-only.

ESPAsync_WiFiManager::startConfigPortal() will cause a watchdog timeout when called from a higher-priority task. #39

Closed
russelljahn opened this issue Feb 13, 2021 · 14 comments

Comments

@russelljahn
Copy link

Hi, first I want to say thank you for this library. It's fantastically well documented. I've also compared the old version of this library to this newer async version, and the improved snappiness of the experience is night & day.

I'm working on fairly complex firmware for a sensor-based project, and can consistently repro a watchdog timeout crash that only happens when startConfigPortal() is called from a task with a priority above the default of 0.

Debugging my firmware, I've simplified the code down to these 2 tasks:

  1. Network Task (Priority of 1). This fires ESPAsync_WiFiManager::startConfigPortal() inside of a task.
  2. Idle Task (Priority of 0). This is the default implicit FreeRTOS task for that feeds the Task watchdog: https://gitdemo.readthedocs.io/en/latest/api/system/wdts.html

I noted that if I decrease the Network Task's priority to 0, the watchdog crash will not happen. This is intuitive if I examine the source code:

  • On line 887 of ESPAsync_WiFiManager.cpp, there's a yield() statement in the spin loop waiting for the ESPAsync_WiFiManager config portal to timeout or be closed.
  • Unlike delay(milliseconds)/vTaskDelay(ticks), which suspends the currently running Task and triggers the Task scheduler to schedule the next available Task, yield() doesn't necessarily suspend the current Task.
  • This would naturally cause a watchdog crash - in my case, the Network Task with a priority of 1 never yields time to the Idle Task do to the yield() in startConfigPortal(). This starves the watchdog and eventually triggering a crash.
  • Decreasing the Network Task's priority to 0 allows the Idle Task to get yielded time and feed the watchdog.
  • And lastly, replacing that yield() with delay() fixes the crash(!)

I'd suggest using delay() in line 887 instead, as that will allow yielding processing time to lower priority tasks. This allows using the library without modifications for more complex projects like mine, which require different task priorities. If you'd like to keep the same default behaviour for existing users, this would be easy to #ifdef with a flag like your other opt-in features.

I'm more than happy to create a PR with the fix if you'd review it. 😃

@russelljahn russelljahn changed the title Calling ESPAsync_WiFiManager::startConfigPortal() will cause a watchdog timeout when called from a higher-priority task. Calling ESPAsync_WiFiManager::startConfigPortal() will cause a watchdog timeout when called from a higher-priority task. Feb 13, 2021
@russelljahn russelljahn changed the title Calling ESPAsync_WiFiManager::startConfigPortal() will cause a watchdog timeout when called from a higher-priority task. ESPAsync_WiFiManager::startConfigPortal() will cause a watchdog timeout when called from a higher-priority task. Feb 13, 2021
@khoih-prog
Copy link
Owner

Hi @russelljahn

Thanks for your interests and good research on the issue.

I'm currently doing exact the same thing, introducing delay(1) additional to the mentioned yield(), to provide support to ESP32-S2. But this delay(1) is designed only for ESP32-S2.

For your WDT issue, I think it's better to change the priority of WDT task to be higher than that of the task you're monitoring.

Check Watchdog-Timers

The watchdog monitor task runs at a higher priority than the tasks it is monitoring.

The nature of the tasks

Most tasks have some minimum period during which they are required to run. A task may run in reaction to a timer that occurs at a regular interval. These tasks have a start point through which they pass in each execution loop)

@russelljahn
Copy link
Author

I appreciate the quick response. 👍 I'll check out the article you linked.

@russelljahn
Copy link
Author

russelljahn commented Feb 14, 2021

@khoih-prog I've read the article and done more research. I'm not convinced that increasing the WDT task's priority is a good design decision.

The lower-priority watchdog is doing its job and catching a legitimate bug - The WiFi Portal's loop is starving any lower priority tasks from running. There are other tasks in my system (logging, battery, sensor polling) that aren't able to meet their schedules on time while the WiFi Portal's loop is hogging the CPU.

@khoih-prog
Copy link
Owner

In that case, you have to lower the priority of the Config-Portal, even to the lowest possible priority.
Assigning priorities for tasks requires a good and detailed knowledge of the tasks' functionalities to do correctly. Arbitrarily priority assigning will finally comes back to haunt you. Been there, done that, I just can share some experience with you.
Anyway, it's up to you to investigate and solve that issue.

@russelljahn
Copy link
Author

Thanks for the replies and perspective, as always. My vantage is from a seasoned software design perspective of designing commercial APIs and SDKs for others. If end-users have to understand the internal mechanics inside the library, rather than using a well-designed API that works for a variety of use-cases, it breaks prevents the basic principle of encapsulation. As this is an asynchronous library, I imagine you'll get similar tickets to mine.

There is a non-arbitrary rhyme and reason for assigning priority order. If you're reluctant to allow the library to delay() rather than yield(), I do think wrapping the Config Manager in its own low priority task will be a good compromise to avoid modifying this library, and simplify updating it.

May I understand your reluctance on switching out the yield()? If the library needs to apply a delay anyways for the ESP32-S2 boards, wouldn't it make sense to have one common code path that works for both ESP32 and ESP32-S2 in a broader variety of situations?

@khoih-prog
Copy link
Owner

khoih-prog commented Feb 15, 2021

If end-users have to understand the internal mechanics inside the library

You actually don't need to know the internal details of the library, just what is the function you're using it for. Then assign the priority accordingly after comparing to all other tasks..

Config-Portal, and GUI related tasks, are generally time-consuming and don't need high priority. Some acceptable delay will still be OK.

The other tasks, measuring, etc., requires higher/highest priority as they must do something, even very fast, in-time, and delay is normally not acceptable.

If you're reluctant to allow the library to delay() rather than yield()

I just do something necessary, as delay() is generally a waste of MPU time. For ESP32-S2, I believe we still need delay() because of certain mistake in design of that still immature esp32-s2 core. Especially this is just 1-core MPU.

Just rechecking this ESPAsync_WiFiManager library code, it's so bad and inefficient when using the AsyncWebServer. I'll rewrite the library soon to make it more efficient and not so unnecessarily time-consuming because I'm now even having issue to convert it to support ESP32-S2.

The unnecessary time-hogging and inefficiency are the reasons you got issue with Config Portal.

Try this better ESPAsync_WiFiManager_Lite and see if your issue disappear.

@russelljahn
Copy link
Author

Just rechecking this ESPAsync_WiFiManager library code, it's so bad and inefficient when using the AsyncWebServer. I'll rewrite the library soon to make it more efficient and not so unnecessarily time-consuming because I'm now even having issue to convert it to support ESP32-S2.

The unnecessary time-hogging and inefficiency are the reasons you got issue with Config Portal.

Yep, I think we see the same thing. Even my suggestion of a delay() is a patch on top of that spin-loop, which is appears to be an artifact from the non-async version.

Appreciate the link, the help, and what you're doing for the community! I'll stop adding to the thread here. 😊

@khoih-prog
Copy link
Owner

Hi @russelljahn

You can test the new release v1.5.0, which I think fixes the time-hogging issue when scanning Networks

If you still need delay(), you can change to whatever value (ms) as necessary.

#define TIME_BETWEEN_CONFIG_PORTAL_LOOP            1L   // or how many ms to delay between CP loop

Major Releases v1.5.0

  1. Add support to ESP32-S2 (ESP32-S2 Saola, AI-Thinker ESP-12K, ESP32S2 Dev Module, UM FeatherS2, UM ProS2, microS2, etc.)
  2. Add Instructions to install ESP32-S2 core
  3. Rewrite library code to be more efficient and multitask friendly

@russelljahn
Copy link
Author

Thanks for the update! I'll check out the update when I'm able.

@russelljahn
Copy link
Author

I've been testing the newer library for a week now across different devices. When calling startConfigPortal from a default-priority task (priority of 0), it appears to run fine. When running it from a priority 1 task, I get the watchdog panic in a different form. This is with and without #define TIME_BETWEEN_CONFIG_PORTAL_LOOP 1 declared. I don't encounter the same crash when modifying the previous version with the delay suggestion above:

0x4009190c: invoke_abort at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/esp32/panic.c line 156
0x40091b89: abort at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/esp32/panic.c line 171
0x401a9330: task_wdt_isr at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/esp32/task_wdt.c line 174
0x40094501: vTaskExitCritical at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/freertos/tasks.c line 4274
0x400966b7: multi_heap_internal_unlock at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/heap/multi_heap.c line 380
0x40096d0a: multi_heap_malloc at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/heap/multi_heap_poisoning.c line 206
0x400817cd: heap_caps_malloc at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/heap/heap_caps.c line 111
0x400817fe: heap_caps_malloc_default at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/heap/heap_caps.c line 140
0x40082495: _malloc_r at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/newlib/syscalls.c line 37
0x401b7307: operator new(unsigned int) at /builds/idf/crosstool-NG/.build/src/gcc-5.2.0/libstdc++-v3/libsupc++/new_op.cc line 50
0x401b72d9: operator new[](unsigned int) at /builds/idf/crosstool-NG/.build/src/gcc-5.2.0/libstdc++-v3/libsupc++/new_opv.cc line 32
0x401d6a57: WiFiUDP::parsePacket() at C:/users/russell/.platformio/packages/framework-arduinoespressif32@src-cba3def1496a47e6af73c0b73bd2e13c/libraries/WiFi/src/WiFiUdp.cpp line 210
0x401e13b5: DNSServer::processNextRequest() at C:/users/russell/.platformio/packages/framework-arduinoespressif32@src-cba3def1496a47e6af73c0b73bd2e13c/libraries/DNSServer/src/DNSServer.cpp line 62
0x400e04e0: ESPAsync_WiFiManager::startConfigPortal(char const*, char const*) at lib/molecule/Features/WiFi/ESPAsync_WiFiManager.cpp line 536

Heads up, a separate regression with the 1.5 release I've noticed is the WiFi scan results no longer show in the config portal. Here's a screenshot:
File_000 (2)

To verify that it wasn't an issue with my firmware, I took the Async_AutoConnect_ESP32_minimal.ino example and added starting the config portal. The screenshot above is from this example. I'm running on an ESP32 board, and I've tested across Windows/Android/iOS.

#include <ESPAsync_WiFiManager.h>              //https://github.com/khoih-prog/ESPAsync_WiFiManager
AsyncWebServer webServer(80);
DNSServer dnsServer;

void setup()
{
  Serial.begin(115200); while (!Serial); delay(200);
  Serial.print("\nStarting Async_AutoConnect_ESP32_minimal on " + String(ARDUINO_BOARD)); Serial.println(ESP_ASYNC_WIFIMANAGER_VERSION);
  ESPAsync_WiFiManager ESPAsync_wifiManager(&webServer, &dnsServer, "AutoConnectAP");
  ESPAsync_wifiManager.startConfigPortal();
}

loop() {}

@khoih-prog
Copy link
Owner

Thanks for the feedback.

I've just tested and got this preliminary results

  1. ESP32 release v1.0.5 breaks the Async scanning function (always returning "No network found") this library is relied on. I'll investigate, find out the where the issue is and fix.
  2. ESP32 release v1.0.4 is working OK. The WiFi scanning is still OK.

image


Are you running ESP32 core v1.0.5? If so, try core v1.0.4.


I'm not sure that will finally fix your priority issue as this priority even seems simple but sometimes very time-consuming and frustrating. Only you can solve it as only you possess all the data, code and know the use case / scenario.

@russelljahn
Copy link
Author

@khoih-prog In my build toolchain, I use the master branch of the ESP32 core to be able to use the latest C++ capabilities: https://github.com/espressif/arduino-esp32.git#master (if you use PlatformIO).
For what it's worth, the wifi scan results show up fine when downgrading this library to v1.4.3 with that same ESP32 core version.

I understand your sentiment about the complexity of diagnosing priority/watchdog issues. For now this isn't a blocker, as I'm sticking with the v1.4.3 with my patch, which appears the most stable/reliable solution for my project. I can't afford more development time to troubleshoot these issues, but as always, appreciate your responses.

@khoih-prog
Copy link
Owner

I already found out and will fix and release today a new version, which combine the best working code of v1.4.3 as well as v1.5.0.

@khoih-prog
Copy link
Owner

The new ESPAsync_WiFiManager v1.6.0 has just been released to fix the mentioned WiFi Scanning.

The following TIME_BETWEEN_CONFIG_PORTAL_LOOP is still usable

#define TIME_BETWEEN_CONFIG_PORTAL_LOOP            1L   // or how many ms to delay between CP loop

Releases v1.6.0

  1. Fix WiFi Scanning bug.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants