Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Panic reset when scanning and several concurrent http requests occur #167

Closed
pletch opened this issue Jan 7, 2021 · 90 comments
Closed

Panic reset when scanning and several concurrent http requests occur #167

pletch opened this issue Jan 7, 2021 · 90 comments

Comments

@pletch
Copy link

pletch commented Jan 7, 2021

Working with others on a project which leverages NimBLE as a client only receiving advertisements (ADV_NONCONN_IND) from up to 8 devices. We have recently converted the project over to use ESPAsyncWebserver and have started experiencing sporadic PANIC resets. A reset can be fairly reliably triggered by hard reloading one of the pages with several concurrent jQuery requests repeatedly. We suspect this relates to competition for radio time between WiFi and BLE but are unsure about this aspect. Have done some troubleshooting such as trying to adjust the Scan Interval and Scan Window but have not been able to reliably eliminate the issue and am looking for ideas. The backtrace is always the same and appears to indicate that the PANIC is being triggered in NimBLE related code.

Any guidance on whether this likely points to a problem with our implementation or whether we have uncovered a library issue is much appreciated!

Logging output:
ble_hs_hci_cmd_send: ogf=0x08 ocf=0x000c len=2
0x0c 0x20 0x02 0x00 0x00
Command complete: cmd_pkts=5 ogf=0x8 ocf=0xc status=0
GAP procedure initiated: discovery; own_addr_type=0 filter_policy=0 passive=1 limited=0 filter_duplicates=0 duration=3000ms
ble_hs_hci_cmd_send: ogf=0x08 ocf=0x000b len=7
0x0b 0x20 0x07 0x00 0x54 0x00 0x3b 0x00 0x00 0x00
Command complete: cmd_pkts=5 ogf=0x8 ocf=0xb status=0
ble_hs_hci_cmd_send: ogf=0x08 ocf=0x000c len=2
0x0c 0x20 0x02 0x01 0x00

Backtrace: 0x40091fa4:0x3ffbdc20 0x400921d5:0x3ffbdc40 0x40085a9d:0x3ffbdc60 0x40085bc9:0x3ffbdc90 0x40180377:0x3ffbdcb0 0x4017ad15:0x3ffbdf70 0x40177de9:0x3ffbdfc0 0x400ee0e2:0x3ffbdff0 0x401c75ea:0x3ffbe010 0x401bba2a:0x3ffbe030 0x4001791d:0x3ffbe050 0x401bc029:0x3ffbe070 0x401bc106:0x3ffbe0b0 0x400188f5:0x3ffbe0e0 0x400175ee:0x3ffbe120 0x40017688:0x3ffbe140 0x4008cb36:0x3ffbe170 0x4008cd0b:0x3ffbe190 0x401bb9d7:0x3ffbe1b0 0x4008cd28:0x3ffbe1d0 0x40085025:0x3ffbe1f0 0x4008583e:0x3ffb9d00 0x4008cadd:0x3ffb9d20 0x401c7789:0x3ffb9d40 0x401aa088:0x3ffb9d60 0x400edd8d:0x3ffb9d80 0x40082b53:0x3ffb9da0 0x4008e33d:0x3ffb9dc0
#0 0x40091fa4:0x3ffbdc20 in invoke_abort at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/esp32/panic.c:707
#1 0x400921d5:0x3ffbdc40 in abort at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/esp32/panic.c:707
#2 0x40085a9d:0x3ffbdc60 in lock_acquire_generic at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/newlib/locks.c:143
#3 0x40085bc9:0x3ffbdc90 in _lock_acquire_recursive at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/newlib/locks.c:171
#4 0x40180377:0x3ffbdcb0 in _vfiprintf_r at /Users/ivan/e/newlib_xtensa-2.2.0-bin/newlib_xtensa-2.2.0/xtensa-esp32-elf/newlib/libc/stdio/../../../.././newlib/libc/stdio/vfprintf.c:860 (discriminator 2)
#5 0x4017ad15:0x3ffbdf70 in fiprintf at /Users/ivan/e/newlib_xtensa-2.2.0-bin/newlib_xtensa-2.2.0/xtensa-esp32-elf/newlib/libc/stdio/../../../.././newlib/libc/stdio/fiprintf.c:50
#6 0x40177de9:0x3ffbdfc0 in __assert_func at /Users/ivan/e/newlib_xtensa-2.2.0-bin/newlib_xtensa-2.2.0/xtensa-esp32-elf/newlib/libc/stdlib/../../../.././newlib/libc/stdlib/assert.c:59 (discriminator 8)
#7 0x400ee0e2:0x3ffbdff0 in host_rcv_pkt at .pio\libdeps\lcd_ssd1306\NimBLE-Arduino@src-e248a3b4ca528fb2d5b4751077939ebd\src\esp-hci\src/esp_nimble_hci.c:365
#8 0x401c75ea:0x3ffbe010 in vhci_send at ??:?
#9 0x401bba2a:0x3ffbe030 in r_eif_send at ??:?
#10 0x4001791d:0x3ffbe050 in ?? ??:0
#11 0x401bc029:0x3ffbe070 in hci_tx_start at hci_tl.c:?
#12 0x401bc106:0x3ffbe0b0 in r_hci_tl_send at ??:?
#13 0x400188f5:0x3ffbe0e0 in ?? ??:0
#14 0x400175ee:0x3ffbe120 in ?? ??:0
#15 0x40017688:0x3ffbe140 in ?? ??:0
#16 0x4008cb36:0x3ffbe170 in vhci_recv_end at vhci.c:?
#17 0x4008cd0b:0x3ffbe190 in r_vhci_isr at ??:?
#18 0x401bb9d7:0x3ffbe1b0 in r_eif_isr at ??:?
#19 0x4008cd28:0x3ffbe1d0 in r_eif_isr_wrapper at intc.c:?
#20 0x40085025:0x3ffbe1f0 in _xt_lowint1 at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/freertos/xtensa_vectors.S:1154
#21 0x4008583e:0x3ffb9d00 in cause_sw_intr_to_core_wrapper at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/bt/bt.c:1322
#22 0x4008cadd:0x3ffb9d20 in vhci_set_interrupt at vhci.c:?
#23 0x401c7789:0x3ffb9d40 in API_vhci_host_send_packet at ??:?
#24 0x401aa088:0x3ffb9d60 in esp_vhci_host_send_packet at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/bt/bt.c:1322
#25 0x400edd8d:0x3ffb9d80 in ble_hci_trans_hs_cmd_tx_on_core_0 at .pio\libdeps\lcd_ssd1306\NimBLE-Arduino@src-e248a3b4ca528fb2d5b4751077939ebd\src\esp-hci\src/esp_nimble_hci.c:171
#26 0x40082b53:0x3ffb9da0 in ipc_task at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/esp32/ipc.c:62
#27 0x4008e33d:0x3ffb9dc0 in vPortTaskWrapper at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/freertos/port.c:355 (discriminator 1)

@doudar
Copy link
Contributor

doudar commented Jan 7, 2021

I can only add that I fought the exact same issue for a couple weeks with espasyncwebserver and the old Bluetooth library. I ended up dropping back to the regular webserver and using client side JavaScript to provide the functionality I needed.

@h2zero
Copy link
Owner

h2zero commented Jan 7, 2021

Well I cannot say for sure what is happening here but I found this interesting, in 7 this assert is triggered. This is a message from the controller that it has encountered a hardware error.

Not much to work with here, except perhaps a power issue. If not that, then perhaps a bug in the controller, guessing by the previous area of the backtrace (who knows here, it's black magic).

One other thing it could be that could be tested however, is the commands being sent from core 0. This is a modification made to the esp-nimble files to address delay issues when sending commands from core 1. If you revert this commit and test I'd be very curious of the results.

@justoke
Copy link

justoke commented Jan 7, 2021

Hi. I have run into the same issue running an http server using ESPAsyncWebserver and scanning BLE - my understanding is that this is hardware problem in that the WiFi and BLE share the same hardware radio. Meaning that you have a harder software problem to balance scanning BLE for devices and servicing Wifi packets as they share the data pipeline. I've not found a solution yet and may have to make a separate controller for BLE scanning as the ESP32 crashes often or the http server takes as long to respond as the shortest scanning cycle. It seems while BLE scan is in progress you can't service http requests.

I'll share my findings if I manage to come up with a workable approach.

@justoke
Copy link

justoke commented Jan 7, 2021

Backtrace: 0x40091fa4:0x3ffbdc20 0x400921d5:0x3ffbdc40 0x40085a9d:0x3ffbdc60 0x40085bc9:0x3ffbdc90 0x40180377:0x3ffbdcb0 0x4017ad15:0x3ffbdf70 0x40177de9:0x3ffbdfc0 0x400ee0e2:0x3ffbdff0 0x401c75ea:0x3ffbe010 0x401bba2a:0x3ffbe030 0x4001791d:0x3ffbe050 0x401bc029:0x3ffbe070 0x401bc106:0x3ffbe0b0 0x400188f5:0x3ffbe0e0 0x400175ee:0x3ffbe120 0x40017688:0x3ffbe140 0x4008cb36:0x3ffbe170 0x4008cd0b:0x3ffbe190 0x401bb9d7:0x3ffbe1b0 0x4008cd28:0x3ffbe1d0 0x40085025:0x3ffbe1f0 0x4008583e:0x3ffb9d00 0x4008cadd:0x3ffb9d20 0x401c7789:0x3ffb9d40 0x401aa088:0x3ffb9d60 0x400edd8d:0x3ffb9d80 0x40082b53:0x3ffb9da0 0x4008e33d:0x3ffb9dc0

I've put that through an Exception Decoder. It looks like its related to the WiFi data, DNS and so on - I get the same sort of issues. If I find a workable approach I'll share it.
image

@doudar
Copy link
Contributor

doudar commented Jan 7, 2021

Nimble works perfectly fine with me with all of these libraries, even during scan. It's just the espasync . Which sucks because it is a nice webserver. You can however spin off the regular webserver below on its own task and it works reasonably well. Good luck!

#include <WebServer.h>
#include <HTTPClient.h>
#include <HTTPUpdate.h>
#include <SPIFFS.h>
#include <ESPmDNS.h>
#include <WiFiClientSecure.h>
#include <Update.h>
#include <DNSServer.h>

@justoke
Copy link

justoke commented Jan 7, 2021

Someone did mention power supply maybe needing a capacitor to deal with extra power needed during scan. Has anyone tried the setActiveScan with false or observed anything related to this setting and stability with Wifi and BLE. I'll try this myself anyway and see if this helps.

image

@lbussy
Copy link

lbussy commented Jan 7, 2021

I'm one of the unnamed co-conspirators.

Not much to work with here, except perhaps a power issue.

I can help rule-out the power unless it's a design flaw in the dev board itself (Lolin D32 Pro.) There's three of us working on this with similar results, it seems unlikely all of us have flakey power. I'd also expect to see a reset caused by brownout, not a panic. Unless you see something I do not?

Has anyone tried the setActiveScan with false or observed anything related to this setting and stability with Wifi and BLE.

setActiveScan is set to false. We are just using the iBeacon in this project.

@justoke which exception decoder are you using? I've had horrible results trying to consistently use those, but that would be a nice thing to figure out. I am using the esp32_exception_decoder built in to PIO and I get this which does not show the same information:

Backtrace: 0x40091978:0x3ffbe190 0x40091ba9:0x3ffbe1b0 0x40121f58:0x3ffbe1d0 0x400847ed:0x3ffbe1f0 0x401c4ecf:0x3ffb5cf0 0x401c65c5:0x3ffb5d60 0x401c3adb:0x3ffb5db0 0x401c106e:0x3ffb5e00 0x40019d11:0x3ffb5e40 0x40055b4d:0x3ffb5e60 0x401bb1e7:0x3ffb5e80 0x401bb731:0x3ffb5ea0 0x4008dd45:0x3ffb5ed0
  #0  0x40091978:0x3ffbe190 in invoke_abort at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/esp32/panic.c:707
  #1  0x40091ba9:0x3ffbe1b0 in abort at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/esp32/panic.c:707
  #2  0x40121f58:0x3ffbe1d0 in task_wdt_isr at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/esp32/task_wdt.c:252
  #3  0x400847ed:0x3ffbe1f0 in _xt_lowint1 at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/freertos/xtensa_vectors.S:1154
  #4  0x401c4ecf:0x3ffb5cf0 in r_llm_le_adv_report_ind at ??:?
  #5  0x401c65c5:0x3ffb5d60 in r_llm_pdu_defer at ??:?
  #6  0x401c3adb:0x3ffb5db0 in r_lld_pdu_check at ??:?
  #7  0x401c106e:0x3ffb5e00 in r_lld_evt_deffered_elt_handler at ??:?
  #8  0x40019d11:0x3ffb5e40 in ?? ??:0
  #9  0x40055b4d:0x3ffb5e60 in ?? ??:0
  #10 0x401bb1e7:0x3ffb5e80 in r_rw_schedule at ??:?
  #11 0x401bb731:0x3ffb5ea0 in btdm_controller_task at ??:?
  #12 0x4008dd45:0x3ffb5ed0 in vPortTaskWrapper at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/freertos/port.c:355 (discriminator 1)

I ended up dropping back to the regular webserver

Rolling back to use the synchronous web server is an undesirable outcome. The performance difference between the two is quite pronounced for this application. Unless I inline all of the code (undesirable from a re-use and maintenance standpoint), a single connection would be quite bad.

One other thing it could be that could be tested however, is the commands being sent from core 0. This is a modification made to the esp-nimble files to address delay issues when sending commands from core 1. If you revert this commit and test I'd be very curious of the results.

The web server (async specifically) is pinned to core 1 per this note. Removing that makes things quite a bit worse, I did attempt to pin NimBLE to core 1 as well with no improvement.

Reverting that commit has proved challenging for me, and I'm sure it's something I am forgetting but I don't see it:

git log --oneline  
bc52eac (grafted, HEAD -> master, origin/master, origin/HEAD) Fix attributes not found with 16/32bit UUIDs.

I appreciate the folks here willing to help out. If it's something we need to bring upstream we can obviously do that. It's just quite a bit more challenging to work there.

@justoke
Copy link

justoke commented Jan 7, 2021

I think this might resolve your issue on ESP32. Add this define you your code - in main.cpp if you are using Platform IO
#define CONFIG_ASYNC_TCP_RUNNING_CORE 1

This will pin the AsyncWebserver which runs as a FreeRTOS task on CORE 1. I tried this and although the web server can still get blocked by the BLE(apparent because the web server response times are longer when the scan is in progress), it has not crashed since I tried this and often it responds in sub 100ms times which is as good as I could hope for. I implemented some other things specific to my use case - like blocking my loop() task logic while updating data from the BLE onResult() event.

@lbussy
Copy link

lbussy commented Jan 7, 2021

I think this might resolve your issue on ESP32.

Yes, sir - already doing that. My comment about it was not horribly clear in retrospect. I've edited my post above.

@justoke
Copy link

justoke commented Jan 7, 2021

@lbussy I used this decode the exceptions https://github.com/me-no-dev/EspExceptionDecoder. Although I mainly use PIO, this is the only thing I use Arduino IDE for.

@justoke
Copy link

justoke commented Jan 7, 2021

I think this might resolve your issue on ESP32.

Yes, sir - already doing that. My comment about it was not horribly clear in retrospect. I've edited my post above.

I had replied before I saw your comment. I'm happy to try help you. I was definitely in a world of pain before I made the core assignment. Of course my use case will differ from yours, but maybe there is an approach to what I have done that might help you.

@lbussy
Copy link

lbussy commented Jan 7, 2021

Thanks - I really hate going back to the Arduino IDE after PIO.

So here's something interesting. I have a fork of ESPAsynchWebServer I was using elsewhere and I pulled that in instead. The behavior is much improved (but not perfect). Here's the PR I merged in my fork:

https://github.com/me-no-dev/ESPAsyncWebServer/pull/836/files

As you can see the difference is basically cleaning up connections. It's not making perfect sense yet as I've not had my coffee - but you are 5 hours ahead so maybe it strikes a chord with you.

@lbussy
Copy link

lbussy commented Jan 7, 2021

I was definitely in a world of pain before I made the core assignment. Of course my use case will differ from yours, but maybe there is an approach to what I have done that might help you.

I agree - and we're using that:

    -D CONFIG_ASYNC_TCP_RUNNING_CORE=1  ; per: https://github.com/me-no-dev/ESPAsyncWebServer/issues/731#issuecomment-628163515
    -D CONFIG_ASYNC_TCP_USE_WDT=1

So aside from the PR up there having a net positive impact, if I remove the CONFIG_ASYNC_TCP_USE_WDT=1 define I get no crashes, but eventually the webserver stops responding.

We're poking in the right place for sure.

@h2zero
Copy link
Owner

h2zero commented Jan 7, 2021

I can help rule-out the power unless it's a design flaw in the dev board itself (Lolin D32 Pro.) There's three of us working on this with similar results, it seems unlikely all of us have flakey power. I'd also expect to see a reset caused by brownout, not a panic. Unless you see something I do not?

If the regulator is weak, which it may be on this board, the voltage might dip low enough to cause an error but not low enough to trigger the brownout detection (2.43v default in arduino). So if your regulator dropped to, say, 2.9v, that might be enough to trigger this. Just speculating here though, you would have to measure with a scope or meter to detect min levels.

Hopefully your PR there will lead to a software resolution though.

@doudar
Copy link
Contributor

doudar commented Jan 7, 2021

Thanks - I really hate going back to the Arduino IDE after PIO.

So here's something interesting. I have a fork of ESPAsynchWebServer I was using elsewhere and I pulled that in instead. The behavior is much improved (but not perfect). Here's the PR I merged in my fork:

https://github.com/me-no-dev/ESPAsyncWebServer/pull/836/files

As you can see the difference is basically cleaning up connections. It's not making perfect sense yet as I've not had my coffee - but you are 5 hours ahead so maybe it strikes a chord with you.

FYI, you can still work in PIO and just use the trace debugger
In Arduino IDE separately. When you select it from the menu, it asks for your firmware.elf file (grab it from your .pio folder) and paste the debug message into the window that pops up.

@justoke
Copy link

justoke commented Jan 7, 2021

I was definitely in a world of pain before I made the core assignment. Of course my use case will differ from yours, but maybe there is an approach to what I have done that might help you.

I agree - and we're using that:

    -D CONFIG_ASYNC_TCP_RUNNING_CORE=1  ; per: https://github.com/me-no-dev/ESPAsyncWebServer/issues/731#issuecomment-628163515
    -D CONFIG_ASYNC_TCP_USE_WDT=1

So aside from the PR up there having a net positive impact, if I remove the CONFIG_ASYNC_TCP_USE_WDT=1 define I get no crashes, but eventually the webserver stops responding.

We're poking in the right place for sure.

So I've had 5 hrs of up time on my ESP32 and not a single panic crash. The basic app use case is as follows:
Scan BLE every 2 seconds and read sensor values from detected beacons/sensors
Manage thermostat task when sensor readings change
When sensor values changes a flag is set and this prevents further BLE scanning until the thermostat task has responded and then the state machine repeats the cycle.
The loop() function simply services the web server - again while scan results in sensor changes, the web server is not serviced.

To test the web api response, I fired up Postman and fired requests at 500 millisecond intervals for a few minutes. What is apparent is during the BLE scan response time goes up to 4s, otherwise typical response is sub 200 millisecond with 80 to 90 ms being fairly average.

@lbussy
Copy link

lbussy commented Jan 7, 2021

Stopping the BLE scans while servicing the jquery (which are JSON requests) is a possibility. That might help. Thank you for your ideas - there might be something in there we can use!

I wonder if the AsyncWebServer can be paused during BLE scans - sort of the opposite.

@lbussy
Copy link

lbussy commented Jan 7, 2021

Hopefully your PR there will lead to a software resolution though.

He has 40 open PRs going back to 2017. I'm not going to hold my breath.

@justoke
Copy link

justoke commented Jan 7, 2021

Stopping the BLE scans while servicing the jquery (which are JSON requests) is a possibility. That might help. Thank you for your ideas - there might be something in there we can use!

I wonder if the AsyncWebServer can be paused during BLE scans - sort of the opposite.

I do this - so when BLE onResult() event finds a matching device that I am after, it reads the sensor data and sets HasChanges to true. In that state I don't process web requests until the changes have been processed by the relevant observers. In practice this doesn't take long and the stability gained is worth the slightly slower request response.

image

@justoke
Copy link

justoke commented Jan 7, 2021

Hopefully your PR there will lead to a software resolution though.

He has 40 open PRs going back to 2017. I'm not going to hold my breath.

I'm using the latest version and haven't had to rely on any PRs which is preferred in my case as I don't want to have to keep track of such nuances in my projects

@lbussy
Copy link

lbussy commented Jan 7, 2021

Yeah, that would be fine if I was synchronous, I'm not. I hope to not have to go back to the other webserver. If I can somehow "pause" the async, that might help.

@doudar
Copy link
Contributor

doudar commented Jan 7, 2021 via email

@pletch
Copy link
Author

pletch commented Jan 7, 2021

I reverted the mentioned commit and it did not appear to make much difference as both myself and my fellow conspirators continue to experience the PANIC. Thank you for providing the suggestion. We'll continue to brainstorm ideas.

One other thing it could be that could be tested however, is the commands being sent from core 0. This is a modification made to the esp-nimble files to address delay issues when sending commands from core 1. If you revert this commit and test I'd be very curious of the results.

@lbussy
Copy link

lbussy commented Jan 7, 2021

Are you able to completely shut down the async webserver then restart it?

I do this in a different project (ESPAsyncWebServer::stop() if I remember correctly) and it works fine. It seems heavy-handed here, it takes a sec to start and stop, so I'm more looking for a "hold up a sec" method. :)

@h2zero
Copy link
Owner

h2zero commented Jan 7, 2021

Thanks for letting me know.

Could I ask what your scan interval and window values are? I know you've tried adjusting them I'm just curious. If I get time tonight I'll load up my async server BLE project and see if I can reproduce.

Also is the backtrace the same every time or does it change?

@lbussy
Copy link

lbussy commented Jan 7, 2021

Here's our init:

void tiltScanner::init()
{
    NimBLEDevice::init("");
    pBLEScan = NimBLEDevice::getScan();
    pBLEScan->setAdvertisedDeviceCallbacks(callbacks);

    pBLEScan->setActiveScan(false);
    pBLEScan->setInterval(53);
    pBLEScan->setWindow(37);
}

Between us, we get a slightly different backtrace, but we each keep the same one if that makes sense.

Happy to share a link to this project if that will help you help us.

@h2zero
Copy link
Owner

h2zero commented Jan 8, 2021

Those values look pretty tight in reality if you are sharing the antenna with a webserver.

You've got 20ms open for the webserver to do its job between scan windows. If your latency to the server is more than that you'll have issues. Try increasing the interval by 200, you'll get less responses on BLE but it may solve the abort.

You can work your way up or down from there with those values to find the sweet spot.

@lbussy
Copy link

lbussy commented Jan 8, 2021

I had randomly chosen some values with (as expected) no improvement. Can you recommend what might work?

For reference, we're working with up to eight devices that are BLEacon-only.

@h2zero
Copy link
Owner

h2zero commented Jan 8, 2021

Only values I can suggest would be setting the interval much higher and keeping the window as is. Something extreme, say 5000 interval.

This is not an easy issue to solve for sure and the feedback we have is not ideal. We don't know where or why the problem occurs but we can try.

If the extreme interval value does not improve the results we need to look elsewhere. Let me know what happens if you try that.

@lbussy
Copy link

lbussy commented Jan 8, 2021

Well, it's consistent anyway :) :

Backtrace: 0x40091978:0x3ffbe190 0x40091ba9:0x3ffbe1b0 0x40122074:0x3ffbe1d0 0x400847ed:0x3ffbe1f0 0x401d6617:0x3ffbd8e0 0x40123626:0x3ffbd900 0x4008f539:0x3ffbd920 0x4008dd45:0x3ffbd940
  #0  0x40091978:0x3ffbe190 in invoke_abort at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/esp32/panic.c:707
  #1  0x40091ba9:0x3ffbe1b0 in abort at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/esp32/panic.c:707
  #2  0x40122074:0x3ffbe1d0 in task_wdt_isr at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/esp32/task_wdt.c:252
  #3  0x400847ed:0x3ffbe1f0 in _xt_lowint1 at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/freertos/xtensa_vectors.S:1154
  #4  0x401d6617:0x3ffbd8e0 in esp_pm_impl_waiti at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/esp32/pm_esp32.c:492
  #5  0x40123626:0x3ffbd900 in esp_vApplicationIdleHook at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/esp32/freertos_hooks.c:108
  #6  0x4008f539:0x3ffbd920 in prvIdleTask at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/freertos/tasks.c:4813
  #7  0x4008dd45:0x3ffbd940 in vPortTaskWrapper at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/freertos/port.c:355 (discriminator 1)

Rebooting...

@justoke
Copy link

justoke commented Jan 15, 2021

Found this when looking at async issues me-no-dev/ESPAsyncWebServer#900 (comment)
might be worth giving that a shot.

Thank you. I've tried to reference that branch as it does sound promising. Unfortunately there are a lot of compile errors which I don't really understand related to things like this:
image
image

I'll keep trying.

@pletch
Copy link
Author

pletch commented Jan 25, 2021

I have recently achieved very good stability and performance. I attribute the improvements to the following:

  • using the latest arduino-espressif platform packages
  • revising event queue management in AsyncTCP after several hints in some of the ESPAsyncWebserver issue threads
  • implementing better parallelism and task flow control using FreeRTOS tasks to distribute work between the two ESP32 cores.

WDT errors are completely gone and I haven't been able to produce any other crashes for several days now under heavy load. I am closing this issue as I consider it resolved from my perspective. If others wish to continue the discussion, feel free to reopen.

@pletch pletch closed this as completed Jan 25, 2021
@h2zero
Copy link
Owner

h2zero commented Jan 25, 2021

Glad you got it stable!

@justoke
Copy link

justoke commented Jan 25, 2021

I attribute the improvements to the following:

Would you mind outlining what you did in more detail? When you say latest arduino-espressif are you adding anything specific to the platformio.ini file or you are using latest PlatformIO Espressif 2.1? I pretty much followed the last 2 of your list and got to a much more stable place, but have not been able to eliminate the crashes all together. My impression is that while scanning is in progress and a WifiUDP::parsePacket is processed - in my case as part of OTA handler(note OTA is not in progress, just the handler is polled by the loop function) - a crash often occurs. I run the scan in a Freertos task on core 0, and as part of a state machine so that tasks run according to state and not concurrently. That said I'm happy for the issue to be closed as I don't think there is anything wrong with the NimBLE library, rather it is not easy to get BLE and WiFi working in the scenario described. Running BLE on it's own is fine, I run into heap running out and not recovering when I combine everything together, even with a state machine.

Thanks to @h2zero for finding this me-no-dev/ESPAsyncWebServer#876 - I modified the AsynTCP.cpp according to the suggestions there and the web server responds near instantly most of the time. I was getting responses taking up to five seconds, now it is typically under 200ms.

image

image

@doudar
Copy link
Contributor

doudar commented Jan 25, 2021

I attribute the improvements to the following:

Would you mind outlining what you did in more detail? When you say latest arduino-espressif are you adding anything specific to the platformio.ini file or you are using latest PlatformIO Espressif 2.1? I pretty much followed the last 2 of your list and got to a much more stable place, but have not been able to eliminate the crashes all together. My impression is that while scanning is in progress and a WifiUDP::parsePacket is processed - in my case as part of OTA handler(note OTA is not in progress, just the handler is polled by the loop function) - a crash often occurs. I run the scan in a Freertos task on core 0, and as part of a state machine so that tasks run according to state and not concurrently. That said I'm happy for the issue to be closed as I don't think there is anything wrong with the NimBLE library, rather it is not easy to get BLE and WiFi working in the scenario described. Running BLE on it's own is fine, I run into heap running out and not recovering when I combine everything together, even with a state machine.

Thanks to @h2zero for finding this me-no-dev/ESPAsyncWebServer#876 - I modified the AsynTCP.cpp according to the suggestions there and the web server responds near instantly most of the time. I was getting responses taking up to five seconds, now it is typically under 200ms.

image

image

I can't answer for him but you can use:
platform_packages = framework-arduinoespressif32 @ https://github.com/espressif/arduino-esp32.git#master
in platformio.ini.

It's a blessing and a curse though. Small things they change have caused issues for me that made it past my initial testing and into binaries.

@h2zero
Copy link
Owner

h2zero commented Jan 29, 2021

@lbussy, @pletch I have not tested this with tiltbridge but I made some significant changes to NimBLEScan and NimBLEAdvertised device that significantly increase performance and may alleviate some of the watchdog timeouts. If you're interested in testing this please checkout this branch.

Some notes for your use: in the scan setup call pBLEScan->setMaxResults(0); (you also will not need to call clearResults() when the scan stops with this setting).

@lbussy
Copy link

lbussy commented Jan 29, 2021

We'll give it a go and let you know how it works out, thank you!

@justoke
Copy link

justoke commented Jan 29, 2021

Some notes for your use: in the scan setup call pBLEScan->setMaxResults(0);

I added this to my platformio.ini file lib_deps:
https://github.com/h2zero/NimBLE-Arduino.git#refactor-scan

But after building it I don't see the setMaxResults() method. Am I missing something? Thanks.

@justoke
Copy link

justoke commented Jan 29, 2021

I managed to get it to compile - had to clean out all the environments and used this reference:
https://github.com/h2zero/NimBLE-Arduino.git#79cd9464b2c98f6ade1d1ec1fc01959cd16521a1

@justoke
Copy link

justoke commented Jan 29, 2021

@lbussy, @pletch I have not tested this with tiltbridge but I made some significant changes to NimBLEScan and NimBLEAdvertised device that significantly increase performance and may alleviate some of the watchdog timeouts. If you're interested in testing this please checkout this branch.

Some notes for your use: in the scan setup call pBLEScan->setMaxResults(0); (you also will not need to call clearResults() when the scan stops with this setting).

I've given this a try. It works fine, most notably for me is the heap seems to be almost entirely preserved. I would often see the heap go from 80k down to 35k(the danger zone from my experience). So far it has been fluctuating around only a few hundred bytes during the scan. This is also probably because I am now specifying max results to 3 as I only have 2 devices of interest. This is anecdotal and my use case is likely to differ from others, but still a good improvement and I see the payload is now complete which was another issue that seems to be rolled into this branch.

image

@h2zero Excellent work here - this project is so useful and was much needed. Thank you to you and everyone else who is making this better and better!

@lbussy
Copy link

lbussy commented Jan 29, 2021

I don't have time to "dig in" and learn what may be beneficial, I slapped it in just to see what happens. Unfortunately, the heap usage went up significantly.

Before:

Free Heap: 163544, Max Allocated: 93028, Frag: 44

After:

Free Heap: 135536, Max Allocated: 97392, Frag: 29

Fortunately (but unfortunately I suppose) we are using a deinit() before calling some memory intensive functions. Therefore the delta is not impactful right now.

@justoke
Copy link

justoke commented Jan 29, 2021

@lbussy As I said, my use case will differ from others. Nonetheless I would say the heap is very healthy from my point of view. I'm pegged at around 85k. I expect that it is not unusual for it to drop, but I've experienced that once you get to the <35k it is likely to crash the ESP32.

@justoke
Copy link

justoke commented Jan 29, 2021

I should add that I have now had a stable web server and BLE running without crashes for 7hrs now - not related to this branch of NimBLE directly, but mainly around the way the BLE and web server and other tasks are orchestrated using Freertos. I had 2 weeks of pain trying to get to this stage. I need to get through 24 hrs straight to be satisfied that I have indeed cracked it and that the pattern will be fit for a MVP.

@h2zero
Copy link
Owner

h2zero commented Jan 29, 2021

I don't have time to "dig in" and learn what may be beneficial, I slapped it in just to see what happens. Unfortunately, the heap usage went up significantly.

Before:

Free Heap: 163544, Max Allocated: 93028, Frag: 44
After:

Free Heap: 135536, Max Allocated: 97392, Frag: 29

Yikes, that shouldn't happen, I'll have to look into this. Thanks for letting me know.

@justoke Awesome!, glad to hear it's working out for you.

@lbussy
Copy link

lbussy commented Jan 29, 2021

I have a VERY chatty BLE environment. Is that a factor?

@h2zero
Copy link
Owner

h2zero commented Jan 29, 2021

I have a VERY chatty BLE environment. Is that a factor?

That's what this branch was intended to help with, if max results is set to 0 then no advertised devices will be stored unless done by copy in the callback in the user code. It's quite shocking that you are seeing the opposite of the intended effect.

@h2zero
Copy link
Owner

h2zero commented Jan 30, 2021

@lbussy, I'm not able to reproduce the heap use you see.

This is master branch:

34527 V: Free Heap: 132928, Max Allocated: 95112, Frag: 29

This is refactor-scan:

34508 V: Free Heap: 133252, Max Allocated: 95524, Frag: 29

Max results set to 0:

34514 V: Free Heap: 133408, Max Allocated: 95780, Frag: 29

@h2zero
Copy link
Owner

h2zero commented Jan 30, 2021

Well after testing this branch for a while and spamming the about page like a mad man I have not been able to reproduce the crash originally discussed in this thread 😄 . Whatever changes you all made, good work!

@lbussy
Copy link

lbussy commented Jan 30, 2021

Maybe we need to get you to compile and send it to me? I can still kill it. :)

Which branch are you testing with?

@h2zero
Copy link
Owner

h2zero commented Jan 30, 2021

I'm working with this thorrak/tiltbridge@8c2c81f

using lcd_ssd1306 env

I can stuff it up and lag it for sure still if I spam the page fast enough but I have not caused a panic so far.

@lbussy
Copy link

lbussy commented Jan 30, 2021

Ah yeah. The 1306 is quite a bit more stable for some reason. If I use d32_pro_tft I can kill it in no time.

@h2zero
Copy link
Owner

h2zero commented Jan 30, 2021

Interesting, sounds like the display updates may be a contributing factor then. I'm not sure how they are handled though so just thinking out loud.

@lbussy
Copy link

lbussy commented Jan 30, 2021

We've been leaning in that direction, but we're not able to prove it (yet).

@h2zero
Copy link
Owner

h2zero commented Jan 30, 2021

Are you still seeing a big difference in heap consumption with the refactor-scan branch?

@lbussy
Copy link

lbussy commented Jan 30, 2021

Are you still seeing a big difference in heap consumption with the refactor-scan branch?

It's looking good so far.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants