[rtl872x] fix multiple BLE issues #2710

XuGuohui · 2023-11-15T06:54:46Z

Problem

BLE scanning may stall. BLE scanning API acquires the BLE lock. When error occured in BLE stack during scanning, it tries to reset the stack, which then sends a command to a queue for an internal thread to process. If the thread is trying to acquire the BLE lock to process other command in queue before handling the command to reset the stack, then deadlock occurs.
Due to reset of the BLE stack, for whatever reason, all of the on-going BLE activities are stopped. This typically causes the BLE provisioning mode not functioning well.
BLE states's event may be notified in messed order.

Solution

Unlock the BLE lock before resetting the BLE stack if required.
Cache BLE current states and restore them after the BLE stack is reset successfully.
Do not completely rely on the notified state event. Use SDK API to check the state before waiting the target state is accomplished.

Steps to Test

Build and run the attached user application.

Example App

#include "Particle.h"

STARTUP(System.enableFeature(FEATURE_DISABLE_LISTENING_MODE));

// delay after which to enable provisioning mode
// set to a very high value to effectively disable
#define PROVISION_MODE_DELAY (10)
// #define PROVISION_MODE_DELAY (100000)

SYSTEM_MODE(SEMI_AUTOMATIC);
SYSTEM_THREAD(ENABLED);

Serial1LogHandler logHandler(115200, LOG_LEVEL_ALL);

void scan_result_cb(const BleScanResult *result, void *context) {
    Log.info("\tgot advert");
}

void scan_thread_f() {
    BLE.on();

    // short repeated scans for improved chance to catch all sensor adverts
    // (each scan will only return the first advert from a device so we can't
    // use long-running scan)
    BLE.setScanTimeout(50);

    // disable active scan
    // only get advert, don't request scan response for less overhead and
    // hopefully reduced power consumption on sensors
    BleScanParams params = {};
    params.version = BLE_API_VERSION;
    params.size = sizeof(BleScanParams);
    BLE.getScanParameters(params);
    params.active = false;
    BLE.setScanParameters(params);

    while (true) {
        Log.info("start scan");
        unsigned int t0 = millis();
        BLE.scan(scan_result_cb, nullptr);
        unsigned int t1 = millis();
        if (t1 - t0 > 500) {
            Log.info("\tscan took too long: %dms", t1 - t0);
        }
        Log.info("\tend scan");
        delay(1);
    }
}

void setup() {
    // while (!Serial.isConnected()) {
        // delay(100);
    // }
    static Thread *thread = new Thread("sensor_ble_scan", scan_thread_f, OS_THREAD_PRIORITY_DEFAULT);
}

void loop() {
    static bool provision_once = true;
    static uint32_t last_sec = System.uptime();
    uint32_t now_sec = System.uptime();

    if (provision_once && now_sec > PROVISION_MODE_DELAY) {
        provision_once = false;

        // ---------Setup device name---------
        char buf[16] = {0};
        char name[BLE_MAX_DEV_NAME_LEN+1];
        auto len = hal_get_device_serial_number(buf, sizeof(buf), nullptr);
        // name gets last 6 of the serial number to be (mostly) unique
        snprintf(name, sizeof(name), "SmartTub-%s", &(buf[len-6]));
        Log.info("before BLE.setDeviceName");
        BLE.setDeviceName(name);
        Log.info("after BLE.setDeviceName");

        // ---------Set company ID---------
        Log.info("before BLE.setProvisioningCompanyId");
        BLE.setProvisioningCompanyId(0x1234);
        Log.info("after BLE.setProvisioningCompanyId");

        // ---------BLE provisioning mode---------
        Log.info("before BLE.provisioningMode");
        BLE.provisioningMode(true);
        Log.info("after BLE.provisioningMode");
    }

    if (now_sec - last_sec >= 10) {
        last_sec = now_sec;
        Log.info("loop: getProvisioningStatus: %d", BLE.getProvisioningStatus());
    }
}

References

N/A

Completeness

User is totes amazing for contributing!
Contributor has signed CLA (Info here)
Problem and Solution clearly stated
Run unit/integration/application tests on device
Added documentation
Added to CHANGELOG.md after merging (add links to docs and issues)

avtolstoy · 2023-11-15T14:12:00Z

hal/src/rtl872x/ble_hal.cpp

+    struct BleGapCache {
+        bool isAdv;
+        bool isScan;
+        bool isconn;


Nit: isConn

avtolstoy · 2023-11-15T14:13:38Z

hal/src/rtl872x/ble_hal.cpp

@@ -1585,13 +1646,14 @@ int BleGap::startScanning(hal_ble_on_scan_result_cb_t callback, void* context) {

    SCOPE_GUARD ({
        if (isScanning_) {
-            const int LE_SCAN_STOP_RETRIES = 10;
+            const int LE_SCAN_STOP_RETRIES = 1;


Remove loop entirely if no longer required? I think when we ran into this previously a few retries did help, but it was always very difficult to reliably reproduced without logging enabled.

I can't reproduce this. But if you did observe the retries help, let's keep it as-is at cost of 100ms postponed.

avtolstoy · 2023-11-15T14:14:51Z

hal/src/rtl872x/ble_hal.cpp

@@ -1616,8 +1681,13 @@ int BleGap::startScanning(hal_ble_on_scan_result_cb_t callback, void* context) {
    CHECK_RTL(le_scan_start());
    isScanning_ = true;
    // GAP_SCAN_STATE_SCANNING may be propagated immediately following the GAP_SCAN_STATE_START
-    if (waitState(BleGapDevState().scan(GAP_SCAN_STATE_SCANNING))) {
-        return SYSTEM_ERROR_TIMEOUT;
+    if (waitState(BleGapDevState().scan(GAP_SCAN_STATE_START), 10, true) && waitState(BleGapDevState().scan(GAP_SCAN_STATE_SCANNING), 10, true)) {


10ms sounds like a very tight condition?

That's designated. I just utilize the waitState() to do a quick check before a long wait.

Is it critical to call BleGapDevState().scan(GAP_SCAN_STATE_START) before BleGapDevState().scan(GAP_SCAN_STATE_SCANNING)?
That seems to be the only real change here?

Like you mention, we still call waitState(BleGapDevState().scan(GAP_SCAN_STATE_SCANNING)) for the full wait duration after this

Is it critical to call BleGapDevState().scan(GAP_SCAN_STATE_START) before BleGapDevState().scan(GAP_SCAN_STATE_SCANNING)?

Nope. Just treat it as a check (force poll is set to true) without waiting any time (although there is 10ms). The GAP_SCAN_STATE_START and GAP_SCAN_STATE_SCANNING sometimes are notified in wrong sequence.

scott-brust

I have some questions about general behavior.

I do see these errors in the app somewhat frequently:

0016113700 [app] INFO: start scan
0016113721 [app] INFO:  got advert
0016113747 [app] INFO:  got advert
0016113825 [app] INFO:  got advert
0016113849 [app] INFO:  got advert
0016114006 [app] INFO:  got advert
0016114134 [app] INFO:  got advert
0016114213 [hal.ble] TRACE: Failed to stop scanning, resetting stack
0016114216 [hal.ble] TRACE: Going to stop the stack...
0016114227 [app] INFO:  got advert
0016114243 [app] INFO:  got advert
0016114248 [app] INFO:  got advert
0016114587 [net.lwip_rltk] INFO: promisc_deinit TODO
0016114589 [hal] INFO: WiFi off
0016114607 [hal] INFO: rltk_wlan_set_netif_info: 0, 94:94:4a:04:29:f8
0016114944 [hal] INFO: WiFi on
0016115277 [hal.ble] TRACE: Restore advertising state
0016115285 [app] INFO:  scan took too long: 1584ms
0016115287 [app] INFO:  end scan

The app continues to scan / advertise so I assume this is expected?

I also see this pattern frequently:

0016229913 [app] INFO: start scan
0016230041 [app] INFO:  got advert
0016230141 [app] INFO:  got advert
0016230352 [app] INFO:  got advert
0016230353 [app] INFO:  got advert
0016230437 [app] INFO:  scan took too long: 523ms
0016230439 [app] INFO:  end scan

Is the 50ms BLE scan timeout per beacon?
I see the time difference between some of the individual scan results is >50ms, is this expected?

scott-brust · 2023-11-15T21:54:51Z

hal/src/rtl872x/ble_hal.cpp

@@ -1616,8 +1681,13 @@ int BleGap::startScanning(hal_ble_on_scan_result_cb_t callback, void* context) {
    CHECK_RTL(le_scan_start());
    isScanning_ = true;
    // GAP_SCAN_STATE_SCANNING may be propagated immediately following the GAP_SCAN_STATE_START
-    if (waitState(BleGapDevState().scan(GAP_SCAN_STATE_SCANNING))) {
-        return SYSTEM_ERROR_TIMEOUT;
+    if (waitState(BleGapDevState().scan(GAP_SCAN_STATE_START), 10, true) && waitState(BleGapDevState().scan(GAP_SCAN_STATE_SCANNING), 10, true)) {


Is it critical to call BleGapDevState().scan(GAP_SCAN_STATE_START) before BleGapDevState().scan(GAP_SCAN_STATE_SCANNING)?
That seems to be the only real change here?

Like you mention, we still call waitState(BleGapDevState().scan(GAP_SCAN_STATE_SCANNING)) for the full wait duration after this

scott-brust · 2023-11-15T21:54:58Z

hal/src/rtl872x/ble_hal.cpp

@@ -1005,19 +1030,28 @@ void BleGap::bleCommandThread(void *context) {
    BleGap* gap = (BleGap*)context;
    while (true) {
        uint8_t command;
-        if (!os_queue_take(gap->cmdQueue_, &command, CONCURRENT_WAIT_FOREVER, nullptr)) {
+        if (!os_queue_peek(gap->cmdQueue_, &command, CONCURRENT_WAIT_FOREVER, nullptr)) {


What is the effective difference between peek on the queue rather than take?

In int BleGap::stop(bool restore) you'll see this code snippet:

if (isQueueAvailable()) { LOCAL_DEBUG(" >>>>>>>>>>>> s_bleMutex.unlock() <<<<<<<<<<<<<<"); s_bleMutex.unlock(); }

This is going to fix a deadlock in BLE driver. Using peek will set isQueueAvailable() to let us know if there is command that might be acquiring the BLE lock and need to be processed and unlock the lock accordingly.

hal/src/rtl872x/ble_hal.cpp

XuGuohui · 2023-11-16T05:57:29Z

The app continues to scan / advertise so I assume this is expected?

Yes.

0016230437 [app] INFO: scan took too long: 523ms

The scan timeout can be changed in user app through BLE.setScanTimeout(50);, where the actual timeout is 50 * 10ms. So that's expected result.

[rtl872x] hal: ble scan may stall.

95c5c04

XuGuohui requested review from avtolstoy, YutingYou and scott-brust November 15, 2023 06:54

avtolstoy reviewed Nov 15, 2023

View reviewed changes

scott-brust added this to the 5.6.0 milestone Nov 15, 2023

scott-brust reviewed Nov 15, 2023

View reviewed changes

[rtl872x] hal: BLE minor.

ea6646c

scott-brust approved these changes Nov 17, 2023

View reviewed changes

XuGuohui merged commit 5143009 into develop Nov 18, 2023
13 checks passed

XuGuohui deleted the fix/rtl872x/ble-scan-stall branch November 18, 2023 11:20

technobly added bug 5.x 5.x-only labels Dec 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[rtl872x] fix multiple BLE issues #2710

[rtl872x] fix multiple BLE issues #2710

XuGuohui commented Nov 15, 2023 •

edited

avtolstoy Nov 15, 2023

avtolstoy Nov 15, 2023

XuGuohui Nov 15, 2023

avtolstoy Nov 15, 2023

XuGuohui Nov 15, 2023 •

edited

scott-brust Nov 15, 2023

XuGuohui Nov 16, 2023

scott-brust left a comment

scott-brust Nov 15, 2023

scott-brust Nov 15, 2023

XuGuohui Nov 16, 2023

XuGuohui commented Nov 16, 2023

[rtl872x] fix multiple BLE issues #2710

[rtl872x] fix multiple BLE issues #2710

Conversation

XuGuohui commented Nov 15, 2023 • edited

Problem

Solution

Steps to Test

Example App

References

Completeness

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

XuGuohui Nov 15, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

scott-brust left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

XuGuohui commented Nov 16, 2023

XuGuohui commented Nov 15, 2023 •

edited

XuGuohui Nov 15, 2023 •

edited