Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[rtl872x] fix multiple BLE issues #2710

Merged
merged 2 commits into from Nov 18, 2023
Merged

Conversation

XuGuohui
Copy link
Member

@XuGuohui XuGuohui commented Nov 15, 2023

Problem

  1. BLE scanning may stall. BLE scanning API acquires the BLE lock. When error occured in BLE stack during scanning, it tries to reset the stack, which then sends a command to a queue for an internal thread to process. If the thread is trying to acquire the BLE lock to process other command in queue before handling the command to reset the stack, then deadlock occurs.
  2. Due to reset of the BLE stack, for whatever reason, all of the on-going BLE activities are stopped. This typically causes the BLE provisioning mode not functioning well.
  3. BLE states's event may be notified in messed order.

Solution

  1. Unlock the BLE lock before resetting the BLE stack if required.
  2. Cache BLE current states and restore them after the BLE stack is reset successfully.
  3. Do not completely rely on the notified state event. Use SDK API to check the state before waiting the target state is accomplished.

Steps to Test

Build and run the attached user application.

Example App

#include "Particle.h"

STARTUP(System.enableFeature(FEATURE_DISABLE_LISTENING_MODE));

// delay after which to enable provisioning mode
// set to a very high value to effectively disable
#define PROVISION_MODE_DELAY (10)
// #define PROVISION_MODE_DELAY (100000)

SYSTEM_MODE(SEMI_AUTOMATIC);
SYSTEM_THREAD(ENABLED);

Serial1LogHandler logHandler(115200, LOG_LEVEL_ALL);

void scan_result_cb(const BleScanResult *result, void *context) {
    Log.info("\tgot advert");
}

void scan_thread_f() {
    BLE.on();

    // short repeated scans for improved chance to catch all sensor adverts
    // (each scan will only return the first advert from a device so we can't
    // use long-running scan)
    BLE.setScanTimeout(50);

    // disable active scan
    // only get advert, don't request scan response for less overhead and
    // hopefully reduced power consumption on sensors
    BleScanParams params = {};
    params.version = BLE_API_VERSION;
    params.size = sizeof(BleScanParams);
    BLE.getScanParameters(params);
    params.active = false;
    BLE.setScanParameters(params);

    while (true) {
        Log.info("start scan");
        unsigned int t0 = millis();
        BLE.scan(scan_result_cb, nullptr);
        unsigned int t1 = millis();
        if (t1 - t0 > 500) {
            Log.info("\tscan took too long: %dms", t1 - t0);
        }
        Log.info("\tend scan");
        delay(1);
    }
}

void setup() {
    // while (!Serial.isConnected()) {
        // delay(100);
    // }
    static Thread *thread = new Thread("sensor_ble_scan", scan_thread_f, OS_THREAD_PRIORITY_DEFAULT);
}

void loop() {
    static bool provision_once = true;
    static uint32_t last_sec = System.uptime();
    uint32_t now_sec = System.uptime();

    if (provision_once && now_sec > PROVISION_MODE_DELAY) {
        provision_once = false;

        // ---------Setup device name---------
        char buf[16] = {0};
        char name[BLE_MAX_DEV_NAME_LEN+1];
        auto len = hal_get_device_serial_number(buf, sizeof(buf), nullptr);
        // name gets last 6 of the serial number to be (mostly) unique
        snprintf(name, sizeof(name), "SmartTub-%s", &(buf[len-6]));
        Log.info("before BLE.setDeviceName");
        BLE.setDeviceName(name);
        Log.info("after BLE.setDeviceName");

        // ---------Set company ID---------
        Log.info("before BLE.setProvisioningCompanyId");
        BLE.setProvisioningCompanyId(0x1234);
        Log.info("after BLE.setProvisioningCompanyId");

        // ---------BLE provisioning mode---------
        Log.info("before BLE.provisioningMode");
        BLE.provisioningMode(true);
        Log.info("after BLE.provisioningMode");
    }

    if (now_sec - last_sec >= 10) {
        last_sec = now_sec;
        Log.info("loop: getProvisioningStatus: %d", BLE.getProvisioningStatus());
    }
}

References

N/A


Completeness

  • User is totes amazing for contributing!
  • Contributor has signed CLA (Info here)
  • Problem and Solution clearly stated
  • Run unit/integration/application tests on device
  • Added documentation
  • Added to CHANGELOG.md after merging (add links to docs and issues)

struct BleGapCache {
bool isAdv;
bool isScan;
bool isconn;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: isConn

@@ -1585,13 +1646,14 @@ int BleGap::startScanning(hal_ble_on_scan_result_cb_t callback, void* context) {

SCOPE_GUARD ({
if (isScanning_) {
const int LE_SCAN_STOP_RETRIES = 10;
const int LE_SCAN_STOP_RETRIES = 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove loop entirely if no longer required? I think when we ran into this previously a few retries did help, but it was always very difficult to reliably reproduced without logging enabled.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't reproduce this. But if you did observe the retries help, let's keep it as-is at cost of 100ms postponed.

@@ -1616,8 +1681,13 @@ int BleGap::startScanning(hal_ble_on_scan_result_cb_t callback, void* context) {
CHECK_RTL(le_scan_start());
isScanning_ = true;
// GAP_SCAN_STATE_SCANNING may be propagated immediately following the GAP_SCAN_STATE_START
if (waitState(BleGapDevState().scan(GAP_SCAN_STATE_SCANNING))) {
return SYSTEM_ERROR_TIMEOUT;
if (waitState(BleGapDevState().scan(GAP_SCAN_STATE_START), 10, true) && waitState(BleGapDevState().scan(GAP_SCAN_STATE_SCANNING), 10, true)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

10ms sounds like a very tight condition?

Copy link
Member Author

@XuGuohui XuGuohui Nov 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's designated. I just utilize the waitState() to do a quick check before a long wait.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it critical to call BleGapDevState().scan(GAP_SCAN_STATE_START) before BleGapDevState().scan(GAP_SCAN_STATE_SCANNING)?
That seems to be the only real change here?

Like you mention, we still call waitState(BleGapDevState().scan(GAP_SCAN_STATE_SCANNING)) for the full wait duration after this

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it critical to call BleGapDevState().scan(GAP_SCAN_STATE_START) before BleGapDevState().scan(GAP_SCAN_STATE_SCANNING)?

Nope. Just treat it as a check (force poll is set to true) without waiting any time (although there is 10ms). The GAP_SCAN_STATE_START and GAP_SCAN_STATE_SCANNING sometimes are notified in wrong sequence.

@scott-brust scott-brust added this to the 5.6.0 milestone Nov 15, 2023
Copy link
Member

@scott-brust scott-brust left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some questions about general behavior.

I do see these errors in the app somewhat frequently:

0016113700 [app] INFO: start scan
0016113721 [app] INFO:  got advert
0016113747 [app] INFO:  got advert
0016113825 [app] INFO:  got advert
0016113849 [app] INFO:  got advert
0016114006 [app] INFO:  got advert
0016114134 [app] INFO:  got advert
0016114213 [hal.ble] TRACE: Failed to stop scanning, resetting stack
0016114216 [hal.ble] TRACE: Going to stop the stack...
0016114227 [app] INFO:  got advert
0016114243 [app] INFO:  got advert
0016114248 [app] INFO:  got advert
0016114587 [net.lwip_rltk] INFO: promisc_deinit TODO
0016114589 [hal] INFO: WiFi off
0016114607 [hal] INFO: rltk_wlan_set_netif_info: 0, 94:94:4a:04:29:f8
0016114944 [hal] INFO: WiFi on
0016115277 [hal.ble] TRACE: Restore advertising state
0016115285 [app] INFO:  scan took too long: 1584ms
0016115287 [app] INFO:  end scan

The app continues to scan / advertise so I assume this is expected?

I also see this pattern frequently:

0016229913 [app] INFO: start scan
0016230041 [app] INFO:  got advert
0016230141 [app] INFO:  got advert
0016230352 [app] INFO:  got advert
0016230353 [app] INFO:  got advert
0016230437 [app] INFO:  scan took too long: 523ms
0016230439 [app] INFO:  end scan

Is the 50ms BLE scan timeout per beacon?
I see the time difference between some of the individual scan results is >50ms, is this expected?

@@ -1616,8 +1681,13 @@ int BleGap::startScanning(hal_ble_on_scan_result_cb_t callback, void* context) {
CHECK_RTL(le_scan_start());
isScanning_ = true;
// GAP_SCAN_STATE_SCANNING may be propagated immediately following the GAP_SCAN_STATE_START
if (waitState(BleGapDevState().scan(GAP_SCAN_STATE_SCANNING))) {
return SYSTEM_ERROR_TIMEOUT;
if (waitState(BleGapDevState().scan(GAP_SCAN_STATE_START), 10, true) && waitState(BleGapDevState().scan(GAP_SCAN_STATE_SCANNING), 10, true)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it critical to call BleGapDevState().scan(GAP_SCAN_STATE_START) before BleGapDevState().scan(GAP_SCAN_STATE_SCANNING)?
That seems to be the only real change here?

Like you mention, we still call waitState(BleGapDevState().scan(GAP_SCAN_STATE_SCANNING)) for the full wait duration after this

@@ -1005,19 +1030,28 @@ void BleGap::bleCommandThread(void *context) {
BleGap* gap = (BleGap*)context;
while (true) {
uint8_t command;
if (!os_queue_take(gap->cmdQueue_, &command, CONCURRENT_WAIT_FOREVER, nullptr)) {
if (!os_queue_peek(gap->cmdQueue_, &command, CONCURRENT_WAIT_FOREVER, nullptr)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the effective difference between peek on the queue rather than take?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In int BleGap::stop(bool restore) you'll see this code snippet:

            if (isQueueAvailable()) {
                LOCAL_DEBUG(" >>>>>>>>>>>> s_bleMutex.unlock() <<<<<<<<<<<<<<");
                s_bleMutex.unlock();
            }

This is going to fix a deadlock in BLE driver. Using peek will set isQueueAvailable() to let us know if there is command that might be acquiring the BLE lock and need to be processed and unlock the lock accordingly.

hal/src/rtl872x/ble_hal.cpp Show resolved Hide resolved
hal/src/rtl872x/ble_hal.cpp Show resolved Hide resolved
hal/src/rtl872x/ble_hal.cpp Outdated Show resolved Hide resolved
@XuGuohui
Copy link
Member Author

The app continues to scan / advertise so I assume this is expected?

Yes.

0016230437 [app] INFO: scan took too long: 523ms

The scan timeout can be changed in user app through BLE.setScanTimeout(50);, where the actual timeout is 50 * 10ms. So that's expected result.

@XuGuohui XuGuohui merged commit 5143009 into develop Nov 18, 2023
13 checks passed
@XuGuohui XuGuohui deleted the fix/rtl872x/ble-scan-stall branch November 18, 2023 11:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants