Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve resilience of BLE Inquiry mode #887

Merged
merged 10 commits into from
Sep 21, 2021
Merged

Improve resilience of BLE Inquiry mode #887

merged 10 commits into from
Sep 21, 2021

Conversation

PeteBa
Copy link
Contributor

@PeteBa PeteBa commented Aug 28, 2021

Describe the change
Looks to address issues identified in #852. A quick summary:

  • The first two new commits address the comments here and here.

  • The next two new commits allow for aborting connection and query logic on a timeout or disconnect. Basically, I have pushed promiseWithTimeout down to wrap the specific Noble functions. I have retained the overall timeout duration as per master.

  • The last commit refactors the "unlocking" disconnect listener as that can be handled inline at the end of queryLowEnergyDevice directly.

There is an outstanding question about the sleep call after connection attempts. I have read the noble issue and the fix recently committed on the abandonware hci repo. Unfortunately, the fix didn't make any difference for me. I also found 100ms sleep to be much better than 500ms at filtering out the erroneous disconnect events . Obviously, happy to adjust.

I have done this as a stacked PR building on the miflora_battery PR as it has allowed me to test using a large number of active devices. But I'm thinking this will make it very difficult to review. Let me know if it would help to split the commits into separate PRs and base off of main (or maybe beta). I can then fixup miflora_battery once this has landed.

I've marked as WIP as I am still soak testing this and looking at a few other tweaks but be good to get any thoughts thus far.

Checklist
If you changed code:

  • Tests run locally and pass (npm test)
  • Code has correct format (npm run format)

If you added a new integration:

  • Documentation page added in docs/integrations/
  • Page linked in docs/.vuepress/config.ts and docs/integrations/README.md

Additional information
Is this PR related to any issues? Do you have anything else to add?

@PeteBa PeteBa requested a review from mKeRix as a code owner August 28, 2021 09:36
@codecov
Copy link

codecov bot commented Aug 28, 2021

Codecov Report

Merging #887 (f82f574) into main (cefd445) will decrease coverage by 0.05%.
The diff coverage is 91.35%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #887      +/-   ##
==========================================
- Coverage   91.61%   91.55%   -0.06%     
==========================================
  Files          97       97              
  Lines        2624     2665      +41     
  Branches      397      417      +20     
==========================================
+ Hits         2404     2440      +36     
- Misses        216      221       +5     
  Partials        4        4              
Flag Coverage Δ
unittests 91.55% <91.35%> (-0.06%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...uetooth-low-energy/bluetooth-low-energy.service.ts 98.99% <50.00%> (-1.01%) ⬇️
src/integrations/xiaomi-mi/xiaomi-mi.service.ts 96.36% <50.00%> (-1.75%) ⬇️
...integration-support/bluetooth/bluetooth.service.ts 86.11% <95.89%> (+1.40%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cefd445...f82f574. Read the comment docs.

@PeteBa PeteBa changed the title Ble query Improve resilience of BLE Inquiry mode Aug 28, 2021
@mKeRix
Copy link
Owner

mKeRix commented Aug 29, 2021

From a first look the PR looks good! It does seem like me merging #852 has introduced some conflicts, possibly because of the squash & merge. Let me know if I can help fix that somehow.

As for the sleep: I looked at the recent fix commit on abandonware/bluetooth-hci-socket as well, it's a bit out of my depth. Either way, the current safeguard implementation shouldn't hurt even if they fix the issue upstream someday. Since the previous implementation was just based on experiments as well (if I remember correctly), I see nothing wrong with 100ms if that is working better in your tests.

@PeteBa
Copy link
Contributor Author

PeteBa commented Aug 29, 2021

De-stacked the PR to resolve the conflict.

@PeteBa
Copy link
Contributor Author

PeteBa commented Aug 29, 2021

I added two "guard" commits to avoid concurrent calls to queryLowEnergyDevice. I have seen this occur in two scenarios: 1) both BLE and Xiaomi modules trigger a query at similar times; 2) a quick succession of iPhone discovery messages are received. This can result for example in a new query attempt happening during a reset triggered by the previous query.

Rather than relying on adapter.state, I added a simple non-blocking mutex to address this. This results in companion app/battery queries being dropped until the mutex can be claimed. I think this is OK for these use-cases as subsequent discovery messages will re-attempt the query.

27/08/2021, 08:58:23 - log - BluetoothLowEnergyService: [BLE_00132] Attempting app discovery for tag 686a4b15a776
27/08/2021, 08:58:23 - trace - BluetoothService: [BLE_00132]               (queryLowEnergyDevice)
27/08/2021, 08:58:23 - trace - BluetoothService: [BLE_00132]               (connectLowEnergyDevice)
27/08/2021, 08:58:23 - trace - BluetoothService: [BLE_00132]               (lockAdapter)
27/08/2021, 08:58:23 - trace - BluetoothService: [BLE_00132]               (connectLowEnergyDeviceWithRetry)
27/08/2021, 08:58:23 - log - BluetoothLowEnergyService: [BLE_00133] Attempting app discovery for tag 686a4b15a776
27/08/2021, 08:58:23 - log - BluetoothLowEnergyService: [BLE_00133] Canceled discovery as BLE adapter is already in use
27/08/2021, 08:58:24 - trace - BluetoothService: [BLE_00132]               (readLowEnergyCharacteristic)
27/08/2021, 08:58:24 - trace - BluetoothService: [BLE_00132]               (disconnectLowEnergyDevice)
27/08/2021, 08:58:24 - trace - BluetoothService: [BLE_00132]               (unlockAdapter)
27/08/2021, 08:58:24 - trace - BluetoothService: [BLE_00132]               (handleAdapterStateChange)
27/08/2021, 08:58:24 - log - BluetoothLowEnergyService: [BLE_00132] Discovered companion app with ID F231E501-7661-40DD-BD46-7E4528619063 for tag 686a4b15a776

@PeteBa
Copy link
Contributor Author

PeteBa commented Sep 5, 2021

@mKeRix , so I'm seeing something "weird" going on and appreciate any insight ...

Basically, Noble hci-socket seems to generate a spurious Scan Started event in the middle of a queryLowEnergyDevice call. This results in the query failing and scanning not resuming. Eventually, the watchdog will kick in to recover the situation but not before the Tag flicks to "not_home". This happens about ~1 in 100 queries or few times a day so enough to impact automation reliability.

The event is triggered by this line in gap.js and isnt as a result of a noble.startScanning call. Code commentary around that line suggests the adapter is responding to another connected application. The issue seems to go away if you disable Bleno so could be a problem with sharing the same BLE adapter. I have seen this on both RPiZ and RPi4. I've disabled bluetoothd as per Bleno readme so its not that. Went back to stock 2.18.4 to confirm it isnt a recent regression. I've seen the beta branch thread and wonder if this could be an underlying factor.

A function trace below to show context:

31/08/2021, 09:53:40 - info - BluetoothLowEnergyService: Attempting app discovery for tag 7d8803412f5a
31/08/2021, 09:53:40 - info - BluetoothService:      (queryLowEnergyDevice)
31/08/2021, 09:53:40 - info - BluetoothService:        (connectLowEnergyDevice)
31/08/2021, 09:53:40 - info - BluetoothService:          (lockAdapter)
31/08/2021, 09:53:40 - info - BluetoothService:          (connectLowEnergyDeviceWithRetry)
31/08/2021, 09:53:40 - info - BluetoothService:  (handleScanStop)        <--- Correct response to lockAdapter call
31/08/2021, 09:53:41 - info - BluetoothService:  (handleScanStart)        <--- Error as adapter state overwritten to "scan"
31/08/2021, 09:53:41 - info - BluetoothService:        (readLowEnergyCharacteristic)
31/08/2021, 09:53:42 - info - BluetoothService:        (disconnectLowEnergyDevice)
31/08/2021, 09:53:42 - info - BluetoothService:        (unlockAdapter)     <--- startScanning not called assuming already running
31/08/2021, 09:53:42 - info - BluetoothLowEnergyService: Discovered companion app ...
31/08/2021, 09:54:16 - info - BluetoothService:  (verifyLowEnergyScanner)     <--- Reset triggered given adapter not actually scanning
31/08/2021, 09:54:16 - info - BluetoothService:    (resetHciDevice)
31/08/2021, 09:54:21 - info - BluetoothService:      (handleAdapterStateChange)
31/08/2021, 09:54:21 - info - BluetoothService:  (handleScanStart)

I would be interested to know if this is seen by others. While I dont have a simple repo, if you add a breakpoint or console.log('Unrecognised Scan Event'); on the gap.js line then you can leave RA running for a day and then check back in later. Not sure that there is a simple fix unfortunately.

@mKeRix
Copy link
Owner

mKeRix commented Sep 13, 2021

Hm - bleno in itself isn't instructed to start scanning at any time, in fact it shouldn't even have to do that for any internal reason. It should only ever toggle the advertising. It certainly could be a root cause for the discussions in the beta issue though... great find!

Given that this seems to be resolved once you get rid off bleno I'd suggest to just leave it as is though. bleno was only used for the toggle feature in the app, which also suffered from some other issues on the iOS side. I'm still planning to drop that (and hopefully replace the feature with smth else) like it is on the beta branch currently. The recent changes on main will also be made available as part of a new beta release once this PR is merged.

(happy to discuss if you disagree!)

@PeteBa
Copy link
Contributor Author

PeteBa commented Sep 16, 2021

I agree that Bleno is not directly initiating a scan. I suspect it is either a timing or an inter-operability issue that happens say one in ten times in my setup.

Looking at HCI traces during a failed query attempt, you can see an rogue inbound connection to the bleno interface - bleno configures the adapter in connectable mode - this is followed by a disconnect and scan started event coming into the noble side. I cant prove it but I wonder if a nearby device is interrogating the Pi and this causes the adapter to "reset" the noble side. Could be a red-herring but consistent across the observed failed queries.

Unfortunately, just not starting advertising is insufficient as importing bleno triggers its initialisation and subsequent processing of hci socket commands. I have done a quick dynamic loading commit to avoid this when instanceBeaconEnabled: false. This lets you disable bleno completely if seeing stability issues.

I think this is a good place to finish up this PR and happy to take any comments/feedback.

--- Looks like some test errors from the rebase, will take a look -Done

Copy link
Owner

@mKeRix mKeRix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! Looks good, LGTM. Will issue a release shortly.

I like your idea of dynamically loading bleno... could be a better path for the next major version (= current beta), instead of dropping the feature just turning it off by default.

@mKeRix mKeRix merged commit 9fc0eec into mKeRix:main Sep 21, 2021
github-actions bot pushed a commit that referenced this pull request Sep 21, 2021
# [2.19.0](v2.18.4...v2.19.0) (2021-09-21)

### Features

* **bluetooth-low-energy:** add option for reducing discovery log entries ([#853](#853)) ([e405558](e405558))
* **bluetooth-low-energy:** improve resilience of inquiry mode ([#887](#887)) ([9fc0eec](9fc0eec))
* **grid-eye:** add option to mask zero based values ([#891](#891)) ([c8f5b95](c8f5b95))
* support NodeJS 16 ([b795995](b795995))
* **xiaomi-mi:** add support for HA long-term statistics ([92bf8e8](92bf8e8)), closes [#869](#869)
* **xiaomi-mi:** publish miflora battery state as sensors ([#852](#852)) ([1cdfcd2](1cdfcd2))
@github-actions
Copy link

🎉 This PR is included in version 2.19.0 🎉

The release is available on:

Your semantic-release bot 📦🚀

github-actions bot pushed a commit that referenced this pull request Sep 21, 2021
# [3.0.0-beta.3](v3.0.0-beta.2...v3.0.0-beta.3) (2021-09-21)

### Features

* **bluetooth-low-energy:** add option for reducing discovery log entries ([#853](#853)) ([e405558](e405558))
* **bluetooth-low-energy:** improve resilience of inquiry mode ([#887](#887)) ([9fc0eec](9fc0eec))
* **grid-eye:** add option to mask zero based values ([#891](#891)) ([c8f5b95](c8f5b95))
* support NodeJS 16 ([b795995](b795995))
* **xiaomi-mi:** add support for HA long-term statistics ([92bf8e8](92bf8e8)), closes [#869](#869)
* **xiaomi-mi:** publish miflora battery state as sensors ([#852](#852)) ([1cdfcd2](1cdfcd2))
@github-actions
Copy link

🎉 This PR is included in version 3.0.0-beta.3 🎉

The release is available on:

Your semantic-release bot 📦🚀

@PeteBa
Copy link
Contributor Author

PeteBa commented Sep 21, 2021

Thanks for taking this on-board, much appreciated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants