[Bug] Edimax BLE Dongle Fails After Teardown and Re-Instantiation #497

hkpeprah · 2024-05-28T23:16:53Z

Overview

This PR addresses an issue where the some RTK BLE dongles fail to perform an HCI reset after the transport is torn down and re-instantiated. To address that, we prevent crashing the background threads when invalid data comes in, and handle resetting the parser. A retry for the HCI reset has been implemented for the RTK dongles where errant data comes in after being torn down. This is related to #476.

Setup

Dongle: Edimax BT-8500 Nano BT Adapter
OS: Ubuntu 20.04.6 LTS
Firmware: BT-8500 Linux Bluetooth Driver 1.0.0.3 (Also tried the Linux Kernel FW)

Problem Description

Issue 1

I observed that when closing and re-opening a transport to the Edimax BT Dongle (which has a RealTek chipset), that an HCI Reset would sometimes result in a second HCI reset response on the next command, requiring two HCI resets to be performed in the driver specific code.

Issue 2

I observed that when the first HCI reset was performed after closing and re-opening a transport to the Edimax BT Dongle that sometimes the call would time out. Digging into the packet data, I observed that the data was sometimes invalid (examples follow):

b'\x04\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x0e\x04\x03\x03\x0c\x00'

b'\x04\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'

These would cause the background threads to crash, either from parsing the packet data, failing to handle the packet in on_hci_packet(), or failing to feed the data due to an invalid packet type.

Proposed Changes

When performing an HCI reset in RTK's driver_info_for_host, perform the reset twice.
Expose the hci_source from the Host, and if the HCI reset fails in RTK's driver_info_for_host, clear the parser; this addresses the issue where we get errant packet data that causes the reset we get from the second command to fail to parse.
Add optional timeout to send_command() to allow for timing out the command.
In host.on_packet() capture when hci.HCI_Packet.from_bytes() fails.
Changed raise ValueError to reset and break in ParserSource.feed_data() to prevent crashing background thread when invalid packet type is found; allows the parser to continue parsing with the next bit of (hopefully good) data.

This patch addresses an issue where the some RTK BLE dongles fail to perform an HCI reset after the transport is torn down and re-instantiated. To address that, we prevent crashing the background threads when invalid data comes in, and handle resetting the parser. A retry for the HCI reset has been implemented for the RTK dongles where errant data comes in after being torn down.

barbibulle · 2024-05-30T04:28:49Z

bumble/host.py

@@ -514,7 +516,7 @@ def send_hci_packet(self, packet: hci.HCI_Packet) -> None:
        if self.hci_sink:
            self.hci_sink.on_packet(bytes(packet))

-    async def send_command(self, command, check_result=False):
+    async def send_command(self, command, check_result=False, timeout=None):


send_command is async, so we don't need to add a timeout parameter to it in order to wait at most some amount of time: we can leave that up to the caller, which can invoke await asyncio.wait_for(host.send_command(...), timeout).

Sorry, I'm a bit new to AsyncIO. If the calling thread times out on the co-routine, while the co-routine terminate and yield the lock so that another send_command() can run?

with wait_for, if the timeout expires before the coroutine returns, the coroutine is canceled, which means that here the await self.pending_response with raise a TimeoutError, which will exit out of the async with self.command_semaphore, releasing the semaphore, so if another task was waiting on it, it can now proceed.

The only small different between putting the wait_for on the outside, rather than just around asyncio.wait_for(self.pending_response,...) is that the timeout encompasses not only waiting for the response, but also waiting for the semaphore before that.
So, maybe there's a case to be made for specifying an inner timeout, like you have it, which only applies to waiting for the response. I can see how this could make sense, even if I don't have an actual use case in mind. So maybe we leave things the way you have them here, but rename the timeout parameter to response_timeout, to make it clear that it's not a timeout on the whole method, but just on the waiting for a response inside that method.

I can do that!

Rename timeout -> response_timeout

tests/transport_test.py

bumble/transport/common.py

barbibulle · 2024-06-03T02:24:24Z

bumble/host.py

@@ -152,6 +152,7 @@ class Host(AbortableEventEmitter):
    acl_packet_queue: Optional[AclPacketQueue] = None
    le_acl_packet_queue: Optional[AclPacketQueue] = None
    hci_sink: Optional[TransportSink] = None
+    hci_source: Optional[TransportSource] = None


This isn't needed. The host doesn't need to keep a reference to the source that's feeding it. It does need to keep a reference to the sink that it feeds, but the source only comes into the picture when calling set_packet_source, which is a convenience method for telling the source that its sink is the host object.

I used that in the RTK driver when we get invalid packet data, as I noticed sometimes I got a large packet that had a payload of 254, but never got the remaining packet data for that packet:

if host.hci_source and hasattr(host.hci_source, "parser"): host.hci_source.parser.reset()

Is there another way we can reset the parser from the driver? The packet I was getting from the dongle had the string RTK_5 or something like that.

Maybe a parser property or some other reset exposed from the host?

@barbibulle, any thoughts on a way of working around this?

Sorry for the delay. There are a couple issues to work through here:
1/ The source/sink relationship is a push model, and source is a purely abstract concept of "some object that pushes data to a sink". A sink is just an object that has a on_packet method. This means that an object like Host, which receives HCI data from a "source" doesn't necessarily know what the source is, it just receives packets. The Host.set_packet_source method is purely a convenience function that tells the source where to push data (and looks up a metadata property on the source, but that isn't a requirement. In theory, one could attach a source to a Host without calling that method. This means that looking for an host.hci_source may or may not work depending on how objects were instantiated.

2/ Not all sources have a parser property. It is common to see it, because most sources are Transport implementations, which often need to extract HCI packets from a stream, and they usually use a PacketParser for that. But you could very well have a source that doesn't need to parse an HCI stream (for example, a transport where HCI packets arrive already delimited). So calling something like host.hci_source.parser.reset() only works for a transport that happens to use a PacketParser and happens to store that parser in a property named parser. That could break at any time, since that's not a requirement of transport implementations.

But in this case, the problem we need to solve is narrow, so we should be able to address it differently.
I'll try a few things out over the next week to see if I can think of a maintainable way to achieve what's needed here (i.e without having to put too many constraints on what types of objects and classes can be used as HCI data sources).

Any ideas on a workaround for this?

I could break out the changes into two separate PRs:

One for the RTK / HCI Source

The other for the parser fixes.

Would that be better?

Sorry but took so long to follow up. Yes, breaking up in two PRs would help. For the parser changes, I’ll try to find some time over the weekend to come up with an alternative.

Update: I though about a maintainable way to allow signaling of errors back to an HCI source so that it can reset itself, without creating an explicit reference from a sink to a source (i.e no storing of hci_source in the Host object): a source can listen for events from its sink (first testing if the sink is an EventEmitter, since not all are required to be), then, in the RTK driver case here, the driver could have the host emit an error (upon encountering the timeout), which the usb source would get called back for, and reset itself. But... there's probably an even simpler way: it turns out that the usb source really doesn't need to use an HCI parser, since packets coming from USB are always delimited. The current implementation subclasses ParserSource by convenience, and uses self.parser.feed_data(packet) to emit packets to its sink, but that's really not needed. I'll push a PR where the USB source doesn't use a parser anymore, which should solve the issue here, since even after receiving corrupted data in one USB transfer, the next USB transfer should be a whole packet again.

Created #534 . Thank you!

hkpeprah · 2024-08-14T18:08:28Z

Replaced by #534 .

hkpeprah added 2 commits May 28, 2024 15:46

Lint and unit test.

31a7294

barbibulle reviewed May 30, 2024

View reviewed changes

tests/transport_test.py Show resolved Hide resolved

Update test to indicate exception not thrown.

aace6ba

barbibulle reviewed May 30, 2024

View reviewed changes

bumble/transport/common.py Show resolved Hide resolved

Rename timeout -> response_timeout

17d369a

barbibulle reviewed Jun 3, 2024

View reviewed changes

hkpeprah mentioned this pull request Aug 14, 2024

[Bug] Edimax BLE Dongle Fails After Teardown and Re-Instantiation #534

Open

hkpeprah closed this Aug 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Edimax BLE Dongle Fails After Teardown and Re-Instantiation #497

[Bug] Edimax BLE Dongle Fails After Teardown and Re-Instantiation #497

hkpeprah commented May 28, 2024

barbibulle May 30, 2024

hkpeprah May 30, 2024

barbibulle May 30, 2024

hkpeprah May 30, 2024 •

edited

Loading

barbibulle Jun 3, 2024

hkpeprah Jun 3, 2024

hkpeprah Jun 6, 2024 •

edited

Loading

hkpeprah Jun 18, 2024

barbibulle Jun 20, 2024

hkpeprah Jul 2, 2024

hkpeprah Aug 8, 2024

barbibulle Aug 10, 2024

barbibulle Aug 12, 2024

hkpeprah Aug 15, 2024

hkpeprah commented Aug 14, 2024

[Bug] Edimax BLE Dongle Fails After Teardown and Re-Instantiation #497

[Bug] Edimax BLE Dongle Fails After Teardown and Re-Instantiation #497

Conversation

hkpeprah commented May 28, 2024

Overview

Setup

Problem Description

Issue 1

Issue 2

Proposed Changes

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hkpeprah May 30, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hkpeprah Jun 6, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hkpeprah commented Aug 14, 2024

hkpeprah May 30, 2024 •

edited

Loading

hkpeprah Jun 6, 2024 •

edited

Loading