Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Edimax BLE Dongle Fails After Teardown and Re-Instantiation #497

Closed
wants to merge 4 commits into from

Conversation

hkpeprah
Copy link

Overview

This PR addresses an issue where the some RTK BLE dongles fail to perform an HCI reset after the transport is torn down and re-instantiated. To address that, we prevent crashing the background threads when invalid data comes in, and handle resetting the parser. A retry for the HCI reset has been implemented for the RTK dongles where errant data comes in after being torn down. This is related to #476.

Setup

  1. Dongle: Edimax BT-8500 Nano BT Adapter
  2. OS: Ubuntu 20.04.6 LTS
  3. Firmware: BT-8500 Linux Bluetooth Driver 1.0.0.3 (Also tried the Linux Kernel FW)

Problem Description

Issue 1

I observed that when closing and re-opening a transport to the Edimax BT Dongle (which has a RealTek chipset), that an HCI Reset would sometimes result in a second HCI reset response on the next command, requiring two HCI resets to be performed in the driver specific code.

Issue 2

I observed that when the first HCI reset was performed after closing and re-opening a transport to the Edimax BT Dongle that sometimes the call would time out. Digging into the packet data, I observed that the data was sometimes invalid (examples follow):

b'\x04\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x0e\x04\x03\x03\x0c\x00'
b'\x04\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'

These would cause the background threads to crash, either from parsing the packet data, failing to handle the packet in on_hci_packet(), or failing to feed the data due to an invalid packet type.

Proposed Changes

  1. When performing an HCI reset in RTK's driver_info_for_host, perform the reset twice.
  2. Expose the hci_source from the Host, and if the HCI reset fails in RTK's driver_info_for_host, clear the parser; this addresses the issue where we get errant packet data that causes the reset we get from the second command to fail to parse.
  3. Add optional timeout to send_command() to allow for timing out the command.
  4. In host.on_packet() capture when hci.HCI_Packet.from_bytes() fails.
  5. Changed raise ValueError to reset and break in ParserSource.feed_data() to prevent crashing background thread when invalid packet type is found; allows the parser to continue parsing with the next bit of (hopefully good) data.

This patch addresses an issue where the some RTK BLE dongles fail to perform
an HCI reset after the transport is torn down and re-instantiated. To address
that, we prevent crashing the background threads when invalid data comes in,
and handle resetting the parser. A retry for the HCI reset has been implemented
for the RTK dongles where errant data comes in after being torn down.
bumble/host.py Outdated
@@ -514,7 +516,7 @@ def send_hci_packet(self, packet: hci.HCI_Packet) -> None:
if self.hci_sink:
self.hci_sink.on_packet(bytes(packet))

async def send_command(self, command, check_result=False):
async def send_command(self, command, check_result=False, timeout=None):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

send_command is async, so we don't need to add a timeout parameter to it in order to wait at most some amount of time: we can leave that up to the caller, which can invoke await asyncio.wait_for(host.send_command(...), timeout).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I'm a bit new to AsyncIO. If the calling thread times out on the co-routine, while the co-routine terminate and yield the lock so that another send_command() can run?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with wait_for, if the timeout expires before the coroutine returns, the coroutine is canceled, which means that here the await self.pending_response with raise a TimeoutError, which will exit out of the async with self.command_semaphore, releasing the semaphore, so if another task was waiting on it, it can now proceed.

The only small different between putting the wait_for on the outside, rather than just around asyncio.wait_for(self.pending_response,...) is that the timeout encompasses not only waiting for the response, but also waiting for the semaphore before that.
So, maybe there's a case to be made for specifying an inner timeout, like you have it, which only applies to waiting for the response. I can see how this could make sense, even if I don't have an actual use case in mind. So maybe we leave things the way you have them here, but rename the timeout parameter to response_timeout, to make it clear that it's not a timeout on the whole method, but just on the waiting for a response inside that method.

Copy link
Author

@hkpeprah hkpeprah May 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can do that!

  • Rename timeout -> response_timeout

@@ -152,6 +152,7 @@ class Host(AbortableEventEmitter):
acl_packet_queue: Optional[AclPacketQueue] = None
le_acl_packet_queue: Optional[AclPacketQueue] = None
hci_sink: Optional[TransportSink] = None
hci_source: Optional[TransportSource] = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't needed. The host doesn't need to keep a reference to the source that's feeding it. It does need to keep a reference to the sink that it feeds, but the source only comes into the picture when calling set_packet_source, which is a convenience method for telling the source that its sink is the host object.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used that in the RTK driver when we get invalid packet data, as I noticed sometimes I got a large packet that had a payload of 254, but never got the remaining packet data for that packet:

            if host.hci_source and hasattr(host.hci_source, "parser"):
                host.hci_source.parser.reset()

Is there another way we can reset the parser from the driver? The packet I was getting from the dongle had the string RTK_5 or something like that.

Copy link
Author

@hkpeprah hkpeprah Jun 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a parser property or some other reset exposed from the host?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@barbibulle, any thoughts on a way of working around this?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay. There are a couple issues to work through here:
1/ The source/sink relationship is a push model, and source is a purely abstract concept of "some object that pushes data to a sink". A sink is just an object that has a on_packet method. This means that an object like Host, which receives HCI data from a "source" doesn't necessarily know what the source is, it just receives packets. The Host.set_packet_source method is purely a convenience function that tells the source where to push data (and looks up a metadata property on the source, but that isn't a requirement. In theory, one could attach a source to a Host without calling that method. This means that looking for an host.hci_source may or may not work depending on how objects were instantiated.

2/ Not all sources have a parser property. It is common to see it, because most sources are Transport implementations, which often need to extract HCI packets from a stream, and they usually use a PacketParser for that. But you could very well have a source that doesn't need to parse an HCI stream (for example, a transport where HCI packets arrive already delimited). So calling something like host.hci_source.parser.reset() only works for a transport that happens to use a PacketParser and happens to store that parser in a property named parser. That could break at any time, since that's not a requirement of transport implementations.

But in this case, the problem we need to solve is narrow, so we should be able to address it differently.
I'll try a few things out over the next week to see if I can think of a maintainable way to achieve what's needed here (i.e without having to put too many constraints on what types of objects and classes can be used as HCI data sources).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any ideas on a workaround for this?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could break out the changes into two separate PRs:

  1. One for the RTK / HCI Source
  2. The other for the parser fixes.

Would that be better?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry but took so long to follow up. Yes, breaking up in two PRs would help. For the parser changes, I’ll try to find some time over the weekend to come up with an alternative.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: I though about a maintainable way to allow signaling of errors back to an HCI source so that it can reset itself, without creating an explicit reference from a sink to a source (i.e no storing of hci_source in the Host object): a source can listen for events from its sink (first testing if the sink is an EventEmitter, since not all are required to be), then, in the RTK driver case here, the driver could have the host emit an error (upon encountering the timeout), which the usb source would get called back for, and reset itself. But... there's probably an even simpler way: it turns out that the usb source really doesn't need to use an HCI parser, since packets coming from USB are always delimited. The current implementation subclasses ParserSource by convenience, and uses self.parser.feed_data(packet) to emit packets to its sink, but that's really not needed. I'll push a PR where the USB source doesn't use a parser anymore, which should solve the issue here, since even after receiving corrupted data in one USB transfer, the next USB transfer should be a whole packet again.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created #534 . Thank you!

@hkpeprah
Copy link
Author

Replaced by #534 .

@hkpeprah hkpeprah closed this Aug 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants