Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

STM32 Observing malformed Advertisement Reports on STM32WB55 #8663

Closed
kt-work opened this issue May 13, 2022 · 9 comments
Closed

STM32 Observing malformed Advertisement Reports on STM32WB55 #8663

kt-work opened this issue May 13, 2022 · 9 comments

Comments

@kt-work
Copy link

kt-work commented May 13, 2022

I have been observing erroneous data when receiving BLE Advertisements with our micropython setup. Our data shows RSSI and general data errors at an interval of ~5h. Now I looked a bit closer and found that LE Advertisement reports have fairly frequently wrong RSSI data.

If I print the data passed to ble_hs_hci_evt_le_adv_rpt() I get:
LeAdvRpt: 020100004312b5d1d8ea1d0201061916f7fd011219cae292db59043e1f020100012792f34bfb591302

The associated data entering through mp_bluetooth_nimble_hci_uart_process() looks almost identical:

04
3e
29
02
01
00
00
43
12
b5
d1
d8
ea
1d
02
01
06
19
16
f7
fd
01
12
19
ca
e2
92
db
59
04
3e
1f
02
01
00
01
27
92
f3
4b
fb
59
13
02

I run this on our proprietary hardware using a STM32WB55VGQ6. Our Wireless FW version is the HCI variant of 1.11.1 The micropython version is 1.18 .

In addition to the above I found that the hci/ipcc interface yields apparently correct HCI data when I added HCI_TRACE 1 and Debug_printf to rfcore.c. The data that arrives in the Nimble BLE stack however contains parts of two advertisement reports. I marked them bold and italic.

[ 875062] <HCI_EVT(04:3e:1f:02:01:04:00:30:f5:0a:f4:d9:7c:13:0b:09:49:44:31:5f:42:43:30:45:41:41:06:ff:9a:08:01:80:6b:a8)
[ 875074] <HCI_EVT(04:3e:27:02:01:00:00:26:36:e2:42:4b:e6:1b:02:01:06:06:ff:57:07:a2:2c:01:10:09:50:20:50:52:4f:58:49:52:20:30:30:30:30:31:34:c0)
[ 875088] <HCI_EVT(04:3e:13:02:01:04:00:26:36:e2:42:4b:e6:07:06:ff:57:07:f2:bd:0b:c0)
[ 875097] <HCI_EVT(04:3e:2b:02:01:03:01:48:50:15:01:87:3a:1f:1e:ff:06:00:01:09:20:02:bd:6b:03:94:fb:24:a3:2d:96:cf:49:19:47:61:95:d2:15:9a:8a:77:e3:24:af:af)
[ 875112] <HCI_EVT(04:3e:2b:02:01:03:01:4b:b9:04:01:13:33:1f:1e:ff:06:00:01:09:20:02:af:49:64:f6:f6:fb:60:83:36:90:b9:bb:1e:db:c4:75:66:42:44:4d:42:27:21:b4)
[ 875127] <HCI_EVT(04:3e:2b:02:01:03:01:8b:9c:a8:1c:03:54:1f:1e:ff:06:00:01:09:20:02:56:40:47:3d:8d:ca:82:95:12:54:75:a4:55:66:cf:ef:65:48:97:6a:bf:08:24:ad)
[ 875143] <HCI_EVT(04:3e:2a:02:01:00:01:0a:c6:b1:a1:bd:74:1e:02:01:1a:1a:ff:4c:00:0c:0e:00:2c:6e:7f:04:67:1d:bc:64:b3:ae:63:93:18:10:05:45:1c:fe:33:f6:9e)
[ 875158] <HCI_EVT(04:3e:24:02:01:00:00:db:68:4c:93:7a:18:18:02:01:06:03:02:1d:18:03:19:80:0c:07:1b:db:68:4c:93:7a:18:04:16:1d:18:01:bf)
[ 875171] <HCI_EVT(04:3e:13:02:01:04:00:db:68:4c:93:7a:18:07:06:09:42:46:36:30:30:bf)
[ 875180] <HCI_EVT(04:3e:2b:02:01:03:01:4b:b9:04:01:13:33:1f:1e:ff:06:00:01:09:20:02:af:49:64:f6:f6:fb:60:83:36:90:b9:bb:1e:db:c4:75:66:42:44:4d:42:27:21:b4)
[ 875195] <HCI_EVT(04:3e:2b:02:01:03:01:48:50:15:01:87:3a:1f:1e:ff:06:00:01:09:20:02:bd:6b:03:94:fb:24:a3:2d:96:cf:49:19:47:61:95:d2:15:9a:8a:77:e3:24:af:af)
LeAdrfcore: IPCC_C1_RX_IRQHandler
vRpt: 020103018b9ca81c03541f1eff0600010920025640473d8dca8295043e1f0201040030f50af4d97c130b09
LeAdvRpt: 020100002636e2424be61b02010606ff5707a22c011009502050524f58495220303030303134c0
LeAdvRpt: 020104002636e2424be60706ff5707f2bd0bc0

So far I have reduced the issue to the buffering mechanism in mp_bluetooth_hci_uart_msg_cb() and mp_bluetooth_hci_uart_readchar() in mpbthciport.c . If the array passed to the callback doesn't fit the buffer hci_uart_rx_buf_data[] only a part is copied. The function mp_bluetooth_hci_uart_readchar is then able to read a fragment of a HCI event and then another packet. The Nimble stack will not be able to recognize the error since the start will be correct, so the user will have to deal with the incorrect advertisements.
I replaced the buffering mechanism with a circular buffer I wrote up pretty quick and so far it is running ok, I'll know more after the weekend. I increased the buffer size to 2048 Bytes, I haven't seen dropped packets yet.
Please let me know how you would like to fix this.

Regards,

Kilian

@jimmo
Copy link
Member

jimmo commented May 13, 2022

Thanks @kt-work for the detailed analysis! Your explanation makes sense, and you're right that particular bit of code should clearly be a circular buffer. It would be good to repro it here...have you come across a way to increase the rate at which this happens, e.g. by increasing the number/rate/size of advertising packets being broadcast by nearby devices?

Please let me know how you would like to fix this.

If your testing goes well, would you be willing to share your patch to mpbthciport.c (in whatever form it's in, even just a rough draft) and we can take it from there? Did you just use ringbuf.h ? If you want to make a PR that would be even better :)

increased the buffer size to 2048 Bytes

This is pretty big! Was it necessary to solve this problem?

@andrewleech FYI I wonder if this could improve l2cap data transfer reliability.

@kt-work
Copy link
Author

kt-work commented May 13, 2022

The rate if you look at the RSSI in ble_hs_hci_evt_le_adv_rpt() is already pretty high in an office environment with about 20 different BLE devices. Otherwise you can always reduce the size of the buffer I guess. I have been looking at a specific device in 1min intervals that is how I ended up with 5 h initially. If you just look for packets with an error in the above function it will only take a few minutes. Maybe there is more filtering going on so my application was only rarely affected.

You have a ringbuf.h already, nice I'll check it out.

I'll have to ask management what I can share, but we are interested to get fixes into the mainline code base if possible.

Well if you can live with a few dropped packets you can reduce it, but 2048 was the first value after doubling that gave me no easily observable drops. We have been wondering for a long time where all the advertisements are going despite RSSI = -45dBm, so if I can find a cause I'm motivated to reduce the lost packet count to zero.

Best Regards,

Kilian

@andrewleech
Copy link
Sponsor Contributor

Well if you can live with a few dropped packets you can reduce it, but 2048 was the first value after doubling that gave me no easily observable drops.

Really solid work here, I know full well how difficult and time consuming chasing down these kind of issues can be!
I agree with @jimmo that 2048 seems really big, considering how small most BLE packets are. It's great to hear this is the result of testing. My feeling here is that it sounds like it's just not getting serviced enough at the other end of the buffer. Perhaps with a suitable ringbuffer we might be able to have a 3/4 full trigger that schedules whatever is needed to read from the buffer.

@andrewleech
Copy link
Sponsor Contributor

In a really strange coincidence... I was actually working on a problem yesterday that involved l2cap transfers not working on nucleo wb55. Adding some debuging prints to this today, I see I get ~4 l2cap packets of 235 bytes long through before it locks up (BLE mtu set to 244). This is also in a custom application (using aioble however), with a WB55 dongle on the other end of the l2cap transfer.

If I halve the l2cap transfer size, it continues through the whole transfer.

Alternatively, if in mpbthciport.c I change: STATIC uint8_t hci_uart_rx_buf_data[256]; to STATIC uint8_t hci_uart_rx_buf_data[2048];, it continues through the whole transfer.

I swapped out the static buffer for a ringbuffer based on in the existing definition, which cleans up the code somewhat - but I don't think will improve things on it its own. The "read message from ipcc" function which uses the callback is only called when the buffer is empty anyway, as checked when reading bytes from the buffer. So arguably using the ringbuffer is just adding a (small) amount of overhead. I saw then with extra printf's that with the 235byte l2cap it was trying to put ~280 bytes into the buffer each time, which explains the repeatable failure every time.

Taking a step back, I think it'd be better to restructure the interface to allow direct handing of the buffers from ipcc callbacks, and remove the ringbuffer entirely. I've tested this here (messy code with debug prints still) and seems to work well!
l2cap works well for me, and we save some static ram.
#8665

@andrewleech
Copy link
Sponsor Contributor

I now think #8665 is in pretty good shape, works well in my initial testing. Haven't done any multitests yet though...

@kt-work if you get a chance to look / test it out it'd be great to hear if it fixes your issue too!

@kt-work
Copy link
Author

kt-work commented May 16, 2022

Using the ringbuffer removed the erroneous RSSI values while testing over the weekend.

image

In front of the blue line the old code was used.

@andrewleech Thanks for getting rid of the ring buffer, I'm totally glad you took it on, since I would have been out of my depth changing that much code in the BLE frontend. Also we should get more RAM out of it that could be available to the user, this is great.
I'll go and see how I can work it into our code to perform some testing.

@kt-work
Copy link
Author

kt-work commented May 16, 2022

As a side note while running the 1.18 code I noticed exceptions occuring when stopping the BLE scanning with gap_scan(). The exception was reported as ENODEV which was a major misunderstanding since the number returned was -19 which meant internally something like HCI timeout. I only discovered this when turning on debug prints. Now it seems explainable with HCI events going missing in the buffering mechanism. With our ring buffer implementation I noticed that I don't get those exceptions anymore.

@kt-work
Copy link
Author

kt-work commented May 17, 2022

@andrewleech I ran your changes in our code and didn't find any issues in the results. Below you can find a graph with the RSSI reported by one of the BLE devices.

image

Thanks for your help.

@dpgeorge
Copy link
Member

dpgeorge commented Jun 3, 2022

Should be fixed by 8fb01be

@dpgeorge dpgeorge closed this as completed Jun 3, 2022
tannewt pushed a commit to tannewt/circuitpython that referenced this issue Nov 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants