New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data poll delayed when receiving fragmented packets #9820
Comments
It is been a while since I last read the
Gotta love a Heisenbug 😄 |
@CodingRays , can you attach the pcap file? |
So heres a minimal setup to reproduce the issue i found. This uses only dongles: From a fresh clone of the nrf52840:
On the ftd to test
Since the bug appears ~50% of the time a few transmission might be required but it should trigger very reliably. I sadly lost the pcap file from the image but here is a capture from a test using the above setup. |
Looking at the packet trace again, it appears that the ACK was sent with the Frame Pending bit set, but no data frame transmission followed. The default Data Poll Timeout is set to 100ms: openthread/src/core/config/mac.h Line 597 in 44c3906
The Data Sequence Number jumps from 125 in frame 169 to 127 in frame 175. So it appears that it tried to send a data frame. But without more logging from the sender, it is difficult to determine exactly the root cause. It could be due to CCA failures. It could be that the frame was transmitted but a collision occurred. Both of which would be legitimate reasons for packet loss. Can you provide detailed logs from the parent? |
I can try to get the logs. But its hard to believe its completely unrelated to the logging due to just how consistent it is. I have not had a single such failure during all my testing while logging was disabled while with it enabled its 50% of all ip packets / ~16% of all fragments. |
I'm hoping the logs can provide some visibility into why data frames are not being transmitted following an ACK with the Frame Pending bit set. |
Ok so ive spent a while testing and i was no longer able to reproduce the bug. It did occur once and as previously when it did happen it was reliable until i changed the configuration. Here 10s after the change i continued testing and it immediately stopped. However without knowing any conditions to trigger it its hard to investigate. I also couldnt get logs from that test. One thing to note though is that now the wireshark logs show the data poll being sent but no response from the parent. I am using different hardware (newer chip revisions) now though and will try and test with the original hardware hopefully soon and get back after i do that. |
Alright i cant reproduce the bug anymore event with different hardware. I have no idea what changed. |
Describe the bug
We discovered this bug while working on #9763 . It appears that while receiving a fragmented packet the data poll for the next fragments can randomly be delayed by approximately 100ms. This matches the data poll timeout however no previous data poll was sent. During this delay the receiver remains on.
We have tested this inside a Faraday cage to eliminate any external causes and the bug seems to be easily reproducible. Out of 20 512byte packets 11 had at least 1 such delayed fragment. We were able to reproduce it on multiple commits, days and locations making external influences unlikely.
After further testing we found that this only happens when logging is enabled at debug level. Which leads me to believe that it could be related to some very tight timeout. But how that would be possible is unclear to me. Further the bug only appears when attached as a SED not as a SSED.
To Reproduce
Git commit id: ce8bbfe (Tested with multiple older commits as well)
Platform: FTD: nrf52840 dongle, MTD: nrf52840 dev kit
Topology: MTD attached as a SED to the leader
Other: poll interval set to 10s, logging dynamic at level 5
Expected behavior
A data poll should be sent without waiting 100ms.
Console/log output
Unable to obtain due to messages overriding each other when logging at level 5. Logging at level 4 does not trigger the bug.
The text was updated successfully, but these errors were encountered: