Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data poll delayed when receiving fragmented packets #9820

Open
CodingRays opened this issue Feb 1, 2024 · 8 comments
Open

Data poll delayed when receiving fragmented packets #9820

CodingRays opened this issue Feb 1, 2024 · 8 comments

Comments

@CodingRays
Copy link
Contributor

CodingRays commented Feb 1, 2024

Describe the bug

We discovered this bug while working on #9763 . It appears that while receiving a fragmented packet the data poll for the next fragments can randomly be delayed by approximately 100ms. This matches the data poll timeout however no previous data poll was sent. During this delay the receiver remains on.
poll_wireshark

We have tested this inside a Faraday cage to eliminate any external causes and the bug seems to be easily reproducible. Out of 20 512byte packets 11 had at least 1 such delayed fragment. We were able to reproduce it on multiple commits, days and locations making external influences unlikely.

After further testing we found that this only happens when logging is enabled at debug level. Which leads me to believe that it could be related to some very tight timeout. But how that would be possible is unclear to me. Further the bug only appears when attached as a SED not as a SSED.

To Reproduce

Git commit id: ce8bbfe (Tested with multiple older commits as well)
Platform: FTD: nrf52840 dongle, MTD: nrf52840 dev kit
Topology: MTD attached as a SED to the leader
Other: poll interval set to 10s, logging dynamic at level 5

Expected behavior

A data poll should be sent without waiting 100ms.

Console/log output

Unable to obtain due to messages overriding each other when logging at level 5. Logging at level 4 does not trigger the bug.

@abtink
Copy link
Member

abtink commented Feb 1, 2024

It is been a while since I last read the DataPollSender and Mac code, but checking the code I dont see anything obvious that would lead to such a behavior.

  • The DataPollSender should immediately SendDataPoll() from ProcessRxFrame() if it sees aFrame.GetFramePending() on the received frames.
  • Mac should finish kOperationWaitingForData from HandleReceivedFrame()

Unable to obtain due to messages overriding each other when logging at level 5. Logging at level 4 does not trigger the bug.

Gotta love a Heisenbug 😄

@jwhui
Copy link
Member

jwhui commented Feb 1, 2024

@CodingRays , can you attach the pcap file?

@CodingRays
Copy link
Contributor Author

So heres a minimal setup to reproduce the issue i found. This uses only dongles:

From a fresh clone of the nrf52840:
Build using ./script/build nrf52840 USB_trans -DOT_BOOTLOADER=USB -DOT_LOG_LEVEL_DYNAMIC=ON
On the mtd following commands:

log level 5
mode -
pollperiod 10000

On the ftd to test

udp send <mleid> 5000 -s 512

Since the bug appears ~50% of the time a few transmission might be required but it should trigger very reliably.

I sadly lost the pcap file from the image but here is a capture from a test using the above setup.
The bug appears between packet 172 and 173. Network key: 6ce43b5b088aa1a10db563e803d1a28b
data_poll_dongle.zip

@jwhui
Copy link
Member

jwhui commented Mar 5, 2024

Looking at the packet trace again, it appears that the ACK was sent with the Frame Pending bit set, but no data frame transmission followed. The default Data Poll Timeout is set to 100ms:

#define OPENTHREAD_CONFIG_MAC_DATA_POLL_TIMEOUT 100

The Data Sequence Number jumps from 125 in frame 169 to 127 in frame 175. So it appears that it tried to send a data frame. But without more logging from the sender, it is difficult to determine exactly the root cause. It could be due to CCA failures. It could be that the frame was transmitted but a collision occurred. Both of which would be legitimate reasons for packet loss.

Can you provide detailed logs from the parent?

@CodingRays
Copy link
Contributor Author

I can try to get the logs.

But its hard to believe its completely unrelated to the logging due to just how consistent it is. I have not had a single such failure during all my testing while logging was disabled while with it enabled its 50% of all ip packets / ~16% of all fragments.

@jwhui
Copy link
Member

jwhui commented Mar 7, 2024

I'm hoping the logs can provide some visibility into why data frames are not being transmitted following an ACK with the Frame Pending bit set.

@CodingRays
Copy link
Contributor Author

CodingRays commented Mar 8, 2024

Ok so ive spent a while testing and i was no longer able to reproduce the bug. It did occur once and as previously when it did happen it was reliable until i changed the configuration. Here 10s after the change i continued testing and it immediately stopped. However without knowing any conditions to trigger it its hard to investigate. I also couldnt get logs from that test. One thing to note though is that now the wireshark logs show the data poll being sent but no response from the parent.

I am using different hardware (newer chip revisions) now though and will try and test with the original hardware hopefully soon and get back after i do that.

@CodingRays
Copy link
Contributor Author

Alright i cant reproduce the bug anymore event with different hardware. I have no idea what changed.
I guess the issue can be closed then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants