Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

During Router promotion, the COAP DUA request packet can't be decrypted #7978

Closed
Ursescu opened this issue Aug 3, 2022 · 6 comments
Closed
Labels

Comments

@Ursescu
Copy link

Ursescu commented Aug 3, 2022

Describe the bug

During 5.9.x and 5.10.22 Harness tests, we observed that a COAP packet is sent immediately after the Address Solicit response but before the Link Request. This caused the COAP packet to be encrypted with the new address, but the parent couldn't decrypt it. This is happening while there is no MLE Advertisement from the Leader and the MLE Link Request is sent after a delay of 5 seconds introduced here. The COAP /n/dr (DUA.req) is sent from the Dua manager component and after role promotion to router, the registration delay is updated to kNewRouterRegistrationDelay which is 3 seconds. So the COAP is sent before the Link Request (3s < 5s).

While the tests from 5.9.x are passing, 5.10.22 is failing because the number of packets is not as expected.

This is a capture from our board (NXP K32W0):

capture_nxp

Packet 46 is the COAP one, which is sent with a delay of 3 seconds after COAP ADDR_SOL (/a/as) ACK before the MLE Link Request at 50.

Analyzing EFR32 BRD4166A compiled with the same stack commit shows the same problem:

capture_silabs

Packet 45 is the COAP sent before MLE Link Request 50 with the same timings.

Looking at NRF 52840 compiled with the same stack, the behavior is different:

capture_nordic

Now the COAP packet is sent after the MLE Link Request. Trying to understand why there is a difference, we looked up the path from Dua Manager through Mesh Forwarder and found out that the packet is still sent after 3s delay, but is dropped at PrepareNextDirectTransmission  -> UpdateIp6Route -> UpdateIp6RouteFtd with  OT_ERROR_NO_ROUTE error. The packet is sent later after the MLE Link Request and everything seems to work correctly.

To Reproduce

  1. Git commit id : 78f8437
  2. IEEE 802.15.4 hardware platform: NXP K32W0, NRF 52840, EFR32 BRD4166A
  3. Build steps: 
    Perform the build steps indicated by each platform and enable OPENTHREAD_CONFIG_DUA_ENABLE.
  4. Network topology
    Kirale - KTBRN1 with v2.8.5 image
    BR (Kirale)  <-> DUT FTD
  5. Running 5.10.22 or 5.9.x

Question 

What should be the correct behavior in this case and why are there differences between vendors? This issue seems related more to the stack than the platform.

Maybe related to: #7664

@George-Stefan
Copy link
Contributor

^^ @jwhui

@jwhui jwhui added the question label Aug 11, 2022
@jwhui
Copy link
Member

jwhui commented Aug 11, 2022

@Ursescu , thanks for reporting this issue!

The change in #7745 includes sending a Link Request immediately after receiving an Advertisement that has the newly allocated Router ID set.

IgnoreError(SendLinkRequest(nullptr));

I noticed in the packet traces that an Advertisement is not being sent soon after allocating the new Router ID. When the Leader allocates a new Router ID, it should reset the Advertisement Trickle timer interval to minimum.

Get<Mle::MleRouter>().ResetAdvertiseInterval();

It's not immediately obvious to me why the Advertisement message is not being sent with a short interval. Can you help dig into why we might be observing this behavior?

@Ursescu
Copy link
Author

Ursescu commented Aug 12, 2022

@jwhui the test is performed using Kirale - KTBRN1 as Leader. When using OTBR as Leader and performing the same test, I can see that the Advertisement is sent shortly after the Address Solicit response and, in this case the Link Request packet is sent immediately.

Further analyzing why the test (5.10.22) is failing in this case reveals yet another possible problem with the way the test works and in Kirale behavior. EFR32 BRD4166A capture also expose our problem; COAP /n/mr (53) is sent before /n/dr (55) (retried because sent before Link Request) and there is no response from Leader, causing some retries on our side. Due to the fact that the test is counting COAP /n/mr packets and not filtering retries, the test is failing (the number of /n/mr packets is higher than expected). Manually validating the capture didn't show any problems. After the retries, Kirale is able to respond and the MLR.req is successful.

So in the end, the questions that would conclude this issue are whether Kirale is behaving correctly, if the test assumption that there are no retries is correct, and why doesn't Nordic transmit the packet in the first place (error OT_ERROR_NO_ROUTE).

@EskoDijk
Copy link
Contributor

I think a cert test in general should allow for retries of a TMF (CoAP) message - that is fully within specification. The test script should count the number of unique CoAP transactions, not the number of packets, otherwise it's too strict. @Ursescu is this still an issue? If so I can create an internal issue for the CoAP message count/validation.

On the message timing: it should not matter that a CoAP DUA.req is sent before upgrading to Router, or after, the TLV payload is independent from the device's RLOC address.

Some possible issues that could be relevant here:

  • If the former-Child just upgraded to Router, but hasn't made any links yet to other Routers. that may cause the "No Route" error because links >= 1 other Router are needed to route the packet away from the device. (@jwhui is that correct, or would the former-Child still use its Parent for this?)
  • If the former-Child uses its Parent (given there are no router-links yet) is the issue then that the Parent checks the Child RLOC, and doesn't recognize it? Since the new RLOC not a known Child from the Parent point of view)

@George-Stefan
Copy link
Contributor

I think a cert test in general should allow for retries of a TMF (CoAP) message - that is fully within specification. The test script should count the number of unique CoAP transactions, not the number of packets, otherwise it's too strict. @Ursescu is this still an issue? If so I can create an internal issue for the CoAP message count/validation.

We created an issue at Thread alliance for some clarifications: https://threadgroup.atlassian.net/browse/DEV-2299

@jwhui
Copy link
Member

jwhui commented Sep 23, 2022

Closing stale issue.

@jwhui jwhui closed this as completed Sep 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants