Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

During Router promotion, there is a time-window which Leader and Router have different router RLOC #7664

Closed
AlanLCollins opened this issue May 5, 2022 · 10 comments
Labels

Comments

@AlanLCollins
Copy link

AlanLCollins commented May 5, 2022

Describe the bug
During Router promotion, there is a time-window which Leader and Router have different Router RLOC.
The Router starts using the new RLOC right after getting the "Delay Link response"
But the Leader does not update the neighbor table until receives advertisement from the router. Then "delayed message" is sent. So, during that time-window if the Router attempts sending higher layer protocol/application message to the Leader. The Leader rejects as "UnknownNeighbor".

Expected behavior
Either the Leader needs to update its neighbor table sooner, or the Router needs to wait to start using the new RLOC.

Console/log output
Fail_duringRouterPromotion.txt

Additional context
I reproduced this issue with SRP protocol during matter provisioning. The Router goes into promotion from CHILD to ROUTER right in the before triggering a DNS resolution.

@AlanLCollins AlanLCollins changed the title During Router promotion, there is a time-window which Leader and Router have different RLOCs During Router promotion, there is a time-window which Leader and Router have different router RLOC May 5, 2022
@AlanLCollins
Copy link
Author

AlanLCollins commented Jun 16, 2022

I reproduced this issue with:
github/openthread (commit 'f3774e3368a5656ba2aa994276a1f1dca6ed734c') May-02-2022
github/ot-br-posix (5d7c91c) May-9th-2022

Two devices:

  • BR = Matter controller / Border router device
  • TR = Matter light-bulb / Thread router.

Context: BR is provisioning the TR with Matter & Thread nwk credentials.

  1. TR joins as CHILD:
06-16 00:42:30.835   840   840 I otbr-agent: [I] Mle-----------: Receive Child ID Request (fe80:0:0:0:7c71:c6a2:6d8b:39fe)
06-16 00:42:30.836   840   840 I otbr-agent: [I] Settings------: Added ChildInfo {rloc:0xbc01, extaddr:7e71c6a26d8b39fe, timeout:240, mode:0x0f, version:3}
06-16 00:42:30.836   840   840 I otbr-agent: [I] Mle-----------: Send Child ID Response (fe80:0:0:0:7c71:c6a2:6d8b:39fe,0xbc01)
  1. TR registers service to BR SRP server:
06-16 00:42:32.572   840   840 I otbr-agent: [INFO]-ADPROXY-: Handle publish SRP service 'AAEF30DEA3506875-024C184F1B9161BD._matter._tcp.default.service.arpa.': OK
...
06-16 00:42:32.601   840   840 I otbr-agent: [I] SrpServer-----: Send success response 
  1. Matter SDK gets the DNS query ok, and finds TR IP as CHILD.
06-16 00:42:33.721  1076  2444 I CHIP DIS: Node ID resolved for AAEF30DEA3506875:024C184F1B9161BD
06-16 00:42:33.721  1076  2444 I CHIP DIS:     Addr 0: [fd11:b241:da07:1:782e:20ea:9d4e:9b68]:5540 
  1. BR Matter SDK creates a CASE session with TR. Right after, TR initiates promotion to Router, BR responded the link request, and erases the old CHILD settings:
06-16 00:42:36.914   840   840 I otbr-agent: [I] Mle-----------: Receive Link Request (fe80:0:0:0:7c71:c6a2:6d8b:39fe)
06-16 00:42:37.326   840   840 I otbr-agent: [I] Mle-----------: Send delayed message (fe80:0:0:0:7c71:c6a2:6d8b:39fe)
...
06-16 00:42:37.360   840   840 I otbr-agent: [I] AddrResolver--: Cache entry removed: fd11:b241:da07:1:782e:20ea:9d4e:9b68, 0xbc01, list:cached - removing rloc16
06-16 00:42:37.360   840   840 I otbr-agent: [I] AddrResolver--: Cache entry removed: fd87:d537:a34a:0:5fe4:e20:c44f:bd0c, 0xbc01, list:cached - removing rloc16
06-16 00:42:37.360   840   840 I otbr-agent: [I] Settings------: Read ChildInfo {rloc:0xbc01, extaddr:7e71c6a26d8b39fe, timeout:240, mode:0x0f, version:3}
06-16 00:42:37.360   840   840 I otbr-agent: [I] Settings------: Removed ChildInfo {rloc:0xbc01, extaddr:7e71c6a26d8b39fe, timeout:240, mode:0x0f, version:3}
  1. BR Matter wants to end the commissioning session over Thread sending cluster 0x30 cmd 0x4. But the IP address stored/published to the DNS server are no longer valid. So the attempt fails:
06-16 00:42:39.913  1076  1673 I CHIP TOO: Sending cluster (0x0030) command (0x04) on endpoint 0
06-16 00:42:39.914   840   840 I otbr-agent: [I] AddrResolver--: Sending address query for fd11:b241:da07:1:782e:20ea:9d4e:9b68
..
06-16 00:42:45.911  1076  2444 I CHIP EM : Retransmitting MessageCounter:10252493 on exchange 59006i Send Cnt 0
...
06-16 00:42:50.935  1076  2444 I CHIP EM : Retransmitting MessageCounter:10252493 on exchange 59006i Send Cnt 1

  1. 15 seconds later, 06-16 00:42:37.326 --> 06-16 00:42:52.027 , the SRP entry is updated with the correct new IP.
06-16 00:42:52.027   840   840 I otbr-agent: [INFO]-ADPROXY-: Advertise SRP service updates: host=7E71C6A26D8B39FE.default.service.arpa.
06-16 00:42:52.027   840   840 I otbr-agent: [INFO]-MDNS----: Removing service 4F842D64CBC47B11._matterc._udp
06-16 00:42:52.027   840   840 I otbr-agent: [INFO]-MDNS----: Releasing avahi entry group @0xaeb53320
06-16 00:42:52.034   840   840 I otbr-agent: [INFO]-ADPROXY-: Handle unpublish SRP service '4F842D64CBC47B11._matterc._udp.default.service.arpa.': OK
06-16 00:42:52.034   840   840 I otbr-agent: [INFO]-ADPROXY-: Waiting for more publishing callbacks 2
06-16 00:42:52.034   840   840 I otbr-agent: [INFO]-MDNS----: Removing service AAEF30DEA3506875-024C184F1B9161BD._matter._tcp
06-16 00:42:52.034   840   840 I otbr-agent: [INFO]-MDNS----: Releasing avahi entry group @0xaeb532a0
06-16 00:42:52.040   840   840 I otbr-agent: [INFO]-ADPROXY-: Handle unpublish SRP service 'AAEF30DEA3506875-024C184F1B9161BD._matter._tcp.default.service.arpa.': OK
06-16 00:42:52.040   840   840 I otbr-agent: [INFO]-ADPROXY-: Waiting for more publishing callbacks 1
06-16 00:42:52.040   840   840 I otbr-agent: [INFO]-MDNS----: Removing host 7E71C6A26D8B39FE
06-16 00:42:52.040   840   840 I otbr-agent: [INFO]-MDNS----: Releasing avahi entry group @0xaeb53260
06-16 00:42:52.045   840   840 I otbr-agent: [INFO]-ADPROXY-: Handle unpublish SRP host '7E71C6A26D8B39FE.default.service.arpa.': OK
06-16 00:42:52.046   840   840 I otbr-agent: [I] SrpServer-----: Handler result of SRP update (id = 14426521) is received: OK
06-16 00:42:52.046   840   840 I otbr-agent: [I] SrpServer-----: Remove host 7E71C6A26D8B39FE.default.service.arpa. (but retain its name)
06-16 00:42:52.046   840   840 I otbr-agent: [I] SrpServer-----: Remove but retain name of service 'AAEF30DEA3506875-024C184F1B9161BD._matter._tcp.default.service.arpa.' subtype:_IAAEF30DEA35068
06-16 00:42:52.046   840   840 I otbr-agent: [I] SrpServer-----: Remove but retain name of service 'AAEF30DEA3506875-024C184F1B9161BD._matter._tcp.default.service.arpa.'
06-16 00:42:52.046   840   840 I otbr-agent: [I] SrpServer-----: Remove but retain name of service '4F842D64CBC47B11._matterc._udp.default.service.arpa.' subtype:_L1416
06-16 00:42:52.046   840   840 I otbr-agent: [I] SrpServer-----: Remove but retain name of service '4F842D64CBC47B11._matterc._udp.default.service.arpa.' subtype:_S5
06-16 00:42:52.046   840   840 I otbr-agent: [I] SrpServer-----: Remove but retain name of service '4F842D64CBC47B11._matterc._udp.default.service.arpa.' subtype:_V4456
06-16 00:42:52.046   840   840 I otbr-agent: [I] SrpServer-----: Remove but retain name of service '4F842D64CBC47B11._matterc._udp.default.service.arpa.'
06-16 00:42:52.046   840   840 I otbr-agent: [I] SrpServer-----: Send success response

@jwhui jwhui added the question label Jun 16, 2022
@jwhui
Copy link
Member

jwhui commented Jun 16, 2022

@AlanLCollins , thanks for raising this issue.

One point of clarification, what needs updating here is the IPv6-to-RLOC mapping, not the SRP entry itself. The IPv6 address in the SRP entry should be the same for both child and router roles.

Is it possible for you to provide debug-level logs on otbr-agent? I'd like to see what is going on between when TR requests a new router id until resynchronizes its link with the BR.

@AlanLCollins
Copy link
Author

ticket-7664-otbr-agent.txt

Please refer to the attachment which includes CHIP & otbr-agent log events.
Thank you in advance!

@jwhui
Copy link
Member

jwhui commented Jun 16, 2022

One thing I'm noticing is that there is no response to the Address Query:

06-16 00:42:39.914   840   840 I otbr-agent: [I] AddrResolver--: Sending address query for fd11:b241:da07:1:782e:20ea:9d4e:9b68
06-16 00:42:39.914   840   840 I otbr-agent: [I] AddrResolver--: Cache entry added: fd11:b241:da07:1:782e:20ea:9d4e:9b68, 0xfffe - query request

The Address Query multicast message is sent twice:

06-16 00:42:39.929   840   840 W otbr-agent: [W] Platform------: Error decoding hdlc frame: Parse
06-16 00:42:39.929   840   840 I otbr-agent: [I] MeshForwarder-: Sent IPv6 HopOpts msg, len:84, chksum:0000, ecn:no, to:0xffff, sec:yes, prio:net
06-16 00:42:39.929   840   840 I otbr-agent: [I] MeshForwarder-:     src:[fd87:d537:a34a:0:0:ff:fe00:bc00]
06-16 00:42:39.929   840   840 I otbr-agent: [I] MeshForwarder-:     dst:[ff03:0:0:0:0:0:0:2]
06-16 00:42:39.962   840   840 W otbr-agent: [W] Platform------: Error decoding hdlc frame: Parse
06-16 00:42:39.962   840   840 I otbr-agent: [I] MeshForwarder-: Sent IPv6 HopOpts msg, len:84, chksum:0000, ecn:no, to:0xffff, sec:yes, prio:net
06-16 00:42:39.962   840   840 I otbr-agent: [I] MeshForwarder-:     src:[fd87:d537:a34a:0:0:ff:fe00:bc00]
06-16 00:42:39.962   840   840 I otbr-agent: [I] MeshForwarder-:     dst:[ff03:0:0:0:0:0:0:2]
06-16 00:42:41.964   840   840 I otbr-agent: [I] AddrResolver--: Timed out waiting for address notification for fd11:b241:da07:1:782e:20ea:9d4e:9b68, retry: 15

But no response is received:

06-16 00:42:41.964   840   840 I otbr-agent: [I] AddrResolver--: Timed out waiting for address notification for fd11:b241:da07:1:782e:20ea:9d4e:9b68, retry: 15

I'm also noticing a fair number of HDLC frame errors, an one happens right after the first Address Query is transmitted:

06-16 00:42:39.962   840   840 W otbr-agent: [W] Platform------: Error decoding hdlc frame: Parse

On a well-implemented serial port, we typically do not see frequent HDLC errors. Are these HDLC errors expected on your platform?

@AlanLCollins
Copy link
Author

Hello @jwhui , my apologies for the late response.

  • Our platform engineers are following up regarding the HDLC implementation. Those warning messages have always been harmless for the application. We will take a look.
  • Is the AddrResolver flow the one that updates the IP-to-RLOC mapping?
  • This looks like an improvement to OT --> Shouldn't our device update the mapping during the MLE Link Request flow ?

@jwhui
Copy link
Member

jwhui commented Jul 1, 2022

Our platform engineers are following up regarding the HDLC implementation. Those warning messages have always been harmless for the application. We will take a look.

Great. Thanks.

Is the AddrResolver flow the one that updates the IP-to-RLOC mapping?

Yes.

This looks like an improvement to OT --> Shouldn't our device update the mapping during the MLE Link Request flow ?

We could consider adding something like this as an optimization, but it would only work for 1-hop neighbors. If the device upgrading to router is multiple hops away from the BR, then this optimization would not apply.

@AlanLCollins
Copy link
Author

I will close this ticket, as there are recent improvements in the OT code which triggers the new Router to send MLE Advertisement right after the Router promotion, so this triggers EID-to-RLOC mapping to start faster.

@jwhui
Copy link
Member

jwhui commented Feb 6, 2023

We still have the following:

But this requires some Thread Group Spec / Test Spec updates.

@AlanLCollins AlanLCollins reopened this Feb 6, 2023
@AlanLCollins
Copy link
Author

Ok. I will reopen the ticket to keep track of this.
Thanks !

@jwhui
Copy link
Member

jwhui commented Mar 7, 2023

Resolved by #8307

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants