Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IPv6-specific "Host is unreachable" error that exits the matter runtime #100

Closed
jasta opened this issue Sep 25, 2023 · 10 comments · Fixed by #110
Closed

IPv6-specific "Host is unreachable" error that exits the matter runtime #100

jasta opened this issue Sep 25, 2023 · 10 comments · Fixed by #110

Comments

@jasta
Copy link
Contributor

jasta commented Sep 25, 2023

Environment
Chip: ESP32-C3-MINI-1
Hardware: ESP32-C3-DevKitM-1
Platform: esp-idf (Rust std)

Problem
I likely have something misconfigured on my network causing IPv6 broadcasts to yield a surprising "Host is unreachable" error, however the more important issue is that the way the master future is structured in my example (and onoff_light) causes the entire Matter runtime to effectively shutdown and not automatically restart.

An abridged version of the log shows the issue:

I (7607) rs_matter::transport::core: Comissioning started
I (7617) rs_matter::transport::core: Creating queue for 1 exchanges
I (7617) rs_matter::transport::core: Creating 8 handlers
I (7627) rs_matter::transport::core: Handlers size: 9992
I (7637) rs_matter::transport::core: Transport: waiting for incoming packets
I (7647) rs_matter::transport::udp::async_io: Listening on [::]:5353
I (7647) rs_matter::transport::udp::async_io: Joined IPV6 multicast ff02::fb/2
I (7657) rs_matter::transport::udp::async_io: Joined IP multicast 224.0.0.251/192.168.86.32
I (7667) rs_matter::mdns::builtin: Broadcasting mDNS entry to 224.0.0.251:5353
I (7687) rs_matter::mdns::builtin: Broadcasting mDNS entry to ff02::fb:5353
W (7697) rs_matter::transport::udp::async_io: Error on the network: Os { code: 118, kind: HostUnreachable, message: "Host is unreachable" }
Error: Error::Network

The last line in particular appears to be coming from the master future in the onoff light example: https://github.com/project-chip/rs-matter/blob/main/examples/onoff_light/src/main.rs#L165

This is "fixed" by just disabling IPv6 for me, but I do think this highlights some bigger issues with the error handling robustness inside the runtime. In particular I'd expect that the IPv4 and IPv6 behaviour be separated into separate futures that can error out independently and that one reaching a terminal state wouldn't negatively impact the other. Further I think some measure of error handling policy is appropriate (Host is unreachable seems like it should probably be retryable for example). I could take a crack at a patch but I do worry based on the current state of the code that it might be a bit intrusive. Any guidance from the maintainers would be greatly appreciated before getting started!

Thanks again for this awesome project, it's renewed my interest big time in IoT :)

@ivmarkov
Copy link
Contributor

ivmarkov commented Sep 27, 2023

The mDNS responder is doing broadcasting. In other words, it is not sending the UDP packet to a specific host, but rather, to the broadcast address ff02::fb which should always be available.

Not using ipv6 is less than ideal to put it mildly. The way Matter is implemented in the field (Google Home and I suspect others) is that it requires IPv6 connectivity - link-local IPv6 addresses suffice, but those are necessary. Moreover, the mDNS responder also needs ipv6 support. I was not able - without it - to get Google Home provisioning to complete.

So where I'm going is that if ipv6 (including for broadcasting) does not work for you, a hard failure for now is probably OK. It is another story why it fails, as per above. Can you pinpoint the exact code line where it fails?

@jasta
Copy link
Contributor Author

jasta commented Sep 28, 2023

So where I'm going is that if ipv6 (including for broadcasting) does not work for you, a hard failure for now is probably OK. It is another story why it fails, as per above. Can you pinpoint the exact code line where it fails?

Ack'd, I'll dig a little deeper why this isn't working. My network definitely should be supporting ipv6 as it's a Google WiFi mesh with no custom configuration which makes me think the fault lies somewhere in the rs-matter code somehow but we'll see...

@ivmarkov
Copy link
Contributor

Might be... see, with linked-local ipv6 there is no need of any explicit "ipv6 support" per se. As in, there is no "dhcp" and you don't need a gateway as well.

What esp idf version are you using with the example? It should be 4.4.x which I know works, unless you've explicitly changed it...

@jasta
Copy link
Contributor Author

jasta commented Sep 28, 2023

Might be... see, with linked-local ipv6 there is no need of any explicit "ipv6 support" per se. As in, there is no "dhcp" and you don't need a gateway as well.

Ack'd that's good context for the debugging.

What esp idf version are you using with the example? It should be 4.4.x which I know works, unless you've explicitly changed it...

I'm using 5.0.x, but I can try dropping back to 4.4.x to confirm that's the issue. Another good clue, thanks!

@jasta
Copy link
Contributor Author

jasta commented Oct 1, 2023

Confirmed that 4.4.x fixed this specific issue. I'll try to debug deeper why 5.x would be broken in this way.

@jasta
Copy link
Contributor Author

jasta commented Oct 2, 2023

Digging deeper on why this doesn't work in release/v5.0 (and I presume v5.1 but that fails to compile with esp-idf-svc), the story here seems really hairy. I think espressif might've broke something when attempting to backport fixes to v4.4. After many hours of debugging, I am fairly confident the offending code is:

espressif/esp-idf/components/lwip @ release/v4.4:
https://github.com/espressif/esp-idf/tree/release/v4.4/components/lwip
https://github.com/espressif/esp-lwip/blob/4f24c9baf9101634b7c690802f424b197b3bb685/src/core/ipv6/ip6.c#L175-L185

espressif/esp-idf/components/lwip @ release/v5.0:
https://github.com/espressif/esp-idf/tree/release/v5.0/components/lwip
https://github.com/espressif/esp-lwip/blob/8dad8d3ee66840deee4acfc1601de4e396c594be/src/core/ipv6/ip6.c#L175-L177

No idea why these things are different or what actual diff introduced this inconsistency. The v4.4 branch of esp-lwip has only one commit and it seems unrelated like maybe somebody squashed a big merge into one commit (possibly on accident?). Even weirder is that I can't find any evidence of upstream or esp-lwip having code like this. There's also support in the v4.4 branch for IPV6_MULTICAST_IF (which probably would also fix the issue matter-rs is seeing), but that support isn't in upstream or v5.0/v5.1, or really anywhere I can see...

@ivmarkov
Copy link
Contributor

ivmarkov commented Oct 2, 2023

My hypothesis:
Rust STD's join_multicast_v6 is (partially) broken on the ESP IDF (join_multicast_v4 was totally broken and I had to fix it back in time) - in that it likely does not set correctly the proper ipv6 network interface. And therefore it hits the "fallback paths" in ESP IDF 4.4 and 5.0 which try to derive a network interface (and then use the default one on 4.4 and fail on 5.0).

One test we can try to do is "manually" re-implement join_multicast_v6 here, as I did for join_multicast_v4. If it works, next step is to upstream in libc correct signatures for setsockopt, associated constants, and maybe the ipv6_mreq structure.

@jasta
Copy link
Contributor Author

jasta commented Oct 2, 2023

My hypothesis:
Rust STD's join_multicast_v6 is (partially) broken on the ESP IDF (join_multicast_v4 was totally broken and I had to fix it back in time) - in that it likely does not set correctly the proper ipv6 network interface. And therefore it hits the "fallback paths" in ESP IDF 4.4 and 5.0 which try to derive a network interface (and then use the default one on 4.4 and fail on 5.0).

I think you're right. I was able to find a commit that indicated the behavior I identified in 4.4 is actually wrong according to the standard and they tried to fix it but seemingly regressed this other behavior we care about. I'll do some more digging and see if any work arounds exist.

One test we can try to do is "manually" re-implement join_multicast_v6 here, as I did for join_multicast_v4. If it works, next step is to upstream in libc correct signatures for setsockopt, associated constants, and maybe the ipv6_mreq structure.

I don't think the IPV6_JOIN_GROUP even has the correct support in lwip to set the multicast interface as it probably should. So there's two unknowns that we need to work out then:

  1. How, if at all, can we work around this issue in newer lwip? I have been pretty deep in the code and I don't see any obvious hack that'll work given that IPV6_MULTICAST_IF support was mysteriously removed.

  2. What is the proper fix upstream to make it so we can remove any hack we find in (1)? The hard thing to discern from lwip code is what the intended implementation behaviour even is. That is, for Linux and OS X stacks, is it supposed to be that join_multicast_v6 enables routing of multicast destination IPs? Or is it expected that we call setsockopt with IPV6_MULTICAST_IF? Or something else I hadn't considered? In other words, which exact behaviour does lwip have wrong?

I'll think on this a little more and see if I can find something...

@jasta
Copy link
Contributor Author

jasta commented Oct 3, 2023

Nope, you were right, I think it's not setting the zone flag in the ip6_addr struct which is causing the route to fail. I'll prep a patch soon to fix it.

@jasta
Copy link
Contributor Author

jasta commented Oct 3, 2023

After some digging, I have good news. I believe the issue is that in lwip you have to call ip6_addr_set_zone on an ip6_addr_t (which has an extra u8 zone field at the end) that is then used to route packets. This is achieved using the scope_id field in SocketAddrV6. I believe this should be required/important on all platforms, it's just very likely that Linux has a less fragile heuristic to figure this out for you.

See the discussion on the scope_id field here: https://datatracker.ietf.org/doc/html/rfc2553#section-3.3.

So, good news addressing my unknowns above:

  1. We can just pass scope_id into the SocketAddr we use for send_to, confirmed this works.
  2. Nothing needed upstream, IPv6 scopes are implemented properly for newer versions of lwip (found in esp-idf 5.x)

I'll prep a PR to fix this

jasta added a commit to jasta/rs-matter that referenced this issue Oct 4, 2023
According to the RFC
(https://datatracker.ietf.org/doc/html/rfc2553#section-3.3), it is
necessary to disambiguate link-local addresses with the interface index
(in the scope_id field).  Lacking this field, newer versions of lwip that
support proper IPv6 scopes will yield EHOSTUNREACH (Host unreachable).
Other implementations like on Linux and OS X will likely be affected by
the lack of this field for more complex networking setups.

Fixes project-chip#100
jasta added a commit to jasta/rs-matter that referenced this issue Oct 4, 2023
According to the RFC
(https://datatracker.ietf.org/doc/html/rfc2553#section-3.3), it is
necessary to disambiguate link-local addresses with the interface index
(in the scope_id field).  Lacking this field, newer versions of lwip that
support proper IPv6 scopes will yield EHOSTUNREACH (Host unreachable).
Other implementations like on Linux and OS X will likely be affected by
the lack of this field for more complex networking setups.

Fixes project-chip#100
jasta added a commit to jasta/rs-matter that referenced this issue Oct 4, 2023
According to the RFC
(https://datatracker.ietf.org/doc/html/rfc2553#section-3.3), it is
necessary to disambiguate link-local addresses with the interface index
(in the scope_id field).  Lacking this field, newer versions of lwip that
support proper IPv6 scopes will yield EHOSTUNREACH (Host unreachable).
Other implementations like on Linux and OS X will likely be affected by
the lack of this field for more complex networking setups.

Fixes project-chip#100
jasta added a commit to jasta/rs-matter that referenced this issue Oct 4, 2023
According to the RFC
(https://datatracker.ietf.org/doc/html/rfc2553#section-3.3), it is
necessary to disambiguate link-local addresses with the interface index
(in the scope_id field).  Lacking this field, newer versions of lwip that
support proper IPv6 scopes will yield EHOSTUNREACH (Host unreachable).
Other implementations like on Linux and OS X will likely be affected by
the lack of this field for more complex networking setups.

Fixes project-chip#100

Run cargo fmt again

Run cargo clippy again

Revert "Run cargo clippy again"

This reverts commit e3bba1f.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants