Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test_icmp_probe_eu_derper flaky on windows #2069

Closed
rklaehn opened this issue Mar 12, 2024 · 4 comments · Fixed by #2075
Closed

test_icmp_probe_eu_derper flaky on windows #2069

rklaehn opened this issue Mar 12, 2024 · 4 comments · Fixed by #2075

Comments

@rklaehn
Copy link
Contributor

rklaehn commented Mar 12, 2024

It seems that on windows sometimes we can't do this probe. The derper looks healthy, and if it was down all the other tests would fail as well. We have seen this not just in tests but also in real life.

Possibly related: n0-computer/dumbpipe#17

image

github-merge-queue bot pushed a commit that referenced this issue Mar 12, 2024
…#2068)

## Description

test(iroh-net): disable test_icmp_probe_eu_derper as flaky on windows

See for example
https://github.com/n0-computer/iroh/actions/runs/8245723076/job/22550253677?pr=2051

Related issue: #2069

## Notes & open questions

<!-- Any notes, remarks or open questions you have to make about the PR.
-->

## Change checklist

- [ ] Self-review.
- [ ] Documentation updates if relevant.
- [ ] Tests if relevant.
@flub
Copy link
Contributor

flub commented Mar 12, 2024

See also the iroh-net/src/dns.rs tests in #2073.

The issue seems to be here that DNS resolution does completely fail on windows at times. We have no idea why yet. But without DNS resolution there's not anything that's going to work.

github-merge-queue bot pushed a commit that referenced this issue Mar 12, 2024
## Description

This is extremely likely to be related to #2069, so not filing a new
issue.

## Notes & open questions

<!-- Any notes, remarks or open questions you have to make about the PR.
-->

## Change checklist

- [x] Self-review.
- [x] Documentation updates if relevant.
- [x] Tests if relevant.
@flub
Copy link
Contributor

flub commented Mar 12, 2024

https://github.com/n0-computer/iroh/actions/runs/8245723076/job/22550253677

The problem is that the system config choses the fec0:0:0:ffff::3 (and ...::1 and ...::2) DNS server for some reason. But IPv6 is not routable on the machine, probably because it's a dual-stack machine with no IPv6 connectivity, like the vast majority of systems in the world.

On our CI machine it also tries an IPv4 server fairly quickly after, but the whole DNS lookup has a limit of 1s timeout so the thing fails before we get a response from a working server as the resolver spend too much time on the broken server probably.

ppodolsky pushed a commit to izihawa/iroh that referenced this issue Mar 13, 2024
## Description

This is extremely likely to be related to n0-computer#2069, so not filing a new
issue.

## Notes & open questions

<!-- Any notes, remarks or open questions you have to make about the PR.
-->

## Change checklist

- [x] Self-review.
- [x] Documentation updates if relevant.
- [x] Tests if relevant.
ppodolsky pushed a commit to izihawa/iroh that referenced this issue Mar 13, 2024
## Description

This is extremely likely to be related to n0-computer#2069, so not filing a new
issue.

## Notes & open questions

<!-- Any notes, remarks or open questions you have to make about the PR.
-->

## Change checklist

- [x] Self-review.
- [x] Documentation updates if relevant.
- [x] Tests if relevant.
@flub
Copy link
Contributor

flub commented Mar 13, 2024

So it seems fec0:0:0:ffff::1 (and 2 & 3) are deprecated site-local anycast addresses that microsoft DNS servers might listen on. How that gets in those windows boxes I still have no clue.

@flub
Copy link
Contributor

flub commented Mar 13, 2024

Ah, I can find the resolvers configured on the host when using Get-DnsClientServerAddress in powershell.

github-merge-queue bot pushed a commit that referenced this issue Mar 13, 2024
## Description

This actively refuses to use the `fec0:0:0:ffff::1`, `fec0:0:0:ffff::2`
and `fec0:0:0:ffff::3` DNS servers if the system has them configured.

Windows by default adds 3 IPv6 site-local anycast addresses to the DNS
servers: `fec0:0:0:ffff::1`, `fec0:0:0:ffff::2` and `fec0:0:0:ffff::3`.
Supposedly Microsoft DNS servers by default listen on those. These are
present as soon as there's an IPv6 interface configured it seems, even
for a loopback interface which is extremely common if not the default.

Our hickory-resolver loads the system configuration, which includes
these 3 IPv6 DNS servers. When it needs to make a DNS query it selects a
random nameserver and tries this. If that fails it will try another one.
For the next query there is bias, it will remember which servers to
avoid or use. So if you get lucky and your first query falls on an
actual DNS server then you are good. If you get unlucky recovering is a
bit of a tussle because:

Inside netcheck we do DNS queries with a 1s timeout, this because all
the probes have a 3s timeout. However hickory-resolver has a 5s timeout
configured, so it's queries stay alive longer than ours. This means
almost all subsequent DNS queries will end up reusing an existing
connection to one of those bad servers if you are unlucky to land on
one. The interplay of these timeouts and the connection reuse make
recovering to a good DNS server a rather tough prospect for netcheck. It
probably would eventually, given enough netcheck runs (which run at
intervals of ~30s).

The odds of these nameservers being the sole way of having working DNS
is basically zero. The odds of these nameservers breaking the resolver
are about 50%. So remove these deprecated things.

## Notes & open questions

Unfortunately the resolver returned by `get_resolver()` does not have an
API that allows to test it. But the test would basically be the inverse
logic of the logic that removes the bad servers so perhaps not that
useful anyway.

Closes #2069 
Closes n0-computer/dumbpipe#17

## Change checklist

- [x] Self-review.
- [x] Documentation updates if relevant.
- [x] Tests if relevant.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

2 participants