Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTTP-01 IPv6 to IPv4 fallback not working properly #2770

Closed
cpu opened this issue May 17, 2017 · 24 comments · Fixed by #2852
Closed

HTTP-01 IPv6 to IPv4 fallback not working properly #2770

cpu opened this issue May 17, 2017 · 24 comments · Fixed by #2852

Comments

@cpu
Copy link
Contributor

cpu commented May 17, 2017

A user in IRC noticed that they were suffering HTTP-01 validation failures for a domain that previously worked. Investigating it appears the domain had an AAAA record and an A record but the AAAA address wasn't working. I expected the IPv6 to IPv4 fallback code would have masked this issue but looking at the validation records it did not, there is no addressTried, and the addressUsed is the v6 address:

    "validationRecord": [
      {
        "url": "http://xxxxx/.well-known/acme-challenge/XXXXXXXXXXXXX",
        "hostname": "XXXXX",
        "port": "80",
        "addressesResolved": [
          "92.XXX.XXX.XXX",
          "2001:XXXX:XXXX:XXXX::111"
        ],
        "addressUsed": "2001:XXXX:XXXX:XXXX::111",
        "addressesTried": null
      }
    ]

The VA logged:

HTTP request to http://xxxxx/.well-known/acme-challenge/XXXXXXXXXXXXX failed. err=[&url.Error{Op:"Get", URL:"http://xxxxx/.well-known/acme-challenge/XXXXXXXXXXXXX", Err:(*http.httpError)(0xc420c89260)}] errStr=[Get http://xxxxx/.well-known/acme-challenge/XXXXXXXXXXXXX: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)]

The root cause is the VA's HTTP-01 dialer wrapper is re-using the same underlying net.Dialer with an expended timeout between the initial and subsequent fallback connection.

@cpu
Copy link
Contributor Author

cpu commented May 22, 2017

One false-positive for this issue I've seen so far is a host with an A and AAAA record failing an HTTP-01 challenge because the webserver on the AAAA IP returned a 404 while the A webserver had the correct webroot configured. This doesn't meet the conditions for the retry because the failure is at the HTTP challenge validation level and not the IP connectivity level.

@jsha
Copy link
Contributor

jsha commented May 22, 2017

That seems like a situation where it's consistent with our other behavior to treat the validation as failed due to a misconfigured server.

@cpu
Copy link
Contributor Author

cpu commented May 22, 2017

@jsha I agree, that's why I called it a false positive.

@jangrewe
Copy link

This also happened to me, although with dehydrated.
The issue for me was that the IPv6 address in my AAAA wasn't properly routed by my ISP, but the IPv4 still worked fine.

Processing pokemap.berlin with alternative names: www.pokemap.berlin dev.pokemap.berlin
 + Checking domain name(s) of existing cert... changed!
 + Domain name(s) are not matching!
 + Names in old certificate: dev.pokemap.berlin pmg.faked.org pokemap.berlin www.pokemap.berlin
 + Configured names: dev.pokemap.berlin pokemap.berlin www.pokemap.berlin
 + Forcing renew.
 + Checking expire date of existing cert...
 + Valid till Jun 14 22:01:00 2017 GMT Certificate will expire
(Less than 30 days). Renewing!
 + Signing domains...
 + Generating private key...
 + Generating signing request...
 + Requesting challenge for pokemap.berlin...
 + Requesting challenge for www.pokemap.berlin...
 + Requesting challenge for dev.pokemap.berlin...
 + Responding to challenge for pokemap.berlin...
 + Responding to challenge for www.pokemap.berlin...
 + Responding to challenge for dev.pokemap.berlin...
ERROR: Challenge is invalid! (returned: invalid) (result: {
  "type": "http-01",
  "status": "invalid",
  "error": {
    "type": "urn:acme:error:connection",
    "detail": "Could not connect to dev.pokemap.berlin",
    "status": 400
  },
  "uri": "https://acme-v01.api.letsencrypt.org/acme/challenge/aL5EsF2NuL8zBQHzgBw8qFxKFmnrlt91RdfNxm30lAk/1210320188",
  "token": "MQyfKns3ATGGPMS-pKSieBlrWpjE86FIj2pnuYcAFfo",
  "keyAuthorization": "MQyfKns3ATGGPMS-pKSieBlrWpjE86FIj2pnuYcAFfo.ymn7rrjFsLBQUTzWYgdoacDjsIe-B36saKrAYkAh2Tk",
  "validationRecord": [
    {
      "url": "http://dev.pokemap.berlin/.well-known/acme-challenge/MQyfKns3ATGGPMS-pKSieBlrWpjE86FIj2pnuYcAFfo",
      "hostname": "dev.pokemap.berlin",
      "port": "80",
      "addressesResolved": [
        "87.128.111.190",
        "2003:a:37f:ef4f::"
      ],
      "addressUsed": "2003:a:37f:ef4f::",
      "addressesTried": []
    }
  ]
})

@sahsanu
Copy link

sahsanu commented May 22, 2017

Regarding this ipv6 preference https://community.letsencrypt.org/t/certbot-ipv6-address-on-domain-misconfigured-and-challenges-fail-prefer-ipv6/34626

In this case the user have a domain with both records, A and AAAA, but the web server is only configured for ipv4, the ipv6 reachs the web server but not the right virtualhost, in this case, obviously, the challenge fails. I don't know whether it is worth to fallback to ipv4 in this case.

@cpu
Copy link
Contributor Author

cpu commented May 22, 2017

@sahsanu - Thanks for commenting. That therad is the same one I mentioned earlier in this thread as a false positive (I should have linked to it, apologies). In this case I don't expect a fallback and everything appears to be working as intended.

@jsha
Copy link
Contributor

jsha commented May 22, 2017

Got it! I didn't understand that part. :-)

@sahsanu
Copy link

sahsanu commented May 23, 2017

Just another "false positive" https://community.letsencrypt.org/t/404-on-well-known-acme-challenge-but-accessable-from-browser/34730

I can understand the decision to prefer AAAA if both records are available for a domain, but sadly we are still living in an ipv4 world. The use case for ipv6 is very limited as majority of domestic ISPs doesn't provide an ipv6 to their customers. Also, there are a lot of people getting a dedicated, vps and shared hosting and theirs hosters auto conf the DNS providing both ips (ipv4 and ipv6) but people doesn't care about ipv6 (yet) and don't configure their services tu use it properly so I'm afraid we will see a lot of cases with this "false positive" issue ;).

@cpu
Copy link
Contributor Author

cpu commented May 23, 2017

so I'm afraid we will see a lot of cases with this "false positive" issue ;).

I will be posting an announcement in the community forum about the IPv6 preference today. Hopefully that will help clear up the confusion.

Ultimately if you run a website that publishes an AAAA address that doesn't work you're going to run into problems sooner or later!

@jsha
Copy link
Contributor

jsha commented May 30, 2017

From the entry in https://community.letsencrypt.org/t/unable-to-update-challenge-the-challenge-is-not-pending/35118/3, looking in the logs, it appears that the problem was a timeout. So one possibility is that the fallback doesn't happen correctly on timeouts, perhaps because the first try uses up all of the available time?

@cpu
Copy link
Contributor Author

cpu commented Jun 12, 2017

@jsha That's indeed a possibility. We didn't increase the timeout to accommodate making two back-to-back requests.

@mhofman
Copy link

mhofman commented Jun 13, 2017

Same problem here. The IPv6 address times out since the HTTP server isn't listening on that interface, and the IPv4 is never checked ("addressesTried": []).

Why not run both in parallel and let the faster one win? If you want to give IPv6 a preference, you can start that check a second before the IPv4.

@mirion
Copy link

mirion commented Jun 15, 2017

Any news on this issue? In my case, the IPv6 address is unreachable (out of my direct control). For example curl -v6 returns "Immediate connect fail", "Network is unreachable".

Thanks

@cpu
Copy link
Contributor Author

cpu commented Jun 15, 2017

No news - if the issue isn't assigned & placed into a milestone for a sprint then it's safe to assume it isn't being actively worked on yet.

@jsha jsha modified the milestone: Sprint 2017-06-27 Jun 27, 2017
@jsha jsha assigned cpu Jun 27, 2017
@jsha jsha modified the milestone: Sprint 2017-06-27 Jun 27, 2017
@ArchangeGabriel
Copy link

I’ve just stumbled upon this. My server apparently drops its IPv6 connectivity from time to time (no idea why at this point), and then the challenge verification fails by timing out. I would have expected a fallback to IPv4, but apparently no.

@cpu cpu added this to the Sprint 2017-06-27 milestone Jul 7, 2017
@cpu cpu changed the title IPv6 to IPv4 fallback not working properly HTTP-01 IPv6 to IPv4 fallback not working properly Jul 7, 2017
@cpu
Copy link
Contributor Author

cpu commented Jul 7, 2017

I've updated this issue description to reflect my understanding after some debugging & working on a fix. The fallback problem is isolated to HTTP-01 and I have opened #2852 with a proposed fix.

@cpu cpu closed this as completed in #2852 Jul 10, 2017
cpu added a commit that referenced this issue Jul 10, 2017
The implementation of the dialer used by the HTTP01 challenge, constructed with `resolveAndConstructDialer`, used the same wrapped `net.Dialer` for both the initial IPv6 connection, and any subsequent IPv4 fallback connections. This caused the IPv4 fallback to never succeed for cases where the initial IPv6 connection expended the `validationTimeout`.

This commit updates the http01Dialer (newly renamed from `dialer` since it is in fact specific to HTTP01 challenges) to use a fresh dialer for each connection. To facilitate testing the http01Dialer maintains
a count of how many dialer instances it has constructed. We use this in a unit test to ensure the correct behaviour without a great deal of new mocking/interfaces.

Resolves #2770
@derekatkins
Copy link

Hi,
Does this require a change on the client or is this a server change? I just started hitting this issue myself. My IPv6 service is "broken" so, even though I have both A and AAAA records, only IPv4 will successfully reach my server. I'm getting this timeout when I try to renew.
I just upgraded certbot to certbot-0.19.0-1.fc25.noarch but it didn't seem to fix the problem.
If it requires a change to the service, has this change been pushed to the LetsEncrypt service?

@cpu
Copy link
Contributor Author

cpu commented Oct 30, 2017

Hi @derekatkins - the fallback behaviour is a server-side change, and has been deployed to production already.

The catch is that it's not a complete solution for 100% of all broken IPv6 configurations. In practice there are a handful of cases where IPv6 will not validate for ACME and is broken, but in which the actual IPv6 connectivity works enough to prevent a fallback from occurring. At this point we've decided that we can't invest any more resources in improving the fallback and are not pursuing additional improvements to the server-side code.

My IPv6 service is "broken" so, even though I have both A and AAAA records, only IPv4 will successfully reach my server. I'm getting this timeout when I try to renew.

@derekatkins I recommend that you resolve the IPv6 connectivity or remove the AAAA record entirely. Unfortunately these are the only two options that will be able to fix your problem.

If you need further help diagnosing the problem I recommend starting a new forum topic in the Let's Encrypt Community Forum. Thanks!

@derekatkins
Copy link

Interesting. I would think that a lack-of-connect would trigger the fallback after the connect() times out -- which is my case.

I'll work on getting the AAAA records removed (I don't control the DNS) until I get IPv6 working again.

@darkain
Copy link

darkain commented Mar 28, 2018

I just ran into this issue as well. The error messages were not descriptive enough in the main client to even clue me in to why I was receiving timeouts, but only on one of my domains (I have several with shared IP addresses). After an entire day of investigating, it became apparent that it was because that was the only domain with dual-stack listed in DNS, and there was some routing issues upstream with IPv6 between Lets Encrypt and my servers. In this particular case, because no TCP connection could even be made, and just "timeout" instead, shouldn't that quality as a downgrade to IPv4 condition? This is EXACTLY how web browsers handle this exact situation. Sadly, even today in 2018, there are still routing issues with the IPv6 global network at the backbone/BGP level, and because of this, it literally took my production web site offline due to the fact I could not renew certs through LetsEncrypt, and simply got the rate limit (only 5?) when it really seems like a IPv4 fallback should have been preferential.

@alfonsonishikawa
Copy link

Any updates on this? I rely on IPv4 connection too. I don't have AAAA record defined, only A, and still does not work.

@jsha
Copy link
Contributor

jsha commented May 14, 2018

I don't have AAAA record defined, only A, and still does not work.

It sounds like you have a different problem. I recommend posting on https://community.letsencrypt.org/. Thanks!

@navara
Copy link

navara commented Jul 21, 2018

Another thumbs up for this problem.

We do have two domains with IPv6 on port 443 enabled and those update crtificates correctly. Remaining domains are for our use however, not published to clients, so without IPv4 (no need for it, only universities use IPv6 there).
When server connects it reaches one of those public domains and fails the check without reverting to IPv4 version of site, where the files is created and accessible - I see letsencrypt record in one of their logs.

I can symlink all challenge dirs into one, but option -ipv4only for certbot would be cooler...

@jsha
Copy link
Contributor

jsha commented Dec 13, 2018

Hi @navara! It sounds like you've got a configuration problem. I recommend posting on https://community.letsencrypt.org/.

I'm going to lock this conversation for now - I think most followups are best sent to the forum. Thanks all!

@letsencrypt letsencrypt locked as resolved and limited conversation to collaborators Dec 13, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.