Traefik in the autohttps pod should reprovision on redeploy if needed #1602

mikebranski · 2020-03-18T22:38:31Z

Let's say I'm setting up a new install at jhub.example.com on AWS EKS. I need to point that host to the proxy-public pod sitting in front of the hub, which I can't do until it's been deployed and I can run kubectl get svc proxy-public. The problem is the cert-manager's traefik container is going to fail to provision a certificate from letsencrypt because jhub.example.com isn't pointing to anything yet and it can't complete the ACME challenge.

$ kubectl logs pod/$(k get pods -o custom-columns=POD:metadata.name | grep autohttps-) traefik -f
# (init logs removed for brevity)
time="2020-03-18T22:15:17Z" level=error msg="Unable to obtain ACME certificate
for domains \"jhub.example.com\" : unable to generate a certificate for the
domains [jhub.example.com]: acme: Error -> One or more domains had a problem:\n
[jhub.example.com] acme: error: 400 :: urn:ietf:params:acme:error:dns :: No
valid IP addresses found for jhub.example.com, url: \n" providerName=le.acme

That makes sense. Now, if I point jhub.example.com to the proxy's public URL and re-deploy, I would expect cert-manager to try provisioning again, but it doesn't. The autohttps-* pod does not get recreated, either.

$ kubectl get pods
NAME                              READY   STATUS    RESTARTS   AGE
autohttps-5bf7787fcb-ql8k6        2/2     Running   0          5m43s 👈 old cert-manager
continuous-image-puller-jtkww     1/1     Running   0          52m
continuous-image-puller-pmrqp     1/1     Running   0          52m
continuous-image-puller-pzdfn     1/1     Running   0          52m
continuous-image-puller-qwxw7     1/1     Running   0          52m
continuous-image-puller-qz8cr     1/1     Running   0          52m
hub-78cfbb76f7-hmhwr              1/1     Running   0          105s 👈 new hub
proxy-766c8fb85b-97pcw            1/1     Running   0          101s 👈 new proxy
user-scheduler-746d49c857-nm55r   1/1     Running   0          52m
user-scheduler-746d49c857-xz97f   1/1     Running   0          52m

The hub is still unreachable – both through the proxy's public URL and the subdomain – and there's only one new entry about upgrading in the logs.

time="2020-03-18T22:25:15Z" level=warning msg="A new release has been found: 2.1.7. Please consider updating."

Finally, if I then delete the autohttps-* pod, it gets recreated and attempts another provision, which succeeds and everything loads as I'd expect.

This was my first foray into JupyterHub and Kubernetes, so I could be missing something very elementary, but I've been working with the 0.9.x chart for a few months and this has been plaguing me the entire time. Am I missing something or doing something incorrectly, or could this behavior be changed or called out to be more clear?

Here is our values.yml for posterity. Our real one has a lot more to it, but the issue was reproducible with just this subset.

proxy:
  secretToken: [REDACTED]
  https:
    enabled: true
    hosts:
      - jhub.example.com
    letsencrypt:
      contactEmail: email@example.com

The text was updated successfully, but these errors were encountered:

betatim · 2020-03-23T07:58:12Z

Thanks for finding and debugging this.

I am not sure exactly what the way forward is but I'd start investigating a solution like

zero-to-jupyterhub-k8s/jupyterhub/templates/hub/deployment.yaml

Lines 26 to 28 in a5127ae

    
           # This lets us autorestart when the secret changes! 
        
           checksum/config-map: {{ include (print .Template.BasePath "/hub/configmap.yaml") . | sha256sum }} 
        
           checksum/secret: {{ include (print .Template.BasePath "/hub/secret.yaml") . | sha256sum }}

We'd need to find the right bit of config to include in the sha256. Probably something from the Ingress objects?

The problem is a pod only gets recreated if something about its configuration changes, which in the case of pointing the domain name to the right IP doesn't happen. Maybe cert-manager would have retried getting the certificate but my guess would be that the wait time for that is quite long compared to wanting to deploy things.

consideRatio · 2020-10-07T23:12:44Z

Thank you for a clearly written and formatted issue @mikebranski! ❤️

Note that cert-manager is a different tool unrelated to Traefik running in the autohttps pod. Traefik makes use of the LEGO library (Lets Encrypt Go-lang).

We want to have Traefik try the ACME challenge interaction again with Let's Encrypt when the domain name points to the proxy-public loadbalancer IP. I don't think there is a sensible mechanism to do so.

Should Traefik try a lot of times? No, that would spam Let's Encrypt and your IP would get banned.
Should Help upgrade trigger restarts of the autohttps pod every time, or on change to something? Hmmm not every time at least, because it is a very bad pod to disrupt because all traffic should go through it. But perhaps sometimes? Well, how to determine that sometime? I don't see a sensible way to spot when we want a helm upgrade to trigger a restart of that pod, and think it would be better to document for users to kubectl delete pod -l component=autohttps once during the initial setup if they activated proxy.https.enabled etc with information before the domain name was pointed to the right IP.

Summary

I don't see a fix other than documenting that one may need to do a kubectl delete pod of the autohttps pod once if configuring proxy.https settings before directing the domain name to the proxy-public service external IP.

pvanliefland · 2021-04-15T13:31:37Z

@consideRatio this issue has been bugging me for a while... It seems that even if the domain name points to the proxy-public service external IP, for the first deployment, I need to manually restart the autohttps pod.

I use kubectl rollout restart deployment/autohttps --namespace=a-namespace.

Am I missing something?

I would be happy to help with a PR documenting that or give a shot at a fix, let me know :)

consideRatio · 2021-04-15T14:50:49Z

I opened #2150 @pvanliefland !

consideRatio changed the title ~~Certificate manager should reprovision on redeploy if needed~~ Traefik in the autohttps pod should reprovision on redeploy if needed Oct 7, 2020

consideRatio closed this as completed Oct 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Traefik in the autohttps pod should reprovision on redeploy if needed #1602

Traefik in the autohttps pod should reprovision on redeploy if needed #1602

mikebranski commented Mar 18, 2020

betatim commented Mar 23, 2020

consideRatio commented Oct 7, 2020 •

edited

pvanliefland commented Apr 15, 2021

consideRatio commented Apr 15, 2021

Traefik in the autohttps pod should reprovision on redeploy if needed #1602

Traefik in the autohttps pod should reprovision on redeploy if needed #1602

Comments

mikebranski commented Mar 18, 2020

betatim commented Mar 23, 2020

consideRatio commented Oct 7, 2020 • edited

Summary

pvanliefland commented Apr 15, 2021

consideRatio commented Apr 15, 2021

consideRatio commented Oct 7, 2020 •

edited