Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Traefik in the autohttps pod should reprovision on redeploy if needed #1602

Closed
mikebranski opened this issue Mar 18, 2020 · 4 comments
Closed

Comments

@mikebranski
Copy link

Let's say I'm setting up a new install at jhub.example.com on AWS EKS. I need to point that host to the proxy-public pod sitting in front of the hub, which I can't do until it's been deployed and I can run kubectl get svc proxy-public. The problem is the cert-manager's traefik container is going to fail to provision a certificate from letsencrypt because jhub.example.com isn't pointing to anything yet and it can't complete the ACME challenge.

$ kubectl logs pod/$(k get pods -o custom-columns=POD:metadata.name | grep autohttps-) traefik -f
# (init logs removed for brevity)
time="2020-03-18T22:15:17Z" level=error msg="Unable to obtain ACME certificate
for domains \"jhub.example.com\" : unable to generate a certificate for the
domains [jhub.example.com]: acme: Error -> One or more domains had a problem:\n
[jhub.example.com] acme: error: 400 :: urn:ietf:params:acme:error:dns :: No
valid IP addresses found for jhub.example.com, url: \n" providerName=le.acme

That makes sense. Now, if I point jhub.example.com to the proxy's public URL and re-deploy, I would expect cert-manager to try provisioning again, but it doesn't. The autohttps-* pod does not get recreated, either.

$ kubectl get pods
NAME                              READY   STATUS    RESTARTS   AGE
autohttps-5bf7787fcb-ql8k6        2/2     Running   0          5m43s 👈 old cert-manager
continuous-image-puller-jtkww     1/1     Running   0          52m
continuous-image-puller-pmrqp     1/1     Running   0          52m
continuous-image-puller-pzdfn     1/1     Running   0          52m
continuous-image-puller-qwxw7     1/1     Running   0          52m
continuous-image-puller-qz8cr     1/1     Running   0          52m
hub-78cfbb76f7-hmhwr              1/1     Running   0          105s 👈 new hub
proxy-766c8fb85b-97pcw            1/1     Running   0          101s 👈 new proxy
user-scheduler-746d49c857-nm55r   1/1     Running   0          52m
user-scheduler-746d49c857-xz97f   1/1     Running   0          52m

The hub is still unreachable – both through the proxy's public URL and the subdomain – and there's only one new entry about upgrading in the logs.

time="2020-03-18T22:25:15Z" level=warning msg="A new release has been found: 2.1.7. Please consider updating."

Finally, if I then delete the autohttps-* pod, it gets recreated and attempts another provision, which succeeds and everything loads as I'd expect.

This was my first foray into JupyterHub and Kubernetes, so I could be missing something very elementary, but I've been working with the 0.9.x chart for a few months and this has been plaguing me the entire time. Am I missing something or doing something incorrectly, or could this behavior be changed or called out to be more clear?

Here is our values.yml for posterity. Our real one has a lot more to it, but the issue was reproducible with just this subset.

proxy:
  secretToken: [REDACTED]
  https:
    enabled: true
    hosts:
      - jhub.example.com
    letsencrypt:
      contactEmail: email@example.com
@betatim
Copy link
Member

betatim commented Mar 23, 2020

Thanks for finding and debugging this.

I am not sure exactly what the way forward is but I'd start investigating a solution like

# This lets us autorestart when the secret changes!
checksum/config-map: {{ include (print .Template.BasePath "/hub/configmap.yaml") . | sha256sum }}
checksum/secret: {{ include (print .Template.BasePath "/hub/secret.yaml") . | sha256sum }}

We'd need to find the right bit of config to include in the sha256. Probably something from the Ingress objects?

The problem is a pod only gets recreated if something about its configuration changes, which in the case of pointing the domain name to the right IP doesn't happen. Maybe cert-manager would have retried getting the certificate but my guess would be that the wait time for that is quite long compared to wanting to deploy things.

@consideRatio consideRatio changed the title Certificate manager should reprovision on redeploy if needed Traefik in the autohttps pod should reprovision on redeploy if needed Oct 7, 2020
@consideRatio
Copy link
Member

consideRatio commented Oct 7, 2020

Thank you for a clearly written and formatted issue @mikebranski! ❤️

Note that cert-manager is a different tool unrelated to Traefik running in the autohttps pod. Traefik makes use of the LEGO library (Lets Encrypt Go-lang).

We want to have Traefik try the ACME challenge interaction again with Let's Encrypt when the domain name points to the proxy-public loadbalancer IP. I don't think there is a sensible mechanism to do so.

  • Should Traefik try a lot of times? No, that would spam Let's Encrypt and your IP would get banned.
  • Should Help upgrade trigger restarts of the autohttps pod every time, or on change to something? Hmmm not every time at least, because it is a very bad pod to disrupt because all traffic should go through it. But perhaps sometimes? Well, how to determine that sometime? I don't see a sensible way to spot when we want a helm upgrade to trigger a restart of that pod, and think it would be better to document for users to kubectl delete pod -l component=autohttps once during the initial setup if they activated proxy.https.enabled etc with information before the domain name was pointed to the right IP.

Summary

I don't see a fix other than documenting that one may need to do a kubectl delete pod of the autohttps pod once if configuring proxy.https settings before directing the domain name to the proxy-public service external IP.

@pvanliefland
Copy link

@consideRatio this issue has been bugging me for a while... It seems that even if the domain name points to the proxy-public service external IP, for the first deployment, I need to manually restart the autohttps pod.

I use kubectl rollout restart deployment/autohttps --namespace=a-namespace.

Am I missing something?

I would be happy to help with a PR documenting that or give a shot at a fix, let me know :)

@consideRatio
Copy link
Member

I opened #2150 @pvanliefland !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants