Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

secret-sync: selective write to secret / functional logs #1678

Merged
merged 4 commits into from
Jun 3, 2020

Conversation

consideRatio
Copy link
Member

While working to setup the test suite for HTTPS, this script became quite relevant and I learned some things.

  1. The logs didn't work, so I've tried learning about python logging and fixed it by adding PYTHONBUFFERED=1 and I also print something on startup to provide a life sign, and added some logging statements here and there to help me grasp what was going on.
  2. Traefik, running in the autohttps pod, that runs secret-sync.py as a sidecar - often ended up writing content to the k8s secret that contained a partial state that was soon to be outdated. If the k8s secret was used to bootstrap a new autohttps pod, and the k8s secret contained such partial state that had been outdated, I was forced to manually delete the secret etc. To avoid this, we now only write to the k8s secret when we have acquired a certificate.
  3. SIGTERM signals that were passed by kubectl delete or docker stop wasn't making the script exit, but with tini it did. I also realized that tini-static was a tini binary that had some dependencies embedded into it, allowing it to run on the alpine distribution. This change actually made things quite a bit more robust, because without it, it was common to get two autohttps pods with an active secret-sync.py running, both writing content to the k8s secret. This was because one autohttps pod was in the terminating state, but it took 30 seconds for it to actually quick because that is when SIGKILL is sent, so the new pod that maybe acquired a certificate, would get the k8s secret overwritten by the old pod etc.

@consideRatio
Copy link
Member Author

This PR will hardcode assumptions about working specifically with traefik's acme.json secret in this script. But I think that makes sense for the time being. The docstring speaks specifically about traefik and the use case of the script is only this currently, so I figure it is premature to optimize generalize it to have hooks etc where we make selective syncs etc. If we want this to be generalized further, I think we should do it as part of an extraction from the z2jh repo.

@yuvipanda - what do you think, merge?

else:
file_dict = json.loads(file_content)

for cert_resolver in file_dict.values():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of this, how do you feel about waiting for the file to have not changed for at least 15s or something before writing it? Would that help mitigate the partially written secret issues?

Copy link
Member Author

@consideRatio consideRatio May 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dislike that I embedd assumptions of the acme.json format here, and is using that to determine when we have successfully completed a challenge, but I think we need logic to only write completed challenges to ensure this is robust without manual interventions.

Having a 15 second throttle like system is problematic because if something wasn't right with the DNS setup and the ACME server didn't retain its state during this time, then I've seen the interaction get stuck. Pebble as an ACME server looses its state on restarts, and I'm not sure how long LE staging / LE production retain state.

The best outcome in my mind, is if we get a clear signal from LEGO which is Traefik's ACME client that it is in a happy state, and only accept sync during those times. That would cover partial situations during renewal etc as well perhaps.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems reasonable. I think if we rename this script to acme-secret-sync.py and put a variant of this comment somewhere in the code, I'm happy with this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@consideRatio, in traefiik-proxy exponential_backoff is used a lot to check if Traefik is in a ready state.
Maybe it can be used here also instead of the busy wait.
This is how I check if the cert is available in acme.json with exponential_backof: commit link here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've gained some more experience and insight, and is now quite confident it makes sense to lock in a check that we have acquired a certificate which makes the script acme.json specific. See topics below regarding that.

@yuvipanda I've updated the docstring about this matter, and I've renamed it to acme-secret-sync.py

@GeorgianaElena The crux here is to avoid a state that will keep failing forever (forever is >=
10 minutes). I've actually not seen a failing traefik / acme-serer interaction recover properly yet.

Traefik not fully robust

Apparently traefik can sometimes request two ACME orders to the ACME server, which means two separate challenges will be performed. But, sometimes if this has happened, then it can fail. And when that happens, we will have locked in a state that will keep failing on restarts.

I'm not 100% on reproduction, but here is an issue opened: traefik/traefik#6871

ACME servers can do things to cause disruptions

From having used Pebble as an ACME server that is designed to ensure that ACME clients interacting with it behaves as they should, I conclude that Traefik + LEGO doesn't always do that. Even though we can tame Pebble to behave friendlier (PEBBLE_AUTHZREUSE=100), it is still a possible failure with LE that can occur and we could avoid by syncing only when we have acquired a cert fully.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@consideRatio, I don't understand what happens if the traefik-acme interaction goes wrong and no certificate appears in acme.json.
From the script it looks like it will stay forever in the while loop waiting.
This is what you want, right? But I don't understand why (:confused: sorry for the silly question).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@GeorgianaElena I'm not sure what happens either, the key part I want to avoid with this change is that a user of this setup end up needing to delete a k8s secret manually because it would make Traefik startup with a state it gets stuck in.

You don't ask silly questions, but it was quite silly to think you just asked a silly question =D

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, lots of love for asking what you perceived to be a silly question anyhow @GeorgianaElena, I like this culture!

Copy link
Collaborator

@yuvipanda yuvipanda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @consideRatio. This LGTM. I'll wait for @GeorgianaElena to take a look and merge?

Copy link
Member

@GeorgianaElena GeorgianaElena left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🎉 !

@consideRatio consideRatio merged commit 6970ddc into jupyterhub:master Jun 3, 2020
@consideRatio
Copy link
Member Author

Thanks for your review @yuvipanda @GeorgianaElena! ❤️ 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants