Skip to content
This repository has been archived by the owner on Jun 26, 2023. It is now read-only.

HNC: very slow startup after first installation #765

Closed
adrianludwin opened this issue May 25, 2020 · 3 comments · Fixed by #836
Closed

HNC: very slow startup after first installation #765

adrianludwin opened this issue May 25, 2020 · 3 comments · Fixed by #836
Milestone

Comments

@adrianludwin
Copy link
Contributor

The builtin cert manager seems to stall the first time we install HNC. This is a poor user experience.

@adrianludwin adrianludwin added this to the hnc-v0.4 milestone May 25, 2020
@adrianludwin
Copy link
Contributor Author

There seem to be two problems here:

  • When the Secret is first created, it takes a long time to propagate the updated secret to the Pod. This is expected.
  • Leader election seems to be taking a while too, even though we only have one pod and no ability to run a hot standby.

@adrianludwin
Copy link
Contributor Author

Here are some experiments, from the time kubectl apply -f <manifests> finishes to the time the HNCConfiguration reconciler finishes for the first time.

  1. At head:
  • Attempt 1: 95s
  • Attempt 2: 95s
  1. Force HNC to exit after the secret is first written and restart
  • Attempt 1: 35s
  • Attempt 2: 32s
  1. As above, but without leader election
  • Attempt 1: 11s
  • Attempt 2: 12s

The difference between 2 and 3 (~22s) represents two leader elections, from which we can gather that LE takes about 11s. This suggests that without forcing HNC to exit but disabling LE, we'd reduce our startup time from ~95s to ~84s - not a huge advantage.

However, the second restart of HNC, without LE, takes only ~4s, vs ~15s with LE. This is probably worth doing if only for the high availability aspect (e.g. validators going down for 4s vs 15s).

@adrianludwin adrianludwin modified the milestones: hnc-v0.4, hnc-backlog May 26, 2020
@adrianludwin
Copy link
Contributor Author

Here's the patch to make the cert manager exit when the secret is first written:

diff --git a/incubator/hnc/third_party/open-policy-agent/gatekeeper/pkg/webhook/certs.go b/incubator/hnc/third_party/open-policy-agent/gatekeeper/pkg/webhook/certs.go
index 95e40af2..f238c0b1 100644
--- a/incubator/hnc/third_party/open-policy-agent/gatekeeper/pkg/webhook/certs.go
+++ b/incubator/hnc/third_party/open-policy-agent/gatekeeper/pkg/webhook/certs.go
@@ -122,6 +122,7 @@ func (cr *CertRotator) refreshCertIfNeeded() error {
                if err := cr.client.Get(context.Background(), cr.SecretKey, secret); err != nil {
                        return false, errors.Wrap(err, "acquiring secret to update certificates")
                }
+               noSecret := secret.Data == nil
                if secret.Data == nil || !cr.validCACert(secret.Data[caCertName], secret.Data[caKeyName]) {
                        crLog.Info("refreshing CA and server certs")
                        if err := cr.refreshCerts(true, secret); err != nil {
@@ -129,6 +130,10 @@ func (cr *CertRotator) refreshCertIfNeeded() error {
                                return false, nil
                        }
                        crLog.Info("server certs refreshed")
+                       if noSecret {
+                               crLog.Info("Initial secret has just been written; exiting so secret gets projected faster")
+                               os.Exit(0)
+                       }
                        return true, nil
                }
                // make sure our reconciler is initialized on startup (either this or the above refreshCerts() will call this)

adrianludwin added a commit to adrianludwin/multi-tenancy that referenced this issue Jul 1, 2020
See kubernetes-retired#765. If a mounted secret changes _after_ a pod is started, it can
take a fairly long time (~60s) for the kubelet to notice the change and
project the new secret to the pod. Since our internal cert manager
writes a secret but then needs to wait for it to become available as a
file, this leads to a poor onboarding experience with HNC.

This change introduces a flag that exits the process as soon as the
internal cert manager changes a secret, which should only occur on
initial installation of HNC or every ten years (!). The restart time
takes <5s so this is overall a much better experience.

Tested: without changing the flags in the default manifest, observed no
change when HNC is installed for the first time (i.e. from the first log
message to when the HNCConfiguration is first reconciled takes 103s, and
there are no restarts). When the flag is added, the startup time
decreases to 10s with the one expected restart. Further restarts of HNC
(e.g. deleting and recreating the deployment but not the secret) does
not result in a restart and completes in 4s.
@adrianludwin adrianludwin modified the milestones: hnc-backlog, hnc-v0.6 Jul 1, 2020
adrianludwin added a commit to adrianludwin/multi-tenancy that referenced this issue Jul 2, 2020
See kubernetes-retired#765. If a mounted secret changes _after_ a pod is started, it can
take a fairly long time (~60s) for the kubelet to notice the change and
project the new secret to the pod. Since our internal cert manager
writes a secret but then needs to wait for it to become available as a
file, this leads to a poor onboarding experience with HNC.

This change introduces a flag that exits the process as soon as the
internal cert manager changes a secret, which should only occur on
initial installation of HNC or every ten years (!). The restart time
takes <5s so this is overall a much better experience.

Tested: without changing the flags in the default manifest, observed no
change when HNC is installed for the first time (i.e. from the first log
message to when the HNCConfiguration is first reconciled takes 103s, and
there are no restarts). When the flag is added, the startup time
decreases to 10s with the one expected restart. Further restarts of HNC
(e.g. deleting and recreating the deployment but not the secret) does
not result in a restart and completes in 4s.
@adrianludwin adrianludwin modified the milestones: hnc-v0.6, hnc-v0.5 Jul 2, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant