Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Wait for the certs to be mounted inside the container #2198

Merged
merged 3 commits into from
Aug 15, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
57 changes: 51 additions & 6 deletions pkg/certgenerator/v1beta1/generator.go
Original file line number Diff line number Diff line change
Expand Up @@ -26,12 +26,15 @@ import (
"errors"
"fmt"
"math/big"
"os"
"path/filepath"
"strings"
"time"

admissionregistrationv1 "k8s.io/api/admissionregistration/v1"
corev1 "k8s.io/api/core/v1"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/apimachinery/pkg/util/wait"
"k8s.io/klog"
"sigs.k8s.io/controller-runtime/pkg/client"
"sigs.k8s.io/controller-runtime/pkg/manager"
Expand All @@ -53,11 +56,11 @@ type CertGenerator struct {
namespace string
webhookServiceName string
webhookSecretName string
fullServiceDomain string
kubeClient client.Client
certsReady chan struct{}

certs *certificates
fullServiceDomain string
certs *certificates
}

var _ manager.Runnable = &CertGenerator{}
Expand All @@ -67,11 +70,50 @@ func (c *CertGenerator) Start(ctx context.Context) error {
if err := c.generate(ctx); err != nil {
return err
}
klog.Info("Waiting for certs to get ready.")
if err := wait.ExponentialBackoffWithContext(ctx, wait.Backoff{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a requirement for a timeout? How does the user get to know that it is being waited for certs to be ready?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a requirement for a timeout?

Yes, timeout is 5 minutes.

How does the user get to know that it is being waited for certs to be ready?

The controller pod doesn't ready until the cert is ready.

NAME                                READY   STATUS              RESTARTS   AGE
katib-controller-687798c75b-mtlz7   0/1     Running             0          4s

status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2023-08-13T11:48:12Z"
    message: 'containers with unready status: [katib-controller]'
    reason: ContainersNotReady
    status: "False"
    type: ContainersReady
controller@sha256:86697de94ac8dfbe53082beb96f1ea4ffea2bac982a2332719a99224e17335c7
    lastState: {}
    name: katib-controller
    ready: false
    restartCount: 0
    started: true
    state:
      running:
        startedAt: "2023-08-13T11:48:14Z"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant, how will the user know that Katib controller is waiting for certs to get ready?

Copy link
Member Author

@tenzen-y tenzen-y Aug 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The controller doesn't say, Waiting for certs to get ready. So, there are two ways that users can indirectly know whether the controller waits for certs to get ready:

  1. As shown above, confirmation pod condition.
  2. Creating the Experiments. If the controller waits for certs, users can get the following error:
Error from server (InternalError): error when creating "../../testdata/valid-experiment.yaml": Internal error occurred: failed calling webhook "defaulter.experiment.katib.kubeflow.org": could not get REST client: unable to load root certificates: unable to parse bytes as PEM block

Adding a log, Waiting for certs to get ready, to L72 in this file might be better.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a log is better because user can understand the reason behind the issue

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though we use cert-manager, the controller must wait for certs to be injected into the Secret resource.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@johnugeorge @tenzen-y Does it look confusing that we have this timeout in our generator package even if Katib Cert Generator doesn't used ?
Maybe we can add this timeout to the setupControllers section before we adding the Webhook to the manager ?

We can rename certsReady channel to certsGenerated to be more precise.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think locating this timeout here would be better since this timeout is a function for the webhook certs.
However, I agree with your confusion. So renaming the certgenerator with certmanager or certcontroller might be better to clarify the purpose of this component.

@andreyvelich WDYT?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think certcontroller would be better since certmanager conflicts context with cert-manager :-/

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we rename it to certcontroller how we can deal with Katib Config Settings ?
If user wants the enable cert generator they set the following setting:

  certGenerator:
    enable: true

Copy link
Member Author

@tenzen-y tenzen-y Aug 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, sorry. I will revert 69e9dfe since we can avoid the timeout error in the kubeflow installation by removing katib-webhook-cert Secret from kubeflow installation.

So, we wouldn't have to change the KatibConfig. WDYT?

Copy link
Member Author

@tenzen-y tenzen-y Aug 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, the certcontroller will start only in the standalone installation (certGenerator.enable=true).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, if we are going to remove this timeout from Kubeflow installation, we need to name it as certgenerator since it is only internal logic that we have when Katib certGenerator is enabled.
Let's keep it as certgenerator for now and use this timeout only when our internal Cert Generator is used.
Does it sound good @tenzen-y ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andreyvelich That sounds good to me. Thanks for the great suggestion!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Duration: time.Second,
Factor: 2,
Jitter: 1,
Steps: 10,
Cap: time.Minute * 5,
}, ensureCertMounted(time.Now())); err != nil {
return err
}
// Sending an empty data to a certsReady means it starts to register controllers to the manager.
c.certsReady <- struct{}{}
return nil
}

// ensureCertMounted ensures that the generated certs are mounted inside the container.
func ensureCertMounted(start time.Time) func(context.Context) (bool, error) {
return func(ctx context.Context) (bool, error) {
now := time.Now()
outputLog := false
if now.Sub(start) >= 15*time.Second {
start = now
outputLog = true
}

certFile := filepath.Join(consts.CertDir, serverCertName)
if _, err := os.Stat(certFile); err != nil {
if outputLog {
klog.Infof("Public key file %q doesn't exist in the container yet", certFile)
}
return false, nil
}
keyFile := filepath.Join(consts.CertDir, serverKeyName)
if _, err := os.Stat(keyFile); err != nil {
if outputLog {
klog.Infof("Private key file %q doesn't exist in the container yet", keyFile)
}
return false, nil
}
klog.Info("Succeeded to be mounted certs inside the container.")
return true, nil
}
}

func (c *CertGenerator) NeedLeaderElection() bool {
return false
}
Expand All @@ -82,8 +124,13 @@ func AddToManager(mgr manager.Manager, config configv1beta1.CertGeneratorConfig,
namespace: consts.DefaultKatibNamespace,
webhookServiceName: config.WebhookServiceName,
webhookSecretName: config.WebhookSecretName,
kubeClient: mgr.GetClient(),
certsReady: certsReady,
fullServiceDomain: strings.Join([]string{
config.WebhookServiceName,
consts.DefaultKatibNamespace,
"svc",
}, "."),
kubeClient: mgr.GetClient(),
certsReady: certsReady,
})
}

Expand All @@ -99,8 +146,6 @@ func (c *CertGenerator) generate(ctx context.Context) error {
return fmt.Errorf("%w: %v", errCertCheckFail, err)
}
if !certExist {
c.fullServiceDomain = strings.Join([]string{c.webhookServiceName, c.namespace, "svc"}, ".")

if err = c.createCert(); err != nil {
return fmt.Errorf("%w: %v", errCreateCertFail, err)
}
Expand Down
62 changes: 62 additions & 0 deletions pkg/certgenerator/v1beta1/generator_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -18,8 +18,11 @@ package certgenerator

import (
"context"
"os"
"path/filepath"
"strings"
"testing"
"time"

"github.com/google/go-cmp/cmp"
"github.com/google/go-cmp/cmp/cmpopts"
Expand All @@ -31,6 +34,7 @@ import (
"sigs.k8s.io/controller-runtime/pkg/client/fake"

configv1beta1 "github.com/kubeflow/katib/pkg/apis/config/v1beta1"
"github.com/kubeflow/katib/pkg/controller.v1beta1/consts"
)

func TestGenerate(t *testing.T) {
Expand Down Expand Up @@ -210,3 +214,61 @@ func buildFakeClient(kubeResources []client.Object) client.Client {
}
return fakeClientBuilder.Build()
}

func TestEnsureCertMounted(t *testing.T) {
tests := map[string]struct {
keyExist bool
certExist bool
wantExist bool
}{
"key and cert exist": {
keyExist: true,
certExist: true,
wantExist: true,
},
"key doesn't exist": {
keyExist: false,
certExist: true,
wantExist: false,
},
"cert doesn't exist": {
keyExist: true,
certExist: false,
wantExist: false,
},
"all files doesn't exist": {
keyExist: false,
certExist: false,
wantExist: false,
},
}
for name, tc := range tests {
t.Run(name, func(t *testing.T) {
if tc.keyExist || tc.certExist {
if err := os.MkdirAll(consts.CertDir, 0760); err != nil {
t.Fatalf("Failed to set up directory: %v", err)
}
defer func() {
if err := os.RemoveAll(consts.CertDir); err != nil {
t.Fatalf("Failed to clean up directory: %v", err)
}
}()
}
if tc.keyExist {
if _, err := os.Create(filepath.Join(consts.CertDir, serverKeyName)); err != nil {
t.Fatalf("Failed to create tls.key: %v", err)
}
}
if tc.certExist {
if _, err := os.Create(filepath.Join(consts.CertDir, serverCertName)); err != nil {
t.Fatalf("Failed to create tls.crt: %v", err)
}
}
ensureFunc := ensureCertMounted(time.Now())
got, _ := ensureFunc(context.Background())
if tc.wantExist != got {
t.Errorf("Unexpected value from ensureCertMounted: \n(want: %v, got: %v)\n", tc.wantExist, got)
}
})
}
}