Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add healthz check for KMS Providers on kube-apiserver. #78540

Merged

Conversation

@immutableT
Copy link
Contributor

commented May 30, 2019

What type of PR is this?
/kind feature

What this PR does / why we need it:
When kube-apiserver starts before kms-plugin, kube-apiserver essentially is unable to serve secrets until kms-plugin is up. This inability to process secrets (just after being started) leads to large number of errors and eventually to a restart of kube-apiserver.
To avoid this untested scenario, KMS Provider will install a healthz check on kube-apiserver for the status of kms-plugin.
Therefore, (via a readiness probe) kube-apiserver will be protected from serving requests until kms-plugin becomes available. Thus allowing kube-apiserver properly handle requests for Secrets and Service Accounts.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

KMS Providers will install a healthz check for the status of kms-pluign in kube-apiservers' encryption config. 
@immutableT

This comment has been minimized.

Copy link
Contributor Author

commented May 30, 2019

/assign @awly

@fejta-bot

This comment has been minimized.

Copy link

commented May 30, 2019

This PR may require API review.

If so, when the changes are ready, complete the pre-review checklist and request an API review.

Status of requested reviews is tracked in the API Review project.

@immutableT

This comment has been minimized.

Copy link
Contributor Author

commented May 30, 2019

/test pull-kubernetes-bazel-build

@immutableT

This comment has been minimized.

Copy link
Contributor Author

commented May 30, 2019

/test pull-kubernetes-bazel-test

@@ -89,4 +89,13 @@ type KMSConfiguration struct {
// Timeout for gRPC calls to kms-plugin (ex. 5s). The default is 3 seconds.
// +optional
Timeout *metav1.Duration
// HealthzURL is the url on which kube-apiserver could check the health status of kms-plugin.
// +optional
HealthzURL string

This comment has been minimized.

Copy link
@awly

awly May 31, 2019

Contributor

Could healthz probes be passed over the plugin Endpoint above?

This comment has been minimized.

Copy link
@immutableT

immutableT May 31, 2019

Author Contributor

I assume that you meant to create a dedicated type for healthz probe. If I understood you correctly, the latest commit accomplishes this.

This comment has been minimized.

Copy link
@awly

awly May 31, 2019

Contributor

I meant adding a healthz RPC to the KMS plugin itself on the gRPC service.
So you could send those healthz requests over KMSConfiguration.Endpoint (the unix socket) instead of a separate new HTTP endpoint

This comment has been minimized.

Copy link
@immutableT

immutableT Jun 1, 2019

Author Contributor

I think this is a good idea, but my survey of the currently implemented plugins shows that they do not have this functionality yet.
This is something we could add later.

This comment has been minimized.

Copy link
@awly

awly Jun 3, 2019

Contributor

Do currently implemented plugins already expose a HTTP endpoint for healthz?
If not, you could put in a healthz grpc endpoint and just skip polling health if it returns "unimplemented" (or whatever error grpc will send).

This comment has been minimized.

Copy link
@immutableT

immutableT Jun 3, 2019

Author Contributor

Yes, Google CloudKMS Plugin and AWS CloudHSM plugins have implemented healthz. Hence, these plugins could take advantage of this feature right away.

This comment has been minimized.

Copy link
@awly

awly Jun 3, 2019

Contributor

Ack, that makes sense then

"os"
"time"

"k8s.io/klog"

This comment has been minimized.

Copy link
@awly

awly May 31, 2019

Contributor

remove empty line

@@ -309,6 +322,17 @@ func GetSecretboxPrefixTransformer(config *apiserverconfig.SecretboxConfiguratio
// getEnvelopePrefixTransformer returns a prefix transformer from the provided config.
// envelopeService is used as the root of trust.
func getEnvelopePrefixTransformer(config *apiserverconfig.KMSConfiguration, envelopeService envelope.Service, prefix string) (value.PrefixTransformer, error) {
if config.HealthzURL != "" {
klog.V(4).Infof("Validating healthz of KMS-Plugin via %s", config.HealthzURL)
ready, err := isKMSPluginReady(config)

This comment has been minimized.

Copy link
@awly

awly May 31, 2019

Contributor

So this only happens on kube-apiserver startup, right?
What if kms plugin becomes unhealthy later on? I assume it will just show up as errors from all Secrets operations?

This comment has been minimized.

Copy link
@immutableT

immutableT May 31, 2019

Author Contributor

Yes, this is only to ensure that kube-apiserver comes-up after kms-plugin is reporting healthz OK.
KMS-Plugin may have its own liveness probe.
And yes, kube-apiserver will log and increment failure metrics when the plugin is down.

@@ -318,3 +342,62 @@ func getEnvelopePrefixTransformer(config *apiserverconfig.KMSConfiguration, enve
Prefix: []byte(prefix + config.Name + ":"),
}, nil
}

func isKMSPluginReady(config *apiserverconfig.KMSConfiguration) (bool, error) {

This comment has been minimized.

Copy link
@awly

awly May 31, 2019

Contributor

Could this only return the error? If arg parsing fails or if healthz polling fails, return a non-nil error

This comment has been minimized.

Copy link
@immutableT

immutableT May 31, 2019

Author Contributor

Maybe, but I want to deffirentate between the scenarios

  1. Healthz could not be checked due to configuration errors - false, error
  2. Heathz returned non 200 status - false,nil
  3. Healthz return 200 - true,nil

I think this approach fits with the signature of the poll.Wait.

This comment has been minimized.

Copy link
@awly

awly May 31, 2019

Contributor

Hmm, but shouldn't you retry even if healthz endpoint is unreachable?
For example when KMS plugin starts later than kube-apiserver.
If yes, then there's no difference between non-200 status and any request errors

This comment has been minimized.

Copy link
@immutableT

immutableT Jun 1, 2019

Author Contributor

Let me step back.
wait.PollImmediate takes a condition function of the following signature:
type ConditionFunc func() (done bool, err error)
Therefore, getHelathz must match that signature.

I think the reason for that signature is that polling status has three potential states:

  1. true - we are done ,no need to retry further (kms-plugin OK)
  2. false - poll failed, keep retrying (kms-plugin is not up yet)
  3. error - got negative status, no need to poll anymore (ex. missing permissions on the key)

This comment has been minimized.

Copy link
@awly

awly Jun 3, 2019

Contributor

getHealthz doesn't have to conform to what wait.PollImmediate expects, you can always adapt it with a closure.

I understand the purpose of those separate return values, but my point is: getHealthz should never cause wait.PollImmediate to exit early with an error.
Currently you return non-nil error when response code is not 200. I think kube-apiserver should keep retrying in that case, let the plugin flip response code to 200 later on.
Then KMS plugin can say "i'm up, but not ready to serve requests yet" and kube-apiserver won't crashloop because of it.

This comment has been minimized.

Copy link
@immutableT

immutableT Jun 3, 2019

Author Contributor

ACK on the signature, you are correct.

Here a scenario:
kube-apiserver polls kms-plugin and the plugin responds with 500 because KMS returned "Key Destroyed Error" (or other non-retriable error). Assuming we use the logic that your propose, kube-apiserver will continue to retry for the duration of the Poll timeout which seems unnecessary to me.

So I see a couple of choices here:

  1. Retry until we get either true or poll timeout expires
  2. Differentiate between 500(return an error) and 503 (return false, nil)

WDYT?

This comment has been minimized.

Copy link
@awly

awly Jun 3, 2019

Contributor

Special-casing 500 and causing it to abort retries immediately sounds fine.
I'm not sure what that achieves though.
It'll cause kube-apiserver to crash and be restarted a few seconds later anyway.

Retries will continue indefinitely (indirectly through kube-apiserver restarts) in any case.

This comment has been minimized.

Copy link
@immutableT

immutableT Jun 4, 2019

Author Contributor

@awly PTAL - added a case for 500.

I agree that this may not make a lot of difference. However, I don't want to make any assumptions about how kube-apiserver will be restarted (i.e. its restart policy) or whether or not kube-apiserver should crash if kms-plugin is down.
My objective is to stop polling if KMS-Plugin returned a non-retryable error.

This comment has been minimized.

Copy link
@awly

awly Jun 5, 2019

Contributor

SGTM, as long as there's some way for KMS plugin implementors to discover this expectation from /healthz

}

func getHealthz(url string) (bool, error) {
var c = &http.Client{

This comment has been minimized.

Copy link
@awly

awly May 31, 2019

Contributor

c := &http.Client{...

Show resolved Hide resolved staging/src/k8s.io/apiserver/pkg/server/options/encryptionconfig/config.go Outdated
Show resolved Hide resolved .../src/k8s.io/apiserver/pkg/server/options/encryptionconfig/config_test.go Outdated
Show resolved Hide resolved .../src/k8s.io/apiserver/pkg/server/options/encryptionconfig/config_test.go Outdated
Show resolved Hide resolved .../src/k8s.io/apiserver/pkg/server/options/encryptionconfig/config_test.go Outdated
Show resolved Hide resolved .../src/k8s.io/apiserver/pkg/server/options/encryptionconfig/config_test.go Outdated
@immutableT

This comment has been minimized.

Copy link
Contributor Author

commented May 31, 2019

/assign @liggitt

@fedebongio

This comment has been minimized.

Copy link
Contributor

commented May 31, 2019

/assign @logicalhan

@immutableT

This comment has been minimized.

Copy link
Contributor Author

commented Jun 1, 2019

/test pull-kubernetes-dependencies

@immutableT immutableT force-pushed the immutableT:kms-plugin-healthz-check branch from 39975f8 to 7a0e4d2 Jun 1, 2019

@immutableT immutableT force-pushed the immutableT:kms-plugin-healthz-check branch from 7a0e4d2 to 16ad233 Jun 1, 2019

@immutableT immutableT force-pushed the immutableT:kms-plugin-healthz-check branch from c5ca068 to 73f1de8 Jun 21, 2019

@immutableT immutableT changed the title Allow KMS Provider to install a healthz check on kube-apiserver for the status of kms-plugin. Add healthz check for KMS Providers on kube-apiserver. Jun 21, 2019

@immutableT immutableT force-pushed the immutableT:kms-plugin-healthz-check branch 3 times, most recently from b37a5df to 0a01687 Jun 21, 2019

@immutableT

This comment has been minimized.

Copy link
Contributor Author

commented Jun 22, 2019

/test pull-kubernetes-e2e-gce-100-performance

}

h.lastResponse = &kmsPluginHealthzResponse{err: nil, received: time.Now()}
klog.V(4).Infof("Healthz check for KMS provider: %s succeed", h.name)

This comment has been minimized.

Copy link
@liggitt

liggitt Jun 28, 2019

Member

this seems more like v5, if at all

This comment has been minimized.

Copy link
@immutableT

immutableT Jul 1, 2019

Author Contributor

Removed.

p, err := h.Service.Encrypt([]byte("ping"))
if err != nil {
h.lastResponse = &kmsPluginHealthzResponse{err: err, received: time.Now()}
klog.Warningf("Healthz check for KMS provider %s failed, error: %v", h.name, err)

This comment has been minimized.

Copy link
@liggitt

liggitt Jun 28, 2019

Member

we should include the name in the returned error so it is surfaced in healthz details. if we do that, we don't need to separately log the error here. same comment applies to the decrypt call on line 136

This comment has been minimized.

Copy link
@immutableT

immutableT Jul 1, 2019

Author Contributor

Done, included provider's name into the error message.


for i, h := range healthzConfigs {
c.HealthzChecks = append(c.HealthzChecks, healthz.NamedCheck(fmt.Sprintf("kms-plugin-%d", i), func(r *http.Request) error {
return h.Check()

This comment has been minimized.

Copy link
@liggitt

liggitt Jun 28, 2019

Member

the h captured here by the closure is incorrect, it means if you have multiple checks, they will all test the last iterated config

This comment has been minimized.

Copy link
@liggitt

liggitt Jun 28, 2019

Member

add a test with at least two configs and verify the checks exercise the correct KMS

This comment has been minimized.

Copy link
@liggitt

liggitt Jun 28, 2019

Member

isn't the first KMS in the list used to encrypt, and subsequent ones to decrypt? would we ever encounter a case where a later KMS would only permit decrypting, so the encrypt("ping") check would fail?

This comment has been minimized.

Copy link
@immutableT

immutableT Jul 1, 2019

Author Contributor

Fixed the closure bug.
Added integration test for multiple providers' scenario (see, kms_treansformer_test.go TestHealthz.

Yes, theoretically, the scenario you described in the third comment is possible. KMS Administrator may remove "encrypt" IAM privilege from the service account under which the kms-plugin runs once the corresponding provider is moved into the second place - used only for decryption.
In practice, this is unlikely, but we should document this.
Also, there is no generic way to test decryption without providing a valid encryption payload.
I think the long term solution here is to move towards gRPC Health Checking Protocol
https://github.com/grpc/grpc/blob/master/doc/health-checking.md
KMS-Plugin developers may have vendor specific way to ascertain the status of the plugin based on the requested operation type. This is something I would like to add after this PR.

defer h.l.Unlock()

if (time.Now().Sub(h.lastResponse.received)) < kmsPluginHealthzTTL {
klog.V(4).Infof("Returning cached value:%v for kms-plugin's healthz response, which was received at %v", h.lastResponse.err, h.lastResponse.received)

This comment has been minimized.

Copy link
@liggitt

liggitt Jun 28, 2019

Member

I wouldn't expect a log here, or if we do, at v(5)

This comment has been minimized.

Copy link
@immutableT

immutableT Jul 1, 2019

Author Contributor

Removed.

klog.V(4).Infof("Returning cached value:%v for kms-plugin's healthz response, which was received at %v", h.lastResponse.err, h.lastResponse.received)
return h.lastResponse.err
}
klog.V(4).Info("TTL for kms-plugin's healthz expired, attempting a new check.")

This comment has been minimized.

Copy link
@liggitt

liggitt Jun 28, 2019

Member

no logging here?

This comment has been minimized.

Copy link
@immutableT

immutableT Jul 1, 2019

Author Contributor

Removed.

@k8s-ci-robot k8s-ci-robot added size/XL and removed size/L labels Jul 1, 2019

@immutableT immutableT force-pushed the immutableT:kms-plugin-healthz-check branch from fec4e6c to e033d95 Jul 1, 2019

@immutableT

This comment has been minimized.

Copy link
Contributor Author

commented Jul 1, 2019

/test pull-kubernetes-integration

@liggitt

This comment has been minimized.

Copy link
Member

commented Jul 2, 2019

looks relevant

E0702 00:03:19.695781  107386 grpc_service.go:71] failed to create connection to unix socket: @kms-provider-2.sock, error: dial unix @kms-provider-2.sock: connect: connection refused
W0702 00:03:19.695822  107386 clientconn.go:1251] grpc: addrConn.createTransport failed to connect to {@kms-provider-2.sock 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial unix @kms-provider-2.sock: connect: connection refused". Reconnecting...
panic: Conflicting storage tracking

@immutableT immutableT force-pushed the immutableT:kms-plugin-healthz-check branch from e033d95 to d395fe1 Jul 2, 2019

@immutableT

This comment has been minimized.

Copy link
Contributor Author

commented Jul 2, 2019

/test pull-kubernetes-integration
/test pull-kubernetes-bazel-build

@immutableT

This comment has been minimized.

Copy link
Contributor Author

commented Jul 2, 2019

@liggitt Yes, it was relevant - I was missing a call to cleanup Test API Server - fixed.

@liggitt
Copy link
Member

left a comment

/approve

go ahead and squash commits, and fix a couple import and logging nits while you're at it, then lgtm

}

h.lastResponse = &kmsPluginHealthzResponse{err: nil, received: time.Now()}
klog.V(5).Infof("Healthz check for KMS provider: %s succeed", h.name)

This comment has been minimized.

Copy link
@liggitt

liggitt Jul 3, 2019

Member

do we log on other health check success? wondering when this would be useful

@@ -21,6 +21,8 @@ import (
"testing"
"time"

"k8s.io/apiserver/pkg/server"

This comment has been minimized.

Copy link
@liggitt

liggitt Jul 3, 2019

Member

group imports together

@@ -22,7 +22,10 @@ import (
"context"
"encoding/base64"
"fmt"
"google.golang.org/grpc/codes"
"google.golang.org/grpc/status"

This comment has been minimized.

Copy link
@liggitt

liggitt Jul 3, 2019

Member

group google.golang.org imports together

@immutableT immutableT force-pushed the immutableT:kms-plugin-healthz-check branch from 073e4e5 to d9af661 Jul 3, 2019

@immutableT immutableT force-pushed the immutableT:kms-plugin-healthz-check branch from d9af661 to 05fdbb2 Jul 3, 2019

@liggitt

This comment has been minimized.

Copy link
Member

commented Jul 3, 2019

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm label Jul 3, 2019

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

commented Jul 3, 2019

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: immutableT, liggitt

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit a7cde2e into kubernetes:master Jul 3, 2019

23 checks passed

cla/linuxfoundation immutableT authorized
Details
pull-kubernetes-bazel-build Job succeeded.
Details
pull-kubernetes-bazel-test Job succeeded.
Details
pull-kubernetes-conformance-image-test Skipped.
pull-kubernetes-cross Skipped.
pull-kubernetes-dependencies Job succeeded.
Details
pull-kubernetes-e2e-gce Job succeeded.
Details
pull-kubernetes-e2e-gce-100-performance Job succeeded.
Details
pull-kubernetes-e2e-gce-csi-serial Skipped.
pull-kubernetes-e2e-gce-device-plugin-gpu Job succeeded.
Details
pull-kubernetes-e2e-gce-iscsi Skipped.
pull-kubernetes-e2e-gce-iscsi-serial Skipped.
pull-kubernetes-e2e-gce-storage-slow Skipped.
pull-kubernetes-godeps Skipped.
pull-kubernetes-integration Job succeeded.
Details
pull-kubernetes-kubemark-e2e-gce-big Job succeeded.
Details
pull-kubernetes-local-e2e Skipped.
pull-kubernetes-node-e2e Job succeeded.
Details
pull-kubernetes-node-e2e-containerd Job succeeded.
Details
pull-kubernetes-typecheck Job succeeded.
Details
pull-kubernetes-verify Job succeeded.
Details
pull-publishing-bot-validate Skipped.
tide In merge pool.
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.