Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make bootstrap client cert loading part of rotation #69890

Merged
merged 2 commits into from
Nov 17, 2018

Conversation

smarterclayton
Copy link
Contributor

@smarterclayton smarterclayton commented Oct 16, 2018

Ensure that bootstrap+client-cert rotation in the Kubelet can:

  1. happen in the background so that static pods are launched immediately
  2. collapse down to a single CSR request codepath
  3. reorganize the code to allow future improvements to fetching bootstrap creds

Reorganize how the kubelet client config is determined. If rotation is
off, simplify the code path. If rotation is on, load the config
from disk, and then pass that into the cert manager. The cert manager
creates a client each time it tries to request a new cert.

Preserves existing behavior where:

  1. bootstrap kubeconfig is used if the current kubeconfig is invalid/expired
  2. we create the kubeconfig file based on the bootstrap kubeconfig, pointing to
    the location that new client certs will be placed
  3. the newest client cert is used once it has been loaded

reverted in #71173, original release note was:

When a kubelet is using --bootstrap-kubeconfig and certificate rotation, it no longer waits for bootstrap to succeed before launching static pods.
NONE

Fixes #68686

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-kind Indicates a PR lacks a `kind/foo` label and requires one. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. area/kubelet sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Oct 16, 2018
@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 16, 2018
@smarterclayton
Copy link
Contributor Author

/assign @awly

@mikedanese @liggitt as discussed, @aaronlevy will allow bootkube and other self-hosting mechanisms to rely on bootstrapped master nodes and use static pods.

@smarterclayton
Copy link
Contributor Author

I reverted #66056 in this - perhaps we could instead do a more aggressive dial timeout for bootstrap credentials (now that we always do both in the background)?

@smarterclayton
Copy link
Contributor Author

Testing this here and in openshift/origin#21274 so I have a better baseline across different environments

@smarterclayton smarterclayton force-pushed the bootstrap_retry branch 2 times, most recently from 47a2f3d to 3c493bc Compare October 17, 2018 19:52
@smarterclayton
Copy link
Contributor Author

smarterclayton commented Oct 17, 2018

There are four core scenarios to test:

  1. rotation on, bootstrap on
    1. no client certs, --kubeconfig file doesn't exist -> kubeconfig is created pointed at destination of client certs, cert is refreshed
    2. client certs exist, --kubeconfig file doesn't exist -> kubeconfig is created pointing at client certs, cert is not refreshed
    3. client certs exist, --kubeconfig exists, certs are valid -> cert is not refreshed until next interval
    4. client certs exist, --kubeconfig exists, certs are expired -> static pod starts and cert is refreshed
  2. rotation off, bootstrap on
    1. bootstrap cert is expired -> rebootstrap
    2. bootstrap file doesn't exist -> exit with error
    3. regular kubeconfig exists but is invalid -> exit with error
  3. rotation off, bootstrap off
    1. kubeconfig doesn't exist -> exit with error
    2. kubeconfig exists but is invalid/expired -> static pods start, errors are logged
  4. rotation on, bootstrap off

Copy link
Contributor

@awly awly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re #66056: it should be fine to remove if we add aggressive timeout and fast retries for initial bootstrap.

For context: it was needed on GCE because network programming can lag behind and bootstrapping would get trapped in a very long timeout, even after master IP becomes reachable from Node. cc @mikedanese

// XXX: When an external bootstrap source is available, it should be possible to always use that source
// to retrieve new credentials.
config := certConfig
if current != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, if certificate manager has an expired cert and fails to rotate it, Current() will return that expired cert.
Can we check that current is still valid here and fall back to certConfig if it's not?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current() returns nil if the cert is expired.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or at least it should, that was the intended behavior a while back. The transport rotation loop checks for the transition from valid -> invalid and uses that as the "die if this persists"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I added the filtering to getCurrentCertificateOrBootstrap. There's very little point in the cert manager handing me back an expired cert - can you think of a case where that makes sense? If not, we may want to change the signature and verify callers get the right thing (callers should have to deal with nil in general). Alternatively, we make every caller filter for a valid cert.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think manager.Current() should always check expiration and return nil when cert is expired.

Specifically, scenario I'm concerned about is:

  1. kubelet has valid bootstrap creds
  2. kubelet bootstraps new client cert with short expiration
  3. manager.Current() starts returning this cert
  4. later, rotation using bootstrapped client cert fails
  5. even later, client cert expires
  6. rotation flow still gets it from manager.Current(), but fails trying to rotate because it's expired

I'd expect rotation to switch back to using bootstrap cert again when active client cert expires.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I agree. I'll make that change and add tests

return clientConfig, clientConfig, nil
}

store, err := certificate.NewFileStore("kubelet-client", certDir, certDir, "", "")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this only used to generate pemPath?
moveit down to where it's used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's high in the function to fail earlier on local config

// kubeconfigPath on disk is populated based on bootstrapPath but pointing to the location of the client cert
// in certDir. This preserves the historical behavior of bootstrapping where on subsequent restarts the
// most recent client cert is used to request new client certs instead of the initial token.
func LoadClientConfig(kubeconfigPath string, bootstrapPath string, certDir string) (certConfig, userConfig *restclient.Config, err error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bootstrapPath, certDir string

return &transportConfig, closeAllConns, nil
}

if len(s.BootstrapKubeconfig) > 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this get called before starting the cert managed above?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is in the unmanaged path. In the managed path bootstrap is managed by the manager.

@fedebongio
Copy link
Contributor

/assign @caesarxuchao

@smarterclayton
Copy link
Contributor Author

/assign @awly

@smarterclayton
Copy link
Contributor Author

still looking at making the waitForServer path able to have a more aggressive timeout and dial.

@smarterclayton smarterclayton added the kind/bug Categorizes issue or PR as related to a bug. label Nov 10, 2018
@smarterclayton
Copy link
Contributor Author

Since the timeouts seemed prone to follow up, split all that out. This is just pulling bootstrap into rotation, and having m.Current() return nil when the cert is expired.

@smarterclayton smarterclayton force-pushed the bootstrap_retry branch 2 times, most recently from f6e246b to f008184 Compare November 16, 2018 19:07
@smarterclayton
Copy link
Contributor Author

/retest

@smarterclayton
Copy link
Contributor Author

@awly this is back to effectively the original PR you reviewed. Was your previous review sufficient or were you looking for more?

Copy link
Member

@mikedanese mikedanese left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm +1 on getting this in to CI and seeing how it goes.

/lgtm
/approve

cmd/kubelet/app/server.go Show resolved Hide resolved
cmd/kubelet/app/server.go Outdated Show resolved Hide resolved
cmd/kubelet/app/server.go Show resolved Hide resolved
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 17, 2018
@mikedanese mikedanese added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Nov 17, 2018
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mikedanese, smarterclayton

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@liggitt
Copy link
Member

liggitt commented Nov 17, 2018

/hold
for the csiClient question

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 17, 2018
Ensure that bootstrap+clientcert-rotation in the Kubelet can:

1. happen in the background so that static pods aren't blocked by bootstrap
2. collapse down to a single call path for requesting a CSR
3. reorganize the code to allow future flexibility in retrieving bootstrap creds

Fetching the first certificate and later certificates when the kubelet
is using client rotation and bootstrapping should share the same code
path. We also want to start the Kubelet static pod loop before
bootstrapping completes. Finally, we want to take an incremental step
towards improving how the bootstrap credentials are loaded from disk
(potentially allowing for a CLI call to get credentials, or a remote
plugin that better integrates with cloud providers or KSMs).

Reorganize how the kubelet client config is determined. If rotation is
off, simplify the code path. If rotation is on, load the config
from disk, and then pass that into the cert manager. The cert manager
creates a client each time it tries to request a new cert.

Preserve existing behavior where:

1. bootstrap kubeconfig is used if the current kubeconfig is invalid/expired
2. we create the kubeconfig file based on the bootstrap kubeconfig, pointing to
   the location that new client certs will be placed
3. the newest client cert is used once it has been loaded
Expose both a Stop() method (for cleanup) and a method to force
cert rotation, but only expose Stop() on the interface.

Verify that we choose the correct client.
@k8s-ci-robot
Copy link
Contributor

New changes are detected. LGTM label has been removed.

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 17, 2018
@smarterclayton
Copy link
Contributor Author

Comments addressed

@liggitt
Copy link
Member

liggitt commented Nov 17, 2018

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 17, 2018
@smarterclayton
Copy link
Contributor Author

/retest

@mikedanese mikedanese added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 17, 2018
@k8s-ci-robot k8s-ci-robot merged commit 3be3510 into kubernetes:master Nov 17, 2018
@k8s-ci-robot
Copy link
Contributor

@smarterclayton: The following tests failed, say /retest to rerun them all:

Test name Commit Details Rerun command
pull-kubernetes-integration de293b2 link /test pull-kubernetes-integration
pull-kubernetes-kubemark-e2e-gce-big de293b2 link /test pull-kubernetes-kubemark-e2e-gce-big

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@smarterclayton
Copy link
Contributor Author

This was reverted in #71173 due to breaking kubeadm's requirement that --kubeconfig be provided on masters as a high powered credential and cause a rotation of the cert. Restored in #71174 with coverage for kubeadm's flow.

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. and removed release-note Denotes a PR that will be considered when it comes time to generate release notes. labels Nov 19, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/kubelet cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. release-note-none Denotes a PR that doesn't merit a release note. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/auth Categorizes an issue or PR as relevant to SIG Auth. sig/node Categorizes an issue or PR as relevant to SIG Node. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants