Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubelet responds to ConfigMap mutations for dynamic Kubelet config #63221

Merged
merged 1 commit into from May 22, 2018

Conversation

@mtaufen
Copy link
Contributor

mtaufen commented Apr 27, 2018

This PR makes dynamic Kubelet config easier to reason about by leaving less room for silent skew scenarios. The new behavior is as follows:

  • ConfigMap does not exist: Kubelet reports error status due to missing source
  • ConfigMap is created: Kubelet starts using it
  • ConfigMap is updated: Kubelet respects the update (but we discourage this pattern, in favor of incrementally migrating to a new ConfigMap)
  • ConfigMap is deleted: Kubelet keeps using the config (non-disruptive), but reports error status due to missing source
  • ConfigMap is recreated: Kubelet respects any updates (but, again, we discourage this pattern)

This PR also makes a small change to the config checkpoint file tree structure, because ResourceVersion is now taken into account when saving checkpoints. The new structure is as follows:

- dir named by --dynamic-config-dir (root for managing dynamic config)
| - meta
  | - assigned (encoded kubeletconfig/v1beta1.SerializedNodeConfigSource object, indicating the assigned config)
  | - last-known-good (encoded kubeletconfig/v1beta1.SerializedNodeConfigSource object, indicating the last-known-good config)
| - checkpoints
  | - uid1 (dir for versions of object identified by uid1)
    | - resourceVersion1 (dir for unpacked files from resourceVersion1)
    | - ...
  | - ...

fixes: #61643

The dynamic Kubelet config feature will now update config in the event of a ConfigMap mutation, which reduces the chance for silent config skew. Only name, namespace, and kubeletConfigKey may now be set in Node.Spec.ConfigSource.ConfigMap. The least disruptive pattern for config management is still to create a new ConfigMap and incrementally roll out a new Node.Spec.ConfigSource.
// NewRemoteConfigSource constructs a RemoteConfigSource from a v1/NodeConfigSource object.
// We assume the source has been validated beforehand - e.g. by an API server.
func NewRemoteConfigSource(source *apiv1.NodeConfigSource) RemoteConfigSource {
return &remoteConfigMap{source}

This comment has been minimized.

@mtaufen

mtaufen Apr 27, 2018

Author Contributor

actually still need to check not all known fields nil, wrt old clients against a new API server that allows a new config source type failing cleanly (i.e. not crashing due to nil pointer dereference).

@mtaufen mtaufen force-pushed the mtaufen:dkcfg-live-configmap branch from 87f5ade to 7ec2708 Apr 30, 2018

@mtaufen mtaufen force-pushed the mtaufen:dkcfg-live-configmap branch from 7ec2708 to fd0b29e May 1, 2018

@mtaufen mtaufen force-pushed the mtaufen:dkcfg-live-configmap branch from f642905 to 517b7d9 May 21, 2018

@liggitt

This comment has been minimized.

Copy link
Member

liggitt commented May 21, 2018

just looking at the description, for the current and last-known-good configs, where is the resolved uid/resourceVersion recorded (to be able to look up the correct dir in checkpoints)?

is there any expectation that multiple objects will need to be pulled in/checkpointed in the future for a node config (e.g. a configmap and some set of secrets)? should that inform our config/checkpoint disk structure?

@mtaufen

This comment has been minimized.

Copy link
Contributor Author

mtaufen commented May 21, 2018

just looking at the description, for the current and last-known-good configs, where is the resolved uid/resourceVersion recorded (to be able to look up the correct dir in checkpoints)?

@liggitt in the SerializedNodeConfigSources written to --dynamic-config-dir/meta/assigned and --dynamic-config-dir/meta/last-known-good.

is there any expectation that multiple objects will need to be pulled in/checkpointed in the future for a node config (e.g. a configmap and some set of secrets)? should that inform our config/checkpoint disk structure?

Not yet. If we added this, I'd expect NodeConfigSource to be extended to represent a request for multiple objects (which would transitively ensure --dynamic-config-dir/meta can record the same), and since UID and ResourceVersion are consistent concepts across all objects, the structure of --dynamic-config-dir/checkpoints should work for additional object types.

@liggitt

This comment has been minimized.

Copy link
Member

liggitt commented May 21, 2018

@liggitt in the SerializedNodeConfigSources written to --dynamic-config-dir/meta/assigned and --dynamic-config-dir/meta/last-known-good.

ah, so those are enhanced versions of what is in the Node API object

Not yet. If we added this, I'd expect NodeConfigSource to be extended to represent a request for multiple objects (which would transitively ensure --dynamic-config-dir/meta can record the same), and since UID and ResourceVersion are consistent concepts across all objects, the structure of --dynamic-config-dir/checkpoints should work for additional object types.

That makes sense

@mtaufen mtaufen force-pushed the mtaufen:dkcfg-live-configmap branch from 517b7d9 to b5648c3 May 21, 2018

@@ -142,6 +145,8 @@ func (s *nodeConfigStatus) Sync(client clientset.Interface, nodeName string) {
return

This comment has been minimized.

@dashpole

dashpole May 21, 2018

Contributor

I realize this isn't part of this PR, but is there a reason you chose to use a periodic, jittered check rather than just a long-running loop?
for _ := range s.syncCh {
utillog.Infof("updating Node.Status.Config")
...
}

@dashpole

This comment has been minimized.

Copy link
Contributor

dashpole commented May 21, 2018

/lgtm
comment above doesn't need to be dealt with in this PR.
Great work, especially on the test coverage for this.

@k8s-ci-robot k8s-ci-robot added the lgtm label May 21, 2018

@mtaufen

This comment has been minimized.

Copy link
Contributor Author

mtaufen commented May 21, 2018

/retest

@dchen1107

This comment has been minimized.

Copy link
Member

dchen1107 commented May 21, 2018

/approve

Approve the pr based on the offline discussions with @mtaufen on the high level design and @dashpole' detail review.

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented May 21, 2018

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dashpole, dchen1107, mtaufen

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot pushed a commit to kubernetes/test-infra that referenced this pull request May 21, 2018

increase ci-kubernetes-node-kubelet-serial timeout
https://k8s-testgrid.appspot.com/sig-node-kubelet#kubelet-serial-gce-e2e
is currently failing due to timeouts. I have some new tests in
kubernetes/kubernetes#63221 which will further
increase the test runtime. This PR should fix the current failures and
give us a margin of safety for the new tests.
@k8s-github-robot

This comment has been minimized.

Copy link
Contributor

k8s-github-robot commented May 22, 2018

Automatic merge from submit-queue (batch tested with PRs 63881, 64046, 63409, 63402, 63221). If you want to cherry-pick this change to another branch, please follow the instructions here.

@k8s-github-robot k8s-github-robot merged commit 2a989c6 into kubernetes:master May 22, 2018

18 checks passed

Submit Queue Queued to run github e2e tests a second time.
Details
cla/linuxfoundation mtaufen authorized
Details
pull-kubernetes-bazel-build Job succeeded.
Details
pull-kubernetes-bazel-test Job succeeded.
Details
pull-kubernetes-cross Skipped
pull-kubernetes-e2e-gce Job succeeded.
Details
pull-kubernetes-e2e-gce-100-performance Job succeeded.
Details
pull-kubernetes-e2e-gce-device-plugin-gpu Job succeeded.
Details
pull-kubernetes-e2e-gke Skipped
pull-kubernetes-e2e-kops-aws Job succeeded.
Details
pull-kubernetes-integration Job succeeded.
Details
pull-kubernetes-kubemark-e2e-gce Job succeeded.
Details
pull-kubernetes-kubemark-e2e-gce-big Job succeeded.
Details
pull-kubernetes-local-e2e Skipped
pull-kubernetes-local-e2e-containerized Job succeeded.
Details
pull-kubernetes-node-e2e Job succeeded.
Details
pull-kubernetes-typecheck Job succeeded.
Details
pull-kubernetes-verify Job succeeded.
Details
@dashpole

This comment has been minimized.

Copy link
Contributor

dashpole commented May 22, 2018

I think we may need to update the node e2e framework. Tests using dynamic kubelet config are failing with the error: Forbidden: uid must not be set in spec

@mtaufen

This comment has been minimized.

Copy link
Contributor Author

mtaufen commented May 23, 2018

That's surprising, bad merge?

@mtaufen

This comment has been minimized.

Copy link
Contributor Author

mtaufen commented May 23, 2018

Ah, misread, you said tests using dynamic config, not the dynamic config tests.

@mtaufen

This comment has been minimized.

Copy link
Contributor Author

mtaufen commented May 23, 2018

Yup, probably have to do that.

@mtaufen

This comment has been minimized.

Copy link
Contributor Author

mtaufen commented May 23, 2018

mtaufen added a commit to mtaufen/kubernetes that referenced this pull request May 30, 2018

remove unused status per TODO
This should have been deleted in kubernetes#63221, as it is now unused.

jingax10 pushed a commit to jingax10/kubernetes that referenced this pull request May 31, 2018

Kubernetes Submit Queue
Merge pull request kubernetes#64486 from mtaufen/cleanup-unused-statu…
…s-message

Automatic merge from submit-queue (batch tested with PRs 64338, 64219, 64486, 64495, 64347). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

remove unused status per TODO

This should have been deleted in kubernetes#63221, as it is now unused.

```release-note
NONE
```

wenjiaswe added a commit to wenjiaswe/kubernetes that referenced this pull request Jun 19, 2018

remove unused status per TODO
This should have been deleted in kubernetes#63221, as it is now unused.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.