Checkpoint and activate itself. #366

yifan-gu · 2017-03-10T02:14:26Z

This PR enables us to GC itself using the existing codepath.
Also it removes the needs for the checkpoint-installer, which
also enables us to update the checkpointer by just updating the
checkpointer's manifest.

yifan-gu · 2017-03-11T03:24:56Z

Fix #253
Fix #206

Ready for a review.

I did several tests manually:

Boot up 2 nodes.
Schedule checkpointer on both.
Schedule a test pod (checkpoints enabled) on the worker node.
See 4 checkpointers, 2 running, 2 standby.
Suspend the master, reboot the worker node.
The standby checkpointer starts, which then starts checkpoints.
Resume the master, see 4 checkpointers, 2 running, 2 standby, and the static checkpointer now is the effective one on the worker node.
Checkpointer stops the test pod's checkpoints after the test pod is running.

Boot up 2 nodes.
Schedule checkpointer on both.
Schedule a test pod (checkpoints enabled) on the worker node.
See 4 checkpointers, 2 running, 2 standby.
Suspend the worker, delete the test pod, recreate the checkpointer daemonset so it only schedules on master node.
Suspend the master
Resume/reboot the worker
The standby checkpointer starts, which then starts checkpoints.
Resume the master
The standby checkpointer cleans the checkpoints and itself.

/cc @aaronlevy @stuart-warren

aaronlevy

Couple minor comments, but this is looking awesome

aaronlevy · 2017-03-13T18:10:28Z

cmd/checkpoint/main.go

@@ -295,6 +351,11 @@ func writeCheckpointManifest(pod *v1.Pod) error {
 	return writeAndAtomicRename(path, b, 0644)
 }

+// isPodCheckpointer returns true if the pod is the checkpointer itself.
+func isPodCheckpointer(pod *v1.Pod) bool {
+	return strings.HasPrefix(pod.Name, "pod-checkpointer")


This seems a bit fragile. Is there another way we could determine this? We can know our "self" via the downward api + inject as env vars - maybe that's an option?

@aaronlevy Which downward API? Getting the podname?

Yeah. We could probably check that maybe:

pod.Name == self.Namespace pod.Name == self.Name + "-" + self.NodeName // Because we care about the static pod, right?

Hmm actually, it won't have the node-name suffix in the checkpoint manifest - so maybe just pod.Name == self.Name if we care about the parent pod

or, if we care about the checkpointed copy, something like:

pod.Name == strings.TrimSuffix(self.Name, self.NodeName) && pod.Name != self.Name

aaronlevy · 2017-03-13T18:14:15Z

cmd/checkpoint/main.go

+		if err := os.Remove(p); err != nil && !os.IsNotExist(err) {
+			glog.Errorf("Failed to remove active checkpoint %s: %v", p, err)
+			continue
+		}


I'm a slightly worried about swapping the order of removal here. My thought being that we should stop the pods first, then remove the assets they rely on. It might not be an issue - because we weren't waiting for the pods to actually stop - but do you forsee issues in removing a secret / configMap for example from underneath a running pod?

I tested this, it actually successfully removed without an issue. But I want to just double check and understand why it's not an issue and report back.

Alternatively, we can keep the order. and just leave the configmap/scret for the checkpointer untouched. And next time we create the configmap/scret again, we always overwrite the existing one. This seems safer and doesn't cause any issue except for two small files. WDYT?

cc @aaronlevy ^

So actually removing the config, secret will always succeed, it's not a unmounting operaton.
But maybe that's not graceful to pods as they might have some prestop hooks or whatever graceful termination process that will require the secrets/configs.
I will revert this ordering change.

Had more discussion on this. As for now we are not waiting the pod to be deleted anyway, so maybe it's fine to leave as is, and improve this later.
Adding a TODO to capture that.

aaronlevy · 2017-03-13T18:15:23Z

cmd/checkpoint/main_test.go

@@ -104,6 +104,22 @@ func TestProcess(t *testing.T) {
 			activeCheckpoints: map[string]*v1.Pod{"AA": {}},
 			localParents:      map[string]*v1.Pod{"AA": {}},
 		},
+		{
+			desc:                "Inactive pod-checkpointer, local parent, local running, api parent: should start",


May be worth adding another test to make sure it still starts even with a local parent (the checkpointed copy always needs to run)

This enables us to GC itself using the existing codepath. Also it removes the needs for the checkpoint-installer, which also enables us to update the checkpointer by just updating the checkpointer's manifest.

Remove pod-checkpoint-installer.

yifan-gu · 2017-03-14T03:54:30Z

Comments addressed, PTAL @aaronlevy

aaronlevy · 2017-03-14T22:17:24Z

This LGTM - but going to hold off on merging this to coordinate these changes at same time as k8s v1.6.0

yifan-gu · 2017-03-16T19:21:50Z

Add one more commit for cleaning up all checkpoints when the pod checkpointer is removed.
Will squash if this passes the review @aaronlevy

aaronlevy

lgtm

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Mar 10, 2017

yifan-gu force-pushed the gc_checkpointer branch from ffeba83 to 6d60cae Compare March 10, 2017 02:17

yifan-gu changed the title ~~Checkpoint and activate itself.~~ WIP (don't review): Checkpoint and activate itself. Mar 10, 2017

yifan-gu force-pushed the gc_checkpointer branch 5 times, most recently from e05ca93 to dc12082 Compare March 11, 2017 00:58

yifan-gu changed the title ~~WIP (don't review): Checkpoint and activate itself.~~ Checkpoint and activate itself. Mar 11, 2017

yifan-gu force-pushed the gc_checkpointer branch from dc12082 to 6a695df Compare March 11, 2017 03:12

aaronlevy reviewed Mar 13, 2017

View reviewed changes

yifan-gu force-pushed the gc_checkpointer branch 2 times, most recently from 22d4c2e to e077b08 Compare March 14, 2017 02:49

cmd/checkpoint: Checkpoint and activate itself.

38c4b6e

This enables us to GC itself using the existing codepath. Also it removes the needs for the checkpoint-installer, which also enables us to update the checkpointer by just updating the checkpointer's manifest.

yifan-gu force-pushed the gc_checkpointer branch from e077b08 to f0631b5 Compare March 14, 2017 03:14

*.*: Update README and image building utilities.

fc04e43

Remove pod-checkpoint-installer.

aaronlevy added this to the v0.4.0 milestone Mar 14, 2017

aaronlevy mentioned this pull request Mar 15, 2017

checkpoint installer should source manifests from api-object #206

Closed

yifan-gu force-pushed the gc_checkpointer branch from f0631b5 to 247f62f Compare March 16, 2017 19:21

yifan-gu force-pushed the gc_checkpointer branch from 247f62f to 8fe3298 Compare March 16, 2017 20:14

yifan-gu mentioned this pull request Mar 16, 2017

pluton/tests/bootkube: Add checkpointer tests. coreos/mantle#513

Merged

Remove all checkpoints when the pod checkpointer is unscheduled.

c4ec0bb

yifan-gu force-pushed the gc_checkpointer branch from 8fe3298 to c4ec0bb Compare March 16, 2017 22:25

aaronlevy approved these changes Mar 29, 2017

View reviewed changes

yifan-gu merged commit a936ee1 into kubernetes-retired:master Mar 30, 2017

yifan-gu deleted the gc_checkpointer branch March 30, 2017 00:23

aaronlevy mentioned this pull request Apr 3, 2017

checkpointer: should GC itself if installer no longer scheduled #253

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Checkpoint and activate itself. #366

Checkpoint and activate itself. #366

yifan-gu commented Mar 10, 2017

yifan-gu commented Mar 11, 2017

aaronlevy left a comment

aaronlevy Mar 13, 2017

yifan-gu Mar 13, 2017

aaronlevy Mar 13, 2017

aaronlevy Mar 13, 2017

aaronlevy Mar 13, 2017

yifan-gu Mar 13, 2017

yifan-gu Mar 13, 2017

yifan-gu Mar 13, 2017

yifan-gu Mar 13, 2017 •

edited

aaronlevy Mar 13, 2017

yifan-gu Mar 13, 2017

yifan-gu commented Mar 14, 2017

aaronlevy commented Mar 14, 2017

yifan-gu commented Mar 16, 2017

aaronlevy left a comment

Checkpoint and activate itself. #366

Checkpoint and activate itself. #366

Conversation

yifan-gu commented Mar 10, 2017

yifan-gu commented Mar 11, 2017

aaronlevy left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yifan-gu Mar 13, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yifan-gu commented Mar 14, 2017

aaronlevy commented Mar 14, 2017

yifan-gu commented Mar 16, 2017

aaronlevy left a comment

Choose a reason for hiding this comment

yifan-gu Mar 13, 2017 •

edited