-
Notifications
You must be signed in to change notification settings - Fork 224
Conversation
ffeba83
to
6d60cae
Compare
e05ca93
to
dc12082
Compare
dc12082
to
6a695df
Compare
Ready for a review. I did several tests manually:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple minor comments, but this is looking awesome
cmd/checkpoint/main.go
Outdated
@@ -295,6 +351,11 @@ func writeCheckpointManifest(pod *v1.Pod) error { | |||
return writeAndAtomicRename(path, b, 0644) | |||
} | |||
|
|||
// isPodCheckpointer returns true if the pod is the checkpointer itself. | |||
func isPodCheckpointer(pod *v1.Pod) bool { | |||
return strings.HasPrefix(pod.Name, "pod-checkpointer") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems a bit fragile. Is there another way we could determine this? We can know our "self" via the downward api + inject as env vars - maybe that's an option?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aaronlevy Which downward API? Getting the podname?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. We could probably check that maybe:
pod.Name == self.Namespace
pod.Name == self.Name + "-" + self.NodeName // Because we care about the static pod, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm actually, it won't have the node-name suffix in the checkpoint manifest - so maybe just pod.Name == self.Name
if we care about the parent pod
or, if we care about the checkpointed copy, something like:
pod.Name == strings.TrimSuffix(self.Name, self.NodeName) && pod.Name != self.Name
if err := os.Remove(p); err != nil && !os.IsNotExist(err) { | ||
glog.Errorf("Failed to remove active checkpoint %s: %v", p, err) | ||
continue | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a slightly worried about swapping the order of removal here. My thought being that we should stop the pods first, then remove the assets they rely on. It might not be an issue - because we weren't waiting for the pods to actually stop - but do you forsee issues in removing a secret / configMap for example from underneath a running pod?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested this, it actually successfully removed without an issue. But I want to just double check and understand why it's not an issue and report back.
Alternatively, we can keep the order. and just leave the configmap/scret for the checkpointer untouched. And next time we create the configmap/scret again, we always overwrite the existing one. This seems safer and doesn't cause any issue except for two small files. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @aaronlevy ^
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So actually removing the config, secret will always succeed, it's not a unmounting operaton.
But maybe that's not graceful to pods as they might have some prestop hooks or whatever graceful termination process that will require the secrets/configs.
I will revert this ordering change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had more discussion on this. As for now we are not waiting the pod to be deleted anyway, so maybe it's fine to leave as is, and improve this later.
Adding a TODO to capture that.
cmd/checkpoint/main_test.go
Outdated
@@ -104,6 +104,22 @@ func TestProcess(t *testing.T) { | |||
activeCheckpoints: map[string]*v1.Pod{"AA": {}}, | |||
localParents: map[string]*v1.Pod{"AA": {}}, | |||
}, | |||
{ | |||
desc: "Inactive pod-checkpointer, local parent, local running, api parent: should start", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May be worth adding another test to make sure it still starts even with a local parent (the checkpointed copy always needs to run)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure
22d4c2e
to
e077b08
Compare
This enables us to GC itself using the existing codepath. Also it removes the needs for the checkpoint-installer, which also enables us to update the checkpointer by just updating the checkpointer's manifest.
e077b08
to
f0631b5
Compare
Remove pod-checkpoint-installer.
Comments addressed, PTAL @aaronlevy |
This LGTM - but going to hold off on merging this to coordinate these changes at same time as k8s v1.6.0 |
f0631b5
to
247f62f
Compare
Add one more commit for cleaning up all checkpoints when the pod checkpointer is removed. |
247f62f
to
8fe3298
Compare
8fe3298
to
c4ec0bb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
This PR enables us to GC itself using the existing codepath.
Also it removes the needs for the checkpoint-installer, which
also enables us to update the checkpointer by just updating the
checkpointer's manifest.