Pod lifecycle checkpointing #3949

bgrant0607 · 2015-01-29T21:52:50Z

Filing this issue for discussion and tracking, since it has come up a number of times.

Starting with background:

Pods are scheduled, started, and eventually terminate. They are replaced with new pods by replication controller (or some other controller, once we add more controllers). That's both reality and the model. Today pods are replaced reactively, but eventually it will replace pods proactively for planned moves. We currently do not preempt pods in order to schedule other pods, and likely won't for some time.

Currently, new pods have no obvious relationship to the pods they replace. They have different names, different uids, different IP addresses, different hostnames (since we set the pod hostname to pod name), and newly initialized volumes.

Replication controllers themselves are not durable objects. They are tied to deployments. New deployments create new replication controllers. This simplifies sophisticated deployment and rollout strategies without making simple scenarios complex. Both rollout tools/components and auto-scaling will deal with groups of replication controllers.

Naming/discovery is addressed using services, DNS, and the Endpoints API. The evolution of these mechanisms is being discussed in #2585.

This is a flexible model that facilitates transparency, simplifies handling of inevitable distributed-systems scenarios, facilitates high availability, and facilitates dynamic deployment and scaling.

But the model is not without issues. The main ones are:

Data durability
Self-discovery
Work/role assignment

Data durability is being discussed in the persistent storage proposal #3318. We will also need to address it for local storage, but local storage is less relevant to "migration", anyway, since it's not feasible to migrate. For remote storage, it will be possible to detach and reattach the devices to new pods/hosts.

Self-discovery: Pods know their IP addresses, but currently do not know the names nor IPs of services targeting them. This will be solved by the service redesign #2585 and downward API #386.

Work/role assignment: We encourage dynamic role assignment: master election, fine-grain locking, sharding, task queues, pubsub, etc. That said, some servers are "pet-like", particularly those requiring large amounts of persistent storage. Many of these are replicated and/or sharded, with application-specific clustering implementations that tie together names/addresses and persistent data. We've discussed a concept tentatively called "nominal services" #260 to stably assign names and IP addresses to individual pods, and we aim to address that in the service redesign #2585.

So, do we need "pod migration", and, if so, what should it mean? I think it minimally should mean that the replacement pod has the same hostname, IP address, and storage.

We should aim to minimize disruption for high-availability servers. We could we do, besides the things planned above?

Don't use the pod name as the host name. Associate a name with the IP address instead (e.g., by hashing the address). Pods created by replication controllers aren't currently predictable, so this wouldn't be a regression.
Migrate the pod IP address. Currently pod IP addresses are statically partitioned among hosts and are not migratable. This would likely be problematic on some cloud providers with the way we're currently configuring routing, but could be done with an overlay network.
Lifecycle hooks pre- and post-migration.
Actual live state transfer via CRIU

With respect to Kubernetes objects, the "migrated" pod would still be a new pod with a new name and uid. The orchestration of the migration would be performed by a controller -- possibly an enhanced replication controller, perhaps in collaboration with a network controller to move the address, similar to the separation of concerns in the persistent storage proposal. During the migration process, the old and new pods would coexist, and that coexistence would be visible to clients of the Kubernetes API, but the application being migrated and its clients would not need to be aware of the migration.

/cc @smarterclayton @thockin @alex-mohr

thockin · 2015-01-30T01:25:08Z

Fixing the hostname seems like an obvious change. We should consider where
else that information might pop up. In a different issue we are discussing
allowing users to expose their own pod UID and name as custom env vars.
This would still be safe as long as we don't do live migration. We solved
this internally by allocating a virtual UID at pod creation time which
travels across migrations (live or otherwise), but is allocated as a UUID.
When a migrating controller knows that a new pod is a migration it sets the
VUID of the new pod.

I don't know if we will ever really get to pervasive live migration, but I
hope so

On Thu, Jan 29, 2015 at 1:53 PM, Brian Grant notifications@github.com
wrote:

Filing this issue for discussion and tracking, since it has come up a
number of times.

Starting with background:

Pods are scheduled, started, and eventually terminate. They are replaced
with new pods by replication controller (or some other controller, once we
add more controllers). That's both reality and the model. Today pods are
replaced reactively, but eventually it will replace pods proactively for
planned moves. We currently do not preempt pods in order to schedule other
pods, and likely won't for some time.

Currently, new pods have no obvious relationship to the pods they replace.
They have different names, different uids, different IP addresses,
different hostnames (since we set the pod hostname to pod name), and newly
initialized volumes.

Replication controllers themselves are not durable objects. They are tied
to deployments. New deployments create new replication controllers. This
simplifies sophisticated deployment and rollout strategies without making
simple scenarios complex. Both rollout tools/components and auto-scaling
will deal with groups of replication controllers.

Naming/discovery is addressed using services, DNS, and the Endpoints API.
The evolution of these mechanisms is being discussed in #2585
#2585.

This is a flexible model that facilitates transparency, simplifies
handling of inevitable distributed-systems scenarios, facilitates high
availability, and facilitates dynamic deployment and scaling.

But the model is not without issues. The main ones are:

Data durability

Self-discovery

Work/role assignment

Data durability is being discussed in the persistent storage proposal
#3318 #3318. We
will also need to address it for local storage, but local storage is less
relevant to "migration", anyway, since it's not feasible to migrate. For
remote storage, it will be possible to detach and reattach the devices to
new pods/hosts.

Self-discovery: Pods know their IP addresses, but currently do not know
the names nor IPs of services targeting them. This will be solved by the
service redesign #2585
#2585 and
downward API #386
#386.

Work/role assignment: We encourage dynamic role assignment: master
election, fine-grain locking, sharding, task queues, pubsub, etc. That
said, some servers are "pet-like", particularly those requiring large
amounts of persistent storage. Many of these are replicated and/or sharded,
with application-specific clustering implementations that tie together
names/addresses and persistent data. We've discussed a concept tentatively
called "nominal services" #260
#260 to stably
assign names and IP addresses to individual pods, and we aim to address
that in the service redesign #2585
#2585.

So, do we need "pod migration", and, if so, what should it mean? I think
it minimally should mean that the replacement pod has the same hostname, IP
address, and storage.

We should aim to minimize disruption for high-availability servers. We
could we do, besides the things planned above?

Don't use the pod name as the host name. Associate a name with the
IP address instead (e.g., by hashing the address). Pods created by
replication controllers aren't currently predictable, so this wouldn't be a
regression.

Migrate the pod IP address. Currently pod IP addresses are
statically partitioned among hosts and are not migratable. This would
likely be problematic on some cloud providers with the way we're currently
configuring routing, but could be done with an overlay network.

Lifecycle hooks pre- and post-migration.

Actual live state transfer via CRIU http://criu.org/Docker

With respect to Kubernetes objects, the "migrated" pod would still be a
new pod with a new name and uid. The orchestration of the migration would
be performed by a controller -- possibly an enhanced replication
controller, perhaps in collaboration with a network controller to move the
address, similar to the separation of concerns in the persistent storage
proposal. During the migration process, the old and new pods would coexist,
and that coexistence would be visible to clients of the Kubernetes API, but
the application being migrated and its clients would not need to be aware
of the migration.

/cc @smarterclayton https://github.com/smarterclayton @thockin
https://github.com/thockin @alex-mohr https://github.com/alex-mohr

Reply to this email directly or view it on GitHub
#3949.

bgrant0607 · 2015-01-30T05:55:06Z

I'd make the requirement that anything using the Kubernetes API to introspect or manage their own pods would need to be migration-aware. They could use the post-migration hook to get the pod's new name.

Why would someone want the uid? What issue is that? (Other than #386.)

smarterclayton · 2015-02-26T15:08:52Z

They want the uid as a unique instance identifier. We could give them anything, but self registration into an external system is one way (I'm pod foo, serving X, here's my unique identifier)

gaocegege · 2017-04-19T10:47:32Z

Hi, I'm interested in pod live migration.

Is there any progress? Docker has built-in C/R operations in experimental mode, could we develop a feature based on that?

thockin · 2017-04-19T14:01:49Z

We have not done any real work in this area. It is somewhat counter to Kubernetes designs, and would be a little tricky to manage with regards to life cycle in our API

…

On Apr 19, 2017 5:48 AM, "Ce Gao" ***@***.***> wrote: Hi, I'm interested in pod live migration. Is there any progress? Docker has built-in C/R operations in experimental mode, could we develop a feature based on that? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#3949 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AFVgVDgyLFMOD9GGp4yQxyfUxZCV89Lvks5rxeZcgaJpZM4DZNIC> .

warmchang · 2017-08-18T12:52:25Z

pod live migration? I think it's not necessary to do this.
you can use service to shielding the destroy and rebirth of pod.

ktosiek · 2017-08-18T13:17:19Z

@warmchang Live migration would let one reduce downtime for services that don't support failover and have a long startup time.
JIRA and Jenkins would be examples of such services.

You can run them under a single-instance StatefulSet (not under RC, as they do not guarantee stopping the old instance), but then you'll see those slow restarts (potentially multiple ones) on a rolling cluster reboot.

fejta-bot · 2018-01-03T08:19:06Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

schrej · 2020-11-11T14:04:11Z

For evaluation I've implemented a migration controller and a MigratingPod kind, which allows to migrate pods on deletion.
The operator is available at schrej/podmigration-operator (As the rest of my PoC, this is very rough as well.)
It requires my modified kubernetes (the latest modified version of kubelet!) and containerd with modified cri to work of course, see my previous comment.

Here's a small demo gif showing the migration of a simple pod with a "stateful container".

This is implemented with a finalizer that's also respected by kubelet, so the Pod doesn't get terminated immediately.

vutuong · 2020-12-23T05:10:35Z

Hi everyone, thank to @schrej and @adrianreber works, wish this feature lands in k8s soon. For anyone, who wanna try @schrej's setup, but got problems, my ansible-based automation tool, and installation document to build K8s cluster for try out pod migration:

https://github.com/SSU-DCN/podmigration-operator/blob/main/init-cluster-containerd-CRIU.md

However, in @schrej works, when you want to migrate a Pod, the process is briefly described: Create a new Pod with Spec.clonePod = oldPod => The new Pod will be restored as soon as checkpointing old pod is completed (The old Pod will be deleted). So if I just only want to checkpoint my Pod and save it for later restore (not to restore immediately), with that I can restore pod as many as I need from one checkpoint. It seems like not supported in schrej's work.
Thank to @schrej's help, I tried to extend his work as I wanna decouple checkpoint - restore. My initial idea [1] is using pod Annotation to trigger checkpoint/ restore as following:

To checkpoint a Running Pod, there are two options:

$ kubectl annotate pod [POD_NAME] snapshotPolicy="checkpoint"  snapshotPath="/your/path/"
$ kubectl checkpoint [POD_NAME] [CHECKPOINT_PATH]

To restore a new Pod from existing checkpoint info, create a new Pod with these Annotations:

...
  annotations:
    snapshotPolicy: "restore"
    snapshotPath: "/your/path/"
...

To live migration ( cold migration maybe), there are two options:

$ Using  Spec.clonePod = oldPod (just like @schrej works)
$ kubectl migrate [POD_NAME] [DEST_HOST] (my option)

The video-streaming pod migration demo:

Pod migration in single cluster: https://www.youtube.com/watch?v=M4Ik7aUKhas&feature=youtu.be&ab_channel=Xu%C3%A2nT%C6%B0%E1%BB%9DngV%C5%A9
Pod migration across multi-clusters: https://www.youtube.com/watch?v=Bpdlgu0XZqo&ab_channel=Xu%C3%A2nT%C6%B0%E1%BB%9DngV%C5%A9

Finally, I didn't really measure the restore delay, but it looks like not fast as I expect (see the demo). Is it a problem of CRIU or contained?

wgahnagl · 2021-06-24T20:43:07Z

/close
no longer relevant, reopen if it becomes relevant again 👍

k8s-ci-robot · 2021-06-24T20:43:13Z

@wgahnagl: Closing this issue.

In response to this:

/close
no longer relevant, reopen if it becomes relevant again 👍

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

ehashman · 2021-06-24T20:46:53Z

/reopen
/title Pod lifecycle checkpointing

This has been raised recently at SIG Node meetings with some demos. kubernetes/enhancements#1990 is not yet implementable but reopening to keep this on backlog.

(closed due to confusion about title/first comment)

k8s-ci-robot · 2021-06-24T20:46:59Z

@ehashman: Reopened this issue.

In response to this:

/reopen
/title Pod lifecycle checkpointing

This has been raised recently at SIG Node meetings with some demos. kubernetes/enhancements#1990 is not yet implementable but reopening to keep this on backlog.

(closed due to confusion about title/first comment)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

gjkim42 · 2021-06-24T22:14:22Z

/triage accepted

gjkim42 · 2021-06-24T22:59:33Z

/remove-triage accepted

Sorry for the confusion. It seems that sig-node has not yet decided to accept this.

k8s-ci-robot · 2021-06-24T22:59:35Z

@bgrant0607: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

ehashman · 2021-06-25T01:01:19Z

/retitle Pod lifecycle checkpointing

mmiranda96 · 2021-06-25T20:27:20Z

/remove-priority awaiting-more-evidence
/triage needs-information
/kind feature
/remove-kind design

bgrant0607 added kind/design Categorizes issue or PR as related to design. priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. labels Jan 29, 2015

This was referenced Feb 3, 2015

Mark node to be decommissioned and act accordingly #3885

Closed

Kubelet to understand pods, and to be able to pull from apiserver #2483

Closed

Pass pod labels to containers and to lifecycle hooks #560

Closed

roberthbailey added the team/cluster label Feb 18, 2015

bgrant0607 mentioned this issue Feb 26, 2015

Hostname shouldn't be set to the pod name #4825

Closed

bgrant0607 mentioned this issue Mar 12, 2015

Philosophy: when and how should kubelet reject assigned pods? #5335

Closed

bgrant0607 mentioned this issue Apr 3, 2015

Add PodSpec.NodeFailurePolicy = {Reschedule, Delete, Ignore} #6393

Closed

This was referenced May 9, 2015

Make pod.Spec.Host unambiguous #8007

Closed

PetSet (was nominal services) #260

Closed

bgrant0607 mentioned this issue Oct 7, 2015

Have an option to keep Pod around for debugging #14602

Closed

bgrant0607 mentioned this issue Jan 13, 2016

Proposal for implementing nominal services AKA StatefulSets AKA The-Proposal-Formerly-Known-As-PetSets #18016

Merged

bgrant0607 mentioned this issue Mar 7, 2016

[WIP/RFC] Rescheduling in Kubernetes design proposal #22217

Merged

bgrant0607 mentioned this issue Jun 10, 2016

Support for live migration scenario #25603

Closed

jimmidyson mentioned this issue Jul 26, 2016

Kubernetes SD: Add node name and host IP to pod discovery prometheus/prometheus#1835

Merged

anguslees mentioned this issue Aug 9, 2016

External Load Balancer Source IP Preservation Proposal #30105

Merged

bgrant0607 added sig/node Categorizes an issue or PR as relevant to SIG Node. and removed team/cluster (deprecated - do not use) labels Jan 26, 2017

bgrant0607 mentioned this issue Mar 7, 2017

Filter/restrict access to http://metadata from containers #8867

Closed

AmitKumarDas mentioned this issue Oct 4, 2017

Explore K8s Pod restarts, failures, anomaly detections, handlers/hooks & remedial actions in Kubernetes openebs/openebs#465

Closed

bgrant0607 mentioned this issue Oct 31, 2017

Feature Request: enable user-managed Pod Migration #43405

Open

adrianreber mentioned this issue Dec 10, 2020

[WIP] Add --checkpoint to drain #97194

Closed

adrianreber mentioned this issue Jan 4, 2021

Add experimental CRI API for checkpoint/restore #97689

Closed

tfenster mentioned this issue Jan 25, 2021

Question: Support for docker checkpoint on Windows? microsoft/Windows-Containers#88

Closed

tfenster mentioned this issue Feb 17, 2021

Support for checkpoint / restore on Windows as well #99170

Closed

schrej mentioned this issue May 19, 2021

Guide schrej/podmigration-operator#1

Closed

k8s-ci-robot closed this as completed Jun 24, 2021

k8s-ci-robot reopened this Jun 24, 2021

k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Jun 24, 2021

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 24, 2021

k8s-ci-robot removed the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Jun 24, 2021

k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Jun 24, 2021

k8s-ci-robot changed the title ~~Pod migration~~ Pod lifecycle checkpointing Jun 25, 2021

adrianreber mentioned this issue Sep 10, 2021

Minimal checkpointing support #104907

Merged

thockin added the area/pod-lifecycle Issues or PRs related to Pod lifecycle label Feb 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pod lifecycle checkpointing #3949

Pod lifecycle checkpointing #3949

bgrant0607 commented Jan 29, 2015

thockin commented Jan 30, 2015

bgrant0607 commented Jan 30, 2015

smarterclayton commented Feb 26, 2015

gaocegege commented Apr 19, 2017

thockin commented Apr 19, 2017 via email

warmchang commented Aug 18, 2017

ktosiek commented Aug 18, 2017

fejta-bot commented Jan 3, 2018

schrej commented Nov 11, 2020 •

edited

Loading

vutuong commented Dec 23, 2020 •

edited

Loading

wgahnagl commented Jun 24, 2021

k8s-ci-robot commented Jun 24, 2021

ehashman commented Jun 24, 2021

k8s-ci-robot commented Jun 24, 2021

gjkim42 commented Jun 24, 2021

gjkim42 commented Jun 24, 2021

k8s-ci-robot commented Jun 24, 2021

ehashman commented Jun 25, 2021

mmiranda96 commented Jun 25, 2021

Pod lifecycle checkpointing #3949

Pod lifecycle checkpointing #3949

Comments

bgrant0607 commented Jan 29, 2015

thockin commented Jan 30, 2015

bgrant0607 commented Jan 30, 2015

smarterclayton commented Feb 26, 2015

gaocegege commented Apr 19, 2017

thockin commented Apr 19, 2017 via email

warmchang commented Aug 18, 2017

ktosiek commented Aug 18, 2017

fejta-bot commented Jan 3, 2018

schrej commented Nov 11, 2020 • edited Loading

vutuong commented Dec 23, 2020 • edited Loading

wgahnagl commented Jun 24, 2021

k8s-ci-robot commented Jun 24, 2021

ehashman commented Jun 24, 2021

k8s-ci-robot commented Jun 24, 2021

gjkim42 commented Jun 24, 2021

gjkim42 commented Jun 24, 2021

k8s-ci-robot commented Jun 24, 2021

ehashman commented Jun 25, 2021

mmiranda96 commented Jun 25, 2021

schrej commented Nov 11, 2020 •

edited

Loading

vutuong commented Dec 23, 2020 •

edited

Loading