Pod migration #3949

Open
bgrant0607 opened this Issue Jan 29, 2015 · 3 comments

Projects

None yet

4 participants

@bgrant0607
Member

Filing this issue for discussion and tracking, since it has come up a number of times.

Starting with background:

Pods are scheduled, started, and eventually terminate. They are replaced with new pods by replication controller (or some other controller, once we add more controllers). That's both reality and the model. Today pods are replaced reactively, but eventually it will replace pods proactively for planned moves. We currently do not preempt pods in order to schedule other pods, and likely won't for some time.

Currently, new pods have no obvious relationship to the pods they replace. They have different names, different uids, different IP addresses, different hostnames (since we set the pod hostname to pod name), and newly initialized volumes.

Replication controllers themselves are not durable objects. They are tied to deployments. New deployments create new replication controllers. This simplifies sophisticated deployment and rollout strategies without making simple scenarios complex. Both rollout tools/components and auto-scaling will deal with groups of replication controllers.

Naming/discovery is addressed using services, DNS, and the Endpoints API. The evolution of these mechanisms is being discussed in #2585.

This is a flexible model that facilitates transparency, simplifies handling of inevitable distributed-systems scenarios, facilitates high availability, and facilitates dynamic deployment and scaling.

But the model is not without issues. The main ones are:

  1. Data durability
  2. Self-discovery
  3. Work/role assignment

Data durability is being discussed in the persistent storage proposal #3318. We will also need to address it for local storage, but local storage is less relevant to "migration", anyway, since it's not feasible to migrate. For remote storage, it will be possible to detach and reattach the devices to new pods/hosts.

Self-discovery: Pods know their IP addresses, but currently do not know the names nor IPs of services targeting them. This will be solved by the service redesign #2585 and downward API #386.

Work/role assignment: We encourage dynamic role assignment: master election, fine-grain locking, sharding, task queues, pubsub, etc. That said, some servers are "pet-like", particularly those requiring large amounts of persistent storage. Many of these are replicated and/or sharded, with application-specific clustering implementations that tie together names/addresses and persistent data. We've discussed a concept tentatively called "nominal services" #260 to stably assign names and IP addresses to individual pods, and we aim to address that in the service redesign #2585.

So, do we need "pod migration", and, if so, what should it mean? I think it minimally should mean that the replacement pod has the same hostname, IP address, and storage.

We should aim to minimize disruption for high-availability servers. We could we do, besides the things planned above?

  • Don't use the pod name as the host name. Associate a name with the IP address instead (e.g., by hashing the address). Pods created by replication controllers aren't currently predictable, so this wouldn't be a regression.
  • Migrate the pod IP address. Currently pod IP addresses are statically partitioned among hosts and are not migratable. This would likely be problematic on some cloud providers with the way we're currently configuring routing, but could be done with an overlay network.
  • Lifecycle hooks pre- and post-migration.
  • Actual live state transfer via CRIU

With respect to Kubernetes objects, the "migrated" pod would still be a new pod with a new name and uid. The orchestration of the migration would be performed by a controller -- possibly an enhanced replication controller, perhaps in collaboration with a network controller to move the address, similar to the separation of concerns in the persistent storage proposal. During the migration process, the old and new pods would coexist, and that coexistence would be visible to clients of the Kubernetes API, but the application being migrated and its clients would not need to be aware of the migration.

/cc @smarterclayton @thockin @alex-mohr

@thockin
Member
thockin commented Jan 30, 2015

Fixing the hostname seems like an obvious change. We should consider where
else that information might pop up. In a different issue we are discussing
allowing users to expose their own pod UID and name as custom env vars.
This would still be safe as long as we don't do live migration. We solved
this internally by allocating a virtual UID at pod creation time which
travels across migrations (live or otherwise), but is allocated as a UUID.
When a migrating controller knows that a new pod is a migration it sets the
VUID of the new pod.

I don't know if we will ever really get to pervasive live migration, but I
hope so

On Thu, Jan 29, 2015 at 1:53 PM, Brian Grant notifications@github.com
wrote:

Filing this issue for discussion and tracking, since it has come up a
number of times.

Starting with background:

Pods are scheduled, started, and eventually terminate. They are replaced
with new pods by replication controller (or some other controller, once we
add more controllers). That's both reality and the model. Today pods are
replaced reactively, but eventually it will replace pods proactively for
planned moves. We currently do not preempt pods in order to schedule other
pods, and likely won't for some time.

Currently, new pods have no obvious relationship to the pods they replace.
They have different names, different uids, different IP addresses,
different hostnames (since we set the pod hostname to pod name), and newly
initialized volumes.

Replication controllers themselves are not durable objects. They are tied
to deployments. New deployments create new replication controllers. This
simplifies sophisticated deployment and rollout strategies without making
simple scenarios complex. Both rollout tools/components and auto-scaling
will deal with groups of replication controllers.

Naming/discovery is addressed using services, DNS, and the Endpoints API.
The evolution of these mechanisms is being discussed in #2585
#2585.

This is a flexible model that facilitates transparency, simplifies
handling of inevitable distributed-systems scenarios, facilitates high
availability, and facilitates dynamic deployment and scaling.

But the model is not without issues. The main ones are:

  1. Data durability
  2. Self-discovery
  3. Work/role assignment

Data durability is being discussed in the persistent storage proposal
#3318 #3318. We
will also need to address it for local storage, but local storage is less
relevant to "migration", anyway, since it's not feasible to migrate. For
remote storage, it will be possible to detach and reattach the devices to
new pods/hosts.

Self-discovery: Pods know their IP addresses, but currently do not know
the names nor IPs of services targeting them. This will be solved by the
service redesign #2585
#2585 and
downward API #386
#386.

Work/role assignment: We encourage dynamic role assignment: master
election, fine-grain locking, sharding, task queues, pubsub, etc. That
said, some servers are "pet-like", particularly those requiring large
amounts of persistent storage. Many of these are replicated and/or sharded,
with application-specific clustering implementations that tie together
names/addresses and persistent data. We've discussed a concept tentatively
called "nominal services" #260
#260 to stably
assign names and IP addresses to individual pods, and we aim to address
that in the service redesign #2585
#2585.

So, do we need "pod migration", and, if so, what should it mean? I think
it minimally should mean that the replacement pod has the same hostname, IP
address, and storage.

We should aim to minimize disruption for high-availability servers. We
could we do, besides the things planned above?

  • Don't use the pod name as the host name. Associate a name with the
    IP address instead (e.g., by hashing the address). Pods created by
    replication controllers aren't currently predictable, so this wouldn't be a
    regression.
  • Migrate the pod IP address. Currently pod IP addresses are
    statically partitioned among hosts and are not migratable. This would
    likely be problematic on some cloud providers with the way we're currently
    configuring routing, but could be done with an overlay network.
  • Lifecycle hooks pre- and post-migration.
  • Actual live state transfer via CRIU http://criu.org/Docker

With respect to Kubernetes objects, the "migrated" pod would still be a
new pod with a new name and uid. The orchestration of the migration would
be performed by a controller -- possibly an enhanced replication
controller, perhaps in collaboration with a network controller to move the
address, similar to the separation of concerns in the persistent storage
proposal. During the migration process, the old and new pods would coexist,
and that coexistence would be visible to clients of the Kubernetes API, but
the application being migrated and its clients would not need to be aware
of the migration.

/cc @smarterclayton https://github.com/smarterclayton @thockin
https://github.com/thockin @alex-mohr https://github.com/alex-mohr

Reply to this email directly or view it on GitHub
#3949.

@bgrant0607
Member

I'd make the requirement that anything using the Kubernetes API to introspect or manage their own pods would need to be migration-aware. They could use the post-migration hook to get the pod's new name.

Why would someone want the uid? What issue is that? (Other than #386.)

@smarterclayton
Contributor

They want the uid as a unique instance identifier. We could give them anything, but self registration into an external system is one way (I'm pod foo, serving X, here's my unique identifier)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment