New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detailed Design for Volume Attach/Detach Controller #20262

Closed
saad-ali opened this Issue Jan 28, 2016 · 26 comments

Comments

Projects
None yet
@saad-ali
Copy link
Member

saad-ali commented Jan 28, 2016

Objective

  • Make volume attachment and detachment independent of any single nodes’ availability
    • If a node or kubelet goes down, the volumes attached to that node should be detached so that they are free to be attached to other nodes.
  • Secure Cloud Provider Credentials
    • Because each kubelet is responsible for triggering attach/detach logic, every node currently needs (often broad) permissions. These permissions should be limited to the master. For Google Compute Engine (GCE), for example, this means nodes should no longer need the computer-rw auth scope.
  • Improve stability of volume attach/detach code
    • Existing design has inherent race conditions that the new design should address.

Background

In the existing Kubernetes design, the kubelet is responsible for determining what volumes to attach or detach from the node it is running on. Therefore, when kubelet, or the node, unexpectedly terminate, volumes attached to the node remain attached.

When a node becomes inaccessible (due to network issues, kubelet crash, node restart, etc.), the node controller marks the node as down, and deletes all pods that were scheduled to it. These pods are subsequently rescheduled to other nodes. Pods that require a volume that can only be attached to a single node--many volumes can only be attached to one node at a time (for example GCE PDs may only be attached to one instance in RW mode)--fail to start on the new node because the referenced volume is still attached to the old node (issue #14642).

In addition, in the existing design, the loop in Kubelet that determines if a volume should be attached (pod creation loop) is separate from and completely independent (on a separate thread) of the loop that determines if a volume should be detached (orphaned volumes loop). This leads to race conditions when a pod is rapidly created, deleted, and recreated between attach and detach resulting in unpredictable behavior.

Furthermore, for many volume types, in order for nodes to trigger volume attach and detach, they require broad permissions from the cloud provider. Kubernetes nodes running on GCE, for example, require compute-rw auth scope to be able to initiate GCE PD attach/detach operations.

Solution Overview

A node should not be responsible for determining whether to attach or detach a volume to itself. Instead, the responsibility should be moved (out of Kubelet) to a separate off-node “Volume Attachment/Detachment Controller” that has a life cycle independent of any individual node.

More specifically, a new controller, responsible for attaching and detaching all volumes across the cluster, will be added to the controllermanager, running on the master. This controller will watch the API server to determine when pods are scheduled or killed. When a new pod is scheduled it will trigger the appropriate attach logic for that volume type based on the node it is scheduled to. When a pod is terminated, it will similarly trigger the appropriate detach logic. Some volume types may not have a concept of attaching to a node (e.g. NFS volume), and for these volume types the attach and detach operations will be no-ops.

Detailed Design

Attach/Detach Controller

The controller will maintain an in-memory cache containing a list of volumes that are managed by the controller (i.e. volumes that require attach/detach). Each of these volumes will, in addition to the volume name, specify a list of nodes the volume is attached to, and similarly, each of these nodes will, in addition to the node name, specify a list of pods that are scheduled to the node and referencing the volume. This cache defines the state of the world according to the controller. This cache must be thread-safe. The cache will contain a single top level boolean used to track if state needs to be persisted or not (the value is reset every time state is persisted, and it is set anytime an attach or detach succeeds).

On initial startup, the controller will fetch a list of all persistent volume (PV) objects from the API server and pre-populate the cache with volumes and the nodes they are attached to.

The controller will have a loop that does the following:

  • Fetch State
    • Fetch all pod and mirror pod objects from the API server.
  • Acquire Lock
    • Acquire a lock on the in-memory cache.
  • Reconcile State
    • Search for newly terminated/deleted pods:
      • Loop through all cached pods (i.e. volume->node->pods) and delete them from in-memory cache if:
        • The cached pod is not present in the list of fetched pods or the node it is scheduled to is different (indicating the pod object was was deleted from the API server or rescheduled).
        • The cached pod is present in the list of fetched pods and the PodPhase is Succeeded or Failed and the VolumeStatus (detailed in “Safe Mount/Unmount” section below) indicates the volume is safe to detach.
    • Search for newly attached volumes:
      • Iterate through the fetched pods, and for each pod with a PodPhase Pending, check the volumes it defines or references (dereference any PersistentVolumeClaim objects to fetch associated PersistentVolume objects). If the volume is already tracked in memory cache, and the node the pod is scheduled to is listed in the in-memory cache for the specified volume (indicating the volume was successfully attached to the node):
        • Add the pod to the list of pods in the in-memory cache under the node (indicating a new pod is referencing a volume that is already attached to the node it is scheduled to).
  • Act
    • Trigger detaches:
      • Loop through all cached nodes (i.e. volume->node) and trigger detach logic (detailed below) for any volumes that are attached to a node (exist in-memory cache) but have no pods listed under that node (indicating no running pods using the volume) and implement the Detacher interface.
        • Detaches are triggered before attaches so that volumes referenced by pods that are rescheduled to a different node are detached first.
    • Trigger attaches:
      • Iterate through the fetched pods, and for each pod with a PodPhase Pending check the volumes it defines or references (dereference any PersistentVolumeClaim objects to fetch associated PersistentVolume objects). For each volume that implements the Attacher interface:
        • If the volume is not already in-memory cache (indicates a new volume has been discovered), then trigger attach logic (detailed in section below) to attach the volume to the node the pod is scheduled to.
        • If the volume is already tracked in memory cache, and the node the pod is scheduled to is not listed in the in-memory cache for the specified volume (indicating the volume is not attached to the node), trigger attach logic (detailed in section below) to attach the volume to the node the pod is scheduled to.
  • Persist State
    • Persist the in-memory cache to API server for PersistentVolume volume types:
      • If the top level boolean in the in-memory cache used to track if state needs to be persisted or not is not set, skip this operation (to prevent unnecessary writes to the API server).
      • For each volume in the in-memory cache that is a PersistentVolume and is attached to a node (has more than one item in list of nodes), write the list of nodes to associated PersistentVolume object via the API server.
      • Reset the top level boolean in the in-memory cache to indicate that state does not need to be persisted.
  • Release Lock
    • Release the lock on the in-memory cache (spawned threads have this opportunity to update state indicating volume attachment or detachment).

Attach and detach operations can take a long time to complete, so the primary controller loop should not block on these operations. Instead the attach and detach operations should spawn separate threads for these operations. To prevent multiple attach or detach operations on the same volume, the main thread will maintain a table mapping volumes to currently active operations. The number of threads that can be spawned will be capped (possibly using a thread pool) and once the cap is hit, subsequent requests will have to wait for a thread to become available.

For backwards compatibility, the controller binary will accept a flag (or some mechanism) to disable the new controller altogether.

To allow rolling upgrades and mixed-version clusters, the controller will not manage attach/detach for any node except nodes that explicitly have a volume.temporary.kubernetes.io/controller-managed annotation (detailed in "Upgrader Plan" below).

Attach Logic

To attach a volume:

  • Acquire operation lock for volume so that no other attach or detach operations can be started for volume.
    • Abort if there is already a pending operation for the specified volume (main loop will retry, if needed).
  • Check “policies” to determine if the persistent volume can be attached to the specified node, if not, do nothing.
    • This includes logic that prevents a ReadWriteOnce persistent volume from being attached to more than one node.
  • Spawn a new thread:
    • Execute the volume-specific logic to attach the specified volume to the specified node.
    • If there is an error indicating the volume is already attached to the specified node, assume attachment was successful (this will be the responsibility of the plugin code).
    • For all other errors, log the error, and terminate the thread (the main controller will retry as needed).
    • Once attachment completes successfully:
      • Acquire a lock on the in-memory cache (block until lock is acquired).
      • Add the node the volume was attached to, to in-memory cache (i.e. volume->node), to indicate the volume was successfully attached to the node.
      • Set the top level boolean in the in-memory cache to indicate that state does need to be persisted.
      • Release the lock on the in-memory cache.
      • Make a call to the API server to update the VolumeStatus object under the PodStatus for the volume to indicate that it is now safeToMount.
  • Release operation lock for volume.

Detach Logic

To detach a volume:

  • Acquire operation lock for volume so that no other attach or detach operations can be started for volume.
    • Abort if there is already a pending operation for the specified volume (main loop will retry, if needed).
  • Spawn a new thread:
    • Execute the volume-specific logic to detach the specified volume from the specified node.
    • If there is an error indicating the volume is not attached to the specified node, assume detachment was successful (this will be the responsibility of the plugin code).
    • For all other errors, log the error, and terminate the thread (the main controller will retry as needed).
    • Once detachment completes successfully:
      • Acquire a lock on the in-memory cache (block until lock is acquired).
      • Remove the node that the specified volume was detached from, from the list of attached nodes under the volume in-memory cache, to indicate the volume was successfully detached from the node.
      • Set the top level boolean in the in-memory cache to indicate that state does need to be persisted.
      • Release the lock on the in-memory cache.
  • Acquire operation lock for volume.

Rationale for Storing State

The new controller will need to constantly be aware of which volumes are attached to which nodes, and which scheduled pods reference those volumes. The controller should also be robust enough that if it terminates unexpectedly, it should continue operating where it left off when restarted. Ideally, the controller should also be a closed-loop system--it should store little or no state and, instead, operate strictly with an ”observe and rectify” philosophy; meaning, it should trust no cached state (never assume the state of the world), and should instead always verify (with the source of truth) which volumes are actually attached to which nodes, then take the necessary actions to push them towards the desired state (attach or detach), and repeat (re-verify with the source of truth the state of the world and take actions to rectify).

The problem with this purist approach is that it puts an increased load on volume providers by constantly asking for the state of the world (repeated operations to list all nodes or volumes to determine attachment mapping). It makes the controller susceptible to instability/delays in the provider APIs. And, most importantly, some state simply cannot (currently) be reliably inferred through observation.

For example, the controller must be able to figure out when to to trigger volume detach operations. When a pod is terminated the pod object will be deleted (normally by kubelet, unless kubelet becomes unresponsive, in which case, the node controller will mark the node as unavailable resulting in the pod forcibly getting terminated and the pod object getting deleted). So the controller could use the deletion of the pod object (a state change) as the observation that indicates that either the volume was safely unmounted or the node became unavailable, and trigger the volume detach operation.

However, having the controller trigger such critical behavior based on observation of a state change makes the controller inherently unreliable because the controller cannot retrieve the state of the world on-demand and must instead constantly monitor for state changes; if a state change is missed, the controller will not trigger the correct behavior: imagine that the controller unexpectedly terminates, and the pod object is deleted during this period, when the controller comes back up it will have missed the pod deletion event and will have no knowledge that a volume was once attached to a node and should now be detached.

Therefore, in order to make the controller reliable, it must maintain some state (specifically which volumes are attached to which nodes). It can maintain a mapping of volumes and the nodes they are attached to in local memory, and operate based on this state.

To prevent this cached state of the world (which volumes are attached to which nodes) from becoming stale or out-of-sync with the true state of the world, the controller could, optionally, periodically poll the volume providers to find out exactly which volumes are attached to which nodes and update its cached state accordingly.

In order to make the controller more robust, the cached state should also be persisted outside the controller. This can be done by adding a field to the PersistentVolume object, exposed by the Kubernetes API server. If the controller is unexpectedly terminated, it can come back up and pick up from where it left off by reading this state from the API server (except for “Pod Defined” and “Static Pod” Volumes, see below). If a PersistentVolume object is deleted by the user before it is detached, the controller will still behave normally, as long as the controller is not restarted during this period, since the volume still exists in the controller's local memory.

What about Pod Defined Volume Objects and Static Pods?

In the existing design, most volume types can be defined either as PersistentVolumeSource types or as plain VolumeSource types. During pod creation, Kubelet looks at any Volumes (dereferencing PersistentVolumeClaims) defined in the pod object and attaches/mounts them as needed.

The solution proposed above for persisting controller state (by extending the PersistentVolume object with additional fields that the controller can use to save and read attach/detach state), works only for volumes created as PersistentVolume objects, because the API Server stores and exposes PersistentVolume objects. However, plain Volume objects, are defined directly in the pod definition config, and are simply fields in the pod object (not first class API server objects). If any state is stored inside the pod object (like what node a volume is attached to), once the pod API object is deleted (which happens routinely when a pod is terminated), information about what node(s) the volume was attached to would be lost. Additionally, users can create Static Pods and bypass the API server altogether.

In order to support such “Pod Defined” and “Static Pod” Volumes, the controller will manage all attach/detach but persist only some state in API server objects:

  • Controller handles “persistent volume” objects by parsing PersistentVolume API objects from the API server, and persists state in the PersistentVolume API object.
  • Controller handles “pod defined volumes” by parsing pod objects from the API server, and stores state only in local controller memory (best effort, no persisted state).
  • Controller handles “static pods” by parsing mirror pod objects from the API server, and stores state only in local controller memory (best effort, no persisted state).

Because “static pods” can be created via a kubelet running without a master (i.e. standalone mode), kubelet will support a legacy mode of operation where it continues to be responsible for attach/detach logic; same behavior as today (see below for details).

Safe Mount/Unmount

The controller should wait for a volume to be safely unmounted before it tries to detach it. Specifically, the controller should wait for the kubelet (on the node that a volume is attached to) to safely unmount the volume before the controller detaches it. This means the controller must monitor some state (set by kubelet) indicating a volume has been unmounted, and detach once the signal is given. The controller should also be robust enough that if the kubelet is unavailable, act unilaterally to detach the disk anyway (so that no disks are left attached if kubelet becomes inaccessible).

Similarly kubelet should wait for a volume to be safely attached before it tries to mount it. Specifically, the kubelet should wait for the attach/detach controller to indicate that a volume is attached before the kubelet attempts to mount it. This means the kubelet must monitor some state (set by the controller) indicating if a volume has been attached, and mount only once the signal is given.

Both these goals can be achieved by introducing a list of VolumeStatus objects under the PodStatus API object. The VolumeStatus object, similar to ContainerStatus, will contain state information for a volume in the pod. This includes a safeToMount field indicating the volume is attached and ready to mount (set by controller, read by kubelet), and a safeToDetach field indicating the volume is unmounted and ready to detach (set by kubelet, read by controller).

Kubelet Changes

For backwards compatibility and upgrade purposes, kubelet will support two modes of operation, controlled via a flag to the binary.

  • Kubelet volume attach/detach enabled
    • Legacy mode of operation, where kubelet continues to be responsible for handling all attach/detach operations for volumes by itself (this assumes the new attach/detach controller is either not present or disabled).
    • Running in this mode eliminates all the benefits of this redesign (defined in objectives above).
    • This mode is primarily provided to support two scenarios:
      • Support rolling upgrades of Kubernetes
        • Detailed in "Upgrader Plan" below.
      • To support volumes defined in static pods defined on standalone (no master) kubelets.
    • This functionality may be removed from future releases of Kubernetes.
  • Kubelet volume attach/detach disabled
    • New/default mode of operation, where all logic to attach/detach a volume in kubelet is disabled.
    • In this mode Kubelet will only be responsible for mounting volumes once the VolumeStatus indicates it is attached, and unmounting them and setting the VolumeStatus accordingly once the pod is terminated (this assumes the new attach/detach controller is present and handling attachment/detachment instead).
    • In this mode Kubelet adds a volume.temporary.kubernetes.io/controller-managed annotation to its node API object indicating the controller should manage attaches/detaches (detailed in "Upgrader Plan" below).

Upgrade Plan

In order to safely upgrade an existing Kubernetes cluster without interruption of volume attach/detach logic:

  1. Upgrade the master first
  • This will start the new attach/detach controller.
  • The new controller will initially ignore volumes for all nodes since they lack the volume.temporary.kubernetes.io/controller-managed annotation.
  • As nodes are updated, the associated node object will have the volume.temporary.kubernetes.io/controller-managed annotation which will cause the controller to start doing attaches/detaches for volumes that land on those nodes.
    1. Upgrade node machines.
  • As node machine are upgraded, the default behavior will be "Kubelet volume attach/detach disabled" (detailed above).
  • In this mode kubelet will disable attach/detach and add a volume.temporary.kubernetes.io/controller-managed annotation to its node API object indicating to the new controller that it should start doing attaches/detaches on behalf of the node.

Since rolling upgrades are only supported for up to two Kubernetes releases, after two releases this rolling upgrade logic can be removed. That means the volume.temporary.kubernetes.io/controller-managed annotation can be retirered and the controller can operate on all nodes by default (ignore the annotation).

Updated February 10, 2016: Typo fix. Changed persistence logic from every 5 minutes to every cycle (as long as there is a state change).

Updated February 22, 2016: Reordered main loop logic so that a volume that is created is not immediately detached while continuing to support a rescheduled volume being detached before it is attached.

Updated April 18, 2016: Added detailed upgrade plan.

@stvnwrgs

This comment has been minimized.

Copy link

stvnwrgs commented Jan 29, 2016

Great propsal! Hope this will fix the persistent volume issues we are struggeling with at the moment.

General question: This will result in an minor release right? So the earliest possible release version would be 1.3.0?

@saad-ali i have a few questions/suggestions:

Attach/Detach Controller

To prevent the controller from constantly writing to the API server, this should happen periodically (e.g. every 5 minutes).

I think it could be usefull, to write the list of nodes, associated with the volume, directly if a disk attached/detached. This could be done additionally to a general sync. Reason: I think if you describe a PersistentVolume or Node with kubectl, you wont get the actual state of the disk for up to 5 minutes because this data comes from etcd right?
I would add an update of the PersistentVolume in the attach/detach functions.

Similarly kubelet should wait for a volume to be safely mounted before it tries to attach it

Attach <-> Mount ;)

@markturansky

This comment has been minimized.

Copy link
Member

markturansky commented Jan 29, 2016

@kubernetes/rh-storage

@alfred-huangjian

This comment has been minimized.

Copy link
Contributor

alfred-huangjian commented Feb 3, 2016

cc @kubernetes/huawei

@saad-ali

This comment has been minimized.

Copy link
Member Author

saad-ali commented Feb 11, 2016

Thanks @stvnwrgs.

General question: This will result in an minor release right? So the earliest possible release version would be 1.3.0?

Yes, this is targeted for 1.3.0

I think it could be usefull, to write the list of nodes, associated with the volume, directly if a disk attached/detached. This could be done additionally to a general sync. Reason: I think if you describe a PersistentVolume or Node with kubectl, you wont get the actual state of the disk for up to 5 minutes because this data comes from etcd right?
I would add an update of the PersistentVolume in the attach/detach functions.

Good point. I went back and forth on this. My original plan was to persist state as soon as a state change happened. But, I want to balance the primary goal of persisting state (so there is enough info for the controller to recover if it crashes), while also not overwhelming the API server/etcd with tons of writes. I think a good balance will be to write more frequently but still periodically (every sync cycle, potentially) with the cavet that that state is only persisted if it has changed since the previous write (the in-memory cache has a boolean value that is reset every time it is persisted, and anytime an attach or detach succeeds it flips that bit). I'll update the doc to reflect this.

Attach <-> Mount ;)

Good catch, fixed.

@swagiaal

This comment has been minimized.

Copy link
Contributor

swagiaal commented Feb 22, 2016

Hmm.. I am a little uneasy with all the stored state :/... I'll offer some suggestions which could help:

The problem with this purist approach is that it puts an increased load on volume providers by constantly asking for the state of the world (repeated operations to list all nodes or volumes to determine attachment mapping).

What if we ask the nodes instead of the provider. All the plugins know the path to the device (or at least know how to find it). So maybe we can extend that to provide a list of attached volumes. And when attachedVolumes != desiredVolumes we attach/detach to match.

Similar to health checks the update is pushed from the node to the master and if the updates stop assume the node is dead and do the right thing (detach volumes ?). Of course if the node is dead we can fall back to the provider as the backup source of truth.

We can use pod creation/deletion events to drive attach/detach and do the above check periodically to clean up.

The list of 'desiredVolumes' can be calculated by inspecting pods in the API server and should cover PDs and old fashioned pod volumes.

This means the kubelet must monitor some state (set by the controller) indicating if a volume has been attached, and mount only once the signal is given.

There will always be a gap between when the provider API says that the attach is complete and when the attach is actually complete. That is why volume plugins currently poll until the device appears. So perhaps we can continue to rely on the appearance of the device as the indicator that the attach is complete. Is it possible for the controller to block the scheduling of the pod until the provider API says that the needed devices have been attached ? If so perhaps the remainder of the time gap can be covered by polling for the device.

@saad-ali

This comment has been minimized.

Copy link
Member Author

saad-ali commented Feb 24, 2016

What if we ask the nodes instead of the provider. All the plugins know the path to the device (or at least know how to find it). So maybe we can extend that to provide a list of attached volumes. And when attachedVolumes != desiredVolumes we attach/detach to match.

Similar to health checks the update is pushed from the node to the master and if the updates stop assume the node is dead and do the right thing (detach volumes ?). Of course if the node is dead we can fall back to the provider as the backup source of truth.

Because node availability is not guaranteed and the controller must be able to function independently anyway, this will just add unnecessary complexity and overhead. The source of truth must be the storage provider, not the nodes.

We can use pod creation/deletion events to drive attach/detach and do the above check periodically to clean up.

I am ok with adding a reconciler that ensures that the expected state of the world matches the actual state of the world. But, I would want this to check with the storage provider, not the nodes. From the doc: "To prevent this cached state of the world (which volumes are attached to which nodes) from becoming stale or out-of-sync with the true state of the world, the controller could, optionally, periodically poll the volume providers to find out exactly which volumes are attached to which nodes and update its cached state accordingly." That said, this won't be part of the initial implementation, we can add it in later if needed.

There will always be a gap between when the provider API says that the attach is complete and when the attach is actually complete. That is why volume plugins currently poll until the device appears. So perhaps we can continue to rely on the appearance of the device as the indicator that the attach is complete.

The appearance of the device does not necessarily indicate that it is safe to use. For example, there was a GCE PD bug a while ago, #7972, where even after the device appeared it was not ready to use. Therefore I wanted to make sure I gated on both the storage provider giving the thumbs up and then the existing kubelet code that relies on the appearance of the device to ensure that a volume is indeed attached and ready to mount.

Is it possible for the controller to block the scheduling of the pod until the provider API says that the needed devices have been attached ? If so perhaps the remainder of the time gap can be covered by polling for the device.

No need to get the scheduler involved. With the proposed design, kubelet should have sufficient checks in place to ensure that a volume is not mounted before it is attached.

@jsafrane

This comment has been minimized.

Copy link
Member

jsafrane commented Mar 20, 2016

Working on #22994 I found out there should be something that should enforce nr. of attached EBS or PDs to a node - AWS supports ~40 volumes (this number should be configurable) and GCE has hard limit of 16 PDs attached to a node.

Scheduler is now aware of these limits, however it sees only "running" pods and not the deleted ones, so it may happen that more than scheduler's limit of volumes become attached to a node - new pods are already attaching their volumes while some dying ones are still slowly detaching.

Is it something that attach controller could enforce on cluster level?

@erinboyd

This comment has been minimized.

Copy link

erinboyd commented Mar 22, 2016

@jsafrane How did you find this limitation? Is is through an API call that we can set this value from? Do you currently have it hardcoded? Is it possible that is could vary by the level of service you select with either provider? Just thinking forward if it's something we can call and get the value from the provider and set on each run...

@jsafrane

This comment has been minimized.

Copy link
Member

jsafrane commented Mar 22, 2016

@erinboyd, AWS [link] supports only 40 EBS volumes per node, everything above this limit "is supported on a best effort basis only and is not guaranteed."

GCE [link] has limit of 16 PDs, which seems to be really hard limit.

39 and 15 are hardcoded scheduler defaults, tunable by env. variable KUBE_MAX_PD_VOLS in the scheduler process. I'm not sure how well it's documented. Introduced by PR #19880.

@erinboyd

This comment has been minimized.

Copy link

erinboyd commented Mar 22, 2016

@jsafrane thanks for the clarification

@thockin

This comment has been minimized.

Copy link
Member

thockin commented Apr 4, 2016

I just did a full re-read.

Each of these volumes will, in addition to the volume name, specify a list of
nodes the volume is attached to

Is this in the PV resource? Beware size limits. Things like NFS should be able to skip this entirely.

On initial startup, the controller will fetch a list of all persistent volume
(PV) objects from the API server and pre-populate the cache with volumes and
the nodes they are attached to.

This should pull from the cloud provider and refresh periodically. Otherwise it's just a potential for things to get out of sync and hopelessly confuse users. You talk about this later, but I think it's a pretty big deal. The hard part is knowing which attaches are "managed" vs "user-owned".

dereference any PersistentVolumeClaim objects

So you're watching all PV, PVC, Pod, and Node objects? You should call that out somewhere.

Persist the in-memory cache to API server for PersistentVolume volume types

This is very racey. What if you crash before you persist the state? What reconciles?

If a PersistentVolume object is deleted by the user before it is detached,
the controller will still behave normally, as long as the controller is not
restarted during this period

Not good enough? We need to define what happens in all cases.

Similarly kubelet should wait for a volume to be safely attached before it
tries to mount it

Won't the mount just not work? This seems unneeded to me. If so, that changes "safeTo*" into something like "isMounted bool"

Support rolling upgrades of Kubernetes

This doc needs a section covering upgrades specifically. How to do an upgrade from 1.2 to 1.3 and enable this feature? What happens if we upgrade nodes first? What happens if we upgrade master first? How do hosted products like GKE manage the crossover?

@saad-ali

This comment has been minimized.

Copy link
Member Author

saad-ali commented Apr 19, 2016

there should be something that should enforce nr. of attached EBS or PDs to a node... Is it something that attach controller could enforce on cluster level

We'll handle it separately from this design (See #24317).

Summarizing the discussion @thockin and I had offline:

This should pull from the cloud provider and refresh periodically. Otherwise it's just a potential for things to get out of sync and hopelessly confuse users. You talk about this later, but I think it's a pretty big deal. The hard part is knowing which attaches are "managed" vs "user-owned".

This was an optional part of the design: "To prevent this cached state of the world (which volumes are attached to which nodes) from becoming stale or out-of-sync with the true state of the world, the controller could, optionally, periodically poll the volume providers to find out exactly which volumes are attached to which nodes and update its cached state accordingly."

After speaking with @thockin, I agree that it is worth making a requirement. It will need each cloud provider to expose a "disks attached to nodes" API, that would periodically be queried to reconcile the controllers state of the world. Doing this would also enable us to eliminate persisting the state of the world in PV objects--if the controller crashes, the storage provider can be queried directly for the current state. If a cloud provider chooses not to provide the API, then they would risk a less robust implementation.

Is this in the PV resource? Beware size limits. Things like NFS should be able to skip this entirely.

Now that we are planning to query the storage provider to get the state of the world, we don't even need to persist state to PV for this scenario. That said, same comment could apply to the safeToDetach field. Regardless, it's not too much of a worry because it contains a list of nodes not pods.

So you're watching all PV, PVC, Pod, and Node objects? You should call that out somewhere.

Yes, will do.

This is very racey. What if you crash before you persist the state? What reconciles?
Not good enough? We need to define what happens in all cases.

The persisting of state to PV objects was designed to be a best effort attempt at preventing losing track of volumes that should be detached if the controller crashes and the pod gets deleted before it comes back up. The other case, new volumes to attach during down time, reconcile themselves once the controller comes back up (looks like I need to attach volume X to node A, oh looks like it is already attached, all is good). However, now that we are planning to query the storage provider to get the state of the world, we don't even need to persist state to PV for this scenario.

We still need to use the PV object to coordinate safe mount/unmount between the controller and kubelet which means

Won't the mount just not work? This seems unneeded to me. If so, that changes "safeTo*" into something like "isMounted bool"

Sami had the same question. The reason I wanted to gate mounting was because the appearance of the device does not necessarily indicate that it is safe to use. For example, there was a GCE PD bug a while ago, #7972, where even after the device appeared it was not ready to use. Therefore I wanted to make sure I gated on both the controller giving the thumbs up and then the existing kubelet code that relies on the appearance of the device to ensure that a volume is indeed attached and ready to mount. However, I agree that it is not critical. So, I will not include it, and we can always add it in later if there is a need.

This doc needs a section covering upgrades specifically. How to do an upgrade from 1.2 to 1.3 and enable this feature? What happens if we upgrade nodes first? What happens if we upgrade master first? How do hosted products like GKE manage the crossover?

Added.

@fabioy

This comment has been minimized.

Copy link
Member

fabioy commented Apr 25, 2016

Great proposal. One more item: the controller needs to be able to handle having a different max PD value per node.

With heterogeneous node clusters, it's possible that nodes of different machine types are part of the same cluster. The controller needs to be able to handle this, as well as having some mechanism for discovering this value (probably from kubelet?).

@saad-ali

This comment has been minimized.

Copy link
Member Author

saad-ali commented Apr 25, 2016

Thanks @fabioy. We will handle the max PD value per node independent of this in Issue #24317

@fabioy

This comment has been minimized.

Copy link
Member

fabioy commented Apr 25, 2016

SG. Added a comment there as well. Thanks.

@swagiaal

This comment has been minimized.

Copy link
Contributor

swagiaal commented May 2, 2016

@saad-ali if we are no longer persisting any state and polling the cloud providers for attachment information, is it now ok to expand this to include regular volumes and not just PV based volumes ?

@saad-ali

This comment has been minimized.

Copy link
Member Author

saad-ali commented May 2, 2016

@saad-ali if we are no longer persisting any state and polling the cloud providers for attachment information, is it now ok to expand this to include regular volumes and not just PV based volumes ?

The existing design already includes non-PV based volumes. See the "What about Pod Defined Volume Objects and Static Pods" section. The only difference was that PV volumes would persist to PV objects and now neither will since we'll poll the storage providers.

@swagiaal

This comment has been minimized.

Copy link
Contributor

swagiaal commented May 2, 2016

Great!.. I think I misunderstood when I last read this

k8s-github-robot added a commit that referenced this issue May 9, 2016

Merge pull request #24696 from saad-ali/attachControllerSkeleton
Automatic merge from submit-queue

Add data structure for storing attach detach controller state.

This PR introduces the data structure for maintaining the in-memory state for the new attach/detach controller (#20262).

k8s-github-robot added a commit that referenced this issue May 9, 2016

Merge pull request #24838 from saad-ali/attachControllerOpMap
Automatic merge from submit-queue

Add data structure for managing go routines by name

This PR introduces a data structure for managing go routines by name. It prevents the creation of new go routines if an existing go routine with the same name exists. This will enable parallelization of the designs in #20262 and #21931 with sufficient protection to prevent starting multiple operations on the same volume.

k8s-github-robot added a commit that referenced this issue May 10, 2016

Merge pull request #25008 from saad-ali/attachControllerSkeletonActual
Automatic merge from submit-queue

Introduce skeleton of new attach/detach controller

This PR introduces the skeleton of the new attach/detach controller for #20262

k8s-github-robot added a commit that referenced this issue May 26, 2016

Merge pull request #25457 from saad-ali/expectedStateOfWorldDataStruc…
…ture

Automatic merge from submit-queue

Attach Detach Controller Business Logic

This PR adds the meat of the attach/detach controller proposed in #20262.

The PR splits the in-memory cache into a desired and actual state of the world.
@sgotti

This comment has been minimized.

Copy link
Contributor

sgotti commented Jun 2, 2016

@saad-ali Will the attach/detach controller work also with volumes (like rbd, iscsi) that must be attached directly from the host (unlike gce pd, ebs, cinder)? How will it handle the detach case when a node with an attached rbd/iscsi volume fails or is partitioned (active but unreachable)?

I'm asking this because actually every volume plugin is different, for example the pkg/volume/rbd has internal fencing to avoid a volume being attached by multiple nodes (but needs manual intervention when a node fails without coming back or is partitioned), while looks likes iscsi doesn't do this (so it might lead to corruptions), and probably also additional flex volume plugins (I'm thinking to the new coreos/torus) will have the same problems.

From #18949 (comment) and openshift/origin#7983 (comment) looks like this proposal will avoid per volume plugin ad hoc fencing and fix the manual unfencing needs when a node fails without coming back.

@saad-ali

This comment has been minimized.

Copy link
Member Author

saad-ali commented Jun 2, 2016

@sgotti No it will not. Only plugins that implement the attacher interface (GCE PD, AWS EBS, flex, etc.) will work with the attach/detach controller. Attach operation can be triggered from any machine in the cluster, whereas mount operations must be triggered by the node machine itself. Since plugins like RBD, ISCSI, NFS, etc. only implement the mounter interface and not the attacher interface, they are at the mercy of the node agent, kubelet. But it's an interesting point you bring up. I think it boils down to if a storage plugin requires fencing (i.e. must be detached if node goes down) then it probably should be implementing the attacher interface.

k8s-github-robot added a commit that referenced this issue Jun 3, 2016

Merge pull request #26351 from saad-ali/attachDetachControllerKubelet…
…Changes

Automatic merge from submit-queue

Attach/Detach Controller Kubelet Changes

This PR contains changes to enable attach/detach controller proposed in #20262.

Specifically it:
* Introduces a new `enable-controller-attach-detach` kubelet flag to enable control by attach/detach controller. Default enabled.
* Removes all references `SafeToDetach` annotation from controller.
* Adds the new `VolumesInUse` field to the Node Status API object.
* Modifies the controller to use `VolumesInUse` instead of `SafeToDetach` annotation to gate detachment.
* Modifies kubelet to set `VolumesInUse` before Mount and after Unmount.
  * There is a bug in the `node-problem-detector` binary that causes `VolumesInUse` to get reset to nil every 30 seconds. Issue kubernetes/node-problem-detector#9 (comment) opened to fix that.
  * There is a bug here in the mount/unmount code that prevents resetting `VolumeInUse in some cases, this will be fixed by mount/unmount refactor.
* Have controller process detaches before attaches so that volumes referenced by pods that are rescheduled to a different node are detached first.
* Fix misc bugs in controller.
* Modify GCE attacher to: remove retries, remove mutex, and not fail if volume is already attached or already detached.

Fixes #14642, #19953

```release-note
Kubernetes v1.3 introduces a new Attach/Detach Controller. This controller manages attaching and detaching volumes on-behalf of nodes that have the "volumes.kubernetes.io/controller-managed-attach-detach" annotation.

A kubelet flag, "enable-controller-attach-detach" (default true), controls whether a node sets the "controller-managed-attach-detach" or not.
```
@sgotti

This comment has been minimized.

Copy link
Contributor

sgotti commented Jun 5, 2016

@saad-ali Thanks for the detailed answer!

Correct me if I'm wrong but looking at the current k8s master looks like only GCE PD implements the volume.Attacher/Detacher interfaces. All the others, including AWS EBS and the flex volume, only implement the volume.Mounter/Unmounter interface (probably the AWS EBS should be changed to implement the attache/detacher interfaces)

Regarding the flexvolume looks like it has the concept of attach/detach/mount/unmount but these aren't related to the volume.Attach/Detach/Mounter/Unmounter interfaces. In fact all the flex volume plugin calls are executed inside the flexvolume.SetUpAt (volume.Mounter interface) function.

So actually looks like flexvolume won't be handled by the attach/detach controller.

Will the flexvolume implementation change in future to make the plugins attach/detach/mount/unmount methods be related to the volume interfaces (so attach/detach will accept an hostname)?

I think it boils down to if a storage plugin requires fencing (i.e. must be detached if node goes down) then it probably should be implementing the attacher interface.

Agree, I think that the fence/unfence wording in the rbd plugin are not correct, they are more like a lock/unlock since this is using a ceph cluster rbd funcion to lock/unlock a volume but there's no function to really block operations (fence) to a volume by a node (for this reason if a node goes down the user should assure that it's really down (not partitioned) and then unlock the rbd volume to let it be used by another host).

@mcluseau

This comment has been minimized.

Copy link
Contributor

mcluseau commented Jun 6, 2016

On 06/05/2016 08:36 PM, Simone Gotti wrote:

Agree, I think that the fence/unfence wording in the rbd plugin are
not correct, they are more like a lock/unlock since this is using a
ceph cluster rbd funcion to lock/unlock a volume but there's no
function to really force detach a volume from a node (for this reason
is e node goes down onw should assure that it's rellay down (not
paritioned) and then unlock the rbd volume to let it be used by
another host).

I'd like to note that RBD now have an "exclusive" flag on images; it's
not in the kernel yet (at least not in 4.4), but will eventually land
there. Also, there's the rbd-nbd introduced in Ceph Jewel that can
provide this support (extra note: NBD interface is also used by CoreOS
Torus, so there may be some factorization possible here).

So, at least for Ceph, the fencing will be managed by having the failed
node timing out (30s by default), which will release the exclusive lock
and let the new node take the volume. And we have two timelines that
will allow that quite soon (ie: exclusive support in kernel's RBD client
on one track, and Jewel + rbd-nbd on another track).

@saad-ali

This comment has been minimized.

Copy link
Member Author

saad-ali commented Jun 6, 2016

Correct me if I'm wrong but looking at the current k8s master looks like only GCE PD implements the volume.Attacher/Detacher interfaces. All the others, including AWS EBS and the flex volume, only implement the volume.Mounter/Unmounter interface (probably the AWS EBS should be changed to implement the attache/detacher interfaces)

The other attachers are in the works, for example: #25888

Regarding the flexvolume looks like it has the concept of attach/detach/mount/unmount but these aren't related to the volume.Attach/Detach/Mounter/Unmounter interfaces. In fact all the flex volume plugin calls are executed inside the flexvolume.SetUpAt (volume.Mounter interface) function.

So actually looks like flexvolume won't be handled by the attach/detach controller.

Will the flexvolume implementation change in future to make the plugins attach/detach/mount/unmount methods be related to the volume interfaces (so attach/detach will accept an hostname)?

Yes. Flex also needs to be refactored to split SetUpAt apart and implement the Attacher interface. I believe @rootfs is planning on doing this (he's tackling a few plugins at a time), but if anyone else wants to jump on it, I'd love a PR sometime this week (since v1.3 is being cut on Friday).

@mcluseau

This comment has been minimized.

Copy link
Contributor

mcluseau commented Jun 7, 2016

On 06/07/2016 05:08 AM, Saad Ali wrote:

but if anyone else wants to jump on it, I'd love a PR sometime this week

should it be "thread-safe"?

@saad-ali

This comment has been minimized.

Copy link
Member Author

saad-ali commented Jun 7, 2016

should it be "thread-safe"?

Nope, the controller will take care of that.

@saad-ali

This comment has been minimized.

Copy link
Member Author

saad-ali commented Jun 19, 2016

Closed with #26351 which will be part of v1.3.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment