Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reconcile VolumeAttachment with ListVolumes Published Nodes #184

Merged
merged 1 commit into from Dec 20, 2019

Conversation

@davidz627
Copy link
Member

davidz627 commented Sep 28, 2019

Main code-paths ready for review.

Tested with one dynamically provisioned volume with the native PD CSI Driver. After detaching the volume on the cloud provider -> external-attacher logs show correct identification of state change and update of VolumeAttachment object -> external-attacher succeeds with immediate reattachment of volume.

WIP:

  • Depends on CSI Spec 1.2.
  • Depends on kubernetes/kubernetes#83593
  • Only reconcile volumes if plugin supports ListVolumePublishedNodes capability
  • Handle migration cases for volume ID
  • Update logging to include verbosity
  • Write some unit tests

/kind feature
/assign @jsafrane @msau42

Reconciles VolumeAttachment attachment status with actual back-end volume attachment state if plugin supports LIST_VOLUMES_PUBLISHED_NODES capability.
@msau42

This comment has been minimized.

Copy link
Collaborator

msau42 commented Oct 1, 2019

cc @misterikkit
Do we also want to handle the reverse scenario? Detach if it got attached out of band?

@davidz627

This comment has been minimized.

Copy link
Member Author

davidz627 commented Oct 1, 2019

Do we also want to handle the reverse scenario? Detach if it got attached out of band?

I thought we didn't want to do this for data safety reasons (don't want to detach a disk if it may be in use)

@msau42

This comment has been minimized.

Copy link
Collaborator

msau42 commented Oct 1, 2019

I think to handle the dangling volumes case we may want to do something like that. Maybe for data safety can we do something like only detach if we're trying to attach it to a different node? cc @gnufied

@davidz627 davidz627 force-pushed the davidz627:feature/verifyVolumes branch from a59a16d to 363c2fb Oct 5, 2019
@msau42

This comment has been minimized.

Copy link
Collaborator

msau42 commented Oct 5, 2019

/assign @misterikkit

@davidz627 davidz627 force-pushed the davidz627:feature/verifyVolumes branch 2 times, most recently from 9f887be to 00dba93 Oct 7, 2019
@davidz627

This comment has been minimized.

Copy link
Member Author

davidz627 commented Oct 8, 2019

@msau42 @misterikkit @jsafrane
This is ready for review. Only reason it is still WIP is because of the dependencies that need to be updated and vendored in. The code is for the actual feature is complete 👍

@davidz627 davidz627 force-pushed the davidz627:feature/verifyVolumes branch 2 times, most recently from 91e58e8 to b936885 Oct 9, 2019
@davidz627 davidz627 changed the title WIP Reconcile VolumeAttachment with ListVolumes Published Nodes [WIP blocked on CSI Spec v1.2] Reconcile VolumeAttachment with ListVolumes Published Nodes Oct 9, 2019
@davidz627 davidz627 force-pushed the davidz627:feature/verifyVolumes branch 3 times, most recently from acc6172 to 1c6bae8 Oct 9, 2019
@davidz627 davidz627 force-pushed the davidz627:feature/verifyVolumes branch from 1c6bae8 to 428b476 Oct 26, 2019
@davidz627 davidz627 changed the title [WIP blocked on CSI Spec v1.2] Reconcile VolumeAttachment with ListVolumes Published Nodes Reconcile VolumeAttachment with ListVolumes Published Nodes Oct 26, 2019
@davidz627 davidz627 force-pushed the davidz627:feature/verifyVolumes branch from 939950e to 5e151c6 Oct 29, 2019
Copy link
Contributor

jsafrane left a comment

How does this new goroutine interacts with the informer / SyncNewOrUpdatedVolumeAttachment that can run in parallel? A volume was just attached by the handler, yet this reconciler is processing slightly older ListVolumes and marks the volume as detached.

for _, va := range vas {
nodeID, ok := va.Annotations[vaNodeIDAnnotation]
if !ok {
return errors.New("failed to find node ID in VolumeAttachment annotation")

This comment has been minimized.

Copy link
@jsafrane

jsafrane Oct 31, 2019

Contributor

This happens only when the VA was not processed by the main handler, right? IMO the message should say that + I would prefer to continue and process the other VAs.

This comment has been minimized.

Copy link
@davidz627
if !ok {
return errors.New("failed to find node ID in VolumeAttachment annotation")
}
// TODO(dyzz): Maybe all of these should just be an "error" then continue.

This comment has been minimized.

Copy link
@jsafrane

jsafrane Oct 31, 2019

Contributor

+1, I'd prefer to continue processing other VAs. Otherwise one bad attachment destroys whole attachment reconciliation.

I wonder if it would make sense to put the errors below into VolumeAttachment or to an event for better visibility.

This comment has been minimized.

Copy link
@davidz627

davidz627 Dec 13, 2019

Author Member

ack. The errors are returned then logged at level klog.Errorf at the top level.

pkg/controller/csi_handler.go Show resolved Hide resolved
@msau42 msau42 added this to In progress in K8s 1.17 Dec 11, 2019
@davidz627

This comment has been minimized.

Copy link
Member Author

davidz627 commented Dec 13, 2019

A volume was just attached by the handler, yet this reconciler is processing slightly older ListVolumes and marks the volume as detached.

There are two directions of race condition to think about:

1. We have old ListVolumes ("actual state" has changed) when we compare that state to VolumeAttachment

Lets say actual state was updated from `x` to `y` but we have gotten an "old" actual state `x` and the VA has state `y`. 
In this case we would "incorrectly" reconcile `VA` to state `x`. 
This would then get resolved in the next loop as we would get actual state `y` and reconcile `VA` to `y`. 
Though this is probably not ideal to wait an entire reconcile cycle and could cause errors.

To fix we can reverse the direction and always get the `VA` first before the actual state. This way we have two possibilities:
  1a.  the `VA` changes between get and update and we catch it in the second case below
  1b. The `VA` is up to date, we get actual state `x`, some other controller changes actual state to `y` before we update the `VA` then:
    1bi. `VA` gets updated before we can change it and we're in case 2.
    1bii. We set the "old" `VA` to state `x` and the controller that changed state to `y` can now go and change `VA` state to `y` still.

2. We have an up to date ListVolumes (actual state is accurate) when we compare that state to a old VolumeAttachment
  2a. This should be easy to catch by checking for `ResourceVersion` and making sure that hasn't changed. (this is done implicitly as a Patch error occurs when the resource is out of date)

wdyt? @jsafrane @msau42

@msau42

This comment has been minimized.

Copy link
Collaborator

msau42 commented Dec 13, 2019

If all of the controller paths are calling either Update() or Patch() with resource version, then I think that should handle the races?

The only issue is in case 1bii, this would probably cause the VA update in the attach handler to fail, and we would have to retry the attach operation, but I'm not sure how big of an issue that is since the driver should be idempotent, and I'm not sure how frequent this race will be.

We could potentially reduce the window of a race by checking if an operation is in progress, and skipping that volume for this loop or having some other per-volume locking mechanism.

@davidz627 davidz627 force-pushed the davidz627:feature/verifyVolumes branch from 5e151c6 to 080839f Dec 14, 2019
@jsafrane

This comment has been minimized.

Copy link
Contributor

jsafrane commented Dec 16, 2019

What if ReconcileVA does not update VolumeAttachment that something is wrong, but puts the VA silently into the controller workqueue, marked as "force resync"*? The attacher then calls ControllerPublish / ControllerUnpublish first and updates VA only if it failed using already existing code.

*) ReconcileVA needs to mark the VA so VA handler knows it should call ControllerPublish even when the volume looks attached here:

func (h *csiHandler) syncAttach(va *storage.VolumeAttachment) error {
if va.Status.Attached {
// Volume is attached, there is nothing to be done.
klog.V(4).Infof("%q is already attached", va.Name)
return nil
}

(+similar check in detach?)

There won't be any race + volume attachments get fixed automatically, however, user (or A/D controller) won't get a short blip that volume got detached and back attached.

@davidz627

This comment has been minimized.

Copy link
Member Author

davidz627 commented Dec 17, 2019

Thanks for the suggestion Jan. After some careful thought here are the reasons why I think we should continue with the current separate reconciler model:

  1. From a conceptual standpoint it is more sound to me to have one controller that's job is to reconcile actual state to the desired state (attacher) and a separate loop that reconciles the known state to the actual state (reconcileVA). This is simple and results in eventual consistency even if there are race conditions. The model of injecting imperative forced re-syncs to the attacher work-queue breaks abstractions and increases complexity within the attacher (now it has to branch on this force resync)
  2. I think a "short blip that volume got detached and back attached." is actually desirable from a debugging/logs/user perspective since we can tell that the actual state got out of sync with the VolumeAttachment state and we can see the timeline of when that was detected + remedied. This is better than trying to fix it silently IMO. If the fix goes wrong it is harder to figure out at what step things broke down.
    1. In the case when we have a VolumeAttachment.Status.Attached=true but the actual state has the volume detached. If there is a long-running Attach error that happens every loop, we would be stuck in a state retrying Attach in the background while not updating the state for the user or any other controllers that may be watching this object.

I can definitely be convinced of the force resync model but the above are my major concerns. In summary, it feels like a smell to have a "force" something flag as well as to hide actual state (when we know it) from the user and attempt to silently fix it. Let me know what you think.

@jsafrane

This comment has been minimized.

Copy link
Contributor

jsafrane commented Dec 17, 2019

From a conceptual standpoint it is more sound to me to have one controller that's job is to reconcile actual state to the desired state (attacher) and a separate loop that reconciles the known state to the actual state (reconcileVA). This is simple and results in eventual consistency even if there are race conditions.

Not really. I can agree that va.Status.Attached gets eventually fixed even with races. What about va.Status.AttachmentMetadata? You don't get them from CSI via ListVolumes. The first Attached = true must be saved by VA handler, together with the metadata. And I would argue that all of them should be saved by it, what if AttachmentMetadata changes? (Not that we have facility in A/D controller / kubelet to use that information, but that's a separate issue).

So, can the ListVolume loop write only Attached = false and ignore = true?

  • During attach, if a volume is attached and VA is not yet marked so, we can be sure that the VA is in the attacher workqueue.
  • During detach, if a volume is attached and VA is already marked as detached, that means that Detach succeeded and removed its finializer from VA and VA most likely already does not exist (it had no finalizer and set deletion timestamp).

I think a "short blip that volume got detached and back attached." is actually desirable from a debugging/logs/user perspective

I agree.

the case when we have a VolumeAttachment.Status.Attached=true but the actual state has the volume detached. If there is a long-running Attach error that happens every loop, we would be stuck in a state retrying Attach in the background.

That could be fixed if we wanted...

@davidz627

This comment has been minimized.

Copy link
Member Author

davidz627 commented Dec 18, 2019

Ack. I changed the functionality to add volumes back to the workqueue with a forced resync flag. This causes intended behavior for reattaching "accidentally detached" volumes with minimal API updates and no known races. For the detach workflow it's unlikely that the forced resync is necessary as it's very diffiult to get into a state where the Status.Attached=false and the finalizer still exists and the volume is still attached. However, I don't think its wrong to cover this case as well anyway.

@davidz627 davidz627 force-pushed the davidz627:feature/verifyVolumes branch 3 times, most recently from e1a6665 to f359106 Dec 18, 2019
Copy link
Contributor

jsafrane left a comment

lgtm-ish, with a minor nit.

pkg/attacher/lister.go Outdated Show resolved Hide resolved
pkg/controller/csi_handler.go Show resolved Hide resolved
@davidz627 davidz627 force-pushed the davidz627:feature/verifyVolumes branch from f359106 to 836557e Dec 18, 2019
@davidz627

This comment has been minimized.

Copy link
Member Author

davidz627 commented Dec 18, 2019

/assign @misterikkit

@davidz627 davidz627 force-pushed the davidz627:feature/verifyVolumes branch from 836557e to ab2ed5f Dec 18, 2019
Copy link

misterikkit left a comment

how did it get up to 22 comments? Those were there when I got here, promise.

pkg/attacher/lister.go Outdated Show resolved Hide resolved
pkg/controller/controller.go Show resolved Hide resolved
pkg/controller/controller.go Show resolved Hide resolved
// status of the volume is different from the state on the VolumeAttachment the
// VolumeAttachment object is patched to the correct state.
func (h *csiHandler) ReconcileVA() error {
ctx, cancel := context.WithTimeout(context.Background(), h.timeout)

This comment has been minimized.

Copy link
@misterikkit

misterikkit Dec 19, 2019

In theory we should be passing context into this func. Using Background discards any notion of what could be happening at higher levels.

This comment has been minimized.

Copy link
@misterikkit

misterikkit Dec 19, 2019

I see the Run func takes a stop channel rather than a context. That is worth fixing before or at the same time as plumbing context into this function.

This comment has been minimized.

Copy link
@davidz627

davidz627 Dec 19, 2019

Author Member

@misterikkit could you create an issue on this repo for that?

This comment has been minimized.

Copy link
@misterikkit
pkg/controller/csi_handler.go Outdated Show resolved Hide resolved
pkg/controller/csi_handler.go Outdated Show resolved Hide resolved
@@ -114,8 +215,8 @@ func (h *csiHandler) SyncNewOrUpdatedVolumeAttachment(va *storage.VolumeAttachme
}

func (h *csiHandler) syncAttach(va *storage.VolumeAttachment) error {
if va.Status.Attached {
// Volume is attached, there is nothing to be done.
if !h.consumeForceSync(va.Name) && va.Status.Attached {

This comment has been minimized.

Copy link
@misterikkit

misterikkit Dec 19, 2019

Can we be sure that the va we pulled from the queue is the one we pushed in ReconcileVA? What if some other operation is sitting near the front of the queue when we reconcile? Hopefully that's fine?

  • if the two queued vas agree on state, then the second one will be skipped but that's fine
  • if the two queued vas disagree on state, then they will both pass the already attached/detached check?

This comment has been minimized.

Copy link
@davidz627

davidz627 Dec 19, 2019

Author Member

the forceSync flag is based on vaName so will not affect any other items on the queue. When the queue eventually pops the item we have requeued for force sync it will go into this modified logic

}

for _, va := range vas {
nodeID, ok := va.Annotations[vaNodeIDAnnotation]

This comment has been minimized.

Copy link
@misterikkit

misterikkit Dec 19, 2019

When does nodeID move from annotation to spec?

This comment has been minimized.

Copy link
@davidz627

davidz627 Dec 19, 2019

Author Member

VA.Spec.NodeName is different from NodeID I think. NodeName refers to the Kubernetes node name - node ID is the reference for that node given by the CSI Driver.... I think :) In PD case the nodeID is fully qualified with project and zone, but the node name is not.

@msau42 in case I'm misrepresenting this

pkg/attacher/lister.go Outdated Show resolved Hide resolved
pkg/controller/csi_handler.go Outdated Show resolved Hide resolved
@davidz627 davidz627 force-pushed the davidz627:feature/verifyVolumes branch from 019916a to fdbf3e1 Dec 19, 2019
…cing items into the VA workqueue to be processed by main thread if ListVolumesPublishedNodes capability is specified
@davidz627 davidz627 force-pushed the davidz627:feature/verifyVolumes branch from fdbf3e1 to 12abab4 Dec 19, 2019
@jsafrane

This comment has been minimized.

Copy link
Contributor

jsafrane commented Dec 20, 2019

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm label Dec 20, 2019
@jsafrane

This comment has been minimized.

Copy link
Contributor

jsafrane commented Dec 20, 2019

/approve

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Dec 20, 2019

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: davidz627, jsafrane

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit 2d2ab15 into kubernetes-csi:master Dec 20, 2019
6 of 7 checks passed
6 of 7 checks passed
tide Not mergeable.
Details
cla/linuxfoundation davidz627 authorized
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
pull-kubernetes-csi-external-attacher-1-15-on-kubernetes-1-15 Job succeeded.
Details
pull-kubernetes-csi-external-attacher-1-16-on-kubernetes-1-16 Job succeeded.
Details
pull-kubernetes-csi-external-attacher-1-17-on-kubernetes-1-17 Job succeeded.
Details
pull-kubernetes-csi-external-attacher-unit Job succeeded.
Details
K8s 1.17 automation moved this from In progress to Done Dec 20, 2019
@davidz627 davidz627 deleted the davidz627:feature/verifyVolumes branch Dec 20, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.