-
Notifications
You must be signed in to change notification settings - Fork 38.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CSI block volume refactor to fix target path #68635
Conversation
This PR is tested with below csi rbd driver and csi provisioner. Please also take a look. [csi rbd driver] [csi provisioner] |
/kind bug |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall I think this is gettin closer. Let me think about the issue of delegating the creation of the block file to the driver. While this works, it means that driver pods will be forced to have privileged access to the node.
pkg/volume/csi/csi_block.go
Outdated
specName := m.specName | ||
glog.V(4).Infof(log("blockMapper.GetPodDeviceMapPath [path=%s; name=%s]", path, specName)) | ||
return path, specName | ||
fileName := "file" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Returning the value "file"
seems extremely specific and is may not be useful for all drivers . Returning the volumeID may be needed during operations such as tearDown.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your comment.
volumeID
seems to be included in path as pods/{podUid}/volumeDevices/kubernetes.io~csi/{volumeID}/dev
, so I think that just making the filename as "file" should be fine because volumeID
could be resolved from path name. However, there is no disadvantage by making it volumeID
, so I'm willing to change as so, as it might be safer.
@@ -139,18 +139,6 @@ func (m *csiBlockMapper) SetUpDevice() (string, error) { | |||
} | |||
glog.V(4).Info(log("blockMapper.SetupDevice created global device map path successfully [%s]", globalMapPath)) | |||
|
|||
// create block device file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Delegating the creation of block device to the Driver may be a good approach. However, before this is removed, I think it should be cleared/documented that this step will be delegated to the driver.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will add comments to SetUpDevice
and MapDevice
before calling NodeStageVolume
and NodePublishVolume
.
pkg/volume/csi/csi_block.go
Outdated
@@ -201,8 +189,8 @@ func (m *csiBlockMapper) MapDevice(devicePath, globalMapPath, volumeMapPath, vol | |||
defer cancel() | |||
|
|||
globalMapPathBlockFile := devicePath | |||
dir, _ := m.GetPodDeviceMapPath() | |||
targetBlockFilePath := filepath.Join(dir, "file") | |||
dir, file := m.GetPodDeviceMapPath() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again returning "file"
will not work properly for drivers that needs volumeID during Teardown.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will rename file
to volumeID
when GetPodDeviceMapPath
is changed to return volumeID
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This portion needs careful attention not to effect other drivers.
@@ -887,24 +888,26 @@ func (og *operationGenerator) GenerateMapVolumeFunc( | |||
return volumeToMount.GenerateError("MapVolume failed", fmt.Errorf("Device path of the volume is empty")) | |||
} | |||
|
|||
// When kubelet is containerized, devicePath may be a symlink at a place unavailable to | |||
// kubelet, so evaluate it on the host and expect that it links to a device in /dev, | |||
// Map device to global and pod device map path |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes here will affect how all block volume devices work including in-tree drivers. Any changes here should be tested to make sure other drivers are working as before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As for in-tree drivers, I added e2e tests for block volume. So, it would be tested there. (Block volume feature was missed to be beta in v1.12, so it won't be automatically tested by CI, though.)
In addition, it won't be tested with combinations of containerized kubelet, so we might need to find a way to test with containerized kubelet test environment. Let me think how to test it.
As for csi drivers, I'm working on to apply the same testsuites that in-tree drivers do to CSI drivers, here and it will make all csi drivers to be tested for block volume. (Actually, no block volume supported CSI drivers there, though.)
I looked through the I implemented above way in another commit in a separate branch. PTAL Summary of the difference of the concept is as below: [Previous Implementation]
[New Implementation above]
|
@mkimuram I am reviewing your proposed changes today. I will post feedback here and other PR where appropriate. |
@mkimuram
Other than that, I think this maybe workable. Good progress. |
Thank your for your comment.
I will fix based on your comment and make PR.
Actually, as for the changes in operation_generator.go, I found another two issues below while testing with my rbd prototype codes and they are the fixes for the issues.
--> As specName seems to be expected there, I fixed as so.
--> As descriptor lock seems required to be released before Anyway, we need to check and test them carefully, whether they won't affect any existing drivers.
I will fix as so. Thank you for pointing it out. |
/ok-to-test |
7382008
to
994ceaa
Compare
I've updated the commit to the second version and add unit test for it. PTAL Please also confirm the log below.
3.2 symlink to the path provided by
3.3 symlink to the path provided by
4.2 symlink to the path provided by
4.3 symlink to the path provided by
5.2 Unmap:
|
/retest |
As for the change in operation_generator.go, I think that below two points are required to be confirmed:
As a result of my investigation, it should be safe for both (a) and (b). (Also, I tested e2e block test for iscsi and rbd and it passed.) Let me explain the reason. There are two types of plugins:
And Therefore:
For (b), we might be able to keep the descriptor lock as is in |
// Call NodePublishVolume | ||
publishPath, err := m.publishVolumeForBlock(csiSource, attachment, stagingPath) | ||
if err != nil { | ||
return "", err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should the volume be unstaged here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your comment.
IIUC, reconciler will keep calling MountVolume
to try to make ASW and DSW the same, even when MountVolume
fails in SetUpDevice
. So, even if we call NodeUnstageVolume
here to unstage the volume that failed in publish, another attempts to MountVolume
would stage it again. (Then, MountVolume
may fail in publish again.) In addition, if NodeUnstageVolume
for Nth MountVolume
are called before NodePublishVolume
and after NodeStageVolume
for N+1th MountVolume
, it will make NodePublishVolume
call for unstaged volume, which violates CSI spec. Therefore, just calling NodeUnstageVolume
here won't role back the status properly.
However, this raises me a question whether it is safe to keep the volume staged, after the failure in publish. At a glance, it seems safe from data corruption viewpoint, for we won't directly touch to the staged device and staging path.
The worst case senarios that I noticed are below:
- If the pvc deletion is done before the publish-failed pod deletion, the device might be deleted while it is attached to a node,
- If the pvc is used in another pod scheduled on a different node before the publish-failed pod deletion, the device might be attached to multiple nodes.
Is my understanding correct? And are there any issues that you can think of by not staging the volume there?
sorry late to the review, I haven't tested it yet but I believe it makes sense to release the loopback device before detaching the underlying device. I see this PR does two things: a csi block refactoring and the operator generator reordering. What if breaking them up in separate PRs? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the 11/14 CSI meeting we agreed that staging paths should always be directories, even for blocks. This PR clarifies the wording: container-storage-interface/spec#335
To conform to the spec this PR should be updated to create a directory for the staging path.
pkg/volume/csi/csi_block.go
Outdated
// create block device file | ||
blockFile, err := os.OpenFile(globalMapPathBlockFile, os.O_CREATE|os.O_RDWR, 0750) | ||
// create an empty file on staging path where block device is bind mounted to | ||
stagingPathFile, err := os.OpenFile(stagingPath, os.O_CREATE|os.O_RDWR, 0750) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the new wording in the CSI spec, this should be a directory always.
be6e010
to
d8c6fe5
Compare
Changed |
@mkimuram We finally settled on another spec change which is that the CO should not create the 0-byte file during NodePublishVolume() either. It will be the SP's responsibility to create and delete the file at target_path. The CO must create the parent directory of target_path. |
Updated. PTAL
|
I updated my test driver to match the spec, and ran it against this patch, and it worked great. That's all I have time for tonight. If there's time tomorrow I'll take a look at the code changes. |
/lgtm |
@bswartz: changing LGTM is restricted to assignees, and only kubernetes/kubernetes repo collaborators may be assigned issues. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Thank you for your review. @bswartz @saad-ali @vladimirvivien One thing that I would like to confirm last is that
I guess that SP already should has permission to their parent directory, so from security point of view, there will be no difference even if we choose to use deeper path. From safety point of view, should we separate them for different csi drivers (to protect form different drivers at least) or should we create one more directory for each volume like Also, current implementation assumes that driver "Bidirectional" mount that directory, so we still have a chance to choose completely different directory from |
/approve |
/approve cancel
Why not align this with what mount does? |
pkg/volume/csi/csi_block.go
Outdated
} | ||
|
||
// getPublishPath returns a publish path for a file (on the node) that should be used on NodePublishVolume/NodeUnpublishVolume | ||
// Example: plugins/kubernetes.io/csi/volumeDevices/publish/{volumeID} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If multiple containers use this, it will collide.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your comment.
Is my understanding right that you are suggesting that we should use spec.PersistentVolume.Name
instead of kstrings.EscapeQualifiedNameForDisk(m.specName)
, as it is done in makeDeviceMountPath
for filesystem volume?
Then, it will be like as below:
- block:
/var/lib/kubelet/plugins/kubernetes.io/csi/volumeDevices/staging/{pvname}
/var/lib/kubelet/plugins/kubernetes.io/csi/volumeDevices/publish/{pvname}
- filesystem:
/var/lib/kubelet/plugins/kubernetes.io/csi/pv/{pvname}/globalmount
m.specName
is originally used there, because in-tree block volume code use it for GlobalMapPath
and PodDeviceMapPath
.
- GlobalMapPath :
/var/lib/kubelet/plugins/kubernetes.io/csi/volumeDevices/{volumeID}/dev
- PodDeviceMapPath:
/var/lib/kubelet/pods/{podUid}/volumeDevices/kubernetes.io~csi/{volumeID}
However, as explained in #68635 (comment), for block volume, we choose not to let csi driver handle GlobalMapPath
and PodDeviceMapPath
directly, instead csi driver make block volume visible to another path (target_path
), then k8s (ioutil.MapBlockVolume
) make target_path visible
from GlobalMapPath
and PodDeviceMapPath
. So, there is no restriction for using m.specName
for target_path
and staging_target_path
.
@bswartz @vladimirvivien
Any concern?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked the log and code again and found out that block codes already use the same ID to filesystem codes. I mean {pvname}
should be the same to {volumeID}
, like pvc-a618a282-e929-11e8-bb66-525400b854f0
, so it should not collide.
Is it ok to just update the comment to make them look same?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would spec.PersistentVolume.Name
collide if the PVC is used by multiple pods? Ideally the target_path
should be unique per pod (have the pod ID in it) right? Otherwise if 2 pods use the same volume on the same node, each pod will have it's volume setup code running in parallel.
I'm basing this on my understanding of mount, so it may well be different for block. If so I'd like to understand why?
@saad-ali @bswartz @vladimirvivien Fix the comment. PTAL |
/retest |
@saad-ali @bswartz @vladimirvivien is this fix good to go? |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: mkimuram, saad-ali, vladimirvivien The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/lgtm Based on verification from @bswartz |
What this PR does / why we need it:
Fix for adding block volume support to CSI RBD driver
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):Fixes #68424
Special notes for your reviewer:
/sig storage
This PR fixes following 3 issues:
[Issues]
symlinkPath
used inMapDevice
doesn't match to the onemakeBlockVolumes
expectsNo need to create file in neitherIt is not an issue. CSI spec requires CO to create an empty file to bind mount the device on, so the file should be created.targetBlockFilePath
norglobalMapPathBlockFile
EvalHostSymlinksdevice
fails for csi drivers that don't implementNodeStageVolume
Release note: