-
Notifications
You must be signed in to change notification settings - Fork 38.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix discovery/deletion of iscsi block devices #63176
Conversation
Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). 📝 Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA. It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
CLA issue should be fixed |
/assign @saad-ali |
/sig storage |
/assign @humblec |
cc @bmarzins |
I am not a multipath expert, looping in @bmarzins |
// The list of block devices on the scsi bus will be in a | ||
// directory called "target%d:%d:%d". | ||
// See drivers/scsi/scsi_scan.c in Linux | ||
// We assume the channel/bus and device/controller are always zero for iSCSI |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it true for iSCSI offloading cards?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm trying to determine this. I don't have an offload card but I've ordered some and when I get them installed I'll do experiments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay I was able to get my hands on some iSCSI offload cards and confirm that channel/bus and device/controller remain zero for those cards too. Each "port" on the card shows up as a separate SCSI host, and LUNs are numbered according to the iSCSI LUN number as you would expect.
@humblec do you have a multipath setup to test this? |
/ok-to-test |
@bswartz Thanks.. I am starting the review with a question :)
Is it really required to flush the device mapper multipath entries ? afaik, the device mapper will take care of flushing stale device entries after a time stamp/period. I am trying to understand why we need to do it by code , if its required all storage admin has to do this operation manually as soon as an unmount occurs which I think not the case. Also without 'mandatory' flushing ,are we seeing any functional issues in iscsi workflow ? |
@humblec , AFAIK multipathd reacts to udev events and therefore can lag behind processing them. IMHO it is best to delete multipath device explicitly to avoid any potential issues and bugs in other parts of the chain. |
@humblec What redbaron says is true. Also, I was following RedHat's recommended practices from this document: |
@redbaron @bswartz its definitely good to have, however my attempt was to find out whether its mandatory or any potential issues in absense of it. The guide which was pointed above belongs to RHEL 5, the device mapper and other software stack would have got good amount of changes in latest versions. |
@humblec I dont think it is necessary and ideally multipathd would remove device once all paths are deleted. Removing device explicitly in the code has following benefits in my view:
|
pkg/volume/iscsi/iscsi_util.go
Outdated
// Build a map of SCSI hosts for each target portal. We will need this to | ||
// issue the bus rescans. | ||
portalHostMap, err := b.deviceUtil.GetIscsiPortalHostMapForTarget(b.Iqn) | ||
if nil != err { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
err != nil
looks consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
pkg/volume/iscsi/iscsi_util.go
Outdated
if nil != err { | ||
return "", err | ||
} | ||
glog.V(6).Infof("AttachDisk portal->host map for %s is %v", b.Iqn, portalHostMap) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can place this on v3 or v4.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
pkg/volume/util/device_util_linux.go
Outdated
@@ -80,3 +83,202 @@ func (handler *deviceHandler) FindSlaveDevicesOnMultipath(dm string) []string { | |||
} | |||
return devices | |||
} | |||
|
|||
// GetIscsiPortalHostMapForTarget given a target iqn, find all the scsi hosts logged into | |||
// that target. Returns a map by target portal to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please expand the source code comment on how the portalmap looks like ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
pkg/volume/iscsi/iscsi_util.go
Outdated
if err != nil { | ||
glog.Errorf("iscsi: failed to rescan session with error: %s (%v)", string(out), err) | ||
hostNumber, loggedIn := portalHostMap[tp] | ||
if !loggedIn { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How are we addressing the race condition here, I mean before we login another routine does login or we assume its already logged in and before performing next operation, it was logged out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure the old code addressed that case. When I wrote this I had assumed there was locking at a higher level. I can add proper protection from races here in attach/detach but it will either require a Big Giant Lock or some very complex locking on the individual portals.
lastErr = fmt.Errorf("iscsi: failed to sendtargets to portal %s output: %s, err %v", tp, string(out), err) | ||
continue | ||
} | ||
err = updateISCSINode(b, tp) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does updateISCSIDIscoverydb() or updateISCSINode() can have a inconsistent view of portal map compared to iscsi db ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't write that code -- my changed just moved it. Based on what I know about iscsiadm, it stores some things in its databases, but probably not the SCSI hosts -- those it gets at runtime.
Are you worried about a new bug here, or are you considering an alternative implementation?
status/approved-for-milestone |
@rootfs I think the command is "/status approved-for-milestone" |
/status approved-for-milestone |
You must be a member of the kubernetes/kubernetes-milestone-maintainers github team to add status labels. |
/retest |
This change ensures that iSCSI block devices are deleted after unmounting, and implements scanning of individual LUNs rather than scanning the whole iSCSI bus. In cases where an iSCSI bus is in use by more than one attachment, detaching used to leave behind phantom block devices, which could cause I/O errors, long timeouts, or even corruption in the case when the underlying LUN number was recycled. This change makes sure to flush references to the block devices after unmounting. The original iSCSI code scanned the whole target every time a LUN was attached. On storage controllers that export multiple LUNs on the same target IQN, this led to a situation where nodes would see SCSI disks that they weren't supposed to -- possibly dozens or hundreds of extra SCSI disks. This caused 3 significant problems: 1) The large number of disks wasted resources on the node and caused a minor drag on performance. 2) The scanning of all the devices caused a huge number of uevents from the kernel, causing udev to bog down for multiple minutes in some cases, triggering timeouts and other transient failures. 3) Because Kubernetes was not tracking all the "extra" LUNs that got discovered, they would not get cleaned up until the last LUN on a particular target was detached, causing a logout. This led to significant complications: In the time window between when a LUN was unintentially scanned, and when it was removed due to a logout, if it was deleted on the backend, a phantom reference remained on the node. In the best case, the phantom LUN would cause I/O errors and timeouts in the udev system. In the worst case, the backend could reuse the LUN number for a new volume, and if that new volume were to be scheduled to a pod with a phantom reference to the old LUN by the same number, the initiator could get confused and possibly corrupt data on that volume. To avoid these problems, the new implementation only scans for the specific LUN number it expects to see. It's worth noting that the default behavior of iscsiadm is to automatically scan the whole bus on login. That behavior can be disabled by setting node.session.scan = manual in iscsid.conf, and for the reasons mentioned above, it is strongly recommended to set that option. This change still works regardless of the setting in iscsid.conf, and while automatic scanning will cause some problems, this change doesn't make the problems any worse, and can make things better in some cases.
Squashed 5 commits down to 1 |
/test pull-kubernetes-e2e-gce |
@jsafrane any more comments? |
[MILESTONENOTIFIER] Milestone Pull Request Needs Approval @bswartz @humblec @rootfs @saad-ali @kubernetes/sig-cluster-lifecycle-misc @kubernetes/sig-storage-misc Action required: This pull request must have the Pull Request Labels
|
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: bswartz, rootfs The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/test all [submit-queue is verifying that this PR is safe to merge] |
Automatic merge from submit-queue (batch tested with PRs 64844, 63176). If you want to cherry-pick this change to another branch, please follow the instructions here. |
@bswartz: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
Big thanks to everybody who was involved fixing it |
This PR modifies the iSCSI attach/detatch codepaths in the following
ways:
flush the multipath device mapper entry (if it exists) and delete
all block devices so the kernel forgets about them.
attempting to scan for the new LUN, first determine if the target
is already logged into, and if not, do the login first. Once every
portal is logged into, the scan is done.
bus. This avoids discovering LUNs that kubelet has no interest in.
for the new functionality.
work.
Fixes #59946