Fix discovery/deletion of iscsi block devices #63176

bswartz · 2018-04-26T02:12:33Z

This PR modifies the iSCSI attach/detatch codepaths in the following
ways:

After unmounting a filesystem on an iSCSI block device, always
flush the multipath device mapper entry (if it exists) and delete
all block devices so the kernel forgets about them.
When attaching an iSCSI block device, instead of blindly
attempting to scan for the new LUN, first determine if the target
is already logged into, and if not, do the login first. Once every
portal is logged into, the scan is done.
Scans are now done for specific devices, instead of the whole
bus. This avoids discovering LUNs that kubelet has no interest in.
Additions to the underlying utility interfaces, with new tests
for the new functionality.
Some existing code was shifted up or down, to make the new logic
work.
A typo in an existing exec call on the attach path was fixed.

When attaching iSCSI volumes, kubelet now scans only the specific
LUNs being attached, and also deletes them after detaching. This avoids
dangling references to LUNs that no longer exist, which used to be the
cause of random I/O errors/timeouts in kernel logs, slowdowns during
block-device related operations, and very rare cases of data corruption.

k8s-ci-robot · 2018-04-26T02:12:36Z

Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA.

It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.

If you've already signed a CLA, it's possible we don't have your GitHub username or you're using a different email address. Check your existing CLA data and verify that your email is set on your git commits.
If you signed the CLA as a corporation, please sign in with your organization's credentials at https://identity.linuxfoundation.org/projects/cncf to be authorized.
If you have done the above and are still having issues with the CLA being reported as unsigned, please email the CNCF helpdesk: helpdesk@rt.linuxfoundation.org

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

bswartz · 2018-04-26T02:50:13Z

CLA issue should be fixed

bswartz · 2018-04-26T02:50:47Z

/assign @saad-ali

krmayankk · 2018-04-26T07:06:45Z

/sig storage

humblec · 2018-04-26T16:49:10Z

/assign @humblec

rootfs · 2018-04-26T17:33:52Z

cc @bmarzins

rootfs · 2018-04-26T17:35:18Z

I am not a multipath expert, looping in @bmarzins

redbaron · 2018-04-26T20:54:50Z

pkg/volume/util/device_util_linux.go

+			// The list of block devices on the scsi bus will be in a
+			// directory called "target%d:%d:%d".
+			// See drivers/scsi/scsi_scan.c in Linux
+			// We assume the channel/bus and device/controller are always zero for iSCSI


is it true for iSCSI offloading cards?

I'm trying to determine this. I don't have an offload card but I've ordered some and when I get them installed I'll do experiments.

Okay I was able to get my hands on some iSCSI offload cards and confirm that channel/bus and device/controller remain zero for those cards too. Each "port" on the card shows up as a separate SCSI host, and LUNs are numbered according to the iSCSI LUN number as you would expect.

rootfs · 2018-04-27T12:49:49Z

@humblec do you have a multipath setup to test this?

rootfs · 2018-04-30T12:31:26Z

/ok-to-test

humblec · 2018-05-07T19:16:37Z

@bswartz Thanks.. I am starting the review with a question :)

After unmounting a filesystem on an iSCSI block device, always flush the multipath device mapper entry (if it exists) and delete all block devices so the kernel forgets about them.

Is it really required to flush the device mapper multipath entries ? afaik, the device mapper will take care of flushing stale device entries after a time stamp/period. I am trying to understand why we need to do it by code , if its required all storage admin has to do this operation manually as soon as an unmount occurs which I think not the case. Also without 'mandatory' flushing ,are we seeing any functional issues in iscsi workflow ?

redbaron · 2018-05-07T20:10:20Z

@humblec , AFAIK multipathd reacts to udev events and therefore can lag behind processing them. IMHO it is best to delete multipath device explicitly to avoid any potential issues and bugs in other parts of the chain.

bswartz · 2018-05-08T01:13:03Z

@humblec What redbaron says is true. Also, I was following RedHat's recommended practices from this document:
https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/5/html/online_storage_reconfiguration_guide/removing_devices

humblec · 2018-05-08T15:04:41Z

@redbaron @bswartz its definitely good to have, however my attempt was to find out whether its mandatory or any potential issues in absense of it. The guide which was pointed above belongs to RHEL 5, the device mapper and other software stack would have got good amount of changes in latest versions.

redbaron · 2018-05-08T15:50:25Z

@humblec I dont think it is necessary and ideally multipathd would remove device once all paths are deleted.

Removing device explicitly in the code has following benefits in my view:

it is less likely to trigger bugs in interaction between multipathd, udev and kernel. There are so many combination spanning number of years in development between them, there bound to be differences in behaviour which hard to predict.
it generates much more visible kubelet log entry should flush operation fail, for instance if device is still open for whatever reason
explicit is better than implicit. it is good to have code for every effect we anticipate to happen. Not everybody (==me) is fluent with iSCSI and things which are obvious once you dive deep in the topic might be surprising, when someone else try to debug another instance of odd behaviour.

humblec · 2018-05-08T15:43:13Z

pkg/volume/iscsi/iscsi_util.go

+	// Build a map of SCSI hosts for each target portal. We will need this to
+	// issue the bus rescans.
+	portalHostMap, err := b.deviceUtil.GetIscsiPortalHostMapForTarget(b.Iqn)
+	if nil != err {


err != nil looks consistent.

humblec · 2018-05-08T15:43:44Z

pkg/volume/iscsi/iscsi_util.go

+	if nil != err {
+		return "", err
+	}
+	glog.V(6).Infof("AttachDisk portal->host map for %s is %v", b.Iqn, portalHostMap)


We can place this on v3 or v4.

humblec · 2018-05-08T15:46:30Z

pkg/volume/util/device_util_linux.go

@@ -80,3 +83,202 @@ func (handler *deviceHandler) FindSlaveDevicesOnMultipath(dm string) []string {
 	}
 	return devices
 }
+
+// GetIscsiPortalHostMapForTarget given a target iqn, find all the scsi hosts logged into
+// that target. Returns a map by target portal to


Can you please expand the source code comment on how the portalmap looks like ?

humblec · 2018-05-08T15:49:32Z

pkg/volume/iscsi/iscsi_util.go

-		if err != nil {
-			glog.Errorf("iscsi: failed to rescan session with error: %s (%v)", string(out), err)
+		hostNumber, loggedIn := portalHostMap[tp]
+		if !loggedIn {


How are we addressing the race condition here, I mean before we login another routine does login or we assume its already logged in and before performing next operation, it was logged out.

I'm not sure the old code addressed that case. When I wrote this I had assumed there was locking at a higher level. I can add proper protection from races here in attach/detach but it will either require a Big Giant Lock or some very complex locking on the individual portals.

humblec · 2018-05-08T15:50:45Z

pkg/volume/iscsi/iscsi_util.go

+				lastErr = fmt.Errorf("iscsi: failed to sendtargets to portal %s output: %s, err %v", tp, string(out), err)
+				continue
+			}
+			err = updateISCSINode(b, tp)


Does updateISCSIDIscoverydb() or updateISCSINode() can have a inconsistent view of portal map compared to iscsi db ?

I didn't write that code -- my changed just moved it. Based on what I know about iscsiadm, it stores some things in its databases, but probably not the SCSI hosts -- those it gets at runtime.

Are you worried about a new bug here, or are you considering an alternative implementation?

rootfs · 2018-07-24T16:07:52Z

status/approved-for-milestone

bswartz · 2018-07-24T17:44:18Z

@rootfs I think the command is "/status approved-for-milestone"

rootfs · 2018-07-24T19:14:41Z

/status approved-for-milestone

k8s-ci-robot · 2018-07-24T19:14:41Z

You must be a member of the kubernetes/kubernetes-milestone-maintainers github team to add status labels.

bswartz · 2018-07-24T23:05:19Z

/retest

This change ensures that iSCSI block devices are deleted after unmounting, and implements scanning of individual LUNs rather than scanning the whole iSCSI bus. In cases where an iSCSI bus is in use by more than one attachment, detaching used to leave behind phantom block devices, which could cause I/O errors, long timeouts, or even corruption in the case when the underlying LUN number was recycled. This change makes sure to flush references to the block devices after unmounting. The original iSCSI code scanned the whole target every time a LUN was attached. On storage controllers that export multiple LUNs on the same target IQN, this led to a situation where nodes would see SCSI disks that they weren't supposed to -- possibly dozens or hundreds of extra SCSI disks. This caused 3 significant problems: 1) The large number of disks wasted resources on the node and caused a minor drag on performance. 2) The scanning of all the devices caused a huge number of uevents from the kernel, causing udev to bog down for multiple minutes in some cases, triggering timeouts and other transient failures. 3) Because Kubernetes was not tracking all the "extra" LUNs that got discovered, they would not get cleaned up until the last LUN on a particular target was detached, causing a logout. This led to significant complications: In the time window between when a LUN was unintentially scanned, and when it was removed due to a logout, if it was deleted on the backend, a phantom reference remained on the node. In the best case, the phantom LUN would cause I/O errors and timeouts in the udev system. In the worst case, the backend could reuse the LUN number for a new volume, and if that new volume were to be scheduled to a pod with a phantom reference to the old LUN by the same number, the initiator could get confused and possibly corrupt data on that volume. To avoid these problems, the new implementation only scans for the specific LUN number it expects to see. It's worth noting that the default behavior of iscsiadm is to automatically scan the whole bus on login. That behavior can be disabled by setting node.session.scan = manual in iscsid.conf, and for the reasons mentioned above, it is strongly recommended to set that option. This change still works regardless of the setting in iscsid.conf, and while automatic scanning will cause some problems, this change doesn't make the problems any worse, and can make things better in some cases.

bswartz · 2018-07-25T04:02:25Z

Squashed 5 commits down to 1

redbaron · 2018-07-25T07:11:22Z

/test pull-kubernetes-e2e-gce

rootfs · 2018-07-25T15:27:05Z

@jsafrane any more comments?

k8s-github-robot · 2018-07-25T15:28:00Z

[MILESTONENOTIFIER] Milestone Pull Request Needs Approval

@bswartz @humblec @rootfs @saad-ali @kubernetes/sig-cluster-lifecycle-misc @kubernetes/sig-storage-misc

Action required: This pull request must have the status/approved-for-milestone label applied by a SIG maintainer.

Pull Request Labels

sig/cluster-lifecycle sig/storage: Pull Request will be escalated to these SIGs if needed.
priority/critical-urgent: Never automatically move pull request out of a release milestone; continually escalate to contributor and SIG through all available channels.
kind/bug: Fixes a bug discovered during the current release.

Help

rootfs · 2018-07-25T21:48:04Z

/lgtm

k8s-ci-robot · 2018-07-25T21:48:11Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bswartz, rootfs

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/volume/iscsi/OWNERS~~ [rootfs]
~~pkg/volume/util/OWNERS~~ [rootfs]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-github-robot · 2018-07-25T22:43:48Z

/test all [submit-queue is verifying that this PR is safe to merge]

k8s-github-robot · 2018-07-25T23:19:00Z

Automatic merge from submit-queue (batch tested with PRs 64844, 63176). If you want to cherry-pick this change to another branch, please follow the instructions here.

k8s-ci-robot · 2018-07-25T23:41:21Z

@bswartz: The following tests failed, say /retest to rerun them all:

Test name	Commit	Details	Rerun command
pull-kubernetes-local-e2e-containerized	`ad9722c`	link	`/test pull-kubernetes-local-e2e-containerized`
pull-kubernetes-e2e-kops-aws	`6d23d8e`	link	`/test pull-kubernetes-e2e-kops-aws`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

redbaron · 2018-07-26T07:30:11Z

Big thanks to everybody who was involved fixing it

k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Apr 26, 2018

k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 26, 2018

k8s-ci-robot requested review from jingxu97 and rootfs April 26, 2018 02:12

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Apr 26, 2018

k8s-ci-robot assigned saad-ali Apr 26, 2018

k8s-ci-robot added the sig/storage Categorizes an issue or PR as relevant to SIG Storage. label Apr 26, 2018

saad-ali assigned rootfs Apr 26, 2018

bswartz changed the title ~~Bug/59946~~ Fix discovery/deletion of iscsi block devices Apr 26, 2018

k8s-ci-robot assigned humblec Apr 26, 2018

redbaron reviewed Apr 26, 2018

View reviewed changes

k8s-ci-robot removed the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Apr 30, 2018

humblec reviewed May 8, 2018

View reviewed changes

childsb added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. kind/bug Categorizes issue or PR as related to a bug. labels Jul 24, 2018

k8s-github-robot added milestone/needs-approval and removed milestone/incomplete-labels labels Jul 24, 2018

bswartz closed this Jul 24, 2018

bswartz reopened this Jul 24, 2018

bswartz force-pushed the bug/59946 branch from 04cac22 to 1c449c7 Compare July 24, 2018 18:37

bswartz force-pushed the bug/59946 branch from 1c449c7 to 6d23d8e Compare July 25, 2018 04:01

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 25, 2018

k8s-github-robot merged commit 845a55d into kubernetes:master Jul 25, 2018

bswartz deleted the bug/59946 branch July 26, 2018 02:29

marpaia mentioned this pull request Sep 24, 2018

Regex is cutting the text of longer notes marpaia/release-notes#7

Open

bswartz mentioned this pull request Feb 8, 2019

REQUEST: New membership for bswartz kubernetes/org#457

Closed

6 tasks

jianglingxia mentioned this pull request Feb 27, 2019

standalone cinder provisioner use iscsi raw pv can not attach the k8s minion #74640

Closed

Fix discovery/deletion of iscsi block devices #63176

Fix discovery/deletion of iscsi block devices #63176

Conversation

bswartz commented Apr 26, 2018

k8s-ci-robot commented Apr 26, 2018

bswartz commented Apr 26, 2018

bswartz commented Apr 26, 2018

krmayankk commented Apr 26, 2018

humblec commented Apr 26, 2018

rootfs commented Apr 26, 2018

rootfs commented Apr 26, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rootfs commented Apr 27, 2018

rootfs commented Apr 30, 2018

humblec commented May 7, 2018 • edited Loading

redbaron commented May 7, 2018

bswartz commented May 8, 2018

humblec commented May 8, 2018 • edited Loading

redbaron commented May 8, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rootfs commented Jul 24, 2018

bswartz commented Jul 24, 2018

rootfs commented Jul 24, 2018

k8s-ci-robot commented Jul 24, 2018

bswartz commented Jul 24, 2018

bswartz commented Jul 25, 2018

redbaron commented Jul 25, 2018

rootfs commented Jul 25, 2018

k8s-github-robot commented Jul 25, 2018

rootfs commented Jul 25, 2018

k8s-ci-robot commented Jul 25, 2018

k8s-github-robot commented Jul 25, 2018

k8s-github-robot commented Jul 25, 2018

k8s-ci-robot commented Jul 25, 2018 • edited Loading

redbaron commented Jul 26, 2018

humblec commented May 7, 2018 •

edited

Loading

humblec commented May 8, 2018 •

edited

Loading

k8s-ci-robot commented Jul 25, 2018 •

edited

Loading