Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix discovery/deletion of iscsi block devices #63176

Merged
merged 1 commit into from Jul 25, 2018

Conversation

@bswartz
Copy link
Contributor

bswartz commented Apr 26, 2018

This PR modifies the iSCSI attach/detatch codepaths in the following
ways:

  1. After unmounting a filesystem on an iSCSI block device, always
    flush the multipath device mapper entry (if it exists) and delete
    all block devices so the kernel forgets about them.
  2. When attaching an iSCSI block device, instead of blindly
    attempting to scan for the new LUN, first determine if the target
    is already logged into, and if not, do the login first. Once every
    portal is logged into, the scan is done.
  3. Scans are now done for specific devices, instead of the whole
    bus. This avoids discovering LUNs that kubelet has no interest in.
  4. Additions to the underlying utility interfaces, with new tests
    for the new functionality.
  5. Some existing code was shifted up or down, to make the new logic
    work.
  6. A typo in an existing exec call on the attach path was fixed.

Fixes #59946

When attaching iSCSI volumes, kubelet now scans only the specific
LUNs being attached, and also deletes them after detaching. This avoids
dangling references to LUNs that no longer exist, which used to be the
cause of random I/O errors/timeouts in kernel logs, slowdowns during
block-device related operations, and very rare cases of data corruption.
@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Apr 26, 2018

Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA.

It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.


Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@bswartz

This comment has been minimized.

Copy link
Contributor Author

bswartz commented Apr 26, 2018

CLA issue should be fixed

@bswartz

This comment has been minimized.

Copy link
Contributor Author

bswartz commented Apr 26, 2018

/assign @saad-ali

@krmayankk

This comment has been minimized.

Copy link
Contributor

krmayankk commented Apr 26, 2018

/sig storage

@bswartz bswartz changed the title Bug/59946 Fix discovery/deletion of iscsi block devices Apr 26, 2018

@humblec

This comment has been minimized.

Copy link
Contributor

humblec commented Apr 26, 2018

/assign @humblec

@rootfs

This comment has been minimized.

Copy link
Member

rootfs commented Apr 26, 2018

@rootfs

This comment has been minimized.

Copy link
Member

rootfs commented Apr 26, 2018

I am not a multipath expert, looping in @bmarzins

// The list of block devices on the scsi bus will be in a
// directory called "target%d:%d:%d".
// See drivers/scsi/scsi_scan.c in Linux
// We assume the channel/bus and device/controller are always zero for iSCSI

This comment has been minimized.

@redbaron

redbaron Apr 26, 2018

Contributor

is it true for iSCSI offloading cards?

This comment has been minimized.

@bswartz

bswartz May 7, 2018

Author Contributor

I'm trying to determine this. I don't have an offload card but I've ordered some and when I get them installed I'll do experiments.

This comment has been minimized.

@bswartz

bswartz May 22, 2018

Author Contributor

Okay I was able to get my hands on some iSCSI offload cards and confirm that channel/bus and device/controller remain zero for those cards too. Each "port" on the card shows up as a separate SCSI host, and LUNs are numbered according to the iSCSI LUN number as you would expect.

@rootfs

This comment has been minimized.

Copy link
Member

rootfs commented Apr 27, 2018

@humblec do you have a multipath setup to test this?

@rootfs

This comment has been minimized.

Copy link
Member

rootfs commented Apr 30, 2018

/ok-to-test

@humblec

This comment has been minimized.

Copy link
Contributor

humblec commented May 7, 2018

@bswartz Thanks.. I am starting the review with a question :)

After unmounting a filesystem on an iSCSI block device, always flush the multipath device mapper entry (if it exists) and delete all block devices so the kernel forgets about them.

Is it really required to flush the device mapper multipath entries ? afaik, the device mapper will take care of flushing stale device entries after a time stamp/period. I am trying to understand why we need to do it by code , if its required all storage admin has to do this operation manually as soon as an unmount occurs which I think not the case. Also without 'mandatory' flushing ,are we seeing any functional issues in iscsi workflow ?

@redbaron

This comment has been minimized.

Copy link
Contributor

redbaron commented May 7, 2018

@humblec , AFAIK multipathd reacts to udev events and therefore can lag behind processing them. IMHO it is best to delete multipath device explicitly to avoid any potential issues and bugs in other parts of the chain.

@bswartz

This comment has been minimized.

Copy link
Contributor Author

bswartz commented May 8, 2018

@humblec What redbaron says is true. Also, I was following RedHat's recommended practices from this document:
https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/5/html/online_storage_reconfiguration_guide/removing_devices

@humblec

This comment has been minimized.

Copy link
Contributor

humblec commented May 8, 2018

@redbaron @bswartz its definitely good to have, however my attempt was to find out whether its mandatory or any potential issues in absense of it. The guide which was pointed above belongs to RHEL 5, the device mapper and other software stack would have got good amount of changes in latest versions.

@redbaron

This comment has been minimized.

Copy link
Contributor

redbaron commented May 8, 2018

@humblec I dont think it is necessary and ideally multipathd would remove device once all paths are deleted.

Removing device explicitly in the code has following benefits in my view:

  • it is less likely to trigger bugs in interaction between multipathd, udev and kernel. There are so many combination spanning number of years in development between them, there bound to be differences in behaviour which hard to predict.
  • it generates much more visible kubelet log entry should flush operation fail, for instance if device is still open for whatever reason
  • explicit is better than implicit. it is good to have code for every effect we anticipate to happen. Not everybody (==me) is fluent with iSCSI and things which are obvious once you dive deep in the topic might be surprising, when someone else try to debug another instance of odd behaviour.
// Build a map of SCSI hosts for each target portal. We will need this to
// issue the bus rescans.
portalHostMap, err := b.deviceUtil.GetIscsiPortalHostMapForTarget(b.Iqn)
if nil != err {

This comment has been minimized.

@humblec

humblec May 8, 2018

Contributor

err != nil looks consistent.

This comment has been minimized.

@bswartz

bswartz May 18, 2018

Author Contributor

Fixed

if nil != err {
return "", err
}
glog.V(6).Infof("AttachDisk portal->host map for %s is %v", b.Iqn, portalHostMap)

This comment has been minimized.

@humblec

humblec May 8, 2018

Contributor

We can place this on v3 or v4.

This comment has been minimized.

@bswartz

bswartz May 18, 2018

Author Contributor

Fixed

@@ -80,3 +83,202 @@ func (handler *deviceHandler) FindSlaveDevicesOnMultipath(dm string) []string {
}
return devices
}

// GetIscsiPortalHostMapForTarget given a target iqn, find all the scsi hosts logged into
// that target. Returns a map by target portal to

This comment has been minimized.

@humblec

humblec May 8, 2018

Contributor

Can you please expand the source code comment on how the portalmap looks like ?

This comment has been minimized.

@bswartz

bswartz May 18, 2018

Author Contributor

Fixed

if err != nil {
glog.Errorf("iscsi: failed to rescan session with error: %s (%v)", string(out), err)
hostNumber, loggedIn := portalHostMap[tp]
if !loggedIn {

This comment has been minimized.

@humblec

humblec May 8, 2018

Contributor

How are we addressing the race condition here, I mean before we login another routine does login or we assume its already logged in and before performing next operation, it was logged out.

This comment has been minimized.

@bswartz

bswartz May 18, 2018

Author Contributor

I'm not sure the old code addressed that case. When I wrote this I had assumed there was locking at a higher level. I can add proper protection from races here in attach/detach but it will either require a Big Giant Lock or some very complex locking on the individual portals.

lastErr = fmt.Errorf("iscsi: failed to sendtargets to portal %s output: %s, err %v", tp, string(out), err)
continue
}
err = updateISCSINode(b, tp)

This comment has been minimized.

@humblec

humblec May 8, 2018

Contributor

Does updateISCSIDIscoverydb() or updateISCSINode() can have a inconsistent view of portal map compared to iscsi db ?

This comment has been minimized.

@bswartz

bswartz May 18, 2018

Author Contributor

I didn't write that code -- my changed just moved it. Based on what I know about iscsiadm, it stores some things in its databases, but probably not the SCSI hosts -- those it gets at runtime.

Are you worried about a new bug here, or are you considering an alternative implementation?

@rootfs

This comment has been minimized.

Copy link
Member

rootfs commented Jul 24, 2018

status/approved-for-milestone

@bswartz

This comment has been minimized.

Copy link
Contributor Author

bswartz commented Jul 24, 2018

@rootfs I think the command is "/status approved-for-milestone"

@bswartz bswartz closed this Jul 24, 2018

@bswartz bswartz reopened this Jul 24, 2018

@bswartz bswartz force-pushed the NetApp:bug/59946 branch from 04cac22 to 1c449c7 Jul 24, 2018

@rootfs

This comment has been minimized.

Copy link
Member

rootfs commented Jul 24, 2018

/status approved-for-milestone

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Jul 24, 2018

You must be a member of the kubernetes/kubernetes-milestone-maintainers github team to add status labels.

@bswartz

This comment has been minimized.

Copy link
Contributor Author

bswartz commented Jul 24, 2018

/retest

Avoid deleted iSCSI LUNs in the kernel
This change ensures that iSCSI block devices are deleted after
unmounting, and implements scanning of individual LUNs rather
than scanning the whole iSCSI bus.

In cases where an iSCSI bus is in use by more than one attachment,
detaching used to leave behind phantom block devices, which could
cause I/O errors, long timeouts, or even corruption in the case
when the underlying LUN number was recycled. This change makes
sure to flush references to the block devices after unmounting.

The original iSCSI code scanned the whole target every time a LUN
was attached. On storage controllers that export multiple LUNs on
the same target IQN, this led to a situation where nodes would
see SCSI disks that they weren't supposed to -- possibly dozens or
hundreds of extra SCSI disks. This caused 3 significant problems:

1) The large number of disks wasted resources on the node and
caused a minor drag on performance.
2) The scanning of all the devices caused a huge number of uevents
from the kernel, causing udev to bog down for multiple minutes in
some cases, triggering timeouts and other transient failures.
3) Because Kubernetes was not tracking all the "extra" LUNs that
got discovered, they would not get cleaned up until the last LUN
on a particular target was detached, causing a logout. This led
to significant complications:

In the time window between when a LUN was unintentially scanned,
and when it was removed due to a logout, if it was deleted on the
backend, a phantom reference remained on the node. In the best
case, the phantom LUN would cause I/O errors and timeouts in the
udev system. In the worst case, the backend could reuse the LUN
number for a new volume, and if that new volume were to be
scheduled to a pod with a phantom reference to the old LUN by the
same number, the initiator could get confused and possibly corrupt
data on that volume.

To avoid these problems, the new implementation only scans for
the specific LUN number it expects to see. It's worth noting that
the default behavior of iscsiadm is to automatically scan the
whole bus on login. That behavior can be disabled by setting
node.session.scan = manual
in iscsid.conf, and for the reasons mentioned above, it is
strongly recommended to set that option. This change still works
regardless of the setting in iscsid.conf, and while automatic
scanning will cause some problems, this change doesn't make the
problems any worse, and can make things better in some cases.

@bswartz bswartz force-pushed the NetApp:bug/59946 branch from 1c449c7 to 6d23d8e Jul 25, 2018

@bswartz

This comment has been minimized.

Copy link
Contributor Author

bswartz commented Jul 25, 2018

Squashed 5 commits down to 1

@redbaron

This comment has been minimized.

Copy link
Contributor

redbaron commented Jul 25, 2018

/test pull-kubernetes-e2e-gce

@rootfs

This comment has been minimized.

Copy link
Member

rootfs commented Jul 25, 2018

@jsafrane any more comments?

@k8s-github-robot

This comment has been minimized.

Copy link
Contributor

k8s-github-robot commented Jul 25, 2018

[MILESTONENOTIFIER] Milestone Pull Request Needs Approval

@bswartz @humblec @rootfs @saad-ali @kubernetes/sig-cluster-lifecycle-misc @kubernetes/sig-storage-misc

Action required: This pull request must have the status/approved-for-milestone label applied by a SIG maintainer.

Pull Request Labels
  • sig/cluster-lifecycle sig/storage: Pull Request will be escalated to these SIGs if needed.
  • priority/critical-urgent: Never automatically move pull request out of a release milestone; continually escalate to contributor and SIG through all available channels.
  • kind/bug: Fixes a bug discovered during the current release.
Help
@rootfs

This comment has been minimized.

Copy link
Member

rootfs commented Jul 25, 2018

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm label Jul 25, 2018

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Jul 25, 2018

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bswartz, rootfs

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-github-robot

This comment has been minimized.

Copy link
Contributor

k8s-github-robot commented Jul 25, 2018

/test all [submit-queue is verifying that this PR is safe to merge]

@k8s-github-robot

This comment has been minimized.

Copy link
Contributor

k8s-github-robot commented Jul 25, 2018

Automatic merge from submit-queue (batch tested with PRs 64844, 63176). If you want to cherry-pick this change to another branch, please follow the instructions here.

@k8s-github-robot k8s-github-robot merged commit 845a55d into kubernetes:master Jul 25, 2018

11 of 18 checks passed

Submit Queue Required Github CI test is not green: pull-kubernetes-e2e-gce
Details
pull-kubernetes-e2e-gce Job triggered.
Details
pull-kubernetes-e2e-gce-100-performance Job triggered.
Details
pull-kubernetes-e2e-gce-device-plugin-gpu Job triggered.
Details
pull-kubernetes-e2e-kops-aws Job triggered.
Details
pull-kubernetes-kubemark-e2e-gce-big Job triggered.
Details
pull-kubernetes-verify Job triggered.
Details
cla/linuxfoundation bswartz authorized
Details
pull-kubernetes-bazel-build Job succeeded.
Details
pull-kubernetes-bazel-test Job succeeded.
Details
pull-kubernetes-cross Skipped
pull-kubernetes-e2e-gke Skipped
pull-kubernetes-e2e-kubeadm-gce Skipped
pull-kubernetes-integration Job succeeded.
Details
pull-kubernetes-local-e2e Skipped
pull-kubernetes-local-e2e-containerized Skipped
pull-kubernetes-node-e2e Job succeeded.
Details
pull-kubernetes-typecheck Job succeeded.
Details
@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Jul 25, 2018

@bswartz: The following tests failed, say /retest to rerun them all:

Test name Commit Details Rerun command
pull-kubernetes-local-e2e-containerized ad9722c link /test pull-kubernetes-local-e2e-containerized
pull-kubernetes-e2e-kops-aws 6d23d8e link /test pull-kubernetes-e2e-kops-aws

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@bswartz bswartz deleted the NetApp:bug/59946 branch Jul 26, 2018

@redbaron

This comment has been minimized.

Copy link
Contributor

redbaron commented Jul 26, 2018

Big thanks to everybody who was involved fixing it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.