Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix discovery/deletion of iscsi block devices #63176

Merged
merged 1 commit into from
Jul 25, 2018

Conversation

bswartz
Copy link
Contributor

@bswartz bswartz commented Apr 26, 2018

This PR modifies the iSCSI attach/detatch codepaths in the following
ways:

  1. After unmounting a filesystem on an iSCSI block device, always
    flush the multipath device mapper entry (if it exists) and delete
    all block devices so the kernel forgets about them.
  2. When attaching an iSCSI block device, instead of blindly
    attempting to scan for the new LUN, first determine if the target
    is already logged into, and if not, do the login first. Once every
    portal is logged into, the scan is done.
  3. Scans are now done for specific devices, instead of the whole
    bus. This avoids discovering LUNs that kubelet has no interest in.
  4. Additions to the underlying utility interfaces, with new tests
    for the new functionality.
  5. Some existing code was shifted up or down, to make the new logic
    work.
  6. A typo in an existing exec call on the attach path was fixed.

Fixes #59946

When attaching iSCSI volumes, kubelet now scans only the specific
LUNs being attached, and also deletes them after detaching. This avoids
dangling references to LUNs that no longer exist, which used to be the
cause of random I/O errors/timeouts in kernel logs, slowdowns during
block-device related operations, and very rare cases of data corruption.

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Apr 26, 2018
@k8s-ci-robot
Copy link
Contributor

Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA.

It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.


Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-ci-robot k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 26, 2018
@bswartz
Copy link
Contributor Author

bswartz commented Apr 26, 2018

CLA issue should be fixed

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Apr 26, 2018
@bswartz
Copy link
Contributor Author

bswartz commented Apr 26, 2018

/assign @saad-ali

@krmayankk
Copy link

/sig storage

@k8s-ci-robot k8s-ci-robot added the sig/storage Categorizes an issue or PR as relevant to SIG Storage. label Apr 26, 2018
@bswartz bswartz changed the title Bug/59946 Fix discovery/deletion of iscsi block devices Apr 26, 2018
@humblec
Copy link
Contributor

humblec commented Apr 26, 2018

/assign @humblec

@rootfs
Copy link
Contributor

rootfs commented Apr 26, 2018

cc @bmarzins

@rootfs
Copy link
Contributor

rootfs commented Apr 26, 2018

I am not a multipath expert, looping in @bmarzins

// The list of block devices on the scsi bus will be in a
// directory called "target%d:%d:%d".
// See drivers/scsi/scsi_scan.c in Linux
// We assume the channel/bus and device/controller are always zero for iSCSI
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it true for iSCSI offloading cards?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to determine this. I don't have an offload card but I've ordered some and when I get them installed I'll do experiments.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay I was able to get my hands on some iSCSI offload cards and confirm that channel/bus and device/controller remain zero for those cards too. Each "port" on the card shows up as a separate SCSI host, and LUNs are numbered according to the iSCSI LUN number as you would expect.

@rootfs
Copy link
Contributor

rootfs commented Apr 27, 2018

@humblec do you have a multipath setup to test this?

@rootfs
Copy link
Contributor

rootfs commented Apr 30, 2018

/ok-to-test

@k8s-ci-robot k8s-ci-robot removed the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Apr 30, 2018
@humblec
Copy link
Contributor

humblec commented May 7, 2018

@bswartz Thanks.. I am starting the review with a question :)

After unmounting a filesystem on an iSCSI block device, always flush the multipath device mapper entry (if it exists) and delete all block devices so the kernel forgets about them.

Is it really required to flush the device mapper multipath entries ? afaik, the device mapper will take care of flushing stale device entries after a time stamp/period. I am trying to understand why we need to do it by code , if its required all storage admin has to do this operation manually as soon as an unmount occurs which I think not the case. Also without 'mandatory' flushing ,are we seeing any functional issues in iscsi workflow ?

@redbaron
Copy link
Contributor

redbaron commented May 7, 2018

@humblec , AFAIK multipathd reacts to udev events and therefore can lag behind processing them. IMHO it is best to delete multipath device explicitly to avoid any potential issues and bugs in other parts of the chain.

@bswartz
Copy link
Contributor Author

bswartz commented May 8, 2018

@humblec What redbaron says is true. Also, I was following RedHat's recommended practices from this document:
https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/5/html/online_storage_reconfiguration_guide/removing_devices

@humblec
Copy link
Contributor

humblec commented May 8, 2018

@redbaron @bswartz its definitely good to have, however my attempt was to find out whether its mandatory or any potential issues in absense of it. The guide which was pointed above belongs to RHEL 5, the device mapper and other software stack would have got good amount of changes in latest versions.

@redbaron
Copy link
Contributor

redbaron commented May 8, 2018

@humblec I dont think it is necessary and ideally multipathd would remove device once all paths are deleted.

Removing device explicitly in the code has following benefits in my view:

  • it is less likely to trigger bugs in interaction between multipathd, udev and kernel. There are so many combination spanning number of years in development between them, there bound to be differences in behaviour which hard to predict.
  • it generates much more visible kubelet log entry should flush operation fail, for instance if device is still open for whatever reason
  • explicit is better than implicit. it is good to have code for every effect we anticipate to happen. Not everybody (==me) is fluent with iSCSI and things which are obvious once you dive deep in the topic might be surprising, when someone else try to debug another instance of odd behaviour.

// Build a map of SCSI hosts for each target portal. We will need this to
// issue the bus rescans.
portalHostMap, err := b.deviceUtil.GetIscsiPortalHostMapForTarget(b.Iqn)
if nil != err {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

err != nil looks consistent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

if nil != err {
return "", err
}
glog.V(6).Infof("AttachDisk portal->host map for %s is %v", b.Iqn, portalHostMap)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can place this on v3 or v4.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@@ -80,3 +83,202 @@ func (handler *deviceHandler) FindSlaveDevicesOnMultipath(dm string) []string {
}
return devices
}

// GetIscsiPortalHostMapForTarget given a target iqn, find all the scsi hosts logged into
// that target. Returns a map by target portal to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please expand the source code comment on how the portalmap looks like ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

if err != nil {
glog.Errorf("iscsi: failed to rescan session with error: %s (%v)", string(out), err)
hostNumber, loggedIn := portalHostMap[tp]
if !loggedIn {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are we addressing the race condition here, I mean before we login another routine does login or we assume its already logged in and before performing next operation, it was logged out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure the old code addressed that case. When I wrote this I had assumed there was locking at a higher level. I can add proper protection from races here in attach/detach but it will either require a Big Giant Lock or some very complex locking on the individual portals.

lastErr = fmt.Errorf("iscsi: failed to sendtargets to portal %s output: %s, err %v", tp, string(out), err)
continue
}
err = updateISCSINode(b, tp)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does updateISCSIDIscoverydb() or updateISCSINode() can have a inconsistent view of portal map compared to iscsi db ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't write that code -- my changed just moved it. Based on what I know about iscsiadm, it stores some things in its databases, but probably not the SCSI hosts -- those it gets at runtime.

Are you worried about a new bug here, or are you considering an alternative implementation?

@childsb childsb added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. kind/bug Categorizes issue or PR as related to a bug. labels Jul 24, 2018
@rootfs
Copy link
Contributor

rootfs commented Jul 24, 2018

status/approved-for-milestone

@bswartz
Copy link
Contributor Author

bswartz commented Jul 24, 2018

@rootfs I think the command is "/status approved-for-milestone"

@rootfs
Copy link
Contributor

rootfs commented Jul 24, 2018

/status approved-for-milestone

@k8s-ci-robot
Copy link
Contributor

You must be a member of the kubernetes/kubernetes-milestone-maintainers github team to add status labels.

@bswartz
Copy link
Contributor Author

bswartz commented Jul 24, 2018

/retest

This change ensures that iSCSI block devices are deleted after
unmounting, and implements scanning of individual LUNs rather
than scanning the whole iSCSI bus.

In cases where an iSCSI bus is in use by more than one attachment,
detaching used to leave behind phantom block devices, which could
cause I/O errors, long timeouts, or even corruption in the case
when the underlying LUN number was recycled. This change makes
sure to flush references to the block devices after unmounting.

The original iSCSI code scanned the whole target every time a LUN
was attached. On storage controllers that export multiple LUNs on
the same target IQN, this led to a situation where nodes would
see SCSI disks that they weren't supposed to -- possibly dozens or
hundreds of extra SCSI disks. This caused 3 significant problems:

1) The large number of disks wasted resources on the node and
caused a minor drag on performance.
2) The scanning of all the devices caused a huge number of uevents
from the kernel, causing udev to bog down for multiple minutes in
some cases, triggering timeouts and other transient failures.
3) Because Kubernetes was not tracking all the "extra" LUNs that
got discovered, they would not get cleaned up until the last LUN
on a particular target was detached, causing a logout. This led
to significant complications:

In the time window between when a LUN was unintentially scanned,
and when it was removed due to a logout, if it was deleted on the
backend, a phantom reference remained on the node. In the best
case, the phantom LUN would cause I/O errors and timeouts in the
udev system. In the worst case, the backend could reuse the LUN
number for a new volume, and if that new volume were to be
scheduled to a pod with a phantom reference to the old LUN by the
same number, the initiator could get confused and possibly corrupt
data on that volume.

To avoid these problems, the new implementation only scans for
the specific LUN number it expects to see. It's worth noting that
the default behavior of iscsiadm is to automatically scan the
whole bus on login. That behavior can be disabled by setting
node.session.scan = manual
in iscsid.conf, and for the reasons mentioned above, it is
strongly recommended to set that option. This change still works
regardless of the setting in iscsid.conf, and while automatic
scanning will cause some problems, this change doesn't make the
problems any worse, and can make things better in some cases.
@bswartz
Copy link
Contributor Author

bswartz commented Jul 25, 2018

Squashed 5 commits down to 1

@redbaron
Copy link
Contributor

/test pull-kubernetes-e2e-gce

@rootfs
Copy link
Contributor

rootfs commented Jul 25, 2018

@jsafrane any more comments?

@k8s-github-robot
Copy link

[MILESTONENOTIFIER] Milestone Pull Request Needs Approval

@bswartz @humblec @rootfs @saad-ali @kubernetes/sig-cluster-lifecycle-misc @kubernetes/sig-storage-misc

Action required: This pull request must have the status/approved-for-milestone label applied by a SIG maintainer.

Pull Request Labels
  • sig/cluster-lifecycle sig/storage: Pull Request will be escalated to these SIGs if needed.
  • priority/critical-urgent: Never automatically move pull request out of a release milestone; continually escalate to contributor and SIG through all available channels.
  • kind/bug: Fixes a bug discovered during the current release.
Help

@rootfs
Copy link
Contributor

rootfs commented Jul 25, 2018

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 25, 2018
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bswartz, rootfs

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-github-robot
Copy link

/test all [submit-queue is verifying that this PR is safe to merge]

@k8s-github-robot
Copy link

Automatic merge from submit-queue (batch tested with PRs 64844, 63176). If you want to cherry-pick this change to another branch, please follow the instructions here.

@k8s-github-robot k8s-github-robot merged commit 845a55d into kubernetes:master Jul 25, 2018
@k8s-ci-robot
Copy link
Contributor

k8s-ci-robot commented Jul 25, 2018

@bswartz: The following tests failed, say /retest to rerun them all:

Test name Commit Details Rerun command
pull-kubernetes-local-e2e-containerized ad9722c link /test pull-kubernetes-local-e2e-containerized
pull-kubernetes-e2e-kops-aws 6d23d8e link /test pull-kubernetes-e2e-kops-aws

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@bswartz bswartz deleted the bug/59946 branch July 26, 2018 02:29
@redbaron
Copy link
Contributor

Big thanks to everybody who was involved fixing it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/kubeadm cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. milestone/needs-approval priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. sig/storage Categorizes an issue or PR as relevant to SIG Storage. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

iSCSI should be deleting scsi devices on PV unmount to prevent data corruption
10 participants