Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry attaching multipath iSCSI volumes #68141

Merged
merged 1 commit into from
Oct 8, 2018

Conversation

jsafrane
Copy link
Member

@jsafrane jsafrane commented Aug 31, 2018

So far, iSCSI volume plugin tried to log into all targets only once. If all targets but one were busy or unavailable, the plugin mounted just a single path.

This PR tries harder to get the missing paths:

  • In case only one path is available but multipath was requested, try again (up to 5 times)
  • In case at least two paths are available and some paths are missing, try again (up to 3 times).

In the worst case, the volume plugin uses the single available path after 5 attempts to log into all targets. The only difference to current code is that the plugin is slower in pathologic cases when some targets are unavailable for longer time. But it sets up the volume even in that case.

What this PR does / why we need it:
See #68140, I think we should wait for at least two paths in case of some network hiccups or timeouts.

Fixes #68140

@j-griffith @bswartz @rootfs @humblec

Release note:

NONE

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Aug 31, 2018
@k8s-ci-robot k8s-ci-robot added the sig/storage Categorizes an issue or PR as relevant to SIG Storage. label Aug 31, 2018
@childsb childsb added the kind/bug Categorizes issue or PR as related to a bug. label Aug 31, 2018
@childsb childsb added this to the v1.12 milestone Aug 31, 2018
@dims
Copy link
Member

dims commented Sep 2, 2018

/retest

@humblec
Copy link
Contributor

humblec commented Sep 4, 2018

Don't mount single path instead of multipath volumes and always wait until at least 2 paths are available. Try up to 3 times to get all paths, then continue with 2 paths if they're available. 

@jsafrane I doubt this is necessary. The reason being, dm multipath is designed to take care of this scenario or the restriction /denial of mpath configuration if only one path is available should be the logic of device mapper multipath layer than kubelet or plugin. Mpath supposed to work even when only one path is available and nothing wrong with it. So, IMO we should not impose this restriction from our layer.

@bswartz
Copy link
Contributor

bswartz commented Sep 5, 2018

@humblec @jsafrane I agree with Humble that this change isn't needed.

The main way you can get into trouble today is if you configure multipath.conf with find_multipaths=yes (the default is no, except on RedHat). If you do that, then multipathd will NOT create a /dev/dm-X device when there is only a single path and kubelet will attach to the /dev/sdX device, preventing the multipath daemon from properly setting up multipath when the second path becomes available.

The workaround is simple -- just set find_multipaths=no and properly configure your blacklist, and you should never have a problem.

@jsafrane
Copy link
Member Author

Indeed, multipath can work when only single path is available and it should be task of DM / multipath to cope with that. However, Kubernetes currently won't retry to attach the missing paths. It tries to log into / attach all specified targetsm and it's satisfied with only one path attached. It mounts it (either as /dev/dm-X or /dev/sdY; it really does not matter in this case) and iSCSI volume plugin work is done.

Is it what the user expected by specifying e.g. three targets in iSCSI PV? In my opinion the user wanted at least some sort of fault tolerance, which is impossible with only one path attached.

The best would be if Kubernetes re-tried attaching missing paths periodically, even after the volume has been set up and a pod is running. We don't have such logic in Kubernetes and adding it would be IMO too big change. I'd suggest to implement it in a CSI driver (@j-griffith?).

For the in-tree iSCSI volume plugin, I am proposing to wait for a while for at least two paths to be available. It will slow down pod startup when all but one targets are offline. It will help in all other cases. In my (virtualized) Gluster environment, it's enough to start ~20 pods at the same time for some targets to time out and to get a volume with just one path instead of three.

@bswartz
Copy link
Contributor

bswartz commented Sep 18, 2018

@jsafrane You raise a good point that I hadn't thought of. Something does need to be responsible for eventually going back and logging into/rescanning all the missing paths so multipath daemon can eventually pick them up.

This is a task that doesn't need to be done before kubernetes can attach to and start using the volume, but if we never do it, then we're going to have worse performance and poor reliability.

My instinct says it's the kind of thing that kubernetes should address because kubernetes' desired state/reconciler loop model is the perfect solution to this kind of problem. I agree with your point that it wouldn't be a small amount of new code though, and it might be too much to swallow, at least not without giving it a lot more thought.

@guineveresaenger
Copy link
Contributor

Please continue this work in 1.13.

/milestone v1.13

@k8s-ci-robot k8s-ci-robot modified the milestones: v1.12, v1.13 Sep 18, 2018
@jsafrane
Copy link
Member Author

@bswartz @humblec @rootfs, until we have great iSCSI volume plugin that tries to log into all missing volume paths after a pod has been started on the volume, would this PR be a viable workaround?

I admit, it will slow down setup if all but one paths are unavailable. On the other hand, it provides at least some multipath capability if one of the paths are unavailable only temporarily.

@bswartz
Copy link
Contributor

bswartz commented Sep 19, 2018

I think at least some users would view this patch as a regression. They will say "It used to work and now it doesn't." I understand that other users may view the situation differently. If there was a switch to let the user decide which behavior they wanted, that might let us deliver the improvement without breaking anyone's existing use case.

@humblec
Copy link
Contributor

humblec commented Sep 19, 2018

Is it what the user expected by specifying e.g. three targets in iSCSI PV? In my opinion the user wanted at least some sort of fault tolerance, which is impossible with only one path attached.

If we forget about the kubernetes layer, what is the general practice ? Admin know there are more than one connection to the LUN/IQN, so he configure multipath and login and keep all paths ready at start. Then on, its multipath who take care of path grouping and path switch based on the path availability. If an active path goes down, a failover happens, who will reinstate the failed path?. iic, its admin's responsibility. If this is the general practice what we are trying to cover by this PR is the first configuration and latter one ( reinstating failed path ) is targetted for CSI or later PRs. Now, the question is, Is kubernetes supposed to take care the actions of admin ? iiuc, better if kube take care of it, so by this PR we march towards it. At the same time, we have to think about the regression/backward compatibility as @bswartz pointed out. As these ( portals configuration) are not controlled via SC or PVC, rather its in the spec, I think its difficult to avoid backward compatibility break. If we have a way, I ack for this PR in.

@bswartz
Copy link
Contributor

bswartz commented Sep 19, 2018

If we forget about the kubernetes layer, what is the general practice ? Admin know there are more than one connection to the LUN/IQN, so he configure multipath and login and keep all paths ready at start. Then on, its multipath who take care of path grouping and path switch based on the path availability. If an active path goes down, a failover happens, who will reinstate the failed path?. iic, its admin's responsibility.

In my mind, kubernetes is meant to run it a mostly lights-out datacenter type environment. If you assume that someone has decided to deploy kubernetes with iscsi and multipath as the primary mechanism for supplying storage (maybe not likely, but this is the only case we care about in this part of the code) then ask yourself what happens when one of the switches carrying iscsi traffic inevitably dies and needs replacing. Hopefully monitoring will catch the dead switch and it will get replaced quickly, but in the mean time, it's possible that all pod/PV attaching grinds to a halt across the cluster while the system waits for the missing path to be restored.

If the system is going to respond to a lost path by ceasing to function, then what is the point of multipath? It just makes the system less reliable because now it's twice as likely to fail (2 single points of failure rather than 1). My opinion is that multipath should make the system more reliable, otherwise nobody will use it, and it becomes pointless.

@humblec
Copy link
Contributor

humblec commented Sep 21, 2018

If the system is going to respond to a lost path by ceasing to function, then what is the point of multipath? It just makes the system less reliable because now it's twice as likely to fail (2 single points of failure rather than 1). My opinion is that multipath should make the system more reliable, otherwise nobody will use it, and it becomes pointless.

I understand this part and its one of the reason for kind of denying this approach in my first comment. But, at present we dont have a mechanism from kubelet to notify an admin that, we were only able to pick/login one path eventhough mulitpath is configured and expected to have HA via more than one path. Monitoring each and every LUN/IQN in all kubelets and identifying which path(s) are missing is also difficult, especially when we think about lights-out datacenter type environment. In these situation, if we fail and reschedule the pod to another kubelet when only one path is available, brings us below advantages:

*) Highly possible to get second path working in new kubelet if the issue is specific to initiator or the issue at target side was temporary.
*) If the pod fails to run in any of the kubelet ( by the restriction we introduce here in this patch ) or in the same kubelet ( at configuration like deamonset) admin will be notified by the user and he get a chance to take a look at whats going wrong.

IMO, the user experience is slightly better than serving I/O only via one path especially when the path failure was a temporary issue. For permanent failures I agree that, admin will be notified via some monitoring mechanisms as you mentioned.

Additionally, If we consider a scenario of Active/Active configuration instead of Active/Passive, this approach gives more advantage, again in scenarios like if the failure is a temporary issue, not when other paths are permanently down. That said, if its a temporary issue, the I/O is served only via one path as long as the discovered path is available. This contributes to performance of an application if the expectation was that, its supposed to distribute the I/O among more than one path group.

With above, I am more inclined to have this PR in, but need to sort out how to hack the backward compatibility breakage . @jsafrane @bswartz thoughts or counter arguments ? :)

@jsafrane
Copy link
Member Author

I think at least some users would view this patch as a regression. They will say "It used to work and now it doesn't."

When all but one path is unavailable kubelet tries a while longer to get the missing paths. But after 3 attempts it gives up and continues just with single path, as before. IMO that's not an regression.

I am afraid that there are no easy ways how to configure kubelet in this regard. It's nearly impossible to get a new configuration option there and even if we had it, someone would need to configure it in all kubelets in the cluster.

@humblec
Copy link
Contributor

humblec commented Sep 24, 2018

When all but one path is unavailable kubelet tries a while longer to get the missing paths. But after 3 attempts it gives up and continues just with single path, as before. IMO that's not an regression.>

@jsafrane , I think its better to add description about the behavior when <minMultipathCount happens, in below description. I think thats the source of confusion.

--snip--
Don't mount single path instead of multipath volumes and always wait until at least 2 paths are available. Try up to 3 times to get all paths, then continue with 2 paths if they're available. After 5 attempts report error and kubelet will retry with exponential backoff.
--/snip--

If we even proceed with <minMultipathCount, the only regression which we can think of here is the delay in bringing the app pod in case there are lesser paths. But IMO, thats fine and should not be a big concern.

I am afraid that there are no easy ways how to configure kubelet in this regard. It's nearly impossible to get a new configuration option there and even if we had it, someone would need to configure it in all kubelets in the cluster.

Agreed.

@bswartz
Copy link
Contributor

bswartz commented Sep 24, 2018

@humblec If I understand you stance, you're saying that, in the absence of any mechanism to repair broken paths, it's better to let attaches fail so that the kubernetes retry mechanism can hopefully find another place that doesn't have failed paths, or so that the pod never starts and a human gets annoyed enough to investigate more deeply.

I can see the logic with the first argument, if the missing path is specific to the kubernetes node, and another node is likely to succeed. Failures related to down switches or down storage controller ports won't be fixed by this mechanism though, so it's of limited use.

The argument that it's better to fail completely in order to get the user involved in troubleshooting makes less sense to me. There are better ways to monitor the health of the cluster and to get notified of failures related to paths than allowing workloads to fail completely when they could proceed in some kind of degraded mode (where degraded could refer to lower reliability due to fewer backup paths, or lower performance due to fewer active connections).

In an ideal world, the attachment operation would be able to report success but also report a degraded status that would cause the system to periodically try to "fix" the attachment, and also the node status would reflect the degraded I/O state in case the problem was specific to some nodes and the scheduler wanted to avoid those nodes.

The main piece that's missing is the ability to actually log into the iSCSI targets and rescan the LUNs for the failed paths of an active attachment. Kubernetes has all the information required to do this, it just doesn't have a way to mark an attachment as degraded such that it needs to do more work on it. I think the right answer is to work on adding this capability so we can arrive at something closer to the ideal world.

@redbaron
Copy link
Contributor

In theory, cron/systemd timer unit with iscsdiadm call on a node to rediscover all scsi devices can "repair" degraded mpath, but it would also discover new scsi devices, which probably will not be handled well by existing iscsi code in iscsi, where it had to delete devices when it stops using them to no try to reuse wrong one in the future

@bswartz
Copy link
Contributor

bswartz commented Sep 24, 2018

In theory, cron/systemd timer unit with iscsdiadm call on a node to rediscover all scsi devices can "repair" degraded mpath, but it would also discover new scsi devices, which probably will not be handled well by existing iscsi code in iscsi, where it had to delete devices when it stops using them to no try to reuse wrong one in the future

The specific repair steps would be:

  1. Login into any targets for which the SCSI host was missing. This would require
    a) The portal(s) which were down at attach time
    b) The target IQN
    c) Any CHAP secrets required
  2. Scan the SCSI bus for the missing disk(s). This would require:
    a) Knowing which SCSI bus to scan, which can be derived from the target IQN and portal information
    b) Knowing the LUN number

I find it unlikely that a cron job could do a satisfactory job here, unless you had a really simple setup with no CHAP and just 1 target.

A CSI driver might be able to do the right thing without help from kubernetes, although it would have to bend the rules of statelessness and introduce some kind of background task to periodically attempt to fix broken connections.

I still believe that it makes more sense for Kubernetes to be aware of the degraded state of the attachment and for Kubernetes to periodically attempt the remediation.

@redbaron
Copy link
Contributor

Currently we have "catch all" script, which discovers everything:

/sbin/iscsiadm -m discovery -t st -p portal_ip_1 -p portal_ip_2
/sbin/iscsiadm -m node --loginall=automatic

@bswartz
Copy link
Contributor

bswartz commented Sep 24, 2018

@redbaron Are you arguing that you already have a working implementation of "multipath self-healing" for your environment and that this bugfix would be harmful to you if there was no way to turn off the new behavior?

@redbaron
Copy link
Contributor

I don't argue for or against this change, but saw a discussion about self healing and thought can throw in some ideas: for administrators who care about it , they can run cronjob to periodically rediscover missing paths.

Besides that, it looks like once iscsid was logged in to portal, it would keep retrying to go there and maybe it is going to self-discover missing paths once portal is back up (unconfirmed, but very plausible). Portal login can be done on a host before starting kubelet, but kubelet can logoff on a last use of a portal :(

@bswartz
Copy link
Contributor

bswartz commented Sep 24, 2018

Yeah the problem is that different iSCSI systems have significantly different workflows. The one you're describing is the more traditional scheme where a storage system has 1 target IQN, 1 or more portals, and a bunch of LUNs on the same bus. Other systems use a different target IQN for each volume, which requires kubernetes to be capable of automating the login/logout procedures. We'd be in a better position if the iscsi plugin knew what kind of system it was dealing with, but there's no information in the PV spec to help distinguish.

@jsafrane
Copy link
Member Author

@bswartz, I agree that complete refactoring of iSCSI volume plugin that would retry to attach missing paths after SetUp() finished would be more appropriate. This PR is just a quick fix to help with temporarily unavailable targets.

If a target is unavailable for a longer time, this PR does not introduce any significant behavior change. Sure, it will be slower to set up a volume, but in the end it will set it up even with a single path, just as before.

I rehprased the commit message to emphasize that, sorry if it was not clear before.

@bswartz
Copy link
Contributor

bswartz commented Sep 25, 2018

@jsafrane thanks!

@humblec
Copy link
Contributor

humblec commented Sep 25, 2018

@bswartz

Yeah the problem is that different iSCSI systems have significantly different workflows. The one you're describing is the more traditional scheme where a storage system has 1 target IQN, 1 or more portals, and a bunch of LUNs on the same bus. Other systems use a different target IQN for each volume, which requires kubernetes to be capable of automating the login/logout procedures. We'd be in a better position if the iscsi plugin knew what kind of system it was dealing with, but there's no information in the PV spec to help distinguish.
--

The reason for above is, our current driver/iscsi intree plugin is a more generic one for supporting maximum ISCSI storage systems out there. At the same time, I understand its possible to leave an option to accommodate storage specific configurations inside the spec and may be make it compatible for every ISCSI storage systems, but this a wider scope material.

Also @jsafrane has updated the commit message with the details on how this behaves in different scenarios. The regression which we are mainly worried here is very minimal, so I am in support of getting this patch in. We can definitely improve and can take this as a first step towards it.


const (
// Minimum number of paths required for a multipath iSCSI volume to continue with mounting the volume.
// Any lesser number will be reported as error and such volume won't be mounted.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jsafrane is this source comment true , specifically such volume wont be mounted?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for spotting this, fixed.

// are available after this nr. of attempts, the volume plugin continues with mounting the volume.
minAttachAttempts = 3

// Total nr. of attempts to attach a volume before returning an error to kubelet.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nr/number


// build discoverydb and discover iscsi target
b.exec.Run("iscsiadm", "-m", "discoverydb", "-t", "sendtargets", "-p", tp, "-I", b.Iface, "-o", "new")
// update discoverydb with CHAP secret
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A blank line above //update is better

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

}
out, err = b.exec.Run("iscsiadm", "-m", "discoverydb", "-t", "sendtargets", "-p", tp, "-I", b.Iface, "--discover")
if err != nil {
// delete discoverydb record
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added blank spaces all over the place :-)
(but not between if and the true block, that looks weird).

pkg/volume/iscsi/iscsi_util.go Show resolved Hide resolved
@humblec
Copy link
Contributor

humblec commented Oct 3, 2018

@mtanino @bswartz Please review.

@jsafrane jsafrane force-pushed the iscsi-retry branch 2 times, most recently from 9d167bf to 8a2f346 Compare October 3, 2018 09:54
@jsafrane
Copy link
Member Author

jsafrane commented Oct 3, 2018

@humblec, I fixed comments of all the constants, please review

In addition, I lowered minAttachAttempts to 2. This gives the initiator 2x10 seconds to resolve "small" network hiccups or overloaded targets in iscsiadm --login. It turned to be enough in my overloaded gluster.

devicePath = waitForMultiPathToExist(devicePaths, 10, b.deviceUtil)
if len(bkpPortal) > 1 {
// Multipath volume was requested. Wait up to 10 seconds for the multipath device to appear.
devicePath = waitForMultiPathToExist(devicePathList, 10, b.deviceUtil)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make 10 as a constant and avoid hardcoding?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jsafrane in case you missed above comment :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to const.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jsafrane

Don't mount single path instead of multipath volumes and always wait until
at least 2 paths are available. Try up to 3 times to get all paths. Try 5
times to get at last 2 paths.
@humblec
Copy link
Contributor

humblec commented Oct 8, 2018

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 8, 2018
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: humblec, jsafrane

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel comment for consistent failures.

@k8s-ci-robot k8s-ci-robot merged commit 5d0c19c into kubernetes:master Oct 8, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note-none Denotes a PR that doesn't merit a release note. sig/storage Categorizes an issue or PR as relevant to SIG Storage. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants