Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix iscsi logout issues #78941

Merged
merged 1 commit into from Jul 31, 2019

Conversation

@jsafrane
Copy link
Member

commented Jun 12, 2019

What type of PR is this?
/kind bug

What this PR does / why we need it:
Our iSCSI e2e tests sometimes report that a file that has been just written to an iSCSI volume is not available there. e2e systems are chronically overloaded and I noticed two issues there:

  • iscsiadm sometimes reports iscsiadm: Could not execute operation on all records: encountered iSCSI database failure. We should check all iscsiadm exit codes and re-try logging out from a target.

    • This change can be tricky, we ignored iscsiadm errors previously and now we may enter endless loop if iscsiadm fails constantly. kubelet should cope with that IMO. I added extra checks that make sure that logging off from already removed target succeeds.
  • Looking at Red Hat docs, it suggests flushing devices before removing them:
    https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/storage_administration_guide/removing_devices. I added blockdev --flushbufs <device> to make sure we don't loose data, especially when using raw block devices.

Does this PR introduce a user-facing change?:

Failed iscsi logout is now re-tried periodically.

@jsafrane jsafrane force-pushed the jsafrane:fix-iscsi-logout branch from c5d2364 to c8fd65e Jun 12, 2019

@k8s-ci-robot k8s-ci-robot requested review from humblec and saad-ali Jun 12, 2019

@jsafrane jsafrane force-pushed the jsafrane:fix-iscsi-logout branch from c8fd65e to 346a1b2 Jun 12, 2019

@jsafrane jsafrane changed the title Fix iscsi logout issues WIP: Fix iscsi logout issues Jun 12, 2019

@jsafrane

This comment has been minimized.

Copy link
Member Author

commented Jun 12, 2019

/test pull-kubernetes-e2e-gce-iscsi

2 similar comments
@jsafrane

This comment has been minimized.

Copy link
Member Author

commented Jun 12, 2019

/test pull-kubernetes-e2e-gce-iscsi

@jsafrane

This comment has been minimized.

Copy link
Member Author

commented Jun 13, 2019

/test pull-kubernetes-e2e-gce-iscsi

@jsafrane

This comment has been minimized.

Copy link
Member Author

commented Jun 17, 2019

/test pull-kubernetes-e2e-gce-iscsi

@jsafrane jsafrane force-pushed the jsafrane:fix-iscsi-logout branch from a3b96ec to 5d78f78 Jul 19, 2019

@jsafrane

This comment has been minimized.

Copy link
Member Author

commented Jul 19, 2019

Rebased and removed testing patches. Now it's ready for review

@j-griffith @bswartz @humblec @bertinatto, PTAL

@jsafrane jsafrane force-pushed the jsafrane:fix-iscsi-logout branch from 5d78f78 to dbaec52 Jul 19, 2019

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

commented Jul 19, 2019

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jsafrane

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@jsafrane

This comment has been minimized.

Copy link
Member Author

commented Jul 19, 2019

/test pull-kubernetes-node-e2e-containerd/test

(iscsi tests are broken because of #78940)

@jsafrane

This comment has been minimized.

Copy link
Member Author

commented Jul 19, 2019

/test pull-kubernetes-node-e2e-containerd

if err != nil {
klog.Errorf("iscsi: failed to delete node record Error: %s", string(out))
return err

This comment has been minimized.

Copy link
@bertinatto

bertinatto Jul 19, 2019

Member

The changes LGTM, specially the device flushing; however, IDK iSCSI well enough to be sure these are the only exit codes we could/should ignore.

For example, looking through the exit codes it seems we should always retry on ISCSI_ERR_AGAIN and ISCSI_ERR_BUSY, but not for ISCSI_ERR_ACCESS and ISCSI_ERR_INVAL. To make it more difficult, I think each of these codes are tied to a specific set of commands.

This comment has been minimized.

Copy link
@bertinatto

bertinatto Jul 19, 2019

Member

Anyway, I think it's OK to retry regardless

This comment has been minimized.

Copy link
@jsafrane

jsafrane Jul 23, 2019

Author Member

I don't see ISCSI_ERR_AGAIN and ISCSI_ERR_BUSY in my man page (iscsiadm version 6.2.0.876-1)

This comment has been minimized.

Copy link
@jsafrane

jsafrane Jul 23, 2019

Author Member

I saw ISCSI_ERR_NO_OBJS_FOUND for deleting already deleted session and I think I saw ISCSI_ERR_SESS_NOT_FOUND in some log.

This comment has been minimized.

Copy link
@bertinatto

bertinatto Jul 23, 2019

Member

I was looking at the iscsi_err.h file, so maybe these exit codes aren't returned by iscsiadm (they're probably used in other utilities).

This comment has been minimized.

Copy link
@j-griffith

j-griffith Jul 23, 2019

Contributor

There are a bunch of different things that can be returned from iscsiadm that may or may not matter. One way to approach this is ignore any error, and do a verify that things were cleaned up, if it has the error is irrelevant.

This comment has been minimized.

Copy link
@jsafrane

jsafrane Jul 24, 2019

Author Member

verify that things were cleaned up

I can check that iface and node need to be deleted easily (-m [node|iface] -o show).

I am not sure if it's possible to check if iscsiadm -m node -p 127.0.0.1 -T iqn.2003-01.io.k8s:e2e.volume-8751 --logout is needed. iscsiadm -m node -o show is the same before and after logout. How can I map node / iqn to a session without too much parsing of iscsiadm -m session -o show -P 2?

This comment has been minimized.

Copy link
@j-griffith

j-griffith Jul 24, 2019

Contributor

Yeah, I was thinking of using the approach to check for sessions; I may be overthinking this though.

@j-griffith
Copy link
Contributor

left a comment

I think this is "ok" as is, but if you wanted to be sure about the iscsiadm errors I'd suggest just removing the explicit checks for errors being returned, instead just if there is an error check if thing were cleaned up or not.

if err != nil {
klog.Errorf("iscsi: failed to delete node record Error: %s", string(out))
return err

This comment has been minimized.

Copy link
@j-griffith

j-griffith Jul 23, 2019

Contributor

There are a bunch of different things that can be returned from iscsiadm that may or may not matter. One way to approach this is ignore any error, and do a verify that things were cleaned up, if it has the error is irrelevant.

@bswartz

This comment has been minimized.

Copy link
Contributor

commented Jul 24, 2019

Flushing the device happens automatically on unmount. Is this bug about raw block iSCSI? Or is there good evidence that unmount by itself isn't good enough?

@jsafrane

This comment has been minimized.

Copy link
Member Author

commented Jul 25, 2019

Unmount should be enough, this is extra precaution for block.

Report error when iscsiadm fails during detach
Kubernetes should retry detaching iSCSI volumes on error. In addition, it
should not report an error when detaching a disk while the disk is already
detached.

@jsafrane jsafrane force-pushed the jsafrane:fix-iscsi-logout branch from dbaec52 to cdcd2e2 Jul 25, 2019

@k8s-ci-robot k8s-ci-robot added size/S and removed size/M labels Jul 25, 2019

@jsafrane

This comment has been minimized.

Copy link
Member Author

commented Jul 25, 2019

/retest

@j-griffith

This comment has been minimized.

Copy link
Contributor

commented Jul 30, 2019

/lgtm

@fejta-bot

This comment has been minimized.

Copy link

commented Jul 31, 2019

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@humblec

This comment has been minimized.

Copy link
Contributor

commented Jul 31, 2019

/retest

@fejta-bot

This comment has been minimized.

Copy link

commented Jul 31, 2019

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

1 similar comment
@fejta-bot

This comment has been minimized.

Copy link

commented Jul 31, 2019

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@jsafrane

This comment has been minimized.

Copy link
Member Author

commented Jul 31, 2019

/test pull-kubernetes-integration

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

commented Jul 31, 2019

@jsafrane: The following tests failed, say /retest to rerun them all:

Test name Commit Details Rerun command
pull-kubernetes-e2e-gce-iscsi cdcd2e2 link /test pull-kubernetes-e2e-gce-iscsi
pull-kubernetes-e2e-gce-iscsi-serial cdcd2e2 link /test pull-kubernetes-e2e-gce-iscsi-serial

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-ci-robot k8s-ci-robot merged commit fde94e9 into kubernetes:master Jul 31, 2019

18 of 23 checks passed

pull-kubernetes-e2e-gce-iscsi Job failed.
Details
pull-kubernetes-e2e-gce-iscsi-serial Job failed.
Details
pull-kubernetes-e2e-gce-100-performance Job triggered.
Details
pull-kubernetes-kubemark-e2e-gce-big Job triggered.
Details
pull-kubernetes-verify Job triggered.
Details
cla/linuxfoundation jsafrane authorized
Details
pull-kubernetes-bazel-build Job succeeded.
Details
pull-kubernetes-bazel-test Job succeeded.
Details
pull-kubernetes-conformance-image-test Skipped.
pull-kubernetes-cross Skipped.
pull-kubernetes-dependencies Job succeeded.
Details
pull-kubernetes-e2e-gce Job succeeded.
Details
pull-kubernetes-e2e-gce-csi-serial Job succeeded.
Details
pull-kubernetes-e2e-gce-device-plugin-gpu Job succeeded.
Details
pull-kubernetes-e2e-gce-storage-slow Job succeeded.
Details
pull-kubernetes-godeps Skipped.
pull-kubernetes-integration Job succeeded.
Details
pull-kubernetes-local-e2e Skipped.
pull-kubernetes-node-e2e Job succeeded.
Details
pull-kubernetes-node-e2e-containerd Job succeeded.
Details
pull-kubernetes-typecheck Job succeeded.
Details
pull-publishing-bot-validate Skipped.
tide In merge pool.
Details

@k8s-ci-robot k8s-ci-robot added this to the v1.16 milestone Jul 31, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.