Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[3.11] Bug 1970977: UPSTREAM: multiple: Fix corruption of FibreChannel volumes #26222

Merged

Conversation

jsafrane
Copy link
Contributor

@jsafrane jsafrane commented Jun 14, 2021

There are two upstream PRs, a bot will link them. The code is in different files in 3.11 than in 4.x, however, the backport went quite smooth.

Fix corruption of FS devices when a pod is deleted when kubelet is not running. The volume reconstruction in this case must not fail, otherwise the volume is marked as unused and can be mounted on another node, corrupting the filesystem on it.

Therefore try hard not to fail 1) volume reconstruction and 2) unmounting the volume on error.

Upstream cherry pick to 1.21: kubernetes/kubernetes#102656

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 14, 2021

@jsafrane: This pull request references Bugzilla bug 1970977, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (3.11.z) matches configured target release for branch (3.11.z)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

No GitHub users were found matching the public email listed for the QA contact in Bugzilla (wduan@redhat.com), skipping review request.

In response to this:

Bug 1970977: UPSTREAM: multiple: Fix corruption of FibreChannel volumes

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot added bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels Jun 14, 2021
@jsafrane
Copy link
Contributor Author

/retest

@jsafrane jsafrane changed the title Bug 1970977: UPSTREAM: multiple: Fix corruption of FibreChannel volumes [3.11] Bug 1970977: UPSTREAM: multiple: Fix corruption of FibreChannel volumes Jun 15, 2021
@jsafrane
Copy link
Contributor Author

I tested this on a 3.11 cluster and I was not able to reproduce could not get consistent content of /proc/mounts after X attempts (either with or without this PR), even with hundreds of processes constantly mounting + unmounting tmpfs in parallel. At least it did not make things worse, an iSCSI volume disguised as FC was always unmounted + multipath device flushed + removed. No corrupted filesystem found.

@jsafrane
Copy link
Contributor Author

/retest

@gnufied
Copy link
Member

gnufied commented Jun 28, 2021

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jun 28, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 29, 2021

@jsafrane: This pull request references Bugzilla bug 1970977, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (3.11.z) matches configured target release for branch (3.11.z)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Requesting review from QA contact:
/cc @chao007

In response to this:

[3.11] Bug 1970977: UPSTREAM: multiple: Fix corruption of FibreChannel volumes

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot requested a review from chao007 June 29, 2021 09:43
@jsafrane
Copy link
Contributor Author

/retest

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 30, 2021

@jsafrane: This pull request references Bugzilla bug 1970977, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (3.11.z) matches configured target release for branch (3.11.z)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Requesting review from QA contact:
/cc @chao007

In response to this:

[3.11] Bug 1970977: UPSTREAM: multiple: Fix corruption of FibreChannel volumes

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@bparees
Copy link
Contributor

bparees commented Jul 1, 2021

/override ci/openshift-jenkins-cmd
/override ci/openshift-jenkins/extended_clusterup

failing for unrelated reasons

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 1, 2021

@bparees: /override requires a failed status context to operate on.
The following unknown contexts were given:

  • ci/openshift-jenkins-cmd

Only the following contexts were expected:

  • ci/openshift-jenkins/cmd
  • ci/openshift-jenkins/end_to_end
  • ci/openshift-jenkins/extended_clusterup
  • ci/openshift-jenkins/extended_conformance_install
  • ci/prow/e2e-gcp
  • ci/prow/images
  • ci/prow/integration
  • ci/prow/unit
  • ci/prow/verify
  • tide

In response to this:

/override ci/openshift-jenkins-cmd
/override ci/openshift-jenkins/extended_clusterup

failing for unrelated reasons

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@bparees
Copy link
Contributor

bparees commented Jul 1, 2021

/override ci/openshift-jenkins/cmd
/override ci/openshift-jenkins/extended_clusterup

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 1, 2021

@bparees: Overrode contexts on behalf of bparees: ci/openshift-jenkins/cmd, ci/openshift-jenkins/extended_clusterup

In response to this:

/override ci/openshift-jenkins/cmd
/override ci/openshift-jenkins/extended_clusterup

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@jsafrane
Copy link
Contributor Author

Override does not help, re-testing
/test ci/openshift-jenkins/extended_clusterup
/test ci/openshift-jenkins/cmd

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 21, 2021

@jsafrane: The specified target(s) for /test were not found.
The following commands are available to trigger jobs:

  • /test artifacts
  • /test e2e-gcp
  • /test e2e-gcp-crio
  • /test images
  • /test integration
  • /test unit
  • /test verify
  • /test cmd
  • /test end_to_end
  • /test extended_builds
  • /test extended_clusterup
  • /test extended_conformance_install
  • /test extended_gssapi
  • /test extended_image_ecosystem
  • /test extended_image_registry
  • /test extended_ldap_groups
  • /test extended_networking
  • /test service-catalog

Use /test all to run the following jobs:

  • pull-ci-openshift-origin-release-3.11-e2e-gcp
  • pull-ci-openshift-origin-release-3.11-images
  • pull-ci-openshift-origin-release-3.11-integration
  • pull-ci-openshift-origin-release-3.11-unit
  • pull-ci-openshift-origin-release-3.11-verify
  • test_pull_request_origin_cmd
  • test_pull_request_origin_end_to_end_311
  • test_pull_request_origin_extended_clusterup-release-3.11
  • test_pull_request_origin_extended_conformance_install-release-3.11

In response to this:

Override does not help, re-testing
/test ci/openshift-jenkins/extended_clusterup
/test ci/openshift-jenkins/cmd

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@jsafrane
Copy link
Contributor Author

/test extended_clusterup
/test cmd

@jsafrane
Copy link
Contributor Author

/test extended_clusterup
/test cmd

does it work on overriden test?

@brenton
Copy link
Contributor

brenton commented Jul 22, 2021

/test extended_clusterup
/test cmd

@brenton
Copy link
Contributor

brenton commented Jul 22, 2021

/retest

When UnmountDevice() of a FibreChannel volume fails after unmounting the
device and before the device is fully cleaned up, subsequent
UnmountDevice() retry won't find the device mounted and return without
retrying the device cleanup.

Therefore implement its own retry inside UnmountDevice() to make sure that
the volume devices are either fully cleaned or the error is serius enough
that even 1 minute of trying does not help.
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Jul 23, 2021
@jsafrane
Copy link
Contributor Author

Rebased to today's release-3.11, hoping to get fresh tests...

@gnufied
Copy link
Member

gnufied commented Jul 23, 2021

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jul 23, 2021
@sttts
Copy link
Contributor

sttts commented Jul 26, 2021

/approve

1 similar comment
@mfojtik
Copy link
Member

mfojtik commented Jul 26, 2021

/approve

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 26, 2021

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: gnufied, jsafrane, mfojtik, sttts

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 26, 2021
@openshift-merge-robot openshift-merge-robot merged commit 5f037af into openshift:release-3.11 Jul 26, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 26, 2021

@jsafrane: All pull requests linked via external trackers have merged:

Bugzilla bug 1970977 has been moved to the MODIFIED state.

In response to this:

[3.11] Bug 1970977: UPSTREAM: multiple: Fix corruption of FibreChannel volumes

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants