Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detect if GCE PD udev link is wrong and try to correct it #66832

Merged
merged 1 commit into from Aug 2, 2018

Conversation

@msau42
Copy link
Member

commented Jul 31, 2018

What this PR does / why we need it:
udev can miss scsi events and not update device links properly when disks are detached and reattached to the same node in a different order. This PR workarounds the issue by comparing the udev link name, which is the PD disk name, with the scsi disk serial number returned from scsi_id, and prevents mounting of the disk if the link is wrong. It will also run udevadm trigger on the disk to try to correct the bad link.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

Special notes for your reviewer:

Release note:

This fix prevents a GCE PD volume from being mounted if the udev device link is stale and tries to correct the link.
@msau42

This comment has been minimized.

Copy link
Member Author

commented Jul 31, 2018

// Calls scsi_id on the given devicePath to get the serial number reported by that device.
func getScsiSerial(devicePath string) (string, error) {
out, err := exec.New().Command(
"/lib/udev/scsi_id",

This comment has been minimized.

Copy link
@wonderfly

wonderfly Jul 31, 2018

Contributor

This puts a hard dependency on the installation path, which could vary from distro to distro, or from version to version in the same distro. That said, I don't have a better idea off the top of my head. Just wanted to point it out.

This comment has been minimized.

Copy link
@ddebroy

ddebroy Jul 31, 2018

Member

For kubelets running in containers, the above scsi_id tool will also need to be part of the kubelet image. How about reading the vpd_pg83 file in sysfs for reading in the identifier?

This comment has been minimized.

Copy link
@msau42

msau42 Jul 31, 2018

Author Member

The logic that scsi_id does to parse the vpd information looks quite involved, so would rather not have to repeat it here.

I will add in a check if the binary exists, and skip it if it doesn't.

@davidz627
Copy link
Contributor

left a comment

Should also submit a bugreport to maintainers of udevadm to get this actually fixed and link it in the PR

// validate that the path actually resolves to the correct disk
serial, err := getScsiSerial(path)
if err != nil {
return "", err

This comment has been minimized.

Copy link
@davidz627

davidz627 Jul 31, 2018

Contributor

nit: more error context

func parseScsiSerial(output string) (string, error) {
// Output should be in the form of:
// 0Google PersistentDisk <disk name>
scsiPattern := regexp.MustCompile(`^0Google\s+PersistentDisk\s+([\S]+)\s*$`)

This comment has been minimized.

Copy link
@davidz627

davidz627 Jul 31, 2018

Contributor

nit: pattern should be a const, might even be shared elsewhere (?)


Describe("Default StorageClass", func() {
Context("pods that use multiple volumes", func() {
It("should be reschedulable", func() {

This comment has been minimized.

Copy link
@davidz627

davidz627 Jul 31, 2018

Contributor

does this test consistently reproduce the error?

This comment has been minimized.

Copy link
@msau42

msau42 Jul 31, 2018

Author Member

no, the issue is very incredibly difficult to reproduce and I have not found a consistent way to get into the error state

// Output should be in the form of:
// 0Google PersistentDisk <disk name>
scsiPattern := regexp.MustCompile(`^0Google\s+PersistentDisk\s+([\S]+)\s*$`)
substrings := scsiPattern.FindStringSubmatch(string(output))

This comment has been minimized.

Copy link
@ddebroy

ddebroy Jul 31, 2018

Member

nit: output is already of type string. So the string(output) is probably not necessary?

@msau42 msau42 force-pushed the msau42:udev branch from e14ba44 to bcebd81 Jul 31, 2018

@msau42

This comment has been minimized.

Copy link
Member Author

commented Jul 31, 2018

Updated

@msau42 msau42 force-pushed the msau42:udev branch from bcebd81 to eaf460d Jul 31, 2018

@msau42

This comment has been minimized.

Copy link
Member Author

commented Jul 31, 2018

The e2e test is failing on aws for some reason. Since this fix is only for gce pd, I've removed aws from the test for now.

@msau42

This comment has been minimized.

Copy link
Member Author

commented Aug 1, 2018

/test pull-kubernetes-bazel-build

@jsafrane
Copy link
Member

left a comment

Just minor nit, otherwise lgtm


// Parse the output returned by scsi_id and extract the serial number
func parseScsiSerial(output string) (string, error) {
scsiRegex := regexp.MustCompile(scsiPattern)

This comment has been minimized.

Copy link
@jsafrane

jsafrane Aug 1, 2018

Member

You should compile the regex only once.

This comment has been minimized.

Copy link
@msau42

msau42 Aug 1, 2018

Author Member

done

@msau42 msau42 force-pushed the msau42:udev branch from eaf460d to dd0e12a Aug 1, 2018

@msau42 msau42 force-pushed the msau42:udev branch from dd0e12a to cf10715 Aug 1, 2018

@davidz627

This comment has been minimized.

Copy link
Contributor

commented Aug 1, 2018

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm label Aug 1, 2018

@msau42

This comment has been minimized.

Copy link
Member Author

commented Aug 1, 2018

/test pull-kubernetes-e2e-kops-aws

@saad-ali
Copy link
Member

left a comment

/lgtm
/approve

Thanks @msau42

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

commented Aug 1, 2018

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: davidz627, msau42, saad-ali

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-github-robot

This comment has been minimized.

Copy link
Contributor

commented Aug 2, 2018

Automatic merge from submit-queue (batch tested with PRs 64645, 66832). If you want to cherry-pick this change to another branch, please follow the instructions here.

@k8s-github-robot k8s-github-robot merged commit d3c0965 into kubernetes:master Aug 2, 2018

18 checks passed

Submit Queue Queued to run github e2e tests a second time.
Details
cla/linuxfoundation msau42 authorized
Details
pull-kubernetes-bazel-build Job succeeded.
Details
pull-kubernetes-bazel-test Job succeeded.
Details
pull-kubernetes-cross Skipped
pull-kubernetes-e2e-gce Job succeeded.
Details
pull-kubernetes-e2e-gce-100-performance Job succeeded.
Details
pull-kubernetes-e2e-gce-device-plugin-gpu Job succeeded.
Details
pull-kubernetes-e2e-gke Skipped
pull-kubernetes-e2e-kops-aws Job succeeded.
Details
pull-kubernetes-e2e-kubeadm-gce Skipped
pull-kubernetes-integration Job succeeded.
Details
pull-kubernetes-kubemark-e2e-gce-big Job succeeded.
Details
pull-kubernetes-local-e2e Skipped
pull-kubernetes-local-e2e-containerized Skipped
pull-kubernetes-node-e2e Job succeeded.
Details
pull-kubernetes-typecheck Job succeeded.
Details
pull-kubernetes-verify Job succeeded.
Details
k8s-github-robot pushed a commit that referenced this pull request Aug 2, 2018
Kubernetes Submit Queue
Merge pull request #66878 from msau42/automated-cherry-pick-of-#66832…
…-upstream-release-1.11

Automatic merge from submit-queue.

Automated cherry pick of #66832: Detect if GCE PD udev link is wrong and try to correct it

Cherry pick of #66832 on release-1.11.

#66832: Detect if GCE PD udev link is wrong and try to correct it
k8s-github-robot pushed a commit that referenced this pull request Aug 2, 2018
Kubernetes Submit Queue
Merge pull request #66879 from msau42/automated-cherry-pick-of-#66832…
…-upstream-release-1.10

Automatic merge from submit-queue.

Automated cherry pick of #66832: Detect if GCE PD udev link is wrong and try to correct it

Cherry pick of #66832 on release-1.10.

#66832: Detect if GCE PD udev link is wrong and try to correct it
k8s-github-robot pushed a commit that referenced this pull request Aug 3, 2018
Kubernetes Submit Queue
Merge pull request #66925 from msau42/udev
Automatic merge from submit-queue (batch tested with PRs 66933, 66925). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Rework multi-volume test to use StatefulSet

**What this PR does / why we need it**:
The e2e test that got added as part of #66832 fails in a multi-zone environment because the volumes get provisioned in random zones.  This PR reworks the test to use StatefulSet instead, which handles provisioning multiple PVCs in the same zone.

**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
Fixes #

**Special notes for your reviewer**:

**Release note**:

```release-note
NONE
```
@@ -261,39 +270,73 @@ func createRegionalPD(
}

// Returns the first path that exists, or empty string if none exist.
func verifyDevicePath(devicePaths []string, sdBeforeSet sets.String) (string, error) {
func verifyDevicePath(devicePaths []string, sdBeforeSet sets.String, diskName string) (string, error) {

This comment has been minimized.

Copy link
@davidz627

davidz627 Aug 27, 2018

Contributor

@msau42 is this method going to be refactored out as part of the mount utilities refactor?

Also can this be made into a public interface?

This comment has been minimized.

Copy link
@msau42

msau42 Aug 27, 2018

Author Member

It's not being considered as part of the Mount utils refactor, but we can look into adding this there after the initial refactor is done. Right now, only GCE PD does this kind of checking.

k8s-github-robot pushed a commit that referenced this pull request Aug 29, 2018
Kubernetes Submit Queue
Merge pull request #66880 from msau42/automated-cherry-pick-of-#66832…
…-upstream-release-1.9

Automatic merge from submit-queue.

Automated cherry pick of #66832: Detect if GCE PD udev link is wrong and try to correct it

Cherry pick of #66832 on release-1.9.

#66832: Detect if GCE PD udev link is wrong and try to correct it

#66925: Rework multi-volume test to use StatefulSet
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.