Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature: support external-health-monitor #210

Merged
merged 1 commit into from
Jan 26, 2021

Conversation

fengzixu
Copy link
Contributor

@fengzixu fengzixu commented Oct 18, 2020

What type of PR is this?

/kind feature
What this PR does / why we need it:
Implementing the ListVolumes, GetVolume and NodeStats interfaces to support external-health-monitor project
Which issue(s) this PR fixes:

Fixes kubernetes-csi/external-health-monitor#21

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

support external-health-monitor feature in hostpath driver
- Check if volume mount path of pod exists or not
- Check if filesystem that volume relies on may be out of capacity
- Check if volume usage is almost full

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note-none Denotes a PR that doesn't merit a release note. kind/feature Categorizes issue or PR as related to a new feature. labels Oct 18, 2020
@k8s-ci-robot k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Oct 18, 2020
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 18, 2020
@xing-yang
Copy link
Contributor

/assign

@xing-yang
Copy link
Contributor

This is a new feature. Please add release notes.

@fengzixu fengzixu changed the title [WIP]feature: support external-heath-monitor feature: support external-heath-monitor Nov 7, 2020
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 7, 2020
@fengzixu
Copy link
Contributor Author

fengzixu commented Nov 7, 2020

The implementation of support external-health-monitor in host-path-driver has been finished. I have tested this PR in my personal env. But I'm not sure If I need to change the orchestration file in deploy directory to add related argument of deploying external-health-monitor-controller and agent.

@xing-yang After you reviewed the implementation, please tell me if i need to do it

@xing-yang
Copy link
Contributor

Add a release note.

@xing-yang
Copy link
Contributor

Can you take a look at the CI failures? Did you do go mod vendor and go mod tidy?

cmd/hostpathplugin/main.go Outdated Show resolved Hide resolved
pkg/hostpath/controllerserver.go Outdated Show resolved Hide resolved
pkg/hostpath/controllerserver.go Outdated Show resolved Hide resolved
pkg/hostpath/healthcheck.go Outdated Show resolved Hide resolved
pkg/hostpath/hostpath.go Outdated Show resolved Hide resolved
@k8s-ci-robot k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels Nov 15, 2020
@fengzixu fengzixu changed the title feature: support external-heath-monitor feature: support external-health-monitor Nov 15, 2020
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 15, 2020
go.mod Outdated Show resolved Hide resolved
return false, "The source path of the volume doesn't exist"
}

mpExist, err := checkMountPointExist(volumeHandle)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think only this mountpoint check is relevant for NodeGetVolumeStats.
Other checks are for ListVolumes() and GetVolume() from the controller side.

return nil, status.Error(codes.NotFound, "The volume not found")
}

healthy, msg := doHealthCheck(in.GetVolumeId())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like you are calling the same doHealthCheck for NodeGetVolumeStats, ListVolumes, and GetVolume. That will produce duplicate events on PVCs and Pods. Health check from controller side should be separate from node side so there should not be duplicate events.
For NodeGetVolumeStats, we should only need to check things like the mount condition.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, it’s hard to separate health check cases in the host-path driver. Because host-path-driver runs all of CSI components on the same node.

But, to avoid duplicate the event, it's one of the reasons we need to separate them.

@xing-yang
Copy link
Contributor

/assign @NickrenREN

configvar CSI_PROW_SANITY_VERSION 5421d9f3c37be3b95b241b44a094a3db11bee789 "csi-test version" # latest master
configvar CSI_PROW_SANITY_IMPORT_PATH github.com/kubernetes-csi/csi-test "csi-test package"
configvar CSI_PROW_SANITY_VERSION v4.0.2 "csi-test version" # latest master
t
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did this pass testing?

@pohly pohly mentioned this pull request Jan 11, 2021
@NickrenREN
Copy link

hostpath unit task still fails:

make: *** [release-tools/build.make:259: test-shellcheck] Error 1
make: Target 'test' not remade because of errors.
ERROR: 'make test' failed
WARNING: 'make test' failed, proceeding anyway

@fengzixu
Copy link
Contributor Author

hostpath unit task still fails:

make: *** [release-tools/build.make:259: test-shellcheck] Error 1
make: Target 'test' not remade because of errors.
ERROR: 'make test' failed
WARNING: 'make test' failed, proceeding anyway

@NickrenREN I already fixed it in this PR kubernetes-csi/csi-release-tools#128.
Actually, we cannot directly modify prow.sh in this PR, it is just for testing if other e2e test can be passed.

We need to merge kubernetes-csi/csi-release-tools#128 first. And submit a new PR to host-path repo to update prow.sh by git subtree command. Finally, we can rebase this PR to resolve all of test failure

@NickrenREN
Copy link

@NickrenREN I already fixed it in this PR kubernetes-csi/csi-release-tools#128.
Actually, we cannot directly modify prow.sh in this PR, it is just for testing if other e2e test can be passed.
We need to merge kubernetes-csi/csi-release-tools#128 first. And submit a new PR to host-path repo to update prow.sh by git subtree command. Finally, we can rebase this PR to resolve all of test failure

get it, thanks

@NickrenREN
Copy link

thanks for this.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 22, 2021
@xing-yang
Copy link
Contributor

Under the section "Special notes for your reviewer" in the PR description, can you add more details on what are checked and what are reported as abnormal volume condition in the health monitor controller and agent respectively?

@xing-yang
Copy link
Contributor

/lgtm
/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: fengzixu, xing-yang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 25, 2021
@xing-yang
Copy link
Contributor

/retest

3 similar comments
@xing-yang
Copy link
Contributor

/retest

@xing-yang
Copy link
Contributor

/retest

@xing-yang
Copy link
Contributor

/retest

@k8s-ci-robot
Copy link
Contributor

k8s-ci-robot commented Jan 26, 2021

@fengzixu: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
pull-kubernetes-csi-csi-driver-host-path-1-17-on-kubernetes-1-17 9e7831d link /test pull-kubernetes-csi-csi-driver-host-path-1-17-on-kubernetes-1-17
pull-kubernetes-csi-csi-driver-host-path-1-20-on-kubernetes-1-20 9e7831d link /test pull-kubernetes-csi-csi-driver-host-path-1-20-on-kubernetes-1-20

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@@ -4,6 +4,6 @@ LABEL description="HostPath Driver"
ARG binary=./bin/hostpathplugin

# Add util-linux to get a new version of losetup.
RUN apk add util-linux
RUN apk add util-linux && apk update && apk upgrade
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this related to "support external-health-monitor"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I implement this feature, I found I need run findmnt command with jsonargument. It was supported in newer version. So, I update it

fmt.Printf("Failed to run driver: %s", err.Error())
os.Exit(1)

}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here? This looks like an unrelated enhancement.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An error may be returned from that function, just checking it right?

@pohly
Copy link
Contributor

pohly commented Apr 1, 2021

Updating E2E testing in Kubernetes from csi-driver-host-path 1.4.0 to 1.6.2 causes jobs to fail and the code from this PR seems to be involved, see kubernetes/kubernetes#100637 (comment).

@pohly
Copy link
Contributor

pohly commented Apr 1, 2021

Do we have any test for the discoveryExistingVolumes function?

for _, pv := range mountInfosOfPod.ContainerFileSystem {
if !strings.Contains(pv.Target, csiSignOfVolumeTargetPath) {
continue
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this code distinguish between mounts from some other CSI driver and the mounts of this hostpath driver instance?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it doesn't.

Both podVolumeTargetPath = /var/lib/kubelet/pods and csiSignOfVolumeTargetPath = kubernetes.io~csi/pvc are shared with all other CSI driver instances.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides the obvious failure of adding volumes that aren't actually managed by this driver, we also get a race condition that leads to the "failed to get capacity info: no such file or directory" error from kubernetes/kubernetes#100637 (comment)

  • one CSI driver starts listing mounts
  • another CSI driver concurrently unmounts
  • the first CSI driver tries to get capacity information which then fails

Does that make sense?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is discoveryExistingVolumes related to volume health checks?

I understand the desire to support container restarts. I just don't understand how it is related to health checks.

Regarding container restarts: the information restored is incomplete (for example, the more recently added nominal capacity of a volume cannot be discovered). Instead of making this code more complex, can we simply rip it out and go back to the state from the hostpath 1.4 release, i.e. the driver must not be restarted?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same with discovering snapshots?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the implementation of this volume discovery: wouldn't it be better to use a JSON file in the dataDir to store the in-memory list each time changes are made to it? Then we can restore all information.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When testing upgrade for moving snapshot from v1beta1 to v1, for example, driver restart will happen. So we need to handle driver restarts.

I'm open to different ways of implementing this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you needed that for some other purpose. Now it makes more sense to me.

I started implementing this, but won't have time to finish it. If someone else wants to take a stab at it, feel free.

IMHO the code which Volume and Snaphshot structs and the corresponding maps should be in its own package, with get/add/update/delete functions that have to be used to make changes. Currently it's in the same package and I already found code which directly manipulated the maps instead of using the existing helper functions.

Then in that package, we can dump the entire struct into one file to store volumes and snapshots. That removes a whole lot of custom code, including the one which we discuss here. Combine that with a configurable data dir and we can add unit tests for this...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fengzixu can you take a stab at this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Let me try

@pohly pohly mentioned this pull request Apr 7, 2021
andyzhangx added a commit to andyzhangx/csi-driver-host-path that referenced this pull request Dec 16, 2023
b54c1ba4 Merge pull request kubernetes-csi#246 from xing-yang/go_1.21
5436c81e Change go version to 1.21.5
267b40e9 Merge pull request kubernetes-csi#244 from carlory/sig-storage
b42e5a2d nominate self (carlory) as kubernetes-csi reviewer
a17f536f Merge pull request kubernetes-csi#210 from sunnylovestiramisu/sidecar
011033de Use set -x instead of die
5deaf667 Add wrapper script for sidecar release

git-subtree-dir: release-tools
git-subtree-split: b54c1ba49469d4d5d1b5d75285e8868ffe3d328f
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement health monitoring in CSI hostpath driver
5 participants