Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix a panic for in-tree drivers that partialy support Block volume metrics #101587

Merged
merged 4 commits into from May 24, 2021

Conversation

nixpanic
Copy link
Contributor

@nixpanic nixpanic commented Apr 28, 2021

What type of PR is this?

/kind feature
/sig storage

What this PR does / why we need it:

#97972 added support for gathering metrics for Block volumes provided by CSI drivers. The current in-tree drivers that support Block volumes can return at least the Capacity of the block-device.

CSI drivers are currently not tested for metrics gathering in the e2e framework. Adding support for this in the in-tree drivers makes it possible to verify the functionality and prevent regressions.

Which issue(s) this PR fixes:

Fixes #101431

Special notes for your reviewer:

Block volume metrics detection is quite limited. Except for the Capacity (size of the volume), there is little that can be gathered with standard tools. The contents of a Block volume can not be inspected like a filesystem. In theory drivers could thin-provision (like a sparse file) volumes, and provide Used/Available in addition to the Capacity. However, this needs access and details knowledge of the storage platform, and can not be detected with standard tools.

Does this PR introduce a user-facing change?

Some of the in-tree storage drivers indicate support for the MetricsProvider interface, but fail to configure this for BlockMode volumes. With a recent change, Kubelet will call GetMetrics() for BlockMode volumes, and the in-tree drivers that miss the support cause a Go panic. Now the in-tree storage drivers that support BlockMode volumes, will return the Capacity of the volume in the GetMetrics() call.

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/feature Categorizes issue or PR as related to a new feature. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. sig/storage Categorizes an issue or PR as relevant to SIG Storage. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 28, 2021
@k8s-ci-robot
Copy link
Contributor

Hi @nixpanic. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added area/test sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. sig/testing Categorizes an issue or PR as relevant to SIG Testing. labels Apr 28, 2021
@nixpanic
Copy link
Contributor Author

@gnufied, you expressed interest in adding a metrics test for BlockMode volumes, so assigning you already, before others have posted review comments.

/assign gnufied

@jsafrane
Copy link
Member

jsafrane commented May 4, 2021

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 4, 2021
@nixpanic
Copy link
Contributor Author

nixpanic commented May 4, 2021

The Azure jobs (not required to pass) seem to have failed due to something unrelated:

ProvisioningFailed: Failed to provision volume with StorageClass "azurefile-3090-kubernetes.io-azure-file-dynamic-sc-vv4t2": invalid option "enableLargeFileshares" for volume plugin kubernetes.io/azure-file

@andyzhangx
Copy link
Member

The Azure jobs (not required to pass) seem to have failed due to something unrelated:

ProvisioningFailed: Failed to provision volume with StorageClass "azurefile-3090-kubernetes.io-azure-file-dynamic-sc-vv4t2": invalid option "enableLargeFileshares" for volume plugin kubernetes.io/azure-file

kubernetes-sigs/azurefile-csi-driver#645 would fix the issue soon

@andyzhangx
Copy link
Member

/retest

@nixpanic
Copy link
Contributor Author

nixpanic commented May 6, 2021

/kind bug

Without this change, some of the in-tree storage drivers can cause a panic. See #101431 (comment) for more details.

@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label May 6, 2021
@pohly
Copy link
Contributor

pohly commented May 7, 2021

/kind bug

Without this change, some of the in-tree storage drivers can cause a panic. See #101431 (comment) for more details.

That should be mentioned in the release note section of the PR description.

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels May 7, 2021
@nixpanic nixpanic changed the title Add Capacity metric for Block volumes supported by in-tree drivers Fix a panic for in-tree Block volumes that do not implement MetricsProvider May 11, 2021
@nixpanic nixpanic changed the title Fix a panic for in-tree Block volumes that do not implement MetricsProvider Fix a panic for in-tree drivers that partialy support Block volume metrics May 11, 2021
@gnufied
Copy link
Member

gnufied commented May 19, 2021

So, without a solution to this, I am not able to make calcAndStoreStats() safe when drivers expose the MetricsProvider interface, but fails to provide a GetMetrics() function.

Hmm - we are running into golang limitations. :-) But this code has broken in past and somehow gets past review process and hence I think is in urgent need of fixing.

I am thinking we should tweak the way we are embedding interfaces in volume plugin interface such that it is possible to query if MetricsProvider has been set. For example - BlockVolume interface can have an explicit function:

type BlockVolume interface {
	GetGlobalMapPath(spec *Spec) (string, error)
	GetPodDeviceMapPath() (string, string)

        // if MetricsProvider is set return it , otherwise nil
        GetMetricsProvider() MetricsProvider

	MetricsProvider
}

This way, it should be possible to write code like:

	if v.GetMetricsProvider() != nil {
		fmt.Printf("name is: %s\n", g.GetMetrics())
	}

That is just one idea though, there could be other ways of solving this. But I do think - this code is bit fragile. :(

Similar to how NewMetricsStatFS() works, the new NewMetricsBlock()
provides the GetMetrics() interface for Block volumes.

Additional metrics for Block volumes are difficult to gather. There is
no guarantee that there is a filesystem on the volume, which makes most
of the volume metrics useless.

Advanced storage might be able to detect the actual consumption (when
thin-provisioned) vs the capacity. However, this is out of the scope for
a standard helper function and requires intimate knowledge of the used
storage system.
PR kubernetes#97972 added support for gathering metrics for Block PVCs provided by
CSI drivers. The in-tree drivers can support at leas the most basic
metric; Capacity.
The in-tree drivers support gathering the capacity of the Block volume.
Make sure that Kubelet exposes these for the matching PVCs.
@k8s-ci-robot k8s-ci-robot added area/kubelet sig/node Categorizes an issue or PR as relevant to SIG Node. labels May 20, 2021
@nixpanic
Copy link
Contributor Author

Thanks for the idea, @gnufied! Instead of GetMetricsProvider() I have added SupportsMetrics() which returns a bool with commit Add SupportsMetrics() for Block-mode volumes. Please have a look again and see if this addresses your concerns.

Volumes that are provisioned with `VolumeMode: Block` often have a
MetrucsProvider interface declared in their type. However, the
MetricsProvider should implement a GetMetrics() function. In the cases
where the storage drivers do not implement GetMetrics(), a panic can
occur.

Usual type-assertions are not sufficient in this case. All assertions
assume the interface is present. There is no straight forward way to
verify that a valid GetMetrics() function is provided.

By adding SupportsMetrics(), storage driver implementations require
careful reviewing for metrics support.
@nixpanic
Copy link
Contributor Author

/retest

@k8s-ci-robot
Copy link
Contributor

@nixpanic: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
pull-kubernetes-e2e-gce-storage-snapshot b997e0e link /test pull-kubernetes-e2e-gce-storage-snapshot
pull-kubernetes-e2e-gce-csi-serial b997e0e link /test pull-kubernetes-e2e-gce-csi-serial

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@ehashman ehashman added this to Triage in SIG Node PR Triage May 21, 2021
Copy link
Member

@ehashman ehashman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/priority important-longterm

Kubelet changes LGTM.

@k8s-ci-robot k8s-ci-robot added priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. and removed needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels May 21, 2021
@ehashman ehashman moved this from Triage to Needs Approver in SIG Node PR Triage May 21, 2021
@gnufied
Copy link
Member

gnufied commented May 24, 2021

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 24, 2021
@gnufied
Copy link
Member

gnufied commented May 24, 2021

/approve

@mrunalp mrunalp moved this from Needs Approver to Done in SIG Node PR Triage May 24, 2021
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: gnufied, mrunalp, nixpanic

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/kubelet area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/storage Categorizes an issue or PR as relevant to SIG Storage. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Development

Successfully merging this pull request may close these issues.

Add support for gathering BlockVolume metrics from in-tree storage drivers