Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add attachable pvc in use metrics #64527

Merged
merged 1 commit into from Jun 28, 2018

Conversation

@gnufied
Copy link
Member

gnufied commented May 30, 2018

This metric reports number of PVCs that are in-use in Kubernetes with plugin and node name as dimensions.

This allows us to figure out, how many PVCs each node is using. It is super helpful in figuring out attach/detach issues.

/sig storage

cc @jsafrane @tsmetana @msau42

Add metrics for PVC in-use
@verult
Copy link
Contributor

verult left a comment

Just a quick initial pass

}

type nodePVCCount struct {
pvcCount map[types.NodeName]map[string]int

This comment has been minimized.

@verult

verult Jun 1, 2018

Contributor

To reduce a nested layer, what about this:

type nodePVCCount map[types.NodeName]map[string]int
...
func (pvcInUse nodePVCCount) add(...) {...}
...
nodePVCMap := make(nodePVCCount)

also nit: type PluginName string for better readability

}
if pvc.Status.Phase != v1.ClaimBound || pvc.Spec.VolumeName == "" {
return nil, fmt.Errorf(
"PVC %s has non-bound phase (%q) or empty pvc.Spec.VolumeName (%q)",

This comment has been minimized.

@verult

verult Jun 1, 2018

Contributor

nit: Error message could be confusing if only one of the conditions evaluate to true. Split up the error?

This comment has been minimized.

@gnufied

gnufied Jun 5, 2018

Author Member

I am going to leave this as it is for now. I have not even considered logging these errors tbh, so whatever we return from here - simply gets ignored.

@gnufied gnufied force-pushed the gnufied:add-pvc-in-use-metrics branch from c365c99 to 21fbd9f Jun 5, 2018

volumePluginMgr *volume.VolumePluginMgr
}

type nodePVCCount map[types.NodeName]map[string]int

This comment has been minimized.

@msau42

msau42 Jun 5, 2018

Member

Can you add a comment here describe what each of the types represents

This comment has been minimized.

@gnufied

gnufied Jun 7, 2018

Author Member

fixed

glog.V(3).Infof("Error finding volume plugin for : %v", volumeSpec)
continue
}
nodePVCMap.add(nodeName, volumePlugin.GetPluginName())

This comment has been minimized.

@msau42

msau42 Jun 5, 2018

Member

Is this to help with the attachable limit count? If so, do we want to use the attachable limit resource name instead of the plugin name? For CSI, I think this just returns "kubernetes.io/csi"

This comment has been minimized.

@gnufied

gnufied Jun 5, 2018

Author Member

This is not just for attachable limit count. It is useful for capacity planning in multi-tenant clusters and perhaps there are other usage too. In a multi-tenant cluster one typically has many more PV/PVCs than actually in-use PVCs. This metric obviously helps in determining if a cluster admin needs to spin up new nodes to accomodate in-use PVCs that takes into account both attachable limit or limits that apply in general to volume types that we don't even consider "attachable" right now.

I agree with unfortunate situation with CSI plugin name and GetPluginName is not a satisfactory solution. But all storage metrics are affected by this problem, because they all use GetPluginName as a label. I have filed #64590 to solve this. I think we are going to need additional function call or something for CSI plugins.

@gnufied gnufied force-pushed the gnufied:add-pvc-in-use-metrics branch from 21fbd9f to 2c47de3 Jun 6, 2018

if nodeName == "" {
continue
}
for _, podVolume := range pod.Spec.Volumes {

This comment has been minimized.

@msau42

msau42 Jun 6, 2018

Member

So this is actually just counting pods that have been scheduled to a node, but not necessarily attached. Is that what we want?

This comment has been minimized.

@gnufied

gnufied Jun 6, 2018

Author Member

Yeah for now I think - this should be fine. We may need separate metrics for actually attached volumes. But that requires querying cloudprovider or going through volume plugin at very minimum. I am still thinking, how to implement that interface so as it can be useful to most volume plugins.

}

func (collector *pvcInUseCollector) CreateVolumeSpec(podVolume v1.Volume, namespace string) (*volume.Spec, error) {
pvcSource := podVolume.VolumeSource.PersistentVolumeClaim

This comment has been minimized.

@jsafrane

jsafrane Jun 12, 2018

Member

I think it would be useful to include also inline volumes in pods and not just PVCs. It would bring broader (and more correct) picture.

This comment has been minimized.

@gnufied

gnufied Jun 15, 2018

Author Member

hrm, I overlooked something. turns out A/D controller only initializes attachable plugins (and similarly pv controller only initializes provisionable plugins) and hence it is pretty hard to emit a metric for all volumes in use without initiaizing all volume plugins in control plane.

The alternatives are:

  1. Emit these metrics from kubelet. But the downside of that is, any unresponsive node could cause metrics to be incorrect.
  2. Somehow find a way of initializing all plugins in control plane.

Still thinking how to workaround that....

This comment has been minimized.

@gnufied

gnufied Jun 15, 2018

Author Member

for now I have just renamed the metric and added support for inline attachable volumes too. But yet to think how to truly report ALL volumes... not just attachable types.

metricCollector := newPVCInUseCollector(pvcLister, fakePodInformer.Lister(), pvLister, fakeVolumePluginMgr)
nodeUseMap := metricCollector.getPVCUseByNode()
if len(nodeUseMap) < 1 {
t.Errorf("Expected one pvc in use got %d", len(nodeUseMap))

This comment has been minimized.

@bertinatto

bertinatto Jun 14, 2018

Member

One or at least one?

@gnufied gnufied force-pushed the gnufied:add-pvc-in-use-metrics branch 2 times, most recently from ee2c689 to e11b8af Jun 15, 2018

@jsafrane

This comment has been minimized.

Copy link
Member

jsafrane commented Jun 28, 2018

/lgtm

@gnufied

This comment has been minimized.

Copy link
Member Author

gnufied commented Jun 28, 2018

/test pull-kubernetes-verify

@gnufied gnufied changed the title Add pvc in use metrics Add attachable pvc in use metrics Jun 28, 2018

@gnufied gnufied force-pushed the gnufied:add-pvc-in-use-metrics branch from e11b8af to 8d46912 Jun 28, 2018

@k8s-ci-robot k8s-ci-robot removed the lgtm label Jun 28, 2018

@gnufied

This comment has been minimized.

Copy link
Member Author

gnufied commented Jun 28, 2018

@jsafrane can you lgtm this again? I had tor rebase this (with master) to resolve some bazel/verify failures.

@jsafrane

This comment has been minimized.

Copy link
Member

jsafrane commented Jun 28, 2018

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm label Jun 28, 2018

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Jun 28, 2018

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: gnufied, jsafrane

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@gnufied

This comment has been minimized.

Copy link
Member Author

gnufied commented Jun 28, 2018

/test pull-kubernetes-e2e-kops-aws

@k8s-github-robot

This comment has been minimized.

Copy link
Contributor

k8s-github-robot commented Jun 28, 2018

/test all [submit-queue is verifying that this PR is safe to merge]

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Jun 28, 2018

@gnufied: The following tests failed, say /retest to rerun them all:

Test name Commit Details Rerun command
pull-kubernetes-bazel-test 8d46912 link /test pull-kubernetes-bazel-test
pull-kubernetes-e2e-gce 8d46912 link /test pull-kubernetes-e2e-gce

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-github-robot

This comment has been minimized.

Copy link
Contributor

k8s-github-robot commented Jun 28, 2018

Automatic merge from submit-queue (batch tested with PRs 65361, 64527). If you want to cherry-pick this change to another branch, please follow the instructions here.

@k8s-github-robot k8s-github-robot merged commit a68a909 into kubernetes:master Jun 28, 2018

14 of 17 checks passed

pull-kubernetes-bazel-test Job failed.
Details
pull-kubernetes-e2e-gce Job failed.
Details
Submit Queue Required Github CI test is not green: pull-kubernetes-bazel-test
Details
cla/linuxfoundation gnufied authorized
Details
pull-kubernetes-bazel-build Job succeeded.
Details
pull-kubernetes-cross Skipped
pull-kubernetes-e2e-gce-100-performance Job succeeded.
Details
pull-kubernetes-e2e-gce-device-plugin-gpu Job succeeded.
Details
pull-kubernetes-e2e-gke Skipped
pull-kubernetes-e2e-kops-aws Job succeeded.
Details
pull-kubernetes-integration Job succeeded.
Details
pull-kubernetes-kubemark-e2e-gce-big Job succeeded.
Details
pull-kubernetes-local-e2e Skipped
pull-kubernetes-local-e2e-containerized Skipped
pull-kubernetes-node-e2e Job succeeded.
Details
pull-kubernetes-typecheck Job succeeded.
Details
pull-kubernetes-verify Job succeeded.
Details

@wongma7 wongma7 referenced this pull request Jul 6, 2018

Merged

Remove crappy fmt.Println #65911

k8s-github-robot pushed a commit that referenced this pull request Jul 9, 2018

Kubernetes Submit Queue
Merge pull request #65911 from wongma7/crap
Automatic merge from submit-queue (batch tested with PRs 63194, 65911). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Remove crappy fmt.Println

Remove @gnufied's debug message #64527
```release-note
NONE
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.