-
Notifications
You must be signed in to change notification settings - Fork 39k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add attachable pvc in use metrics #64527
Add attachable pvc in use metrics #64527
Conversation
d2d5811
to
abd3470
Compare
abd3470
to
c365c99
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a quick initial pass
} | ||
|
||
type nodePVCCount struct { | ||
pvcCount map[types.NodeName]map[string]int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To reduce a nested layer, what about this:
type nodePVCCount map[types.NodeName]map[string]int
...
func (pvcInUse nodePVCCount) add(...) {...}
...
nodePVCMap := make(nodePVCCount)
also nit: type PluginName string
for better readability
} | ||
if pvc.Status.Phase != v1.ClaimBound || pvc.Spec.VolumeName == "" { | ||
return nil, fmt.Errorf( | ||
"PVC %s has non-bound phase (%q) or empty pvc.Spec.VolumeName (%q)", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Error message could be confusing if only one of the conditions evaluate to true. Split up the error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am going to leave this as it is for now. I have not even considered logging these errors tbh, so whatever we return from here - simply gets ignored.
c365c99
to
21fbd9f
Compare
volumePluginMgr *volume.VolumePluginMgr | ||
} | ||
|
||
type nodePVCCount map[types.NodeName]map[string]int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a comment here describe what each of the types represents
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
glog.V(3).Infof("Error finding volume plugin for : %v", volumeSpec) | ||
continue | ||
} | ||
nodePVCMap.add(nodeName, volumePlugin.GetPluginName()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this to help with the attachable limit count? If so, do we want to use the attachable limit resource name instead of the plugin name? For CSI, I think this just returns "kubernetes.io/csi"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not just for attachable limit count. It is useful for capacity planning in multi-tenant clusters and perhaps there are other usage too. In a multi-tenant cluster one typically has many more PV/PVCs than actually in-use PVCs. This metric obviously helps in determining if a cluster admin needs to spin up new nodes to accomodate in-use PVCs that takes into account both attachable limit or limits that apply in general to volume types that we don't even consider "attachable" right now.
I agree with unfortunate situation with CSI plugin name and GetPluginName
is not a satisfactory solution. But all storage metrics are affected by this problem, because they all use GetPluginName
as a label. I have filed #64590 to solve this. I think we are going to need additional function call or something for CSI plugins.
21fbd9f
to
2c47de3
Compare
if nodeName == "" { | ||
continue | ||
} | ||
for _, podVolume := range pod.Spec.Volumes { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this is actually just counting pods that have been scheduled to a node, but not necessarily attached. Is that what we want?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah for now I think - this should be fine. We may need separate metrics for actually attached volumes. But that requires querying cloudprovider or going through volume plugin at very minimum. I am still thinking, how to implement that interface so as it can be useful to most volume plugins.
} | ||
|
||
func (collector *pvcInUseCollector) CreateVolumeSpec(podVolume v1.Volume, namespace string) (*volume.Spec, error) { | ||
pvcSource := podVolume.VolumeSource.PersistentVolumeClaim |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be useful to include also inline volumes in pods and not just PVCs. It would bring broader (and more correct) picture.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hrm, I overlooked something. turns out A/D controller only initializes attachable plugins (and similarly pv controller only initializes provisionable plugins) and hence it is pretty hard to emit a metric for all volumes in use without initiaizing all volume plugins in control plane.
The alternatives are:
- Emit these metrics from kubelet. But the downside of that is, any unresponsive node could cause metrics to be incorrect.
- Somehow find a way of initializing all plugins in control plane.
Still thinking how to workaround that....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for now I have just renamed the metric and added support for inline attachable volumes too. But yet to think how to truly report ALL volumes... not just attachable types.
metricCollector := newPVCInUseCollector(pvcLister, fakePodInformer.Lister(), pvLister, fakeVolumePluginMgr) | ||
nodeUseMap := metricCollector.getPVCUseByNode() | ||
if len(nodeUseMap) < 1 { | ||
t.Errorf("Expected one pvc in use got %d", len(nodeUseMap)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One or at least one?
ee2c689
to
e11b8af
Compare
/lgtm |
/test pull-kubernetes-verify |
e11b8af
to
8d46912
Compare
@jsafrane can you lgtm this again? I had tor rebase this (with master) to resolve some bazel/verify failures. |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: gnufied, jsafrane The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/test pull-kubernetes-e2e-kops-aws |
/test all [submit-queue is verifying that this PR is safe to merge] |
@gnufied: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
Automatic merge from submit-queue (batch tested with PRs 65361, 64527). If you want to cherry-pick this change to another branch, please follow the instructions here. |
Automatic merge from submit-queue (batch tested with PRs 63194, 65911). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. Remove crappy fmt.Println Remove @gnufied's debug message #64527 ```release-note NONE ```
This metric reports number of PVCs that are in-use in Kubernetes with plugin and node name as dimensions.
This allows us to figure out, how many PVCs each node is using. It is super helpful in figuring out attach/detach issues.
/sig storage
cc @jsafrane @tsmetana @msau42