-
Notifications
You must be signed in to change notification settings - Fork 134
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Kubernetes node name to exported labels #78
Comments
Would it be possible to have labels in general added as well? We have kubernetes pods that have label.app=dev for example and these are not visible in DCGM metrics |
@neggert as workaround right now you can do it from the Prometheus dcgm-exporter job side, e.g relabel_config. For example:
result
|
Hey, i have solved this by adding some relabeling to my ServiceMonitor. I also overwrote the instance label so I don't have to customize my grafana boards.
|
I needed node_ip and hostname to join DCGM metrics to Node Exporter metrics. I used the following relabelings in my
|
How can I get the name of the pod which is actually using the GPU ? Right now what we get is the name of DCGM-Exporter and the job name is being populated with gpu metrics job name. |
Hi, as I commented here in #99 (comment). This feature does not need to be added to dcgm exporter. It is already available in prometheus. My comment has references on how to configure it. |
@thekuffs Do you mind elaborating on how to do this? I've got a GPU node which is running two pods: dcgm-exporter and a workload which uses the GPU. Id like to be able to associate that workload with the metrics exposed from dcgm-exporter. How exactly is this done? I understand how to do the pod role with kubernetes_sd_configs, but not clear how this helps the problem. I can target my workload through this, but my workload doesnt expose a /metrics endpoint (only dcgm-exporter does). I can target the dcgm-exporter pod, but then im still not sure how to associate those with the workload. |
@francescov1 I think I was just wrong when I posted that comment. I had forgotten that available GPU resources can be shared among multiple pods. If your workload is 1:1:1 (gpu:pod:node) you can use something like the cadvisor/kubelet metrics to "join" between dcgm metrics and your own workload. i.e. there are metrics in cadvisor/kubelet that have both a pod and a node label. You can select those by your workload's pod name. You can join those against themselves to find the specific instance of the dcgm exporter for that node. Which you can use to find the dcgm metrics for the node your workload is running on. Like I said though, that requires you to be allocating the entire GPU on a node to a single pod. So, I apologize for my comment. I made a few other comments on other similar issues without thinking thoroughly about it. |
@thekuffs Thanks for the insight, this is exactly what I need. Will give it a shot! |
Right now, when dcgm-exporter is deployed in Kubernetes (we're using gpu-operator), the
Hostname
label is set to the pod name, which is not particularly useful. I'd like to suggest either:node
orHostname
when running in KubernetesIt should be fairly straightforward to inject the node name into the container using the Downward API.
The text was updated successfully, but these errors were encountered: