Add Kubernetes node name to exported labels #78

neggert · 2022-07-08T16:01:46Z

Right now, when dcgm-exporter is deployed in Kubernetes (we're using gpu-operator), the Hostname label is set to the pod name, which is not particularly useful. I'd like to suggest either:

Adding a new label node or
Using different logic to populate Hostname when running in Kubernetes

It should be fairly straightforward to inject the node name into the container using the Downward API.

The text was updated successfully, but these errors were encountered:

alex-g-tejada · 2022-08-16T16:55:31Z

Would it be possible to have labels in general added as well? We have kubernetes pods that have label.app=dev for example and these are not visible in DCGM metrics

k0nstantinv · 2022-11-01T13:35:47Z

@neggert as workaround right now you can do it from the Prometheus dcgm-exporter job side, e.g relabel_config. For example:

- job_name: dcgm-exporter
  scrape_interval: 30s
  scrape_timeout: 10s
  kubernetes_sd_configs:
    - role: endpoints
  relabel_configs:
    - source_labels: [__meta_kubernetes_endpoints_name]
      regex: 'dcgm-exporter'
      action: keep
    - source_labels: [__meta_kubernetes_namespace]
      action: replace
      target_label: namespace
    - source_labels: [__meta_kubernetes_pod_name]
      action: replace
      target_label: pod_name
    - source_labels: [__meta_kubernetes_pod_node_name]
      action: replace
      target_label: node

result

DCGM_FI_DEV_FB_FREE{UUID="GPU-16e319ba-0b7d-3a4b-e35f-915bf484870f", cluster="test-cluster01", device="nvidia0", gpu="0", instance="10.109.133.139:9400", job="dcgm-exporter", namespace="dcgm-exporter", node="node06", pod_name="dcgm-exporter-k22f5"}

PlayMTL · 2022-11-04T10:23:38Z

Hey, i have solved this by adding some relabeling to my ServiceMonitor.

I also overwrote the instance label so I don't have to customize my grafana boards.

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
...
spec:
  endpoints:
  - relabelings:
    - action: replace
      sourceLabels:
      - __meta_kubernetes_pod_node_name
      targetLabel: nodename
    - action: replace
      sourceLabels:
      - __meta_kubernetes_pod_node_name
      targetLabel: instance
    ...

slyt · 2022-12-06T22:20:25Z

I needed node_ip and hostname to join DCGM metrics to Node Exporter metrics.

I used the following relabelings in my ServiceMonitor to add node_ip and hostname labels to all DCGM metrics:

spec:
  selector:
    matchLabels:
      app: nvidia-dcgm-exporter # target service
  namespaceSelector:
    matchNames:
      - nvidia-gpu-operator
  endpoints:
    - port: gpu-metrics
      interval: 5s # Scrape interval
      relabelings:
        - sourceLabels: ["__meta_kubernetes_pod_host_ip"]
          regex: "(.*)"
          replacement: "$1"
          targetLabel: "node_ip"
        - sourceLabels: ["__meta_kubernetes_pod_node_name"]
          regex: "(.*)"
          replacement: "$1"
          targetLabel: "hostname"

harjitdotsingh · 2023-01-16T17:49:38Z

How can I get the name of the pod which is actually using the GPU ? Right now what we get is the name of DCGM-Exporter and the job name is being populated with gpu metrics job name.

thekuffs · 2023-03-10T18:49:55Z

Hi, as I commented here in #99 (comment). This feature does not need to be added to dcgm exporter. It is already available in prometheus. My comment has references on how to configure it.

francescov1 · 2024-03-12T23:40:20Z

@thekuffs Do you mind elaborating on how to do this?

I've got a GPU node which is running two pods: dcgm-exporter and a workload which uses the GPU. Id like to be able to associate that workload with the metrics exposed from dcgm-exporter.

How exactly is this done? I understand how to do the pod role with kubernetes_sd_configs, but not clear how this helps the problem. I can target my workload through this, but my workload doesnt expose a /metrics endpoint (only dcgm-exporter does). I can target the dcgm-exporter pod, but then im still not sure how to associate those with the workload.

thekuffs · 2024-03-26T18:45:20Z

@francescov1 I think I was just wrong when I posted that comment. I had forgotten that available GPU resources can be shared among multiple pods. If your workload is 1:1:1 (gpu:pod:node) you can use something like the cadvisor/kubelet metrics to "join" between dcgm metrics and your own workload. i.e. there are metrics in cadvisor/kubelet that have both a pod and a node label. You can select those by your workload's pod name. You can join those against themselves to find the specific instance of the dcgm exporter for that node. Which you can use to find the dcgm metrics for the node your workload is running on. Like I said though, that requires you to be allocating the entire GPU on a node to a single pod.

So, I apologize for my comment. I made a few other comments on other similar issues without thinking thoroughly about it.

francescov1 · 2024-03-28T16:25:46Z

@thekuffs Thanks for the insight, this is exactly what I need. Will give it a shot!

kindomLee mentioned this issue Sep 21, 2022

Enable serviceMonitor support prometheus relabelings #104

Merged

nvvfedorov closed this as completed Mar 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Kubernetes node name to exported labels #78

Add Kubernetes node name to exported labels #78

neggert commented Jul 8, 2022

alex-g-tejada commented Aug 16, 2022

k0nstantinv commented Nov 1, 2022 •

edited

Loading

PlayMTL commented Nov 4, 2022

slyt commented Dec 6, 2022

harjitdotsingh commented Jan 16, 2023

thekuffs commented Mar 10, 2023

francescov1 commented Mar 12, 2024 •

edited

Loading

thekuffs commented Mar 26, 2024

francescov1 commented Mar 28, 2024

Add Kubernetes node name to exported labels #78

Add Kubernetes node name to exported labels #78

Comments

neggert commented Jul 8, 2022

alex-g-tejada commented Aug 16, 2022

k0nstantinv commented Nov 1, 2022 • edited Loading

PlayMTL commented Nov 4, 2022

slyt commented Dec 6, 2022

harjitdotsingh commented Jan 16, 2023

thekuffs commented Mar 10, 2023

francescov1 commented Mar 12, 2024 • edited Loading

thekuffs commented Mar 26, 2024

francescov1 commented Mar 28, 2024

k0nstantinv commented Nov 1, 2022 •

edited

Loading

francescov1 commented Mar 12, 2024 •

edited

Loading