Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Kubernetes node name to exported labels #78

Closed
neggert opened this issue Jul 8, 2022 · 9 comments
Closed

Add Kubernetes node name to exported labels #78

neggert opened this issue Jul 8, 2022 · 9 comments

Comments

@neggert
Copy link

neggert commented Jul 8, 2022

Right now, when dcgm-exporter is deployed in Kubernetes (we're using gpu-operator), the Hostname label is set to the pod name, which is not particularly useful. I'd like to suggest either:

  • Adding a new label node or
  • Using different logic to populate Hostname when running in Kubernetes

It should be fairly straightforward to inject the node name into the container using the Downward API.

@alex-g-tejada
Copy link

Would it be possible to have labels in general added as well? We have kubernetes pods that have label.app=dev for example and these are not visible in DCGM metrics

@k0nstantinv
Copy link

k0nstantinv commented Nov 1, 2022

@neggert as workaround right now you can do it from the Prometheus dcgm-exporter job side, e.g relabel_config. For example:

- job_name: dcgm-exporter
  scrape_interval: 30s
  scrape_timeout: 10s
  kubernetes_sd_configs:
    - role: endpoints
  relabel_configs:
    - source_labels: [__meta_kubernetes_endpoints_name]
      regex: 'dcgm-exporter'
      action: keep
    - source_labels: [__meta_kubernetes_namespace]
      action: replace
      target_label: namespace
    - source_labels: [__meta_kubernetes_pod_name]
      action: replace
      target_label: pod_name
    - source_labels: [__meta_kubernetes_pod_node_name]
      action: replace
      target_label: node

result

DCGM_FI_DEV_FB_FREE{UUID="GPU-16e319ba-0b7d-3a4b-e35f-915bf484870f", cluster="test-cluster01", device="nvidia0", gpu="0", instance="10.109.133.139:9400", job="dcgm-exporter", namespace="dcgm-exporter", node="node06", pod_name="dcgm-exporter-k22f5"}

@PlayMTL
Copy link

PlayMTL commented Nov 4, 2022

Hey, i have solved this by adding some relabeling to my ServiceMonitor.

I also overwrote the instance label so I don't have to customize my grafana boards.

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
...
spec:
  endpoints:
  - relabelings:
    - action: replace
      sourceLabels:
      - __meta_kubernetes_pod_node_name
      targetLabel: nodename
    - action: replace
      sourceLabels:
      - __meta_kubernetes_pod_node_name
      targetLabel: instance
    ...

@slyt
Copy link

slyt commented Dec 6, 2022

I needed node_ip and hostname to join DCGM metrics to Node Exporter metrics.

I used the following relabelings in my ServiceMonitor to add node_ip and hostname labels to all DCGM metrics:

spec:
  selector:
    matchLabels:
      app: nvidia-dcgm-exporter # target service
  namespaceSelector:
    matchNames:
      - nvidia-gpu-operator
  endpoints:
    - port: gpu-metrics
      interval: 5s # Scrape interval
      relabelings:
        - sourceLabels: ["__meta_kubernetes_pod_host_ip"]
          regex: "(.*)"
          replacement: "$1"
          targetLabel: "node_ip"
        - sourceLabels: ["__meta_kubernetes_pod_node_name"]
          regex: "(.*)"
          replacement: "$1"
          targetLabel: "hostname"

@harjitdotsingh
Copy link

How can I get the name of the pod which is actually using the GPU ? Right now what we get is the name of DCGM-Exporter and the job name is being populated with gpu metrics job name.

@thekuffs
Copy link

Hi, as I commented here in #99 (comment). This feature does not need to be added to dcgm exporter. It is already available in prometheus. My comment has references on how to configure it.

@francescov1
Copy link

francescov1 commented Mar 12, 2024

@thekuffs Do you mind elaborating on how to do this?

I've got a GPU node which is running two pods: dcgm-exporter and a workload which uses the GPU. Id like to be able to associate that workload with the metrics exposed from dcgm-exporter.

How exactly is this done? I understand how to do the pod role with kubernetes_sd_configs, but not clear how this helps the problem. I can target my workload through this, but my workload doesnt expose a /metrics endpoint (only dcgm-exporter does). I can target the dcgm-exporter pod, but then im still not sure how to associate those with the workload.

@thekuffs
Copy link

@francescov1 I think I was just wrong when I posted that comment. I had forgotten that available GPU resources can be shared among multiple pods. If your workload is 1:1:1 (gpu:pod:node) you can use something like the cadvisor/kubelet metrics to "join" between dcgm metrics and your own workload. i.e. there are metrics in cadvisor/kubelet that have both a pod and a node label. You can select those by your workload's pod name. You can join those against themselves to find the specific instance of the dcgm exporter for that node. Which you can use to find the dcgm metrics for the node your workload is running on. Like I said though, that requires you to be allocating the entire GPU on a node to a single pod.

So, I apologize for my comment. I made a few other comments on other similar issues without thinking thoroughly about it.

@francescov1
Copy link

@thekuffs Thanks for the insight, this is exactly what I need. Will give it a shot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants