Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Intel GPU integration only pulls from one GPU #17881

Closed
rebelonion opened this issue Jun 13, 2024 · 3 comments · Fixed by #17884
Closed

[Bug]: Intel GPU integration only pulls from one GPU #17881

rebelonion opened this issue Jun 13, 2024 · 3 comments · Fixed by #17884
Labels
area/collectors Everything related to data collection bug collectors/go.d

Comments

@rebelonion
Copy link

Bug description

I have a system with 3 intel GPUs (one iGPU and 2 arc cards). Running intel_gpu_top -J -s 1000 (or any variation without -d) seems to only pull the first GPU on the list (the same result seems to happen in netdata metrics tab).
Using intel_gpu_top -L I can get a list of all Intel GPUs:

> intel_gpu_top -L
card2                    Intel Dg2 (Gen12)                 pci:vendor=8086,device=56A0,card=0
└─renderD129            
card1                    Intel Alderlake_s (Gen12)         pci:vendor=8086,device=4680,card=0
└─renderD128            
card0                    Intel Dg2 (Gen12)                 pci:vendor=8086,device=56A0,card=1
└─renderD130

using this, it is possible to get each GPU's metrics with a command like intel_gpu_top -d pci:vendor=8086,device=4680,card=0 -J -s 1000

selection options from the docs:

       On  systems  where multiple GPUs are present it is possible to select a specific GPU to be
       monitored. A GPU can be selected by sysfs path, drm device node or using various  PCI  sub
       filters.

          ┌───────┬────────────────────────────────────────────────┬──────────────────────────┐
          │Filter │ Syntax                                         │ GPU selection criteria   │
          ├───────┼────────────────────────────────────────────────┼──────────────────────────┤
          │sys    │ sys:/sys/devices/pci0000:00/0000:00:02.0       │ Select  using  the sysfs │
          │       │                                                │ path.                    │
          ├───────┼────────────────────────────────────────────────┼──────────────────────────┤
          │drm    │ drm:/dev/dri/<node>                            │ Select     using     the │
          │       │                                                │ /dev/dri/* device node.  │
          └───────┴────────────────────────────────────────────────┴──────────────────────────┘
          │pci    │ pci:[vendor=%04x/name][,device=%04x][,card=%d] │ Select   using  the  PCI │
          │       │                                                │ address.    Vendor    is │
          │       │                                                │ hexadecinal   number  or │
          │       │                                                │ vendor name.             │
          └───────┴────────────────────────────────────────────────┴──────────────────────────┘

There doesn't seem to be a way to output all GPUs in the same command

Expected behavior

The ability to see metrics from all Intel GPUs

Steps to reproduce

  1. Have a system with multiple intel GPUs
  2. Enable Intel GPU integration
  3. View intel GPU in metrics page

Installation method

kickstart.sh

System info

Linux Loudhouse 6.8.4-3-pve #1 SMP PREEMPT_DYNAMIC PMX 6.8.4-3 (2024-05-02T11:55Z) x86_64 GNU/Linux
/etc/os-release:PRETTY_NAME="Debian GNU/Linux 12 (bookworm)"
/etc/os-release:NAME="Debian GNU/Linux"
/etc/os-release:VERSION_ID="12"
/etc/os-release:VERSION="12 (bookworm)"
/etc/os-release:VERSION_CODENAM

Netdata build info

Packaging:
    Netdata Version ____________________________________________ : v1.45.0-581-nightly
    Installation Type __________________________________________ : binpkg-deb
    Package Architecture _______________________________________ : x86_64
    Package Distro _____________________________________________ : debian 12
    Configure Options __________________________________________ : dummy-configure-command
Default Directories:
    User Configurations ________________________________________ : /etc/netdata
    Stock Configurations _______________________________________ : /usr/lib/netdata/conf.d
    Ephemeral Databases (metrics data, metadata) _______________ : /var/cache/netdata
    Permanent Databases ________________________________________ : /var/lib/netdata
    Plugins ____________________________________________________ : /usr/libexec/netdata/plugins.d
    Static Web Files ___________________________________________ : /usr/share/netdata/web
    Log Files __________________________________________________ : /var/log/netdata
    Lock Files _________________________________________________ : /var/lib/netdata/lock
    Home _______________________________________________________ : /var/lib/netdata
Operating System:
    Kernel _____________________________________________________ : Linux
    Kernel Version _____________________________________________ : 6.8.4-3-pve
    Operating System ___________________________________________ : Debian GNU/Linux
    Operating System ID ________________________________________ : debian
    Operating System ID Like ___________________________________ : unknown
    Operating System Version ___________________________________ : 12 (bookworm)
    Operating System Version ID ________________________________ : none
    Detection __________________________________________________ : /etc/os-release
Hardware:
    CPU Cores __________________________________________________ : 20
    CPU Frequency ______________________________________________ : 3600000000
    RAM Bytes __________________________________________________ : 67176378368
    Disk Capacity ______________________________________________ : 113006509514752
    CPU Architecture ___________________________________________ : x86_64
    Virtualization Technology __________________________________ : none
    Virtualization Detection ___________________________________ : systemd-detect-virt
Container:
    Container __________________________________________________ : none
    Container Detection ________________________________________ : systemd-detect-virt
    Container Orchestrator _____________________________________ : none
    Container Operating System _________________________________ : none
    Container Operating System ID ______________________________ : none
    Container Operating System ID Like _________________________ : none
    Container Operating System Version _________________________ : none
    Container Operating System Version ID ______________________ : none
    Container Operating System Detection _______________________ : none
Features:
    Built For __________________________________________________ : Linux
    Netdata Cloud ______________________________________________ : YES
    Health (trigger alerts and send notifications) _____________ : YES
    Streaming (stream metrics to parent Netdata servers) _______ : YES
    Back-filling (of higher database tiers) ____________________ : YES
    Replication (fill the gaps of parent Netdata servers) ______ : YES
    Streaming and Replication Compression ______________________ : YES (zstd lz4 gzip)
    Contexts (index all active and archived metrics) ___________ : YES
    Tiering (multiple dbs with different metrics resolution) ___ : YES (5)
    Machine Learning ___________________________________________ : YES
Database Engines:
    dbengine (compression) _____________________________________ : YES (zstd lz4)
    alloc ______________________________________________________ : YES
    ram ________________________________________________________ : YES
    none _______________________________________________________ : YES
Connectivity Capabilities:
    ACLK (Agent-Cloud Link: MQTT over WebSockets over TLS) _____ : YES
    static (Netdata internal web server) _______________________ : YES
    h2o (web server) ___________________________________________ : YES
    WebRTC (experimental) ______________________________________ : NO
    Native HTTPS (TLS Support) _________________________________ : YES
    TLS Host Verification ______________________________________ : YES
Libraries:
    LZ4 (extremely fast lossless compression algorithm) ________ : YES
    ZSTD (fast, lossless compression algorithm) ________________ : YES
    zlib (lossless data-compression library) ___________________ : YES
    Brotli (generic-purpose lossless compression algorithm) ____ : NO
    protobuf (platform-neutral data serialization protocol) ____ : YES (system)
    OpenSSL (cryptography) _____________________________________ : YES
    libdatachannel (stand-alone WebRTC data channels) __________ : NO
    JSON-C (lightweight JSON manipulation) _____________________ : YES
    libcap (Linux capabilities system operations) ______________ : NO
    libcrypto (cryptographic functions) ________________________ : YES
    libyaml (library for parsing and emitting YAML) ____________ : YES
Plugins:
    apps (monitor processes) ___________________________________ : YES
    cgroups (monitor containers and VMs) _______________________ : YES
    cgroup-network (associate interfaces to CGROUPS) ___________ : YES
    proc (monitor Linux systems) _______________________________ : YES
    tc (monitor Linux network QoS) _____________________________ : YES
    diskspace (monitor Linux mount points) _____________________ : YES
    freebsd (monitor FreeBSD systems) __________________________ : NO
    macos (monitor MacOS systems) ______________________________ : NO
    statsd (collect custom application metrics) ________________ : YES
    timex (check system clock synchronization) _________________ : YES
    idlejitter (check system latency and jitter) _______________ : YES
    bash (support shell data collection jobs - charts.d) _______ : YES
    debugfs (kernel debugging metrics) _________________________ : YES
    cups (monitor printers and print jobs) _____________________ : YES
    ebpf (monitor system calls) ________________________________ : YES
    freeipmi (monitor enterprise server H/W) ___________________ : YES
    nfacct (gather netfilter accounting) _______________________ : YES
    perf (collect kernel performance events) ___________________ : YES
    slabinfo (monitor kernel object caching) ___________________ : YES
    Xen ________________________________________________________ : YES
    Xen VBD Error Tracking _____________________________________ : NO
    Logs Management ____________________________________________ : NO
Exporters:
    AWS Kinesis ________________________________________________ : NO
    GCP PubSub _________________________________________________ : NO
    MongoDB ____________________________________________________ : YES
    Prometheus (OpenMetrics) Exporter __________________________ : YES
    Prometheus Remote Write ____________________________________ : YES
    Graphite ___________________________________________________ : YES
    Graphite HTTP / HTTPS ______________________________________ : YES
    JSON _______________________________________________________ : YES
    JSON HTTP / HTTPS __________________________________________ : YES
    OpenTSDB ___________________________________________________ : YES
    OpenTSDB HTTP / HTTPS ______________________________________ : YES
    All Metrics API ____________________________________________ : YES
    Shell (use metrics in shell scripts) _______________________ : YES
Debug/Developer Features:
    Trace All Netdata Allocations (with charts) ________________ : NO
    Developer Mode (more runtime checks, slower) _______________ : NO

Additional info

No response

@rebelonion rebelonion added bug needs triage Issues which need to be manually labelled labels Jun 13, 2024
@ilyam8
Copy link
Member

ilyam8 commented Jun 13, 2024

Hi, @rebelonion.

Show intel_gpu_top -L -J. Listing available GPUs doesn't support JSON output 😢

@rebelonion
Copy link
Author

Listing available GPUs doesn't support JSON output 😢

Yeah I noticed that. pretty unfortunate 😅

@ilyam8
Copy link
Member

ilyam8 commented Jun 13, 2024

@rebelonion As a quick fix I added a configuration option to select a device in #17884

I plan to add an option to monitor all devices using one job, but it will be later if there is a need for that (feature request).

@ilyam8 ilyam8 added collectors/go.d area/collectors Everything related to data collection and removed needs triage Issues which need to be manually labelled labels Jun 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/collectors Everything related to data collection bug collectors/go.d
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants