Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add AMD GPU collector #15515

Merged
merged 27 commits into from
Aug 7, 2023
Merged
Show file tree
Hide file tree
Changes from 23 commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
9724115
Initial commit with GPU and MEM utilization
Dim-P Jul 24, 2023
24fe573
Add GPU and MEM clock frequency metrics
Dim-P Jul 24, 2023
6143cd2
Add MEM usage metrics
Dim-P Jul 25, 2023
536e493
Configure per chart, not per dimension
Dim-P Jul 25, 2023
c9604c8
Refactor to reduce boilerplate code
Dim-P Jul 25, 2023
b309a74
Add MEM percentage usage metrics
Dim-P Jul 25, 2023
71b2fef
Add error handling in case of file read error
Dim-P Jul 27, 2023
2d55060
Update related README.md
Dim-P Jul 27, 2023
ae9148c
Update metrics.csv and multi_metadata.yaml
Dim-P Jul 27, 2023
b3b866e
Update dashboard_info.js
Dim-P Jul 27, 2023
732a081
Fix YAML indentation
Dim-P Jul 27, 2023
071311f
Split up dimensions in separate charts and add free memory metrics
Dim-P Jul 28, 2023
6231256
Remove config options and change chart IDs
Dim-P Jul 28, 2023
bbd65fc
Remove forgotten copyright comment
Dim-P Jul 28, 2023
22649f4
Add cardX suffix to IDs
Dim-P Jul 31, 2023
b04350d
Do not monitor if device or revision hex cannot be read
Dim-P Jul 31, 2023
bde603e
Proper fix for card_free() in case of error
Dim-P Jul 31, 2023
ddea66d
Fix marketing name in case of unknown AMD GPU
Dim-P Jul 31, 2023
5c106c7
Refactor code that updates charts.
Dim-P Aug 3, 2023
2438c19
Add some missing RX 7XXX series cards
Dim-P Aug 3, 2023
7ab7a3a
Add some logging in case of initial file read errors
Dim-P Aug 3, 2023
5211801
Add logging in case of unreadable asic_id or pci_rev_id
Dim-P Aug 3, 2023
9c64b7e
Update some error logs to info logs
Dim-P Aug 7, 2023
3f1a064
Add check to skip initialization if is false
Dim-P Aug 7, 2023
5163695
Update method_description in metadata.yaml
Dim-P Aug 7, 2023
9431a7f
Update chart priorities
Dim-P Aug 7, 2023
49f2dc0
Change some dimension names from used to usage
Dim-P Aug 7, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
1 change: 1 addition & 0 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -671,6 +671,7 @@ set(PROC_PLUGIN_FILES
collectors/proc.plugin/sys_fs_btrfs.c
collectors/proc.plugin/sys_class_power_supply.c
collectors/proc.plugin/sys_devices_pci_aer.c
collectors/proc.plugin/sys_class_drm.c
)

set(TC_PLUGIN_FILES
Expand Down
1 change: 1 addition & 0 deletions Makefile.am
Original file line number Diff line number Diff line change
Expand Up @@ -423,6 +423,7 @@ PROC_PLUGIN_FILES = \
collectors/proc.plugin/sys_fs_btrfs.c \
collectors/proc.plugin/sys_class_power_supply.c \
collectors/proc.plugin/sys_class_infiniband.c \
collectors/proc.plugin/sys_class_drm.c \
$(NULL)

PROFILE_PLUGIN_FILES = \
Expand Down
4 changes: 4 additions & 0 deletions collectors/all.h
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,10 @@

#define NETDATA_CHART_PRIO_INTERRUPTS_PER_CORE 1100 // +1 per core

// AMD GPUs

#define NETDATA_CHART_PRIO_DRM_AMDGPU 1005

// Memory Section - 1xxx

#define NETDATA_CHART_PRIO_MEM_SYSTEM_AVAILABLE 1010
Expand Down
31 changes: 31 additions & 0 deletions collectors/proc.plugin/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@ In detail, it collects metrics from:
- `/proc/spl/kstat/zfs/pool/state` (state of ZFS pools)
- `/sys/class/power_supply` (power supply properties)
- `/sys/class/infiniband` (infiniband interconnect)
- `/sys/class/drm` (AMD GPUs)
- `ipc` (IPC semaphores and message queues)
- `ksm` Kernel Same-Page Merging performance (several files under `/sys/kernel/mm/ksm`).
- `netdata` (internal Netdata resources utilization)
Expand Down Expand Up @@ -579,6 +580,36 @@ Default configuration will monitor only enabled infiniband ports, and refresh ne
# refresh ports state every seconds = 30
```

## AMD GPUs

This module monitors every AMD GPU card discovered at agent startup.

### Monitored GPU metrics

The following charts will be provided:

- **GPU utilization**
- **GPU memory utilization**
- **GPU clock frequency**
- **GPU memory clock frequency**
- **VRAM memory usage percentage**
- **VRAM memory usage**
- **visible VRAM memory usage percentage**
- **visible VRAM memory usage**
- **GTT memory usage percentage**
- **GTT memory usage**

### configuration

The `drm` path can be configured if it differs from the default:

```
[plugin:proc:/sys/class/drm]
# directory to monitor = /sys/class/drm
```

> [!NOTE]
> Temperature, fan speed, voltage and power metrics for AMD GPUs can be monitored using the [Sensors](https://github.com/netdata/netdata/blob/master/collectors/charts.d.plugin/sensors/README.md) plugin.

## IPC

Expand Down
135 changes: 135 additions & 0 deletions collectors/proc.plugin/metadata.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5180,3 +5180,138 @@ modules:
- name: now
- name: max
- name: max_design
- meta:
plugin_name: proc.plugin
module_name: /sys/class/drm
monitored_instance:
name: AMD GPU
link: "https://www.amd.com"
categories:
- data-collection.hardware-devices-and-sensors
icon_filename: "amd.png"
related_resources:
integrations:
list: []
info_provided_to_referring_integrations:
description: ""
keywords:
- amd
- gpu
- hardware
most_popular: false
overview:
data_collection:
metrics_description: "This integration monitors AMD GPU metrics, such as utilization, clock frequency and memory usage."
method_description: ""
Dim-P marked this conversation as resolved.
Show resolved Hide resolved
supported_platforms:
include:
- Linux
exclude: []
multi_instance: true
additional_permissions:
description: ""
default_behavior:
auto_detection:
description: ""
limits:
description: ""
performance_impact:
description: ""
setup:
prerequisites:
list: []
configuration:
file:
name: ""
description: ""
options:
description: ""
folding:
title: ""
enabled: true
list: []
examples:
folding:
enabled: true
title: ""
list: []
troubleshooting:
problems:
list: []
alerts: []
metrics:
folding:
title: Metrics
enabled: false
description: ""
availability: []
scopes:
- name: gpu
description: "These metrics refer to the GPU."
Dim-P marked this conversation as resolved.
Show resolved Hide resolved
labels:
- name: product_name
description: GPU product name (e.g. AMD RX 6600)
metrics:
- name: amdgpu.gpu_utilization
description: GPU utilization
unit: "percentage"
chart_type: line
dimensions:
- name: utilization
- name: amdgpu.gpu_mem_utilization
description: GPU memory utilization
unit: "percentage"
chart_type: line
dimensions:
- name: utilization
- name: amdgpu.gpu_clk_frequency
description: GPU clock frequency
unit: "MHz"
chart_type: line
dimensions:
- name: frequency
- name: amdgpu.gpu_mem_clk_frequency
description: GPU memory clock frequency
unit: "MHz"
chart_type: line
dimensions:
- name: frequency
- name: amdgpu.gpu_mem_vram_usage_perc
description: VRAM memory usage percentage
unit: "percentage"
chart_type: line
dimensions:
- name: used
- name: amdgpu.gpu_mem_vram_usage
description: VRAM memory usage
unit: "bytes"
chart_type: area
dimensions:
- name: used
- name: free
- name: amdgpu.gpu_mem_vis_vram_usage_perc
description: visible VRAM memory usage percentage
unit: "percentage"
chart_type: line
dimensions:
- name: used
- name: amdgpu.gpu_mem_vis_vram_usage
description: visible VRAM memory usage
unit: "bytes"
chart_type: area
dimensions:
- name: used
- name: free
- name: amdgpu.gpu_mem_gtt_usage_perc
description: GTT memory usage percentage
unit: "percentage"
chart_type: line
dimensions:
- name: used
- name: amdgpu.gpu_mem_gtt_usage
description: GTT memory usage
unit: "bytes"
chart_type: area
dimensions:
- name: used
- name: free
5 changes: 4 additions & 1 deletion collectors/proc.plugin/plugin_proc.c
Original file line number Diff line number Diff line change
Expand Up @@ -70,8 +70,11 @@ static struct proc_module {
// IPC metrics
{.name = "ipc", .dim = "ipc", .func = do_ipc},

{.name = "/sys/class/power_supply", .dim = "power_supply", .func = do_sys_class_power_supply},
// linux power supply metrics
{.name = "/sys/class/power_supply", .dim = "power_supply", .func = do_sys_class_power_supply},

// GPU metrics
{.name = "/sys/class/drm", .dim = "drm", .func = do_sys_class_drm},

// the terminator of this array
{.name = NULL, .dim = NULL, .func = NULL}
Expand Down
1 change: 1 addition & 0 deletions collectors/proc.plugin/plugin_proc.h
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,7 @@ int do_ipc(int update_every, usec_t dt);
int do_sys_class_power_supply(int update_every, usec_t dt);
int do_proc_pagetypeinfo(int update_every, usec_t dt);
int do_sys_class_infiniband(int update_every, usec_t dt);
int do_sys_class_drm(int update_every, usec_t dt);
int get_numa_node_count(void);

// metrics that need to be shared among data collectors
Expand Down