Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add AMD GPU collector #15515

Merged
merged 27 commits into from Aug 7, 2023
Merged

Add AMD GPU collector #15515

merged 27 commits into from Aug 7, 2023

Conversation

Dim-P
Copy link
Contributor

@Dim-P Dim-P commented Jul 24, 2023

Summary

This PR adds a collector that monitors all the AMD GPU cards on the system.

Fixes #12616.

The collector provides the following metrics:

  • Utilization
    • GPU utilization
    • GPU memory utilization
  • Clock frequencies:
    • GPU clock frequency
    • GPU memory clock frequency
  • Memory usage:
    • VRAM memory usage percentage
    • VRAM memory usage
    • visible VRAM memory usage percentage
    • visible VRAM memory usage
    • GTT memory usage percentage
    • GTT memory usage

The following can be provided by Sensors, so they were not implemented:

  • Power consumption
  • Temperature
  • Fan speed
  • Voltage

Out of scope of this PR, but could potentially be implemented in the future (needs further research):

  • PCIe RX/TX bandwidth metrics
  • Overclocking metrics
  • Usage and consumption per process (similar to nvtop)
  • Power state metrics
Test Plan
  • Run this PR on a system without an AMD GPU and check that this branch behaves as master.
  • Run this PR on a system with an AMD GPU and check that all the above metrics are displayed. Check also that the "marketing name" for each GPU is correct (can be checked at the product_name label).

For users: How does this change affect me?

You will be able to monitor basic metrics for AMD GPUs. It does not affect any instances that do not have an AMD GPU (see the nvidia-smi collector for NVIDIA GPUs).

@github-actions github-actions bot added area/collectors Everything related to data collection area/build Build system (autotools and cmake). collectors/proc area/docs labels Jul 24, 2023
@Dim-P Dim-P marked this pull request as ready for review July 27, 2023 18:36
@Dim-P
Copy link
Contributor Author

Dim-P commented Jul 28, 2023

@ilyam8 Removed configuration as discussed. Please let me know if chart IDs and contexts make sense as they are, I think they do now. I tested with multiple GPU cards and aggregated metrics seem to work correctly.

@MrZammler
Copy link
Contributor

Hi Dimitri! Nice!

So, I tested this on my laptop, which has an AMD 4500U CPU, with integrated GPU.

The card is on this list.

I've added {0x1636, 0xC3, "AMD 4500U"}, in your list, since it wasn't included. So maybe there are some more cards that could be added, but that can be iterated.

What I've found next, is that still it wouldn't display any charts for it. The problem was that in my case util_mem does not exist. Commenting out the part for checking util_mem, did allow other charts to appear (gpu_utilization, gpu_clk_frequency and gpu_mem_clk_frequency).

So a suggestion would be that if in case something is missing, don't give up on the whole card?

Also, if possible to capitalize the amdgpu entry in the list on the dashboard (e.g. AMDGpu) and an icon would be cool as well :-)

@Dim-P
Copy link
Contributor Author

Dim-P commented Aug 3, 2023

Hi Dimitri! Nice!

So, I tested this on my laptop, which has an AMD 4500U CPU, with integrated GPU.

The card is on this list.

I've added {0x1636, 0xC3, "AMD 4500U"}, in your list, since it wasn't included. So maybe there are some more cards that could be added, but that can be iterated.

What I've found next, is that still it wouldn't display any charts for it. The problem was that in my case util_mem does not exist. Commenting out the part for checking util_mem, did allow other charts to appear (gpu_utilization, gpu_clk_frequency and gpu_mem_clk_frequency).

So a suggestion would be that if in case something is missing, don't give up on the whole card?

Also, if possible to capitalize the amdgpu entry in the list on the dashboard (e.g. AMDGpu) and an icon would be cool as well :-)

Thanks, good points.

About additional GPU marketing names, I believe these can be added later, as it's a never-ending piece of work. I did add some RX 7xxx though that were missing from the original list.

I changed how chart updates are performed and now a linked list is used, so any metrics (and charts) that cannot be monitored any more can easily be removed from the list. This seems to be working fine but please test on your system again.

About amdgpu, it is capitalized if you check the first version of the dashboard (localhost:19999/v1), but I believe for v2, this (and the icon) is handled by @netdata/cloud-fe .

I also checked @Ferroin 's concerns about introducing unnecessary GPU usage due to /sys/class/drm calls, but I didn't notice any difference on my setup.

@Dim-P Dim-P requested review from ilyam8 and MrZammler August 3, 2023 17:11
MrZammler
MrZammler previously approved these changes Aug 4, 2023
Copy link
Contributor

@MrZammler MrZammler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems to work fine here! Just needs more cards, we can add later!

@github-actions github-actions bot added the area/metadata Integrations metadata label Aug 7, 2023
MrZammler
MrZammler previously approved these changes Aug 7, 2023
title: 'AMD GPUs',
icon: '<i class="fas fa-microchip"></i>',
info: 'Detailed information for each AMD GPU of the system. Temperature, fan speed, voltage and power metrics can be found at the <a href="#menu_sensors">Sensors</a> section.'
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, do not forget to create an analogous PR to also bring this description for new dashboard.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, will do, just need the information of what needs updating from @netdata/cloud-fe as mentioned in my comment above.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: we shouldn't update this file in netdata/netdata but in https://github.com/netdata/dashboard. If there is a new version of the old dashboard, this file will be overwritten.

underhood
underhood previously approved these changes Aug 7, 2023
ilyam8
ilyam8 previously approved these changes Aug 7, 2023
Copy link
Contributor

@thiagoftsm thiagoftsm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not have hardware to test, bu considering that other teammate already tested it, and possible changes will come in other PRs, LGTM!

@ilyam8 ilyam8 merged commit 1bcb3fb into netdata:master Aug 7, 2023
125 of 126 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/build Build system (autotools and cmake). area/collectors Everything related to data collection area/docs area/metadata Integrations metadata area/web collectors/proc
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feat]: AMD GPU monitoring
5 participants