Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gpu levelzero sidecar #1803

Open
wants to merge 9 commits into
base: main
Choose a base branch
from
Open

Conversation

tkatila
Copy link
Contributor

@tkatila tkatila commented Aug 9, 2024

Add GPU Levelzero sidecar to allow fetching health data for the GPU devices from the Levelzero API.

As a bonus, this also adds support for detecting Intel GPUs within Windows Subsystem for Linux (WSL).

@tkatila tkatila force-pushed the gpu-levelzero-sidecar branch 4 times, most recently from fc65eff to d3afded Compare August 13, 2024 11:59
Copy link
Contributor

@eero-t eero-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few doc comments.

PS. I would also suggest running following (untested) command on the code to fix PCI abbreviation use in error messages:
sed -i 's/ pci / PCI /' $(git grep -l ' pci ')

README.md Outdated Show resolved Hide resolved
cmd/gpu_levelzero/README.md Outdated Show resolved Hide resolved
cmd/internal/levelzero/README.md Show resolved Hide resolved
tkatila and others added 6 commits September 6, 2024 09:06
Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>
Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>
Retrieve Intel GPU indices from the Level-Zero sidecar

Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>
Co-authored-by: Eero Tamminen <eero.t.tamminen@intel.com>
Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>
Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>
Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>
@tkatila
Copy link
Contributor Author

tkatila commented Sep 6, 2024

@eero-t & @uniemimu can you please review?

This probably requires some restructuring after the GPU CDI support has been merged (assuming it will happen first).

(Failed check is due to internal certificate issue, I do not suspect it to be a real problem.)

@eero-t
Copy link
Contributor

eero-t commented Sep 6, 2024

@eero-t & @uniemimu can you please review?

What are the differences to earlier version? (you did force push, so github does not show those.)

@tkatila
Copy link
Contributor Author

tkatila commented Sep 6, 2024

@eero-t & @uniemimu can you please review?

What are the differences to earlier version? (you did force push, so github does not show those.)

https://github.com/intel/intel-device-plugins-for-kubernetes/compare/82a387760e4c2a9a12cc366405e19153f67e7efd..76b1e5a04abc600ec1992a1adf0474d42ffad1b9

~mid third is my changes. Top third and bottom third are baseline changes. But to summarize:

  1. Some logging to C code
  2. Added temperature read from device and its handling in the plugin

cmd/gpu_plugin/gpu_plugin.go Outdated Show resolved Hide resolved
cmd/gpu_levelzero/zes.c Outdated Show resolved Hide resolved
cmd/gpu_levelzero/zes.c Outdated Show resolved Hide resolved
cmd/gpu_levelzero/zes.c Outdated Show resolved Hide resolved
cmd/gpu_levelzero/zes.c Outdated Show resolved Hide resolved
Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>
Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>
Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants