Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add fake GPU device generator for scalability testing #1116

Merged
merged 5 commits into from
Dec 7, 2022

Conversation

eero-t
Copy link
Contributor

@eero-t eero-t commented Aug 24, 2022

The whole picture and earlier review comments are in the RFC PR #1104, from which this is split off.

Compared to RFC PR, I've moved / renamed generator code to gpu_fakedev / gpu_fakedev.go and added documentation + intel-gpu-fakedev container for it.

@eero-t
Copy link
Contributor Author

eero-t commented Aug 25, 2022

The 7 plugin tests that remain in "expected" state show:

Error: .github#L1
An unexpected error has occurred and we've been automatically notified. Errors are sometimes temporary, so please try again.

What I should do next?

@eero-t
Copy link
Contributor Author

eero-t commented Aug 25, 2022

While this adds new container for the devices project, program itself does not have any dependencies (outside of Golang standard libraries + golang.org/x/sys/unix).

Currently it does not have unit tests, but I'm not sure what those should do, as this itself is a (scalability) testing tool (for GPU plugin). Most relevant test would be whether GPU plugin reports expected number of GPUs for the content created by the tool, but because it's generating device files, running it requires either root or suitable capability for creating those.

@mythi
Copy link
Contributor

mythi commented Aug 25, 2022

The 7 plugin tests that remain in "expected" state show:

Error: .github#L1
An unexpected error has occurred and we've been automatically notified. Errors are sometimes temporary, so please try again.

What I should do next?

https://www.githubstatus.com/

To facilitate GPU plugin scalability testing on a real cluster.

Pre-existing (fake) sysfs & devfs content needs to be removed first:

* Fake devfs directory is mounted from host so OCI runtime can "mount"
  device files also to workloads requesting fake devices. This means
  that those files can persist over fake GPU plugin life-time, and
  earlier files need to be removed, as they may not match

* DaemonSet restarts failing init containers, so errors about content
  created on previous generator run would prevent getting logs of the
  real error on first generator run

* Before removal, check that removed directory content is as expected,
  to avoid accidentally removing host sysfs/devfs content (in case
  container was erronously granted access to the real thing)

Container runtime requires fake device files to real be devices:

* Use NULL devices to represent fake GPU devices:
  https://www.kernel.org/doc/Documentation/admin-guide/devices.txt

* Give more detailed logging for MkNod() failures as device
  node creation is most likely operation to fail when container
  does not have the necessary access rights

Created content is based on JSON config file (instead of e.g.
commandline options) so that (configMap providing) it can be updated
independently of the pod where generator is run.

Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
Config file is suitably indented so that it can be directly
appended to a suitable configMap header.

Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
@eero-t
Copy link
Contributor Author

eero-t commented Aug 26, 2022

Updated doc + rebased on main now that GitHub problems are fixed, so that CI tests run for the first time.

@codecov-commenter
Copy link

Codecov Report

Merging #1116 (72f1010) into main (6347609) will not change coverage.
The diff coverage is n/a.

❗ Current head 72f1010 differs from pull request most recent head ff5cc41. Consider uploading reports for the commit ff5cc41 to get more accurate results

@@           Coverage Diff           @@
##             main    #1116   +/-   ##
=======================================
  Coverage   53.01%   53.01%           
=======================================
  Files          40       40           
  Lines        4350     4350           
=======================================
  Hits         2306     2306           
  Misses       1917     1917           
  Partials      127      127           

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@eero-t eero-t changed the title WIP: Add fake GPU device generator for scalability testing Add fake GPU device generator for scalability testing Oct 27, 2022
@eero-t
Copy link
Contributor Author

eero-t commented Oct 27, 2022

@mythi Any comments on this now that release is done?

@eero-t
Copy link
Contributor Author

eero-t commented Nov 23, 2022

Note: fakedev-exporter project relies on this functionality, as documented here: https://github.com/intel/fakedev-exporter/blob/main/deployments/README.md

Copy link
Member

@bart0sh bart0sh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@bart0sh bart0sh merged commit b4c2bd3 into intel:main Dec 7, 2022
@eero-t eero-t deleted the gpu_fakedev branch December 12, 2022 11:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants