Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: add (path) "prefix" option to GPU plugin for scalability testing #1104

Closed
wants to merge 13 commits into from

Conversation

eero-t
Copy link
Contributor

@eero-t eero-t commented Aug 18, 2022

PR adds -prefix option to GPU plugin which allows it to read faked devfs and sysfs GPU device files (please check the commit messages for details).

PR includes also a tool for generating such files, which I've tested with the (included) modified GPU plugin / pod spec in k8s cluster. Those are included just for reference, and should be split to separate PR(s) before this is merged.

For now, they are under "fakedev" subdir under gpu_plugin/. What would be best place for the generator code, maybe cmd/gpu_fake_devs/generator.go?

GPU plugin pod running with -prefix will have that generator as its first init container. For NFD labeling to work properly with faked devices, gpu_nfdhook must be run in GPU plugin pod context, not copied to host as NFD hook, as is done by the modified GPU plugin pod spec (while gpu_nfdhook does not need changes for now, see [1]).

With these, one can fake an arbitrary number of devices which can be used to do scalability testing both for GPU plugin, and things depending on it.

[1] Links:

@tkatila
Copy link
Contributor

tkatila commented Aug 19, 2022

Can you give an example yaml of the deployment?

Copy link
Contributor

@uniemimu uniemimu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really see much benefit in the prefix, I see complications and unnecessary logic.

You should primarily aim to fake the resource with its existing name without breaking dependencies to the existing resource name.

@eero-t eero-t changed the title RFC: add "fake-mode" to GPU plugin for scalability testing RFC: add (path) "prefix" option to GPU plugin for scalability testing Aug 19, 2022
@eero-t
Copy link
Contributor Author

eero-t commented Aug 19, 2022

@uniemimu & @tkatila I dropped the resource prefix commit, renamed the option just to "-prefix", added generator code & faked GPU plugin pod spec as reference, and updated the title + PR description accordingly.

@eero-t
Copy link
Contributor Author

eero-t commented Aug 19, 2022

Pod spec is just example of the changes. When using it in cluster running also real GPU plugin, you would add nodeSelector to the daemonSet to constrain faked GPU plugin to non-GPU nodes.

With the example configMap, nodeSelector for a faked GPU workload [1] would ask for nodes where "gpu.intel.com/platform_fake_DG1.present" label is true, and real workloads would need to ask for non-fake GPU label.

Eventually there would be following PRs:

  • "-prefix" option to GPU plugin (this PR)
  • fake device file generator code + container for it + example config(s)
  • GPU plugin kustomize file(s) for customizing GPU plugin pod to use those
  • Relevant documentation updates

[1] If fake workload is one from "fakedev-exporter" project, it can fake suitable GPU load for the (fake) GPU assigned to it, which allows testing further GPU related k8s components.

Devices can be faked for scalability testing when non-standard paths
are used (GPU plugin code assumes container paths to match host paths,
and container runtime prevents creating fake files under real paths).

Note: If one wants to run both normal GPU plugin and faked one in same
cluster, all nodes providing fake "i915" resources should be labeled
differently from ones with real GPU plugin + devices, so that real GPU
workloads can be limited to correct nodes with a suitable
nodeSelector.

Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
Fake devfs directory is mounted from host so OCI runtime can "mount"
device files also to workloads requesting fake devices. This means
that those files can persist over fake GPU plugin life-time, so
earlier files need to be removed, as they may not match.

Also, DaemonSet restarts failing init containers, so errors about
directories generated on previous generator run would prevent getting
logs of the real error from first generator run.
Represent fake GPU devices with null devices:
https://www.kernel.org/doc/Documentation/admin-guide/devices.txt

Real devfs check needed also changing, and removal warnings
were simplified, as there's always just one entry.
Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
@eero-t
Copy link
Contributor Author

eero-t commented Aug 19, 2022

Pushed fixes to CI complaints (including some white-space complaints for already existing GPU plugin code).

Copy link
Contributor

@uniemimu uniemimu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I think this looks good. I'm still going to have a nitpicking round if I find some style issues or so but in the bigger picture this is excellent.

immutable: false
data:
fakedev.json: |-
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the base object be an array? In case we ever want to define a heterogenous fake system.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you mean different GPUs on different nodes, one could just run multiple fake plugins, restricted to different nodes. Or code could be changed to read multiple JSON files, and if JSON file includes new include & exclude keys listing node names, that code compares against specified env var value.

If you mean different GPUs on same node, there could be a device offset key in JSON, with code changed to parse multiple config files.

Both of above changes would be fully backwards compatible i.e. they could be done later.

=> If needed, I'll add those features later on, but keep code simple for now.

PS. I do not think changing API later would be problem either, as this is just a (scalability) test tool, config API change only requires matching generator version, and one gets error warning if they do not match.

@tkatila
Copy link
Contributor

tkatila commented Aug 22, 2022

I like that it's all deployable from k8s. No need to ssh into node and run generator there etc. Overall, it looks good!

As suggested by Ukri.

Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
Give more detailed logging for most likely failure, as MkNod() device
node creation can fail as normal user.

Additional error checking done in new dir removal helper function
fixes Ukri's review comments.  There's now error if to-be-removed fake
sysfs has more content than expected (earlier such check was only for
fake devfs content).

Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
Noticed by Tuomas.

Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
@eero-t
Copy link
Contributor Author

eero-t commented Aug 22, 2022

With above comments addressed, do you have any advise on where to put the generator code, container & example config?

(It's probably best to leave just the platform item and remove rest of keys from Capabilities map in its config.)

@tkatila
Copy link
Contributor

tkatila commented Aug 23, 2022

For the locations. The yaml would probably go to /deployments/fake-gpu-plugin (or similar).
For the generator code, I was wondering if it would make sense to include it in the gpu init container? We wouldn't then need another container for it.

@eero-t
Copy link
Contributor Author

eero-t commented Aug 24, 2022

For the locations. The yaml would probably go to /deployments/fake-gpu-plugin (or similar).

I think YAML is best as new kustomize dir for GPU plugin.

For the generator code, I was wondering if it would make sense to include it in the gpu init container? We wouldn't then need another container for it.

Including it in gpu_nfdhook/ directory and container image would make things simpler => I'll do that and split rest of the stuff to separate PRs now that Mikko fixes SGX CI builds.

As is has no extra deps and is only 3.5MB when statically compiled, image size increase and impact of that on GPU plugin deployment could be fine I guess, although generator is used only for testing.

However:

[1] I.e. that functionality would either be included into GPU plugin itself, or it runs as GPU plugin sidecar to update NFD feature file whenever GPUs change.

So I think it still makes sense to run generator in a separate init container, although it would be on the same image with the NFD hook.

@eero-t
Copy link
Contributor Author

eero-t commented Aug 24, 2022

@tkatila Project container templating does not support building or installing multiple binaries to the same container image, so I need to put generator to a separate directory / container after all. Maybe "gpu_fakedev"?

@tkatila
Copy link
Contributor

tkatila commented Aug 25, 2022

Maybe "gpu_fakedev"?

Sounds fine to me.

@eero-t
Copy link
Contributor Author

eero-t commented Aug 25, 2022

Closing now that the features discussed here have their own PRs.

@eero-t eero-t closed this Aug 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants