Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add "-enable-monitoring" option to GPU plugin + testing for its options #628

Merged
merged 4 commits into from
May 6, 2021

Conversation

eero-t
Copy link
Contributor

@eero-t eero-t commented Apr 28, 2021

Commits in this PR do following changes for the GPU plugin:

  • Disable "i915_monitoring" resource by default, and add "-enable-monitoring" option so that this resource (exposing all node GPUs) can be enabled only when needed
  • Move options to their own struct, so their use is more explicit (no magic values) and they can be more easily passed around
  • Add testing for the new option, and also for the device sharing option that was missing testing support

These changes were dropped from latest versions of the PR:

  • WIP/untested operator support for the new option (I will test it properly after it's clear whether rest of the changes are going to be accepted)
  • Because option combinations change how GPU plugin should interpret the test environment, they and the expected resulting values are an additional testing dimension. Therefore, each test-case can have an array of them, and there's an additional loop going through these combinations

Alternatives for the last item are:

  • Having the option combinations array separate from sysfs/devfs setup array. Advantage of this is the best testing coverage (all options combos are tested with all test setups) with least amount of lines, but it requires few additional ifs to the test loop. When this was discussed in Gpu plugin testing improvements #615, people seemed to prefer slightly the approach I've taken now, but I'm fine with either of them. Example of how this could look (with additional options) is here: eero-t@1901af8
  • Flattening the 2-dimensional test-case data to 1 dimension. This avoids the need for additional loop, but has the downside of needing larger amount of test-case duplication to provide as good coverage

@mythi
Copy link
Contributor

mythi commented May 3, 2021

This avoids the need for additional loop, but has the huge downside of needing large amount of test-case duplication, so I'm not going do it

I don't see how "needing large amount of test-case duplication" can be justified. Since monitor count is independent from how many devices there are, it can be tested separately:

name: "two devices, enable monitoring"

Further, since expectedDevs is defined by a scalar multiplier sharedDevNum only one check for it is sufficient:

name: "two devices, sharedDevNum=4"

(or how testing 2 and 4 makes math different?)

Finally, what could be added still is the check that if there are no devices, also expectedMonitors gives zero even when monitoring is enabled:

name: "no device installed, enable monitoring"

@eero-t
Copy link
Contributor Author

eero-t commented May 3, 2021

I don't see how "needing large amount of test-case duplication" can be justified. Since monitor count is independent from how many devices there are, it can be tested separately

Current expectation is that options are independent. Combining the options is testing for that.

But combined options testing works also if a dependent option like this is added later on: eero-t@3fb02c0

Further, since expectedDevs is defined by a scalar multiplier sharedDevNum only one check for it is sufficient:
...
(or how testing 2 and 4 makes math different?)

I can think of some (contrived) code changes after which two out of 1, 2 & 4 cases would pass, but not all. While such changes might not pass manual review, I think it's good if unit tests catch them before one considers sending changes for review.

But I'll change 2 & 4 to a single, larger prime number, that should catch even those contrived examples well enough. :-)

In general, I think it's good practice to add few guards also against "impossible" happening, because unit tests aren't supposed to test only current code, but detect also when (potentially much) later changes break expectations in (supposedly) unrelated earlier functionality.

Finally, what could be added still is the check that if there are no devices, also expectedMonitors gives zero even when monitoring is enabled:

name: "no device installed, enable monitoring"

Good point, testing that is definitely needed! I'll add options to test that for one of the no-dev cases.

@eero-t eero-t force-pushed the gpu-plugin-monitoring-option branch from e1ca88f to d6f7614 Compare May 3, 2021 11:35
@eero-t
Copy link
Contributor Author

eero-t commented May 3, 2021

Did the changes mentioned above and pushed rebased version. That dang unrelated "opae-nlb-demo" was broken again.

Test update commit message lists what testing it is supposed to add: eero-t@11609a3

@mythi
Copy link
Contributor

mythi commented May 3, 2021

Combining the options is testing for that.

my feedback is that we should drop []optionsCheck looping but others can decide

@uniemimu
Copy link
Contributor

uniemimu commented May 4, 2021

The cmd-line flag itself looks ok. I do not have an opinion about testing.

@eero-t eero-t force-pushed the gpu-plugin-monitoring-option branch from d6f7614 to 6211033 Compare May 4, 2021 12:23
@eero-t
Copy link
Contributor Author

eero-t commented May 4, 2021

After discussing with Ukri about windmills, I replaced []optionsCheck looping[1] with minimal testing adding just "cliOptions" to test-cases: eero-t@6cec11e

(CI failed as "opae-nlb-demo" breakage seems now to be permanent.)

[1] it moved to "gpu-plugin-proper-option-testing" branch from where suitable bits can be resurrected after GPU plugin test code has acquired more sysfs/devfs setups it needs to test, and more options they need to be tested against.

Copy link
Contributor

@rojkov rojkov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Added a couple of nits.

cmd/gpu_plugin/gpu_plugin.go Outdated Show resolved Hide resolved
cmd/gpu_plugin/gpu_plugin.go Outdated Show resolved Hide resolved
@eero-t eero-t force-pushed the gpu-plugin-monitoring-option branch from 6211033 to b1a1949 Compare May 4, 2021 14:15
rojkov
rojkov previously approved these changes May 4, 2021
@eero-t
Copy link
Contributor Author

eero-t commented May 4, 2021

Besides getting ClearLinux testing fixed for CI, only thing still missing before this could be merged, is testing the operator support (last, trivial WIP commit).

Do CI checks include any kind of operator testing, or is that something that needs to be done completely manually?

According to Ukri, testing operator support is a major undertaking / pain (especially when one does it for the first time), so I was wondering could it be dropped from this PR, or would somebody more familiar with it be willing to review / test the commit that tries to add operator support? :-)

@mythi
Copy link
Contributor

mythi commented May 4, 2021

so I was wondering could it be dropped from this PR,

It's OK to drop from this PR.

@eero-t eero-t force-pushed the gpu-plugin-monitoring-option branch from b1a1949 to ca423b1 Compare May 5, 2021 11:16
@eero-t eero-t changed the title RFC: add "-enable-monitoring" option to GPU plugin + testing for its options Add "-enable-monitoring" option to GPU plugin + testing for its options May 5, 2021
@eero-t
Copy link
Contributor Author

eero-t commented May 5, 2021

Dropped the WIP/operator commit and updated PR description to reflect current state, so that what gets in will be properly documented.

rojkov
rojkov previously approved these changes May 5, 2021
eero-t added 4 commits May 5, 2021 17:08
To reduce scan() function complexity before adding more functionality
to it.
To help in:
* adding more CLI options in next and later commits, and
* to replace magic newDevicePlugin() input parameters with
  explicitly named one(s)
Make "i915_monitoring" resource (granting access to all GPUs) optional
so that it can be enabled only when it's needed.
Tests plugin scan results in setups having none, one and multiple
eligible GPU devices, with and without SRIOV enabled, with two
different options values.

This does not cover verifying number of devices added under
"i915_monitoring" resource as that would be much larger change.
@eero-t eero-t force-pushed the gpu-plugin-monitoring-option branch from 0877f32 to 57c8d76 Compare May 5, 2021 14:09
@eero-t
Copy link
Contributor Author

eero-t commented May 5, 2021

I fixed the conflicts with main updates, but the result with previous PR changes to GPU plugin scan() function pushed it over the CI lint complexity limit => I added commit moving GPU device sysfs compatibility checks to its own function.

Do you want a separate PR for that?

Copy link
Contributor

@mythi mythi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to split to another PR

Copy link
Contributor

@uniemimu uniemimu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@uniemimu uniemimu merged commit de96a4b into intel:main May 6, 2021
@eero-t eero-t deleted the gpu-plugin-monitoring-option branch May 6, 2021 08:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants