Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add "i915_monitoring" resource to GPU plugin #622

Merged
merged 1 commit into from
Apr 28, 2021

Conversation

eero-t
Copy link
Contributor

@eero-t eero-t commented Apr 20, 2021

Added "i915_monitoring" resource mounts all GPU devices to requesting container.

This is needed to get GPU metrics from the host. Requesting pod doesn't know how many GPUs the node it's assigned to has, so there needs to means to request them all. Only alternative for this would be requesting Privileged mode, which is clearly worse as that would grant access also to all other devices and capabilities.

Checking for expected number of monitoring resources (0/1) is added to tests.

@eero-t eero-t force-pushed the gpu-plugin-i915_monitoring-resource branch from 8d6f3a6 to b07cdc2 Compare April 20, 2021 11:24
@eero-t
Copy link
Contributor Author

eero-t commented Apr 20, 2021

As can be seen from previous CI results, simply adding testing for new GPU plugin resource pushed TestScan() over project "golangci-lint" complexity limit.

Avoiding that requires either:

  • Doing what Mikko suggested for file creation: mythi@479cfdc
  • Lifting the notifier checks to its own function, e.g. validateNotifierScan(), but that requires specifying testCase as its own type
  • Lifting the whole test case running to its own function. That also requires specifying testCase as its own type

I chose last option because moving the file system creation there also (as suggested earlier by Dmitry Rozhkov) allows simplifying the mock file system management, and means that testCase type is defined only few lines above its usage: https://github.com/eero-t/intel-device-plugins-for-kubernetes/blob/gpu-plugin-i915_monitoring-resource/cmd/gpu_plugin/gpu_plugin_test.go#L45

(Also, if testCase run at some later point point grows also too complex, it's then simply to split validateNotifierScan() off it, or still do Mikko's change.)

I've split the commits for easier review, so that "gofmt" is done as separate (last) commit.

EDIT: updated above options list based on Mikko's feedback.

@mythi
Copy link
Contributor

mythi commented Apr 20, 2021

I chose last option because moving

@eero-t FWIW, this version gives gocognit score of 22 while my version gives 16 (out of 31).

@eero-t
Copy link
Contributor Author

eero-t commented Apr 20, 2021

@eero-t FWIW, this version gives gocognit score of 22 while my version gives 16 (out of 31).

@mythi Oh, my assumption was really off. I didn't realize gocognit penalizes those loops that badly (Fedora doesn't have package for that or even gocyclo) => I'll update the options comment

While your suggestion simplifies FS cleanup I still like a bit more using defer for that in separate function, and not needing to "explode" testcase contents to createTestFiles() args. But I don't have strong preference.

What the others say?

@mythi
Copy link
Contributor

mythi commented Apr 21, 2021

While your suggestion simplifies FS cleanup I still like a bit more using defer for that in separate function,
and not needing to "explode" testcase contents to createTestFiles() args.

my version was hacked rather quickly, maybe defer works with t.Run() too so its worth giving a try. one other alternative could be to create root in createTestFiles() and have it return sysfs and devfs paths...

@rojkov
Copy link
Contributor

rojkov commented Apr 21, 2021

my version was hacked rather quickly, maybe defer works with t.Run() too so its worth giving a try.

Yes, defer works for lambda functions too.

@eero-t eero-t force-pushed the gpu-plugin-i915_monitoring-resource branch 2 times, most recently from 2b68cc4 to e8869d4 Compare April 26, 2021 15:55
@eero-t
Copy link
Contributor Author

eero-t commented Apr 26, 2021

I replaced separate test function with a variant of Mikko's earlier suggestion of having separate function just for setting up the sysfs/devfs mock files. Changes from that suggestion were:

  • returning sysfs & devfs paths from the function (as I thought creating the path strings separately on two different levels of code was a bit nasty)
  • Using defer for root removal

rojkov
rojkov previously approved these changes Apr 27, 2021
Copy link
Contributor

@rojkov rojkov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@mythi
Copy link
Contributor

mythi commented Apr 27, 2021

@eero-t LGTM too but can you squash the commits please

Copy link
Member

@bart0sh bart0sh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: createTestFiles can be done in a separate PR.

@eero-t
Copy link
Contributor Author

eero-t commented Apr 27, 2021

LGTM too but can you squash the commits please

@mythi I'll push rebase with squashed commits. Isn't there any project setting where you could specify that preference, or is this something you only request on per-PR basis?

nit: createTestFiles can be done in a separate PR.

@bart0sh If that is missing, PR would fail golangci-lint complexity check (as mentioned in my second comment).

Which mounts all (Intel) GPU devices to requesting container.

This is needed e.g. to get GPU metrics from the node.  Requesting pod
does not know how many GPUs are on the node it gets assigned to, so
there needs to means to request them all.

(Only alternative for the new resource would be requesting Privileged
mode, which is clearly worse as that would grant pod access also to
all other devices and capabilities.)

This commit also:

* Adds "i915_monitoring" resource testing to: go test -v -run Scan

* Splits GPU plugin tests mock file system setup to a separate
  createTestFiles() function because otherwise TestScan() does not
  pass project's golangci-lint complexity limits
Copy link
Contributor

@uniemimu uniemimu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@mythi
Copy link
Contributor

mythi commented Apr 27, 2021

I'll push rebase with squashed commits. Isn't there any project setting where you could specify that preference

@eero-t thanks. I guess my rule of thumb is that the individual commits in a PR would be justified as PRs of their own too but I don't think we're too strict on this. This PR had monitoring+testing for it + gofmt as their own commits but those would be logical to have in a single commit (or PR). createTestFiles could have been a PR of its own and merged before this....

Copy link
Contributor

@mythi mythi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@eero-t
Copy link
Contributor Author

eero-t commented Apr 27, 2021

createTestFiles could have been a PR of its own and merged before this....

Why it would need separate PR instead of just being a separate commit in this PR (like it was earlier)?

@bart0sh
Copy link
Member

bart0sh commented Apr 27, 2021

@eero-t

Why it would need separate PR instead of just being a separate commit in this PR (like it was earlier)?

Because introducing createTestFiles is an independent change.

@mythi mythi merged commit 1e7865e into intel:main Apr 28, 2021
@eero-t
Copy link
Contributor Author

eero-t commented Apr 28, 2021

Because introducing createTestFiles is an independent change.

@bart0sh All commits are supposed to be "independent" to some degree, but I assume you're talking here about the change being independent enough that the commit order can be changed without conflicts.

This doesn't seem to be a documented project policy, and it's applied inconsistently. For example all 4 commits in the previous PR merged from me were independent to that degree: https://github.com/eero-t/intel-device-plugins-for-kubernetes/commits/gpu-plugin-testing-improvements

Are you suggesting that also all of them should have been separate PRs? If not, could you explain more in detail where the line should be drawn, agree with the others, and document the "Good pull request guidelines"?

(Or if there's already a good policy documented elsewhere, point to it.)

That way you can just link the doc if PR seems non-conforming, and my & others' future PRs get less bikeshedding.

@eero-t eero-t deleted the gpu-plugin-i915_monitoring-resource branch April 28, 2021 09:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants