New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add monitoring support for hardware accelerators #55188
Add monitoring support for hardware accelerators #55188
Conversation
// unit: bytes | ||
MemoryUsed uint64 `json:"memory_used"` | ||
|
||
// Percent of time over the past sample period during which |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Define what the sampling period is ? It is 10 seconds
right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
This PR LGTM. I'd recommend making the cAdvisor update in a separate PR though. |
292c044
to
d5dc8bf
Compare
d5dc8bf
to
6cfe359
Compare
@mindprince @vishh I can help to add GPU info to heapster based on this. BTW, nice work~ |
Also update golang.org/x/sys because of google/cadvisor#1786
assert.Contains() checks if its second argument (which is supposed to be a single element) is contained in its first argument (which is supposed to be a slice/map etc.) The third and following arguments are supposed to be message and args for the output in case of failure. Because of this bad form, a failure was hidden, the system container is named "misc", not "system".
6cfe359
to
7a68e8a
Compare
7a68e8a
to
9c38abd
Compare
/assign @dchen1107 |
@mindprince One thing for my clarification. Will this PR show the GPU metrics on the following endpoint: http://KUBELET-IP:10255/stats/summary ? |
@abdasgupta Yes. |
/cc @dashpole @Random-Liu @yguo0905 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@@ -46,7 +46,7 @@ func TestSummaryProvider(t *testing.T) { | |||
node = &v1.Node{ObjectMeta: metav1.ObjectMeta{Name: "test-node"}} | |||
nodeConfig = cm.NodeConfig{ | |||
RuntimeCgroupsName: "/runtime", | |||
SystemCgroupsName: "/system", | |||
SystemCgroupsName: "/misc", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious why do we need this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I explain this here: 238b4a0
Fix TestSummaryProvider.
assert.Contains() checks if its second argument (which is supposed to be
a single element) is contained in its first argument (which is supposed
to be a slice/map etc.) The third and following arguments are supposed
to be message and args for the output in case of failure.
Because of this bad form, a failure was hidden, the system container is
named "misc", not "system".
I only reviewed the last commit: Expose accelerator metrics in the summary API. /lgtm |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dchen1107, mindprince Associated issue: 369 The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these OWNERS Files:
You can indicate your approval by writing |
/retest |
Any thing wrong. Seems submit queue is pending. |
Automatic merge from submit-queue (batch tested with PRs 55798, 49579, 54862, 55188, 51990). If you want to cherry-pick this change to another branch, please follow the instructions here. |
How was this merged without a proposal merged to kubernetes/community? |
|
||
// Percent of time over the past sample period (10s) during which | ||
// the accelerator was actively processing. | ||
DutyCycle uint64 `json:"duty_cycle"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not consistent with our API conventions - use of snake_case is not allowed in public APIs.
|
||
// Total accelerator memory. | ||
// unit: bytes | ||
MemoryTotal uint64 `json:"memory_total"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not consistent with our API conventions - use of snake_case is not allowed in public APIs.
|
||
// Total accelerator memory allocated. | ||
// unit: bytes | ||
MemoryUsed uint64 `json:"memory_used"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not consistent with our API conventions - use of snake_case is not allowed in public APIs.
Looks like there are some API errors on this - @mindprince please open a new PR for 1.9 that fixes the API to be covered by our API conventions. |
@kubernetes/sig-testing-feature-requests I think I need to make sure |
@smarterclayton can we build some CI tooling to enforce these semantics? |
Automatic merge from submit-queue (batch tested with PRs 55908, 55829, 55293, 55653, 55665). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. Fix accelerator stats API to follow API conventions. Introduced in #55188 **Release note**: ```release-note None ```
Automatic merge from submit-queue. Add design proposal for accelerator monitoring. For kubernetes/enhancements#369, google/cadvisor#1762 and kubernetes/kubernetes#55188 Conversion to markdown from Google doc: https://docs.google.com/document/d/13O4HNrB7QFpKQcLcJm28R-QBH3Xo0VmJ7w_Pkvmsf68/edit (accessible to anyone who is a member of kubernetes-dev@ or kubernetes-users@ Google Groups). Lots of discussion on the doc which is hard to recreate here now.
Currently only NVIDIA GPU monitoring is implemented.
Feature repo issue: kubernetes/enhancements#369
cAdvisor PR: google/cadvisor#1762
/kind feature
/sig node
/sig instrumentation
/area hw-accelerators
Release note: