Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add monitoring support for hardware accelerators #55188

Merged

Conversation

rohitagarwal003
Copy link
Member

Currently only NVIDIA GPU monitoring is implemented.

Feature repo issue: kubernetes/enhancements#369
cAdvisor PR: google/cadvisor#1762

/kind feature
/sig node
/sig instrumentation
/area hw-accelerators

Release note:

Kubelet now exposes metrics for NVIDIA GPUs attached to the containers.

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/feature Categorizes issue or PR as related to a new feature. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. area/hw-accelerators size/M Denotes a PR that changes 30-99 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Nov 6, 2017
@rohitagarwal003
Copy link
Member Author

/assign @vishh @yujuhong

// unit: bytes
MemoryUsed uint64 `json:"memory_used"`

// Percent of time over the past sample period during which
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Define what the sampling period is ? It is 10 seconds right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@vishh
Copy link
Contributor

vishh commented Nov 7, 2017

This PR LGTM. I'd recommend making the cAdvisor update in a separate PR though.
This PR can have just the summary API changes, ideally with corresponding e2es updates as well.

@rohitagarwal003 rohitagarwal003 force-pushed the accelerator-monitoring branch 2 times, most recently from 292c044 to d5dc8bf Compare November 7, 2017 20:54
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Nov 7, 2017
@andyxning
Copy link
Member

@mindprince @vishh I can help to add GPU info to heapster based on this.

BTW, nice work~

Also update golang.org/x/sys because of google/cadvisor#1786
assert.Contains() checks if its second argument (which is supposed to be
a single element) is contained in its first argument (which is supposed
to be a slice/map etc.) The third and following arguments are supposed
to be message and args for the output in case of failure.

Because of this bad form, a failure was hidden, the system container is
named "misc", not "system".
@rohitagarwal003
Copy link
Member Author

/assign @dchen1107
This is ready to go in.

@abdasgupta
Copy link

abdasgupta commented Nov 13, 2017

@mindprince One thing for my clarification. Will this PR show the GPU metrics on the following endpoint: http://KUBELET-IP:10255/stats/summary ?

@rohitagarwal003
Copy link
Member Author

@abdasgupta Yes.

@rohitagarwal003
Copy link
Member Author

/cc @dashpole @Random-Liu @yguo0905
Adding more people from sig node who have worked with the summary API.

Copy link
Contributor

@yguo0905 yguo0905 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -46,7 +46,7 @@ func TestSummaryProvider(t *testing.T) {
node = &v1.Node{ObjectMeta: metav1.ObjectMeta{Name: "test-node"}}
nodeConfig = cm.NodeConfig{
RuntimeCgroupsName: "/runtime",
SystemCgroupsName: "/system",
SystemCgroupsName: "/misc",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious why do we need this change?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I explain this here: 238b4a0

Fix TestSummaryProvider.

assert.Contains() checks if its second argument (which is supposed to be
a single element) is contained in its first argument (which is supposed
to be a slice/map etc.) The third and following arguments are supposed
to be message and args for the output in case of failure.

Because of this bad form, a failure was hidden, the system container is
named "misc", not "system".

@dchen1107
Copy link
Member

I only reviewed the last commit: Expose accelerator metrics in the summary API.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 15, 2017
@dchen1107
Copy link
Member

/approve

@k8s-github-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dchen1107, mindprince

Associated issue: 369

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

@k8s-github-robot k8s-github-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 15, 2017
@rohitagarwal003
Copy link
Member Author

/retest

@andyxning
Copy link
Member

Any thing wrong. Seems submit queue is pending.

@k8s-github-robot
Copy link

Automatic merge from submit-queue (batch tested with PRs 55798, 49579, 54862, 55188, 51990). If you want to cherry-pick this change to another branch, please follow the instructions here.

@k8s-github-robot k8s-github-robot merged commit 7791056 into kubernetes:master Nov 16, 2017
@smarterclayton
Copy link
Contributor

How was this merged without a proposal merged to kubernetes/community?


// Percent of time over the past sample period (10s) during which
// the accelerator was actively processing.
DutyCycle uint64 `json:"duty_cycle"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not consistent with our API conventions - use of snake_case is not allowed in public APIs.


// Total accelerator memory.
// unit: bytes
MemoryTotal uint64 `json:"memory_total"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not consistent with our API conventions - use of snake_case is not allowed in public APIs.


// Total accelerator memory allocated.
// unit: bytes
MemoryUsed uint64 `json:"memory_used"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not consistent with our API conventions - use of snake_case is not allowed in public APIs.

@smarterclayton
Copy link
Contributor

Looks like there are some API errors on this - @mindprince please open a new PR for 1.9 that fixes the API to be covered by our API conventions.

@smarterclayton
Copy link
Contributor

@kubernetes/sig-testing-feature-requests I think I need to make sure pkg/kubelet/apis appears under the API changes label (it wasn't flagged here and so I missed it). Can someone point me to where that can be done?

@k8s-ci-robot k8s-ci-robot added the sig/testing Categorizes an issue or PR as relevant to SIG Testing. label Nov 16, 2017
@rohitagarwal003
Copy link
Member Author

@smarterclayton #55908

@vishh
Copy link
Contributor

vishh commented Nov 16, 2017

@smarterclayton can we build some CI tooling to enforce these semantics?
I feel a full design review just for the sake of API schema review is heavy weight.
The design doc was shared as a google doc and can now be committed to k8s/community repo if that will be useful.
However, it is not clear if the entire golang object definition is expected in design docs though.

k8s-github-robot pushed a commit that referenced this pull request Nov 18, 2017
Automatic merge from submit-queue (batch tested with PRs 55908, 55829, 55293, 55653, 55665). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Fix accelerator stats API to follow API conventions.

Introduced in #55188

**Release note**:
```release-note
None
```
k8s-github-robot pushed a commit to kubernetes/community that referenced this pull request Nov 21, 2017
Automatic merge from submit-queue.

Add design proposal for accelerator monitoring.

For kubernetes/enhancements#369, google/cadvisor#1762 and  kubernetes/kubernetes#55188

Conversion to markdown from Google doc: https://docs.google.com/document/d/13O4HNrB7QFpKQcLcJm28R-QBH3Xo0VmJ7w_Pkvmsf68/edit
(accessible to anyone who is a member of kubernetes-dev@ or kubernetes-users@ Google Groups). Lots of discussion on the doc which is hard to recreate here now.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/hw-accelerators cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

10 participants