New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add nvidia-gpu-enhancement.md pr #30756

Closed
wants to merge 10 commits into
base: master
from

Conversation

@Hui-Zhi
Member

Hui-Zhi commented Aug 17, 2016

This PR is for NVIDIA GPU capabilities enhancement.


This change is Reviewable

@bgrant0607 bgrant0607 assigned erictune and unassigned bgrant0607 Aug 18, 2016

@bgrant0607

This comment has been minimized.

Show comment
Hide comment
@bgrant0607

bgrant0607 Aug 18, 2016

Member

@erictune Would you like to review this, or suggest someone else?

Member

bgrant0607 commented Aug 18, 2016

@erictune Would you like to review this, or suggest someone else?

@Hui-Zhi

This comment has been minimized.

Show comment
Hide comment
@Hui-Zhi

Hui-Zhi Aug 18, 2016

Member

@dchen1107 @vishh @erictune This is the PR we discussed in the sig-node meeting, I appreciate it if you can help review it.

Member

Hui-Zhi commented Aug 18, 2016

@dchen1107 @vishh @erictune This is the PR we discussed in the sig-node meeting, I appreciate it if you can help review it.

@vishh

This comment has been minimized.

Show comment
Hide comment
@vishh

vishh Aug 18, 2016

Member

@Hui-Zhi most of us are currently occupied with k8s v1.4 release. Give this PR a week or two for us to review it. Apologies for the delay.

Member

vishh commented Aug 18, 2016

@Hui-Zhi most of us are currently occupied with k8s v1.4 release. Give this PR a week or two for us to review it. Apologies for the delay.

@Hui-Zhi

This comment has been minimized.

Show comment
Hide comment
@Hui-Zhi

Hui-Zhi Aug 19, 2016

Member

@vishh It's ok with me, also I'll take a look at what I can help on v1.4 release.

Member

Hui-Zhi commented Aug 19, 2016

@vishh It's ok with me, also I'll take a look at what I can help on v1.4 release.

@Hui-Zhi

This comment has been minimized.

Show comment
Hide comment
@Hui-Zhi

Hui-Zhi Aug 22, 2016

Member

@erictune I add some details in the "Features", I appreciate it if you can let me know which parts are still missing or not clear yet. Thanks in advance.

Member

Hui-Zhi commented Aug 22, 2016

@erictune I add some details in the "Features", I appreciate it if you can let me know which parts are still missing or not clear yet. Thanks in advance.

Hui-Zhi added some commits Aug 22, 2016

@erictune

This comment has been minimized.

Show comment
Hide comment
@erictune

erictune Aug 24, 2016

Member

Thanks for the improvements. The one thing I am not clear on is what happens when a GPU is unassigned, and then reassigned.

When a GPU program runs on a GPU, it loads state into the GPUs memory, registers, and so on. If a GPU is first assigned to Pod A, and then to Pod B, then, ideally, no information from Pod A's computation should leak to Pod B. How can this property be guaranteed? I see that there is an nvidia-sml --gpu-reset command. Can that help?

Member

erictune commented Aug 24, 2016

Thanks for the improvements. The one thing I am not clear on is what happens when a GPU is unassigned, and then reassigned.

When a GPU program runs on a GPU, it loads state into the GPUs memory, registers, and so on. If a GPU is first assigned to Pod A, and then to Pod B, then, ideally, no information from Pod A's computation should leak to Pod B. How can this property be guaranteed? I see that there is an nvidia-sml --gpu-reset command. Can that help?

@Hui-Zhi

This comment has been minimized.

Show comment
Hide comment
@Hui-Zhi

Hui-Zhi Aug 25, 2016

Member

@erictune I think the Pod A's information leak to Pod B is a security issue and this issue should be take care inside GPU mechanism not in Kubelet, there's a doc about it (both memory and registers leakage):
https://arxiv.org/pdf/1305.7383.pdf

Regarding the "nvidia-smi --gpu-reset" command, it can reset all the GPU HW state but I don't recommend we use it due to it also can clear the errors or other useful information.

Member

Hui-Zhi commented Aug 25, 2016

@erictune I think the Pod A's information leak to Pod B is a security issue and this issue should be take care inside GPU mechanism not in Kubelet, there's a doc about it (both memory and registers leakage):
https://arxiv.org/pdf/1305.7383.pdf

Regarding the "nvidia-smi --gpu-reset" command, it can reset all the GPU HW state but I don't recommend we use it due to it also can clear the errors or other useful information.

@erictune

This comment has been minimized.

Show comment
Hide comment
@erictune

erictune Aug 26, 2016

Member

how does the "GPU mechanism" know that a GPU has been reallocated?

On Thu, Aug 25, 2016 at 12:25 AM, Hui-Zhi notifications@github.com wrote:

@erictune https://github.com/erictune I think the Pod A's information
leak to Pod B is a security issue and this issue should be take care
inside GPU mechanism not in Kubelet, there's a doc about it (both memory
and registers leakage):
https://arxiv.org/pdf/1305.7383.pdf

Regarding the "nvidia-smi --gpu-reset" command, it can reset all the GPU
HW state but I don't recommend we use it due to it also can clear the
errors or other useful information.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#30756 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AHuudjccvDWb79hoqe7fMgMiPrsPUnKRks5qjUN2gaJpZM4JmNkM
.

Member

erictune commented Aug 26, 2016

how does the "GPU mechanism" know that a GPU has been reallocated?

On Thu, Aug 25, 2016 at 12:25 AM, Hui-Zhi notifications@github.com wrote:

@erictune https://github.com/erictune I think the Pod A's information
leak to Pod B is a security issue and this issue should be take care
inside GPU mechanism not in Kubelet, there's a doc about it (both memory
and registers leakage):
https://arxiv.org/pdf/1305.7383.pdf

Regarding the "nvidia-smi --gpu-reset" command, it can reset all the GPU
HW state but I don't recommend we use it due to it also can clear the
errors or other useful information.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#30756 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AHuudjccvDWb79hoqe7fMgMiPrsPUnKRks5qjUN2gaJpZM4JmNkM
.

@Hui-Zhi

This comment has been minimized.

Show comment
Hide comment
@Hui-Zhi

Hui-Zhi Aug 26, 2016

Member

GPU isolates memories/registers for different processes, once a GPU has been reallocated, it will use other memories or registers (or like clock cycle) for the new process in Pod B. It's just like we close one game and start to play another one. So actually, what the GPU really needs to be responsible for is could keep different processes can run on it success in sequence (we do not need to take efforts on this), and what we want to do in kubelet is isolate the GPU by cgroup's devices and expose the NVIDIA GPU devices to container, so the application could use NVIDIA GPU via container.

Member

Hui-Zhi commented Aug 26, 2016

GPU isolates memories/registers for different processes, once a GPU has been reallocated, it will use other memories or registers (or like clock cycle) for the new process in Pod B. It's just like we close one game and start to play another one. So actually, what the GPU really needs to be responsible for is could keep different processes can run on it success in sequence (we do not need to take efforts on this), and what we want to do in kubelet is isolate the GPU by cgroup's devices and expose the NVIDIA GPU devices to container, so the application could use NVIDIA GPU via container.

@erictune

This comment has been minimized.

Show comment
Hide comment
@erictune

erictune Aug 26, 2016

Member

What about health checking a GPU? Sometimes GPUs break. Could we have a way to check that GPU is working before allocating it to a Pod? One idea would be to run a particular image in a container with a GPU allocated to it which can do a quick check of the features. If the GPU fails the health check, we could put it in a Broken state. As a side effect, the health check program would ensure the GPU memory really was clear.

@Hui-Zhi @vishh what do you think?

Member

erictune commented Aug 26, 2016

What about health checking a GPU? Sometimes GPUs break. Could we have a way to check that GPU is working before allocating it to a Pod? One idea would be to run a particular image in a container with a GPU allocated to it which can do a quick check of the features. If the GPU fails the health check, we could put it in a Broken state. As a side effect, the health check program would ensure the GPU memory really was clear.

@Hui-Zhi @vishh what do you think?

@binarybana

This comment has been minimized.

Show comment
Hide comment
@binarybana

binarybana Aug 26, 2016

Speaking as a large GPU user, I'd be interested in seeing the support described here and implemented in #28216 merged as soon as possible and nice-to-haves such as isolation and health checking decided/implemented later down the line (even if they necessitated changes to alpha APIs). Seeing as current workarounds for not having this basic functionality are even worse on the security front such as running pods in privileged mode.

binarybana commented Aug 26, 2016

Speaking as a large GPU user, I'd be interested in seeing the support described here and implemented in #28216 merged as soon as possible and nice-to-haves such as isolation and health checking decided/implemented later down the line (even if they necessitated changes to alpha APIs). Seeing as current workarounds for not having this basic functionality are even worse on the security front such as running pods in privileged mode.

@Hui-Zhi

This comment has been minimized.

Show comment
Hide comment
@Hui-Zhi

Hui-Zhi Aug 27, 2016

Member

@erictune a health check would be great, but that needs NVML, and if I use NVML to discover GPU instead of using regex (the current way I used in #28216 ) to check the file name (nvidia0, nvidia1, ...) under "/dev/", I can check the GPU status while kubelet is discovering GPU. But in this PR, I just want to add the NVIDIA GPU feature without NVML needed. For dedicated GPU, we don't have to know what's inside the GPU, like cores, memory size, family, even status (only broken status, the free and inuse status need to be handled by kubelet according to the dedicated GPU allocation).

Sure we can put the health check (and NVML libs) in a container, but the container image would be large, so I do not think this is necessary in this version.

If this feature can be merged, I think we can get some feedback from users, like @binarybana , then we can decide what we need to enable next, maybe schedule GPU cores/memory, compare the CUDA versions/card familes to ensure the container image could run without compatible issue. Speaking of which, a card health check can't ensure the user's container could run success due to there still have CUDA version and card family compatible issues.

Member

Hui-Zhi commented Aug 27, 2016

@erictune a health check would be great, but that needs NVML, and if I use NVML to discover GPU instead of using regex (the current way I used in #28216 ) to check the file name (nvidia0, nvidia1, ...) under "/dev/", I can check the GPU status while kubelet is discovering GPU. But in this PR, I just want to add the NVIDIA GPU feature without NVML needed. For dedicated GPU, we don't have to know what's inside the GPU, like cores, memory size, family, even status (only broken status, the free and inuse status need to be handled by kubelet according to the dedicated GPU allocation).

Sure we can put the health check (and NVML libs) in a container, but the container image would be large, so I do not think this is necessary in this version.

If this feature can be merged, I think we can get some feedback from users, like @binarybana , then we can decide what we need to enable next, maybe schedule GPU cores/memory, compare the CUDA versions/card familes to ensure the container image could run without compatible issue. Speaking of which, a card health check can't ensure the user's container could run success due to there still have CUDA version and card family compatible issues.

@erictune

This comment has been minimized.

Show comment
Hide comment
@erictune

erictune Aug 28, 2016

Member

Sure, we can move forward. It is still alpha.

/lgtm

Member

erictune commented Aug 28, 2016

Sure, we can move forward. It is still alpha.

/lgtm

@k8s-bot

This comment has been minimized.

Show comment
Hide comment
@k8s-bot

k8s-bot Aug 29, 2016

Can one of the admins verify that this patch is reasonable to test? If so, please reply "ok to test".
(Note: "add to whitelist" is no longer supported. Please update configurations in kubernetes/test-infra/jenkins/job-configs/kubernetes-jenkins-pull instead.)

This message will repeat several times in short succession due to jenkinsci/ghprb-plugin#292. Sorry.

k8s-bot commented Aug 29, 2016

Can one of the admins verify that this patch is reasonable to test? If so, please reply "ok to test".
(Note: "add to whitelist" is no longer supported. Please update configurations in kubernetes/test-infra/jenkins/job-configs/kubernetes-jenkins-pull instead.)

This message will repeat several times in short succession due to jenkinsci/ghprb-plugin#292. Sorry.

@erictune

This comment has been minimized.

Show comment
Hide comment
@erictune

erictune Aug 29, 2016

Member

ok to test

Member

erictune commented Aug 29, 2016

ok to test

@erictune

This comment has been minimized.

Show comment
Hide comment
@erictune

erictune Aug 29, 2016

Member

add to whitelist

Member

erictune commented Aug 29, 2016

add to whitelist

@vishh

This comment has been minimized.

Show comment
Hide comment
@vishh

vishh Aug 30, 2016

Member

What about health checking a GPU? Sometimes GPUs break. Could we have a way to check that GPU is working before allocating it to a Pod? One idea would be to run a particular image in a container with a GPU allocated to it which can do a quick check of the features. If the GPU fails the health check, we could put it in a Broken state. As a side effect, the health check program would ensure the GPU memory really was clear.

@erictune @Hui-Zhi We do not have a concept of hardware health checks yet in kubelet. Even if we did have one how far would be go with the health checks? Trivial health checks can be performed continuously on the background. For example kubelet could track the temperature and can evict pods assigned to a overclocked GPU. The idea of running a pod to validate a GPU might be possible as long as the startup latency for that pod is very small (<1s). Otherwise, we are better off letting hardware validation be handled by external tooling.

Member

vishh commented Aug 30, 2016

What about health checking a GPU? Sometimes GPUs break. Could we have a way to check that GPU is working before allocating it to a Pod? One idea would be to run a particular image in a container with a GPU allocated to it which can do a quick check of the features. If the GPU fails the health check, we could put it in a Broken state. As a side effect, the health check program would ensure the GPU memory really was clear.

@erictune @Hui-Zhi We do not have a concept of hardware health checks yet in kubelet. Even if we did have one how far would be go with the health checks? Trivial health checks can be performed continuously on the background. For example kubelet could track the temperature and can evict pods assigned to a overclocked GPU. The idea of running a pod to validate a GPU might be possible as long as the startup latency for that pod is very small (<1s). Otherwise, we are better off letting hardware validation be handled by external tooling.

@Hui-Zhi

This comment has been minimized.

Show comment
Hide comment
@Hui-Zhi

Hui-Zhi Nov 15, 2016

Member

@vishh BTW, if we want to schedule pre core or memory inside GPU, the GPU card should be able to share to multiple pods, like container1 in pod1 could use 20% cores in /dev/nvidia1, and container2 in pod2 could use 80% cores in /dev/nvidia1. We can discussed this in another proposal after this proposal be implemented.

Member

Hui-Zhi commented Nov 15, 2016

@vishh BTW, if we want to schedule pre core or memory inside GPU, the GPU card should be able to share to multiple pods, like container1 in pod1 could use 20% cores in /dev/nvidia1, and container2 in pod2 could use 80% cores in /dev/nvidia1. We can discussed this in another proposal after this proposal be implemented.

@k8s-ci-robot

This comment has been minimized.

Show comment
Hide comment
@k8s-ci-robot

k8s-ci-robot Nov 17, 2016

Jenkins verification failed for commit 5a34338. Full PR test history.

The magic incantation to run this job again is @k8s-bot verify test this. Please help us cut down flakes by linking to an open flake issue when you hit one in your PR.

k8s-ci-robot commented Nov 17, 2016

Jenkins verification failed for commit 5a34338. Full PR test history.

The magic incantation to run this job again is @k8s-bot verify test this. Please help us cut down flakes by linking to an open flake issue when you hit one in your PR.

@Hui-Zhi

This comment has been minimized.

Show comment
Hide comment
@Hui-Zhi

Hui-Zhi Nov 17, 2016

Member

@vishh Do you have any other comments about this?

Member

Hui-Zhi commented Nov 17, 2016

@vishh Do you have any other comments about this?

@Hui-Zhi

This comment has been minimized.

Show comment
Hide comment
@Hui-Zhi

Hui-Zhi Nov 21, 2016

Member

@davidopp I think this PR is ready to be merged, do you?

Member

Hui-Zhi commented Nov 21, 2016

@davidopp I think this PR is ready to be merged, do you?

@duglin

This comment has been minimized.

Show comment
Hide comment
@duglin

duglin Nov 21, 2016

ping @vishh - any more comments on this one?

duglin commented Nov 21, 2016

ping @vishh - any more comments on this one?

@Hui-Zhi

This comment has been minimized.

Show comment
Hide comment
@Hui-Zhi

Hui-Zhi Nov 24, 2016

Member

@davidopp Do you think this is good enough to be merged?

Member

Hui-Zhi commented Nov 24, 2016

@davidopp Do you think this is good enough to be merged?

@davidopp

This comment has been minimized.

Show comment
Hide comment
@davidopp

davidopp Nov 24, 2016

Member

@vishh is out for a while, and I think this proposal is missing some important details, but I think it's OK to merge it as is, so I'll LGTM it.

Member

davidopp commented Nov 24, 2016

@vishh is out for a while, and I think this proposal is missing some important details, but I think it's OK to merge it as is, so I'll LGTM it.

@davidopp davidopp added the lgtm label Nov 24, 2016

@Hui-Zhi

This comment has been minimized.

Show comment
Hide comment
@Hui-Zhi

Hui-Zhi Nov 24, 2016

Member

@davidopp Thanks for your reply, do we need to wait @vishh back or you can merge it yourself?

Member

Hui-Zhi commented Nov 24, 2016

@davidopp Thanks for your reply, do we need to wait @vishh back or you can merge it yourself?

@davidopp

This comment has been minimized.

Show comment
Hide comment
@davidopp

davidopp Nov 24, 2016

Member

I already marked it LGTM

Member

davidopp commented Nov 24, 2016

I already marked it LGTM

@Hui-Zhi

This comment has been minimized.

Show comment
Hide comment
@Hui-Zhi

Hui-Zhi Nov 24, 2016

Member

@davidopp Thank you! :)

Member

Hui-Zhi commented Nov 24, 2016

@davidopp Thank you! :)

@Hui-Zhi

This comment has been minimized.

Show comment
Hide comment
@Hui-Zhi

Hui-Zhi Nov 28, 2016

Member

@ddysher Any comments is welcome.

Member

Hui-Zhi commented Nov 28, 2016

@ddysher Any comments is welcome.

2.1 Kubelet need to find free GPUs from the current ```GPUDevice``` list and ```ContainerGPUMapping``` list, and asssign them to container. The device path mapping on host and container are the same, like if we assgin ```/dev/nvidia1``` and ```/dev/nvidia2``` to a container, inside the container, we can see ```/dev/nvidia1``` and ```/dev/nvidia2``` as the same as on the host.
2.2 Once a container dead/be killed, the GPUs attached to this container need to be freed. If the container be restarted, the GPUs attached will be freed and reassigned, if a pod is evicted, all the GPUs assigned to the containers in this pod will be immediately released.

This comment has been minimized.

@vishh

vishh Nov 29, 2016

Member

what do you mean by "freed & reassigned"?

@vishh

vishh Nov 29, 2016

Member

what do you mean by "freed & reassigned"?

This comment has been minimized.

@Hui-Zhi

Hui-Zhi Nov 29, 2016

Member

I mean detach from the container and reattach to it.

@Hui-Zhi

Hui-Zhi Nov 29, 2016

Member

I mean detach from the container and reattach to it.

@vishh

This comment has been minimized.

Show comment
Hide comment
@vishh

vishh Nov 29, 2016

Member

The implementation plan needs some more improvements. I don't see the point of this proposal actually. This proposal could actually be an issue instead. In any case, let's not be slowed down by this PR and instead focus our energy on the implementation.

Member

vishh commented Nov 29, 2016

The implementation plan needs some more improvements. I don't see the point of this proposal actually. This proposal could actually be an issue instead. In any case, let's not be slowed down by this PR and instead focus our energy on the implementation.

@k8s-merge-robot

This comment has been minimized.

Show comment
Hide comment
@k8s-merge-robot

k8s-merge-robot Dec 1, 2016

Adding label:do-not-merge because PR changes docs prohibited to auto merge
See http://kubernetes.io/editdocs/ for information about editing docs

k8s-merge-robot commented Dec 1, 2016

Adding label:do-not-merge because PR changes docs prohibited to auto merge
See http://kubernetes.io/editdocs/ for information about editing docs

@Hui-Zhi

This comment has been minimized.

Show comment
Hide comment
@Hui-Zhi

Hui-Zhi Dec 29, 2016

Member

keep-open

Member

Hui-Zhi commented Dec 29, 2016

keep-open

@k8s-merge-robot

This comment has been minimized.

Show comment
Hide comment
@k8s-merge-robot

k8s-merge-robot Jan 24, 2017

[APPROVALNOTIFIER] Needs approval from an approver in each of these OWNERS Files:

We suggest the following people:
cc @brendandburns
You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

k8s-merge-robot commented Jan 24, 2017

[APPROVALNOTIFIER] Needs approval from an approver in each of these OWNERS Files:

We suggest the following people:
cc @brendandburns
You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

@Hui-Zhi

This comment has been minimized.

Show comment
Hide comment
@Hui-Zhi
Member

Hui-Zhi commented Jan 24, 2017

@k8s-merge-robot

This comment has been minimized.

Show comment
Hide comment
@k8s-merge-robot

k8s-merge-robot Feb 23, 2017

This PR hasn't been active in 30 days. It will be closed in 59 days (Apr 24, 2017).

cc @Hui-Zhi @brendandburns @davidopp @smarterclayton @vishh

You can add 'keep-open' label to prevent this from happening, or add a comment to keep it open another 90 days

k8s-merge-robot commented Feb 23, 2017

This PR hasn't been active in 30 days. It will be closed in 59 days (Apr 24, 2017).

cc @Hui-Zhi @brendandburns @davidopp @smarterclayton @vishh

You can add 'keep-open' label to prevent this from happening, or add a comment to keep it open another 90 days

@Hui-Zhi Hui-Zhi closed this Mar 1, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment