Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GSoC] Persistent Device Claims for KubeVirt #254

Open
alicefr opened this issue Feb 1, 2024 · 11 comments
Open

[GSoC] Persistent Device Claims for KubeVirt #254

alicefr opened this issue Feb 1, 2024 · 11 comments

Comments

@alicefr
Copy link
Member

alicefr commented Feb 1, 2024

Title: Persistent Device Claims for KubeVirt

Description

KubeVirt [1] is a Kubernetes extension to deploy Virtual Machines like pods and integrate with the Kubernetes ecosystem.

For handling host devices, KubeVirt depends on the Kubernetes device plugin framework [2]. It is used for scheduling, allocating, and attaching a desired device and resources to a running pod.

One of the limitations of this framework is the persistence of the device allocation when the pod isn’t running. This becomes especially problematic for devices that require a significant initialization time, such as FPGAs, or storage devices, such as NVMes or USD devices, where users may have saved data. Devices assigned to the same resource name might be randomly allocated without the possibility to identify a specific device within the set.

For KubeVirt, the device is released upon shutdown or restarting of the VM due to the deletion or recreation of the VM pod. Hence, when the VM is restarted it might get a different device assigned than the previous one.

Dynamic Resource Allocation [3] API provides a solution by introducing resource claims. The claims are independent from the pod lifetime, and they persist until the user deletes them. In this way, we are able to recognize the device that was previously assigned to a VM and preserve its state upon restarts.

Expected Outcome

The project goal is to design, develop and integrate Resource Claims in KubeVirt for host device allocation. As it is the case with PVCs nowadays, the user should be able to declare and assign a resource claim and/or template to a KubeVirt virtual machine.
To support this new Kubernetes API, the project must research and suggest ways to expand KubeVirt.

A successful project will implement a POC for an example device that is already available in Kubevirt infrastructure. The outcome of this project will provide a base for future integration of DRA, once the API reaches maturity.

Project requirements

Project size: 350 hours
Difficult: Hard
Required skills: Kubernetes knowledge and GoLang programming skills
Desirable skills: Virtualization
Mentors: Alice Frosi afrosi@redhat.com, Victor Toso de Carvalho vtosodec@redhat.com, Luboslav Pivarc lpivarc@redhat.com

How and where to search help

First, try to check KubeVirt documentation [4], we cover many topics and you might already find some of the answers. If there is something unclear, feel free to open an issue and a PR. This is already a great start to getting in touch with the process.
For questions related to KubeVirt and not strictly to the GSoc program, try to use the slack channel [5] and the issues [6] as much as possible. Your question can be useful for other people, and the mentors might have a limited amount of time. It is also important to interact with the community as much as possible.

If something doesn't work, try to document the steps and how to reproduce the issue as clearly as possible. The more information you provide, the easiest is for us to help you. If you open an issue in KubeVirt, this already guides you with a template with the kind of information we generally need.

How to start

  1. Install KubeVirt and deploy KubeVirt VMs following the getting started guide [7]
  2. [Optional] Look for good-first issues [8] and try to solve one to get familiar with the project (if there isn’t a PR linked to it, feel free to pick it)
  3. Understand how device plugins and DRA work and their differences. Try to deploy a pod using a device plugin [9] and using DRA [10]
  4. Understand how device plugins work in KubeVirt. Try to deploy a VM using a device (see point 5) [11]
  5. Understand the problem of the device assignment. Try to start kubevirci [12] with 2 emulated NVME devices that are allocated under the same resource name [13]. Start kubevirtci with export KUBEVIRT_PROVIDER_EXTRA_ARGS="--nvme 1G --nvme 1G"

How to submit the proposal

The preferred way is to create a google doc and share it with the mentors (slack or email work). If for any reason, google doc doesn't work for you, please share your proposal by email. Early submissions have higher chances as they will be reviewed on multiple iterations and can be further improved.

What the proposal should contain

The design and your strategy for solving the challenge should be concisely explained in the proposal. Which components you anticipate touching and an example of an API are good starting points. The updates or APIs are merely a draft of what the candidate hopes to expand and change rather than being final. The details and possible issues can be discussed during the project with the mentors that can help to refine the proposal.

It is not necessary to provide an introduction to Kubernetes or KubeVirt; instead, candidates should demonstrate their familiarity with KubeVirt by describing in detail how they intend to approach the task.

Mentors may find it helpful to have a schematic drawing of the flows and examples to better grasp the solution. They will select a couple of good proposals at the end of the selection period and this will be followed by an interview with the candidate.

The proposal can have a free form or you can get inspired by the KubeVirt design proposals [14] and template [15]. However, it should contain a draft schedule of the project phases with some planned extra time to overcome eventual difficulties.

Links

[1] https://github.com/kubevirt/kubevirt
[2] https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/
[3] https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/
[4] https://github.com/kubevirt/kubevirt/tree/main/docs
[5] https://kubernetes.slack.com/archives/C0163DT0R8X
[6] https://github.com/kubevirt/kubevirt/issues
[7] https://github.com/kubevirt/kubevirt/blob/main/docs/getting-started.md
[8] https://github.com/kubevirt/kubevirt/issues?q=is%3Aopen+is%3Aissue+label%3Agood-first-issue
[9] https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/#example-pod
[10] https://gist.github.com/alicefr/592591b18a99cf126dd82110d8fa74ea
[11] https://kubevirt.io/user-guide/virtual_machines/host-devices/
[12] https://github.com/kubevirt/kubevirtci
[13] https://kubevirt.io/user-guide/virtual_machines/host-devices/#nvme-pci-passthrough
[14] https://github.com/kubevirt/community/tree/main/design-proposals
[15] https://github.com/kubevirt/community/blob/main/design-proposals/proposal-template.md

Other good resources to check:

@alicefr
Copy link
Member Author

alicefr commented Feb 1, 2024

@victortoso
Copy link
Member

Hi Alice, nice!

As it is the case with PVCs nowadays

This is a good reference to understand what is more or less expected. I'd add links to it, like k8s pvc and kubevirt user-guide

In the end of the description, we could make a reference to kubernetes/enhancements#3063 to highlight the ongoing discussion about DRA.

A successful project will implement a POC for an example device that is already available in Kubevirt infrastructure.

I think this is very reasonable indeed!

On How to start, I'd recommend this kubecon talk as it provides a lot of insight on how it works, how to implement it, etc.

@alicefr
Copy link
Member Author

alicefr commented Mar 8, 2024

cc @sarthaksarthak9

Proposal : kubevirt/community#254

The proposal mentions two potential solutions:

  • Using the Dynamic Resource Allocation (DRA) API
  • Implementing custom Persistent Device Claims (PDCs)

The detailed plan focuses on developing PDCs. I understand this might be because DRA is still under development. However, the initial proposal also mentioned DRA as a possibility.

Could you please give some more clarity over approaching the implementation ?

PDCs are part of DRA. If you have a look to DRA, you can see that you can define ClaimTemplates. The goal of the project is to have a POC where we can use this new API with one of the already supported device types. For example PCI devices for passthrough. As I mentioned in the description, you could use emulated NVMe devices.

In this way, the users who want to use a PCI device could also create a PDC based on the new template. Is this clearer?

@jgrady15
Copy link

jgrady15 commented Mar 24, 2024

Hi @alicefr, I'm a master's student at Georgia Tech and I'm interested in potentially joining KubeVirt on this particular project for GSoC 2024.

I am building on the first potential solution for this issue:

Proposal : #254

The proposal mentions two potential solutions:

Using the Dynamic Resource Allocation (DRA) API
Implementing custom Persistent Device Claims (PDCs)
The detailed plan focuses on developing PDCs. I understand this might be because DRA is still under development. However, the initial proposal also mentioned DRA as a possibility.

Could you please give some more clarity over approaching the implementation ?

I have been reading up on this bit of documentation that allows us to specify a CRD for our ResourceClaim template and store resource definitions and properties inside of a .yaml file for persistence. From my understanding, this CRD will allow us to cache the desired device configuration using relevant metadata. I also think that utilizing custom controllers will allow us to use the metadata within the CRD.

Additionally, the potential use of ControllerRevision may also be helpful in speeding up the initialization process for devices, as this might allow us to embed and serialize/deserialize objects that contain their internal state.

Please let me know if my understanding or logic is flawed somewhere. I would appreciate any feedback on this expanded potential solution. :)

@alicefr
Copy link
Member Author

alicefr commented Mar 25, 2024

Hi @alicefr, I'm a master's student at Georgia Tech and I'm interested in potentially joining KubeVirt on this particular project for GSoC 2024.

I am building on the first potential solution for this issue:

Proposal : #254
The proposal mentions two potential solutions:
Using the Dynamic Resource Allocation (DRA) API
Implementing custom Persistent Device Claims (PDCs)
The detailed plan focuses on developing PDCs. I understand this might be because DRA is still under development. However, the initial proposal also mentioned DRA as a possibility.
Could you please give some more clarity over approaching the implementation ?

I have been reading up on this bit of documentation that allows us to specify a CRD for our ResourceClaim template and store resource definitions and properties inside of a .yaml file for persistence. From my understanding, this CRD will allow us to cache the desired device configuration using relevant metadata. I also think that utilizing custom controllers will allow us to use the metadata within the CRD.

The CRD can allow you to model the device parameters.

Additionally, the potential use of ControllerRevision may also be helpful in speeding up the initialization process for devices, as this might allow us to embed and serialize/deserialize objects that contain their internal state.

Please, have a look to the DRA documentation. The goal is to implement a DRA driver. This, for sure, also uses kubernetes controllers to watch the customer resources,

Please let me know if my understanding or logic is flawed somewhere. I would appreciate any feedback on this expanded potential solution. :)

Once you have a draft for your proposal you can share it with us and we can review your solution in more details

@jgrady15
Copy link

The CRD can allow you to model the device parameters.

I see that makes more sense. I watched through the KubeCon talk that @victortoso linked a few comments above, and have some form of basic understanding on what DRA is and how we can create a DRA Driver to start. My question is, they mention the use of CDIs, however after looking through the KubeVirt documentation on Host Devices, I noticed that KubeVirt uses VFIO for Mediated Devices. My question is, are CDIs something I should be looking into for this particular project, or should I strictly stick to understanding the VFIO interface on how to prepare a specific device for device assignment?

Once you have a draft for your proposal you can share it with us and we can review your solution in more details

For drafting my proposal, should I directly add your email to the google document itself?

@alicefr
Copy link
Member Author

alicefr commented Mar 26, 2024

The CRD can allow you to model the device parameters.

I see that makes more sense. I watched through the KubeCon talk that @victortoso linked a few comments above, and have some form of basic understanding on what DRA is and how we can create a DRA Driver to start. My question is, they mention the use of CDIs, however after looking through the KubeVirt documentation on Host Devices, I noticed that KubeVirt uses VFIO for Mediated Devices. My question is, are CDIs something I should be looking into for this particular project, or should I strictly stick to understanding the VFIO interface on how to prepare a specific device for device assignment?

CDI is another kind of interface but you should relay on dra.
As far as it regard VFIO, this is used for creating mediated devices and mostly for vGPUs. As I mentioned in the description, please, focus only on PCI passthrouhg with NVMe devices. GPUs are a great use-case, but you need to have at least one device available. While for NVMe, you can create instances of kubevirtci which has already emulated NVMes.

The goal of the project is to model a device and this should serve as an example for most complex ones.

Once you have a draft for your proposal you can share it with us and we can review your solution in more details

For drafting my proposal, should I directly add your email to the google document itself?

As you prefer. We need to be able to comment in the doc

@jgrady15
Copy link

Hi Alice!

As you prefer. We need to be able to comment in the doc

I've gone ahead and attached you to my Google Doc for my draft proposal, it should be under my email: gradyjonathan55@gmail.com

Please let me know what you think. 😄

@xpivarc
Copy link
Member

xpivarc commented Apr 1, 2024

Reminder, don't forget to submit a proposal through GSoC by 2nd April - 18:00 UTC.

@hkiiita
Copy link

hkiiita commented Apr 21, 2024

Dear Mods, Is this still open to work ? or someone has been assigned on it ?

@alicefr
Copy link
Member Author

alicefr commented Apr 22, 2024

Dear Mods, Is this still open to work ? or someone has been assigned on it ?

The project deadline has already passed, you cannot unfortunately participate anymore

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants