Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PVC creation as part of PyTorch job spec #1971

Open
Tracked by #2003
deepanker13 opened this issue Dec 21, 2023 · 8 comments
Open
Tracked by #2003

PVC creation as part of PyTorch job spec #1971

deepanker13 opened this issue Dec 21, 2023 · 8 comments

Comments

@deepanker13
Copy link
Contributor

deepanker13 commented Dec 21, 2023

To avoid giving unnecessary permission to Kubeflow User and make it more general for PyTorchJob APIs. We need to introduce changes to API to be able to set the following in PyTorchJob:

apiVersion: "kubeflow.org/v1"
kind: PyTorchJob
metadata:
  name: pytorchjob-yaml
  namespace: notebooks-test
spec:
  pytorchReplicaSpecs:
     .....
  storageSpec:
     storageClassName: my-sc
     resources:
        request:
           storage: 8Gi

So the parameter looks like this:

type PyTorchJobSpec struct {
  storageSpec *corev1.PersistentVolumeClaimSpec `json:"storageSpec,omitempty"`
@deepanker13 deepanker13 mentioned this issue Dec 21, 2023
1 task
@tenzen-y
Copy link
Member

I don't think we should introduce the original API since such changes will increase maintenance costs.
So I would suggest using K8s core API's volumeClaimTemplates like StatefulSet:

https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/#volume-claim-templates

cc: @andreyvelich @johnugeorge

@andreyvelich
Copy link
Member

@tenzen-y So your suggestion is to use volumeClaimTemplates in the PyTorchJob .spec to orchestrate PVC as part of our Training Operator controller loop as discussed here: #1962 (comment) ?
Do we have a use-case when we require to create more than one PVC for our Distributed Training Job ?

@tenzen-y
Copy link
Member

tenzen-y commented Jan 25, 2024

So your suggestion is to use volumeClaimTemplates in the PyTorchJob .spec to orchestrate PVC as part of our Training Operator controller loop as discussed here: #1962 (comment) ?

Yes, that's right.

Do we have a use-case when we require to create more than one PVC for our Distributed Training Job ?

@andreyvelich We can imagine that users want to create volumes with different storageclasses.
There is a case in which we want to create volumes with huge and slower storage for downloading large datasets and to create volumes with small and faster storage for putting uncompressed data.

@andreyvelich
Copy link
Member

There is a case in which we want to create volumes with huge and slower storage for downloading large datasets and to create volumes with small and faster storage for putting uncompressed data.

@tenzen-y In that case, user basically does some pre-processing on the PyTorch Master Pod to prepare uncompressed data for workers and distribute this data using small and faster storage ?

Copy link

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Copy link

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.

@deepanker13
Copy link
Contributor Author

/reopen

@google-oss-prow google-oss-prow bot reopened this May 15, 2024
Copy link

@deepanker13: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants