-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
new command: put-url OR rsync/rclone #6485
Comments
Does it make sense to have a separate
|
👀 well that sounds like a problem
Maybe this wouldn't be too difficult to implement. Summarising what I think the most useful
We've endeavoured to provide buffered |
😬 Yup, I only realized this pretty recently myself. Extracted to #6494. |
The initial motivation for this command was to upload data to storage and preserve a pointer file ( It seems like, we need an "upload equivalent" of Proposed names:
New commands:
PS: it is related to the idea of lightweight model and data management |
I also think |
I'd suggest focusing on the core scenario - |
does it mean "aws s3 cp local_path_in_repo s3://remote_path && dvc import-url s3://remote_path -o local_path_in_repo"? It sounds then than this should be an extension to import itself (move a file before import, or something). I don't feel that this deservers the full name "export" since it won't be doing anything different from import-url. |
My original intention with this feature request was just the first bit ( |
Yep, and name |
I thought that we are solving CML, MLEM problems with this proposal, aren't we? If not, I'm creating a separate issue for the integrations and we can keep the current one as is. |
I'm fine with this one :) My questions stays - it doesn't feel like it's export. It does copy + import under the hood, right? So why export then? Why not an option for the import command (to copy artifact to the cloud first)? |
@shcheklein yes, it can be just an option of import like import/export naming is definitely not perfect. So, an option might be a safer choice in the short term. |
See #6494 (comment) for a proposed syntax to cover both dvc urls and arbitrary urls with one command.
Seems like it's unclear whether CML needs
Hm, I thought
|
It might make sense to name it |
I'm trying to aggregate our discussions here and in person to action points:
Important:
Below are user use cases that should help to understand the scenarios. From local to Cloud/S3A model Why user needs the pointer file:
Uploading
Note, This command is an equivalent to UpdatingA model file was changed (as a result of re-training) for example:
From cloud to workspaceUsers write models/data to cloud from user's code (or it is already updated by an external tool). Saving pointer to a model file still might be useful. Why:
Tracking a cloud fileAfter training is done and a file saved to s3://mybucket/ml/prod/2022-03-07-model.h5:
Tracking a cloud file without a local copyIn some cases, user does writes a file in a storage and does not need a copy in workspace.
Technically, the file will still have a virtual representation in the workspace as Pros/Cons:
To cover the latest cons, we can consider introducing |
Could you please clarify this please: "It should work now if the Uploading part is based on My initial reaction is that External outputs remind export to me. |
The logic is:
To be honest I'd rename the first two to
Yes, but internal machinery and logic is very different. You need a pipeline for external outputs which is not compatible with no-DVC requirements and won't be intuitive for users. |
that's exactly the sign that we are mixing the semantics
not necessarily btw,
Yep, I understand. It's not so much about redundancy of a command, it's more about the semantics still. It confuses me a bit that export internally does import. For example, we can make And
If we want to keep this semantics (import link created inside export), I would probably even prefer to have Also, if we go back to the |
This looks like the direction of
It means the upload is happening as a result of |
From local to Cloud/S3In this scenario, the user has their own local If they want to upload to the cloud and keep a pointer locally, As @shcheklein noted, the workflow here assumes the user saves updates locally, so it makes sense for Similar to how I'll follow up on the other scenarios in another comment to keep this from being too convoluted 😅 Edit: On second thought, maybe it's better to resolve this scenario first 😄 . The others might require a separate discussion. |
If we go to the bi-directional
@shcheklein @dberenbaum WDYT? |
@dmpetrov I think it makes sense. However, I think the "Storage to local" scenarios are a little convoluted. If model updates are happening external to user code and saved in the cloud, or the user already has a model in the cloud saved previously by their code, If instead they are using dvc to track a model that their user code saves in the cloud, An alternative is to change how external outputs work. I put a mini-proposal at the bottom of https://www.notion.so/iterative/Model-Management-in-DVC-af279e36b8be4e929b08df7a491e1a4c. It's still a work in progress, but if you have time, PTAL and see if the direction makes sense. |
Right. This does not look like a core scenario. Just to make it clear - Storage to local covers use cases when a model was created from outside of a repository. Examples: user imports an external model to use GitOps/Model-Registry functionality, importing a pre-trained model or existing dataset. |
In your earlier comment, you seemed to indicate that the scope was broader:
Are we now limiting it to cases where the model was updated by an external tool? Edit: Or maybe writing models to cloud from user's code is part of "Local to Storage." Either way, I think there's a core scenario for writing models directly to cloud from user's code that isn't covered by |
It was described as a broader scenario but the major goal was to cover Lightweight Model Management use case (see "user imports an external model to use GitOps/Model-Registry functionality"). It can be useful in some other scenarios (see "importing a pre-trained model or existing dataset."). However, importing training model to the same repo does not make sense. We are introducing From local to Cloud for this. |
In the context of model management we nailed down the scope - see #7918 |
Any news on this? |
@bhack No progress here, sorry. |
@bhack Could you explain more about your scenario? One option might be to push to a cloud-versioned remote, which would show files in a more typical directory structure. |
@dberenbaum In some cases I need to use gcsfuse or similar. |
Do you mind pulling up to a higher level? Once you sync with the cloud bucket, how are you using gcsfuse and why do you need this setup? When you materialize multiple commits, do you sync each to a different location? |
As this is a quite common (emerging?) ML setup. See:
Yes, I need to use at least a different prefix in the destination bucket to replicate the whole data e.g. with the related dvc/git commit and handle the garbage collection on the materialized commits when are not need anymore. At least I don't find any |
Hi @bhack, sorry for the lack of response here. Would you have time to discuss in more depth on a call sometime? |
@dberenbaum Yes but I think that it will be more useful to open a new discussion thread on github so that it could be useful also for other users. |
Summary
An upload equivalent of
dvc get-url
.We currently use
get-url
as a cross-platform replacement forwget
. However, together withget-url
,put-url
will turn DVC into a replacement forrsync
/rclone
.Motivation
get-url
so addingput-url
seems natural for the same reasonsput-url
will be used byrsync
/rclone
. What's not to love?Detailed Design
How We Teach This
put-url
seems to be in line with the existingget-url
(vis. HTTPGET
&PUT
)Drawbacks
Alternatives
Unresolved Questions
put-url
)?url targets [targets...]
)?dvc.api.put_url()
)?Please do assign me if happy with the proposal.
(
dvc get-url
+put-url
=dvc rsync
:))The text was updated successfully, but these errors were encountered: