Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data sharing scenarios #784
Data sharing scenarios #784
Changes from all commits
1203c11
ef27cb2
e2486a5
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this one replaces the shared server use case? I think the explanation in the use case article is better in a sense that it gives the context why a single shared server is used in the first place
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, you got this one wrong. This one is replacing the shared server use case: https://dvc-org-pr-784.herokuapp.com/doc/user-guide/data-sharing/shared-server
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dashohoxha then, I'm even more confused since the
shared-server.md
is still part of this PR.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it can be hardlinks and reflinks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, if the cache is mounted from a NAS, neither hardlinks nor reflinks work (because they don't work across different filesystems).
I think this is a good example of why these cases need to be explained separately and they cannot be consolidated further. If you can be confused on such cases, then normal users might be much more confused than you. So, it can't be simplified further.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If users use the same mount point for their workspaces the way your example is written DVC might pick hardlinks or reflinks. Thinks a single EC2 box with a single huge SSD xfs on it. And it's being used by multiple people.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think that explaining SSHFS here is way too much - too specific, takes too much time and distracts from the point even if expandable sections are being used
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like your personal opinion. I don't think the same. I think that without some specific details and a concrete example, the explanation would be too abstract and much more difficult to understand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't even disagree with this. It's a matter of number of details and also a matter of using 1% concrete example vs using 80% example. SSHFS is way too specific and we have too many details. It's like 50% of the page is about setting up this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
protected mode is missed here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No it is not missing, maybe you have missed it: https://dvc-org-pr-784.herokuapp.com/doc/user-guide/data-sharing/mounted-cache#optimize-data-management
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 probably worth mentioning why is it needed. Was mostly reading the text that's probably missed this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the primary point of using shared cache usually is optimize resources - not sharing data - in sense to pass some data to another user
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I don't understand what you are trying to say here. Can you please explain further what is the problem that you perceive?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kk. Let me step back a bit then. May be I'm missing the whole point of this PR still. In the index page for this new section, you write:
And I understand this more or less. And the primary mechanism for this is just regular remotes (that's why I still confused considering that you have second PR that is about external data but also includes remotes. So both overlap in this sense but not quite focus on remotes management.
So, could you reiterate (and fix index.md?) to communicate what is the goal here? What part of the DVC workflow for end users do we cover? How does it relate to the second PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again, see my other comments. There are at least two possibilities - shared cache or shared remote. In case of NAS it's actually beneficial to share cache (and use some regular cloud remote to still do backups).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sharing cache in the case of a NAS may cause problems when we try to use
dvc gc
. I remember seeing some discussions about this.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, and we even introduced a special flag - to pass multiple projects at once to
dvc gc
. Gc in DVC is a big pain still but it does not change the fact I mentioned above.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The option
-p, --projects
ofdvc gc
gets a path to a project (at least this is how I understand the man page, I have never tried it).In the case of a NAS mounted storage I assume that the collaborating projects are located on different machines, isn't it? So, the option
-p, --projects
cannot be used in this case.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are probably ways still to run GC - clone projects to a single machine. It's not ideal (all about GC not) but it's a maintenance operations vs day to day workflow that is being optimized with links if you share cache vs sharing remote directly.
Also, I think it is the same problem with other your cases and in one of them it's about people sharing the same machine, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would first understand the options in terms of organizing data, understand which of them are more general then others, then would try to come up with a couple of sections that explain them in a general way. And by general I mean concepts like - cache is shared or not? people use a single machine or not? etc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see how my initial concern is resolved or addressed here. Please 🙏 , don't resolve them on your own - it makes it extremely hard to do reviews (check and follow up the previously raised concerns).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be precise - I don't see much value in three (two?) sections that explain different variations of the mounted remote. And to be even more precise - I haven't see a mounted share remote case. The only benefit I see - unsupported storage type. I would just create a How to or FAQ or something with one-two paragraphs explanation on options - mount, use rsync/rclone, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added this page:
and this interactive example:
that explain the case of mounted cache, which is more efficient if we share data through a NAS (with caveat of being careful with the command
dvc gc
).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does not answer my question unless again I'm missing the whole point of this PR. Could you please elaborate on this: