Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data sharing scenarios #784

Closed
wants to merge 3 commits into from
Closed

Data sharing scenarios #784

wants to merge 3 commits into from

Conversation

dashohoxha
Copy link
Contributor

@dashohoxha dashohoxha commented Nov 12, 2019

Explain the different data sharing scenarios:

  • Remote DVC Storage
  • Shared Development Server
  • Mounted DVC Storage
  • Synced DVC Storage

Deprecates:

Related: #103, #648 (comment), #54, #455, #194

Fix #54, address #103, fix #194, fix #648

@shcheklein shcheklein temporarily deployed to dvc-org-pr-784 November 12, 2019 07:49 Inactive
@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Nov 12, 2019

Looking really cool! Probably Ivan will review this? Feel free to request my review otherwise.

Just one Q: How does this fix #429?

@shcheklein
Copy link
Member

@dashohoxha looks great! will be definitely reviewing this. It will take a while though :)

@dashohoxha

This comment has been minimized.

@dashohoxha
Copy link
Contributor Author

How does this fix #429 use-case: update shared dev server case with an optimization link?

@jorgeorpinel I have moved the "shared dev server case" to the User Guide, under the section "Data Sharing", along with other data sharing possibilities (or scenarios). I have also revised it a bit to use a local DVC storage instead of a shared cache. In this context, the suggested optimization is using a deduplicating filesystem.

So, my assumption is that that issue is oudated/deprecated and maybe should be closed (as soon as this PR is merged).

Feel free to request my review otherwise.

I am sure that it is not perfect and there are things that can be improved, but please let's merge it first and do any improvements later (in other PRs).

@shcheklein
Copy link
Member

@dashohoxha @jorgeorpinel I still think Jorge should review it. We all need to see what's happening, we can fix some small issues right away, and I would love to know his opinion.

@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Nov 13, 2019

I have moved the "shared dev server case" to the User Guide

OK. Need to make a more careful review but for now one comment: Why is the use case still there if you're suggesting to move it to the user guide? Same for /doc/use-cases/sharing-data-and-model-files. Thanks

I have also revised it a bit to use a local DVC storage instead of a shared cache. In this context, the suggested optimization is using a deduplicating filesystem... So, my assumption is that that issue is oudated/deprecated.

I see. In that case case we could just close it now, please comment on there for Ivan to confirm. Tbh I don't think this change addresses it. Removing "fix #429" from here, for now.

please let's merge it first and do any improvements later

Review first, work on changes until we agree its mergeable, merge, then continue to improve, for sure. If that's what you meant, I agree.

@shcheklein
Copy link
Member

please let's merge it first and do any improvements later

my 2c on this. The problems with this that it never happens. Making small improvements is considered "unsexy" for some reason (I have no idea why) and people just move on to some new stuff. Just one example - we've created a few tickets after we merged the Install page and they are still not resolved, even those that have p-1 priority, like the first get started page.

I'm not saying that we should be polishing to perfection, but we def should review, come us with a reasonable changes, create follow up tickets and agreed on resolving them, then merge.

@dashohoxha

This comment has been minimized.

@jorgeorpinel

This comment has been minimized.

@dashohoxha
Copy link
Contributor Author

I think the full move including redirections (glad you remembered) should be included along with all this new content

I would prefer to do it separately, but if Ivan also thinks that it should be included in this PR, I'd be glad to do it.

@jorgeorpinel

This comment has been minimized.

@dashohoxha
Copy link
Contributor Author

@jorgeorpinel I created an issue as a follow-up of this PR: #793

@@ -0,0 +1,193 @@
# Sharing Data Through a Remote DVC Storage
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is strange to see that it contains only examples, no explanation of what's happening whatsoever. I would expect it too explain remotes way better - this is a primary purpose of this.

SSH example is too complicated -

  • the same dir for git and dvc on remote is too specific, very uncommon and distracts a lot
  • name of the remote storage should not use cache in it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is strange to see that it contains only examples, no explanation of what's happening whatsoever

It is a modified version of this page: https://dvc.org/doc/use-cases/sharing-data-and-model-files
That one does not have much explanations either and things are explained mostly by the example.

Actually I don't find it feasible to explain a solution without using at least a few DVC commands, and for those commands to make sense they have to be used in the context of an example. So, the description mainly describes the situation, and the solution is described by the examples. The hope is that once the reader has understood the solution he can generalize and adopt it for his own case.

I would expect it too explain remotes way better - this is a primary purpose of this.

I am planning to explain the details of the remotes (and their types) on another section. This section is about data sharing scenarios, so let's just refer to the remote details, but not include them here.

the same dir for git and dvc on remote is too specific, very uncommon and distracts a lot

Yes, the Git repository is usually located on GitHub. But this is just an example, an assumption to keep things simple and interactive.

name of the remote storage should not use cache in it

I tried to keep the analogy with Git. In Git a central bare repository is usually named project.git. So, a central DVC storage/cache is name project.cache.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am planning to explain the details of the remotes (and their types) on another section. This section is about data sharing scenarios, so let's just refer to the remote details, but not include them here.

I don't think we need both that was point. I'm confused why do we need both. I think the remote section is enough. There should be some "DVC workflow" section that explains from a high level perspective the workflow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need both

I don't think that these UG pages:
https://dvc-org-pr-807.herokuapp.com/doc/user-guide/external-data (from another PR)
can be merged with the pages of this PR. I think they should be separate sections.

@@ -0,0 +1,115 @@
# Sharing Data Through a Mounted DVC Storage
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again, see my other comments. There are at least two possibilities - shared cache or shared remote. In case of NAS it's actually beneficial to share cache (and use some regular cloud remote to still do backups).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case of NAS it's actually beneficial to share cache (and use some regular cloud remote to still do backups).

Sharing cache in the case of a NAS may cause problems when we try to use dvc gc. I remember seeing some discussions about this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, and we even introduced a special flag - to pass multiple projects at once to dvc gc. Gc in DVC is a big pain still but it does not change the fact I mentioned above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we even introduced a special flag - to pass multiple projects at once to dvc gc

The option -p, --projects of dvc gc gets a path to a project (at least this is how I understand the man page, I have never tried it).

In the case of a NAS mounted storage I assume that the collaborating projects are located on different machines, isn't it? So, the option -p, --projects cannot be used in this case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are probably ways still to run GC - clone projects to a single machine. It's not ideal (all about GC not) but it's a maintenance operations vs day to day workflow that is being optimized with links if you share cache vs sharing remote directly.

Also, I think it is the same problem with other your cases and in one of them it's about people sharing the same machine, right?

Copy link
Member

@shcheklein shcheklein Nov 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may extend the tutorial and the user-guide page to explain this optimization as well.

I would first understand the options in terms of organizing data, understand which of them are more general then others, then would try to come up with a couple of sections that explain them in a general way. And by general I mean concepts like - cache is shared or not? people use a single machine or not? etc

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see how my initial concern is resolved or addressed here. Please 🙏 , don't resolve them on your own - it makes it extremely hard to do reviews (check and follow up the previously raised concerns).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be precise - I don't see much value in three (two?) sections that explain different variations of the mounted remote. And to be even more precise - I haven't see a mounted share remote case. The only benefit I see - unsupported storage type. I would just create a How to or FAQ or something with one-two paragraphs explanation on options - mount, use rsync/rclone, etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see how my initial concern is resolved or addressed here.

I have added this page:

and this interactive example:

that explain the case of mounted cache, which is more efficient if we share data through a NAS (with caveat of being careful with the command dvc gc).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not answer my question unless again I'm missing the whole point of this PR. Could you please elaborate on this:

To be precise - I don't see much value in three (two?) sections that explain different variations of the mounted remote. And to be even more precise - I haven't see a mounted share remote case. The only benefit I see - unsupported storage type. I would just create a How to or FAQ or something with one-two paragraphs explanation on options - mount, use rsync/rclone, etc.

Copy link
Contributor

@jorgeorpinel jorgeorpinel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created an issue as a follow-up of this PR: #793

Again, please do it all together here, as discussed.

The following comment from that issue is indeed worth discussing though:

If they are removed, use-cases is left with only two pages...

The first one is too generic to be a use-case. Data Versioning is rather a core feature of DVC...
So, I think that this page can be removed.

The second page is about the case when DVC is used only as a dataset catalog... This page could be moved to the User Guide, maybe in the section HowTo (like: How to Build and Use a Data Registry).

My first question here is why are we moving these use cases in the first place? Is this something you guys discussed elsewhere? Cc @shcheklein

@shcheklein shcheklein temporarily deployed to dvc-org-pr-784 November 20, 2019 03:23 Inactive
@jorgeorpinel

This comment has been minimized.

@dashohoxha
Copy link
Contributor Author

Cool! What kind of UML diagrams are they? (class, sequence, etc.) Are all of them done with https://www.umlet.com/? http://www.umletino.com/? How do you export .uxf ?

They are not UML diagrams. They might be called deployment diagrams, but they are not UML deployment diagrams. I used Umlet because I find it handy and efficient (and it is also open source).
The format .uxf is the basic format of Umlet, and the images (.png) are exported from it.

@dashohoxha dashohoxha dismissed jorgeorpinel’s stale review November 25, 2019 22:42

I would prefer to remove those pages on another PR (unless Ivan thinks otherwise). I reopened the issue #793 because I think that it is better to discuss it there.

@shcheklein shcheklein temporarily deployed to dvc-org-pr-784 November 25, 2019 23:14 Inactive
@@ -0,0 +1,144 @@
# Sharing Data Through a Mounted Cache
Copy link
Member

@shcheklein shcheklein Nov 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this one replaces the shared server use case? I think the explanation in the use case article is better in a sense that it gives the context why a single shared server is used in the first place

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this one replace the shared server use case?

Sorry, you got this one wrong. This one is replacing the shared server use case: https://dvc-org-pr-784.herokuapp.com/doc/user-guide/data-sharing/shared-server

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dashohoxha then, I'm even more confused since the shared-server.md is still part of this PR.

> cleans obsolete data from the cache) and consult the other users of the
> project before using this command.

The optimization in data management comes from using the _symlink_ cache type.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it can be hardlinks and reflinks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it can be hardlinks and reflinks

No, if the cache is mounted from a NAS, neither hardlinks nor reflinks work (because they don't work across different filesystems).

I think this is a good example of why these cases need to be explained separately and they cannot be consolidated further. If you can be confused on such cases, then normal users might be much more confused than you. So, it can't be simplified further.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If users use the same mount point for their workspaces the way your example is written DVC might pick hardlinks or reflinks. Thinks a single EC2 box with a single huge SSD xfs on it. And it's being used by multiple people.

You can find more details about it in the page of
[Large Dataset Optimization](https://dvc.org/doc/user-guide/large-dataset-optimization).

## Mounted Cache Example
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think that explaining SSHFS here is way too much - too specific, takes too much time and distracts from the point even if expandable sections are being used

Copy link
Contributor Author

@dashohoxha dashohoxha Nov 26, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think that explaining SSHFS here is way too much - too specific, takes too much time and distracts from the point even if expandable sections are being used

This looks like your personal opinion. I don't think the same. I think that without some specific details and a concrete example, the explanation would be too abstract and much more difficult to understand.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that without some specific details and a concrete example

I don't even disagree with this. It's a matter of number of details and also a matter of using 1% concrete example vs using 80% example. SSHFS is way too specific and we have too many details. It's like 50% of the page is about setting up this.

~/project/.dvc/cache/
```

### Optimize data management
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

protected mode is missed here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

protected mode is missed here

No it is not missing, maybe you have missed it: https://dvc-org-pr-784.herokuapp.com/doc/user-guide/data-sharing/mounted-cache#optimize-data-management

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 probably worth mentioning why is it needed. Was mostly reading the text that's probably missed this.

$ git push
```

### Sharing data
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the primary point of using shared cache usually is optimize resources - not sharing data - in sense to pass some data to another user

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the primary point of using shared cache usually is optimize resources - not sharing data - in sense to pass some data to another user

Sorry, I don't understand what you are trying to say here. Can you please explain further what is the problem that you perceive?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kk. Let me step back a bit then. May be I'm missing the whole point of this PR still. In the index page for this new section, you write:

Like Git, DVC facilitates collaboration and data sharing on a distributed
environment. It makes it easy to consistently get all your data files and
directories to any machine, along with the source code.

And I understand this more or less. And the primary mechanism for this is just regular remotes (that's why I still confused considering that you have second PR that is about external data but also includes remotes. So both overlap in this sense but not quite focus on remotes management.

So, could you reiterate (and fix index.md?) to communicate what is the goal here? What part of the DVC workflow for end users do we cover? How does it relate to the second PR?

Copy link
Member

@shcheklein shcheklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dashohoxha I lot of good stuff has been done. But I still have the same concerns.

See my comments. I've unresolved a few of those.

It's too complicated. It puts too much focus on some strange cases like mounted remotes. Not clear why do we need remotes here if we have a new PR about them. Examples with SSHFS are too specific and complicated.

Copy link
Contributor

@jorgeorpinel jorgeorpinel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as #784 (review) which was dismissed unilaterally.

Note that I'm only reviewing BASIC high-level structural matters in this PR. It doesn't make sense to extract this to another issue or PR, since these possible problems are introduced by this PR. This is also as agreed in #784 (comment).

Mainly talking about removing the docs that these changes intend to deprecate:

If they are removed, use-cases is left with only two pages. 1) The first one is too generic to be a use-case. Data Versioning is rather a core feature of DVC, so, I think that this page can be removed. 2) The second page is about the case when DVC is used only as a dataset catalog. This page could be moved to the User Guide, maybe in the section HowTo (like: How to Build and Use a Data Registry).

My initial question here is why are we moving these use cases in the first place? Don't we want to have a Use Cases section by design of the docs? @shcheklein

Can we perhaps focus on this, decide, and address? So we can move on to Ivan's detailed review 🙂 Thanks

@shcheklein
Copy link
Member

@jorgeorpinel I would say I have mixed feeling about the use cases like versioning data and sharing data, so I understand where the intention to deprecate them comes from :)

I agree that the way they are written not the are too low level, do not serve as a good "landing page" as @jorgeorpinel called them. But I think we should be improving them in that direction (the same way we do with data registry) and write more use cases like this if possible. This section should answer the question of an infrastructure, DevOps, MLOps folks on how DVC can help them. How can it be used, in what high level scenarios. And it's even okay if they overlap in this case. I totally can see that data versioning is to some extent about data management as data registry is. Or data sharing overlaps potentially with data registry in a sense that data registry can be used to share data in a little bit different scenarios.

To reiterate on the Use Cases vs User Guide.

Users Guide is like a car owner's manual - a lot of details, how tos, etc.
Use Cases would be like explaining how that car can be used (potentially referring a lot to the owner's manual to cover the technical details) - driving from home to work, renting, taxi, etc

Hope that makes sense :)

@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Nov 26, 2019

OK, so here are the decisions I propose:

the way they are written not the are too low level, do not serve as a good "landing page"... But I think we should be improving them (and write more use cases)

I agree. So the first decision 🥇 is that we're not removing the use-cases section. ✔️

it's even okay if they overlap in this case.

Again agree. If we can come up with a good balance on how much overlap is good to have between use cases and these deeper new docs in /doc/user-guide/data-sharing/, then I think we can keep both (having in mind we will improve the use cases in the future to be more high level).

I think that this specifically means that 🥈 the new docs should try to not repeat the more high level parts of the use cases. ✔️
I just hope we do review this now, to prevent merging the PR with lots of repetition vs. existing use cases. I could focus on reviewing this aspect if you agree, @shcheklein. (So far I haven't looked at the detailed changes here.)

The first one is too generic to be a use-case. Data Versioning is rather a core feature of DVC...

data versioning is to some extent about data management as data registry is.

The second page is about the case when DVC is used only as a dataset catalog...

data sharing overlaps potentially with data registry in a sense that data registry can be used to share data

That level of overlap between use cases is fine, I agree. And also that data versioning is a core feature of DVC. Data Registry is our newest use case and we're already working on its 2nd iteration so its possibly the best use case we have.

So 🥉 let's keep these 2 use cases where they are as well. ✔️

Assuming you both agree, I'll remove myself for now.

@dashohoxha
Copy link
Contributor Author

It's too complicated. It puts too much focus on some strange cases like mounted remotes. Not clear why do we need remotes here if we have a new PR about them. Examples with SSHFS are too specific and complicated.

I have tried to answer some of the concerns on the other comments. User Guide pages are supposed to be details oriented (as opposed to Use Cases, which are supposed to be high level -- at least this is what I gather from the other discussions). In this sense I don't think these pages are too complicated.

I am afraid that you are not getting my vision, and I don't get yours either. At this point I am stuck as I don't see how this PR can be improved further.

@jorgeorpinel: So  let's keep these 2 use cases where they are as well.

If we agree on this, then it seems to me that no major changes need to be added on this PR.

@jorgeorpinel: I could focus on reviewing this aspect if you agree

I would be more than happy if @jorgeorpinel takes over making further improvements to these pages.
But I would be even more happy if we merged this PR first and make further improvements on another PR.

@shcheklein
Copy link
Member

@dashohoxha see my comments on some discussions on this ticket.

I am afraid that you are not getting my vision, and I don't get yours either.

yep. 👍 And that prevents it from being merged or improved. I asked about the vision in some comments above. Please share your thoughts and we'll see if I can understand the purpose and end goal of this PR.

@shcheklein
Copy link
Member

Even though some parts could be useful, I'm closing this :( :

  • it feels that it was created without understanding of the use cases which translates in an overcomplicated structure, some artificial examples, and some non relevant statements
  • it looks like author does not like to continue iterating with us on this
  • two things above mean for us that it would take less time to re-write certain relevant materials from scratch then addressing them here

@shcheklein shcheklein closed this Dec 14, 2019
@jorgeorpinel jorgeorpinel deleted the data-sharing branch January 23, 2020 17:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants