Data sharing scenarios #784

dashohoxha · 2019-11-12T07:49:29Z

Explain the different data sharing scenarios:

Remote DVC Storage
Shared Development Server
Mounted DVC Storage
Synced DVC Storage

Deprecates:

Related: #103, #648 (comment), #54, #455, #194

Fix #54, address #103, fix #194, fix #648

jorgeorpinel · 2019-11-12T21:18:49Z

Looking really cool! Probably Ivan will review this? Feel free to request my review otherwise.

Just one Q: How does this fix #429?

shcheklein · 2019-11-13T02:28:47Z

@dashohoxha looks great! will be definitely reviewing this. It will take a while though :)

dashohoxha · 2019-11-13T03:00:54Z

How does this fix #429 use-case: update shared dev server case with an optimization link?

@jorgeorpinel I have moved the "shared dev server case" to the User Guide, under the section "Data Sharing", along with other data sharing possibilities (or scenarios). I have also revised it a bit to use a local DVC storage instead of a shared cache. In this context, the suggested optimization is using a deduplicating filesystem.

So, my assumption is that that issue is oudated/deprecated and maybe should be closed (as soon as this PR is merged).

Feel free to request my review otherwise.

I am sure that it is not perfect and there are things that can be improved, but please let's merge it first and do any improvements later (in other PRs).

shcheklein · 2019-11-13T03:04:43Z

@dashohoxha @jorgeorpinel I still think Jorge should review it. We all need to see what's happening, we can fix some small issues right away, and I would love to know his opinion.

jorgeorpinel · 2019-11-13T20:02:41Z

I have moved the "shared dev server case" to the User Guide

OK. Need to make a more careful review but for now one comment: Why is the use case still there if you're suggesting to move it to the user guide? Same for /doc/use-cases/sharing-data-and-model-files. Thanks

I have also revised it a bit to use a local DVC storage instead of a shared cache. In this context, the suggested optimization is using a deduplicating filesystem... So, my assumption is that that issue is oudated/deprecated.

I see. In that case case we could just close it now, please comment on there for Ivan to confirm. Tbh I don't think this change addresses it. Removing "fix #429" from here, for now.

please let's merge it first and do any improvements later

Review first, work on changes until we agree its mergeable, merge, then continue to improve, for sure. If that's what you meant, I agree.

shcheklein · 2019-11-13T20:13:40Z

please let's merge it first and do any improvements later

my 2c on this. The problems with this that it never happens. Making small improvements is considered "unsexy" for some reason (I have no idea why) and people just move on to some new stuff. Just one example - we've created a few tickets after we merged the Install page and they are still not resolved, even those that have p-1 priority, like the first get started page.

I'm not saying that we should be polishing to perfection, but we def should review, come us with a reasonable changes, create follow up tickets and agreed on resolving them, then merge.

dashohoxha · 2019-11-14T15:42:13Z

I think the full move including redirections (glad you remembered) should be included along with all this new content

I would prefer to do it separately, but if Ivan also thinks that it should be included in this PR, I'd be glad to do it.

dashohoxha · 2019-11-15T04:59:19Z

@jorgeorpinel I created an issue as a follow-up of this PR: #793

src/Documentation/sidebar.json

shcheklein · 2019-11-15T23:54:24Z

static/docs/user-guide/data-sharing/remote-storage.md

@@ -0,0 +1,193 @@
+# Sharing Data Through a Remote DVC Storage


it is strange to see that it contains only examples, no explanation of what's happening whatsoever. I would expect it too explain remotes way better - this is a primary purpose of this.

SSH example is too complicated -

the same dir for git and dvc on remote is too specific, very uncommon and distracts a lot

name of the remote storage should not use cache in it

it is strange to see that it contains only examples, no explanation of what's happening whatsoever

It is a modified version of this page: https://dvc.org/doc/use-cases/sharing-data-and-model-files
That one does not have much explanations either and things are explained mostly by the example.

Actually I don't find it feasible to explain a solution without using at least a few DVC commands, and for those commands to make sense they have to be used in the context of an example. So, the description mainly describes the situation, and the solution is described by the examples. The hope is that once the reader has understood the solution he can generalize and adopt it for his own case.

I would expect it too explain remotes way better - this is a primary purpose of this.

I am planning to explain the details of the remotes (and their types) on another section. This section is about data sharing scenarios, so let's just refer to the remote details, but not include them here.

the same dir for git and dvc on remote is too specific, very uncommon and distracts a lot

Yes, the Git repository is usually located on GitHub. But this is just an example, an assumption to keep things simple and interactive.

name of the remote storage should not use cache in it

I tried to keep the analogy with Git. In Git a central bare repository is usually named project.git. So, a central DVC storage/cache is name project.cache.

I am planning to explain the details of the remotes (and their types) on another section. This section is about data sharing scenarios, so let's just refer to the remote details, but not include them here.

I don't think we need both that was point. I'm confused why do we need both. I think the remote section is enough. There should be some "DVC workflow" section that explains from a high level perspective the workflow.

I don't think we need both

I don't think that these UG pages:
https://dvc-org-pr-807.herokuapp.com/doc/user-guide/external-data (from another PR)
can be merged with the pages of this PR. I think they should be separate sections.

static/docs/user-guide/data-sharing/shared-server.md

shcheklein · 2019-11-15T23:59:53Z

static/docs/user-guide/data-sharing/mounted-storage.md

@@ -0,0 +1,115 @@
+# Sharing Data Through a Mounted DVC Storage


again, see my other comments. There are at least two possibilities - shared cache or shared remote. In case of NAS it's actually beneficial to share cache (and use some regular cloud remote to still do backups).

In case of NAS it's actually beneficial to share cache (and use some regular cloud remote to still do backups).

Sharing cache in the case of a NAS may cause problems when we try to use dvc gc. I remember seeing some discussions about this.

Yes, and we even introduced a special flag - to pass multiple projects at once to dvc gc. Gc in DVC is a big pain still but it does not change the fact I mentioned above.

we even introduced a special flag - to pass multiple projects at once to dvc gc

The option -p, --projects of dvc gc gets a path to a project (at least this is how I understand the man page, I have never tried it).

In the case of a NAS mounted storage I assume that the collaborating projects are located on different machines, isn't it? So, the option -p, --projects cannot be used in this case.

There are probably ways still to run GC - clone projects to a single machine. It's not ideal (all about GC not) but it's a maintenance operations vs day to day workflow that is being optimized with links if you share cache vs sharing remote directly.

Also, I think it is the same problem with other your cases and in one of them it's about people sharing the same machine, right?

I may extend the tutorial and the user-guide page to explain this optimization as well.

I would first understand the options in terms of organizing data, understand which of them are more general then others, then would try to come up with a couple of sections that explain them in a general way. And by general I mean concepts like - cache is shared or not? people use a single machine or not? etc

I don't see how my initial concern is resolved or addressed here. Please 🙏 , don't resolve them on your own - it makes it extremely hard to do reviews (check and follow up the previously raised concerns).

To be precise - I don't see much value in three (two?) sections that explain different variations of the mounted remote. And to be even more precise - I haven't see a mounted share remote case. The only benefit I see - unsupported storage type. I would just create a How to or FAQ or something with one-two paragraphs explanation on options - mount, use rsync/rclone, etc.

I don't see how my initial concern is resolved or addressed here.

I have added this page:

https://dvc-org-pr-784.herokuapp.com/doc/user-guide/data-sharing/mounted-cache

and this interactive example:

https://katacoda.com/dvc/courses/examples/mounted-cache

that explain the case of mounted cache, which is more efficient if we share data through a NAS (with caveat of being careful with the command dvc gc).

It does not answer my question unless again I'm missing the whole point of this PR. Could you please elaborate on this:

To be precise - I don't see much value in three (two?) sections that explain different variations of the mounted remote. And to be even more precise - I haven't see a mounted share remote case. The only benefit I see - unsupported storage type. I would just create a How to or FAQ or something with one-two paragraphs explanation on options - mount, use rsync/rclone, etc.

jorgeorpinel

I created an issue as a follow-up of this PR: #793

Again, please do it all together here, as discussed.

The following comment from that issue is indeed worth discussing though:

If they are removed, use-cases is left with only two pages...

The first one is too generic to be a use-case. Data Versioning is rather a core feature of DVC...
So, I think that this page can be removed.

The second page is about the case when DVC is used only as a dataset catalog... This page could be moved to the User Guide, maybe in the section HowTo (like: How to Build and Use a Data Registry).

My first question here is why are we moving these use cases in the first place? Is this something you guys discussed elsewhere? Cc @shcheklein

static/img/user-guide/data-sharing/shared-server.uxf

dashohoxha · 2019-11-20T05:38:27Z

Cool! What kind of UML diagrams are they? (class, sequence, etc.) Are all of them done with https://www.umlet.com/? http://www.umletino.com/? How do you export .uxf ?

They are not UML diagrams. They might be called deployment diagrams, but they are not UML deployment diagrams. I used Umlet because I find it handy and efficient (and it is also open source).
The format .uxf is the basic format of Umlet, and the images (.png) are exported from it.

I would prefer to remove those pages on another PR (unless Ivan thinks otherwise). I reopened the issue #793 because I think that it is better to discuss it there.

shcheklein · 2019-11-25T23:36:00Z