Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data sharing scenarios #784

Closed
wants to merge 3 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 27 additions & 0 deletions src/Documentation/sidebar.json
Original file line number Diff line number Diff line change
Expand Up @@ -119,6 +119,33 @@
"label": "Managing External Data",
"slug": "managing-external-data"
},
{
"label": "Data Sharing",
"slug": "data-sharing",
"source": "data-sharing/index.md",
"children": [
{
"label": "Remote DVC Storage",
"slug": "remote-storage"
},
{
"label": "Shared Development Server",
dashohoxha marked this conversation as resolved.
Show resolved Hide resolved
"slug": "shared-server"
},
{
"label": "Mounted DVC Storage",
"slug": "mounted-storage"
},
{
"label": "Mounted DVC Cache",
"slug": "mounted-cache"
},
{
"label": "Synced DVC Storage",
"slug": "synced-storage"
}
]
},
{
"label": "Contributing",
"slug": "contributing",
Expand Down
52 changes: 52 additions & 0 deletions static/docs/user-guide/data-sharing/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
# Data Sharing and Collaboration with DVC

Like Git, DVC facilitates collaboration and data sharing on a distributed
environment. It makes it easy to consistently get all your data files and
directories to any machine, along with the source code.

![](/static/img/model-sharing-digram.png)

There are several ways to setup data sharing with DVC. We will discuss the most
common scenarios.

- [Sharing Data Through a Remote DVC Storage](/doc/user-guide/data-sharing/remote-storage)

This is the recommended and the most common case of data sharing. In this case
we setup a [remote storage](/doc/command-reference/remote) on a data storage
provider, to store data files online, where others can reach them. Currently
DVC supports Amazon S3, Google Cloud Storage, Microsoft Azure Blob Storage,
SSH, HDFS, and other remote locations, and the list is constantly growing.

- [Using Local Storage on a Shared Development Server](/doc/user-guide/data-sharing/shared-server)

Some teams may prefer to use a single shared machine for running their
experiments. This allows them to have better resource utilization such as the
ability to use multiple GPUs, etc. In this case we can use a local data
storage, which allows the team to store and share data very efficiently, with
no duplication of data files and instantaneous transfer time.

- [Sharing Data Through a Mounted DVC Storage](/doc/user-guide/data-sharing/mounted-storage)

If the data storage server (or provider) has a protocol that is not supported
yet by DVC, but it allows us to mount a remote directory on the local
filesystem, then we can still make a setup for data sharing with DVC. This
case might be useful for example when the data files are located on a
network-attached storage (NAS) and can be accessed through protocols like NFS,
Samba, SSHFS, etc.

- [Sharing Data Through a Mounted DVC Cache](/doc/user-guide/data-sharing/mounted-cache)

This case is similar to the Mounted DVC Storage (mentioned above), but instead
of mounting the DVC storage from the server, we can directly mount the cache
directory (`.dvc/cache/`). If all the users do this, then effectively they
will be using the same cache directory (which is mounted from the NAS server).
So, if one of them adds something to the cache, it will appear automatically
to the cache of all the others.

- [Sharing Data Through a Synchronized DVC Storage](/doc/user-guide/data-sharing/synched-storage)

There are cloud data storage providers that are not supported yet by DVC. But
this does not mean that we cannot use them to share data with the help of DVC.
If it is possible to synchronize a local directory with a remote one (which is
supported by almost all storage providers), then we can still make a setup
that allows us to share DVC data.
144 changes: 144 additions & 0 deletions static/docs/user-guide/data-sharing/mounted-cache.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,144 @@
# Sharing Data Through a Mounted Cache
Copy link
Member

@shcheklein shcheklein Nov 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this one replaces the shared server use case? I think the explanation in the use case article is better in a sense that it gives the context why a single shared server is used in the first place

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this one replace the shared server use case?

Sorry, you got this one wrong. This one is replacing the shared server use case: https://dvc-org-pr-784.herokuapp.com/doc/user-guide/data-sharing/shared-server

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dashohoxha then, I'm even more confused since the shared-server.md is still part of this PR.


We have seen already how to share data through a
[mounted DVC storage](/doc/user-guide/data-sharing/mounted-storage). In that
case we have a copy of the data on the DVC storage and at least one copy on each
user project, since deduplication does not work across filesystems.

However the data management can be further optimized if we use a shared cache.
The idea is that instead of mounting the DVC storage from the server, we can
directly mount the cache directory (`.dvc/cache/`). If all the users do this,
then effectively they will be using the same cache directory (which is mounted
from the NAS server). So, if one of them adds something to the cache, it will
appear automatically to the cache of all the others. As a result, no `dvc push`
and `dvc pull` are needed to share the data, just a `dvc checkout` will be
sufficient.

> ** ❗ Caution:** Deleting data from the cache will also make it disappear from
> the cache of the other users. So, be careful with the command `dvc gc` (which
> cleans obsolete data from the cache) and consult the other users of the
> project before using this command.

The optimization in data management comes from using the _symlink_ cache type.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it can be hardlinks and reflinks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it can be hardlinks and reflinks

No, if the cache is mounted from a NAS, neither hardlinks nor reflinks work (because they don't work across different filesystems).

I think this is a good example of why these cases need to be explained separately and they cannot be consolidated further. If you can be confused on such cases, then normal users might be much more confused than you. So, it can't be simplified further.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If users use the same mount point for their workspaces the way your example is written DVC might pick hardlinks or reflinks. Thinks a single EC2 box with a single huge SSD xfs on it. And it's being used by multiple people.

You can find more details about it in the page of
[Large Dataset Optimization](https://dvc.org/doc/user-guide/large-dataset-optimization).

## Mounted Cache Example
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think that explaining SSHFS here is way too much - too specific, takes too much time and distracts from the point even if expandable sections are being used

Copy link
Contributor Author

@dashohoxha dashohoxha Nov 26, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think that explaining SSHFS here is way too much - too specific, takes too much time and distracts from the point even if expandable sections are being used

This looks like your personal opinion. I don't think the same. I think that without some specific details and a concrete example, the explanation would be too abstract and much more difficult to understand.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that without some specific details and a concrete example

I don't even disagree with this. It's a matter of number of details and also a matter of using 1% concrete example vs using 80% example. SSHFS is way too specific and we have too many details. It's like 50% of the page is about setting up this.


In this example we will see how to share data with the help of a cache directory
that is mounted through SSHFS. We are using a SSHFS example because it is easy
to network-mount a directory with SSHFS. However once you understand how it
works, it should be easy to implement it for other types of network-mounted
storages (like NFS, Samba, etc.).

> For more detailed instructions check out this
> [interactive example](https://katacoda.com/dvc/courses/examples/mounted-cache).

<p align="center">
shcheklein marked this conversation as resolved.
Show resolved Hide resolved
<img src="/static/img/user-guide/data-sharing/mounted-cache.png"/>
</p>

<details>

### Prerequisites: Setup the server

We have to do these configurations on the SSH server:

- Create accounts for each user and add them to groups for accessing the Git
repository and the DVC storage.
- Create a bare git repository (for example on `/srv/project.git/`) and an empty
directory for the DVC cache (for example on `/srv/project.cache/`).
- Grant users read/write access to these directories (through the groups).

</details>

<details>

### Setup each user

When we have to access a SSH server, we definitely want to generate ssh key
pairs and setup the SSH config so that we can access the server without a
password.

Let's assume that for each user we can use the private ssh key
`~/.ssh/dvc-server` to access the server without a password, and we have also
added on `~/.ssh/config` lines like these:

```
Host dvc-server
HostName host01
User user1
IdentityFile ~/.ssh/dvc-server
IdentitiesOnly yes
```

Here `dvc-server` is the name or alias that we can use for our server, `host01`
can actually be the IP or the FQDN of the server, and `user1` is the username of
the first user on the server.

</details>

### Mount the DVC cache

With SSHFS (and the SSH configuration on the section above), we can mount the
remote directory to `.dvc/cache/` of the project like this:

```dvc
$ mkdir -p ~/project/.dvc/cache
$ sshfs \
dvc-server:/srv/project.cache/ \
~/project/.dvc/cache/
```

### Optimize data management
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

protected mode is missed here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

protected mode is missed here

No it is not missing, maybe you have missed it: https://dvc-org-pr-784.herokuapp.com/doc/user-guide/data-sharing/mounted-cache#optimize-data-management

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 probably worth mentioning why is it needed. Was mostly reading the text that's probably missed this.


Since the cache directory is located on a mounted filesystem, we cannot use the
_reflink_ optimization for data management. However we can use _symlinks_ (which
work across the filesystems):

```dvc
$ dvc config cache.type 'reflink,symlink,hardlink,copy'
$ dvc config cache.protected true
```

The configuration file `.dvc/config` should look like this:

```ini
[cache]
type = "reflink,symlink,hardlink,copy"
protected = true
```

This configuration is the same for all the users, so we can add it to Git in
order to share it with the other users:

```dvc
$ git add .dvc/config
$ git commit -m "Use symlinks if reflinks are not available"
$ git push
```

### Sharing data
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the primary point of using shared cache usually is optimize resources - not sharing data - in sense to pass some data to another user

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the primary point of using shared cache usually is optimize resources - not sharing data - in sense to pass some data to another user

Sorry, I don't understand what you are trying to say here. Can you please explain further what is the problem that you perceive?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kk. Let me step back a bit then. May be I'm missing the whole point of this PR still. In the index page for this new section, you write:

Like Git, DVC facilitates collaboration and data sharing on a distributed
environment. It makes it easy to consistently get all your data files and
directories to any machine, along with the source code.

And I understand this more or less. And the primary mechanism for this is just regular remotes (that's why I still confused considering that you have second PR that is about external data but also includes remotes. So both overlap in this sense but not quite focus on remotes management.

So, could you reiterate (and fix index.md?) to communicate what is the goal here? What part of the DVC workflow for end users do we cover? How does it relate to the second PR?


When we add data to the project with `dvc add` or `dvc run`, some DVC-files are
created, the data is stored in `.dvc/cache/`, and it is linked (with symlink)
from the workspace.

We can share the DVC-files with:

```dvc
$ git push
```

In order to receive the changes, the other users should do:

```dvc
$ git pull
$ dvc checkout
```

Notice that there is no need to use `dvc push` and `dvc pull` for sharing the
data, because it is like all the collaborating users are using the same
directory for the DVC cache. As soon as one of them saves a file in cache, it is
immediately available for `dvc checkout` to all the others. All they need to do
is to synchronize their DVC-files (with `git push` and `git pull`).
149 changes: 149 additions & 0 deletions static/docs/user-guide/data-sharing/mounted-storage.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,149 @@
# Sharing Data Through a Mounted DVC Storage
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again, see my other comments. There are at least two possibilities - shared cache or shared remote. In case of NAS it's actually beneficial to share cache (and use some regular cloud remote to still do backups).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case of NAS it's actually beneficial to share cache (and use some regular cloud remote to still do backups).

Sharing cache in the case of a NAS may cause problems when we try to use dvc gc. I remember seeing some discussions about this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, and we even introduced a special flag - to pass multiple projects at once to dvc gc. Gc in DVC is a big pain still but it does not change the fact I mentioned above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we even introduced a special flag - to pass multiple projects at once to dvc gc

The option -p, --projects of dvc gc gets a path to a project (at least this is how I understand the man page, I have never tried it).

In the case of a NAS mounted storage I assume that the collaborating projects are located on different machines, isn't it? So, the option -p, --projects cannot be used in this case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are probably ways still to run GC - clone projects to a single machine. It's not ideal (all about GC not) but it's a maintenance operations vs day to day workflow that is being optimized with links if you share cache vs sharing remote directly.

Also, I think it is the same problem with other your cases and in one of them it's about people sharing the same machine, right?

Copy link
Member

@shcheklein shcheklein Nov 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may extend the tutorial and the user-guide page to explain this optimization as well.

I would first understand the options in terms of organizing data, understand which of them are more general then others, then would try to come up with a couple of sections that explain them in a general way. And by general I mean concepts like - cache is shared or not? people use a single machine or not? etc

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see how my initial concern is resolved or addressed here. Please 🙏 , don't resolve them on your own - it makes it extremely hard to do reviews (check and follow up the previously raised concerns).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be precise - I don't see much value in three (two?) sections that explain different variations of the mounted remote. And to be even more precise - I haven't see a mounted share remote case. The only benefit I see - unsupported storage type. I would just create a How to or FAQ or something with one-two paragraphs explanation on options - mount, use rsync/rclone, etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see how my initial concern is resolved or addressed here.

I have added this page:

and this interactive example:

that explain the case of mounted cache, which is more efficient if we share data through a NAS (with caveat of being careful with the command dvc gc).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not answer my question unless again I'm missing the whole point of this PR. Could you please elaborate on this:

To be precise - I don't see much value in three (two?) sections that explain different variations of the mounted remote. And to be even more precise - I haven't see a mounted share remote case. The only benefit I see - unsupported storage type. I would just create a How to or FAQ or something with one-two paragraphs explanation on options - mount, use rsync/rclone, etc.


If the data storage server (or provider) has a protocol that is not supported
yet by DVC, but it allows us to mount a remote directory on the local
filesystem, then we can still make a setup for data sharing with DVC.

> This case might be useful when the data files are located on a
> network-attached storage (NAS) for example, and can be accessed through
> protocols like NFS, Samba, SSHFS, etc.

The solution is very similar to that of a
[Shared Development Server](/doc/user-guide/data-sharing/shared-server), using a
local DVC storage, which is actually located on the mounted directory. Whenever
we push data to our mounted storage, it is made available immediately to the
mounted storage of each user. So, the data sharing workflow is the normal one,
with `dvc push` and `dvc pull`.

> Different from the case of Shared Development Server, the local DVC storage
> and the project cannot be on the same filesystem (because the DVC storage is
> on a mounted remote directory). So, the deduplication optimization does not
> work that well. We have a copy of the data on the DVC storage, and at least
> one copy on each user project.

## Mounted Storage Example

In this example we will see how to share data with the help of a storage
directory that is mounted through SSHFS.

> Normally we don't need to do this, since we can
> [use a SSH remote storage](https://katacoda.com/dvc/courses/examples/ssh-storage)
> directly. But we are using it just as an example, since it is easy to
> network-mount a directory with SSHFS. Once you understand how it works, it
> should be easy to implement it for other types of mounted storages (like NFS,
> Samba, etc.).

<p align="center">
<img src="/static/img/user-guide/data-sharing/mounted-storage.png"/>
</p>

> For more detailed instructions check out this
> [interactive example](https://katacoda.com/dvc/courses/examples/mounted-storage).

<details>

### Prerequisite: Setup the server

We have to do these configurations on the SSH server:

- Create accounts for each user and add them to groups for accessing the Git
repository and the DVC storage.
- Create a bare git repository (for example on `/srv/project.git/`) and an empty
directory for the DVC storage (for example on `/srv/project.cache/`).
- Grant users read/write access to these directories (through the groups).

</details>

<details>

### Prerequisite: Setup each user

When we have to access a SSH server, we definitely want to generate ssh key
pairs and setup the SSH config so that we can access the server without a
password.

Let's assume that for each user we can use the private ssh key
`~/.ssh/dvc-server` to access the server without a password, and we have also
added on `~/.ssh/config` lines like these:

```
Host dvc-server
HostName host01
User user1
IdentityFile ~/.ssh/dvc-server
IdentitiesOnly yes
```

Here `dvc-server` is the name or alias that we can use for our server, `host01`
can actually be the IP or the FQDN of the server, and `user1` is the username of
the first user on the server.

</details>

<details>

### Prerequisite: Mount the remote storage directory

With SSHFS (and the SSH configuration on the section above) we can mount the
remote directory on the server to a local one (let's say `$HOME/project.cache`),
like this:

```dvc
$ mkdir -p $HOME/project.cache
$ sshfs \
dvc-server:/srv/project.cache \
$HOME/project.cache
```

</details>

### Set the DVC storage

We can setup the project to use `$HOME/project.cache` as
[local DVC storage](/doc/user-guide/external-data/local#local-dvc-storage), by
adding a _default remote_ like this:

```dvc
$ dvc remote add --local --default \
mounted-storage $HOME/project.cache

$ dvc remote list --local
mounted-storage /home/username/project.cache
```

Note that this configuration is specific for each user, so we have used the
`--local` option in order to save it on `.dvc/config.local`, which is ignored by
Git.

Now this configuration file should have a content like this:

```
['remote "mounted-storage"']
url = /home/username/project.cache
[core]
remote = mounted-storage
```

### Sharing data

When we add data to the project with `dvc add` or `dvc run`, some DVC-files are
created and the data is stored in `.dvc/cache/`. We can upload DVC-files to the
Git server with `git push`, and upload the cached files to the DVC storage with
`dvc push`:

```dvc
$ git push
$ dvc push
```

The command `dvc push` copies the cached files from `.dvc/cache/` to
`$HOME/project.cache/`. However, since this is a mounted directory, the cached
files are immediately copied to the server as well, and they become available on
the mounted directories of the other users.

The other users can receive the DVC-files and the cached data like this:

```dvc
$ git pull
$ dvc pull
```
Loading