iterative · dashohoxha · Nov 12, 2019 · Nov 25, 2019 · Nov 25, 2019 · shcheklein
diff --git a/src/Documentation/sidebar.json b/src/Documentation/sidebar.json
@@ -119,6 +119,33 @@
         "label": "Managing External Data",
         "slug": "managing-external-data"
       },
+      {
+        "label": "Data Sharing",
+        "slug": "data-sharing",
+        "source": "data-sharing/index.md",
+        "children": [
+          {
+            "label": "Remote DVC Storage",
+            "slug": "remote-storage"
+          },
+          {
+            "label": "Shared Development Server",
+            "slug": "shared-server"
+          },
+          {
+            "label": "Mounted DVC Storage",
+            "slug": "mounted-storage"
+          },
+          {
+            "label": "Mounted DVC Cache",
+            "slug": "mounted-cache"
+          },
+          {
+            "label": "Synced DVC Storage",
+            "slug": "synced-storage"
+          }
+        ]
+      },
       {
         "label": "Contributing",
         "slug": "contributing",

diff --git a/static/docs/user-guide/data-sharing/index.md b/static/docs/user-guide/data-sharing/index.md
@@ -0,0 +1,52 @@
+# Data Sharing and Collaboration with DVC
+
+Like Git, DVC facilitates collaboration and data sharing on a distributed
+environment. It makes it easy to consistently get all your data files and
+directories to any machine, along with the source code.
+
+![](/static/img/model-sharing-digram.png)
+
+There are several ways to setup data sharing with DVC. We will discuss the most
+common scenarios.
+
+- [Sharing Data Through a Remote DVC Storage](/doc/user-guide/data-sharing/remote-storage)
+
+  This is the recommended and the most common case of data sharing. In this case
+  we setup a [remote storage](/doc/command-reference/remote) on a data storage
+  provider, to store data files online, where others can reach them. Currently
+  DVC supports Amazon S3, Google Cloud Storage, Microsoft Azure Blob Storage,
+  SSH, HDFS, and other remote locations, and the list is constantly growing.
+
+- [Using Local Storage on a Shared Development Server](/doc/user-guide/data-sharing/shared-server)
+
+  Some teams may prefer to use a single shared machine for running their
+  experiments. This allows them to have better resource utilization such as the
+  ability to use multiple GPUs, etc. In this case we can use a local data
+  storage, which allows the team to store and share data very efficiently, with
+  no duplication of data files and instantaneous transfer time.
+
+- [Sharing Data Through a Mounted DVC Storage](/doc/user-guide/data-sharing/mounted-storage)
+
+  If the data storage server (or provider) has a protocol that is not supported
+  yet by DVC, but it allows us to mount a remote directory on the local
+  filesystem, then we can still make a setup for data sharing with DVC. This
+  case might be useful for example when the data files are located on a
+  network-attached storage (NAS) and can be accessed through protocols like NFS,
+  Samba, SSHFS, etc.
+
+- [Sharing Data Through a Mounted DVC Cache](/doc/user-guide/data-sharing/mounted-cache)
+
+  This case is similar to the Mounted DVC Storage (mentioned above), but instead
+  of mounting the DVC storage from the server, we can directly mount the cache
+  directory (`.dvc/cache/`). If all the users do this, then effectively they
+  will be using the same cache directory (which is mounted from the NAS server).
+  So, if one of them adds something to the cache, it will appear automatically
+  to the cache of all the others.
+
+- [Sharing Data Through a Synchronized DVC Storage](/doc/user-guide/data-sharing/synched-storage)
+
+  There are cloud data storage providers that are not supported yet by DVC. But
+  this does not mean that we cannot use them to share data with the help of DVC.
+  If it is possible to synchronize a local directory with a remote one (which is
+  supported by almost all storage providers), then we can still make a setup
+  that allows us to share DVC data.
diff --git a/static/docs/user-guide/data-sharing/mounted-cache.md b/static/docs/user-guide/data-sharing/mounted-cache.md
@@ -0,0 +1,144 @@
+# Sharing Data Through a Mounted Cache
+
+We have seen already how to share data through a
+[mounted DVC storage](/doc/user-guide/data-sharing/mounted-storage). In that
+case we have a copy of the data on the DVC storage and at least one copy on each
+user project, since deduplication does not work across filesystems.
+
+However the data management can be further optimized if we use a shared cache.
+The idea is that instead of mounting the DVC storage from the server, we can
+directly mount the cache directory (`.dvc/cache/`). If all the users do this,
+then effectively they will be using the same cache directory (which is mounted
+from the NAS server). So, if one of them adds something to the cache, it will
+appear automatically to the cache of all the others. As a result, no `dvc push`
+and `dvc pull` are needed to share the data, just a `dvc checkout` will be
+sufficient.
+
+> ** ❗ Caution:** Deleting data from the cache will also make it disappear from
+> the cache of the other users. So, be careful with the command `dvc gc` (which
+> cleans obsolete data from the cache) and consult the other users of the
+> project before using this command.
+
+The optimization in data management comes from using the _symlink_ cache type.
+You can find more details about it in the page of
+[Large Dataset Optimization](https://dvc.org/doc/user-guide/large-dataset-optimization).
+
+## Mounted Cache Example
+
+In this example we will see how to share data with the help of a cache directory
+that is mounted through SSHFS. We are using a SSHFS example because it is easy
+to network-mount a directory with SSHFS. However once you understand how it
+works, it should be easy to implement it for other types of network-mounted
+storages (like NFS, Samba, etc.).
+
+> For more detailed instructions check out this
+> [interactive example](https://katacoda.com/dvc/courses/examples/mounted-cache).
+
+<p align="center">
+<img src="/static/img/user-guide/data-sharing/mounted-cache.png"/>
+</p>
+
+<details>
+
+### Prerequisites: Setup the server
+
+We have to do these configurations on the SSH server:
+
+- Create accounts for each user and add them to groups for accessing the Git
+  repository and the DVC storage.
+- Create a bare git repository (for example on `/srv/project.git/`) and an empty
+  directory for the DVC cache (for example on `/srv/project.cache/`).
+- Grant users read/write access to these directories (through the groups).
+
+</details>
+
+<details>
+
+### Setup each user
+
+When we have to access a SSH server, we definitely want to generate ssh key
+pairs and setup the SSH config so that we can access the server without a
+password.
+
+Let's assume that for each user we can use the private ssh key
+`~/.ssh/dvc-server` to access the server without a password, and we have also
+added on `~/.ssh/config` lines like these:
+
+```
+Host dvc-server
+    HostName host01
+    User user1
+    IdentityFile ~/.ssh/dvc-server
+    IdentitiesOnly yes
+```
+
+Here `dvc-server` is the name or alias that we can use for our server, `host01`
+can actually be the IP or the FQDN of the server, and `user1` is the username of
+the first user on the server.
+
+</details>
+
+### Mount the DVC cache
+
+With SSHFS (and the SSH configuration on the section above), we can mount the
+remote directory to `.dvc/cache/` of the project like this:
+
+```dvc
+$ mkdir -p ~/project/.dvc/cache
+$ sshfs \
+      dvc-server:/srv/project.cache/ \
+      ~/project/.dvc/cache/
+```
+
+### Optimize data management
+
+Since the cache directory is located on a mounted filesystem, we cannot use the
+_reflink_ optimization for data management. However we can use _symlinks_ (which
+work across the filesystems):
+
+```dvc
+$ dvc config cache.type 'reflink,symlink,hardlink,copy'
+$ dvc config cache.protected true
+```
+
+The configuration file `.dvc/config` should look like this:
+
+```ini
+[cache]
+type = "reflink,symlink,hardlink,copy"
+protected = true
+```
+
+This configuration is the same for all the users, so we can add it to Git in
+order to share it with the other users:
+
+```dvc
+$ git add .dvc/config
+$ git commit -m "Use symlinks if reflinks are not available"
+$ git push
+```
+
+### Sharing data
+
+When we add data to the project with `dvc add` or `dvc run`, some DVC-files are
+created, the data is stored in `.dvc/cache/`, and it is linked (with symlink)
+from the workspace.
+
+We can share the DVC-files with:
+
+```dvc
+$ git push
+```
+
+In order to receive the changes, the other users should do:
+
+```dvc
+$ git pull
+$ dvc checkout
+```
+
+Notice that there is no need to use `dvc push` and `dvc pull` for sharing the
+data, because it is like all the collaborating users are using the same
+directory for the DVC cache. As soon as one of them saves a file in cache, it is
+immediately available for `dvc checkout` to all the others. All they need to do
+is to synchronize their DVC-files (with `git push` and `git pull`).
diff --git a/static/docs/user-guide/data-sharing/mounted-storage.md b/static/docs/user-guide/data-sharing/mounted-storage.md
@@ -0,0 +1,149 @@
+# Sharing Data Through a Mounted DVC Storage
+
+If the data storage server (or provider) has a protocol that is not supported
+yet by DVC, but it allows us to mount a remote directory on the local
+filesystem, then we can still make a setup for data sharing with DVC.
+
+> This case might be useful when the data files are located on a
+> network-attached storage (NAS) for example, and can be accessed through
+> protocols like NFS, Samba, SSHFS, etc.
+
+The solution is very similar to that of a
+[Shared Development Server](/doc/user-guide/data-sharing/shared-server), using a
+local DVC storage, which is actually located on the mounted directory. Whenever
+we push data to our mounted storage, it is made available immediately to the
+mounted storage of each user. So, the data sharing workflow is the normal one,
+with `dvc push` and `dvc pull`.
+
+> Different from the case of Shared Development Server, the local DVC storage
+> and the project cannot be on the same filesystem (because the DVC storage is
+> on a mounted remote directory). So, the deduplication optimization does not
+> work that well. We have a copy of the data on the DVC storage, and at least
+> one copy on each user project.
+
+## Mounted Storage Example
+
+In this example we will see how to share data with the help of a storage
+directory that is mounted through SSHFS.
+
+> Normally we don't need to do this, since we can
+> [use a SSH remote storage](https://katacoda.com/dvc/courses/examples/ssh-storage)
+> directly. But we are using it just as an example, since it is easy to
+> network-mount a directory with SSHFS. Once you understand how it works, it
+> should be easy to implement it for other types of mounted storages (like NFS,
+> Samba, etc.).
+
+<p align="center">
+<img src="/static/img/user-guide/data-sharing/mounted-storage.png"/>
+</p>
+
+> For more detailed instructions check out this
+> [interactive example](https://katacoda.com/dvc/courses/examples/mounted-storage).
+
+<details>
+
+### Prerequisite: Setup the server
+
+We have to do these configurations on the SSH server:
+
+- Create accounts for each user and add them to groups for accessing the Git
+  repository and the DVC storage.
+- Create a bare git repository (for example on `/srv/project.git/`) and an empty
+  directory for the DVC storage (for example on `/srv/project.cache/`).
+- Grant users read/write access to these directories (through the groups).
+
+</details>
+
+<details>
+
+### Prerequisite: Setup each user
+
+When we have to access a SSH server, we definitely want to generate ssh key
+pairs and setup the SSH config so that we can access the server without a
+password.
+
+Let's assume that for each user we can use the private ssh key
+`~/.ssh/dvc-server` to access the server without a password, and we have also
+added on `~/.ssh/config` lines like these:
+
+```
+Host dvc-server
+    HostName host01
+    User user1
+    IdentityFile ~/.ssh/dvc-server
+    IdentitiesOnly yes
+```
+
+Here `dvc-server` is the name or alias that we can use for our server, `host01`
+can actually be the IP or the FQDN of the server, and `user1` is the username of
+the first user on the server.
+
+</details>
+
+<details>
+
+### Prerequisite: Mount the remote storage directory
+
+With SSHFS (and the SSH configuration on the section above) we can mount the
+remote directory on the server to a local one (let's say `$HOME/project.cache`),
+like this:
+
+```dvc
+$ mkdir -p $HOME/project.cache
+$ sshfs \
+      dvc-server:/srv/project.cache \
+      $HOME/project.cache
+```
+
+</details>
+
+### Set the DVC storage
+
+We can setup the project to use `$HOME/project.cache` as
+[local DVC storage](/doc/user-guide/external-data/local#local-dvc-storage), by
+adding a _default remote_ like this:
+
+```dvc
+$ dvc remote add --local --default \
+      mounted-storage $HOME/project.cache
+
+$ dvc remote list --local
+mounted-storage /home/username/project.cache
+```
+
+Note that this configuration is specific for each user, so we have used the
+`--local` option in order to save it on `.dvc/config.local`, which is ignored by
+Git.
+
+Now this configuration file should have a content like this:
+
+```
+['remote "mounted-storage"']
+url = /home/username/project.cache
+[core]
+remote = mounted-storage
+```
+
+### Sharing data
+
+When we add data to the project with `dvc add` or `dvc run`, some DVC-files are
+created and the data is stored in `.dvc/cache/`. We can upload DVC-files to the
+Git server with `git push`, and upload the cached files to the DVC storage with
+`dvc push`:
+
+```dvc
+$ git push
+$ dvc push
+```
+
+The command `dvc push` copies the cached files from `.dvc/cache/` to
+`$HOME/project.cache/`. However, since this is a mounted directory, the cached
+files are immediately copied to the server as well, and they become available on
+the mounted directories of the other users.
+
+The other users can receive the DVC-files and the cached data like this:
+
+```dvc
+$ git pull
+$ dvc pull
+```