ref: revert changed related to #2302

iterative · Mar 18, 2021 · fa5a590 · fa5a590
1 parent 4d5bd5f
commit fa5a590
Show file tree

Hide file tree

Showing 7 changed files with 182 additions and 165 deletions.
diff --git a/content/docs/command-reference/add.md b/content/docs/command-reference/add.md
@@ -33,23 +33,23 @@ option to avoid this, and `dvc commit` to finish the process when needed).
 > See also `dvc.yaml` and `dvc run` for more advanced ways to track and version
 > intermediate and final results (like ML models).
 
-After checking that each `target` isn't already tracked with DVC, a few actions
-are taken under the hood:
+After checking that each `target` hasn't been added before (or tracked with
+other DVC commands), a few actions are taken under the hood:
 
 1. Calculate the file hash.
-2. Move the file contents to the cache, using the file hash to form the cached
-   file path (see
+2. Move the file contents to the cache (by default in `.dvc/cache`) (or to
+   remote storage if `--to-remote` is given), using the file hash to form the
+   cached file path. (See
    [Structure of cache directory](/doc/user-guide/project-structure/internal-files#structure-of-the-cache-directory)
-   for details). Using `--out`, or `--to-remote` with an external target, the
-   data is copied instead (to cache or remote storage).
-3. Attempt to replace the file with a link to (or copy of) the cached data (more
-   details on file linking ahead). A new link is created if a different `--out`
-   `path` is given. Skipped if `--to-remote` is used
-4. Create a `.dvc` file to track the file or directory, saving it's path, and
-   the hash as a pointer to the cached data. The `.dvc` file lists the data as
-   an <abbr>output</abbr> (`outs` field). Unless the `--file` option is used,
-   the `.dvc` file name generated by default is `<file>.dvc`, where `<file>` is
-   the file name of the first target.
+   for more details.)
+3. Attempt to replace the file with a link to the cached data (more details on
+   file linking further down). Skipped if `--to-remote` is used.
+4. Create a corresponding `.dvc` file to track the file, using its path and hash
+   to identify the cached data (with `--to-remote`/`-o`, an external path is
+   moved to the workspace). The `.dvc` file lists the DVC-tracked file as an
+   <abbr>output</abbr> (`outs` field). Unless the `--file` option is used, the
+   `.dvc` file name generated by default is `<file>.dvc`, where `<file>` is the
+   file name of the first target.
 5. Add the `targets` to `.gitignore` in order to prevent them from being
    committed to the Git repository (unless `dvc init --no-scm` was used when
    initializing the <abbr>DVC project</abbr>).
@@ -145,32 +145,28 @@ not.
   [pattern](https://docs.python.org/3/library/glob.html) specified in `targets`.
   Shell style wildcards supported: `*`, `?`, `[seq]`, `[!seq]`, and `**`
 
-- `-o <path>`, `--out <path>` - destination `path` inside the workspace to place
-  the data target. By default the data file basename is used in the current
-  working directory (if this option isn't used). Directories in the given `path`
-  will be created. Note that for external targets, this can be combined
-  [with an external cache](#example-external-data) to skip the local file
-  system.
-
-- `--to-remote` - allow a target outside of the DVC repository (e.g. an S3
-  object, SSH directory URL, file on mounted volume, etc.) but don't move it
-  into the workspace, nor cache it. [Store a copy](#straight-to-remote) on a DVC
-  remote instead (the default one unless `-r` is specified) to skip the local
-  file system. Use `dvc pull` to get the data later.
-
-- `-r <name>`, `--remote <name>` - name of the
-  [remote](/doc/command-reference/remote) to store data on (can only be used
-  with `--to-remote`).
-
-- `--external` - allow `targets` that are outside of the DVC repository, to
-  track in-place. See
+- `--external` - allow `targets` that are outside of the DVC repository. See
   [Managing External Data](/doc/user-guide/managing-external-data).
 
   > ⚠️ Note that this is an advanced feature for very specific situations and
   > not recommended except if there's absolutely no other alternative.
   > Additionally, this typically requires an external cache setup (see link
   > above).
 
+- `-o <path>`, `--out <path>` - destination `path` to make a local target copy,
+  or to [transfer](#example-transfer-to-cache) an external target into the cache
+  (and link to workspace). Note that this can be combined with `--to-remote` to
+  avoid storing the data locally, while still adding it to the project.
+
+- `--to-remote` - import an external target, but don't move it into the
+  workspace, nor cache it. [Transfer it](#example-transfer-to-remote-storage) it
+  directly to remote storage (the default one, unless `-r` is specified)
+  instead. Use `dvc pull` to get the data locally.
+
+- `-r <name>`, `--remote <name>` - name of the
+  [remote storage](/doc/command-reference/remote) to transfer external target to
+  (can only be used with `--to-remote`).
+
 - `--desc <text>` - user description of the data (optional). This doesn't affect
   any DVC operations.
 
@@ -340,82 +336,95 @@ $ tree .dvc/cache
 Only the hash values of the `dir/` directory (with `.dir` file extension) and
 `file2` have been cached.
 
-## Example: External data
+## Example: Transfer to the cache
+
+When you have a large dataset in an external location, you may want to add it to
+the <abbr>project</abbr> without having to copy it into the workspace. Maybe
+your local disk doesn't have enough space, but you have setup an
+[external cache](/doc/use-cases/shared-development-server#configure-the-external-shared-cache)
+that could handle it.
 
-Sometimes you may want to add a large dataset currently found in an external
-location. But what if there's not enough disk space to download the data? Here's
-one method!
+The `--out` option lets you add external paths in a way that they are
+<abbr>cached</abbr> first, and then
+[linked](/doc/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache)
+to a given path inside the <abbr>workspace<abbr>. Let's initialize an example
+DVC project to try this:
 
-The `--out` option lets you add external so that it's linked to a given path
-inside the <abbr>workspace</abbr> after being copied to the <abbr>cache</abbr>.
-Combined with
-[symlinking](/doc/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache)
-an
-[external cache](/doc/use-cases/shared-development-server#configure-the-external-shared-cache),
-this let's you avoid using the local file system completely.
+```dvc
+$ mkdir example # workspace
+$ cd example
+$ git init
+$ dvc init
+```
 
-For example, we can add a `data.xml` file via HTTP, outputting it to a local
-path in our project:
+Now we can add a `data.xml` file via HTTP for example, putting it a local path
+in our project:
 
 ```dvc
-$ dvc add https://data.dvc.org/get-started/data.xml -o raw/data.xml
+$ dvc add https://data.dvc.org/get-started/data.xml -o data.xml
 
 $ ls
 data.xml data.xml.dvc
 ```
 
-The local `data.xml` should be a symlink to the (externally) <abbr>cached</abbr>
-data copy. The resulting `.dvc` file will save the local `path` as if the data
-was already there before this command. Let's check the contents of
-`data.xml.dvc`:
+The resulting `.dvc` file will save the provided local `path` as if the data was
+already in the workspace, while the `md5` hash points to the copy of the data
+that has now been transferred to the <abbr>cache</abbr>. Let's check the
+contents of `data.xml.dvc` in this case:
 
 ```yaml
 outs:
   - md5: a304afb96060aad90176268345e10355
     nfiles: 1
-    path: raw/data.xml
+    path: data.xml
 ```
 
 > For a similar operation that actually keeps a connection to the data source,
 > please see `dvc import-url`.
 
-## Example: `--to-remote` usage {#straight-to-remote}
-
-Here's another method to add a large dataset found in an external location
-without downloading the data (refer to previous example).
+## Example: Transfer to remote storage
 
-The `--to-remote` option lets you store a copy of the target data on a
-[DVC remote](/doc/command-reference/remote), while creating a `.dvc` file
-locally so it can be [pulled](/doc/command-reference/plots) later. This is a way
-to "bootstrap" a project in your local machine, to be
-[reproduced](/doc/command-reference/repro) on the right environment later (e.g.
-a GPU cloud server or a CI/CD system).
-
-Let's setup a simple remote and add a `data.xml` file from the web this way:
+When you have a large dataset in an external location, you may want to track it
+as if it was in your project, but without downloading it locally (for now). The
+`--to-remote` option lets you do so, while storing a copy
+[remotely](/doc/command-reference/remote) so it can be
+[pulled](/doc/command-reference/plots) later. Let's initialize a DVC project,
+and setup a remote:
 
 ```dvc
+$ mkdir example # workspace
+$ cd example
+$ git init
+$ dvc init
 $ mkdir /tmp/dvc-storage
 $ dvc remote add myremote /tmp/dvc-storage
+```
+
+Now let's add the `data.xml` to our remote storage from the given remote
+location.
+
+```dvc
 $ dvc add https://data.dvc.org/get-started/data.xml -o data.xml \
                  --to-remote -r myremote
 ...
+```
+
+The only difference that dataset is transferred straight to remote, so DVC won't
+control the remote location you gave but rather continue managing your remote
+storage where the data is now on. The operation will still be resulted with an
+`.dvc` file:
+
+```dvc
 $ ls
 data.xml.dvc
 ```
 
-> Note that this can be combined with `--out` to specify a local destination
-> `path` (written to the `.dvc` file).
-
-DVC won't control the original data source after this, but rather continue
-managing your remote storage, where the data is now found. To actually download
-the data to <abbr>cache</abbr>, you can use `dvc fetch` or `dvc pull` as usual
-(on a system that can handle it):
+Whenever anyone wants to actually download the added data (for example from a
+system that can handle it), they can use `dvc pull` as usual:
 
 ```dvc
-$ dvc pull data.xml.dvc -r tmp_remote
+ $ dvc pull data.xml.dvc -r tmp_remote
+
 A       data.xml
 1 file added and 1 file fetched
 ```
-
-> Note that `dvc repro` will try to download the data too, as part of the
-> pipeline execution.
diff --git a/content/docs/command-reference/get.md b/content/docs/command-reference/get.md
@@ -24,13 +24,13 @@ repository (e.g. source code, small image/other files). `dvc get` copies the
 target file or directory (found at `path` in `url`) to the current working
 directory. (Analogous to `wget`, but for repos.)
 
-> See `dvc list` for a way to browse repository contents to find files or
-> directories to download.
-
 > Note that unlike `dvc import`, this command does not track the downloaded
 > files (does not create a `.dvc` file). For that reason, it doesn't require an
 > existing DVC project to run in.
 
+> See `dvc list` for a way to browse repository contents to find files or
+> directories to download.
+
 The `url` argument specifies the address of the DVC or Git repository containing
 the data source. Both HTTP and SSH protocols are supported (e.g.
 `[user@]server:project.git`). `url` can also be a local file system path
@@ -56,10 +56,10 @@ name.
 
 ## Options
 
-- `-o <path>`, `--out <path>` - destination `path` to place the downloaded file
-  or directory. By default the data file basename is used in the current working
-  directory (if this option isn't used). Directories in the given `path` will be
-  created.
+- `-o <path>`, `--out <path>` - specify a path to the desired location in the
+  workspace to place the downloaded file or directory (instead of using the
+  current working directory). Directories specified in the path will be created
+  by this command.
 
 - `--rev <commit>` - commit hash, branch or tag name, etc. (any
   [Git revision](https://git-scm.com/docs/revisions)) of the repository to