Skip to content

Commit

Permalink
ref: revert changed related to #2302
Browse files Browse the repository at this point in the history
  • Loading branch information
jorgeorpinel committed Mar 18, 2021
1 parent 4d5bd5f commit fa5a590
Show file tree
Hide file tree
Showing 7 changed files with 182 additions and 165 deletions.
159 changes: 84 additions & 75 deletions content/docs/command-reference/add.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,23 +33,23 @@ option to avoid this, and `dvc commit` to finish the process when needed).
> See also `dvc.yaml` and `dvc run` for more advanced ways to track and version
> intermediate and final results (like ML models).
After checking that each `target` isn't already tracked with DVC, a few actions
are taken under the hood:
After checking that each `target` hasn't been added before (or tracked with
other DVC commands), a few actions are taken under the hood:

1. Calculate the file hash.
2. Move the file contents to the cache, using the file hash to form the cached
file path (see
2. Move the file contents to the cache (by default in `.dvc/cache`) (or to
remote storage if `--to-remote` is given), using the file hash to form the
cached file path. (See
[Structure of cache directory](/doc/user-guide/project-structure/internal-files#structure-of-the-cache-directory)
for details). Using `--out`, or `--to-remote` with an external target, the
data is copied instead (to cache or remote storage).
3. Attempt to replace the file with a link to (or copy of) the cached data (more
details on file linking ahead). A new link is created if a different `--out`
`path` is given. Skipped if `--to-remote` is used
4. Create a `.dvc` file to track the file or directory, saving it's path, and
the hash as a pointer to the cached data. The `.dvc` file lists the data as
an <abbr>output</abbr> (`outs` field). Unless the `--file` option is used,
the `.dvc` file name generated by default is `<file>.dvc`, where `<file>` is
the file name of the first target.
for more details.)
3. Attempt to replace the file with a link to the cached data (more details on
file linking further down). Skipped if `--to-remote` is used.
4. Create a corresponding `.dvc` file to track the file, using its path and hash
to identify the cached data (with `--to-remote`/`-o`, an external path is
moved to the workspace). The `.dvc` file lists the DVC-tracked file as an
<abbr>output</abbr> (`outs` field). Unless the `--file` option is used, the
`.dvc` file name generated by default is `<file>.dvc`, where `<file>` is the
file name of the first target.
5. Add the `targets` to `.gitignore` in order to prevent them from being
committed to the Git repository (unless `dvc init --no-scm` was used when
initializing the <abbr>DVC project</abbr>).
Expand Down Expand Up @@ -145,32 +145,28 @@ not.
[pattern](https://docs.python.org/3/library/glob.html) specified in `targets`.
Shell style wildcards supported: `*`, `?`, `[seq]`, `[!seq]`, and `**`

- `-o <path>`, `--out <path>` - destination `path` inside the workspace to place
the data target. By default the data file basename is used in the current
working directory (if this option isn't used). Directories in the given `path`
will be created. Note that for external targets, this can be combined
[with an external cache](#example-external-data) to skip the local file
system.

- `--to-remote` - allow a target outside of the DVC repository (e.g. an S3
object, SSH directory URL, file on mounted volume, etc.) but don't move it
into the workspace, nor cache it. [Store a copy](#straight-to-remote) on a DVC
remote instead (the default one unless `-r` is specified) to skip the local
file system. Use `dvc pull` to get the data later.

- `-r <name>`, `--remote <name>` - name of the
[remote](/doc/command-reference/remote) to store data on (can only be used
with `--to-remote`).

- `--external` - allow `targets` that are outside of the DVC repository, to
track in-place. See
- `--external` - allow `targets` that are outside of the DVC repository. See
[Managing External Data](/doc/user-guide/managing-external-data).

> ⚠️ Note that this is an advanced feature for very specific situations and
> not recommended except if there's absolutely no other alternative.
> Additionally, this typically requires an external cache setup (see link
> above).
- `-o <path>`, `--out <path>` - destination `path` to make a local target copy,
or to [transfer](#example-transfer-to-cache) an external target into the cache
(and link to workspace). Note that this can be combined with `--to-remote` to
avoid storing the data locally, while still adding it to the project.

- `--to-remote` - import an external target, but don't move it into the
workspace, nor cache it. [Transfer it](#example-transfer-to-remote-storage) it
directly to remote storage (the default one, unless `-r` is specified)
instead. Use `dvc pull` to get the data locally.

- `-r <name>`, `--remote <name>` - name of the
[remote storage](/doc/command-reference/remote) to transfer external target to
(can only be used with `--to-remote`).

- `--desc <text>` - user description of the data (optional). This doesn't affect
any DVC operations.

Expand Down Expand Up @@ -340,82 +336,95 @@ $ tree .dvc/cache
Only the hash values of the `dir/` directory (with `.dir` file extension) and
`file2` have been cached.

## Example: External data
## Example: Transfer to the cache

When you have a large dataset in an external location, you may want to add it to
the <abbr>project</abbr> without having to copy it into the workspace. Maybe
your local disk doesn't have enough space, but you have setup an
[external cache](/doc/use-cases/shared-development-server#configure-the-external-shared-cache)
that could handle it.

Sometimes you may want to add a large dataset currently found in an external
location. But what if there's not enough disk space to download the data? Here's
one method!
The `--out` option lets you add external paths in a way that they are
<abbr>cached</abbr> first, and then
[linked](/doc/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache)
to a given path inside the <abbr>workspace<abbr>. Let's initialize an example
DVC project to try this:

The `--out` option lets you add external so that it's linked to a given path
inside the <abbr>workspace</abbr> after being copied to the <abbr>cache</abbr>.
Combined with
[symlinking](/doc/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache)
an
[external cache](/doc/use-cases/shared-development-server#configure-the-external-shared-cache),
this let's you avoid using the local file system completely.
```dvc
$ mkdir example # workspace
$ cd example
$ git init
$ dvc init
```

For example, we can add a `data.xml` file via HTTP, outputting it to a local
path in our project:
Now we can add a `data.xml` file via HTTP for example, putting it a local path
in our project:

```dvc
$ dvc add https://data.dvc.org/get-started/data.xml -o raw/data.xml
$ dvc add https://data.dvc.org/get-started/data.xml -o data.xml
$ ls
data.xml data.xml.dvc
```

The local `data.xml` should be a symlink to the (externally) <abbr>cached</abbr>
data copy. The resulting `.dvc` file will save the local `path` as if the data
was already there before this command. Let's check the contents of
`data.xml.dvc`:
The resulting `.dvc` file will save the provided local `path` as if the data was
already in the workspace, while the `md5` hash points to the copy of the data
that has now been transferred to the <abbr>cache</abbr>. Let's check the
contents of `data.xml.dvc` in this case:

```yaml
outs:
- md5: a304afb96060aad90176268345e10355
nfiles: 1
path: raw/data.xml
path: data.xml
```

> For a similar operation that actually keeps a connection to the data source,
> please see `dvc import-url`.

## Example: `--to-remote` usage {#straight-to-remote}

Here's another method to add a large dataset found in an external location
without downloading the data (refer to previous example).
## Example: Transfer to remote storage

The `--to-remote` option lets you store a copy of the target data on a
[DVC remote](/doc/command-reference/remote), while creating a `.dvc` file
locally so it can be [pulled](/doc/command-reference/plots) later. This is a way
to "bootstrap" a project in your local machine, to be
[reproduced](/doc/command-reference/repro) on the right environment later (e.g.
a GPU cloud server or a CI/CD system).

Let's setup a simple remote and add a `data.xml` file from the web this way:
When you have a large dataset in an external location, you may want to track it
as if it was in your project, but without downloading it locally (for now). The
`--to-remote` option lets you do so, while storing a copy
[remotely](/doc/command-reference/remote) so it can be
[pulled](/doc/command-reference/plots) later. Let's initialize a DVC project,
and setup a remote:

```dvc
$ mkdir example # workspace
$ cd example
$ git init
$ dvc init
$ mkdir /tmp/dvc-storage
$ dvc remote add myremote /tmp/dvc-storage
```

Now let's add the `data.xml` to our remote storage from the given remote
location.

```dvc
$ dvc add https://data.dvc.org/get-started/data.xml -o data.xml \
--to-remote -r myremote
...
```

The only difference that dataset is transferred straight to remote, so DVC won't
control the remote location you gave but rather continue managing your remote
storage where the data is now on. The operation will still be resulted with an
`.dvc` file:

```dvc
$ ls
data.xml.dvc
```

> Note that this can be combined with `--out` to specify a local destination
> `path` (written to the `.dvc` file).

DVC won't control the original data source after this, but rather continue
managing your remote storage, where the data is now found. To actually download
the data to <abbr>cache</abbr>, you can use `dvc fetch` or `dvc pull` as usual
(on a system that can handle it):
Whenever anyone wants to actually download the added data (for example from a
system that can handle it), they can use `dvc pull` as usual:

```dvc
$ dvc pull data.xml.dvc -r tmp_remote
$ dvc pull data.xml.dvc -r tmp_remote
A data.xml
1 file added and 1 file fetched
```

> Note that `dvc repro` will try to download the data too, as part of the
> pipeline execution.
14 changes: 7 additions & 7 deletions content/docs/command-reference/get.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,13 +24,13 @@ repository (e.g. source code, small image/other files). `dvc get` copies the
target file or directory (found at `path` in `url`) to the current working
directory. (Analogous to `wget`, but for repos.)

> See `dvc list` for a way to browse repository contents to find files or
> directories to download.
> Note that unlike `dvc import`, this command does not track the downloaded
> files (does not create a `.dvc` file). For that reason, it doesn't require an
> existing DVC project to run in.
> See `dvc list` for a way to browse repository contents to find files or
> directories to download.
The `url` argument specifies the address of the DVC or Git repository containing
the data source. Both HTTP and SSH protocols are supported (e.g.
`[user@]server:project.git`). `url` can also be a local file system path
Expand All @@ -56,10 +56,10 @@ name.

## Options

- `-o <path>`, `--out <path>` - destination `path` to place the downloaded file
or directory. By default the data file basename is used in the current working
directory (if this option isn't used). Directories in the given `path` will be
created.
- `-o <path>`, `--out <path>` - specify a path to the desired location in the
workspace to place the downloaded file or directory (instead of using the
current working directory). Directories specified in the path will be created
by this command.

- `--rev <commit>` - commit hash, branch or tag name, etc. (any
[Git revision](https://git-scm.com/docs/revisions)) of the repository to
Expand Down
Loading

0 comments on commit fa5a590

Please sign in to comment.