Skip to content

Commit

Permalink
Regular updates (Feb 10+) (#988)
Browse files Browse the repository at this point in the history
* cmd ref: copy edits and improve `move` examples, adding one with `import`
related to https://discordapp.com/channels/485586884165107732/485596304961962003/676360262416203776

* tutorials: correct section title in versioning
per #933 (comment)

* term: review "point" (file hash context)
per #552 (comment)
  • Loading branch information
jorgeorpinel committed Feb 14, 2020
1 parent 01b5efa commit 8f2e4b3
Show file tree
Hide file tree
Showing 10 changed files with 69 additions and 66 deletions.
9 changes: 5 additions & 4 deletions public/static/docs/command-reference/add.md
Original file line number Diff line number Diff line change
Expand Up @@ -72,10 +72,11 @@ to work with directory hierarchies with `dvc add`:
`--no-commit` flag is used).
2. When not using `--recursive` a DVC-file is created for the top of the
directory (with default name `dirname.dvc`). Every file in the hierarchy is
added to the cache (unless `--no-commit` flag is added), but DVC does not
produce individual DVC-files for each file in the directory tree. Instead,
the single DVC-file references a file in the cache that in turn points to the
files in the added hierarchy.
added to the cache (unless the `--no-commit` option is used), but DVC does
not produce individual DVC-files for each file in the directory tree.
Instead, the single DVC-file references a special JSON file in the cache
(with `.dir` extension), that in turn points to the files added from the
hierarchy.

In a <abbr>DVC project</abbr>, `dvc add` can be used to version control any
<abbr>data artifact</abbr> (input, intermediate, or output files and
Expand Down
13 changes: 7 additions & 6 deletions public/static/docs/command-reference/get.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,12 +34,13 @@ to an "offline" repo (if it's a DVC repo without a default remote, instead of
downloading, DVC will try to copy the target data from its <abbr>cache</abbr>).

The `path` argument of this command is used to specify the location of the
target to be downloaded within the source repository at `url`. It can point to
any file or directory in there, including <abbr>outputs</abbr> tracked by DVC,
as well as files tracked by Git. Note that for DVC repos, the target should be
found in one of the [DVC-files](/doc/user-guide/dvc-file-format) of the project.
The project should also have a default
[DVC remote](/doc/command-reference/remote), containing the actual data.
target to be downloaded within the source repository at `url`. `path` can
specify any file or directory in the source repo, including <abbr>outputs</abbr>
tracked by DVC, as well as files tracked by Git. Note that for DVC repos, the
target should be found in one of the
[DVC-files](/doc/user-guide/dvc-file-format) of the project. The project should
also have a default [DVC remote](/doc/command-reference/remote), containing the
actual data.

> See `dvc get-url` to download data from other supported locations such as S3,
> SSH, HTTP, etc.
Expand Down
17 changes: 9 additions & 8 deletions public/static/docs/command-reference/import.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,12 +35,13 @@ to an "offline" repo (if it's a DVC repo without a default remote, instead of
downloading, DVC will try to copy the target data from its <abbr>cache</abbr>).

The `path` argument of this command is used to specify the location of the
target to be downloaded within the source repository at `url`. It can point to
any file or directory in there, including <abbr>outputs</abbr> tracked by DVC,
as well as files tracked by Git. Note that for DVC repos, the target should be
found in one of the [DVC-files](/doc/user-guide/dvc-file-format) of the project.
The project should also have a default
[DVC remote](/doc/command-reference/remote), containing the actual data.
target to be downloaded within the source repository at `url`. `path` can
specify any file or directory in the source repo, including <abbr>outputs</abbr>
tracked by DVC, as well as files tracked by Git. Note that for DVC repos, the
target should be found in one of the
[DVC-files](/doc/user-guide/dvc-file-format) of the project. The project should
also have a default [DVC remote](/doc/command-reference/remote), containing the
actual data.

> See `dvc import-url` to download and track data from other supported locations
> such as S3, SSH, HTTP, etc.
Expand Down Expand Up @@ -156,8 +157,8 @@ deps:
rev_lock: 0547f5883fb18e523e35578e2f0d19648c8f2d5c
```

If `rev` is a Git branch or tag (where the commit it points to changes), the
data source may have updates at a later time. To bring it up to date if so (and
If `rev` is a Git branch or tag (where the underlying commit changes), the data
source may have updates at a later time. To bring it up to date if so (and
update `rev_lock` in the DVC-file), simply use `dvc update <stage>.dvc`. If
`rev` is a specific commit hash (does not change), `dvc update` will never have
an effect on the import stage. You may **re-import** a different commit instead,
Expand Down
42 changes: 21 additions & 21 deletions public/static/docs/command-reference/move.md
Original file line number Diff line number Diff line change
Expand Up @@ -81,31 +81,33 @@ outs:

- `-v`, `--verbose` - displays detailed tracing information.

## Examples
## Example: change an output file name

Here we use `dvc add`to put a file under DVC control. Then we change the name of
it using `dvc move`.
We first use `dvc add` to track file with DVC. Then, we change its name using
`dvc move`.

```dvc
$ dvc add data.csv
...
$ tree
.
├── data.csv
└── data.csv.dvc
$ dvc move data.csv other.csv
...
$ tree
.
├── other.csv
└── other.csv.dvc
```

Here we use `dvc add` to put a file under DVC control. Then we use `dvc move` to
change its location. Note that the `data.csv.dvc`
[DVC-file](/doc/user-guide/dvc-file-format) is also moved. If target path
already exists and is a directory, data file is moved with unchanged name into
this folder.
## Example: change an output location

We use `dvc add` to track a file with DVC, then we use `dvc move` to change its
location. If target path already exists and is a directory, data file is moved
with unchanged name into this folder. Note that the `data.csv.dvc`
[DVC-file](/doc/user-guide/dvc-file-format) is also moved.

```dvc
$ tree
Expand All @@ -116,6 +118,7 @@ $ tree
└── subdir
$ dvc add data/foo
...
$ tree
.
├── data
Expand All @@ -125,6 +128,7 @@ $ tree
└── subdir
$ dvc move data/foo data2/subdir/
...
$ tree
.
├── data
Expand All @@ -134,28 +138,24 @@ $ tree
└── foo.dvc
```

In this example we use `dvc add` to put a directory under DVC control. Then we
use `dvc move` to move the whole directory. As in other cases, DVC-file is also
moved.
## Example: change an imported directory name and location

```dvc
$ tree
.
├── data
│   ├── bar
│   └── foo
└── data2
Let's try the same with an entire directory imported from an external <abbd>DVC
repository</abbd> with `dvc import`. Note that, as in the previous cases, the
DVC-file is also moved.

$ dvc add data
```dvc
$ dvc import ../another-repo data
...
$ tree
.
├── data
│   ├── bar
│   └── foo
├── data2
└── data.dvc
$ dvc move data data2/data3
...
$ tree
.
└── data2
Expand Down
2 changes: 1 addition & 1 deletion public/static/docs/command-reference/remote/add.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ positional arguments:
## Description

`name` and `url` are required. `url` specifies a location to store your data. It
can point to a cloud storage service, an SSH server, network-attached storage,
can represent a cloud storage service, an SSH server, network-attached storage,
or even a directory in the local file system. (See all the supported remote
storage types in the examples below.) If `url` is a relative path, it will be
resolved against the current working directory, but saved **relative to the
Expand Down
20 changes: 10 additions & 10 deletions public/static/docs/tutorials/versioning.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,8 +15,8 @@ to build a powerful image classifier using a pretty small dataset.
We first train a classifier model using 1000 labeled images, then we double the
number of images (2000) and retrain our model. We capture both datasets and
classifier results and show how to use `dvc checkout` to switch between data
and/or model versions.
classifier results and show how to use `dvc checkout` to switch between
<abbr>workspace</abbr> versions.

The specific algorithm used to train and validate the classifier is not
important, and no prior knowledge of Keras is required. We'll reuse the
Expand Down Expand Up @@ -165,7 +165,7 @@ $ git tag -a "v1.0" -m "model v1.0, 1000 images"
As we mentioned briefly, DVC does not commit the `data/` directory and
`model.h5` file with Git. Instead, `dvc add` stores them in the cache (usually
in `.dvc/cache`) and adds them to `.gitignore`. We then `git commit` DVC-files
that contain pointers to the cached data.
that contain file hashes that point to cached data.

In this case we created `data.dvc` and `model.h5.dvc`. Refer to
[DVC-File Format](/doc/user-guide/dvc-file-format) to learn more about how these
Expand Down Expand Up @@ -241,11 +241,11 @@ $ git commit -m "Second model, trained with 2000 images"
$ git tag -a "v2.0" -m "model v2.0, 2000 images"
```

That's it! We have a second model and dataset saved and pointers to them
committed with Git. Let's now look at how DVC can help us go back to the
previous version if we need to.
That's it! We have tracked a second dataset, model, and metrics versioned DVC,
and the DVC-files that point to them committed with Git. Let's now look at how
DVC can help us go back to the previous version if we need to.

## Switching between data and/or model versions
## Switching between workspace versions

The DVC command that helps get a specific committed version of data is designed
to be similar to `git checkout`. All we need to do in our case is to
Expand Down Expand Up @@ -291,8 +291,8 @@ directory inside the repository). Instead, DVC creates
placeholders that point to the cached files, and they can be easily version
controlled with Git.

When we run `git checkout` we restore pointers (DVC-files) first, then when we
run `dvc checkout` we use these pointers to put the right data in the right
When we run `git checkout` we restore pointers (DVC-files) first. Then, when we
run `dvc checkout`, we use these pointers to put the right data in the right
place.

</details>
Expand All @@ -312,7 +312,7 @@ When you have a script that takes some data as an input and produces other data
<abbr>outputs</abbr>, a better way to capture them is to use `dvc run`:

> If you tried the commands in the
> [Switching between data and/or model versions](#switching-between-data-and-or-model-versions)
> [Switching between workspace versions](#switching-between-workspace-versions)
> section, go back to the master branch code and data with:
>
> ```dvc
Expand Down
5 changes: 3 additions & 2 deletions public/static/docs/understanding-dvc/core-features.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,9 @@
- It makes data science projects **reproducible** by creating lightweight
[pipelines](/doc/command-reference/pipeline) using implicit dependency graphs.

- **Large data file versioning** works by creating pointers in your Git
repository to the <abbr>cache</abbr>, typically stored on a local hard drive.
- **Large data file versioning** works by creating special files in your Git
repository that point to the <abbr>cache</abbr>, typically stored on a local
hard drive.

- DVC is **Programming language agnostic**: Python, R, Julia, shell scripts,
etc. as well as ML library agnostic: Keras, Tensorflow, PyTorch, Scipy, etc.
Expand Down
4 changes: 2 additions & 2 deletions public/static/docs/user-guide/external-dependencies.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,8 +28,8 @@ supported:
> `dvc remote`.
In order to specify an external dependency for your stage, use the usual '-d'
option in `dvc run` with the external path or URL pointing to your desired file
or directory.
option in `dvc run` with the external path or URL to your desired file or
directory.

## Examples

Expand Down
16 changes: 8 additions & 8 deletions public/static/docs/user-guide/large-dataset-optimization.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,17 +16,17 @@ will be duplicated between the workspace and the cache? **That would not be
efficient!** Especially with large files (several Gigabytes or larger).

In order to have the files present in both directories without duplication, DVC
can automatically create **file links** in the workspace that "point" to the
data in cache. In fact, by default it will attempt to use reflinks\* if
supported by the file system.
can automatically create **file links** to the cached data in the workspace. In
fact, by default it will attempt to use reflinks\* if supported by the file
system.

## File link types for the DVC cache

File links are entries in the file system that don't necessarily hold the file
contents, but point to where the file is actually stored. File links are more
common in file systems used with UNIX-like operating systems and come in
different kinds, that differ in how they connect file names to _inodes_ in the
system.
File links are lightweight entries in the file system that don't hold the file
contents, but work as shortcuts to where the original data is actually stored.
They're more common in file systems used with UNIX-like operating systems, and
come in different kinds that differ in how they connect file names to _inodes_
in the system.

> **Inodes** are metadata file records to locate and store permissions to the
> actual file contents. See **Linking files** in
Expand Down
7 changes: 3 additions & 4 deletions public/static/docs/user-guide/managing-external-data.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,10 +29,9 @@ supported:
> `dvc remote`.
In order to specify an external output for a stage file, use the usual `-o` or
`-O` options of the `dvc run` command, but with the external path or URL
pointing to the file in question. For <abbr>cached</abbr> external outputs
(`-o`) you will need to
[setup an external cache](/doc/command-reference/config#cache) location.
`-O` options of the `dvc run` command, but with the external path or URL to the
file in question. For <abbr>cached</abbr> external outputs (`-o`) you will need
to [setup an external cache](/doc/command-reference/config#cache) location.
Non-cached external outputs (`-O`) do not require an external cache to be setup.

> Avoid using the same remote location that you are using for `dvc push`,
Expand Down

0 comments on commit 8f2e4b3

Please sign in to comment.