Skip to content

Commit

Permalink
Regular updates & plots 1.0 update (#1382)
Browse files Browse the repository at this point in the history
* cmd ref: add note that move creates dirs

* cmd ref: improve structure of add ref desc.

* grammar: add some commas

* term: checksum -> hash value in dvcignore guide

* style: lower case bullet text

* cmd ref: remove some redundancy in metrics index

* cmd ref: update plots refs synopsis and descriptions
per iterative/dvc/issues/3924 et al.

* Add plots modify cmd

* typo: CSV->csv

* term: working tree -> workspace
per iterative/dvc/pull/3914

* cmd ref: couple improvements to add ref
per #1382 (review)
and #1382 (review)

* Update config/prismjs/dvc-commands.js

* cmd ref: update plots modify description

* cmd ref: add plots modify to nav, with a few more improvements

* cmd ref: plots --show-json -> --show-vega
per iterative/dvc#3891 (comment)

* rename x-lab to x-label

* cmd ref: review descriptions of plots index, show, and diff

* cmd ref: review and update old plots cmds options
per iterative/dvc#3948 et al.

* cmd ref: a couple more option updates
per #1382 (review)

* cmd ref: emphasize add works with any large file/dir
per #1382 (review)

* cmd ref: updae plots modify top half (definition, description)
per #1382 (review) al.

* cmd ref: improve all plot cmd option descriptions

* Update content/docs/command-reference/plots/modify.md

* cmd ref: review examples (mainly images) in plots modify
per #1382 (comment) et al.

* cmd ref: rephrase info about how data arrays are injected to plot templates
per #1382 (review)

* cmd ref: update info on how targets for for plots show/diff
per #1382 (review)

* cmd ref: double check all plots examples
per #1382 (comment)

* cmd ref: remove info about plots show --select

* cmd ref: update add desc
per #1382 (review)

* cmd ref: re-explain dvc add for dirs
per #1382 (review)

* cmd ref: improve description about targets in plots diff
per #1382 (review)

* cmd ref: make emoji note in plots index
per #1382 (review)

* cmd ref: remove ineffective CSV code block highlighting from plots refs
per #1382 (review)

* get started: improve intro in index

* glossary: remove external deps entry (no need)

* cmd ref: add info about column indexing for headerless tables
per #1382 (comment)

* cmd ref: update template metavar for plots subcommands
per #1382 (review)

* cmd ref: mention YAML is supported for plots
per #1382 (comment)

* cmd ref: rename template metavar again in plots
per #1382 (comment)

* cmd ref: clarify plots modify --no-csv-header
per #1382 (review)

* cmd ref: add note about plots modify in show and diff

* cmd ref: update all plots options again

* cmd ref: more updates to plots et al. per Ivan's review

* cmd ref: multiple plots diff --targets allowed

* cmd ref: update note about detault metrics in index
per #1382 (review)

* cmd ref: emphasize add --recursive is rarely needed
per #1382 (review)

* cmd ref: plots diff: update revisions arg desc
per #1382 (review)

* cmd ref: mention column names and numbers in plots {cmd} -x and -y
per #1382 (review)

* cmd ref: emphasize that metrics diff is not a real diff
per #1382 (review)

* cmd ref: simplify note on plots targets
per #1382 (review)

* cmd ref: how to id colmns in plots modify --no-csv-header
per #1382 (review)

* cmd ref: add default target behavior to plots show and diff
rel: #1382 (review)

* cmd ref: rename plots option --no-header
per iterative/dvc/pull/4001

* cmd ref: term: prop->property (plots)
per #1382 (comment)

* cmd ref: more details on metrics index
per #1382 (review)
and #1382 (review)

* cmd ref: more details on plots index
per #1382 (review)
and #1382 (review)

* cmd ref: note about disply props in plots modify
per #1382 (review)
and #1382 (review)

Co-authored-by: Dmitry Petrov <dmitry.petrov@nevesomo.com>
  • Loading branch information
jorgeorpinel and dmpetrov committed Jun 11, 2020
1 parent a258bbc commit 37f4e90
Show file tree
Hide file tree
Showing 28 changed files with 442 additions and 309 deletions.
1 change: 1 addition & 0 deletions config/prismjs/dvc-commands.js
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@ module.exports = [
'pull',
'pkg',
'plots show',
'plots modify',
'plots diff',
'plots',
'pipeline show',
Expand Down
66 changes: 35 additions & 31 deletions content/docs/command-reference/add.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,23 +16,27 @@ positional arguments:
## Description

The `dvc add` command is analogous to `git add`, in that it makes DVC aware of
the target data, as a first step to version it. It creates a
the target data, in order to start versioning it. It creates a
[`.dvc` file](/doc/user-guide/dvc-file-format) to track the added data.

The `targets` are files or directories to add with this command, that are turned
into <abbr>data artifacts</abbr> of the <abbr>project</abbr>. By default, these
are committed to the <abbr>cache</abbr> (use the `--no-commit` option to avoid
this, and `dvc commit` to finish the process when needed).
This command can be used to
[version control](/doc/use-cases/versioning-data-and-model-files) large files,
models, dataset directories, etc. that are too big for Git.

Note that [external data](/doc/user-guide/managing-external-data) (targets
outside the <abbr>workspace</abbr>) is supported.
The `targets` are the files or directories to add, which are turned into
<abbr>data artifacts</abbr> of the <abbr>project</abbr>. These are stored in the
<abbr>cache</abbr> by default (use the `--no-commit` option to avoid this, and
`dvc commit` to finish the process when needed).

> See also `dvc run` for more advanced ways to version intermediate and final
> results (like ML models).
Under the hood, a few actions are taken for each file (or directory) in
`targets`:

1. Calculate the file hash.
2. Move the file contents to the cache directory (by default in `.dvc/cache`),
using the file hash to form the cached file names. (See
2. Move the file contents to the cache (by default in `.dvc/cache`), using the
file hash to form the cached file names. (See
[Structure of cache directory](/doc/user-guide/dvc-files-and-directories#structure-of-cache-directory)
for more details.)
3. Attempt to replace the file with a link to the cached data (more details
Expand All @@ -59,34 +63,34 @@ files that can be tracked with Git. See
To avoid adding files inside a directory accidentally, you can add the
corresponding [patterns](/doc/user-guide/dvcignore) in a `.dvcignore` file.

By default DVC tries to use reflinks (see
By default, DVC tries to use reflinks (see
[File link types](/doc/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache)
to avoid copying any file contents and to optimize `.dvc` file operations for
large files. DVC also supports other link types for use on file systems without
`reflink` support, but they have to be specified manually. Refer to the
`cache.type` config option in `dvc config cache` for more information.

A `dvc add` target can be an individual file or a directory. There are two ways
to work with directory hierarchies with `dvc add`:

1. With `dvc add --recursive`, the hierarchy is traversed and every file is
added individually as described above. This means every file has its own
`.dvc` file, and a corresponding cached file is created (unless the
`--no-commit` option is used).
2. When not using `--recursive` a `.dvc` file is created for the top of the
directory (with default name `dirname.dvc`). Every file in the hierarchy is
added to the cache (unless the `--no-commit` option is used), but DVC does
not produce individual `.dvc` files for each file in the directory tree.
Instead, the single `.dvc` file references a special JSON file in the cache
(with `.dir` extension), that in turn points to the files added from the
hierarchy.

`dvc add` is typically used to version control raw data or initial datasets from
which data processing [pipelines](/doc/command-reference/pipeline) are built,
but it can be used to track any large file or directory. We recommend using
`dvc run` to version control intermediate and final results (like ML models).
This way you bring data provenance and make your project
[reproducible](/doc/command-reference/repro).
### Tracking directories

A `dvc add` target can be an individual file or a directory. In the latter case,
a [`.dvc` file](/doc/user-guide/dvc-file-format) is created for the top of the
directory (with default name `<dir_name>.dvc`).

Every file in the hierarchy is added to the cache (unless the `--no-commit`
option is used), but DVC does not produce individual `.dvc` files for each file
in the directory tree. Instead, the single `.dvc` file references a special JSON
file in the cache (with `.dir` extension), that in turn points to the added
files.

Note that DVC commands that use tracked files support granular targeting of
files, even when the directory is added as a whole. Examples: `dvc push`,
`dvc pull`, `dvc get`, `dvc import`, etc.

As a rarely needed alternative, the `--recursive` option causes every file in
the hierarchy to be added individually. A corresponding `.dvc` file will be
generated for each file in he same location. This may be helpful to save time
adding several data files grouped in a structural directory, but it's
undesirable for data directories with a large number of files.

## Options

Expand Down
51 changes: 22 additions & 29 deletions content/docs/command-reference/metrics/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,28 @@ $ dvc metrics diff
summary.json AUC 0.801807 0.037826
```

`dvc metrics` subcommands by default use the metric files specified in
`dvc.yaml` (if any), for example `summary.json` below:

```yaml
stages:
train:
cmd: python train.py
deps:
- users.csv
outs:
- model.pkl
metrics:
- summary.json:
cache: false
```

> `cache: false` above specifies that `summary.json` is not tracked or
> <abbr>cached</abbr> by DVC (`-M` option of `dvc run`). These metric files are
> normally committed with Git instead. See
> [`dvc.yaml`](/doc/user-guide/dvc-file-format) for more information on the file
> format above.
### Supported file formats

Metrics can be organized as tree hierarchies in JSON or YAML files. DVC
Expand All @@ -69,35 +91,6 @@ DVC itself does not ascribe any specific meaning for these numbers. Usually they
are produced by the model training or model evaluation code and serve as a way
to compare and pick the best performing experiment.

### Default metric files

`dvc metrics` subcommands use all metric files that are specified in `dvc.yaml`
by default. There's no need to specify metric file names to see these metrics.
Metric files can be added to `dvc.yaml` with the `--metrics` (`-m`) or
`--metrics-no-cache` (`-M`) options of `dvc run`, or manually to the `metrics`
section of a stage in `dvc.yaml`:

```yaml
stages:
train:
cmd: python train.py
deps:
- users.csv
params:
- epochs
- dropout
- lr
outs:
- model.pkl
metrics:
- summary.json:
cache: false
```

`cache: false` above specifies that `summary.json is not a data file: it will
not be <abbr>cached</abbr> by DVC. Metric files are normally committed with Git
instead.

## Options

- `-h`, `--help` - prints the usage/help message, and exit.
Expand Down
4 changes: 2 additions & 2 deletions content/docs/command-reference/metrics/show.md
Original file line number Diff line number Diff line change
Expand Up @@ -79,7 +79,7 @@ history use `--all-commits` option:

```dvc
$ dvc metrics show --all-commits
working tree:
workspace:
eval.json:
AUC: 0.66729
error: 0.16982
Expand All @@ -100,7 +100,7 @@ Metrics from different branches can be shown by `--all-branches` (`-a`) option:

```dvc
$ dvc metrics show -a
working tree:
workspace:
eval.json:
AUC: 0.66729
error: 0.16982
Expand Down
5 changes: 3 additions & 2 deletions content/docs/command-reference/move.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,8 +20,9 @@ positional arguments:
`dvc move` is useful when a `src` file or directory has previously been added to
the <abbr>project</abbr> with `dvc add`, creating a
[`.dvc` file](/doc/user-guide/dvc-file-format) (with `src` as a dependency).
`dvc move` behaves like `mv src dst`, moving `src` to the given `dst` path, but
it also renames and updates the corresponding `.dvc` file appropriately.
`dvc move` behaves similar to `mv src dst`, moving `src` to the given `dst`
path, but it also renames and updates the corresponding `.dvc` file
appropriately.

> Note that `src` may be a copy or a
> [link](/doc/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache)
Expand Down
120 changes: 63 additions & 57 deletions content/docs/command-reference/plots/diff.md
Original file line number Diff line number Diff line change
@@ -1,81 +1,86 @@
# plots diff

Show multiple versions of [plot metrics](/doc/command-reference/plots) by
plotting them in a single image.
plotting them in a single image. This allows to easily compare them.

## Synopsis

```usage
usage: dvc plots diff [-h] [-q | -v] [-t [TEMPLATE]] [-d [DATAFILE]] [-f FILE]
[-s SELECT] [-x X] [-y Y] [--stdout] [--no-csv-header]
[--no-html] [--title TITLE] [--xlab XLAB] [--ylab YLAB]
[revisions [revisions ...]]
usage: dvc plots diff [-h] [-q | -v] [--targets [<path> [<path> ...]]]
[-t <name_or_path>] [-x <field>] [-y <field>]
[--no-header] [--title <text>]
[--x-label <text>] [--y-label <text>] [-o <path>]
[--show-vega]
[revisions [revisions ...]]
positional arguments:
revisions Git commits to plot from
revisions Git commits to find metrics to compare
```

## Description

This command visualize difference between metrics among experiments in the
repository history. Requires that Git is being used to version the metrics
files.
This command is a way to visualize the "difference" between metrics among
experiments in the <abbr>repository</abbr> history, by plotting multiple
versions of the metrics. All plots defined in `dvc.yaml` are used by default.

The metrics file needs to be specified through `-d`/`--datafile` option. Also, a
plot can be customized with
[plot templates](/doc/command-reference/plots#plot-templates) using the
`--template` option. To learn more about the file formats and templates please
see `dvc plots`.
> Note that unlike `dvc metrics diff`, this command does not calculate numeric
> differences between metric file values.
`revisions` are Git commit hashes, tag, or branch names. If none are specified,
`dvc plots diff` compares metrics currently present in the
<abbr>workspace</abbr> (uncommitted changes) with the latest committed version.
A single specified revision results in plotting the difference in metrics
between the workspace and that version.
`dvc plots diff` compares targets currently present in the
<abbr>workspace</abbr> (uncommitted changes) with their latest committed
versions (required). A single specified revision results in comparing the
workspace and that version.

In contrast to commands such as `git diff`, `dvc metrics diff` and
`dvc params diff`, **any number of `revisions` can be provided**, and the
resulting plot shows all of them in a single output.
Note that any number of `revisions` can be provided, and the resulting plot
shows all of them in a single output.

This command can work with metric files that are committed to a repository
history, data files controlled by DVC, or any other file in the workspace. In
the case of DVC-tracked `datafile`, the `revisions` are used to find the
corresponding [DVC-files](/doc/user-guide/dvc-file-format).
The plot style can be customized with
[plot templates](/doc/command-reference/plots#plot-templates), using the
`--template` option. To learn more about metric file formats and templates
please see `dvc plots`.

> Note that the default behavior of this command can be modified per metrics
> file with `dvc plots modify`.
## Options

- `-d [DATAFILE], --datafile [DATAFILE]` - metrics file to visualize.
- `--targets <path>` - specific metric files to visualize. These must be listed
in a [`dvc.yaml`](/doc/user-guide/dvc-file-format) file (see the `--plots`
option of `dvc run`).

- `-o <path>, --out <path>` - name of the generated file. By default, the output
file name is equal to the input filename with a `.html` file extension (or
`.json` when using `--show-vega`).

- `-t [TEMPLATE], --template [TEMPLATE]` -
- `-t <name_or_path>, --template <name_or_path>` -
[plot template](/doc/command-reference/plots#plot-templates) to be injected
with data. The default template is `.dvc/plots/default.json`. See more details
in `dvc plots`.

- `-f FILE, --file FILE` - name of the generated file. By default, the output
file name is equal to the input filename with additional `.html` suffix or
`.json` suffix for `--no-html` mode.

- `--no-html` - do not wrap output Vega specification (JSON) with HTML.

- `-x X` - field name for X axis. An auto-generated `index` field is used by
default.
- `-x <field>` - field name from which the X axis data comes from. An
auto-generated `index` field is used by default. See
[Custom templates](/doc/command-reference/plots#custom-templates) for more
information on this `index` field. Column names or numbers are expected for
tabular metric files.

- `-y Y` - field name for Y axis. The last column or field found in the
`datafile` is used by default.
- `-y <field>` - field name from which the Y axis data comes from. The last
field found in the `--targets` is used by default. Column names or numbers are
expected for tabular metric files.

- `-s SELECT, --select SELECT` - select which fields or JSONPath to store in the
metrics file [metadata](https://vega.github.io/vega/docs/data/). The
auto-generated, zero-based `index` column is always included.
- `--x-label <text>` - X axis label. The X field name is the default.

- `--xlab XLAB` - X axis title. The X field name is the default title.
- `--y-label <text>` - Y axis label. The Y field name is the default.

- `--ylab YLAB` - Y axis title. The Y field name is the default title.
- `--title <text>` - plot title.

- `--title TITLE` - plot title.
- `--show-vega` - produce a
[Vega specification](https://vega.github.io/vega/docs/specification/) file
instead of HTML. See `dvc plots` for more info.

- `-o, --stdout` - print plot content to stdout.

- `--no-csv-header` - provided CSV or TSV datafile does not have a header.
- `--no-header` - lets DVC know that CSV or TSV `--targets` do not have a
header. A 0-based numeric index can be used to identify each column instead of
names.

- `-h`, `--help` - prints the usage/help message, and exit.

Expand All @@ -86,21 +91,22 @@ corresponding [DVC-files](/doc/user-guide/dvc-file-format).

## Examples

To visualize the difference between uncommitted changes of a metrics file and
the last commit:
To compare uncommitted changes of a metrics file and its last committed version:

```dvc
$ dvc plots diff -d logs.csv
$ dvc plots diff --targets logs.csv --x-label x
file:///Users/dmitry/src/plots/logs.html
```

![](/img/plots_auc.svg)

The difference between two versions (commit hashes, tags, or branches can be
provided):
> Note that we renamed the X axis label with option `--x-label x`.
Compare two specific versions (commit hashes, tags, or branches can be provided,
for example):

```dvc
$ dvc plots diff -d logs.csv HEAD 0135527
$ dvc plots diff --targets logs.csv HEAD 0135527
file:///Users/usr/src/plots/logs.csv.html
```

Expand All @@ -110,7 +116,7 @@ file:///Users/usr/src/plots/logs.csv.html

We'll use tabular metrics file `classes.csv` for this example:

```csv
```
predicted,actual
cat,cat
cat,cat
Expand All @@ -124,13 +130,13 @@ cat,turtle
...
```

A predefined confusion matrix
The predefined confusion matrix
[template](/doc/command-reference/plots#plot-templates) (in
`.dvc/plots/confusion.json`) shows how metric differences can be faceted by
separate plots:
`.dvc/plots/confusion.json`) shows how metric comparisons can be faceted by
separate plots. It can be enabled with `-t` (`--template`):

```dvc
$ dvc plots diff -t confusion -x predicted -d classes.csv
$ dvc plots diff -t confusion --targets classes.csv -x predicted
file:///Users/usr/src/test/plot_old/classes.csv.html
```

Expand Down

0 comments on commit 37f4e90

Please sign in to comment.