Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd: update commit-related info. #1989

Merged
merged 23 commits into from
Dec 13, 2020
Merged
Show file tree
Hide file tree
Changes from 7 commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
5f5c1ff
cmd: update commit-related info
jorgeorpinel Dec 1, 2020
68c9c1a
Merge branch 'master' into cmd/commit
jorgeorpinel Dec 8, 2020
94eb1f8
cmd: improve commit intro
jorgeorpinel Dec 8, 2020
9c22f7e
cmd: update commit description
jorgeorpinel Dec 8, 2020
45da240
cmd: shorten commit intro
jorgeorpinel Dec 9, 2020
9b06323
cmd: mention that commit is an alternative to add
jorgeorpinel Dec 9, 2020
6ba7ea0
cmd: generalize use case of commit (not just about stages)
jorgeorpinel Dec 9, 2020
98c464f
cmd: separate add from repro cases of commit
jorgeorpinel Dec 10, 2020
a0cb751
cmd: term: don't say "under development"
jorgeorpinel Dec 10, 2020
ba6e109
cmd: clarify commit scenarios
jorgeorpinel Dec 10, 2020
3711dc8
cmd: clarify diffs among -no-cache options in run, repro
jorgeorpinel Dec 10, 2020
ce8ccad
cmd: update import/run --no-exec regarding caching
jorgeorpinel Dec 10, 2020
1c7d4a1
cmd: reinstate note on caching in import refs.
jorgeorpinel Dec 10, 2020
0be1fe0
cmd: rephrase first p in commit
jorgeorpinel Dec 10, 2020
f63a664
cmd: simplify main scenario in commit desc.
jorgeorpinel Dec 10, 2020
12a1a8e
Merge branch 'master' into cmd/commit
jorgeorpinel Dec 12, 2020
97be1b9
cmd: more uses for run -O
jorgeorpinel Dec 12, 2020
144749f
cmd: mention import --no-exec in commit
jorgeorpinel Dec 12, 2020
859e874
cmd: restructure commit desc
jorgeorpinel Dec 12, 2020
cf34cf6
cmd: impro/add motivation to run/repro/import --no-commit/exec
jorgeorpinel Dec 12, 2020
11c2768
cmd: update motivation for --no-exec
jorgeorpinel Dec 13, 2020
36997ed
cmd: Other->Specifically in secondary commit scenarios
jorgeorpinel Dec 13, 2020
b45e324
cmd: simplify import* --no-exec
jorgeorpinel Dec 13, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 3 additions & 4 deletions content/docs/command-reference/add.md
Original file line number Diff line number Diff line change
Expand Up @@ -128,10 +128,9 @@ not.
among the `targets`, this option is ignored. For each file found, a new `.dvc`
file is created using the process described in this command's description.

- `--no-commit` - do not save outputs to cache. A `.dvc` file is created, while
nothing is added to the cache. (`dvc status` will report that the file is
`not in cache`.) Use `dvc commit` when ready to commit outputs with DVC. This
is analogous to using `git add` before `git commit`.
- `--no-commit` - do not store `targets` in the cache (the `.dvc` file is still
created). Use `dvc commit` to finish the operation (similar to `git commit`
after `git add`).

- `--file <filename>` - specify name of the `.dvc` file it generates. This
option works only if there is a single target. By default the name of the
Expand Down
123 changes: 56 additions & 67 deletions content/docs/command-reference/commit.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,6 @@
# commit

Record changes to DVC-tracked files in the <abbr>project</abbr>, by saving them
to the <abbr>cache</abbr> and updating the `dvc.lock` or `.dvc` files.
Record changes to files or directories tracked by DVC.

## Synopsis

Expand All @@ -17,65 +16,54 @@ positional arguments:

## Description

The `dvc commit` command is useful for several scenarios, when data already
tracked by DVC changes: when a [stage](/doc/command-reference/run) or
[pipeline](/doc/command-reference/dag) is in development/experimentation; to
force-update the `dvc.lock` or `.dvc` files without reproducing stages or
pipelines; or to mark existing files/dirs as stage <abbr>outputs</abbr>. These
scenarios are further detailed below.

- Code or data for a stage is under active development, with multiple iterations
(experiments) in code, configuration, or data. Use the `--no-commit` option of
DVC commands (`dvc add`, `dvc run`, `dvc repro`) to avoid caching unnecessary
data repeatedly. Use `dvc commit` when the DVC-tracked data is final.

💡 For convenience, a pre-commit Git hook is available to remind you to
`dvc commit` when needed. See `dvc install` for more details.

- Sometimes we want to edit source code, config, or data files in a way that
doesn't cause changes in the results of their data pipeline. We might write
add code comments, change indentation, remove some debugging printouts, or any
other change that doesn't cause changed stage outputs. However, DVC will
notice that some <abbr>dependencies</abbr> have changed, and expect you to
reproduce the whole pipeline. If you're sure no pipeline results would change,
use `dvc commit` to force update the `dvc.lock` or `.dvc` files and cache.

- In some cases, we have previously executed a stage, and later notice that some
of the files/directories used by the stage as dependencies or created as
outputs are missing from `dvc.yaml`. It is possible to
[add missing data to an existing stage](/docs/user-guide/how-to/add-deps-or-outs-to-a-stage),
and then `dvc commit` can be used to save outputs to the cache (and update
`dvc.lock`)

- It's always possible to manually execute the command or source code used in a
stage without DVC (outputs must be unprotected or removed first in certain
cases, see `dvc unprotect`). Once the desired result is reached, use
`dvc commit` to update the `dvc.lock` file(s) and store changed data to the
cache.

Let's take a look at what is happening in the first scenario closely. Normally
DVC commands like `dvc add`, `dvc repro` or `dvc run` commit the data to the
<abbr>cache</abbr> after creating or updating a `dvc.lock` or `.dvc` file. What
_commit_ means is that DVC:

- Computes a hash for the file/directory.
- Enters the hash value and file name in the `dvc.lock` or `.dvc` file.
- Tells Git to ignore the file/directory (adding them to `.gitignore`). (Note
that if the <abbr>project</abbr> was initialized with no Git support
(`dvc init --no-scm`), this does not happen.)
- Adds the file(s) in question to the cache.

There are many cases where the last step is not desirable (for example rapid
iterations on an experiment). The `--no-commit` option prevents it (on the
commands where it's available). The file hash is still computed and added to the
`dvc.lock` or `.dvc` file, but the actual data is not cached. And this is where
the `dvc commit` command comes into play: It performs that last step when
needed.

Note that it's best to avoid the last three scenarios. They essentially
force-update the `dvc.lock` or `.dvc` files and save data to cache. They are
still useful, but keep in mind that DVC can't guarantee reproducibility in those
cases.
<abbr>Caches</abbr> the current contents of files and directories tracked by
DVC, and updates `dvc.lock` or `.dvc` files as needed.

This can be useful for several scenarios, when the project is under development,
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved
or to force DVC to accept changed data (avoiding `dvc add` or `dvc repro`).
We'll expand on these uses below.

Normally, `dvc repro` and `dvc run` finish up with the same steps as `dvc add`,
shcheklein marked this conversation as resolved.
Show resolved Hide resolved
for each <abbr>output</abbr> involved. In summary:

- Compute the hash value of the file or directory and save it in the `dvc.lock`
or `.dvc` file.
- If using Git, append the file/directory path to `.gitignore`.
- Store the data in question in the <abbr>cache</abbr>.

The last step can be skipped with the `--no-commit` option of those commands,
for example when testing or experimenting during the development of the project.
This avoids caching unfinished data (hash values are still calculated and added
to `dvc.lock` or `.dvc` files). This is where `dvc commit` comes into play: It
performs that last step when needed.

💡 For convenience, a pre-commit Git hook is available to remind you to
`dvc commit` when needed. See `dvc install` for more info.

Other scenarios include:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure Other applies here? it's more like a detailed explanation?

Copy link
Contributor Author

@jorgeorpinel jorgeorpinel Dec 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm they're meant as different scenarios. The part before this is about add/run/repro --no-commit + dvc commit — main use case. The bullets in this list are

  • dvc add *
  • force-accepting cosmetic changes to dependencies
  • adding missing deps/outs
  • executing commands manually (this one I guess is pretty similar to the main case)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some of them (all of them?) are part of the main use case in this terminology.

add/run/repro --no-commit is not even a use case by itself, right? it doesn't explain a specific need.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the main reason commit exists as a stand-alone command is to complement the --no-commit options of add/run/repro?

In terms of the "story", its explained as "forces DVC to accept the contents of tracked data currently in the workspace" in the first p of the description. So do you mean they're all different flavors of that explanation and that there should be a single bullet list (including add/run/repro --no-commit)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the main reason commit exists as a stand-alone command is to complement the --no-commit options of add/run/repro?

probably initially it was the case, now we reference it in other places? (like --no-exec?) So we might need to revisit, generalize? (not 100% sure, just asking)

So do you mean they're all different flavors of that explanation

it seems so (not sure about all)

that there should be a single bullet list

ah, not necessarily. Just removing Other might help? or rephrasing it a bit if the first paragraph is already general.

again, not very constructive feedback here - just highlighting stuff as I read it, a think that seemed strange

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha. I iterated on this to make clear that the first p is a generalization, then a main scenario is explained, and finally a list with other scenarios. PTAL.


- Often we edit source code, configuration, or input data in a way that doesn't
cause changes to any outputs, for example reformatting data, adding code
comments, etc. However, DVC notices all changes to <abbr>dependencies</abbr>
and expects you to re-add the files/dirs, or to reproduce the corresponding
stages. Use `dvc commit` instead as an alternative to `dvc add`, or to force
accepting stage-related changes without having to `dvc repro`.

- Sometimes, after executing a stage, we realize that we forgot to specify some
of its dependencies or outputs in `dvc.yaml`. Fortunately it's possible to
[add the missing deps/outs](/docs/user-guide/how-to/add-deps-or-outs-to-a-stage),
and `dvc commit` may be needed to finalize the remedy (see link).

- It's also possible to execute stage commands by hand (without `dvc repro`), or
to manually modify their output files or directories. Use `dvc commit` to
register the changes with DVC once you're done.

> Note that `dvc unprotect` (or removing the outputs) is usually required
> before rewriting files/dirs tracked by DVC.

Note that it's best to try avoiding these scenarios, where the cache,
`dvc.lock`, and `.dvc` files are force-updated. DVC can't guarantee
reproducibility in those cases.
shcheklein marked this conversation as resolved.
Show resolved Hide resolved

## Options

Expand Down Expand Up @@ -228,20 +216,21 @@ ba000ba83b341a423a81eed8ff9238
We've verified that `dvc commit` has saved the changes into the cache, and that
the new instance of `model.pkl` is there.

## Example: Running commands without DVC
## Example: Executing stage commands without DVC

It is also possible to execute the commands that are executed by `dvc repro` by
hand. You won't have DVC helping you, but you have the freedom to run any
command you like, even ones not defined in `dvc.yaml` stages. For example:
Sometimes you may want to execute stage commands manually (instead of using
`dvc repro`). You won't have DVC helping you, but you'll have the freedom to run
any command, even ones not defined in `dvc.yaml`. For example:

```dvc
$ python src/featurization.py data/prepared data/features
$ python src/train.py data/features model.pkl
$ python src/evaluate.py model.pkl data/features auc.metric
```

As before, `dvc status` will show which files have changed, and when your work
is finalized `dvc commit` will commit everything to the <abbr>cache</abbr>.
As before, `dvc status` will show which tracked files/dirs have changed, and
when your work is finalized, `dvc commit` will save the outputs the
<abbr>cache</abbr>.

## Example: Updating dependencies

Expand Down
22 changes: 9 additions & 13 deletions content/docs/command-reference/repro.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,10 +26,8 @@ implicitly defined by the stages listed in `dvc.yaml`. The commands defined in
these stages can then be executed in the correct order, reproducing pipeline
results.

> Pipeline stages are defined in a
> [`dvc.yaml` file](/doc/user-guide/dvc-files-and-directories#dvcyaml-file)
> (either manually or by using `dvc run`) while initial data dependencies can be
> registered with `dvc add`.
> Pipeline stages are defined in a `dvc.yaml` file (either manually or by using
> `dvc run`) while initial data dependencies can be registered with `dvc add`.

This command is similar to [Make](https://www.gnu.org/software/make/) in
software build automation, but DVC captures build requirements
Expand Down Expand Up @@ -105,11 +103,9 @@ up-to-date and only execute the final stage.
target directory and its subdirectories for stages (in `dvc.yaml`) to inspect.
If there are no directories among the targets, this option is ignored.

- `--no-commit` - do not save outputs to cache. A DVC-file is created, while
nothing is added to the cache. (`dvc status` will report that the file is
`not in cache`.) Use `dvc commit` when ready to commit outputs with DVC.
Useful to avoid caching unnecessary data repeatedly when running multiple
experiments.
- `--no-commit` - do not store outputs in the cache (`dvc.yaml` and `dvc.lock`
are still created or updated); useful to avoid caching unnecessary data when
executing tests or experiments. Use `dvc commit` to finish the operation.

- `-m`, `--metrics` - show metrics after reproduction. The target pipelines must
have at least one metrics file defined either with the `dvc metrics` command,
Expand Down Expand Up @@ -141,10 +137,10 @@ up-to-date and only execute the final stage.
stages (`A` and below) depend on `requirements.txt`, we can specify it in `A`,
and omit it in `B` and `C`.

Like with the same option on `dvc run`, this is a way to force-execute stages
without changes. This can also be useful for pipelines containing stages that
produce non-deterministic (semi-random) outputs, where outputs can vary on
each execution, meaning the cache cannot be trusted for such stages.
Like with the `--force` option on `dvc run`, this is a way to force-execute
stages without changes. This can also be useful for pipelines containing
stages that produce non-deterministic (semi-random) outputs, where outputs can
vary on each execution, meaning the cache cannot be trusted for such stages.

- `--downstream` - only execute the stages after the given `targets` in their
corresponding pipelines, including the target stages themselves. This option
Expand Down
23 changes: 9 additions & 14 deletions content/docs/command-reference/run.md
Original file line number Diff line number Diff line change
Expand Up @@ -227,14 +227,14 @@ $ dvc run -n my_stage './my_script.sh $MYENVVAR'
It's used by `dvc repro` to change the working directory before executing the
`command`.

- `--no-exec` - create a stage file, but do not execute the `command` defined in
it, nor cache dependencies or outputs (like with `--no-commit`, explained
below). DVC will also add your outputs to `.gitignore`, same as it would do
without `--no-exec`. Use `dvc commit` to force committing existing output file
versions to cache.
- `--no-exec` - write the stage to `dvc.yaml`, but do not execute its `command`.
Any dependencies and outputs will be entered in `.gitignore`, but won't be
cached (like with `--no-commit` below) or recorded in `dvc.lock`. Use
`dvc commit` to save any existing dep/out files to the cache and record their
hashes to the lock file.

This is useful if, for example, you need to build a pipeline quickly first,
and run it all at once later.
and run it all at once later (with `dvc repro`).

- `-f`, `--force` - overwrite an existing stage in `dvc.yaml` file without
asking for confirmation.
Expand All @@ -244,14 +244,9 @@ $ dvc run -n my_stage './my_script.sh $MYENVVAR'
command's code is non-deterministic
([not recommended](#avoiding-unexpected-behavior)).

- `--no-commit` - do not save outputs to cache. A stage created, while nothing
is added to the cache. In the stage file, the file hash values will be empty;
They will be populated the next time this stage is actually executed, or
`dvc commit` can be used to force committing existing output file versions to
cache.

This is useful to avoid caching unnecessary data repeatedly when running
multiple experiments.
- `--no-commit` - do not store outputs in the cache (`dvc.yaml` and `dvc.lock`
are still created or updated); useful to avoid caching unnecessary data when
shcheklein marked this conversation as resolved.
Show resolved Hide resolved
executing tests or experiments. Use `dvc commit` to finish the operation.

- `--always-changed` - always consider this stage as changed (uses the
`always_changed` field in `dvc.yaml`). As a result `dvc status` will report it
Expand Down