Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to doc - Adding dependency to stage #1913

Closed
wants to merge 20 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 7 additions & 5 deletions content/docs/command-reference/cache/dir.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# cache dir

Set/unset the <abbr>cache</abbr> directory location intuitively (compared to
using `dvc config cache`).
using `dvc config cache`), or shows the current configured value.

## Synopsis

Expand All @@ -19,11 +19,13 @@ positional arguments:

Helper to set the `cache.dir` configuration option. (See
[cache directory](/doc/user-guide/dvc-files-and-directories#structure-of-the-cache-directory).)
Unlike doing so with `dvc config cache`, this command transform paths (`value`)
that are provided relative to the current working directory into paths
Unlike doing so with `dvc config cache`, `dvc cache dir` transform paths
(`value`) that are provided relative to the current working directory into paths
**relative to the config file location**. However, if the `value` provided is an
absolute path, then it's preserved as it is. If no path is provided, it prints
the path for current cache directory.
absolute path, then it's preserved as it is.

If no path `value` is provided to this command, it prints the path for current
cache directory.

## Options

Expand Down
2 changes: 1 addition & 1 deletion content/docs/command-reference/check-ignore.md
Original file line number Diff line number Diff line change
Expand Up @@ -119,7 +119,7 @@ file1
file2
```

It can also be used as a component of a POSIX pipe:
It can also be used as part of a POSIX pipe:

```dvc
cat file_list | dvc check-ignore --stdin
Expand Down
1 change: 0 additions & 1 deletion content/docs/command-reference/unfreeze.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,6 @@ usage: dvc unfreeze [-h] [-q | -v] targets [targets ...]

positional arguments:
targets Stages or .dvc files to unfreeze
(see also `dvc freeze`).
```

## Description
Expand Down
2 changes: 1 addition & 1 deletion content/docs/sidebar.json
Original file line number Diff line number Diff line change
Expand Up @@ -100,8 +100,8 @@
"slug": "how-to",
"source": false,
"children": [
"add-output-to-stage",
"undo-adding-data",
"add-deps-or-outs-to-a-stage",
"update-tracked-files"
]
},
Expand Down
12 changes: 6 additions & 6 deletions content/docs/start/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,13 +49,13 @@ Changes to be committed:
$ git commit -m "Initialize DVC"
```

DVC features can be grouped into layers. We'll explore them one by one in the
next few sections:
DVC features can be grouped into functional components. We'll explore them one
by one in the next few sections:

- [**Data versioning**](/doc/start/data-versioning) is the core part of DVC for
large files, datasets, machine learning models versioning and efficient
sharing. We'll show how to use a regular Git workflow, without storing large
files with Git. Think "Git for data".
- [**Data versioning**](/doc/start/data-versioning) is the base layer of DVC for
large files, datasets, and machine learning models. It looks like a regular
Git workflow, but without storing large files in the repo (think "Git for
data"). Data is stored separately, which allows for efficient sharing.

- [**Data access**](/doc/start/data-access) shows how to use data artifacts from
outside of the project and how to import data artifacts from another DVC
Expand Down
13 changes: 13 additions & 0 deletions content/docs/use-cases/versioned-storage.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# Versioned storage

What if we could **combine data and ML model versioning features with large file
storage** solutions like traditional hard drives, NAS, or cloud services such as
Amazon S3 and Google Drive? DVC brings together the best of both worlds by
implementing easy synchronization between the data <abbr>cache</abbr> and
on-premises or cloud storage for sharing.

![](/img/model-versioning-diagram.png) _DVC's hybrid versioned storage_

> Note that [remote storage](/doc/command-reference/remote) is optional in DVC:
> no server setup or special services are needed, just the `dvc` command-line
> tool.
48 changes: 21 additions & 27 deletions content/docs/user-guide/dvc-files-and-directories.md
Original file line number Diff line number Diff line change
Expand Up @@ -282,22 +282,29 @@ Full <abbr>parameters</abbr> (key and value) are listed separately under

## Structure of the cache directory

There are two ways in which the data is stored in <abbr>cache</abbr>: As a
single file (eg. `data.csv`), or a directory of files.
The DVC cache is a
[content-addressable storage](https://en.wikipedia.org/wiki/Content-addressable_storage),
which adds a layer of indirection between code and data.

For the first case, we calculate the file hash, a 32 characters long string
(usually MD5). The first two characters are used to name the directory inside
`.dvc/cache`, and the rest become the file name of the cached file. For example,
if a data file `Posts.xml.zip` has a hash value of
`ec1d2935f811b77cc49b031b999cbf17`, its path in the cache will be
`.dvc/cache/ec/1d2935f811b77cc49b031b999cbf17`.
There are two ways in which the data is <abbr>cached</abbr>: As a single file
(eg. `data.csv`), or as a directory.

### For files

DVC calculates the file hash, a 32 characters long string (usually MD5). The
first two characters are used to name the directory inside `.dvc/cache`, and the
rest become the file name of the cached file. For example, if a data file
`Posts.xml.zip` has a hash value of `ec1d2935f811b77cc49b031b999cbf17`, its path
in the cache will be `.dvc/cache/ec/1d2935f811b77cc49b031b999cbf17`.

> Note that file hashes are calculated from file contents only. 2 or more files
> with different names but the same contents can exist in the workspace and be
> tracked by DVC, but only one copy is stored in the cache. This helps avoid
> data duplication in cache and remotes.

For the second case, let us consider a directory with 2 images.
### For directories

Let's imagine [adding](/doc/command-reference/add) a directory with 2 images:

```dvc
$ tree data/images/
Expand All @@ -308,21 +315,10 @@ data/images/
$ dvc add data/images
```

When running `dvc add` on this directory of images, a `data/images.dvc`
[DVC-file](/doc/user-guide/dvc-files-and-directories) is created, containing the
hash value of the directory:

```yaml
outs:
- md5: 196a322c107c2572335158503c64bfba.dir
path: data/images
```

The directory in cache is stored as a JSON file (with `.dir` file extension)
describing it's contents, along with the files it contains in cache, like this:
The directory entry in the cache is stored as a JSON file with `.dir` file
extension, along with the files it contains in cache, like this:

```dvc
$ tree .dvc/cache
.dvc/cache/
├── 19
│   └── 6a322c107c2572335158503c64bfba.dir
Expand All @@ -332,11 +328,9 @@ $ tree .dvc/cache
    └── 0b40427ee0998e9802335d98f08cd98f
```

The cache file with `.dir` extension is a special text file that contains the
mapping of files in the `data/` directory (as a JSON array), along with their
hash values. The other two cache files are the files inside `data/`.

A typical `.dir` cache file looks like this:
This `.dir` file contains the mapping of files in `data/images` (as a JSON
array), including their hash values. That's how DVC knows that the other two
cached files belong in the directory:

```dvc
$ cat .dvc/cache/19/6a322c107c2572335158503c64bfba.dir
Expand Down
56 changes: 56 additions & 0 deletions content/docs/user-guide/how-to/add-deps-or-outs-to-a-stage.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
# Add Deps or Outs to a Stage

There are situations where we have executed a stage (either by writing
`dvc.yaml` manually and using `dvc repro`, or with `dvc run`), but later notice
that some of the build requirements are missing from `dvc.yaml`:

- Files or directories in the <abbr>workspace</abbr> that are dependencies of
the stage, are missing from `deps` field.

- Output files or directories that the stage creates, which are already in the
workspace, are missing from `outs` field.

Follow the steps below to add existing files/directories as
<abbr>dependencies</abbr> or <abbr>outputs</abbr> to a stage without
re-executing it again, which can be expensive/time-consuming, and is
unnecessary.

We start with an example `prepare`, which has a single dependency and output. To
add a missing dependency `data/data.csv`, and output `data/validate` to this
stage, we can edit `dvc.yaml` like this:

```git
stages:
prepare:
cmd: python src/prepare.py
deps:
+ - data/data.csv
- src/prepare.py
outs:
- data/train
+ - data/validate
```

> Note that you can also use `dvc run` with the `-f` and `--no-exec` options to
> add another output to the stage:
>
> ```dvc
> $ dvc run -f --no-exec \
> -n prepare \
> -n prepare \
> -d src/prepare.py \
> -o data/train \
> -o data/validate \
> python src/prepare.py
> ```
>
> `-f` overwrites the stage in `dvc.yaml`, while `--no-exec` updates the stage
> without executing it.

Finally, we need to run `dvc commit` to save the newly specified output(s) to
the <abbr>cache</abbr> (and to update the hash values of `deps` and `outs` in
`dvc.lock`):

```dvc
$ dvc commit
```
46 changes: 0 additions & 46 deletions content/docs/user-guide/how-to/add-output-to-stage.md

This file was deleted.

26 changes: 9 additions & 17 deletions content/docs/user-guide/how-to/undo-adding-data.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,41 +3,33 @@
There are situations where you want to stop tracking data added previously.
Follow the steps listed here to undo `dvc add`.

Let's first add a data file into an example <abbr>project</abbr> using
`dvc add`, which creates a `.dvc` file to track the data:
Let's first add a data file into an example <abbr>project</abbr>, which creates
a `.dvc` file to track the data:

```dvc
$ dvc add data.csv
$ ls
data.csv data.csv.dvc
```

> Note, if you are using `symlink` or `hardlink` as
> [link type](/doc/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache)
> for DVC <abbr>cache</abbr>, you will have to unprotect the tracked file first
> (see `dvc unprotect`):
>
> ```dvc
> $ dvc unprotect data.csv
> ```
> Note, if you're using `symlink` or `hardlink` as the project's
> [link type](/doc/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache),
> you'll have to unprotect the tracked file first (see `dvc unprotect`).

Now let's reverse `dvc add` by removing the corresponding `.dvc` file and
`.gitignore` entry using `dvc remove`:
Now let's reverse that with `dvc remove`. This removes the `.dvc` file (and
corresponding `.gitignore` entry). The data file is now no longer being tracked
after this:

```dvc
$ dvc remove data.csv.dvc
```

Data file `data.csv` is now no longer being tracked by DVC.

```dvc
$ git status
Untracked files:
data.csv
```

You can run `dvc gc` with the `-w` option to remove the data that isn't
referenced in the current workspace from the cache:
referenced in the current workspace from the <abbr>cache</abbr>:

```dvc
$ dvc gc -w
Expand Down
2 changes: 1 addition & 1 deletion content/docs/user-guide/how-to/update-tracked-files.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Updating Tracked Files
# Update Tracked Files

Due to the way DVC handles linking between the data files between the
<abbr>cache</abbr> and their counterparts in the <abbr>workspace</abbr> (refer
Expand Down
5 changes: 3 additions & 2 deletions content/docs/user-guide/related-technologies.md
Original file line number Diff line number Diff line change
Expand Up @@ -82,8 +82,9 @@ _Luigi_, etc.

## Experiment management software

- DVC uses Git as the underlying layer for data, pipelines, an experiment
versioning, instead of a custom web application.
- DVC uses Git as the underlying version control layer for data, pipelines, and
experiments. Data versions exist as metadata in Git, as opposed to using
external databases or APIs, so no additional services are required.

- DVC doesn't need to run any services. There's no GUI as a result, but we
expect some GUI services will be created on top of DVC.
Expand Down