Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Misc. updates #1901

Merged
merged 13 commits into from
Nov 9, 2020
12 changes: 7 additions & 5 deletions content/docs/command-reference/cache/dir.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# cache dir

Set/unset the <abbr>cache</abbr> directory location intuitively (compared to
using `dvc config cache`).
using `dvc config cache`), or shows the current configured value.

## Synopsis

Expand All @@ -19,11 +19,13 @@ positional arguments:

Helper to set the `cache.dir` configuration option. (See
[cache directory](/doc/user-guide/dvc-files-and-directories#structure-of-the-cache-directory).)
Unlike doing so with `dvc config cache`, this command transform paths (`value`)
that are provided relative to the current working directory into paths
Unlike doing so with `dvc config cache`, `dvc cache dir` transform paths
(`value`) that are provided relative to the current working directory into paths
**relative to the config file location**. However, if the `value` provided is an
absolute path, then it's preserved as it is. If no path is provided, it prints
the path for current cache directory.
absolute path, then it's preserved as it is.

If no path `value` is provided to this command, it prints the path for current
cache directory.

## Options

Expand Down
2 changes: 1 addition & 1 deletion content/docs/command-reference/check-ignore.md
Original file line number Diff line number Diff line change
Expand Up @@ -119,7 +119,7 @@ file1
file2
```

It can also be used as a component of a POSIX pipe:
It can also be used as part of a POSIX pipe:

```dvc
cat file_list | dvc check-ignore --stdin
Expand Down
1 change: 0 additions & 1 deletion content/docs/command-reference/unfreeze.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,6 @@ usage: dvc unfreeze [-h] [-q | -v] targets [targets ...]

positional arguments:
targets Stages or .dvc files to unfreeze
(see also `dvc freeze`).
```

## Description
Expand Down
12 changes: 6 additions & 6 deletions content/docs/start/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,13 +49,13 @@ Changes to be committed:
$ git commit -m "Initialize DVC"
```

DVC features can be grouped into layers. We'll explore them one by one in the
next few sections:
DVC features can be grouped into functional components. We'll explore them one
by one in the next few sections:

- [**Data versioning**](/doc/start/data-versioning) is the core part of DVC for
large files, datasets, machine learning models versioning and efficient
sharing. We'll show how to use a regular Git workflow, without storing large
files with Git. Think "Git for data".
- [**Data versioning**](/doc/start/data-versioning) is the base layer of DVC for
large files, datasets, and machine learning models. It looks like a regular
Git workflow, but without storing large files in the repo (think "Git for
data"). Data is stored separately, which allows for efficient sharing.

- [**Data access**](/doc/start/data-access) shows how to use data artifacts from
outside of the project and how to import data artifacts from another DVC
Expand Down
48 changes: 21 additions & 27 deletions content/docs/user-guide/dvc-files-and-directories.md
Original file line number Diff line number Diff line change
Expand Up @@ -282,22 +282,29 @@ Full <abbr>parameters</abbr> (key and value) are listed separately under

## Structure of the cache directory

There are two ways in which the data is stored in <abbr>cache</abbr>: As a
single file (eg. `data.csv`), or a directory of files.
The DVC cache is a
[content-addressable storage](https://en.wikipedia.org/wiki/Content-addressable_storage),
which adds a layer of indirection between code and data.

For the first case, we calculate the file hash, a 32 characters long string
(usually MD5). The first two characters are used to name the directory inside
`.dvc/cache`, and the rest become the file name of the cached file. For example,
if a data file `Posts.xml.zip` has a hash value of
`ec1d2935f811b77cc49b031b999cbf17`, its path in the cache will be
`.dvc/cache/ec/1d2935f811b77cc49b031b999cbf17`.
There are two ways in which the data is <abbr>cached</abbr>: As a single file
(eg. `data.csv`), or as a directory.

### For files

DVC calculates the file hash, a 32 characters long string (usually MD5). The
first two characters are used to name the directory inside `.dvc/cache`, and the
rest become the file name of the cached file. For example, if a data file
`Posts.xml.zip` has a hash value of `ec1d2935f811b77cc49b031b999cbf17`, its path
in the cache will be `.dvc/cache/ec/1d2935f811b77cc49b031b999cbf17`.

> Note that file hashes are calculated from file contents only. 2 or more files
> with different names but the same contents can exist in the workspace and be
> tracked by DVC, but only one copy is stored in the cache. This helps avoid
> data duplication in cache and remotes.

For the second case, let us consider a directory with 2 images.
### For directories

Let's imagine [adding](/doc/command-reference/add) a directory with 2 images:

```dvc
$ tree data/images/
Expand All @@ -308,21 +315,10 @@ data/images/
$ dvc add data/images
```

When running `dvc add` on this directory of images, a `data/images.dvc`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this change? it looks more or less good for me

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed this chunk because it was explaining dvc add/.dvc files, not the cache structure.

I can put it back if you prefer.

[DVC-file](/doc/user-guide/dvc-files-and-directories) is created, containing the
hash value of the directory:

```yaml
outs:
- md5: 196a322c107c2572335158503c64bfba.dir
path: data/images
```

The directory in cache is stored as a JSON file (with `.dir` file extension)
describing it's contents, along with the files it contains in cache, like this:
The directory entry in the cache is stored as a JSON file with `.dir` file
extension, along with the files it contains in cache, like this:

```dvc
$ tree .dvc/cache
.dvc/cache/
├── 19
│   └── 6a322c107c2572335158503c64bfba.dir
Expand All @@ -332,11 +328,9 @@ $ tree .dvc/cache
    └── 0b40427ee0998e9802335d98f08cd98f
```

The cache file with `.dir` extension is a special text file that contains the
mapping of files in the `data/` directory (as a JSON array), along with their
hash values. The other two cache files are the files inside `data/`.

A typical `.dir` cache file looks like this:
This `.dir` file contains the mapping of files in `data/images` (as a JSON
array), including their hash values. That's how DVC knows that the other two
cached files belong in the directory:

```dvc
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved
$ cat .dvc/cache/19/6a322c107c2572335158503c64bfba.dir
Expand Down
5 changes: 3 additions & 2 deletions content/docs/user-guide/related-technologies.md
Original file line number Diff line number Diff line change
Expand Up @@ -82,8 +82,9 @@ _Luigi_, etc.

## Experiment management software

- DVC uses Git as the underlying layer for data, pipelines, an experiment
versioning, instead of a custom web application.
- DVC uses Git as the underlying version control layer for data, pipelines, and
experiments. Data versions exist as metadata in Git, as opposed to using
external databases or APIs, so no additional services are required.

- DVC doesn't need to run any services. There's no GUI as a result, but we
expect some GUI services will be created on top of DVC.
Expand Down