diff --git a/content/docs/command-reference/cache/dir.md b/content/docs/command-reference/cache/dir.md index d2da878f14..9f2cc9e751 100644 --- a/content/docs/command-reference/cache/dir.md +++ b/content/docs/command-reference/cache/dir.md @@ -1,7 +1,7 @@ # cache dir Set/unset the cache directory location intuitively (compared to -using `dvc config cache`). +using `dvc config cache`), or shows the current configured value. ## Synopsis @@ -19,11 +19,13 @@ positional arguments: Helper to set the `cache.dir` configuration option. (See [cache directory](/doc/user-guide/dvc-files-and-directories#structure-of-the-cache-directory).) -Unlike doing so with `dvc config cache`, this command transform paths (`value`) -that are provided relative to the current working directory into paths +Unlike doing so with `dvc config cache`, `dvc cache dir` transform paths +(`value`) that are provided relative to the current working directory into paths **relative to the config file location**. However, if the `value` provided is an -absolute path, then it's preserved as it is. If no path is provided, it prints -the path for current cache directory. +absolute path, then it's preserved as it is. + +If no path `value` is provided to this command, it prints the path for current +cache directory. ## Options diff --git a/content/docs/command-reference/check-ignore.md b/content/docs/command-reference/check-ignore.md index 21e7f3d74c..60ad9f2889 100644 --- a/content/docs/command-reference/check-ignore.md +++ b/content/docs/command-reference/check-ignore.md @@ -119,7 +119,7 @@ file1 file2 ``` -It can also be used as a component of a POSIX pipe: +It can also be used as part of a POSIX pipe: ```dvc cat file_list | dvc check-ignore --stdin diff --git a/content/docs/command-reference/unfreeze.md b/content/docs/command-reference/unfreeze.md index d1a784dc58..0002bb4045 100644 --- a/content/docs/command-reference/unfreeze.md +++ b/content/docs/command-reference/unfreeze.md @@ -10,7 +10,6 @@ usage: dvc unfreeze [-h] [-q | -v] targets [targets ...] positional arguments: targets Stages or .dvc files to unfreeze - (see also `dvc freeze`). ``` ## Description diff --git a/content/docs/sidebar.json b/content/docs/sidebar.json index 6794fbff8f..efc7899783 100644 --- a/content/docs/sidebar.json +++ b/content/docs/sidebar.json @@ -100,8 +100,8 @@ "slug": "how-to", "source": false, "children": [ - "add-output-to-stage", "undo-adding-data", + "add-deps-or-outs-to-a-stage", "update-tracked-files" ] }, diff --git a/content/docs/start/index.md b/content/docs/start/index.md index a2f465ca94..060edd3e9a 100644 --- a/content/docs/start/index.md +++ b/content/docs/start/index.md @@ -49,13 +49,13 @@ Changes to be committed: $ git commit -m "Initialize DVC" ``` -DVC features can be grouped into layers. We'll explore them one by one in the -next few sections: +DVC features can be grouped into functional components. We'll explore them one +by one in the next few sections: -- [**Data versioning**](/doc/start/data-versioning) is the core part of DVC for - large files, datasets, machine learning models versioning and efficient - sharing. We'll show how to use a regular Git workflow, without storing large - files with Git. Think "Git for data". +- [**Data versioning**](/doc/start/data-versioning) is the base layer of DVC for + large files, datasets, and machine learning models. It looks like a regular + Git workflow, but without storing large files in the repo (think "Git for + data"). Data is stored separately, which allows for efficient sharing. - [**Data access**](/doc/start/data-access) shows how to use data artifacts from outside of the project and how to import data artifacts from another DVC diff --git a/content/docs/use-cases/versioned-storage.md b/content/docs/use-cases/versioned-storage.md new file mode 100644 index 0000000000..3ced05eacf --- /dev/null +++ b/content/docs/use-cases/versioned-storage.md @@ -0,0 +1,13 @@ +# Versioned storage + +What if we could **combine data and ML model versioning features with large file +storage** solutions like traditional hard drives, NAS, or cloud services such as +Amazon S3 and Google Drive? DVC brings together the best of both worlds by +implementing easy synchronization between the data cache and +on-premises or cloud storage for sharing. + +![](/img/model-versioning-diagram.png) _DVC's hybrid versioned storage_ + +> Note that [remote storage](/doc/command-reference/remote) is optional in DVC: +> no server setup or special services are needed, just the `dvc` command-line +> tool. diff --git a/content/docs/user-guide/dvc-files-and-directories.md b/content/docs/user-guide/dvc-files-and-directories.md index 09e2ff8ee7..7398c5903a 100644 --- a/content/docs/user-guide/dvc-files-and-directories.md +++ b/content/docs/user-guide/dvc-files-and-directories.md @@ -282,22 +282,29 @@ Full parameters (key and value) are listed separately under ## Structure of the cache directory -There are two ways in which the data is stored in cache: As a -single file (eg. `data.csv`), or a directory of files. +The DVC cache is a +[content-addressable storage](https://en.wikipedia.org/wiki/Content-addressable_storage), +which adds a layer of indirection between code and data. -For the first case, we calculate the file hash, a 32 characters long string -(usually MD5). The first two characters are used to name the directory inside -`.dvc/cache`, and the rest become the file name of the cached file. For example, -if a data file `Posts.xml.zip` has a hash value of -`ec1d2935f811b77cc49b031b999cbf17`, its path in the cache will be -`.dvc/cache/ec/1d2935f811b77cc49b031b999cbf17`. +There are two ways in which the data is cached: As a single file +(eg. `data.csv`), or as a directory. + +### For files + +DVC calculates the file hash, a 32 characters long string (usually MD5). The +first two characters are used to name the directory inside `.dvc/cache`, and the +rest become the file name of the cached file. For example, if a data file +`Posts.xml.zip` has a hash value of `ec1d2935f811b77cc49b031b999cbf17`, its path +in the cache will be `.dvc/cache/ec/1d2935f811b77cc49b031b999cbf17`. > Note that file hashes are calculated from file contents only. 2 or more files > with different names but the same contents can exist in the workspace and be > tracked by DVC, but only one copy is stored in the cache. This helps avoid > data duplication in cache and remotes. -For the second case, let us consider a directory with 2 images. +### For directories + +Let's imagine [adding](/doc/command-reference/add) a directory with 2 images: ```dvc $ tree data/images/ @@ -308,21 +315,10 @@ data/images/ $ dvc add data/images ``` -When running `dvc add` on this directory of images, a `data/images.dvc` -[DVC-file](/doc/user-guide/dvc-files-and-directories) is created, containing the -hash value of the directory: - -```yaml -outs: - - md5: 196a322c107c2572335158503c64bfba.dir - path: data/images -``` - -The directory in cache is stored as a JSON file (with `.dir` file extension) -describing it's contents, along with the files it contains in cache, like this: +The directory entry in the cache is stored as a JSON file with `.dir` file +extension, along with the files it contains in cache, like this: ```dvc -$ tree .dvc/cache .dvc/cache/ ├── 19 │   └── 6a322c107c2572335158503c64bfba.dir @@ -332,11 +328,9 @@ $ tree .dvc/cache     └── 0b40427ee0998e9802335d98f08cd98f ``` -The cache file with `.dir` extension is a special text file that contains the -mapping of files in the `data/` directory (as a JSON array), along with their -hash values. The other two cache files are the files inside `data/`. - -A typical `.dir` cache file looks like this: +This `.dir` file contains the mapping of files in `data/images` (as a JSON +array), including their hash values. That's how DVC knows that the other two +cached files belong in the directory: ```dvc $ cat .dvc/cache/19/6a322c107c2572335158503c64bfba.dir diff --git a/content/docs/user-guide/how-to/add-deps-or-outs-to-a-stage.md b/content/docs/user-guide/how-to/add-deps-or-outs-to-a-stage.md new file mode 100644 index 0000000000..bc495a6f3a --- /dev/null +++ b/content/docs/user-guide/how-to/add-deps-or-outs-to-a-stage.md @@ -0,0 +1,56 @@ +# Add Deps or Outs to a Stage + +There are situations where we have executed a stage (either by writing +`dvc.yaml` manually and using `dvc repro`, or with `dvc run`), but later notice +that some of the build requirements are missing from `dvc.yaml`: + +- Files or directories in the workspace that are dependencies of + the stage, are missing from `deps` field. + +- Output files or directories that the stage creates, which are already in the + workspace, are missing from `outs` field. + +Follow the steps below to add existing files/directories as +dependencies or outputs to a stage without +re-executing it again, which can be expensive/time-consuming, and is +unnecessary. + +We start with an example `prepare`, which has a single dependency and output. To +add a missing dependency `data/data.csv`, and output `data/validate` to this +stage, we can edit `dvc.yaml` like this: + +```git + stages: + prepare: + cmd: python src/prepare.py + deps: ++ - data/data.csv + - src/prepare.py + outs: + - data/train ++ - data/validate +``` + +> Note that you can also use `dvc run` with the `-f` and `--no-exec` options to +> add another output to the stage: +> +> ```dvc +> $ dvc run -f --no-exec \ +> -n prepare \ +> -n prepare \ +> -d src/prepare.py \ +> -o data/train \ +> -o data/validate \ +> python src/prepare.py +> ``` +> +> `-f` overwrites the stage in `dvc.yaml`, while `--no-exec` updates the stage +> without executing it. + +Finally, we need to run `dvc commit` to save the newly specified output(s) to +the cache (and to update the hash values of `deps` and `outs` in +`dvc.lock`): + +```dvc +$ dvc commit +``` diff --git a/content/docs/user-guide/how-to/add-output-to-stage.md b/content/docs/user-guide/how-to/add-output-to-stage.md deleted file mode 100644 index 79e3ef292d..0000000000 --- a/content/docs/user-guide/how-to/add-output-to-stage.md +++ /dev/null @@ -1,46 +0,0 @@ -# Add Output to Stage - -There are situations where we have executed a stage (either by writing -`dvc.yaml` manually and using `dvc repro`, or with `dvc run`), but later notice -that some of the output files or directories it creates, which are already in -the workspace, are missing from `dvc.yaml` (`outs` field). Follow -the steps below to add existing files or directories as outputs to -a stage without re-executing it again, which can be expensive/time-consuming, -and is unnecessary. - -We start with an example `prepare`, which has a single output. To add a missing -output `data/validate` to this stage, we can edit `dvc.yaml` like this: - -```git - stages: - prepare: - cmd: python src/prepare.py - deps: - - src/prepare.py - outs: - - data/train -+ - data/validate -``` - -> Note that you can also use `dvc run` with the `-f` and `--no-exec` options to -> add another output to the stage: -> -> ```dvc -> $ dvc run -f --no-exec \ -> -n prepare \ -> -d src/prepare.py \ -> -o data/train \ -> -o data/validate \ -> python src/prepare.py -> ``` -> -> `-f` overwrites the stage in `dvc.yaml`, while `--no-exec` updates the stage -> without executing it. - -Finally, we need to run `dvc commit` to save the newly specified output(s) to -the cache (and to update the corresponding hash values in -`dvc.lock`): - -```dvc -$ dvc commit -``` diff --git a/content/docs/user-guide/how-to/undo-adding-data.md b/content/docs/user-guide/how-to/undo-adding-data.md index d749f8fc5c..e5a59d747c 100644 --- a/content/docs/user-guide/how-to/undo-adding-data.md +++ b/content/docs/user-guide/how-to/undo-adding-data.md @@ -3,8 +3,8 @@ There are situations where you want to stop tracking data added previously. Follow the steps listed here to undo `dvc add`. -Let's first add a data file into an example project using -`dvc add`, which creates a `.dvc` file to track the data: +Let's first add a data file into an example project, which creates +a `.dvc` file to track the data: ```dvc $ dvc add data.csv @@ -12,32 +12,24 @@ $ ls data.csv data.csv.dvc ``` -> Note, if you are using `symlink` or `hardlink` as -> [link type](/doc/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache) -> for DVC cache, you will have to unprotect the tracked file first -> (see `dvc unprotect`): -> -> ```dvc -> $ dvc unprotect data.csv -> ``` +> Note, if you're using `symlink` or `hardlink` as the project's +> [link type](/doc/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache), +> you'll have to unprotect the tracked file first (see `dvc unprotect`). -Now let's reverse `dvc add` by removing the corresponding `.dvc` file and -`.gitignore` entry using `dvc remove`: +Now let's reverse that with `dvc remove`. This removes the `.dvc` file (and +corresponding `.gitignore` entry). The data file is now no longer being tracked +after this: ```dvc $ dvc remove data.csv.dvc -``` - -Data file `data.csv` is now no longer being tracked by DVC. -```dvc $ git status Untracked files: data.csv ``` You can run `dvc gc` with the `-w` option to remove the data that isn't -referenced in the current workspace from the cache: +referenced in the current workspace from the cache: ```dvc $ dvc gc -w diff --git a/content/docs/user-guide/how-to/update-tracked-files.md b/content/docs/user-guide/how-to/update-tracked-files.md index e74a1d06b6..554974d263 100644 --- a/content/docs/user-guide/how-to/update-tracked-files.md +++ b/content/docs/user-guide/how-to/update-tracked-files.md @@ -1,4 +1,4 @@ -# Updating Tracked Files +# Update Tracked Files Due to the way DVC handles linking between the data files between the cache and their counterparts in the workspace (refer diff --git a/content/docs/user-guide/related-technologies.md b/content/docs/user-guide/related-technologies.md index 3a7d72c704..9be7a4a93f 100644 --- a/content/docs/user-guide/related-technologies.md +++ b/content/docs/user-guide/related-technologies.md @@ -82,8 +82,9 @@ _Luigi_, etc. ## Experiment management software -- DVC uses Git as the underlying layer for data, pipelines, an experiment - versioning, instead of a custom web application. +- DVC uses Git as the underlying version control layer for data, pipelines, and + experiments. Data versions exist as metadata in Git, as opposed to using + external databases or APIs, so no additional services are required. - DVC doesn't need to run any services. There's no GUI as a result, but we expect some GUI services will be created on top of DVC.