Skip to content

Commit

Permalink
Merge branch 'guide/how-to' of https://github.com/iterative/dvc.org i…
Browse files Browse the repository at this point in the history
…nto guide/how-to
  • Loading branch information
imhardikj committed Nov 9, 2020
2 parents 195d561 + 6ecc645 commit da629e9
Show file tree
Hide file tree
Showing 6 changed files with 75 additions and 20 deletions.
2 changes: 1 addition & 1 deletion content/docs/sidebar.json
Original file line number Diff line number Diff line change
Expand Up @@ -100,8 +100,8 @@
"slug": "how-to",
"source": false,
"children": [
"add-output-to-stage",
"undo-adding-data",
"add-outputs-to-a-stage",
"update-tracked-files"
]
},
Expand Down
13 changes: 13 additions & 0 deletions content/docs/use-cases/versioned-storage.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# Versioned storage

What if we could **combine data and ML model versioning features with large file
storage** solutions like traditional hard drives, NAS, or cloud services such as
Amazon S3 and Google Drive? DVC brings together the best of both worlds by
implementing easy synchronization between the data <abbr>cache</abbr> and
on-premises or cloud storage for sharing.

![](/img/model-versioning-diagram.png) _DVC's hybrid versioned storage_

> Note that [remote storage](/doc/command-reference/remote) is optional in DVC:
> no server setup or special services are needed, just the `dvc` command-line
> tool.
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
# Add Dependency or Output to Stage

There are situations where we have executed a stage (either by writing
`dvc.yaml` manually and using `dvc repro`, or with `dvc run`), but later notice
that some of the dependencies, or the output files/directories it creates, which
are already in the <abbr>workspace</abbr>, are missing from `dvc.yaml` (`deps`
and `outs` field respectively). Follow the steps below to add existing files or
directories as <abbr>dependency</abbr> or <abbr>outputs</abbr> to a stage
without re-executing it again, which can be expensive/time-consuming, and is
unnecessary.

We start with an example `prepare`, which has a single dependency and output. To
add a missing dependency `data/data.csv` and output `data/validate` to this
stage, we can edit `dvc.yaml` like this:

```git
stages:
prepare:
cmd: python src/prepare.py
deps:
+ - data/data.csv
- src/prepare.py
outs:
- data/train
+ - data/validate
```

> Note that you can also use `dvc run` with the `-f` and `--no-exec` options to
> add another dependency/output to the stage:
>
> ```dvc
> $ dvc run -f --no-exec \
> -n prepare \
> -d data/data.csv \
> -d src/prepare.py \
> -o data/train \
> -o data/validate \
> python src/prepare.py
> ```
>
> `-f` overwrites the stage in `dvc.yaml`, while `--no-exec` updates the stage
> without executing it.
Finally, we need to run `dvc commit` to save the newly specified dependency or
output(s) to the <abbr>cache</abbr> (and to update the corresponding hash values
in `dvc.lock`):
```dvc
$ dvc commit
```
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Add Output to Stage
# Add Output to a Stage

There are situations where we have executed a stage (either by writing
`dvc.yaml` manually and using `dvc repro`, or with `dvc run`), but later notice
Expand Down
26 changes: 9 additions & 17 deletions content/docs/user-guide/how-to/undo-adding-data.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,41 +3,33 @@
There are situations where you want to stop tracking data added previously.
Follow the steps listed here to undo `dvc add`.

Let's first add a data file into an example <abbr>project</abbr> using
`dvc add`, which creates a `.dvc` file to track the data:
Let's first add a data file into an example <abbr>project</abbr>, which creates
a `.dvc` file to track the data:

```dvc
$ dvc add data.csv
$ ls
data.csv data.csv.dvc
```

> Note, if you are using `symlink` or `hardlink` as
> [link type](/doc/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache)
> for DVC <abbr>cache</abbr>, you will have to unprotect the tracked file first
> (see `dvc unprotect`):
>
> ```dvc
> $ dvc unprotect data.csv
> ```
> Note, if you're using `symlink` or `hardlink` as the project's
> [link type](/doc/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache),
> you'll have to unprotect the tracked file first (see `dvc unprotect`).
Now let's reverse `dvc add` by removing the corresponding `.dvc` file and
`.gitignore` entry using `dvc remove`:
Now let's reverse that with `dvc remove`. This removes the `.dvc` file (and
corresponding `.gitignore` entry). The data file is now no longer being tracked
after this:

```dvc
$ dvc remove data.csv.dvc
```
Data file `data.csv` is now no longer being tracked by DVC.
```dvc
$ git status
Untracked files:
data.csv
```

You can run `dvc gc` with the `-w` option to remove the data that isn't
referenced in the current workspace from the cache:
referenced in the current workspace from the <abbr>cache</abbr>:

```dvc
$ dvc gc -w
Expand Down
2 changes: 1 addition & 1 deletion content/docs/user-guide/how-to/update-tracked-files.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Updating Tracked Files
# Update Tracked Files

Due to the way DVC handles linking between the data files between the
<abbr>cache</abbr> and their counterparts in the <abbr>workspace</abbr> (refer
Expand Down

0 comments on commit da629e9

Please sign in to comment.