Merge branch 'guide/how-to' of https://github.com/iterative/dvc.org i…

…nto guide/how-to
iterative · Nov 9, 2020 · da629e9 · da629e9
2 parents 195d561 + 6ecc645
commit da629e9
Show file tree

Hide file tree

Showing 6 changed files with 75 additions and 20 deletions.
diff --git a/content/docs/sidebar.json b/content/docs/sidebar.json
@@ -100,8 +100,8 @@
         "slug": "how-to",
         "source": false,
         "children": [
-          "add-output-to-stage",
           "undo-adding-data",
+          "add-outputs-to-a-stage",
           "update-tracked-files"
         ]
       },

diff --git a/content/docs/use-cases/versioned-storage.md b/content/docs/use-cases/versioned-storage.md
@@ -0,0 +1,13 @@
+# Versioned storage
+
+What if we could **combine data and ML model versioning features with large file
+storage** solutions like traditional hard drives, NAS, or cloud services such as
+Amazon S3 and Google Drive? DVC brings together the best of both worlds by
+implementing easy synchronization between the data <abbr>cache</abbr> and
+on-premises or cloud storage for sharing.
+
+![](/img/model-versioning-diagram.png) _DVC's hybrid versioned storage_
+
+> Note that [remote storage](/doc/command-reference/remote) is optional in DVC:
+> no server setup or special services are needed, just the `dvc` command-line
+> tool.
diff --git a/content/docs/user-guide/how-to/add-dependency-or-output-to-stage.md b/content/docs/user-guide/how-to/add-dependency-or-output-to-stage.md
@@ -0,0 +1,50 @@
+# Add Dependency or Output to Stage
+
+There are situations where we have executed a stage (either by writing
+`dvc.yaml` manually and using `dvc repro`, or with `dvc run`), but later notice
+that some of the dependencies, or the output files/directories it creates, which
+are already in the <abbr>workspace</abbr>, are missing from `dvc.yaml` (`deps`
+and `outs` field respectively). Follow the steps below to add existing files or
+directories as <abbr>dependency</abbr> or <abbr>outputs</abbr> to a stage
+without re-executing it again, which can be expensive/time-consuming, and is
+unnecessary.
+
+We start with an example `prepare`, which has a single dependency and output. To
+add a missing dependency `data/data.csv` and output `data/validate` to this
+stage, we can edit `dvc.yaml` like this:
+
+```git
+ stages:
+   prepare:
+     cmd: python src/prepare.py
+     deps:
++    - data/data.csv
+     - src/prepare.py
+     outs:
+     - data/train
++    - data/validate
+```
+
+> Note that you can also use `dvc run` with the `-f` and `--no-exec` options to
+> add another dependency/output to the stage:
+>
+> ```dvc
+> $ dvc run -f --no-exec \
+>           -n prepare \
+>           -d data/data.csv \
+>           -d src/prepare.py \
+>           -o data/train \
+>           -o data/validate \
+>           python src/prepare.py
+> ```
+>
+> `-f` overwrites the stage in `dvc.yaml`, while `--no-exec` updates the stage
+> without executing it.
+
+Finally, we need to run `dvc commit` to save the newly specified dependency or
+output(s) to the <abbr>cache</abbr> (and to update the corresponding hash values
+in `dvc.lock`):
+
+```dvc
+$ dvc commit
+```
diff --git a/.../user-guide/how-to/add-output-to-stage.md → ...er-guide/how-to/add-outputs-to-a-stage.md b/.../user-guide/how-to/add-output-to-stage.md → ...er-guide/how-to/add-outputs-to-a-stage.md
@@ -1,4 +1,4 @@
-# Add Output to Stage
+# Add Output to a Stage
 
 There are situations where we have executed a stage (either by writing
 `dvc.yaml` manually and using `dvc repro`, or with `dvc run`), but later notice

diff --git a/content/docs/user-guide/how-to/undo-adding-data.md b/content/docs/user-guide/how-to/undo-adding-data.md
@@ -3,41 +3,33 @@
 There are situations where you want to stop tracking data added previously.
 Follow the steps listed here to undo `dvc add`.
 
-Let's first add a data file into an example <abbr>project</abbr> using
-`dvc add`, which creates a `.dvc` file to track the data:
+Let's first add a data file into an example <abbr>project</abbr>, which creates
+a `.dvc` file to track the data:
 
 ```dvc
 $ dvc add data.csv
 $ ls
 data.csv    data.csv.dvc
 ```
 
-> Note, if you are using `symlink` or `hardlink` as
-> [link type](/doc/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache)
-> for DVC <abbr>cache</abbr>, you will have to unprotect the tracked file first
-> (see `dvc unprotect`):
->
-> ```dvc
-> $ dvc unprotect data.csv
-> ```
+> Note, if you're using `symlink` or `hardlink` as the project's
+> [link type](/doc/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache),
+> you'll have to unprotect the tracked file first (see `dvc unprotect`).
 
-Now let's reverse `dvc add` by removing the corresponding `.dvc` file and
-`.gitignore` entry using `dvc remove`:
+Now let's reverse that with `dvc remove`. This removes the `.dvc` file (and
+corresponding `.gitignore` entry). The data file is now no longer being tracked
+after this:
 
 ```dvc
 $ dvc remove data.csv.dvc
-```
-
-Data file `data.csv` is now no longer being tracked by DVC.
 
-```dvc
 $ git status
     Untracked files:
         data.csv
 ```
 
 You can run `dvc gc` with the `-w` option to remove the data that isn't
-referenced in the current workspace from the cache:
+referenced in the current workspace from the <abbr>cache</abbr>:
 
 ```dvc
 $ dvc gc -w

diff --git a/content/docs/user-guide/how-to/update-tracked-files.md b/content/docs/user-guide/how-to/update-tracked-files.md
@@ -1,4 +1,4 @@
-# Updating Tracked Files
+# Update Tracked Files
 
 Due to the way DVC handles linking between the data files between the
 <abbr>cache</abbr> and their counterparts in the <abbr>workspace</abbr> (refer