get-started: fixup example-versioning

per #564 (comment)
iterative · Aug 28, 2019 · 29d3dcc · 29d3dcc
1 parent 6aba9c4
commit 29d3dcc
Show file tree

Hide file tree

Showing 4 changed files with 42 additions and 9 deletions.
diff --git a/static/docs/get-started/add-files.md b/static/docs/get-started/add-files.md
@@ -7,8 +7,9 @@ Let's get a dataset example to play with:
 ```dvc
 $ mkdir data
 $ cd data
-dvc get https://github.com/iterative/dataset-registry \
+$ dvc get https://github.com/iterative/dataset-registry \
         get-started/data.xml
+...
 $ cd ..
 ```
 

diff --git a/static/docs/get-started/connect-code-and-data.md b/static/docs/get-started/connect-code-and-data.md
@@ -43,8 +43,8 @@ recommend creating a virtual environment with a tool such as
 
 ```dvc
 $ virtualenv -p python3 .env
-$ source .env/bin/activate
 $ echo ".env/" >> .gitignore
+$ source .env/bin/activate
 $ pip install -r src/requirements.txt
 ```
 

diff --git a/static/docs/get-started/example-pipeline.md b/static/docs/get-started/example-pipeline.md
@@ -53,8 +53,8 @@ recommend creating a virtual environment with a tool such as
 
 ```dvc
 $ virtualenv -p python3 .env
-$ source .env/bin/activate
 $ echo ".env/" >> .gitignore
+$ source .env/bin/activate
 $ pip install -r requirements.txt
 ```
 

diff --git a/static/docs/get-started/example-versioning.md b/static/docs/get-started/example-versioning.md
@@ -53,7 +53,6 @@ recommend creating a virtual environment with a tool such as
 ```dvc
 $ virtualenv -p python3 .env
 $ source .env/bin/activate
-$ echo ".env/" >> .gitignore
 $ pip install -r requirements.txt
 ```
 
@@ -87,11 +86,21 @@ browser to download `data.xml`. Save it into the `data` subdirectory.
 </details>
 
 ```dvc
-$ wget https://data.dvc.org/tutorial/ver/data.zip
+$ mkdir data
+$ cd data
+$ dvc get https://github.com/iterative/dataset-registry \
+          tutorial/ver/data.zip
+...
 $ unzip data.zip
 $ rm -f data.zip
 ```
 
+> `dvc get` is a special command to download <abbr>data artifacts</abbr> from
+> other DVC projects into the current working directory (similar to `wget` but
+> for DVC repositories). In this case we use our own
+> [iterative/dataset-registry](https://github.com/iterative/dataset-registry))
+> project as the external data source.
+
 This command downloads and extracts our raw dataset, consisting of 1000 labeled
 images for training and 800 labeled images for validation. In summary, it's a 43
 MB dataset, with a directory structure like this:
@@ -151,7 +160,7 @@ it a little bit later. For now, let's commit the current state:
 
 ```dvc
 $ git add .gitignore model.h5.dvc data.dvc metrics.json
-$ git commit -m "model first version, 1000 images"
+$ git commit -m "First model, trained with 1000 images"
 $ git tag -a "v1.0" -m "model v1.0, 1000 images"
 ```
 
@@ -171,13 +180,25 @@ files work.
 
 </details>
 
+> Note that executing `train.py` produced other intermediate files. This is OK,
+> we will use them later.
+>
+> ```dvc
+> $ git status
+> ...
+> 	bottleneck_features_train.npy
+> 	bottleneck_features_validation.npy
+> ```
+
 ## Second model version
 
 Let's imagine that our images dataset is growing, we were able to double it.
 Next command extracts 500 cat and 500 dog images into `data/train`:
 
 ```dvc
-$ wget https://data.dvc.org/tutorial/ver/new-labels.zip
+$ dvc get https://github.com/iterative/dataset-registry \
+          tutorial/ver/new-labels.zip
+...
 $ unzip new-labels.zip
 $ rm -f new-labels.zip
 ```
@@ -220,7 +241,7 @@ Let's commit the second version:
 
 ```dvc
 $ git add model.h5.dvc data.dvc metrics.json
-$ git commit -m "model second version, 2000 images"
+$ git commit -m "Second model, trained with2000 images"
 $ git tag -a "v2.0" -m "model v2.0, 2000 images"
 ```
 
@@ -295,8 +316,16 @@ our example, please notice that `train.py` produces binary files (e.g.
 When you have a script that takes some data as an input and produces other data
 outputs, a better way to capture them is to use `dvc run`:
 
+> Please go back to the master branch code and data if you tried the commands in
+> the [Switching between versions](#switching-between-versions) section with:
+>
+> ```dvc
+> $ git checkout master
+> $ dvc checkout
+> ```
+
 ```dvc
-$ dvc remove -p model.h5.dvc
+$ dvc remove -pf model.h5.dvc
 $ dvc run -f Dvcfile \
           -d train.py -d data \
           -M metrics.json \
@@ -310,6 +339,9 @@ control the same way as `dvc add` does. Unlike, `dvc add`, `dvc run` also tracks
 dependencies (`-d`) and the command (`python train.py`) that was run to produce
 the result.
 
+> BTW, at this point you could `git add .` and `git commit` to save the
+> `Dvcfile` stage file and its changed output files to the repository.
+
 `dvc repro` will run `Dvcfile` if any of its dependencies (`-d`) changed, for
 example after we added new images like we did when we built the second model
 version. It also updates outputs and puts them into the cache.