Skip to content

Commit

Permalink
get-started: fixup example-versioning
Browse files Browse the repository at this point in the history
  • Loading branch information
jorgeorpinel committed Aug 28, 2019
1 parent 6aba9c4 commit 29d3dcc
Show file tree
Hide file tree
Showing 4 changed files with 42 additions and 9 deletions.
3 changes: 2 additions & 1 deletion static/docs/get-started/add-files.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,9 @@ Let's get a dataset example to play with:
```dvc
$ mkdir data
$ cd data
dvc get https://github.com/iterative/dataset-registry \
$ dvc get https://github.com/iterative/dataset-registry \
get-started/data.xml
...
$ cd ..
```

Expand Down
2 changes: 1 addition & 1 deletion static/docs/get-started/connect-code-and-data.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,8 +43,8 @@ recommend creating a virtual environment with a tool such as

```dvc
$ virtualenv -p python3 .env
$ source .env/bin/activate
$ echo ".env/" >> .gitignore
$ source .env/bin/activate
$ pip install -r src/requirements.txt
```

Expand Down
2 changes: 1 addition & 1 deletion static/docs/get-started/example-pipeline.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,8 +53,8 @@ recommend creating a virtual environment with a tool such as

```dvc
$ virtualenv -p python3 .env
$ source .env/bin/activate
$ echo ".env/" >> .gitignore
$ source .env/bin/activate
$ pip install -r requirements.txt
```

Expand Down
44 changes: 38 additions & 6 deletions static/docs/get-started/example-versioning.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,6 @@ recommend creating a virtual environment with a tool such as
```dvc
$ virtualenv -p python3 .env
$ source .env/bin/activate
$ echo ".env/" >> .gitignore
$ pip install -r requirements.txt
```

Expand Down Expand Up @@ -87,11 +86,21 @@ browser to download `data.xml`. Save it into the `data` subdirectory.
</details>

```dvc
$ wget https://data.dvc.org/tutorial/ver/data.zip
$ mkdir data
$ cd data
$ dvc get https://github.com/iterative/dataset-registry \
tutorial/ver/data.zip
...
$ unzip data.zip
$ rm -f data.zip
```

> `dvc get` is a special command to download <abbr>data artifacts</abbr> from
> other DVC projects into the current working directory (similar to `wget` but
> for DVC repositories). In this case we use our own
> [iterative/dataset-registry](https://github.com/iterative/dataset-registry))
> project as the external data source.
This command downloads and extracts our raw dataset, consisting of 1000 labeled
images for training and 800 labeled images for validation. In summary, it's a 43
MB dataset, with a directory structure like this:
Expand Down Expand Up @@ -151,7 +160,7 @@ it a little bit later. For now, let's commit the current state:

```dvc
$ git add .gitignore model.h5.dvc data.dvc metrics.json
$ git commit -m "model first version, 1000 images"
$ git commit -m "First model, trained with 1000 images"
$ git tag -a "v1.0" -m "model v1.0, 1000 images"
```

Expand All @@ -171,13 +180,25 @@ files work.

</details>

> Note that executing `train.py` produced other intermediate files. This is OK,
> we will use them later.
>
> ```dvc
> $ git status
> ...
> bottleneck_features_train.npy
> bottleneck_features_validation.npy
> ```
## Second model version

Let's imagine that our images dataset is growing, we were able to double it.
Next command extracts 500 cat and 500 dog images into `data/train`:

```dvc
$ wget https://data.dvc.org/tutorial/ver/new-labels.zip
$ dvc get https://github.com/iterative/dataset-registry \
tutorial/ver/new-labels.zip
...
$ unzip new-labels.zip
$ rm -f new-labels.zip
```
Expand Down Expand Up @@ -220,7 +241,7 @@ Let's commit the second version:

```dvc
$ git add model.h5.dvc data.dvc metrics.json
$ git commit -m "model second version, 2000 images"
$ git commit -m "Second model, trained with2000 images"
$ git tag -a "v2.0" -m "model v2.0, 2000 images"
```

Expand Down Expand Up @@ -295,8 +316,16 @@ our example, please notice that `train.py` produces binary files (e.g.
When you have a script that takes some data as an input and produces other data
outputs, a better way to capture them is to use `dvc run`:

> Please go back to the master branch code and data if you tried the commands in
> the [Switching between versions](#switching-between-versions) section with:
>
> ```dvc
> $ git checkout master
> $ dvc checkout
> ```
```dvc
$ dvc remove -p model.h5.dvc
$ dvc remove -pf model.h5.dvc
$ dvc run -f Dvcfile \
-d train.py -d data \
-M metrics.json \
Expand All @@ -310,6 +339,9 @@ control the same way as `dvc add` does. Unlike, `dvc add`, `dvc run` also tracks
dependencies (`-d`) and the command (`python train.py`) that was run to produce
the result.

> BTW, at this point you could `git add .` and `git commit` to save the
> `Dvcfile` stage file and its changed output files to the repository.
`dvc repro` will run `Dvcfile` if any of its dependencies (`-d`) changed, for
example after we added new images like we did when we built the second model
version. It also updates outputs and puts them into the cache.
Expand Down

0 comments on commit 29d3dcc

Please sign in to comment.