Skip to content

Commit

Permalink
use-cases: rewrite data registry intro (1)
Browse files Browse the repository at this point in the history
  • Loading branch information
jorgeorpinel committed Nov 9, 2019
1 parent 2e31691 commit 6425a5d
Show file tree
Hide file tree
Showing 2 changed files with 33 additions and 24 deletions.
43 changes: 25 additions & 18 deletions static/docs/use-cases/data-registry.md
Original file line number Diff line number Diff line change
@@ -1,21 +1,28 @@
# Data Registry

We developed the `dvc get`, `dvc import`, and `dvc update` commands with the aim
to enable reusability of any <abbr>data artifacts</abbr> (raw data, intermediate
results, models, etc) between different projects. For example, project A may use
a data file to begin its data [pipeline](/doc/command-reference/pipeline), but
project B also requires this same file; Instead of
One of the main uses of <abbr>DVC repositories</abbr> is the
[versioning of data and model files](/doc/use-cases/data-and-model-files-versioning).
This is provided by commands such as `dvc add` and `dvc run`, that allow
tracking of datasets and any other <abbr>data artifacts</abbr>.

With the aim to enable reusability of these versioned artifacts between
different projects (similar to package management systems, but for data), DVC
also includes the `dvc get`, `dvc import`, and `dvc update` commands. For
example, project A may use a data file to begin its data
[pipeline](/doc/command-reference/pipeline), but project B also requires this
same file; Instead of
[adding it](/doc/command-reference/add#example-single-file) it to both projects,
B can simply import it from A.

Taking this idea to a useful extreme, we could create a <abbr>project</abbr>
that is exclusively dedicated to
[tracking and versioning](/doc/use-cases/data-and-model-files-versioning)
datasets (or any kind of large files) – by mainly using `dvc add` to build it.
Other projects can then share these artifacts by downloading (`dvc get`) or
importing (`dvc import`) them for use in different data processes – and these
don't even have to be _DVC projects_, as `dvc get` works anywhere in your
system.
B can simply import it from A. Furthermore, the version of the data file
imported to B can be an older iteration than what's currently used in A.

Keeping this in mind, we could build a <abbr>DVC project</abbr> dedicated to
tracking and versioning datasets (or any kind of large files). This way we would
have a repository that has all the metadata and change history for the project's
data. We can see who updated what, and when; use pull requests to update data
the same way you do with code; and we don't need ad-hoc conventions to store
different data versions. Other projects can share the data in the registry by
downloading (`dvc get`) or importing (`dvc import`) them for use in different
data processes.

The advantages of using a DVC **data registry** project are:

Expand Down Expand Up @@ -114,9 +121,9 @@ See the `dvc import` command reference for more details on the `--rev`

Importing keeps the connection between the local project and the source data
registry where we are downloading the dataset from. This is achieved by creating
a special [DVC-file](/doc/user-guide/dvc-file-format) (a.k.a. _import stage_)
that uses the `repo` field. (This file can be used for versioning the import
with Git.)
a particular kind of [DVC-file](/doc/user-guide/dvc-file-format) that uses the
`repo` field (a.k.a. _import stage_). (This file can be used for versioning the
import with Git.)

> For a sample DVC-file resulting from `dvc import`, refer to
> [this example](/doc/command-reference/import#example-data-registry).
Expand Down
14 changes: 8 additions & 6 deletions static/docs/use-cases/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,13 +9,15 @@ range from basic to more advanced:
- [Data Versioning](/doc/use-cases/versioning-data-and-model-files) describes
our most primary use: tracking and versioning large files with Git + DVC.
- [Sharing Data and Model Files](/doc/use-cases/sharing-data-and-model-files)
goes over basic collaboration possibilities enabled by DVC.
- [Shared Development Server](/doc/use-cases/shared-development-server)
describes a single development machine setup for teams that prefer so.
goes over the basic collaboration possibilities enabled by DVC.
- [Shared Development Server](/doc/use-cases/shared-development-server) provides
instructions to setup a single development machine for teams that prefer so.
- [Data Registry](/doc/use-cases/data-registry) explains how to use a <abbr>DVC
repository<abbr> as a shared hub for reusing datasets among several projects.

This list of use cases is _not_ exhaustive. We keep reviewing our docs and will
include interesting scenarios that surface in our community. Please,
[contact us](/support) if you need help or have suggestions!
> This list of use cases is **not** exhaustive. We keep reviewing our docs and
> will include interesting scenarios that surface in the community. Please,
> [contact us](/support) if you need help or have suggestions!
Use cases are not written to be run end-to-end. For more general, hands-on
experience with DVC, we recommend following the [Get Started](/doc/get-started),
Expand Down

0 comments on commit 6425a5d

Please sign in to comment.