Skip to content

Commit

Permalink
Merge pull request #694 from iterative/understanding-dvc-copy-edits
Browse files Browse the repository at this point in the history
understanding-dvc: copy edits
  • Loading branch information
shcheklein committed Oct 13, 2019
2 parents 8a46479 + 256a056 commit e9c7ab4
Show file tree
Hide file tree
Showing 7 changed files with 77 additions and 76 deletions.
39 changes: 19 additions & 20 deletions static/docs/understanding-dvc/collaboration-issues.md
Original file line number Diff line number Diff line change
@@ -1,51 +1,50 @@
# Collaboration Issues in Data Science

Even with all the successes today in machine learning (ML), specifically deep
learning and its applications in business, the data science community is still
lacking good practices for organizing their projects and effectively
collaborating across their varied ML projects. This is a massive challenge for
the community and the industry now, when ML algorithms and methods are no longer
simply "tribal knowledge" but are still difficult to implement, reuse, and
manage.

To make progress on this challenge, many areas of the ML experimentation process
need to be formalized. Many common questions need to be answered in an unified,
principled way.
Even with all the success we've seen today in machine learning (ML),
specifically deep learning and its applications in business, the data science
community still lacks good practices for organizing their projects and
effectively collaborating across their varied ML projects. This is a critical
challenge: we need to evolve towards ML algorithms and methods no longer being
"tribal knowledge" and making them easy to implement, reuse, and manage.

To make progress, many areas of the ML experimentation process need to be
formalized. Common questions need to be answered in an unified, principled way.

## Questions

### Source code and data versioning

- How do you avoid any discrepancies between versions of the source code and
versions of the data files when the data cannot fit into a repository?
- How do you avoid discrepancies between versions of the source code and
versions of the data files when the data cannot fit into a traditional
repository format?

### Experiment time log

- How do you track which of the
- How do you track which of your
[hyperparameter](<https://en.wikipedia.org/wiki/Hyperparameter_(machine_learning)>)
changes contributed the most to producing your target
[metric](/doc/command-reference/metrics)? How do you monitor the extent of
changes contributed the most to producing or improving your target
[metric](/doc/command-reference/metrics)? How do you monitor the degree of
each change?

### Navigating through experiments

- How do you recover a model from last week without wasting time waiting for the
model to retrain?

- How do you quickly switch between the large dataset and a small subset without
- How do you quickly switch between a large dataset and a small subset without
modifying source code?

### Reproducibility

- How do you run a model's evaluation again without retraining the model and
preprocessing a raw dataset?
- How do you run a model's evaluation process again without retraining the model
and preprocessing a raw dataset?

### Managing and sharing large data files

- How do you share models trained in a GPU environment with colleagues who don't
have access to a GPU?

- How do you share the entire 147 GB of your project, with all of its data
- How do you share the entire 147 GB of your ML project, with all of its data
sources, intermediate data files, and models?

Some of these questions are easy to answer individually. Any data scientist,
Expand Down
8 changes: 4 additions & 4 deletions static/docs/understanding-dvc/core-features.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,11 +9,11 @@
- **Large data file versioning** works by creating pointers in your Git
repository to the <abbr>cache</abbr>, typically stored on a local hard drive.

- **Programming language agnostic**: Python, R, Julia, shell scripts, etc. ML
library agnostic: Keras, Tensorflow, PyTorch, scipy, etc.
- DVC is **Programming language agnostic**: Python, R, Julia, shell scripts,
etc. as well as ML library agnostic: Keras, Tensorflow, PyTorch, Scipy, etc.

- **Open-sourced** and **Self-served**: DVC is free and doesn't require any
- It's **Open-source** and **Self-serve**: DVC is free and doesn't require any
additional services.

- DVC supports cloud storage (Amazon S3, Azure Blob Storage, and Google Cloud
Storage) for **data sources and pre-trained models sharing**.
Storage) for **data sources and pre-trained model sharing**.
35 changes: 18 additions & 17 deletions static/docs/understanding-dvc/existing-tools.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,32 +2,33 @@

## Existing engineering tools

There is one common opinion regarding data science tooling. Data scientists as
engineers are supposed to use the best practices and collaboration software from
software engineering. Source code version control system (Git), continuous
integration services (CI), and unit test frameworks are all expected to be
utilized in data science [pipelines](/doc/command-reference/pipeline).
There is one thing that data scientists seem to agree on around tooling: as
engineers, we should use the same best practices and collaboration software
that's standard in software engineering. A source code version control system
(Git), continuous integration services (CI), and unit test frameworks are all
expected to be utilized in data science
[pipelines](/doc/command-reference/pipeline).

But a comprehensive look at data science processes shows that the software
engineering toolset does not cover data science needs. Try to answer all the
questions from the above using only engineering tools, and you are likely to be
left wanting for more.
engineering toolset does not completely cover data science needs. Try to answer
all the questions from the above using only engineering tools, and you're likely
to be left wanting more.

## Experiment management software

This new type of software was created to solve data scientists collaboration
issues. This software aims to cover the gap between data scientist needs and the
existing toolset.
This new type of software was created to solve data science collaboration
issues. Experiment management software aims to cover the gap between data
scientist needs and the existing toolsets from software engineering.

Experiment management software is usually **graphical user interface** (GUI)
based, in contrast to existing command line engineering tools. The GUI is a
bridge to a separate **cloud based environment**. The cloud environment is
usually not so flexible as local data scientists environment. And the cloud
environment is not fully integrated with the local environment.
usually not as flexible as local data scientist environments, and isn't fully
integrated with local environments either.

The separation of the local data scientist environment and the experimentation
cloud environment creates another discrepancy issue and the environment
cloud environment creates another discrepancy issue, and environment
synchronization requires addition work. Also, this style of software usually
require external services, typically accompanied with a monthly bill. This might
be a good solution for a particular companies or groups of data scientists.
However a more accessible, free tool is needed for a wider audience.
requires external services that aren't free. This might be a good solution for a
particular companies or groups of data scientists. but a more accessible, free
tool is needed for a wider audience.
4 changes: 2 additions & 2 deletions static/docs/understanding-dvc/how-it-works.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,7 @@
```

- DVC makes repositories reproducible. DVC-files can be easily shared through
any Git server, and allows for experiments to be easily reproduced:
any Git server, and allow for experiments to be easily reproduced:

```dvc
$ git clone https://github.com/dataversioncontrol/myrepo.git
Expand All @@ -73,7 +73,7 @@
```

- The cache of a DVC project can be shared with colleagues through Amazon S3,
Azure Blob Storage, Google Cloud Storage, among others:
Azure Blob Storage, and Google Cloud Storage, among others:

```dvc
$ git push
Expand Down
53 changes: 27 additions & 26 deletions static/docs/understanding-dvc/related-technologies.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# Comparison to Existing Technologies

Due to the the novelty of our approach, it may be easier to understand DVC in
comparison to existing technologies and tools.
DVC takes a novel approach, and it may be easier to understand DVC in comparison
to existing technologies and tools.

DVC combines a number of existing ideas into a single product, with the goal of
bringing best practices from software engineering into the data science field.
Expand All @@ -21,23 +21,24 @@ Pipelines and dependency graphs
Luigi, etc.

- DVC is focused on data science and modeling. As a result, DVC pipelines are
lightweight, easy to create and modify. However, DVC lacks pipeline execution
features like execution monitoring, execution error handling, and recovering.
lightweight and easy to create and modify. However, DVC lacks pipeline
execution features like execution monitoring, execution error handling, and
recovering.

- DVC is purely a command line tool without a graphical user interface (GUI) and
doesn't run any daemons or servers. Nevertheless, DVC can generate images with
pipeline and experiment workflow visualization.
pipeline and experiment workflow visualizations.

### Experiment management software

Mostly designed for enterprise usage, but with open-sourced options such as
Mostly designed for enterprise usage, but with open source options such as
http://studio.ml/

- DVC uses Git as the underlying platform for experiment tracking instead of a
web application.

- DVC doesn't need to run any services. No graphical user interface as a result,
but we expect some GUI services will be created on top of DVC.
- DVC doesn't need to run any services. There's no graphical user interface as a
result, but we expect some GUI services will be created on top of DVC.

- DVC has transparent design. Its
[internal files and directories](/doc/user-guide/dvc-files-and-directories)
Expand All @@ -48,10 +49,10 @@ http://studio.ml/

- DVC supports a new experimentation methodology that integrates easily with a
Git workflow. A separate branch should be created for each experiment, with a
subsequent merge of this branch if it was successful.
subsequent merge of the branch if the experiment was successful.

- DVC innovates by giving experimenters the ability to easily navigate through
past experiments without recomputing them.
past experiments without recomputing them each time.

### Build automation tools

Expand All @@ -62,37 +63,37 @@ http://studio.ml/
(DAG):

- The DAG or dependency graph is defined implicitly by the connections between
[DVC-file](/doc/user-guide/dvc-file-format) (with file names `<file>.dvc` or
`Dvcfile`), based on their dependencies and <abbr>outputs</abbr>.
[DVC-files](/doc/user-guide/dvc-file-format) (with file names `<file>.dvc`
or `Dvcfile`), based on their dependencies and <abbr>outputs</abbr>.

- Each DVC-file defines one node in the DAG. All DVC-files in a repository
make up a single pipeline (think a single Makefile). All DVC-files (and
corresponding pipeline commands) are implicitly combined through their
inputs and outputs, to simplify conflict resolving during merges.
inputs and outputs, simplifying conflict resolution during merges.

- DVC provides a simple command `dvc run` to generate a DVC-file or "stage
- DVC provides a simple command `dvc run` to generate a DVC-file or "stage
file" automatically, based on the provided command, dependencies, and
outputs.

- File tracking:

- DVC tracks files based on checksum (MD5) instead of file timestamps. This
helps avoid running into heavy processes like model retraining when you
checkout a previous, trained version of a modeling code (Make would retrain
checkout a previous, trained version of a model's code (Make would retrain
the model).

- DVC uses file timestamps and inodes for optimization. This allows DVC to
avoid recomputing all dependency files checksum, which would be highly
avoid recomputing all dependency files' checksums, which would be highly
problematic when working with large files (10 GB+).

### Git-annex

- DVC uses the idea of storing the content of large files (that you don't want
to see in your Git repository) in a local key-value store and use file
to see in your Git repository) in a local key-value store and uses file
symlinks instead of the actual files.

- DVC can use reflinks\* or hardlinks (depending on the system) instead of
symlinks to improve performance and make the user experience better.
symlinks to improve performance and the user experience.

- DVC optimizes checksum calculation.

Expand All @@ -105,23 +106,23 @@ http://studio.ml/
workflow) are always included in the Git repository and hence can be recreated
locally with minimal effort.

- DVC is not fundamentally bound to Git, having the option of changing the
repository format.
- DVC is not fundamentally bound to Git, and users have the option of changing
the repository format.

### Git-LFS (Large File Storage)

- DVC does not require special Git servers like Git-LFS demands. Any cloud
storage like S3, GCS, or on-premises SSH server can be used as a backend for
datasets and models, no additional databases, servers or infrastructure are
required.
storage like S3, GCS, or an on-premises SSH server can be used as a backend
for datasets and models. No additional databases, servers, or infrastructure
are required.

- DVC is not fundamentally bound to Git, having the option of changing the
repository format.
- DVC is not fundamentally bound to Git, and users have the option of changing
the repository format.

- DVC does not add any hooks to Git by default. To checkout data files, the
`dvc checkout` command has to be run after each `git checkout` and `git clone`
command. It gives more granularity on managing data and code separately. Hooks
could be configured to make workflow simpler.
could be configured to make workflows simpler.

- DVC attempts to use reflinks\* and has other
[file linking options](/docs/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache).
Expand Down
4 changes: 2 additions & 2 deletions static/docs/understanding-dvc/resources.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,14 +9,14 @@
picture-in-picture" allowfullscreen></iframe>

- DVC Co-founder Dmitry Petrov talking about Model and Dataset versioning
practices using DVC in PyCon, 2019:
practices using DVC at PyCon, 2019:

<iframe width="560" height="315" src="https://www.youtube.com/embed/jkfh2PM5Sz8"
frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope;
picture-in-picture" allowfullscreen></iframe>

- DVC Co-founder Dmitry Petrov talking about Model and Dataset versioning
practices using DVC in PyData Berlin, 2018:
practices using DVC at PyData Berlin, 2018:

<iframe width="560" height="315" src="https://www.youtube.com/embed/BneW7jgB298"
frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope;
Expand Down
10 changes: 5 additions & 5 deletions static/docs/understanding-dvc/what-is-dvc.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
# What Is DVC?

Data Version Control, or DVC, is **a new type of experiment management
software** that has been built **on top of the existing engineering toolset**,
and particularly on a source code version control system (currently Git). DVC
reduces the gap between the existing tools and the data scientist needs. This
gives an ability to use the advantages of experiment management software while
reusing existing skills and intuition.
software** that has been built **on top of the existing engineering toolset that
you're already used to**, and particularly on a source code version control
system (currently Git). DVC reduces the gap between existing tools and data
science needs, allowing users to take advantage of experiment management
software while reusing existing skills and intuition.

The underlying source code control system eliminates the need to use external
services. Data science experiment sharing and collaboration can be done through
Expand Down

0 comments on commit e9c7ab4

Please sign in to comment.