Merge pull request #694 from iterative/understanding-dvc-copy-edits

understanding-dvc: copy edits
iterative · Oct 13, 2019 · e9c7ab4 · e9c7ab4
2 parents 8a46479 + 256a056
commit e9c7ab4
Show file tree

Hide file tree

Showing 7 changed files with 77 additions and 76 deletions.
diff --git a/static/docs/understanding-dvc/collaboration-issues.md b/static/docs/understanding-dvc/collaboration-issues.md
@@ -1,51 +1,50 @@
 # Collaboration Issues in Data Science
 
-Even with all the successes today in machine learning (ML), specifically deep
-learning and its applications in business, the data science community is still
-lacking good practices for organizing their projects and effectively
-collaborating across their varied ML projects. This is a massive challenge for
-the community and the industry now, when ML algorithms and methods are no longer
-simply "tribal knowledge" but are still difficult to implement, reuse, and
-manage.
-
-To make progress on this challenge, many areas of the ML experimentation process
-need to be formalized. Many common questions need to be answered in an unified,
-principled way.
+Even with all the success we've seen today in machine learning (ML),
+specifically deep learning and its applications in business, the data science
+community still lacks good practices for organizing their projects and
+effectively collaborating across their varied ML projects. This is a critical
+challenge: we need to evolve towards ML algorithms and methods no longer being
+"tribal knowledge" and making them easy to implement, reuse, and manage.
+
+To make progress, many areas of the ML experimentation process need to be
+formalized. Common questions need to be answered in an unified, principled way.
 
 ## Questions
 
 ### Source code and data versioning
 
-- How do you avoid any discrepancies between versions of the source code and
-  versions of the data files when the data cannot fit into a repository?
+- How do you avoid discrepancies between versions of the source code and
+  versions of the data files when the data cannot fit into a traditional
+  repository format?
 
 ### Experiment time log
 
-- How do you track which of the
+- How do you track which of your
   [hyperparameter](<https://en.wikipedia.org/wiki/Hyperparameter_(machine_learning)>)
-  changes contributed the most to producing your target
-  [metric](/doc/command-reference/metrics)? How do you monitor the extent of
+  changes contributed the most to producing or improving your target
+  [metric](/doc/command-reference/metrics)? How do you monitor the degree of
   each change?
 
 ### Navigating through experiments
 
 - How do you recover a model from last week without wasting time waiting for the
   model to retrain?
 
-- How do you quickly switch between the large dataset and a small subset without
+- How do you quickly switch between a large dataset and a small subset without
   modifying source code?
 
 ### Reproducibility
 
-- How do you run a model's evaluation again without retraining the model and
-  preprocessing a raw dataset?
+- How do you run a model's evaluation process again without retraining the model
+  and preprocessing a raw dataset?
 
 ### Managing and sharing large data files
 
 - How do you share models trained in a GPU environment with colleagues who don't
   have access to a GPU?
 
-- How do you share the entire 147 GB of your project, with all of its data
+- How do you share the entire 147 GB of your ML project, with all of its data
   sources, intermediate data files, and models?
 
 Some of these questions are easy to answer individually. Any data scientist,

diff --git a/static/docs/understanding-dvc/core-features.md b/static/docs/understanding-dvc/core-features.md
@@ -9,11 +9,11 @@
 - **Large data file versioning** works by creating pointers in your Git
   repository to the <abbr>cache</abbr>, typically stored on a local hard drive.
 
-- **Programming language agnostic**: Python, R, Julia, shell scripts, etc. ML
-  library agnostic: Keras, Tensorflow, PyTorch, scipy, etc.
+- DVC is **Programming language agnostic**: Python, R, Julia, shell scripts,
+  etc. as well as ML library agnostic: Keras, Tensorflow, PyTorch, Scipy, etc.
 
-- **Open-sourced** and **Self-served**: DVC is free and doesn't require any
+- It's **Open-source** and **Self-serve**: DVC is free and doesn't require any
   additional services.
 
 - DVC supports cloud storage (Amazon S3, Azure Blob Storage, and Google Cloud
-  Storage) for **data sources and pre-trained models sharing**.
+  Storage) for **data sources and pre-trained model sharing**.
diff --git a/static/docs/understanding-dvc/existing-tools.md b/static/docs/understanding-dvc/existing-tools.md
@@ -2,32 +2,33 @@
 
 ## Existing engineering tools
 
-There is one common opinion regarding data science tooling. Data scientists as
-engineers are supposed to use the best practices and collaboration software from
-software engineering. Source code version control system (Git), continuous
-integration services (CI), and unit test frameworks are all expected to be
-utilized in data science [pipelines](/doc/command-reference/pipeline).
+There is one thing that data scientists seem to agree on around tooling: as
+engineers, we should use the same best practices and collaboration software
+that's standard in software engineering. A source code version control system
+(Git), continuous integration services (CI), and unit test frameworks are all
+expected to be utilized in data science
+[pipelines](/doc/command-reference/pipeline).
 
 But a comprehensive look at data science processes shows that the software
-engineering toolset does not cover data science needs. Try to answer all the
-questions from the above using only engineering tools, and you are likely to be
-left wanting for more.
+engineering toolset does not completely cover data science needs. Try to answer
+all the questions from the above using only engineering tools, and you're likely
+to be left wanting more.
 
 ## Experiment management software
 
-This new type of software was created to solve data scientists collaboration
-issues. This software aims to cover the gap between data scientist needs and the
-existing toolset.
+This new type of software was created to solve data science collaboration
+issues. Experiment management software aims to cover the gap between data
+scientist needs and the existing toolsets from software engineering.
 
 Experiment management software is usually **graphical user interface** (GUI)
 based, in contrast to existing command line engineering tools. The GUI is a
 bridge to a separate **cloud based environment**. The cloud environment is
-usually not so flexible as local data scientists environment. And the cloud
-environment is not fully integrated with the local environment.
+usually not as flexible as local data scientist environments, and isn't fully
+integrated with local environments either.
 
 The separation of the local data scientist environment and the experimentation
-cloud environment creates another discrepancy issue and the environment
+cloud environment creates another discrepancy issue, and environment
 synchronization requires addition work. Also, this style of software usually
-require external services, typically accompanied with a monthly bill. This might
-be a good solution for a particular companies or groups of data scientists.
-However a more accessible, free tool is needed for a wider audience.
+requires external services that aren't free. This might be a good solution for a
+particular companies or groups of data scientists. but a more accessible, free
+tool is needed for a wider audience.
diff --git a/static/docs/understanding-dvc/how-it-works.md b/static/docs/understanding-dvc/how-it-works.md
@@ -57,7 +57,7 @@
   ```
 
 - DVC makes repositories reproducible. DVC-files can be easily shared through
-  any Git server, and allows for experiments to be easily reproduced:
+  any Git server, and allow for experiments to be easily reproduced:
 
   ```dvc
   $ git clone https://github.com/dataversioncontrol/myrepo.git
@@ -73,7 +73,7 @@
   ```
 
 - The cache of a DVC project can be shared with colleagues through Amazon S3,
-  Azure Blob Storage, Google Cloud Storage, among others:
+  Azure Blob Storage, and Google Cloud Storage, among others:
 
   ```dvc
   $ git push

diff --git a/static/docs/understanding-dvc/related-technologies.md b/static/docs/understanding-dvc/related-technologies.md
@@ -1,7 +1,7 @@
 # Comparison to Existing Technologies
 
-Due to the the novelty of our approach, it may be easier to understand DVC in
-comparison to existing technologies and tools.
+DVC takes a novel approach, and it may be easier to understand DVC in comparison
+to existing technologies and tools.
 
 DVC combines a number of existing ideas into a single product, with the goal of
 bringing best practices from software engineering into the data science field.
@@ -21,23 +21,24 @@ Pipelines and dependency graphs
 Luigi, etc.
 
 - DVC is focused on data science and modeling. As a result, DVC pipelines are
-  lightweight, easy to create and modify. However, DVC lacks pipeline execution
-  features like execution monitoring, execution error handling, and recovering.
+  lightweight and easy to create and modify. However, DVC lacks pipeline
+  execution features like execution monitoring, execution error handling, and
+  recovering.
 
 - DVC is purely a command line tool without a graphical user interface (GUI) and
   doesn't run any daemons or servers. Nevertheless, DVC can generate images with
-  pipeline and experiment workflow visualization.
+  pipeline and experiment workflow visualizations.
 
 ### Experiment management software
 
-Mostly designed for enterprise usage, but with open-sourced options such as
+Mostly designed for enterprise usage, but with open source options such as
 http://studio.ml/
 
 - DVC uses Git as the underlying platform for experiment tracking instead of a
   web application.
 
-- DVC doesn't need to run any services. No graphical user interface as a result,
-  but we expect some GUI services will be created on top of DVC.
+- DVC doesn't need to run any services. There's no graphical user interface as a
+  result, but we expect some GUI services will be created on top of DVC.
 
 - DVC has transparent design. Its
   [internal files and directories](/doc/user-guide/dvc-files-and-directories)
@@ -48,10 +49,10 @@ http://studio.ml/
 
 - DVC supports a new experimentation methodology that integrates easily with a
   Git workflow. A separate branch should be created for each experiment, with a
-  subsequent merge of this branch if it was successful.
+  subsequent merge of the branch if the experiment was successful.
 
 - DVC innovates by giving experimenters the ability to easily navigate through
-  past experiments without recomputing them.
+  past experiments without recomputing them each time.
 
 ### Build automation tools
 
@@ -62,37 +63,37 @@ http://studio.ml/
   (DAG):
 
   - The DAG or dependency graph is defined implicitly by the connections between
-    [DVC-file](/doc/user-guide/dvc-file-format) (with file names `<file>.dvc` or
-    `Dvcfile`), based on their dependencies and <abbr>outputs</abbr>.
+    [DVC-files](/doc/user-guide/dvc-file-format) (with file names `<file>.dvc`
+    or `Dvcfile`), based on their dependencies and <abbr>outputs</abbr>.
 
   - Each DVC-file defines one node in the DAG. All DVC-files in a repository
     make up a single pipeline (think a single Makefile). All DVC-files (and
     corresponding pipeline commands) are implicitly combined through their
-    inputs and outputs, to simplify conflict resolving during merges.
+    inputs and outputs, simplifying conflict resolution during merges.
 
-  - DVC provides a simple command `dvc run` to generate a DVC-file or "stage
+  - DVC provides a simple command – `dvc run` – to generate a DVC-file or "stage
     file" automatically, based on the provided command, dependencies, and
     outputs.
 
 - File tracking:
 
   - DVC tracks files based on checksum (MD5) instead of file timestamps. This
     helps avoid running into heavy processes like model retraining when you
-    checkout a previous, trained version of a modeling code (Make would retrain
+    checkout a previous, trained version of a model's code (Make would retrain
     the model).
 
   - DVC uses file timestamps and inodes for optimization. This allows DVC to
-    avoid recomputing all dependency files checksum, which would be highly
+    avoid recomputing all dependency files' checksums, which would be highly
     problematic when working with large files (10 GB+).
 
 ### Git-annex
 
 - DVC uses the idea of storing the content of large files (that you don't want
-  to see in your Git repository) in a local key-value store and use file
+  to see in your Git repository) in a local key-value store and uses file
   symlinks instead of the actual files.
 
 - DVC can use reflinks\* or hardlinks (depending on the system) instead of
-  symlinks to improve performance and make the user experience better.
+  symlinks to improve performance and the user experience.
 
 - DVC optimizes checksum calculation.
 
@@ -105,23 +106,23 @@ http://studio.ml/
   workflow) are always included in the Git repository and hence can be recreated
   locally with minimal effort.
 
-- DVC is not fundamentally bound to Git, having the option of changing the
-  repository format.
+- DVC is not fundamentally bound to Git, and users have the option of changing
+  the repository format.
 
 ### Git-LFS (Large File Storage)
 
 - DVC does not require special Git servers like Git-LFS demands. Any cloud
-  storage like S3, GCS, or on-premises SSH server can be used as a backend for
-  datasets and models, no additional databases, servers or infrastructure are
-  required.
+  storage like S3, GCS, or an on-premises SSH server can be used as a backend
+  for datasets and models. No additional databases, servers, or infrastructure
+  are required.
 
-- DVC is not fundamentally bound to Git, having the option of changing the
-  repository format.
+- DVC is not fundamentally bound to Git, and users have the option of changing
+  the repository format.
 
 - DVC does not add any hooks to Git by default. To checkout data files, the
   `dvc checkout` command has to be run after each `git checkout` and `git clone`
   command. It gives more granularity on managing data and code separately. Hooks
-  could be configured to make workflow simpler.
+  could be configured to make workflows simpler.
 
 - DVC attempts to use reflinks\* and has other
   [file linking options](/docs/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache).

diff --git a/static/docs/understanding-dvc/resources.md b/static/docs/understanding-dvc/resources.md
@@ -9,14 +9,14 @@
     picture-in-picture" allowfullscreen></iframe>
 
 - DVC Co-founder Dmitry Petrov talking about Model and Dataset versioning
-  practices using DVC in PyCon, 2019:
+  practices using DVC at PyCon, 2019:
 
     <iframe width="560" height="315" src="https://www.youtube.com/embed/jkfh2PM5Sz8"
     frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope;
     picture-in-picture" allowfullscreen></iframe>
 
 - DVC Co-founder Dmitry Petrov talking about Model and Dataset versioning
-  practices using DVC in PyData Berlin, 2018:
+  practices using DVC at PyData Berlin, 2018:
 
     <iframe width="560" height="315" src="https://www.youtube.com/embed/BneW7jgB298"
     frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope;

diff --git a/static/docs/understanding-dvc/what-is-dvc.md b/static/docs/understanding-dvc/what-is-dvc.md
@@ -1,11 +1,11 @@
 # What Is DVC?
 
 Data Version Control, or DVC, is **a new type of experiment management
-software** that has been built **on top of the existing engineering toolset**,
-and particularly on a source code version control system (currently Git). DVC
-reduces the gap between the existing tools and the data scientist needs. This
-gives an ability to use the advantages of experiment management software while
-reusing existing skills and intuition.
+software** that has been built **on top of the existing engineering toolset that
+you're already used to**, and particularly on a source code version control
+system (currently Git). DVC reduces the gap between existing tools and data
+science needs, allowing users to take advantage of experiment management
+software while reusing existing skills and intuition.
 
 The underlying source code control system eliminates the need to use external
 services. Data science experiment sharing and collaboration can be done through