Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added a BC: workspace document #2197

Merged
merged 17 commits into from
May 10, 2021
Merged
Show file tree
Hide file tree
Changes from 6 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions content/docs/sidebar.json
Original file line number Diff line number Diff line change
Expand Up @@ -91,6 +91,12 @@
"label": "What is DVC?",
"slug": "what-is-dvc"
},
{
"label": "Basic Concepts",
"slug": "basic-concepts",
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved
"source": false,
"children": ["workspace"]
},
{
"slug": "project-structure",
"source": "user-guide/project-structure/index.md",
Expand Down
37 changes: 37 additions & 0 deletions content/docs/user-guide/basic-concepts/workspace.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,3 +6,40 @@ tooltip: >-
models, etc. Typically, it's also a Git repository. It will contain your DVC
project.
---

<!-- keywords: data science project architecture, machine learning project architecture, machine learning workflow, data science workflow, machine learning file system, data science file system, data science project structure, machine learning project structure, notebook version control -->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a note to remove this comment later before merging (but for now it's useful to have the list during review 👍)

This comment was marked as resolved.

This comment was marked as off-topic.


# Workspace

A data science project consists of data obtained from many different sources.
This data may be split into multiple files or directories or (as the project
structure needs) have different versions for different requirements. e.g. A
smaller / simplified version might be required in prototyping for faster
feedback and shorter training times. A single workspace to manage all artifacts
of a project is desirable, although versioning needs and managing dependencies
make it increasingly complex.

DVC allows a single directory to contain all your project artifacts. In the
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved
documentation the workspace is the _user visible_ part of the directory that
contains all your <abbr>project</abbr> files e.g. raw datasets, source code, ML
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved
models, etc. Users work in this directory using their data and model files that
and manipulate the contents through DVC commands.

Files and directories in the workspace can be added to DVC (`dvc add`) or they
can be downloaded from external sources (`dvc get`, `dvc import`,

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But actually add can also download data to the workspace (see --out and --to-remote options). Also, import* commands download AND track data. You may want to rephrase this part accordingly 🙂

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sometimes I think there is no command that hasn't got a duplicate, somehow :) I try to mention commands in passing, if we'd consider each and every option to commands, we'll need to duplicate the command reference here IMHO.

We may just delete the commands if you would like.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can also list all possibilities for each functionality, like

In the workspace, you can

  • Import files and directories (dvc add --out, dvc import-url...)

but I think this will turn the document into a list of commands and options.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to list every command usage of course, agreed!

My point was that these 3 commands mentioned actually overlap in a way that makes the current text slightly incorrect. In any case, the main use case of add is not to "add" but to "track", actually. Please check each cmd ref to try to find the right terms when needed 🙂

"Download" is correct for get/import but add can also download (and they can all "transfer") so I'd avoid that term probably. And in fact I wouldn't even mention get here, since it doesn't require a DVC project/workspace. For import I'd try to use the cmd name as the relevant action (to "import") I guess...

Copy link
Contributor

@jorgeorpinel jorgeorpinel Mar 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the main use case of add is not to "add" but to "track"
For import I'd try to use the cmd name as the relevant action (to "import")

then again import* also track the downloaded data 😅 ("adds"). Maybe it should be a single sentence about tracking and put all add, import, import-url in the same parenthesis.

`dvc import-url`). Changes to the files, directories, notebooks, models and in
general machine learning file system can be tracked (`dvc commit`) and versioned
in Git (`dvc checkout`). <abbr>Pipelines</abbr> and <abbr>dependencies</abbr>
between them can be defined. Data and model files can be moved to the cloud
(`dvc remote`) and retrieved one by one when necessary (`dvc push`, `dvc pull`).
They can be removed (`dvc remove`) from the workspace. DVC supports all typical
operations of files and directories of a file system through its commands.

Behind the scene these operations of a <abbr>DVC project</abbr> uses
<abbr>metafiles</abbr> to track the content and dependencies.
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved

## Further Reading

- [What is DVC?](/doc/user-guide/what-is-dvc)
- [Versioning Data and Model](/doc/use-cases/versioning-data-and-model-files)
from Use Cases