Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added a BC: workspace document #2197

Merged
merged 17 commits into from
May 10, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions content/docs/sidebar.json
Original file line number Diff line number Diff line change
Expand Up @@ -91,6 +91,12 @@
"label": "What is DVC?",
"slug": "what-is-dvc"
},
{
"label": "Basic Concepts",
"slug": "basic-concepts",
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved
"source": false,
"children": ["workspace"]
},
{
"slug": "project-structure",
"source": "user-guide/project-structure/index.md",
Expand Down
44 changes: 42 additions & 2 deletions content/docs/user-guide/basic-concepts/workspace.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,47 @@
name: Workspace
match: [workspace]
tooltip: >-
Directory containing all your project files e.g. raw datasets, source code, ML
models, etc. Typically, it's also a Git repository. It will contain your DVC
The directory containing all your project files, e.g., the raw data, source
code, ML models. Typically, it's also a Git repository. It contains your DVC
project.
---

# Workspace

A data science project can consist of data obtained from many distinct sources.
These may be split into multiple files or directories or (as the project
structure needs) have different versions for different requirements, e.g., a
smaller / simplified version might be required in prototyping for faster
feedback and shorter training times. A single workspace to manage all artifacts
Comment on lines +13 to +16
Copy link
Contributor

@jorgeorpinel jorgeorpinel Mar 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or have different versions for different requirements...

Let's not go into versioning here, I think. At least not by implying they're all in the workspace because in DVC the workspace only holds one version (the rest are cached and managed via Git, metafiles, etc.

(Mentioned in #2197 (comment))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

versioning needs and managing dependencies make it increasingly difficult

p.s. This is better way to very subtly mention versioning (could even link to the corresponding Use Case doc).

of a project is desirable, although versioning needs and managing dependencies
make it increasingly difficult.

DVC allows a single directory to contain all your project artifacts. The
Copy link
Contributor

@jorgeorpinel jorgeorpinel Mar 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a single directory to contain all your project artifacts

Not exactly. File contents are org'd in the cache with a special file structure (see https://dvc.org/doc/user-guide/project-structure/internal-files#structure-of-the-cache-directory)

The workspace is the directory containing the visible part of your project

That part is correct. And contradicts the previous part 🙂 (because "visible part" implies there's a hidden part which must be in other dirs).

Let's open this p with that sentence.

workspace is the directory containing the _visible_ part of your
<abbr>project</abbr>, e.g., the raw data, source code, model files. You can have
Comment on lines +21 to +22
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"e.g., the raw data, source code, model files you're currently using"

multiple versions of data, models, and other kinds of artifacts within the
workspace and limit your focus to a subset of these. You can record your
Comment on lines +22 to +24
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can have multiple versions of data ... within the workspace

Again contradicting 😕

progress in a commit and analyze your data and model history. DVC provides a
Comment on lines +24 to +25
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

record your progress in a commit and analyze...

Again probably too many details about versioning. Doesn't really fall within the 'workspace' concept, I think. This can be simpler.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to rename your models for minor changes...
or save tens of different renamed files for training

Those are better mentions of versioning (clear benefits i.e. how much you'd have to suffer without DVC)

save cleaned up data in different directories

That specific example isn't great because that's still pretty common even with DVC (e.g. in our own example-get-started repo we have a prepared/ dir).

_machine learning file system_ to manipulate your data and models using its
Comment on lines +25 to +26
Copy link
Contributor

@jorgeorpinel jorgeorpinel Mar 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This "ML file system" keyword is pretty tricky. No need to force it (just skip it if you can't find a correct way to use it).

I can only think of something like "DVC turns your project into a sort of machine learning file system for..." but not sure.

commands. No need to rename your models for minor changes, save cleaned up data
in different directories or save tens of different renamed files for training
programs. DVC can keep track of all of these in a single directory called the
workspace.
Comment on lines +29 to +30
Copy link
Contributor

@jorgeorpinel jorgeorpinel Mar 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DVC can keep track of all of these in a single directory called the workspace.

Again contradicting and also, repetitive at this point.


Files and directories in the workspace can be added to DVC (`dvc add`), or they
can be downloaded from external sources (`dvc get`, `dvc import`,

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But actually add can also download data to the workspace (see --out and --to-remote options). Also, import* commands download AND track data. You may want to rephrase this part accordingly 🙂

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sometimes I think there is no command that hasn't got a duplicate, somehow :) I try to mention commands in passing, if we'd consider each and every option to commands, we'll need to duplicate the command reference here IMHO.

We may just delete the commands if you would like.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can also list all possibilities for each functionality, like

In the workspace, you can

  • Import files and directories (dvc add --out, dvc import-url...)

but I think this will turn the document into a list of commands and options.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to list every command usage of course, agreed!

My point was that these 3 commands mentioned actually overlap in a way that makes the current text slightly incorrect. In any case, the main use case of add is not to "add" but to "track", actually. Please check each cmd ref to try to find the right terms when needed 🙂

"Download" is correct for get/import but add can also download (and they can all "transfer") so I'd avoid that term probably. And in fact I wouldn't even mention get here, since it doesn't require a DVC project/workspace. For import I'd try to use the cmd name as the relevant action (to "import") I guess...

Copy link
Contributor

@jorgeorpinel jorgeorpinel Mar 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the main use case of add is not to "add" but to "track"
For import I'd try to use the cmd name as the relevant action (to "import")

then again import* also track the downloaded data 😅 ("adds"). Maybe it should be a single sentence about tracking and put all add, import, import-url in the same parenthesis.

`dvc import-url`). Changes to the data, notebooks, models, and any related
machine learning artifact can be tracked (`dvc commit`), and their content can
be synchronized (`dvc checkout`). Tracked data can be removed (`dvc remove`)
from the workspace.

DVC supports all typical operations of a versioned data file system through its
commands. Behind the scene these operations use <abbr>metafiles</abbr> like the
Comment on lines +39 to +40
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DVC supports all typical operations of a versioned data file system through its commands.

Maybe open the previous paragraph with that?

p.s. having this I think def. no need for the "ml file system" keyword. But keeping "machine learning" somewhere would be nice.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's try to incorporate the metafile mentions to the main (2nd) paragraph somehow. After simplifying it per my previous comments, there should be enough room in there. that way there's no need for this 4th p.

`.dvc/` directory, `dvc.yaml` files or files with `*.dvc` extension to track the
content and dependencies.

## Further Reading

- [What is DVC?](/doc/user-guide/what-is-dvc)
- [Versioning Data and Model](/doc/use-cases/versioning-data-and-model-files)
from Use Cases