Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added a BC: workspace document #2197

Merged
merged 17 commits into from
May 10, 2021
Merged
Show file tree
Hide file tree
Changes from 10 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions content/docs/sidebar.json
Original file line number Diff line number Diff line change
Expand Up @@ -91,6 +91,12 @@
"label": "What is DVC?",
"slug": "what-is-dvc"
},
{
"label": "Basic Concepts",
"slug": "basic-concepts",
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved
"source": false,
"children": ["workspace"]
},
{
"slug": "project-structure",
"source": "user-guide/project-structure/index.md",
Expand Down
36 changes: 36 additions & 0 deletions content/docs/user-guide/basic-concepts/workspace.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,3 +6,39 @@ tooltip: >-
models, etc. Typically, it's also a Git repository. It will contain your DVC
project.
---

# Workspace

A data science project consists of data obtained from many different sources.
This data may be split into multiple files or directories or (as the project
structure needs) have different versions for different requirements. e.g. A
smaller / simplified version might be required in prototyping for faster
feedback and shorter training times. A single workspace to manage all artifacts
of a project is desirable, although versioning needs and managing dependencies
make it increasingly complex.

DVC allows a single directory to contain all your project artifacts. The
Copy link
Contributor

@jorgeorpinel jorgeorpinel Mar 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a single directory to contain all your project artifacts

Not exactly. File contents are org'd in the cache with a special file structure (see https://dvc.org/doc/user-guide/project-structure/internal-files#structure-of-the-cache-directory)

The workspace is the directory containing the visible part of your project

That part is correct. And contradicts the previous part 🙂 (because "visible part" implies there's a hidden part which must be in other dirs).

Let's open this p with that sentence.

workspace is the directory containing _user visible_ part of your
<abbr>project</abbr> e.g. raw datasets, source code, ML models, etc. Users work
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved
in this directory using their data and model files that and manipulate the
contents through DVC commands.

Files and directories in the workspace can be added to DVC (`dvc add`) or they
can be downloaded from external sources (`dvc get`, `dvc import`,

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But actually add can also download data to the workspace (see --out and --to-remote options). Also, import* commands download AND track data. You may want to rephrase this part accordingly 🙂

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sometimes I think there is no command that hasn't got a duplicate, somehow :) I try to mention commands in passing, if we'd consider each and every option to commands, we'll need to duplicate the command reference here IMHO.

We may just delete the commands if you would like.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can also list all possibilities for each functionality, like

In the workspace, you can

  • Import files and directories (dvc add --out, dvc import-url...)

but I think this will turn the document into a list of commands and options.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to list every command usage of course, agreed!

My point was that these 3 commands mentioned actually overlap in a way that makes the current text slightly incorrect. In any case, the main use case of add is not to "add" but to "track", actually. Please check each cmd ref to try to find the right terms when needed 🙂

"Download" is correct for get/import but add can also download (and they can all "transfer") so I'd avoid that term probably. And in fact I wouldn't even mention get here, since it doesn't require a DVC project/workspace. For import I'd try to use the cmd name as the relevant action (to "import") I guess...

Copy link
Contributor

@jorgeorpinel jorgeorpinel Mar 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the main use case of add is not to "add" but to "track"
For import I'd try to use the cmd name as the relevant action (to "import")

then again import* also track the downloaded data 😅 ("adds"). Maybe it should be a single sentence about tracking and put all add, import, import-url in the same parenthesis.

`dvc import-url`). Changes to the files, directories, notebooks, models, and in
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved
general machine learning file system can be tracked (`dvc commit`) and versioned
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved
in Git (`dvc checkout`). They can be removed (`dvc remove`) from the workspace.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • It sounds like we're saying that dvc checkout versions data with Git. Can you clarify a bit?
  • "They can be removed" -> "Tracked data can be..."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified in 6cc0f6c

Copy link
Contributor

@jorgeorpinel jorgeorpinel Feb 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

data, notebooks, models, and any related machine learning artifact

That list got a little long? Prob no need for "machine learning" here since the term is found elsewhere in the doc...

their content can be synchronized (dvc checkout)

Good one! But needs some clarifying (sync what with what?).

Also, people usually need checkout before commit (which is included in add/repro). So maybe something like:

"When switching between |repository| versions, use dvc checkout to sync DVC-tracked data with Git-tracked |metafiles|. If you manually modify the workspace status, use dvc commit to record the changes. And if needed, tracked data can be removed..."

|word| means tooltip

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"When switching between |repository| versions, use dvc checkout to sync DVC-tracked data with Git-tracked |metafiles|. If you manually modify the workspace status, use dvc commit to record the changes. And if needed, tracked data can be removed..."

This sentence looks more suitable to use cases or the user guide. Here I would like to mention just the capabilities. Conceptually dvc checkout is similar to git checkout and dvc commit is akin to. git commit. This can be turned into a bullet list maybe,

  • You can add your files and directories ...
  • You can download external sources
  • You can synchronize your workspace with the cache
  • You can do this
  • You can do that

I feel this paragraph is too dense.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But reading again, this looked rather like a political ad than a technical documentation. Maybe "you can"s are spurious. Just

  • Add your files (dvc add)
  • Download external sources (dvc get...)
  • ...

is better.

<abbr>Pipelines</abbr> and <abbr>dependencies</abbr> between them can be
defined. Data and model files can be moved to the cloud and retrieved when
necessary (`dvc push`, `dvc pull`). DVC supports all typical operations of files
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved
and directories of a file system through its commands.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DVC supports all typical operations...

What do we mean by this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We mean the user can live in the workspace and use dvc to do what they normally do with files. Create, copy, rename etc.

Copy link
Contributor

@jorgeorpinel jorgeorpinel Feb 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Yes it may be a relevant note. This sentence seems to belong more to the previous paragraph though? Something like:

"There's usually no need to modify tracked data manually as DVC provides commands to safely perform any update needed, but if you do, use dvc commit to register the changes..."

(see my previous comment, these would have to be merged somehow).


Behind the scene these operations of a <abbr>DVC project</abbr> uses
<abbr>metafiles</abbr> like the `.dvc/` directory, `dvc.yaml` or files with
`*.dvc` extension to track the content and dependencies.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these -> the?

But I'm still not sure I'm getting how this p related to the concept of workspace. May just need some rephrasing because mentioning dvc.yaml and *.dvc files could def. be relevant. Or fit those files into the intro p (probably best).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these refer to typical operations from the previous paragraph.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think introducing .dvc files in the intro is a bit distracting.

Could you check e85a0d7 for changes to this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can definitely incorporate .dvc files and dvc.yaml mentions earlier, closer to the main concept definition. Metafiles are one of the most important contents of the workspace, along with the corresponding data. That, the corresponding data, and any git-tracked assets (mainly code). .dvc/ is not considered part of the workspace. Workspace is analogous to working tree in Git.


## Further Reading

- [What is DVC?](/doc/user-guide/what-is-dvc)
- [Versioning Data and Model](/doc/use-cases/versioning-data-and-model-files)
from Use Cases