Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added user guide page on remote storage #3337

Closed
wants to merge 5 commits into from
Closed

Added user guide page on remote storage #3337

wants to merge 5 commits into from

Conversation

michael-mayo
Copy link

@michael-mayo michael-mayo commented Mar 7, 2022

Rel. #2866

@shcheklein shcheklein changed the title Added user guide page on remote storage (Fix#1792) Added user guide page on remote storage Mar 7, 2022
@jorgeorpinel jorgeorpinel temporarily deployed to dvc-org-master-yllpkpwgl95aphz March 9, 2022 05:33 Inactive
Copy link
Contributor

@jorgeorpinel jorgeorpinel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, this looks like a guide! Here's a first round of mid-level comments in the intro and high-level comments for the other sections. Thanks!

BTW, I deployed the page here for easier reviewing 🙂

Comment on lines +5 to +6
DVC can use remote storage instead of local disk space for storing previously
committed versions of your project. You may need to do this if, for example,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

committed versions of your project

Let's clarify that (cache and) remote storage only stores the DVC-tracked data side of the project. The versioning side is done with Git. This is a key aspect of DVC storage in general.

Comment on lines +6 to +8
committed versions of your project. You may need to do this if, for example,

- You don't have enough disk space for storing all the old versions locally.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice to include the motivation/uses of remote storage! And you got some of the main ones right but

  1. the order should be inverse I think: sharing/collaboration first, then backup and "not enough space" which are basically the same case.
  2. a missing case is to allow for custom data management designs e.g. store raw data in a remote, features in another, and models in a 3rd one -- all with different access rights (the authentication layer provided by storage platforms is key).
  3. I'd try to keep the bullets much shorter to make the list more effective. If needed some of the details can be moved to later parts of the doc after the intro.
  4. (minor) previous versions aren't necessary "old" (that seems to imply "outdated"). They may just be different in other ways. This term is currently used throughout the doc and can be misleading.

![Local cache and remote storage](/img/remote_storage.png)

Multiple old versions of the project (six of them, in fact) are being archived
on remote storage. The current working version of the project (version 7) has
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
on remote storage. The current working version of the project (version 7) has
on remote storage. The latest version of the project's data (7) has

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Loving the diagrams BTW 👍🏼

Comment on lines +14 to +16
and this is updated everytime you issue `dvc commit`. If you are committing
frequently, and making big changes with each commit, you could easily run out
of local storage after a while.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dvc commit is too low-level here. That command is a helper in fact. The operation is included in dvc add and when needed in dvc repro/exp run. We don't need to get so specific in this guide though, we can keep it general e.g. "every time a new version of data dependencies or outputs is saved with DVC" or something like that. Keep in mind the main workflows that lead to these data commits -- again, 1. adding base data and 2. somehow reproing a pipeline or experiment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, term "committing" is tricky: are you referring to git commit an entire project version? DVC 'commits' data internally (to cache) even when no Git commit happens (which is why repro --no-commit exists, for example).

Please review the entire doc with these comments in mind since at the moment there's lots of mentions of committing and dvc commit. Thanks

Comment on lines +38 to +39
stored. The data scientist recently cleared the local cache using DVC's garbage
collection command `dvc gc`. Periodically, she/he issues `dvc push` to send new
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't complicate the intro with dvc gc mentions for now. That deserves a section or even page of its own, probably (not expected for this PR).

Comment on lines +43 to +45
## Connecting and Pushing to Remote Storage

Multiple cloud storage providers can be used with DVC, and connecting is fairly
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make a section specific to setting up and connecting first? OK to mention dvc push in the examples (if needed) as the way to confirm connectivity, but the focus should be on general setup (remote add/modify commands, including general authentication info, ideally listing all supported types of remotes.

In fact it will probably be a series of pages for all that (extracted from https://dvc.org/doc/command-reference/remote/modify#available-parameters-per-storage-type) but no need to go nearly that far for now, we can keep it much more general in this PR.

Then the later section about Sharing can have all the details or example involving push.

Comment on lines +94 to +96
## Content Addressable Storage Format

DVC optimises the storage space used in the local cache and remotes by ensuring
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this section, it's definitely interesting content as I mentioned in the previous PR but still this is more about the DVC cache mechanism. Please make it into a separate page somewhere and leave it out of this doc or link to if if/where needed to make the review process easier.

Comment on lines +156 to +159
The next brief example shows a directory `myDir` tracked by DVC containing two
files `a` and `b`:

![Local cache and remote storage](/img/cache_structure.png)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This 2nd diagram is confusing though. But again, for now we can leave out all these details about the caching mechanism and storage optimization, which are not specifically about remote storage.

from the local workstation to the remote storage, there should be identical
folders in both places.

## Sharing Files via Remote Storage
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finally, this section could probably be more hands-on as well, showing push and pull and maybe even another diagram. Take a look for example at this page which we recently removed because it wasn't in the right area of our docs (but the content is still relevant and can be recovered to some extent here).

Also, let's make the title more general? And it's nice to mention ML models some times in the context of data management (which includes local and remote storage). So here's a new title suggestion:

Suggested change
## Sharing Files via Remote Storage
## Sharing Data and Models

@jorgeorpinel jorgeorpinel added the status: stale You've been groomed! label Mar 30, 2022
@jorgeorpinel
Copy link
Contributor

Closing as stale for now. Please reopen if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status: stale You've been groomed!
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants