Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to use DVC when data is stored in an external drive #563

Open
dashohoxha opened this issue Aug 15, 2019 · 11 comments · May be fixed by #565

Comments

@dashohoxha
Copy link
Collaborator

commented Aug 15, 2019

In this case the data is located in a partition of size 16TB on an external drive, while the DVC project is on /home/, on a partition of size 320GB.

This Use-Case should explain the best solution (or a couple of possible solutions) for this situation.

Context: https://discordapp.com/channels/485586884165107732/485596304961962003/611244643685892153

@shcheklein shcheklein changed the title use-case: How to use DVC when data is stored in an external drive how to use DVC when data is stored in an external drive Aug 15, 2019

@dashohoxha

This comment has been minimized.

Copy link
Collaborator Author

commented Aug 15, 2019

@shcheklein can I give this a try?

@shcheklein

This comment has been minimized.

Copy link
Member

commented Aug 15, 2019

These are related/solve similar problems:

#455 (fixes #103 )
https://dvc.org/doc/use-cases/multiple-data-scientists-on-a-single-machine

Keep in mind:

#497

@shcheklein

This comment has been minimized.

Copy link
Member

commented Aug 15, 2019

@shcheklein can I give this a try?

@dashohoxha absolutely! just take a look at those tickets I mentioned above ^^ They potentially overlap, for one of them a PR is almost done.

@dashohoxha

This comment has been minimized.

Copy link
Collaborator Author

commented Aug 15, 2019

The solution described by @efiop (tracking a data file that is external (outside the dvc project)) seems to be a different solution. Having a remote DVC cache (same as multiple-users-on-a-single-machine) is another solution. The NFS case seems to have a similar solution to multiple-users-on-a-single-machine.

@shcheklein

This comment has been minimized.

Copy link
Member

commented Aug 15, 2019

@dashohoxha gotcha. This is a different one indeed. This sections - https://dvc.org/doc/user-guide/external-outputs and this one https://dvc.org/doc/user-guide/external-dependencies should be reorganized/taken into account.

Also, keep in mind. My take on this that there should be a very strong reason to complicate your workflow with external deps/outs/cache in case of multiple drives. As I mentioned on Discord, I think in most cases the ideal scenario is to use external cache and symlinks (similar to NFS, shared cache scenarios).

@dashohoxha

This comment has been minimized.

Copy link
Collaborator Author

commented Aug 16, 2019

This sections - https://dvc.org/doc/user-guide/external-outputs and this one https://dvc.org/doc/user-guide/external-dependencies should be reorganized/taken into account.

They seem accurate to me (unless there is some missing information that I don't know).
The problem is that it is difficult for the user to read all the details and intricacies on user guides and manual pages, and find the best solution for his case. Showing him what the best solution would be in a particular case (or a similar case) should be helpful.

@shcheklein

This comment has been minimized.

Copy link
Member

commented Aug 16, 2019

@dashohoxha your PR looks good, there are some improvements can be done which I'll review and let you know, but first I would like to understand the "use case" itself better, what are possible solution for that "use case", how should we improve those sections in User Guide, how all this stuff corresponds wish the shared machine case (when there is a single cache setup on a separate partition). Without this holistic plan, we are potentially duplicating information, we are not properly communicating the use case, and we are not properly structuring User Guide.

To give just some concerns:

  1. Huge data on external local drive title. It's a very confusing title for the use case. Starting from the "external local drive" (is external or local after all?) to the way it's formulated (huge data is not a problem, probably, versioning it or managing it is a problem). Huge is a very vague term as well. Some people use a single huge drive for everything.

Some better titles from the top of my head: Managing data storage on a separate drive, Versioning data and processing data outside your repo, etc ...

  1. No matter how good we can come with the name there should be some integration with other parts of the docs (user guide, versioning examples). For examples, in most cases we assume that is part of your workspace. Why don't we clarify somehow that if your data is substantially large there are ways to manage it "externally".

  2. Back to the use case. It's basically about trying to version files that are located on the second large drive (it can be second large HDD, it can be some shared NAS, etc - the point is it's a second large volume with tons of data and tons of space on it). Using external outs/deps is not the only way to deal with this. It's also not ideal. Should we include in this use case different ways of doing this - like "local external cache" + links? They overlap substantially to my mind.

  3. User Guide part of it. If use case (especially title) should be written in a way that will immediately match with user's request (rule of thumb - what words would I use to describe this situation in case I would need to ask a question on chat?), then User Guide is more like a well structured manual. For example, "Managing External Data" is a good section that should actually combine external deps, external outs, some intro and overview of the use cases with links and instructions on how specific cases could be solved.

So, let's please, discuss and understand some strategy behind this.

@jorgeorpinel would love to hear your opinion on this.

@jorgeorpinel

This comment has been minimized.

Copy link
Collaborator

commented Aug 17, 2019

Without this holistic plan, we are potentially duplicating information, we are not properly communicating the use case, and we are not properly structuring User Guide.

Yes!

It's funny because I've been noticing significant confusion around external X topics so I opened #566 recently. I also feel like we may need to regroup and figure out the connections between all the external data stuff before deciding which docs to change.

That said it's good to have more use cases and I'll review the PR but if we don't figure out the big picture, this doc may only add to the confusion of some users, like Dashamir mentioned in #563 (comment).

@jorgeorpinel

This comment was marked as outdated.

Copy link
Collaborator

commented Aug 17, 2019

Questions about #563 (comment) @shcheklein:

  1. No matter how good we can come with the name there should be some integration with other parts of the docs (user guide, versioning examples)... Why don't we clarify somehow that if your data is substantially large there are ways to manage it "externally".

Do you mean to add notes and links in all other documents where it can be useful?

  1. User Guide part of it...

Similar question. Are you suggesting Dashamir to accordingly update existing user guides with the same PR (#565)?

@dashohoxha

This comment has been minimized.

Copy link
Collaborator Author

commented Aug 17, 2019

@shcheklein I believe I get your point. These discussions certainly help me to think about the problems and to look for solutions.

I am still trying to understand DVC and figure out any problems with the docs, what can be improved etc. I am also trying to follow the discussions (as much as I can), which may help me with understanding the problems. The simple tasks that I try to do are just to get familiar with the workflow, the tools, the community, and with DVC itself and its docs (of course).

So, I don't have any quick answers yet. Are you asking me to finish the hard part of the job without even starting yet? :)

@shcheklein

This comment has been minimized.

Copy link
Member

commented Aug 20, 2019

@dashohoxha not at all! it was not even a critique of your PR, it was an attempt from my end to systematize my thoughts about the current state of the stuff related to the external data management, external cache, NFS, etc, and come up with some initial strategy.

I'll review the latest changes asap (we are traveling now, so please give us a bit more time).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.