Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shared cache on NFS Introduced #455

Closed
wants to merge 13 commits into from
Closed

Shared cache on NFS Introduced #455

wants to merge 13 commits into from

Conversation

ryokugyu
Copy link
Contributor

@ryokugyu ryokugyu commented Jun 25, 2019

fix #103

@shcheklein shcheklein temporarily deployed to dvc-org-pr-455 June 26, 2019 23:19 Inactive
link data files from cache to your workspace. It enables symlinks to avoid
copying large files.

`cache.protected true` - to make links `read only` so that we you don't corrupt
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again explain it better that since we are going to use symlinks in this case between cache and workspace (since they are located on different file systems) it important to protect files so that we don't corrupt the cache accidentally. Mention that dvc unprotect should be used in this case, link to the https://dvc.org/doc/user-guide/update-tracked-file

Copy link
Contributor Author

@ryokugyu ryokugyu Jul 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shcheklein we need dvc unprotect only when we are writing to NFS directly?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to run dvc unprotect in the client's workspace if we want to edit/rewrite the file that is under DVC control.

Now, add first version of the dataset into the DVC cache (this is done once for
a dataset).

```dvc
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's simplify all this workflow. Let's just ask users to SSH into NFS serve machine, do git clone .../project. Move data into project and run dvc add, git commit, git push, (dvc push optional) after that. All the stuff below can be adjusted a bit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shcheklein i think it will just confuse the user.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cp -r . /project/ is very confusing also. I would say we need to explain the motivation here - we want to avoid copying existing data to a client machine to take it under DVC control.

I also, think git clone protocol is a standard way to collaborate and update different requirements. It's better to do this from the NFS server machine. It'll emphasize that NFS takes care about data.

Copy link
Member

@shcheklein shcheklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really good stuff 🎉 Requires a second iteration to clarify/simplify certain things. Let me know if you need some help with it.

@ryokugyu
Copy link
Contributor Author

@shcheklein please review this.

possible and have a workspace restoration/switching speed as instant as
`git checkout` for your code.

With large data files it is better to set the cache directory to external NFS.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: we use it's - it's less formal

`git checkout` for your code.

With large data files it is better to set the cache directory to external NFS.
Not only just it will cache the data faster but also version the data. Suppose,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cache faster - I'm not sure I understand this

of a complete dataset. With `cache directory` set to `NFS server` you would
avoid copying large files from NFS server to the machine and DVC will manage the
links from the workspace to cache.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the paragraph above is good but feels repetitive to the first paragraph in the document and has minor problems. What information you are trying to deliver here? Can you summarize it here in the comments, please? And we'll see how can we improve the text.

@@ -0,0 +1,148 @@
# Shared Storage on NFS

In the modern software development environment, teams are working together on
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

software development -> machine learning

(I even agree that it's software engineering, but it's bette to delineate them for now)

@@ -0,0 +1,148 @@
# Shared Storage on NFS
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to rename it ... it's not about NFS only. It's about any network attached storages. We can do something like:

Share Storage on NAS (NFS)


In the modern software development environment, teams are working together on
same dataset to get the results. It became necessary that data is accessible and
every team member has a same updated dataset. NFS (Network File System) storage
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NAS (NFS is one common example) is widely ...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can also mention something like: "Here we would like to show you how to setup a shared cache on NFS, but the same idea applies to any other NAS"

team member is using the same cache location.

After configuring NFS on both server and client side. Let's create an export
directory on server side where all data will be stored.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's better to use : when you have a code block you are writing about in the sentence

From `/mnt/dataset/` you will be able to access `/storage` directory present in
host server from your local machine.

## Configuring Cache location
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

location -> Location

Next, you can easily get this appear in your workspace by:

```dvc
$ cd /home/user/project/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

before the project path was /project

information on `.dvc` file format, visit
[here](/doc/user-guide/dvc-file-format).

`data` directory will now be a symbolic link to the NFS storage.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

worth writing something similar to the last paragraph in the introduction to reiterate on why links are important, worth showing an output of the ls -a

Copy link
Member

@shcheklein shcheklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks great! we are almost there. Please check some comments. Also, I'll try to come with an image, similar to what we have for other use cases. Good stuff.

@shcheklein
Copy link
Member

@ryokugyu any updates on this? :) it's almost done as far as I can tell, would be great to get it merged.

@ryokugyu
Copy link
Contributor Author

@ryokugyu any updates on this? :) it's almost done as far as I can tell, would be great to get it merged.

@shcheklein will work on it. Sorry for the delay!

@dashohoxha
Copy link
Contributor

I think that the "Mounted DVC Storage" (which is explained on this interactive example: https://katacoda.com/dvc/courses/examples/mounted-storage) is more general than just NFS and it deprecates this one.

@shcheklein
Copy link
Member

is more general than just NFS

my concern that it's very specific because of SSHFS and it's not emphasized enough that NFS, NAS (whatever else?) is covered

deprecates this one

don't think so. Especially the way interactive tutorials are made - they are extremely dry and do not explain motivation very well, do not explain what is happening behind the scene and what commands are doing.

@dashohoxha
Copy link
Contributor

don't think so. Especially the way interactive tutorials are made - they are extremely dry and do not explain motivation very well, do not explain what is happening behind the scene and what commands are doing.

In this case it is just an interactive example (not a tutorial) and it is referenced from a User Guide page: https://dvc-org-pr-784.herokuapp.com/doc/user-guide/data-sharing/mounted-storage
So, the motivation and high level explanations are supposed to be elaborated on the UG page.

@shcheklein
Copy link
Member

@dashohoxha

In this case it is just an interactive example (not a tutorial) and it is referenced from a User Guide page: https://dvc-org-pr-784.herokuapp.com/doc/user-guide/data-sharing/mounted-storage
So, the motivation and high level explanations are supposed to be elaborated on the UG page.

kk. It just in you initial comment you mentioned only the interactive tutorial and hadn't had enough time to see the UG changes. Will get back to this one when I have time to read the epic PR :)

@ryokugyu ryokugyu closed this Nov 17, 2019
@shcheklein
Copy link
Member

I think it is still relevant. Unfortunately, no easy way to reopen it now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

guide: using NFS as a remote storage
3 participants