Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd ref: elaborate on how DVC understands a file's status #576

Closed
drorata opened this issue Aug 21, 2019 · 11 comments
Closed

cmd ref: elaborate on how DVC understands a file's status #576

drorata opened this issue Aug 21, 2019 · 11 comments
Labels
A: docs Area: user documentation (gatsby-theme-iterative) C: ref Content of /doc/*-reference type: enhancement Something is not clear, small updates, improvement suggestions

Comments

@drorata
Copy link
Contributor

drorata commented Aug 21, 2019

DVC only takes into account the actual content of files (and ignores for example the last modification timestamp). This is not mentioned explicitly in the docs and is worthy to have somewhere. As per @pared recommendation a reasonable place is https://dvc.org/doc/get-started/share-data#save-and-share-data.

UPDATE: Probably start in https://dvc.org/doc/command-reference/status and see mentions in other docs after that.

@dashohoxha
Copy link
Contributor

I think it uses a combination of both timestamp and checksum of the content (for efficiency). It has a cache database of "filename,timestamp,checksum" entries. If the timestamp of a file is not different from what it has in the database, then the cached checksum is used, otherwise the checksum of the file is recomputed.

Correct me if I am wrong.

However I don't know how I know this: maybe I read it in the docs, maybe on some discussions.

@pared
Copy link
Contributor

pared commented Aug 21, 2019

@dashohoxha you are right, we use mtime and size to quickly detect that file has not been modified. However, what @drorata said here is, that its not obvious how do we calculate md5 for a file.

@drorata
Copy link
Contributor Author

drorata commented Aug 21, 2019

I also referred to the clarity around the process that @dashohoxha described. Furthermore, in this case, I'd raise the issue that this description doesn't seem to hold when the file is located on S3. I tested something like:

  1. Placed a file on S3
  2. Imported it to a dvc workspace (using dvc import-url)
  3. The, I uploaded the same file again to S3 (meaning its timestamp on S3 changed)
  4. Lastly, dvc status didn't mention any change in the file.

@pared
Copy link
Contributor

pared commented Aug 21, 2019

@drorata I am afraid we do not support checking status of imported file at its source.
What @dashohoxha mentioned, is our internal optimization to avoid recalculating md5 for each file in our workspace. Its working as follows:

  • we have some file/directory in a workspace
  • we add it under dvc control with add, import-url, run -o or -d
  • during adding, we calculate md5, mtime and size for given object
  • next time we call status or some command relying on file (eg repro), to avoid md5 calculation, we check in state database if mtime and size hasn't changed, if it is the same as it was at the time of add, we are good to go, if not, we need to retrigger md5 calculation

Why @drorata scenario does not work:

  • dvc does not remember meteadata of imported file at its source, it does remember metadata of imported file (so if we did dvc import-url {url} {workspace_path}, dvc tracks workspace_path, not url)

EDIT:

I am afraid we do not support checking status of imported file at its source.

It is false statement. We do check status for imported data, by comparing its checksum with the one avialable in workspace.

@pared
Copy link
Contributor

pared commented Aug 21, 2019

@drorata Can you elaborate what scenario you have in mind? Why would you like for status to work on data at its source?

@drorata
Copy link
Contributor Author

drorata commented Aug 21, 2019

First, @pared comment is important and I'm not sure it is part of the documentation. Secondly, I'm now confused. If at steps 3 in my comment I would upload a new file (with different content) to S3, then dvc status would spot it. So it does track the remote location as well. Or am I missing something?

@pared
Copy link
Contributor

pared commented Aug 21, 2019

@drorata You are right, I have made a mistake.

To clarify:
For imported file, status will detect if its changed, but the detection is performed by checking whether its checksum have been changed. In case of S3 remote we should note that "checksum" == "etag".

@pared
Copy link
Contributor

pared commented Aug 21, 2019

So it would seem to me that S3 also does not take timestamp into consideration when calculating checksum.

@drorata do you think that status should detect change of timestamp? Why could it be useful? If the file did not change, even if its metadata changed, is there a reason why we should detect it?

@drorata
Copy link
Contributor Author

drorata commented Aug 21, 2019

I'm not yet saying anything. I'm trying to understand better the behavior. So for local file if neither mtime nor size changed the checksum is checked. If this hasn't changed as well dvc deems the file as unchanged. For a file which is imported from S3 only the content is checked?

@pared
Copy link
Contributor

pared commented Aug 21, 2019

@drorata got it.

So for local file if neither mtime nor size changed the checksum is checked.

It quite contraty: when running status for local file, if mtime and size did not change, since last time we checked did something with the target, we can safely assume that particular file did not change. If mtime or size did change, its indicator that something has happend, and we should check md5.

For a file which is imported from S3 only the content is checked?

In this particular use case, we rely on checksum(etag) provided by S3.
So when you check status, we ping s3 with file url to get etag for given file. If its the same that was when we imported the file, status is not changed. If etag did change, status will report that this dependency has changed.

@shcheklein shcheklein changed the title Elaborate the explanation how dvc understands a file's status elaborate the explanation how dvc understands a file's status Aug 21, 2019
@shcheklein shcheklein added A: docs Area: user documentation (gatsby-theme-iterative) type: enhancement Something is not clear, small updates, improvement suggestions user-guide labels Aug 21, 2019
@drorata
Copy link
Contributor Author

drorata commented Aug 21, 2019

That all makes now much more sense. Would be great to improve the docs and reflect these different scenarios.

BTW: I assume there are very good reasons to have different behavior for the two use cases, but for a user experience perspective it is a little confusing.

@dashohoxha dashohoxha mentioned this issue Oct 25, 2019
10 tasks
@jorgeorpinel jorgeorpinel changed the title elaborate the explanation how dvc understands a file's status user-guide: elaborate on how DVC understands a file's status Jan 20, 2020
@jorgeorpinel jorgeorpinel changed the title user-guide: elaborate on how DVC understands a file's status get-started: elaborate on how DVC understands a file's status Jan 20, 2020
@jorgeorpinel jorgeorpinel changed the title get-started: elaborate on how DVC understands a file's status cmd ref: elaborate on how DVC understands a file's status Mar 15, 2020
@iesahin iesahin added the C: ref Content of /doc/*-reference label Oct 21, 2021
@dberenbaum dberenbaum closed this as not planned Won't fix, can't repro, duplicate, stale Apr 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: docs Area: user documentation (gatsby-theme-iterative) C: ref Content of /doc/*-reference type: enhancement Something is not clear, small updates, improvement suggestions
Projects
None yet
Development

No branches or pull requests

7 participants