-
Notifications
You must be signed in to change notification settings - Fork 386
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cmd ref: elaborate on how DVC understands a file's status #576
Comments
I think it uses a combination of both timestamp and checksum of the content (for efficiency). It has a cache database of "filename,timestamp,checksum" entries. If the timestamp of a file is not different from what it has in the database, then the cached checksum is used, otherwise the checksum of the file is recomputed. Correct me if I am wrong. However I don't know how I know this: maybe I read it in the docs, maybe on some discussions. |
@dashohoxha you are right, we use mtime and size to quickly detect that file has not been modified. However, what @drorata said here is, that its not obvious how do we calculate md5 for a file. |
I also referred to the clarity around the process that @dashohoxha described. Furthermore, in this case, I'd raise the issue that this description doesn't seem to hold when the file is located on S3. I tested something like:
|
@drorata I am afraid we do not support checking status of imported file at its source.
Why @drorata scenario does not work:
EDIT:
It is false statement. We do check status for imported data, by comparing its checksum with the one avialable in workspace. |
@drorata Can you elaborate what scenario you have in mind? Why would you like for |
First, @pared comment is important and I'm not sure it is part of the documentation. Secondly, I'm now confused. If at steps 3 in my comment I would upload a new file (with different content) to S3, then |
@drorata You are right, I have made a mistake. To clarify: |
So it would seem to me that S3 also does not take timestamp into consideration when calculating checksum. @drorata do you think that |
I'm not yet saying anything. I'm trying to understand better the behavior. So for local file if neither mtime nor size changed the checksum is checked. If this hasn't changed as well dvc deems the file as unchanged. For a file which is imported from S3 only the content is checked? |
@drorata got it.
It quite contraty: when running
In this particular use case, we rely on checksum(etag) provided by S3. |
That all makes now much more sense. Would be great to improve the docs and reflect these different scenarios. BTW: I assume there are very good reasons to have different behavior for the two use cases, but for a user experience perspective it is a little confusing. |
DVC only takes into account the actual content of files (and ignores for example the last modification timestamp). This is not mentioned explicitly in the docs and is worthy to have somewhere.
As per @pared recommendation a reasonable place is https://dvc.org/doc/get-started/share-data#save-and-share-data.UPDATE: Probably start in https://dvc.org/doc/command-reference/status and see mentions in other docs after that.
The text was updated successfully, but these errors were encountered: