Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

guide: describe the how hash values (md5) are calculated #68

Closed
efiop opened this issue Aug 7, 2018 · 14 comments
Closed

guide: describe the how hash values (md5) are calculated #68

efiop opened this issue Aug 7, 2018 · 14 comments
Labels
C: guide Content of /doc/user-guide type: enhancement Something is not clear, small updates, improvement suggestions

Comments

@efiop
Copy link
Contributor

efiop commented Aug 7, 2018

Tell about how we compute checksums for files, why we dos2unix when computing md5 and how to map the md5 value found in dvcfile to an actual cache file at .dvc/cache

@efiop efiop added the A: docs Area: user documentation (gatsby-theme-iterative) label Aug 7, 2018
@shcheklein shcheklein added type: enhancement Something is not clear, small updates, improvement suggestions user-guide labels Mar 25, 2019
@shcheklein shcheklein changed the title docs: user guide: DVC Checksums user guide: DVC checksums Mar 25, 2019
@shcheklein shcheklein changed the title user guide: DVC checksums describe the way DVC checksums are calculated Mar 25, 2019
@algomaster99

This comment has been minimized.

@shcheklein

This comment has been minimized.

@ghost ghost added the p1-important Active priorities to deal within next sprints label Jul 23, 2019
@jorgeorpinel

This comment has been minimized.

@jorgeorpinel
Copy link
Contributor

More details from Ruslan

Our checksums usually match with md5/md5sum (commands), except in cases where you have text files with windows-style crlf line endings.

@jorgeorpinel

This comment has been minimized.

@jorgeorpinel jorgeorpinel changed the title describe the way DVC checksums are calculated user-guide: describe the way DVC checksums are calculated Jan 20, 2020
@jorgeorpinel jorgeorpinel changed the title user-guide: describe the way DVC checksums are calculated user-guide: describe the way file (MD5) hashes aka "checksums" are calculated Feb 12, 2020
@jorgeorpinel jorgeorpinel changed the title user-guide: describe the way file (MD5) hashes aka "checksums" are calculated user-guide: describe the way file (MD5) hashes a.k.a. "checksums" are calculated Feb 12, 2020
@jorgeorpinel
Copy link
Contributor

@efiop can you or someone from core summarize the explanation here or even submit a draft PR for this when you have some time?

We can take it on from there but it would be great if someone already familiar with the exact algorithm to generate file hashes works.

And/or if you can point me to the part of the code where it's done. Thanks!

@efiop efiop added p2-nice-to-have Less of a priority at the moment. We don't usually deal with this immediately. and removed p1-important Active priorities to deal within next sprints labels Feb 22, 2020
@efiop
Copy link
Contributor Author

efiop commented Feb 22, 2020

@jorgeorpinel Sure! I just wonder if this is really p1. Seems like minor internal implementation details. This ticket was created before dvc commit, so it is even less important these days (people used to hack-around to generate a correct md5). I've lowered the priority to p2 for now, let me know if you think differently. Thanks.

@shcheklein
Copy link
Member

@jorgeorpinel agreed with Ruslan, not sure why it was bumped to p1.

@jorgeorpinel
Copy link
Contributor

Yeah sure, Idk either. p2 sounds good. Thanks!

@jorgeorpinel jorgeorpinel changed the title user-guide: describe the way file (MD5) hashes a.k.a. "checksums" are calculated user-guide: describe the way file (MD5) hashes a.k.a. "hash values" are calculated Jun 21, 2020
@jorgeorpinel
Copy link
Contributor

Hi! I wonder if users need to know this. What's the advantage or need to document it? Seems like an implementation detail. Or perhaps for security reasons (being able to confirm checksums)?

@MetalBlueberry
Copy link
Contributor

In my personal case, we are working with a custom http remote that validates the checksum of the uploaded content. So being able to confirm the checksum is my personal need.

@jorgeorpinel jorgeorpinel changed the title user-guide: describe the way file (MD5) hashes a.k.a. "hash values" are calculated guide: describe the how hash values (md5) are calculated May 7, 2021
@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented May 19, 2021

This seems to keep coming up in support channels (1). Bumping priority a little.

Also, here's the relevant code: https://github.com/iterative/dvc/blob/aab5dec7f5f6dd16c00942285bfd229c693446b6/dvc/utils/__init__.py#L44

@jorgeorpinel jorgeorpinel added p1-important Active priorities to deal within next sprints and removed p2-nice-to-have Less of a priority at the moment. We don't usually deal with this immediately. labels May 19, 2021
@jorgeorpinel jorgeorpinel added C: guide Content of /doc/user-guide and removed A: docs Area: user documentation (gatsby-theme-iterative) p1-important Active priorities to deal within next sprints labels Oct 12, 2021
@jorgeorpinel
Copy link
Contributor

Hi @efiop do you think the current content in https://dvc.org/doc/user-guide/project-structure/dvc-files#output-entries is enough? Still not a full explanation of how it's calculated internally but that's probably too deep? If so feel free to close this. Thanks

Cc @MetalBlueberry

@jorgeorpinel
Copy link
Contributor

In fact I'm going to close this as stale for now, but please reopen if you disagree.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C: guide Content of /doc/user-guide type: enhancement Something is not clear, small updates, improvement suggestions
Projects
None yet
Development

No branches or pull requests

5 participants