Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

decide best term for "md5" "checksum" or "hash" strings used in DVC-files, and Git commit (SHA) "hash" #552

Closed
jorgeorpinel opened this issue Aug 12, 2019 · 8 comments · Fixed by #962
Labels
A: docs Area: user documentation (gatsby-theme-iterative) type: enhancement Something is not clear, small updates, improvement suggestions

Comments

@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Aug 12, 2019

and implement throughout the docs.

And consider changing the md5 field name (also in DVC-files)?

Notes:

@dashohoxha
Copy link
Contributor

If it is just dos2unix <datafile> && md5sum <datafile> (as described here: #68 (comment)),
then it actually is pure MD5, but the datafile is normalized first, to avoid any mismatching in different operating systems.

I think that when the explanation is meant for the users (who don't need to know the internal details) the terms hash/checksum are ok.
When explaining internal details, than MD5 is better, and also accurate. However a note on end-of-line normalization may also be useful.

@shcheklein shcheklein changed the title docs: decide best term for "md5" "checksum" or "hash" strings used in DVC-files decide best term for "md5" "checksum" or "hash" strings used in DVC-files Aug 12, 2019
@shcheklein shcheklein added A: docs Area: user documentation (gatsby-theme-iterative) type: enhancement Something is not clear, small updates, improvement suggestions labels Aug 12, 2019
@shcheklein
Copy link
Member

I agree with @dashohoxha .

checksum vs hash. Hash is a better term because of the primary goal we are using these md5s for - store and find binary files in our storage. (Though we check the consistency as well).

I'm fine to use checksum everywhere, especially considering that it won't be hard to switch to a new version.

MD5, etag, other checksums make sense when we go into details.

@jorgeorpinel
Copy link
Contributor Author

I also vote for "checksum". In fact I don't see how it's the wrong concept at all, other than the purpose we use it for. I think "hash" is too general, TBH.

image

@jorgeorpinel
Copy link
Contributor Author

jorgeorpinel commented Aug 12, 2019

  • I also just found the term "pointer" being used in some of our docs to refer to these fields.

@jorgeorpinel
Copy link
Contributor Author

jorgeorpinel commented Oct 3, 2019

UPDATE: I made sure every instance of MD5 is upper case unless it's a file or command code block as part of #669.

@jorgeorpinel
Copy link
Contributor Author

jorgeorpinel commented Feb 12, 2020

Hi! Notice that per some relatively recent conversation with @shcheklein, we stopped using "checksum" in docs, in favor of terms like (MD5/md5) "file hash" and "hash values" (in DVC-files). See #962

Should we apply the same in docs?

Note that there is somewhat of a collision between that term and (Git) "commit hash" for which we're using "commit SHA hash" in the docs PR right now (may review that one). Cc @efiop

@jorgeorpinel jorgeorpinel changed the title decide best term for "md5" "checksum" or "hash" strings used in DVC-files decide best term for "md5" "checksum" or "hash" strings used in DVC-files, and Git commit (SHA) "hash" Feb 12, 2020
@pared
Copy link
Contributor

pared commented Feb 12, 2020

@jorgeorpinel I would reconsider using MD5. As iterative/dvc#1676 and iterative/dvc#3069 are still open, and we do not intend to drop them, there exists possibility that at some point in time md5 will be obsolete term. This will trigger a lot of changes on the docs side.

@jorgeorpinel
Copy link
Contributor Author

jorgeorpinel commented Feb 12, 2020

Agree. Right now we are reducing the usage of MD5 in docs because AFAIK our file hashes are not always direct MD5 (and this is still not properly explained, see #68)
This issue is to decide what to do with all these terms so if you guys tell me we should remove MD5 as much as possible then we'll do that.

shcheklein pushed a commit that referenced this issue Feb 14, 2020
* cmd ref: copy edits and improve `move` examples, adding one with `import`
related to https://discordapp.com/channels/485586884165107732/485596304961962003/676360262416203776

* tutorials: correct section title in versioning
per #933 (comment)

* term: review "point" (file hash context)
per #552 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: docs Area: user documentation (gatsby-theme-iterative) type: enhancement Something is not clear, small updates, improvement suggestions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants