-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support hashings other than MD5 #3069
Comments
Hi @adrinjalali ! In our case md5 is not used for cryptography, it is simply used for file hashing, where it is still suitable (at least I'm not aware of any issues with it in such scenarios), so it is safe to use for htat purpose. The contents of the cache files are not encrypted, you could see it for yourself by opening any file under .dvc/cache. If you need to encrypt your data, you could do that pretty easily on the remote itself (e.g. |
I understand these hashes shouldn't be the only defense layer against a malicious player, but it is still a very useful thing. Imagine a malicious player who has write access to the storage (s3 bucket in this case). They can take a model file, modify it in a malicious way in a way the md5 is kept unchanged, and upload the new one. The rest of the pipeline now won't detect a change in the file, and will serve the new wrong model. Again, I understand this should not be seen as the only way to secure the models, but in places where software goes through security audit before being used, md5 is one of the first things they're gonna use to reject it. Another point is that having a SHA is not expensive or complicated at all, so why not include it? |
I agree with @adrinjalali, using MD5 to check for integrity is far from ideal. As SHA-1 is starting to show collisions as well, DVC could join |
@adrinjalali @JavierLuna Thanks for the info, guys! We will support other hash functions eventually, it just wasn't the focus for us, but we will reconsider it. We've had an attempt from an outside contributor #1920 , but he wasn't able to finalise it 🙁 If anyone is interested in contributing this feature(similar to #1920, where sha is a opt-in config option for now), we will be happy to help with everything we can. I'll leave this issue opened in addition to #1676 , as this current one is about simply replacing md5 with something else (possibly as an option), but #1676 is originally about some custom hash functions. |
I'd like to help with both issues! |
@JavierLuna That would be great! 🙏
Let's start with the current one, similar to how #1920 does it. Custom hashes could be implemented later, esp since we don't really know what type of hash creator of #1676 wants.
Please don't set it as a default one, it should be configurable. Md5 should stay default for now. If the feature is implemented correctly, we will be able to switch to sha-256 later using 1-line PR 🙂 Please let us know if you have any questions :) FYI: we also have a #dev-general channel in our discord, feel free to join 🙂 |
Also, we need to remember that changing default checksum will result in whole cache recalculation for users upgrading their DVC version, and this could bring a lot of issues. If having md5 is severe, the change of default checksum should probably be left for major version release. |
Currently we kinda assume that whatever is returned by `get_file_hash` is of type self.PARAM_CHECKSUM, which is not actually true. E.g. for http it might return `etag` or `md5`, but we don't distinguish between those and call both `etag`. This is becoming more relevant for dir hashes that are computed a few different ways (e.g. in-memory md5 or upload to remote and get etag for the dir file). Prerequisite for iterative#4144 and iterative#3069
Currently we kinda assume that whatever is returned by `get_file_hash` is of type self.PARAM_CHECKSUM, which is not actually true. E.g. for http it might return `etag` or `md5`, but we don't distinguish between those and call both `etag`. This is becoming more relevant for dir hashes that are computed a few different ways (e.g. in-memory md5 or upload to remote and get etag for the dir file). Prerequisite for #4144 and #3069
Prerequisite for iterative#4144 and iterative#3069
Related to iterative#4144 , iterative#3069 , iterative#1676
I agreed that the fewer degrees of freedom, the less complexity there is. However I still see two different points of parameterization - within the DVC codebase, and user-configurable. The former just implies it would be straightforward to swap out the hash algo used by the codebase - i.e. minimize reliance on hard-coded values. I think this ought to happen regardless, as MD5 should be sunsetted. Replacing MD5 with SHA256 still implies the ability of DVC to read a metadata checksum name and value and validate it. It's a second discussion whether to expose that ability to user-configuration. Personally, I don't that much trouble with exposing this as a config. The least-effort route is just accept anything in algorithms_guaranteed and pass it to Catch the
Documentation: "DVC default_hash can be configured to be one of the following: ..." Security: it's already using MD5. Hard to go worse from there (if ensuring Allowing pluggable hash algos, e.g. to support Blake3, yeah I can see that being a can of worms. |
Let me give an example: Suppose Alice and Bob share a remote. Alice configures to use Blake3, Bob wants to use SHA256. Now, we don't have any metadata in remotes that show which hash algorithm is used. They are simple directory trees with a deterministic structure. How will we solve this? We need to keep metadata in remotes. That means all remote and cache-related code should be updated just to support this. Alice has a local cache in Blake3, has a remote shared with Bob in SHA512, Bob has a local cache in SHA256. When one of them wants to send a file to the other, a conversion is needed and where, and why? Their remote is on a Windows server and suppose they put it to a DVC's all data sharing and versioning code needs an update and all these algorithms will need separate tests. We'll have to ask which hash algorithm do you use? in new tickets. We'll replace each "MD5" occurrence with at least a few sentences in the docs. If some of these algorithms will have export restrictions to some countries, DVC will have to maintain separate branches/controls/whatever for each, because when you start to support n algorithms, you just can't go back and say err, we actually didn't mean this one. Hashing part should be abstracted away and if a new one appears just better for DVC, we should be able to use it without much change in the codebase. I also fully support the forks modifying this layer for their own use but allowing users to change the algorithm in Thank you. |
The choice of MD5 is curious. Git is build upon the (also broken, but alas, far-far more secure SHA-1). Why choose a hash that has been known to be broken from day 0???? One could start to being suspicious. The the blasé urgency that this issues has been treated with here is again verging on suspicious. We should upgrade to Blake 3 as a matter of security urgency. MD5 is a security critical failure and should be addressed as such! -- |
We need to stop having a completely critical security vulnerability as a matter of urgency. We need it pick any not-broken hash and do a one-way move as a matter of desperate urgency. Later we can think about if we want support of multiple hashing algorithms. That is a different topic to addressing this security vulnerability.
You are simply wrong here. There are no practical export restrictions on crypto in this world anymore. This fear is about 20 years out of date. Remember when the wonderful people published the PGP physical book, with the source code printed inside? |
I agree completely.
Laws change. That was an example though. More algorithms mean more points to consider, more options require more updates and more maintenance in |
Guys, we've been using md5 for historical reasons. Initially, we only had pipeline management where md5 was used as a checksum and not a hash, so it was perfectly suitable for the task. Data management grew from it and kept using md5. We are well aware that md5 is not a great hash, and will be switching to something more suitable (likely some SHA) in 3.0, so stay tuned. |
Why don't you release a security update that updates to https://github.com/BLAKE3-team/BLAKE3 (my recommendation) as a matter of priority? Having this sort of security vulnerability addressed 'in the next major update' isn't very professional IMHO... |
@da2ce7 While I agree DVC needs to move away from MD5 ASAP, I disagree that the choice to delay to the next major update is unprofessional. Changing the hashing algorithm is a major change that will require existing users to rename all of their files in the dvc-caches. This is not trivial. In fact IMHO it is highly professional to be mindful of semantic versioning when making large breaking changes like this, even in the case of security. As I mentioned beforehand, the primary use-case of DVC does not require a cryptographically secure hashing algorithm. The attack vectors in the most common cases are niche at best (note: I recognize there are serious concerns in some cases, but my argument is that these are not common). Much of the security community will pounce on anything that would be insecure in some context without really ever considering the real-world threat model in which that thing is applied. I think that is a mistake. It certainly doesn't warrant getting personal. Regardless, I would second an urgent push to release DVC 3.0 with BLAKE3 as the default (and possibly only) hash algorithm. |
I agree with @Erotemic on his points. One thing about blake though, is that it is not FIPS/NIST certified, which is a significant problem for people that need to comply with those, so I doubt that we will choose blake by default, and hence why I think that some version of SHA is more realistic. One more thing is that sometimes people confuse checksums we use for pipeline checks with (hash-ish, hashing function not cryptographic hash) hashes that we use for data management. For the former one, we will support other custom checksums that could be more relaxed than md5 that depends on the full binary content. We've been working hard on 3.0 pre-requsites for the past half a year or so, and are continually working on those as we speak. We not only want to switch to a different algorithm in 3.0, but to also provide better performance/ui/architecture/ecosystem for data management, and all of that while not seizing releases with new features (experiements, dvc machine, plots, etc) and bug fixes for 2.0, so we've been gradually rebuilding that and will likely be ready for 3.0 in the upcoming months. We really truly appreciate your interest 🙏 |
In the context of semantic versioning making a breaking change should always be signalled by a major number change. However, I do not feel like that was the core of the statement that @efiop was making. He wasn't referring to the fact that this breaking change will be called 3.0, but that this change will be included in 3.0.
The problem with MD5 is that it is desperately broken, and somebody could commit to a hash of good file, and then later silently swap out that file to an evil variant, (if they control the server).
Now that is a very niche use case. FIPS/NIST is normally enforced in the core security aspects of a system, such as: when hashing passwords, or for the digest of of a cryptographic signature. In the case for DVC the use case is different: Here we use the cryptographic hash to assure that there is only a single matching blob of data that can be referenced b any single commit within the git repository. DVC routinely deals with large files. So the performance of the hashing algorithm is of quite some importance: All three SHA variants are not very good for our case.
In addition, SHA-1 and SHA-2 both are not parallelizable, and are subject to length extension attacks. On the other hand, BLAKE3 is perfect for our case. It is extremely fast, and very secure, and in particular it is infinitely parallelizable so we can get speedup form multi-cored hardware for large files.
Thanks, our project is considering the use of DVC and the choice of the hashing algorithm naturally is an important consideration for us. |
I also believe this change requires more testing and a major version upgrade. BLAKE3 has technical advantages but having FIPS/NIST certification is an important point for some of the organizations. Setting the default for SHA-256 with possible versions that support other algorithms might be a good solution. Installing like |
For the very few organisations that are constrained by the FIPS/NIST certification, they could use a different version of dvc. - You are suggesting that we should cripple the speed of the default because of a hand-waving concern about FIPS/NIST certification. There will is a dramatic difference in the performance between SHA256, and BLAKE3. I think the default should be to have good performance. If, and only if, you are forced to use FIPS/NIST certificated hashes for this part of your company Infrastructure, then you could use SHA-512-256. (SHA-256 would be silly, as it is slower than SHA-512 for everything other than the smallest of strings). Primarily, can you show of any interested organisations that require FIPS/NIST certificated hashes for DVC files that they have in their git-repo? Is this actually a realistic goal? Is this such an important case to make it the default?? Otherwise, I suggest this is just FUD about FIPS/NIST certification requirements. |
I can't because they probably can't use DVC now. :)
This is a good idea too. I'll leave the decision to the project maintainers. |
@da2ce7 I think you underestimate how many organizations that is. I work at an open source company and even I'm impacted by that (as much as it irritates me). Often it's not about actual security in these cases. Someone see's OMG you're using a hash, it better be one in our passlist! Then it devolves into pseudo-software hell from there. The gov funds a lot of companies, and to be eligible you have to jump through their hoops, even if they don't make sense. (I wasn't able to use my ed25519 ssh key to login to a server recently. There was no way I was going to use the NIST curve, so I had to make a new RSA key. It felt a little dirty.) I would really love something configurable in this case (and IMO it would be best to design in configurability from the ground up). I'd love to see blake3 as the default, and then a sha256 option as a hedge against FIPS/NIST certification issues. My prediction is that the probability they will come up and cause issues is high. But I think you're right that we shouldn't cripple DVC's speed in the default case because of short-sighted policy decisions that have a lot more impact that you might think if you aren't exposed to them.
This is the only thing that gives me doubt. The wording in these certification guidelines is often vague and open to interpretation (quite the opposite of where I like to be). It could be the case that we argue DVC isn't a core security aspect (and based on my earlier arguments about why a crypto hash isn't always necessary for DVC, I would agree with that). @iesahin your solution is effectively configuration of hashes, which I don't know if that is in the roadmap. I think the arguments here are showing that even if there is one technically superior hash (blake3), there are meatspace tradeoffs that need to be considered. If configuration is not on the roadmap, I suggest re-evaluating that decision. My opinion is that blake3 is the default and sha256 is offered for the pour souls who need it. |
I'm not in a position to make these decisions. What I'm saying is not setting an option in |
Looks like Linux is using blake now: https://lore.kernel.org/lkml/20211223141113.1240679-2-Jason@zx2c4.com/ It sounds reasonable to make it configurable and make blake default. |
If it's really required to make it configurable, I think making the digest size fixed (256 bits = 32 bytes) allows to abstract digest -> path conversion. Instead of MD5, DVC may require to use "any hash function that outputs 256 bits." I still think making it configurable won't add much value, and using SHA2-256 is the most obvious option due to NSA blessings. Longer hash function digests may also be problematic for file system paths, 512 bits means 64 bytes encoded with 128 characters in hexadecimal. In this case |
I would like to point out that with hash configuration any future change would then not be a breaking change requiring a major version release. I've just been watching this thread because the text/binary distinction in the hash has been limiting to my use cases. I don't think the use of MD5 is that bad or needing of urgent change, but it would be nice to be able to use another hash. I would personally be a lot more comfortable with something like SHA-256 (or 512-256) vs blake3 if there was not a choice. I like to be able to access the data externally and sha256 is generally more available than blake3 (i.e. it is in the go standard library) |
Because the discussion doesn't seem to reference this issue when I raised it, I'll note that I think a configurable hash could solve the main issue with immutable storage backends like IPFS as discussed here: #6777 (reply in thread) If I run: |
@efiop Has there been any movement towards any of the ideas discussed here? I have an idea that I'd like to test out. Having an API that specifies the hash is necessary for it. Is a PR that adds a configurable hash API that is hard-coded to MD5 to start with (adding more hash algorithms can be discussed later) something that would be considered? |
I would hope that the migration to a newer hashing algorithm would be done in a backwards-compatible way. Saying it has to wait until 3.0 because it's a breaking change implies something pretty troubling - does that mean dvc3 can only use new hash and won't work with existing stores? @shcheklein @efiop (I sure hope not.) For organizations that have really invested in using DVC, migrating their existing stores from one hash func to another will be at least a big hassle and perhaps completely intractable. At the very least you need to redo the entire storage - which is certainly expensive but possible. But then you also need to track down every The right way to do this doesn't require waiting for a major version. You just add an option today which is "new hash / old hash" when creating a new repo. In 3.0 you can/should make "new hash" the default. IMHO it would be very reasonable to offer this option for new hash without any migration tool whatsoever. It just means people who start using dvc today are better off. And those who have legacy MD5 data stores can either figure out the migration themselves or stick with what they have. |
Seeing as I just stumbled upon this and just experimented, lets just be very clear with everyone: if your dataset includes files with the same MD5 sum, only the first one will ever be pushed to the remote, the others will be falsely deduplicated. It's super easy to test with the bins from https://github.com/corkami/collisions. Given how easy the files are to produce and the potential impact, I don't think this issue is treated with nearly the urgency needed. I'd say use Blake3 already, FIPS be damned. (possibly a bit more discussion at https://pouet.chapril.org/@brohee/110537633498422926) |
Hey folks, thanks for your interest in this. We've been working hard towards making hash configurable in our internals and at this point pretty much only missing an actual config option exposed to the public and some high-level stuff (and obviously some polish around it). We don't have an immediate plan to work on that right now, but it is on our radar and we will be considering it in the near/mid future. Will keep everyone posted on the changes here. |
The new Having spent the weekend slogging through md5 hashes, I'm so excited for blake3 and/or xxhash |
Nice this |
I would really like the option to use faster hash algorithms, e.g. |
The docs seem to imply that md5 is the only hash users can use. The vulnerabilities of md5 have made it not usable in many organizations which require a certain level of security.
Is the a way to use SHA1 or SHA256 or other hashes instead?
I saw some PRs (#2022 for instance), but they're closed w/o being merged. What's the progress on that front?
The text was updated successfully, but these errors were encountered: