New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update to the hashing systems for PotcarSingle that fixes some small bugs and allows for verification with the sha256 hash written in new POTCARs #2762
Conversation
…ave any consequences.
…ata hash. Switched the warning messages for hash failures.
for more information, see https://pre-commit.ci
that attempts to use the sha256 hash that is present in new POTCAR files for checking file integrity. If the POTCAR file does not have such a hash in it, the md5 hash of the whole file is compared to the database of hashes. If the comparison with the internal sha256 hash fails, that means that the POTCAR is for sure corrupted and a ValueError is raised. If the sha256 hash is not found, but the md5 hash of the file is not in the database, only a warning is raised to conform to previous behavior.
Please also note the corresponding thread on matsci! |
I should have checked the test set... Of course the |
Thanks for all the work on this @MichaelWolloch, all look like great enhancements. I'll add my $0.02 below, although @mkhorton and other maintainers may have a different view on a few of the points.
I advise against this change based on previous experience. We initially raised a
Excellent; this is a more future-proof approach and I'm glad to see VASP has started to include this information in their files.
Good catch |
for more information, see https://pre-commit.ci
Tanks for your input @rkingsbury I just pushed a small commit that should resolve the failed tests. I also added a test for the sha256 hash and a corresponding modified POTCAR with a hash that should result in a raised
Would be great if more people can get in on this discussion!
I understand the concern, but I think due to only new POTCARS having these sha256 hashes, the problem is not really that complex. I just let a loop run over some potcar folders in Georg Kresse's home directory (which probably includes experimental and unreleased stuff), and of the 4737 POTCARs found, all 1562 that had a hash did complete the verification process. The other 3175 did not have a hash, and I did not check them against the pymatgen stored hashes for the moment, but I saw a lot of warnings, suggesting that they are not, or not correctly, hashed. So this is good news for the robustness of the sha256 hashes I think. I will run some more tests on all currently officially downloadable POTCARs next week. On a different note: I am not sure why some tests fail for (ubuntu-latest, 3.8, 1), but I am pretty sure that this is nothing I did. |
Sounds great. Thanks for your work on this. |
Hi again, I now checked the sha256 and md5 hashes (both for whole files and for metadata) for all POTCARs that can be downloaded from the VASP portal. Those are the following sets:
Looping through all directories and loading POTCARS with the new PotcarSingle class leads to this output:
Of course a lot of those potentials will be duplicated, since they have not changed between releases! So if sha256 hashes are there, they work. No exceptions (I also tested this for some unreleased PBE POTCARs, and also there all hashes worked. I think this means that it is save to actually keep the The metadata hashes also work, This is the good news. However, none of the file hashes work, which is troublesome. I had expected the 5254 set at least to work. I think more investigation and maybe some re-hashing has to happen. Does someone has a POTCAR to test that actually passes the file hash without warnings? I always thought the problem with my potentials were the sha256 hash lines, which I assumed to be also be part of the header before I looked into the code. |
OK, I think I solved the problem with the file hashes! There was another bug that was pretty hard to catch. So I tried to change the line to And now everything works as expected:
Apparently the hashes were created either using I think now all the questions are resolved, and this should be ready to merge. |
…arning is issued now that the hash is corrected!
Thank you for all the thorough testing and bugfixing @MichaelWolloch! I remember that newline thing causing issues in the past and I thought that the way we had it (before your changes) was required to in order to get the tests to pass on the development system at the time, but either something has changed or I missed something back then.
Yes I think it would be a good idea to include a unit test for a valid POTCAR. I realize now that we only test for invalid ones. I think you should be able to use one of the POTCARs already distributed in pymatgen's test files for this
I'm on board with raising a Thoughts @mkhorton ? |
Thank you @MichaelWolloch for your efforts on this PR -- this has been a pain point for a while!
Perhaps not unreleased stuff, but since these are hashes only, is the current pymatgen dataset missing anything important that could be added? Does it need to be updated? It seems like, based on your most recent comment, the pymatgen dataset may actually be "complete"(?)
I'm not so worried about this specifically. There should be nothing OS-specific that would change the hash if it's written correctly. My only major concern is an incomplete pymatgen dataset. |
This is a very good point. I initially assumed that the added sha256 hashes screwed up the cashed md5 hashes, since I always got the familiar warnings, but of course now that I have discovered the other bugs, this is actually quite unlikely. In my previous test runs I did not check the md5 hashes for the POTCARs that have a sha256 hashes, but I did now, and all is fine, the hash library is complete:
Now of course the whole sha256 thing is a little less pressing. But I would still keep this functionality for file integrity checks, since it makes adoption of new potentials quicker. E.g. I ran a loop over an unreleased set, and of course the updated potentials (115 of 1655) fail both the file and the metadata checks. I am happy to talk to my VASP company colleagues, to ensure that once these potentials are launched, I can update the pymatgen hash files ASAP, but of course this does not mean that all users upgrade their version although they might upgrade their potentials. So I feel that my changes regarding file validation of POTCARs is still a superior solution. Of course warnings regarding the metadata will still be printed, but no warnings about possible file corruption will be shown if the sha256 hash matches the file.
I also do not think that platform specifics should play a role. We force uft8 encoding specifically (you have to specify encoding in the python 3 version of hashlib), and I found nothing on the net of hashlib showing any platform dependency if string encoding is handled correctly. |
This would certainly be appreciated! In terms of this PR, I would like to merge it. Are there any remaining changes you would like to make before a merge? |
I will keep on it!
I just removed the parsing of the copyright lines since it in fact only parsed the last of the three lines, and is not needed. Other than that, I think that I am happy with the changes. |
Merging. Thanks again @MichaelWolloch! Noting a test failure unrelated to this PR. |
Thanks a lot for fixing this and thanks to @MichaelWolloch for contribs. Greatly appreciated. |
@mkhorton Any eta for release containing this fix? Thanks. |
Hi @mkhorton , I just found a BIG bug! I will make a new pull request immediatly! |
Noted; will continue discussion in new PR. |
BUG FIX for merged pull request #2762
This pull request has been updated quite a bit from the initial small bugfix. It now allows to use the SHA256 hash that is found in the latest POTCAR sets to be used to verify the POTCAR.
Summary
self.PSCTR.values()
are now detected correctly and converted to strings as intended.self.PSCTR
, but not added to the metadata hash string to retain compatibilityvasp_potcar_file_hashes.json
.Previously the check for boolean actually tried to add the string
"<class 'bool'>"
to the hash which is obviously wrong. However, this never happened, becausebool
is a subclass ofint
in python for historical reason. Thus the booleans were actually converted correctly to strings when converting the other integers in theif isinstance(v, int):
check.By putting the check for booleans first, we can correctly convert them and theoretically do something different with them in the future.
An alternative fix for this would be to just remove lines 2134 and 2135 altogether, but I think my version is clearer and better.
The warnings returned at a failed hash comparisons were switched around. The changed metadata warning was issued after a failed full-file hash comparison, and vice versa.
I think using the internal SHA256 hash to validate the file is a good idea, since it should be working also with new POTCARS that might come out, while the saved hashes list in
vasp_potcar_file_hashes.json
andvasp_potcar_pymatgen_hashes.json
would need to be updated. (This might of course still be true for the metadata hashes if completely new potentials come out.)I decided to raise an ValueError upon a failed hash comparison using the SHA256 internal hash. It can not mean anything else than that the POTCAR file is corrupted IMO, so execution should stop. A warning is not enough I think.