Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update to the hashing systems for PotcarSingle that fixes some small bugs and allows for verification with the sha256 hash written in new POTCARs #2762

Merged
merged 12 commits into from Dec 15, 2022

Conversation

MichaelWolloch
Copy link
Contributor

@MichaelWolloch MichaelWolloch commented Dec 1, 2022

This pull request has been updated quite a bit from the initial small bugfix. It now allows to use the SHA256 hash that is found in the latest POTCAR sets to be used to verify the POTCAR.

Summary

  • Fix 1: Booleans in self.PSCTR.values() are now detected correctly and converted to strings as intended.
  • Fix 2: Warnings regarding failed hash comparison are now longer switched.
  • Feature 1: SHA256 line and CPRYR lines are now parsed into self.PSCTR, but not added to the metadata hash string to retain compatibility
  • Feature 2: Instead trying to identify the POTCAR twice (once using metadata only, then again using the whole file hash), only the metadata is used to identify the POTCAR and return the corresponding functional list and symbol.
  • Feature 3: A validation is added to check file integrity. This is done comparing the file to the SHA256 hash in it, or, if there is no hash, with the md5 hashes saved in vasp_potcar_file_hashes.json.

Previously the check for boolean actually tried to add the string "<class 'bool'>" to the hash which is obviously wrong. However, this never happened, because bool is a subclass of int in python for historical reason. Thus the booleans were actually converted correctly to strings when converting the other integers in the if isinstance(v, int): check.
By putting the check for booleans first, we can correctly convert them and theoretically do something different with them in the future.

An alternative fix for this would be to just remove lines 2134 and 2135 altogether, but I think my version is clearer and better.

The warnings returned at a failed hash comparisons were switched around. The changed metadata warning was issued after a failed full-file hash comparison, and vice versa.

I think using the internal SHA256 hash to validate the file is a good idea, since it should be working also with new POTCARS that might come out, while the saved hashes list in vasp_potcar_file_hashes.json and vasp_potcar_pymatgen_hashes.json would need to be updated. (This might of course still be true for the metadata hashes if completely new potentials come out.)

I decided to raise an ValueError upon a failed hash comparison using the SHA256 internal hash. It can not mean anything else than that the POTCAR file is corrupted IMO, so execution should stop. A warning is not enough I think.

@coveralls
Copy link

coveralls commented Dec 1, 2022

Coverage Status

Coverage remained the same at 0.0% when pulling 0884c63 on MichaelWolloch:master into 7a51c9b on materialsproject:master.

MichaelWolloch and others added 4 commits December 5, 2022 14:43
…ata hash. Switched the warning messages for hash failures.
that attempts to use the sha256 hash that is present in new POTCAR
files for checking file integrity. If the POTCAR file does not
have such a hash in it, the md5 hash of the whole file is
compared to the database of hashes.

If the comparison with the internal sha256 hash fails, that means
that the POTCAR is for sure corrupted and a ValueError is raised.
If the sha256 hash is not found, but the md5 hash of the file is
not in the database, only a warning is raised to conform to
previous behavior.
@MichaelWolloch MichaelWolloch changed the title fixed a small bug in PotcarSingle.get_potcar_hash that actually has no consequences Update to the hashing systems for PotcarSingle that fixes some small bugs and allows for verification with the sha256 hash written in new POTCARs Dec 5, 2022
@MichaelWolloch
Copy link
Contributor Author

MichaelWolloch commented Dec 5, 2022

Please also note the corresponding thread on matsci!

@MichaelWolloch
Copy link
Contributor Author

I should have checked the test set... Of course the test_potcar_file_hash_warning needs to be updated. However, I have no time today, and probably also until next week. Will work on it then, and until then I am thankful for other tips, and of course any discussion!
Maybe @rkingsbury wants to comment here as well?

@rkingsbury
Copy link
Contributor

Thanks for all the work on this @MichaelWolloch, all look like great enhancements. I'll add my $0.02 below, although @mkhorton and other maintainers may have a different view on a few of the points.

I decided to raise an ValueError upon a failed hash comparison using the SHA256 internal hash. It can not mean anything else than that the POTCAR file is corrupted IMO, so execution should stop. A warning is not enough I think.

I advise against this change based on previous experience. We initially raised a ValueError instead of a warning and that caused a lot of pushback and confusion from users, so we relaxed to a warning. That said, that initial implementation was buggier than it is now and there were cases where the error/warning was being raised when it should not have been. Basically, I would encourage either 1) doing extensive testing against many versions of POTCARs across different systems to make sure there are no spurious ValueError OR 2) keeping the warning instead.

Feature 3: A validation is added to check file integrity. This is done comparing the file to the SHA256 hash in it, or, if there is no hash, with the md5 hashes saved in vasp_potcar_file_hashes.json.

I think using the internal SHA256 hash to validate the file is a good idea, since it should be working also with new POTCARS that might come out, while the saved hashes list in vasp_potcar_file_hashes.json and vasp_potcar_pymatgen_hashes.json would need to be updated. (This might of course still be true for the metadata hashes if completely new potentials come out.)

Excellent; this is a more future-proof approach and I'm glad to see VASP has started to include this information in their files.

The warnings returned at a failed hash comparisons were switched around. The changed metadata warning was issued after a failed full-file hash comparison, and vice versa.

Good catch

@MichaelWolloch
Copy link
Contributor Author

MichaelWolloch commented Dec 7, 2022

Tanks for your input @rkingsbury

I just pushed a small commit that should resolve the failed tests. I also added a test for the sha256 hash and a corresponding modified POTCAR with a hash that should result in a raised ValueError

Thanks for all the work on this @MichaelWolloch, all look like great enhancements. I'll add my $0.02 below, although @mkhorton and other maintainers may have a different view on a few of the points.

Would be great if more people can get in on this discussion!

I advise against this change based on previous experience. We initially raised a ValueError instead of a warning and that caused a lot of pushback and confusion from users, so we relaxed to a warning. That said, that initial implementation was buggier than it is now and there were cases where the error/warning was being raised when it should not have been. Basically, I would encourage either 1) doing extensive testing against many versions of POTCARs across different systems to make sure there are no spurious ValueError OR 2) keeping the warning instead.

I understand the concern, but I think due to only new POTCARS having these sha256 hashes, the problem is not really that complex. I just let a loop run over some potcar folders in Georg Kresse's home directory (which probably includes experimental and unreleased stuff), and of the 4737 POTCARs found, all 1562 that had a hash did complete the verification process. The other 3175 did not have a hash, and I did not check them against the pymatgen stored hashes for the moment, but I saw a lot of warnings, suggesting that they are not, or not correctly, hashed. So this is good news for the robustness of the sha256 hashes I think. I will run some more tests on all currently officially downloadable POTCARs next week.
However, I am not totally opposed to change this into a warning as well. Lets see what the other maintainers want.

On a different note: I am not sure why some tests fail for (ubuntu-latest, 3.8, 1), but I am pretty sure that this is nothing I did.

@espenfl
Copy link

espenfl commented Dec 8, 2022

I understand the concern, but I think due to only new POTCARS having these sha256 hashes, the problem is not really that complex. I just let a loop run over some potcar folders in Georg Kresse's home directory (which probably includes experimental and unreleased stuff), and of the 4737 POTCARs found, all 1562 that had a hash did complete the verification process. The other 3175 did not have a hash, and I did not check them against the pymatgen stored hashes for the moment, but I saw a lot of warnings, suggesting that they are not, or not correctly, hashed. So this is good news for the robustness of the sha256 hashes I think. I will run some more tests on all currently officially downloadable POTCARs next week.
However, I am not totally opposed to change this into a warning as well. Lets see what the other maintainers want.

Sounds great. Thanks for your work on this.

@MichaelWolloch
Copy link
Contributor Author

Hi again,

I now checked the sha256 and md5 hashes (both for whole files and for metadata) for all POTCARs that can be downloaded from the VASP portal. Those are the following sets:

  1. 91 set: LDA, PBE, and PW91 PAW sets & LDA and PW91 USPP sets which are all outdated.
  2. 52 set: LDA and PBE PAW sets with early GW variants. However, sha256 hashes have been added here!
  3. 5254 set: LDA and PBE for both the .52 (originally released 2012) and the .54 (2015) sets that were left "original" and thus do not contain any hashes.
  4. 54 set: LDA and PBE for the .54 set with GW potentials and hashes. This is the recommended set.

Looping through all directories and loading POTCARS with the new PotcarSingle class leads to this output:

##############################
14930 total potentials checked.
6005 have a sha256 hash and the file matches.
0 have a sha256 hash and the file does not match.
0 completed both file and metadata md5 comparison with saved pymatgen data.
0 completed only file md5 comparison with saved pymatgen data.
8925 completed only metadata md5 comparison with saved pymatgen data.
0 complete neither file or metadata md5 comparison with saved pymatgen data.
##############################

Of course a lot of those potentials will be duplicated, since they have not changed between releases!

So if sha256 hashes are there, they work. No exceptions (I also tested this for some unreleased PBE POTCARs, and also there all hashes worked. I think this means that it is save to actually keep the ValueError and not revert to a warning.

The metadata hashes also work, This is the good news. However, none of the file hashes work, which is troublesome. I had expected the 5254 set at least to work. I think more investigation and maybe some re-hashing has to happen.
I do not think that I made a mistake in my checks, I only check file hashes if no sha256 hash is found in the file, so actually I expected most to pass the test.

Does someone has a POTCAR to test that actually passes the file hash without warnings? I always thought the problem with my potentials were the sha256 hash lines, which I assumed to be also be part of the header before I looked into the code.

@MichaelWolloch
Copy link
Contributor Author

OK, I think I solved the problem with the file hashes! There was another bug that was pretty hard to catch.
The get_potcar_file_hash(self) method converts the whole file into a hash, this was done using md5(str(self).encode("utf-8")).hexdigest(), which is a perfectly valid way to do it. However, in the beginning of the class there is the following line:
self.data = data # raw POTCAR as a string The ONLY difference between self.data and str(self) seems to be an empty line at the end of the string that is present for str(self), but not in self.data (and also not in the actual POTCAR).

So I tried to change the line to md5(self.data.encode("utf-8")).hexdigest() and rerun the loop over all POTCARs.

And now everything works as expected:

##############################
14930 total potentials checked.
6005 have a sha256 hash and the file matches.
0 have a sha256 hash and the file does not match.
8925 completed both file and metadata md5 comparison with saved pymatgen data.
0 completed only file md5 comparison with saved pymatgen data.
0 completed only metadata md5 comparison with saved pymatgen data.
0 completed neither file or metadata md5 comparison with saved pymatgen data.
##############################

Apparently the hashes were created either using PotcarSingle.data strings, or another way of reading in the POTCARs. However, the checks were done using the string representation of the class instance. And of course the extra line changes the hashes.

I think now all the questions are resolved, and this should be ready to merge.
Or is there the need for another unit test, that checks the sha256 and md5 hashes for a valid POTCAR?
Maybe @rkingsbury and @mkhorton can comment on this? Of course @espenfl is also more than welcome to chime in, as are all other interested parties.

@rkingsbury
Copy link
Contributor

OK, I think I solved the problem with the file hashes! There was another bug that was pretty hard to catch. The get_potcar_file_hash(self) method converts the whole file into a hash, this was done using md5(str(self).encode("utf-8")).hexdigest(), which is a perfectly valid way to do it. However, in the beginning of the class there is the following line: self.data = data # raw POTCAR as a string The ONLY difference between self.data and str(self) seems to be an empty line at the end of the string that is present for str(self), but not in self.data (and also not in the actual POTCAR).

So I tried to change the line to md5(self.data.encode("utf-8")).hexdigest() and rerun the loop over all POTCARs.

And now everything works as expected:

##############################
14930 total potentials checked.
6005 have a sha256 hash and the file matches.
0 have a sha256 hash and the file does not match.
8925 completed both file and metadata md5 comparison with saved pymatgen data.
0 completed only file md5 comparison with saved pymatgen data.
0 completed only metadata md5 comparison with saved pymatgen data.
0 completed neither file or metadata md5 comparison with saved pymatgen data.
##############################

Apparently the hashes were created either using PotcarSingle.data strings, or another way of reading in the POTCARs. However, the checks were done using the string representation of the class instance. And of course the extra line changes the hashes.

Thank you for all the thorough testing and bugfixing @MichaelWolloch! I remember that newline thing causing issues in the past and I thought that the way we had it (before your changes) was required to in order to get the tests to pass on the development system at the time, but either something has changed or I missed something back then.

I think now all the questions are resolved, and this should be ready to merge. Or is there the need for another unit test, that checks the sha256 and md5 hashes for a valid POTCAR? Maybe @rkingsbury and @mkhorton can comment on this? Of course @espenfl is also more than welcome to chime in, as are all other interested parties.

Yes I think it would be a good idea to include a unit test for a valid POTCAR. I realize now that we only test for invalid ones. I think you should be able to use one of the POTCARs already distributed in pymatgen's test files for this

So if sha256 hashes are there, they work. No exceptions (I also tested this for some unreleased PBE POTCARs, and also there all hashes worked. I think this means that it is save to actually keep the ValueError and not revert to a warning.

I'm on board with raising a ValueError given your extensive testing. However, I am still a bit worried that we might run into system- or OS-specific differences in hashing and I would hate for the ValueError to prevent users from doing their work if they really want to ignore it. So I suggest we provide a keyword argument to Potcar (passed through to PotcarSingle) that makes it possible to disable the hash checking by converting the ValueError to a warning.

Thoughts @mkhorton ?

@mkhorton
Copy link
Member

Thank you @MichaelWolloch for your efforts on this PR -- this has been a pain point for a while!

I just let a loop run over some potcar folders in Georg Kresse's home directory (which probably includes experimental and unreleased stuff), and of the 4737 POTCARs found, all 1562 that had a hash did complete the verification process.

Perhaps not unreleased stuff, but since these are hashes only, is the current pymatgen dataset missing anything important that could be added? Does it need to be updated? It seems like, based on your most recent comment, the pymatgen dataset may actually be "complete"(?)

However, I am still a bit worried that we might run into system- or OS-specific differences in hashing and I would hate for the ValueError to prevent users from doing their work if they really want to ignore it.

I'm not so worried about this specifically. There should be nothing OS-specific that would change the hash if it's written correctly. My only major concern is an incomplete pymatgen dataset.

@MichaelWolloch
Copy link
Contributor Author

Perhaps not unreleased stuff, but since these are hashes only, is the current pymatgen dataset missing anything important that could be added? Does it need to be updated? It seems like, based on your most recent comment, the pymatgen dataset may actually be "complete"(?)

This is a very good point. I initially assumed that the added sha256 hashes screwed up the cashed md5 hashes, since I always got the familiar warnings, but of course now that I have discovered the other bugs, this is actually quite unlikely. In my previous test runs I did not check the md5 hashes for the POTCARs that have a sha256 hashes, but I did now, and all is fine, the hash library is complete:

##############################
14930 total potentials checked.
0 have a sha256 hash and file matches, but no md5 check worked.
6005 have a sha256 hash and file matches, and both md5 checks worked.
0 have a sha256 hash and file matches, but only metadata md5 check worked.
0 have a sha256 hash and file matches, but only file md5 check worked.
0 have a sha256 hash and the file does not match.
8925 have no sha256 hash and completed both file and metadata md5 comparison with saved pymatgen data.
0 completed only file md5 comparison with saved pymatgen data.
0 completed only metadata md5 comparison with saved pymatgen data.
0 completed neither file or metadata md5 comparison with saved pymatgen data.
##############################

Now of course the whole sha256 thing is a little less pressing. But I would still keep this functionality for file integrity checks, since it makes adoption of new potentials quicker. E.g. I ran a loop over an unreleased set, and of course the updated potentials (115 of 1655) fail both the file and the metadata checks. I am happy to talk to my VASP company colleagues, to ensure that once these potentials are launched, I can update the pymatgen hash files ASAP, but of course this does not mean that all users upgrade their version although they might upgrade their potentials. So I feel that my changes regarding file validation of POTCARs is still a superior solution. Of course warnings regarding the metadata will still be printed, but no warnings about possible file corruption will be shown if the sha256 hash matches the file.

However, I am still a bit worried that we might run into system- or OS-specific differences in hashing and I would hate for the ValueError to prevent users from doing their work if they really want to ignore it.

I'm not so worried about this specifically. There should be nothing OS-specific that would change the hash if it's written correctly. My only major concern is an incomplete pymatgen dataset.

I also do not think that platform specifics should play a role. We force uft8 encoding specifically (you have to specify encoding in the python 3 version of hashlib), and I found nothing on the net of hashlib showing any platform dependency if string encoding is handled correctly.

@mkhorton
Copy link
Member

I am happy to talk to my VASP company colleagues, to ensure that once these potentials are launched, I can update the pymatgen hash files ASAP

This would certainly be appreciated!

In terms of this PR, I would like to merge it. Are there any remaining changes you would like to make before a merge?

@MichaelWolloch
Copy link
Contributor Author

I am happy to talk to my VASP company colleagues, to ensure that once these potentials are launched, I can update the pymatgen hash files ASAP

This would certainly be appreciated!

I will keep on it!

In terms of this PR, I would like to merge it. Are there any remaining changes you would like to make before a merge?

I just removed the parsing of the copyright lines since it in fact only parsed the last of the three lines, and is not needed. Other than that, I think that I am happy with the changes.

@mkhorton mkhorton merged commit d0ab147 into materialsproject:master Dec 15, 2022
@mkhorton
Copy link
Member

Merging. Thanks again @MichaelWolloch! Noting a test failure unrelated to this PR.

@espenfl
Copy link

espenfl commented Dec 19, 2022

Thanks a lot for fixing this and thanks to @MichaelWolloch for contribs. Greatly appreciated.

@espenfl
Copy link

espenfl commented Dec 19, 2022

@mkhorton Any eta for release containing this fix? Thanks.

@MichaelWolloch
Copy link
Contributor Author

Hi @mkhorton , I just found a BIG bug!
Unfortunately I tested this only for single POTCARs not multi element ones! (Also nothing in the unit tests) and there is a problem with the sha256 hashes when reading POTCARs from a file. Again a newline at the end of the file was missing, so a similar problem as for the md5 hashes. As @rkingsbury mentioned there was some confusion about this previously.

I will make a new pull request immediatly!

@mkhorton
Copy link
Member

Noted; will continue discussion in new PR.

mkhorton added a commit that referenced this pull request Dec 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants