Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace md5 hashing with sha256. #1251

Merged
merged 1 commit into from
Mar 9, 2023
Merged

Conversation

wilrodriguez
Copy link

Fixes #1250

This change is intended to replace the md5 hashing currently utilized to validate grammar caches with sha256 hashing, which will help mitigate situations where md5 is being blocklisted, such as by FIPS-mode

@erezsh
Copy link
Member

erezsh commented Feb 13, 2023

I have no problem with this change, but we have to make sure that in attempting to fix this one edge case, we're not creating new breaking changes for other edge cases. (i.e. other python versions or operating systems)

@wilrodriguez
Copy link
Author

I was mildly concerned about that as well. That said, sha256 is supported as part of the core library in all the versions of python that lark currently supports and this hashing library is only used within lark. The only downside I'd see is that it would need to recache files after the upgrade to the new version, but that's hardly much of a concern. This method should only be used internally. It's certainly not like it provides any novel functionality that a user wouldn't be able to more easily get from just calling hashlib directly. So, I'm not sure it's really anything to worry about.

@MegaIng
Copy link
Member

MegaIng commented Feb 13, 2023

The recaching already happens since the lark version is part of what get's hashed.

@wilrodriguez
Copy link
Author

Even better.

@erezsh erezsh merged commit b94cbc1 into lark-parser:master Mar 9, 2023
@erezsh
Copy link
Member

erezsh commented Mar 9, 2023

@wilrodriguez Thanks for contributing!

@Erotemic
Copy link
Contributor

Small comment on this. I'm +1 for anything modern over md5, and sha256 on modern machines is faster than md5. I'm not sure how much speed is a factor here, but there are hashers that are faster than sha256 - especially if we don't need them to be cryptographic. However some of these faster hashers are not FIPS validated (although blake3 probably should be):

image

Here is a small benchmark:

import ubelt as ub
import timerit

ti = timerit.Timerit(100_000, bestof=100, verbose=2)

hashers = ['md5', 'sha256', 'blake3', 'xxh64']

data = 'foobar' * 10_000

for hasher in hashers:
    for timer in ti.reset(f'{hasher=}'):
        with timer:
            ub.hash_data(data, hasher=hasher)

mean_rankings = ub.udict(ti.rankings['mean']).sorted_values()
print(ub.urepr(mean_rankings, nl=2, align=':'))

And the larger benchmark that produced the image is here: https://github.com/Erotemic/timerit/blob/dev/1.0.2/examples/benchmark_hashers.py

Likely not a big deal, but I thought I'd chime in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Lark doesn't run on FIPS enabled systems on older versions of python 3
4 participants