Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hash UDFs should return zero-padded strings of uniform length even when leading bits are zero. #93

Closed
mrflip opened this issue May 18, 2014 · 2 comments

Comments

@mrflip
Copy link

@mrflip mrflip commented May 18, 2014

The Hash UDFs in 'hex' mode currently do not return always the same-length string, because BigInteger.toString() omits leading zeros. So amidst a stream of 94% strings the same length, 1/16th are shorter by one or more characters, 1/256th by two or more, and in the unlikely case that an MD5 hash's value was 124 bits of zeros and 4 bits of ones it would return the one-character-long string 'f'.

This is surprising behavior, and a trap for those practicing the frequent trick of generating a hash and chopping off just the number of bits you need:

-- returns one-fifteenth, not one-sixteenth, of the input.
sampled_lines = FILTER(FOREACH lines GENERATE MD5(val) AS digest, val) BY (STARTSWITH(digest, 'f'));

mrflip@5c4a77c makes the returns a string zero-padded to be (length of hash / 4) characters long. It needs a lookup table to know how to format a SHA hash; all of the potential SHA-prefixed algorithms in Java 7 are covered.

@mrflip

This comment has been minimized.

Copy link
Author

@mrflip mrflip commented May 18, 2014

And now I see that this issue should be on the Apache JIRA. I'll punch in this issue and my patch over there.

@mrflip mrflip closed this May 18, 2014
@matthayes

This comment has been minimized.

Copy link
Contributor

@matthayes matthayes commented May 18, 2014

Thanks for reporting this! I went ahead and filed an issue in JIRA here:

https://issues.apache.org/jira/browse/DATAFU-46

Since you've already figured out a fix, can you attach a patch to the JIRA?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.