Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.Sign up
Consistent hash code values between JVM instances #590
Hash codes for symbols, booleans, nil, and (sometimes) strings are not consistent between JVM instances. This means that JRuby objects can't be used in frameworks like Hadoop, which uses consistent hashing for synchronization. Hadoop, specifically, uses object hash codes to determine which reduce task is responsible for a particular key. If the hash codes differ, then map tasks send keys to different reduce tasks and there are duplicates in the result.
This adds or updates hashCode (and hash) implementations for booleans, nil, and symbols. It also adds an option to use default hash seed values, rather than random ones for consistent string hashes. Some tests are included, where it was obvious where to put them.
Please let me know if I did something wrong so I can fix it and send another pull request.
Ok, so I have a couple concerns. nil, true, and false hashcodes were explicitly changed to be random in MRI 1.9ish, so us forcing them to be specific values concerns me. It is at least a visible behavior change, and at most a deviation that could break something (that doesn't seem likely, but I don't like behavioral differences).
I think the best plan here would be to also link up nil, true, and false to the option you added, so we can basically just turn on "predictable hashing" for all these types at once.
Out of curiousity, why do you need nil, true, and false to have consistent hashcodes? Using them as keys seems like a bad idea.
Nil sneaks in whenever a value doesn't exist--for example, when a CSV row doesn't have a column--and we try to put the value into a compound key. True and false happen less often, but sometimes you don't want to serialize a large value from mapper to reducer and all you need to do is check it for some property, so you do the check map-side and encode it as a boolean. I've seen nils in practice quite a bit, booleans less (or maybe not at all, I don't remember). It just seems like a good idea to ensure everything works consistently.
I'll add hashCode fields on nil, true, and false and default them randomly based on the consistent hashing option. Why did MRI move to random values explicitly? I can understand not wanting predictable strings, but you don't use true for a hash key very often and it always collides with itself anyway.