Skip to content

Consistent hash code values between JVM instances #590

Closed
wants to merge 3 commits into from

2 participants

@rdblue
rdblue commented Mar 18, 2013

Hash codes for symbols, booleans, nil, and (sometimes) strings are not consistent between JVM instances. This means that JRuby objects can't be used in frameworks like Hadoop, which uses consistent hashing for synchronization. Hadoop, specifically, uses object hash codes to determine which reduce task is responsible for a particular key. If the hash codes differ, then map tasks send keys to different reduce tasks and there are duplicates in the result.

This adds or updates hashCode (and hash) implementations for booleans, nil, and symbols. It also adds an option to use default hash seed values, rather than random ones for consistent string hashes. Some tests are included, where it was obvious where to put them.

Please let me know if I did something wrong so I can fix it and send another pull request.

@rdblue rdblue Fixing hashing consistency across JVM instances.
Several of the core JRuby classes calculate hash codes based on java or ruby
object ids. This doesn't produce consistent hashing across JVM instances, which
is needed for distributed frameworks. For example, Hadoop uses hashCode values
to distribute keys from the map phase to the same reducer task (partitioning).

This commit adds hashCode (and ruby's hash method) implementations for
RubyBoolean, RubyNil, and RubySymbol. RubyBoolean and RubyNil simply return
static, randomly-generated hashCode values that are hard-coded. This replaces
the default java Object#hashCode.

For RubySymbol, the previous implementation of hashCode returned the symbol's
id, which could be different depending on the order in which symbols are
created. This updates it to calculate a hashCode based on the raw symbolBytes
like the RubyString implementation, but with a RubySymbol-specific seed and
without the encoding addition for 1.9. This value is calculated when symbols
are instantiated so the performance impact should be minimal.

This commit also adds a RubyInstanceConfig setting and CLI option for
consistent hashing, jruby.consistent.hashing.enabled, which controls whether
the Ruby runtime's hash seeds (k0 and k1) are generated randomly. When set to
true, they are set to static values. These hash seeds are used to hash
RubyString objects, so this will make string hash codes consistent across JVMs.
a74232a
@headius
JRuby Team member
headius commented Mar 20, 2013

Ahh, interesting. I assume when the default hash setting is active, you're not concerned about the Hash-based DOS everyone got upset about last year...

Will look into this; having default hash values for booleans, nil, and symbols seems mostly reasonable.

@rdblue
rdblue commented Mar 21, 2013

No, I'm not concerned with a DOS since i'm using it for Hadoop. The reference to SecureRandom is what made me add the option and default it to "off". Seems like a situation where you either need one or the other.

@headius
JRuby Team member
headius commented Mar 22, 2013

Ok, so I have a couple concerns. nil, true, and false hashcodes were explicitly changed to be random in MRI 1.9ish, so us forcing them to be specific values concerns me. It is at least a visible behavior change, and at most a deviation that could break something (that doesn't seem likely, but I don't like behavioral differences).

I think the best plan here would be to also link up nil, true, and false to the option you added, so we can basically just turn on "predictable hashing" for all these types at once.

Out of curiousity, why do you need nil, true, and false to have consistent hashcodes? Using them as keys seems like a bad idea.

@rdblue
rdblue commented Mar 22, 2013

Nil sneaks in whenever a value doesn't exist--for example, when a CSV row doesn't have a column--and we try to put the value into a compound key. True and false happen less often, but sometimes you don't want to serialize a large value from mapper to reducer and all you need to do is check it for some property, so you do the check map-side and encode it as a boolean. I've seen nils in practice quite a bit, booleans less (or maybe not at all, I don't remember). It just seems like a good idea to ensure everything works consistently.

I'll add hashCode fields on nil, true, and false and default them randomly based on the consistent hashing option. Why did MRI move to random values explicitly? I can understand not wanting predictable strings, but you don't use true for a hash key very often and it always collides with itself anyway.

@rdblue rdblue Updating hashCode implementations.
Per discussion on the last commit's pull request [1], updating the
implementations of hashCode for RubyNil and RubyBoolean. Now the hashCode
behavior for nil and booleans will only change when consistent hashing is
enabled. Adds a hashCode instance variable to RubyBoolean and RubyNil that is
set in the constructor to the Object#hashCode value (using
System.identityHashCode) or a static value.

[1]: jruby#590
e96b458
@rdblue
rdblue commented Apr 13, 2013

Everything should be enabled by the consistent hashing option now. Are there any other problems with this pull request that I can fix?

@rdblue rdblue Adding RubyBoolean's annotated methods.
Annotated methods on RubyBoolean were not being added to the ruby class, just
the static methods in RubyBoolean.True and RubyBoolean.False. Now hash is
actually defined on the ruby TrueClass and FalseClass.
7871de3
@rdblue
rdblue commented Apr 14, 2013

I ended up finding a bug: RubyBoolean's annotated methods were not being added because there weren't any previously. The above commit fixes the problem.

@headius
JRuby Team member
headius commented Apr 16, 2013

Can you squash this into a single commit? We can then merge it for 1.7.4.

@rdblue
rdblue commented Apr 16, 2013

I squashed the commits in a new branch (shouldn't have been using master) and added a new pull request:
#640

I'll close this one. Thanks!

@rdblue rdblue closed this Apr 16, 2013
@headius
JRuby Team member
headius commented Apr 16, 2013

FWIW, you can just force-push your squashed branch and the PR would pick it up. But this is fine for now :-)

@rdblue
rdblue commented Apr 16, 2013

Ah, good to know. I was on my way to work and wanted to just get something to you. Thanks!

@vipulnsward vipulnsward pushed a commit to vipulnsward/jruby that referenced this pull request Nov 30, 2013
@rdblue rdblue Fixing hashing consistency across JVM instances.
Several of the core JRuby classes calculate hash codes based on java or ruby
object ids. This doesn't produce consistent hashing across JVM instances, which
is needed for distributed frameworks. For example, Hadoop uses hashCode values
to distribute keys from the map phase to the same reducer task (partitioning).

This commit adds hashCode (and ruby's hash method) implementations for
RubyBoolean, RubyNil, and RubySymbol. RubyBoolean and RubyNil simply return
static, randomly-generated hashCode values that are hard-coded. This replaces
the default java Object#hashCode.

For RubySymbol, the previous implementation of hashCode returned the symbol's
id, which could be different depending on the order in which symbols are
created. This updates it to calculate a hashCode based on the raw symbolBytes
like the RubyString implementation, but with a RubySymbol-specific seed and
without the encoding addition for 1.9. This value is calculated when symbols
are instantiated so the performance impact should be minimal.

This commit also adds a RubyInstanceConfig setting and CLI option for
consistent hashing, jruby.consistent.hashing.enabled, which controls whether
the Ruby runtime's hash seeds (k0 and k1) are generated randomly. When set to
true, they are set to static values. These hash seeds are used to hash
RubyString objects, so this will make string hash codes consistent across JVMs.

(later commit...)

Updating hashCode implementations.

Per discussion on the last commit's pull request [1], updating the
implementations of hashCode for RubyNil and RubyBoolean. Now the hashCode
behavior for nil and booleans will only change when consistent hashing is
enabled. Adds a hashCode instance variable to RubyBoolean and RubyNil that is
set in the constructor to the Object#hashCode value (using
System.identityHashCode) or a static value.

[1]: jruby#590
af1d387
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.