ORecordId - hashCode function collisions #3903

swimmesberger · 2015-04-09T12:24:12Z

The hashCode function of the ORecordId class seems to not work properly:

    ORecordId rec = new ORecordId(12, 31);
    int hash = rec.hashCode();
    ORecordId rec1 = new ORecordId(13, 0);
    int hash1 = rec1.hashCode();

Both hash and hash1 returns "403" as integer value.

Version: 2.0.5

laa · 2015-04-09T12:26:09Z

We have com.orientechnologies.orient.core.index.hashindex.local.OMurmurHash3HashFunction hash function do not you mind to implement it as source of hash code and test how performance was changed ?

lvca · 2015-04-09T12:31:13Z

@laa do you mean implementing ORecordId.hashCode() using Murmur algorithm? If I'm right, +1

laa · 2015-04-09T12:38:59Z

yes, but I will really appreciate if @swimmesberger will try himself if he is interested of course.

swimmesberger · 2015-04-09T13:21:01Z

    final OMurmurHash3HashFunction<OIdentifiable> hasher = new OMurmurHash3HashFunction<OIdentifiable>();
    hasher.setValueSerializer(new OStreamSerializerRID());
    ORecordId rec = new ORecordId(12, 31);
    long hash = hasher.hashCode(rec);
    ORecordId rec1 = new ORecordId(13, 0);
    long hash1 = hasher.hashCode(rec1);

Seems to work better (no collisions so far) but not really a dropin replacement for hashCode because hashCode returns an integer value and "OHashFunction hashCode(V value)" returns a long value.
(if we didn't want to discard any information)

The OMurmurHash3HashFunction is indeed "much" slower than the normal hashCode function but for my usecase the performance drop is justifiable.

In my fast non optimized test 10.000 calls to OMurmurHash3HashFunction hashCode function takes ~52ms and using the current hashCode function with 10.000 calls takes ~8ms.

lvca · 2015-04-25T05:12:58Z

Any idea to improve current ORecordID.hashCode() or can I close this issue?

swimmesberger · 2015-08-13T14:25:51Z

No idea but I just wanted to say that we can't really rely on ORecordID.hashCode() for the reason stated above so using ORecordID objects in HashMaps for example could be pretty dangerous.

doppelrittberger · 2015-10-20T12:14:44Z

Hi,
I took a look into the hashCode() method of the class ORecordId. You should use this method:
return 31 * clusterId + 37 * (int) (clusterPosition ^ (clusterPosition >>> 32));
instead of the old one (like proposed by Josh Bloch's Effective Java, http://stackoverflow.com/questions/113511/best-implementation-for-hashcode-method). This solves most collisions for me.

lvca · 2015-10-20T17:34:59Z

I don't think 37 could brings any advantages. Do you have any data how this number reduced your collision rate?

doppelrittberger · 2015-10-22T18:18:37Z

I made a small test:

public static void main (String[] args)
    {
        Set<Integer> oldHashCodes = new HashSet<>();
        int oldHashCodeCollisions = 0;
        for (int clusterId = 0; clusterId < 100; clusterId++) {
            for (int clusterPosition = -1000; clusterPosition < 100000; clusterPosition++) {
                int oldHashCode = oldHashCode(clusterId, clusterPosition);
                if (!oldHashCodes.add(oldHashCode)) {
                    oldHashCodeCollisions++;
                }
            }
        }
        System.out.println("Old hashCode collisions: " + oldHashCodeCollisions);
        Set<Integer> newHashCodes = new HashSet<>();
        int newHashCodeCollisions = 0;
        for (int clusterId = 0; clusterId < 100; clusterId++) {
            for (int clusterPosition = -1000; clusterPosition < 100000; clusterPosition++) {
                int newHashCode = newHashCode(clusterId, clusterPosition);
                if (!newHashCodes.add(newHashCode)) {
                    newHashCodeCollisions++;
                }
            }
        }
        System.out.println("New hashCode collisions: " + newHashCodeCollisions);
    }

    private static int oldHashCode (int clusterId, long clusterPosition)
    {
        return 31 * clusterId + (int) (clusterPosition ^ (clusterPosition >>> 32));
    }

    private static int newHashCode (int clusterId, long clusterPosition)
    {
        return 31 * clusterId + 103 * (int) clusterPosition;
    }

and the results are impressive:
Old hashCode collisions: 9996931
New hashCode collisions: 0

lvca · 2015-10-22T18:38:51Z

Impressive! Why 103?

doppelrittberger · 2015-10-22T18:44:37Z

I did some testing with bigger prime numbers and 103 worked well :)

swimmesberger · 2015-10-22T18:47:30Z

Sounds very good! :) Thanks for the test! I didn't really had time to do some testing with different hashcode functions.

doppelrittberger · 2015-10-22T18:51:00Z

You´re welcome

lvca · 2015-10-25T23:55:01Z

Tested also with random RIDs and the more the RIDs are lower, the best result you have. Actually in 99% of cases this is what happen: rids are incremental. So definitely +1. @doppelrittberger Seems you found the magic number (103). Thanks ;-)

PhantomYdn · 2015-10-26T07:27:36Z

Guys, new hash will fail if you will test more than just 100:)

Simple example:

Records 103:62 and 206:31 will have the same hash code:
31*103 + 103*62 = 31*206+103*31 
because:
31*1*103+103*2*31=31*2*103+103*1*31 and after changing orders
31*1*103+103*2*31=31*1*103+103*2*31

I'm not expect in cipher, but you will definitely need in XOR : affine transformations will not definitely work.

PhantomYdn · 2015-10-26T07:35:05Z

And, btw, there is no requirement in java to have different hashcodes for different objects.
But for hashcode for ORecordId I would recommend the following requirements:

Very low probability of impact in case of the same cluster rather than between clusters
Very fast in calculation

The following (autogenerated in eclipse) function much better than proposed:

@Override
    public int hashCode() {
        final int prime = 31;
        int result = 1;
        result = prime * result + clusterId;
        result = prime * result + (int) (clusterPosition ^ (clusterPosition >>> 32));
        return result;
    }

smolinari · 2015-10-26T07:51:47Z

Nothing like what you can up with Ilia, but we were looking at using this very small library for hash Ids. We are thinking more on the lines of URL ids too, which I know isn't really part of this discussion or issue.

http://hashids.org/java/

The cool thing with this hashing algorythm is, you can insert an rid and get a nice URL friendly hash Id. And, when using salt (the third input), the rid could be tokenized to an extent.

Pretty sure this isn't helpful and sorry if it isn't. But still, I thought I'd throw it into the conversation.

Scott

doppelrittberger · 2015-10-26T08:10:06Z

@PhantomYdn: you´re right but since in a typical usecase the clusterId is much smaller than the clusterPosition this simple "trick" can prevent hash collisions. And even you´re also right that hash collisions are no real issue in Java, they might impact the performance of hash based sets and maps (like analyzed here: http://www.nurkiewicz.com/2014/04/hashmap-performance-improvements-in.html). So my suggestion was just to multiple the clusterId with a bigger prime to avoid collisions. The current version simply does 31*clusterId+clusterPosition and this generates much more collisions. In addition taking the first 32 bit of a long value just cuts off the sign and causes additional collisions (since every new record gets a negative clusterPosition) -> 1:1 collides with 1:-1 and so on...
One final thought: since every database can only store up to 32,767 clusters you can also multiply the clusterId with the next prime (32771) and prevent all collisions based on clusterIds:

public static void main (String[] args)
    {
        Set<Integer> oldHashCodes = new HashSet<>();
        int oldHashCodeCollisions = 0;
        for (int clusterId = 0; clusterId < 10000; clusterId++) {
            for (int clusterPosition = -1000; clusterPosition < 1000; clusterPosition++) {
                int oldHashCode = oldHashCode(clusterId, clusterPosition);
                if (!oldHashCodes.add(oldHashCode)) {
                    oldHashCodeCollisions++;
                }
            }
        }
        System.out.println("Old hashCode collisions: " + oldHashCodeCollisions);
        Set<Integer> newHashCodes = new HashSet<>();
        int newHashCodeCollisions = 0;
        for (int clusterId = 0; clusterId < 10000; clusterId++) {
            for (int clusterPosition = -1000; clusterPosition < 1000; clusterPosition++) {
                int newHashCode = newHashCode(clusterId, clusterPosition);
                if (!newHashCodes.add(newHashCode)) {
                    newHashCodeCollisions++;
                }
            }
        }
        System.out.println("New hashCode collisions: " + newHashCodeCollisions);
    }

    private static int oldHashCode (int clusterId, long clusterPosition)
    {
        return 31 * clusterId + (int) (clusterPosition ^ (clusterPosition >>> 32));
    }

    private static int newHashCode (int clusterId, long clusterPosition)
    {
        return 31 * clusterId + 32771 * (int) clusterPosition;
    }

This code prints:
Old hashCode collisions: 19689031
New hashCode collisions: 0

lvca added the performance label Apr 9, 2015

lvca self-assigned this Apr 25, 2015

lvca closed this as completed Oct 26, 2015

lvca added a commit that referenced this issue Oct 26, 2015

Fixed issue #3903

5b6856c

lvca added enhancement contributor friendly labels Oct 26, 2015

lvca added this to the 2.2 milestone Oct 26, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ORecordId - hashCode function collisions #3903

ORecordId - hashCode function collisions #3903

swimmesberger commented Apr 9, 2015

laa commented Apr 9, 2015

lvca commented Apr 9, 2015

laa commented Apr 9, 2015

swimmesberger commented Apr 9, 2015

lvca commented Apr 25, 2015

swimmesberger commented Aug 13, 2015

doppelrittberger commented Oct 20, 2015

lvca commented Oct 20, 2015

doppelrittberger commented Oct 22, 2015

lvca commented Oct 22, 2015

doppelrittberger commented Oct 22, 2015

swimmesberger commented Oct 22, 2015

doppelrittberger commented Oct 22, 2015

lvca commented Oct 25, 2015

PhantomYdn commented Oct 26, 2015

PhantomYdn commented Oct 26, 2015

smolinari commented Oct 26, 2015

doppelrittberger commented Oct 26, 2015

ORecordId - hashCode function collisions #3903

ORecordId - hashCode function collisions #3903

Comments

swimmesberger commented Apr 9, 2015

laa commented Apr 9, 2015

lvca commented Apr 9, 2015

laa commented Apr 9, 2015

swimmesberger commented Apr 9, 2015

lvca commented Apr 25, 2015

swimmesberger commented Aug 13, 2015

doppelrittberger commented Oct 20, 2015

lvca commented Oct 20, 2015

doppelrittberger commented Oct 22, 2015

lvca commented Oct 22, 2015

doppelrittberger commented Oct 22, 2015

swimmesberger commented Oct 22, 2015

doppelrittberger commented Oct 22, 2015

lvca commented Oct 25, 2015

PhantomYdn commented Oct 26, 2015

PhantomYdn commented Oct 26, 2015

smolinari commented Oct 26, 2015

doppelrittberger commented Oct 26, 2015