Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ORecordId - hashCode function collisions #3903

Closed
swimmesberger opened this issue Apr 9, 2015 · 18 comments
Closed

ORecordId - hashCode function collisions #3903

swimmesberger opened this issue Apr 9, 2015 · 18 comments

Comments

@swimmesberger
Copy link

The hashCode function of the ORecordId class seems to not work properly:

    ORecordId rec = new ORecordId(12, 31);
    int hash = rec.hashCode();
    ORecordId rec1 = new ORecordId(13, 0);
    int hash1 = rec1.hashCode();

Both hash and hash1 returns "403" as integer value.

Version: 2.0.5

@laa
Copy link
Member

laa commented Apr 9, 2015

We have com.orientechnologies.orient.core.index.hashindex.local.OMurmurHash3HashFunction hash function do not you mind to implement it as source of hash code and test how performance was changed ?

@lvca
Copy link
Member

lvca commented Apr 9, 2015

@laa do you mean implementing ORecordId.hashCode() using Murmur algorithm? If I'm right, +1

@laa
Copy link
Member

laa commented Apr 9, 2015

yes, but I will really appreciate if @swimmesberger will try himself if he is interested of course.

@swimmesberger
Copy link
Author

    final OMurmurHash3HashFunction<OIdentifiable> hasher = new OMurmurHash3HashFunction<OIdentifiable>();
    hasher.setValueSerializer(new OStreamSerializerRID());
    ORecordId rec = new ORecordId(12, 31);
    long hash = hasher.hashCode(rec);
    ORecordId rec1 = new ORecordId(13, 0);
    long hash1 = hasher.hashCode(rec1);

Seems to work better (no collisions so far) but not really a dropin replacement for hashCode because hashCode returns an integer value and "OHashFunction hashCode(V value)" returns a long value.
(if we didn't want to discard any information)

The OMurmurHash3HashFunction is indeed "much" slower than the normal hashCode function but for my usecase the performance drop is justifiable.

In my fast non optimized test 10.000 calls to OMurmurHash3HashFunction hashCode function takes ~52ms and using the current hashCode function with 10.000 calls takes ~8ms.

@lvca
Copy link
Member

lvca commented Apr 25, 2015

Any idea to improve current ORecordID.hashCode() or can I close this issue?

@lvca lvca self-assigned this Apr 25, 2015
@swimmesberger
Copy link
Author

No idea but I just wanted to say that we can't really rely on ORecordID.hashCode() for the reason stated above so using ORecordID objects in HashMaps for example could be pretty dangerous.

@doppelrittberger
Copy link

Hi,
I took a look into the hashCode() method of the class ORecordId. You should use this method:
return 31 * clusterId + 37 * (int) (clusterPosition ^ (clusterPosition >>> 32));
instead of the old one (like proposed by Josh Bloch's Effective Java, http://stackoverflow.com/questions/113511/best-implementation-for-hashcode-method). This solves most collisions for me.

@lvca
Copy link
Member

lvca commented Oct 20, 2015

I don't think 37 could brings any advantages. Do you have any data how this number reduced your collision rate?

@doppelrittberger
Copy link

I made a small test:

public static void main (String[] args)
    {
        Set<Integer> oldHashCodes = new HashSet<>();
        int oldHashCodeCollisions = 0;
        for (int clusterId = 0; clusterId < 100; clusterId++) {
            for (int clusterPosition = -1000; clusterPosition < 100000; clusterPosition++) {
                int oldHashCode = oldHashCode(clusterId, clusterPosition);
                if (!oldHashCodes.add(oldHashCode)) {
                    oldHashCodeCollisions++;
                }
            }
        }
        System.out.println("Old hashCode collisions: " + oldHashCodeCollisions);
        Set<Integer> newHashCodes = new HashSet<>();
        int newHashCodeCollisions = 0;
        for (int clusterId = 0; clusterId < 100; clusterId++) {
            for (int clusterPosition = -1000; clusterPosition < 100000; clusterPosition++) {
                int newHashCode = newHashCode(clusterId, clusterPosition);
                if (!newHashCodes.add(newHashCode)) {
                    newHashCodeCollisions++;
                }
            }
        }
        System.out.println("New hashCode collisions: " + newHashCodeCollisions);
    }

    private static int oldHashCode (int clusterId, long clusterPosition)
    {
        return 31 * clusterId + (int) (clusterPosition ^ (clusterPosition >>> 32));
    }

    private static int newHashCode (int clusterId, long clusterPosition)
    {
        return 31 * clusterId + 103 * (int) clusterPosition;
    }

and the results are impressive:
Old hashCode collisions: 9996931
New hashCode collisions: 0

@lvca
Copy link
Member

lvca commented Oct 22, 2015

Impressive! Why 103?

@doppelrittberger
Copy link

I did some testing with bigger prime numbers and 103 worked well :)

@swimmesberger
Copy link
Author

Sounds very good! :) Thanks for the test! I didn't really had time to do some testing with different hashcode functions.

@doppelrittberger
Copy link

You´re welcome

@lvca
Copy link
Member

lvca commented Oct 25, 2015

Tested also with random RIDs and the more the RIDs are lower, the best result you have. Actually in 99% of cases this is what happen: rids are incremental. So definitely +1. @doppelrittberger Seems you found the magic number (103). Thanks ;-)

@lvca lvca closed this as completed Oct 26, 2015
lvca added a commit that referenced this issue Oct 26, 2015
@lvca lvca added this to the 2.2 milestone Oct 26, 2015
@PhantomYdn
Copy link
Contributor

Guys, new hash will fail if you will test more than just 100:)

Simple example:

Records 103:62 and 206:31 will have the same hash code:
31*103 + 103*62 = 31*206+103*31 
because:
31*1*103+103*2*31=31*2*103+103*1*31 and after changing orders
31*1*103+103*2*31=31*1*103+103*2*31

I'm not expect in cipher, but you will definitely need in XOR : affine transformations will not definitely work.

@PhantomYdn
Copy link
Contributor

And, btw, there is no requirement in java to have different hashcodes for different objects.
But for hashcode for ORecordId I would recommend the following requirements:

  1. Very low probability of impact in case of the same cluster rather than between clusters
  2. Very fast in calculation

The following (autogenerated in eclipse) function much better than proposed:

@Override
    public int hashCode() {
        final int prime = 31;
        int result = 1;
        result = prime * result + clusterId;
        result = prime * result + (int) (clusterPosition ^ (clusterPosition >>> 32));
        return result;
    }

@smolinari
Copy link
Contributor

Nothing like what you can up with Ilia, but we were looking at using this very small library for hash Ids. We are thinking more on the lines of URL ids too, which I know isn't really part of this discussion or issue.

http://hashids.org/java/

The cool thing with this hashing algorythm is, you can insert an rid and get a nice URL friendly hash Id. And, when using salt (the third input), the rid could be tokenized to an extent.

Pretty sure this isn't helpful and sorry if it isn't. But still, I thought I'd throw it into the conversation.

Scott

@doppelrittberger
Copy link

@PhantomYdn: you´re right but since in a typical usecase the clusterId is much smaller than the clusterPosition this simple "trick" can prevent hash collisions. And even you´re also right that hash collisions are no real issue in Java, they might impact the performance of hash based sets and maps (like analyzed here: http://www.nurkiewicz.com/2014/04/hashmap-performance-improvements-in.html). So my suggestion was just to multiple the clusterId with a bigger prime to avoid collisions. The current version simply does 31*clusterId+clusterPosition and this generates much more collisions. In addition taking the first 32 bit of a long value just cuts off the sign and causes additional collisions (since every new record gets a negative clusterPosition) -> 1:1 collides with 1:-1 and so on...
One final thought: since every database can only store up to 32,767 clusters you can also multiply the clusterId with the next prime (32771) and prevent all collisions based on clusterIds:

public static void main (String[] args)
    {
        Set<Integer> oldHashCodes = new HashSet<>();
        int oldHashCodeCollisions = 0;
        for (int clusterId = 0; clusterId < 10000; clusterId++) {
            for (int clusterPosition = -1000; clusterPosition < 1000; clusterPosition++) {
                int oldHashCode = oldHashCode(clusterId, clusterPosition);
                if (!oldHashCodes.add(oldHashCode)) {
                    oldHashCodeCollisions++;
                }
            }
        }
        System.out.println("Old hashCode collisions: " + oldHashCodeCollisions);
        Set<Integer> newHashCodes = new HashSet<>();
        int newHashCodeCollisions = 0;
        for (int clusterId = 0; clusterId < 10000; clusterId++) {
            for (int clusterPosition = -1000; clusterPosition < 1000; clusterPosition++) {
                int newHashCode = newHashCode(clusterId, clusterPosition);
                if (!newHashCodes.add(newHashCode)) {
                    newHashCodeCollisions++;
                }
            }
        }
        System.out.println("New hashCode collisions: " + newHashCodeCollisions);
    }

    private static int oldHashCode (int clusterId, long clusterPosition)
    {
        return 31 * clusterId + (int) (clusterPosition ^ (clusterPosition >>> 32));
    }

    private static int newHashCode (int clusterId, long clusterPosition)
    {
        return 31 * clusterId + 32771 * (int) clusterPosition;
    }

This code prints:
Old hashCode collisions: 19689031
New hashCode collisions: 0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

6 participants