Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use MurmurHash3 for interaction features #28

Merged
merged 3 commits into from Jan 4, 2017

Conversation

Projects
None yet
2 participants
@stegben
Copy link
Contributor

commented Dec 29, 2016

#23
MurmurHash3 is used by sklearn's FeatureHash, which is fast and robust enough.

Before

         3 function calls in 9.690 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.000    0.000 base.py:99(get_shape)
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        1    9.690    9.690    9.690    9.690 {method 'fit' of 'kaggler.online_model.ftrl.FTRL' objects}

After

         3 function calls in 2.265 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.000    0.000 base.py:99(get_shape)
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        1    2.265    2.265    2.265    2.265 {method 'fit' of 'kaggler.online_model.ftrl.FTRL' objects}

@stegben

This comment has been minimized.

Copy link
Contributor Author

commented Dec 29, 2016

As for the original index, a function that directly returns the input integer is a good enough hash function, so I remove the hash function for it.

@jeongyoonlee

This comment has been minimized.

Copy link
Owner

commented Dec 29, 2016

Have you tried 'mmh3' python module (https://pypi.python.org/pypi/mmh3/2.3.1) instead of the C++ one? Unless there's a significant performance difference, the python module is preferred for simplicity.

@jeongyoonlee

This comment has been minimized.

Copy link
Owner

commented Dec 29, 2016

Regarding not using hash for original features, I agree with you. There might be more collision for features with low index if the number of features is greater than n, but as long as n is set big enough, which is usually the case, it should be fine.

Thanks for the improvement!

@stegben

This comment has been minimized.

Copy link
Contributor Author

commented Dec 30, 2016

mmh3 looks great, I'll try it in 2 days

@stegben

This comment has been minimized.

Copy link
Contributor Author

commented Jan 4, 2017

Sorry I was on my vacation these days.

I found that if I use mmh3, which cannot be imported by cimport, there will be extra function call overheads. On the other hand, the implementation of it use lots of PyObjects, which is much slower.

@jeongyoonlee

This comment has been minimized.

Copy link
Owner

commented Jan 4, 2017

No worries. Then let's merge this. Thanks for looking into it.

@jeongyoonlee jeongyoonlee merged commit c90506e into jeongyoonlee:master Jan 4, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.