Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use MurmurHash3 for interaction features #28

Merged
merged 3 commits into from Jan 4, 2017

Conversation

stegben
Copy link
Contributor

@stegben stegben commented Dec 29, 2016

#23
MurmurHash3 is used by sklearn's FeatureHash, which is fast and robust enough.

Before

         3 function calls in 9.690 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.000    0.000 base.py:99(get_shape)
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        1    9.690    9.690    9.690    9.690 {method 'fit' of 'kaggler.online_model.ftrl.FTRL' objects}

After

         3 function calls in 2.265 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.000    0.000 base.py:99(get_shape)
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        1    2.265    2.265    2.265    2.265 {method 'fit' of 'kaggler.online_model.ftrl.FTRL' objects}

@stegben
Copy link
Contributor Author

stegben commented Dec 29, 2016

As for the original index, a function that directly returns the input integer is a good enough hash function, so I remove the hash function for it.

@jeongyoonlee
Copy link
Owner

Have you tried 'mmh3' python module (https://pypi.python.org/pypi/mmh3/2.3.1) instead of the C++ one? Unless there's a significant performance difference, the python module is preferred for simplicity.

@jeongyoonlee
Copy link
Owner

Regarding not using hash for original features, I agree with you. There might be more collision for features with low index if the number of features is greater than n, but as long as n is set big enough, which is usually the case, it should be fine.

Thanks for the improvement!

@stegben
Copy link
Contributor Author

stegben commented Dec 30, 2016

mmh3 looks great, I'll try it in 2 days

@stegben
Copy link
Contributor Author

stegben commented Jan 4, 2017

Sorry I was on my vacation these days.

I found that if I use mmh3, which cannot be imported by cimport, there will be extra function call overheads. On the other hand, the implementation of it use lots of PyObjects, which is much slower.

@jeongyoonlee
Copy link
Owner

No worries. Then let's merge this. Thanks for looking into it.

@jeongyoonlee jeongyoonlee merged commit c90506e into jeongyoonlee:master Jan 4, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants