New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compile some processing intensive modules #2093
Comments
It would be interesting to see how this works out and know which module can be easily cythonize. It might take some time to fully create wheels for all platforms but it's worth a try. If you're up for it, please do submit a PR with cythonized versions of the setup.py and we'll take a look at it. |
I think you need to be careful about this and consider the disadvantages to compiling modules. One issue is that users can no longer inspect the code as easily (e.g. using ipython's As far as I understand, efficiency has not been a major goal for nltk. You need to decide on priorities. But if you think this is a fair solution to efficiency, I think you should empirically justify it for each module. |
I agree with this, although I would note that the |
I think it's a good experiment to try though, thus the If the cost of maintaining a Personally, if the tokenization or ngram extraction works 2-5x faster (though I don't think it would), it would be quite awesome. |
It's good to still know what's possible to compile automatically with cython. At the end of this exercise put up some benchmarks against JIT like pypy and native CPython. |
Yes, if available as nltk[binary], then great.
|
AFAIU, Could still lead to versioning problems, because pip might allow installing a different version of And it's not exactly great to have two packages install their modules into the same base package … but then, there are namespace packages for reasons like this. All in all, sounds like a bit of a complicated setup … |
Can't make a prediction here since I don't know the code, but once it's compiled with Cython, you might be able to apply some further tuning at the C level. See the Cython pure Python mode docs, specifically the section on augmenting .pxd files. |
From what I understand, the |
We are using NLTK 3.3 for large scale text processing with custom grammars, and found that compiling two of the NLTK modules with Cython speeds up our processing by about 30%, essentially for free. All we had to do was
after the NLTK installation. Note that this compilation remains entirely optional, we didn't change any code.
Obviously, the little downside for us is now that we need to keep our own binary wheel build of NLTK around in order to keep the faster package pip installable for us, but that is easily worth the hours of processing time that we save every day.
So, would this be something that the NLTK project would want to do as well?
Basically, it would be an option in the
setup.py
script to compile some modules (specifically for CPython), and the NLTK project could start uploading binary wheels for different Python versions to PyPI, which, when used, would speed up the processing on end-user side. Since no code change is involved, choosing to use the wheels or not is entirely up to the users in that case. Specifically, Linux binary wheels would certainly be appreciated for all the CI test runners out there.The text was updated successfully, but these errors were encountered: