Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compile some processing intensive modules #2093

Open
scoder opened this issue Aug 23, 2018 · 9 comments
Open

Compile some processing intensive modules #2093

scoder opened this issue Aug 23, 2018 · 9 comments

Comments

@scoder
Copy link

scoder commented Aug 23, 2018

We are using NLTK 3.3 for large scale text processing with custom grammars, and found that compiling two of the NLTK modules with Cython speeds up our processing by about 30%, essentially for free. All we had to do was

cythonize -i -3 nltk/parse/chart.py nltk/grammar.py

after the NLTK installation. Note that this compilation remains entirely optional, we didn't change any code.

Obviously, the little downside for us is now that we need to keep our own binary wheel build of NLTK around in order to keep the faster package pip installable for us, but that is easily worth the hours of processing time that we save every day.

So, would this be something that the NLTK project would want to do as well?

Basically, it would be an option in the setup.py script to compile some modules (specifically for CPython), and the NLTK project could start uploading binary wheels for different Python versions to PyPI, which, when used, would speed up the processing on end-user side. Since no code change is involved, choosing to use the wheels or not is entirely up to the users in that case. Specifically, Linux binary wheels would certainly be appreciated for all the CI test runners out there.

@alvations
Copy link
Contributor

It would be interesting to see how this works out and know which module can be easily cythonize.

It might take some time to fully create wheels for all platforms but it's worth a try. If you're up for it, please do submit a PR with cythonized versions of the setup.py and we'll take a look at it.

@jnothman
Copy link
Contributor

I think you need to be careful about this and consider the disadvantages to compiling modules. One issue is that users can no longer inspect the code as easily (e.g. using ipython's ??).

As far as I understand, efficiency has not been a major goal for nltk. You need to decide on priorities. But if you think this is a fair solution to efficiency, I think you should empirically justify it for each module.

@scoder
Copy link
Author

scoder commented Aug 29, 2018

I agree with this, although I would note that the .py files are at least still available right next to the .so files, even in the installed package.

@alvations
Copy link
Contributor

I think it's a good experiment to try though, thus the cython branch =)

If the cost of maintaining a pip install -U nltk[cython] is as low as keeping the updates on this PR, then I think it's worth putting in. And everyone can still install with CPython using pip install nltk.


Personally, if the tokenization or ngram extraction works 2-5x faster (though I don't think it would), it would be quite awesome.

@alvations
Copy link
Contributor

It's good to still know what's possible to compile automatically with cython. At the end of this exercise put up some benchmarks against JIT like pypy and native CPython.

@jnothman
Copy link
Contributor

jnothman commented Aug 29, 2018 via email

@scoder
Copy link
Author

scoder commented Aug 29, 2018

AFAIU, nltk[binary] means "nltk, plus all optional dependencies that it defines in its binary list". I guess that could be made to work by separating nltk into two packages: one pure Python package (nltk) and one package (let's call it nltk-binary) that installs the binary modules into it. nltk[binary] would then depend on that nltk-binary package of the exact same version, which would depend back on the exact same version of nltk.

Could still lead to versioning problems, because pip might allow installing a different version of nltk-binary if users really want to have it, but in general, assuming that users don't do "non-adult stuff", this might work.

And it's not exactly great to have two packages install their modules into the same base package … but then, there are namespace packages for reasons like this.

All in all, sounds like a bit of a complicated setup …

@scoder
Copy link
Author

scoder commented Aug 29, 2018

if the tokenization or ngram extraction works 2-5x faster

Can't make a prediction here since I don't know the code, but once it's compiled with Cython, you might be able to apply some further tuning at the C level. See the Cython pure Python mode docs, specifically the section on augmenting .pxd files.

@scoder
Copy link
Author

scoder commented May 5, 2020

From what I understand, the cythonize branch now works. Are there any plans to merge it into mainline and start shipping binary wheels?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants