Compile some processing intensive modules #2093

scoder · 2018-08-23T13:19:46Z

We are using NLTK 3.3 for large scale text processing with custom grammars, and found that compiling two of the NLTK modules with Cython speeds up our processing by about 30%, essentially for free. All we had to do was

cythonize -i -3 nltk/parse/chart.py nltk/grammar.py

after the NLTK installation. Note that this compilation remains entirely optional, we didn't change any code.

Obviously, the little downside for us is now that we need to keep our own binary wheel build of NLTK around in order to keep the faster package pip installable for us, but that is easily worth the hours of processing time that we save every day.

So, would this be something that the NLTK project would want to do as well?

Basically, it would be an option in the setup.py script to compile some modules (specifically for CPython), and the NLTK project could start uploading binary wheels for different Python versions to PyPI, which, when used, would speed up the processing on end-user side. Since no code change is involved, choosing to use the wheels or not is entirely up to the users in that case. Specifically, Linux binary wheels would certainly be appreciated for all the CI test runners out there.

The text was updated successfully, but these errors were encountered:

alvations · 2018-08-23T22:51:42Z

It would be interesting to see how this works out and know which module can be easily cythonize.

It might take some time to fully create wheels for all platforms but it's worth a try. If you're up for it, please do submit a PR with cythonized versions of the setup.py and we'll take a look at it.

Closes nltk#2093.

jnothman · 2018-08-29T05:20:19Z

I think you need to be careful about this and consider the disadvantages to compiling modules. One issue is that users can no longer inspect the code as easily (e.g. using ipython's ??).

As far as I understand, efficiency has not been a major goal for nltk. You need to decide on priorities. But if you think this is a fair solution to efficiency, I think you should empirically justify it for each module.

scoder · 2018-08-29T05:26:53Z

I agree with this, although I would note that the .py files are at least still available right next to the .so files, even in the installed package.

alvations · 2018-08-29T05:30:02Z

I think it's a good experiment to try though, thus the cython branch =)

If the cost of maintaining a pip install -U nltk[cython] is as low as keeping the updates on this PR, then I think it's worth putting in. And everyone can still install with CPython using pip install nltk.

Personally, if the tokenization or ngram extraction works 2-5x faster (though I don't think it would), it would be quite awesome.

alvations · 2018-08-29T05:31:24Z

It's good to still know what's possible to compile automatically with cython. At the end of this exercise put up some benchmarks against JIT like pypy and native CPython.

jnothman · 2018-08-29T06:44:08Z

Yes, if available as nltk[binary], then great.

scoder · 2018-08-29T19:34:46Z

AFAIU, nltk[binary] means "nltk, plus all optional dependencies that it defines in its binary list". I guess that could be made to work by separating nltk into two packages: one pure Python package (nltk) and one package (let's call it nltk-binary) that installs the binary modules into it. nltk[binary] would then depend on that nltk-binary package of the exact same version, which would depend back on the exact same version of nltk.

Could still lead to versioning problems, because pip might allow installing a different version of nltk-binary if users really want to have it, but in general, assuming that users don't do "non-adult stuff", this might work.

And it's not exactly great to have two packages install their modules into the same base package … but then, there are namespace packages for reasons like this.

All in all, sounds like a bit of a complicated setup …

scoder · 2018-08-29T19:40:05Z

if the tokenization or ngram extraction works 2-5x faster

Can't make a prediction here since I don't know the code, but once it's compiled with Cython, you might be able to apply some further tuning at the C level. See the Cython pure Python mode docs, specifically the section on augmenting .pxd files.

scoder · 2020-05-05T07:36:49Z

From what I understand, the cythonize branch now works. Are there any plans to merge it into mainline and start shipping binary wheels?

alvations added enhancement pythonic installation labels Aug 23, 2018

scoder mentioned this issue Aug 24, 2018

Compile some modules with Cython and add a manylinux wheel build target. #2094

Closed

scoder added a commit to scoder/nltk that referenced this issue Aug 24, 2018

Compile some modules with Cython and add a manylinux wheel build target.

b5b9033

Closes nltk#2093.

alvations mentioned this issue May 7, 2019

Retrying Cythonize #2289

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compile some processing intensive modules #2093

Compile some processing intensive modules #2093

scoder commented Aug 23, 2018

alvations commented Aug 23, 2018

jnothman commented Aug 29, 2018

scoder commented Aug 29, 2018

alvations commented Aug 29, 2018

alvations commented Aug 29, 2018

jnothman commented Aug 29, 2018 via email

scoder commented Aug 29, 2018

scoder commented Aug 29, 2018

scoder commented May 5, 2020

Compile some processing intensive modules #2093

Compile some processing intensive modules #2093

Comments

scoder commented Aug 23, 2018

alvations commented Aug 23, 2018

jnothman commented Aug 29, 2018

scoder commented Aug 29, 2018

alvations commented Aug 29, 2018

alvations commented Aug 29, 2018

jnothman commented Aug 29, 2018 via email

scoder commented Aug 29, 2018

scoder commented Aug 29, 2018

scoder commented May 5, 2020