Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance of nltk.ngrams #3083

Closed
BLKSerene opened this issue Dec 13, 2022 · 2 comments
Closed

Improve performance of nltk.ngrams #3083

BLKSerene opened this issue Dec 13, 2022 · 2 comments

Comments

@BLKSerene
Copy link
Contributor

BLKSerene commented Dec 13, 2022

With the current implementation of nltk.ngrams, the performance decreases slightly when the size of n-grams increases:

>>> timeit.timeit('''nltk.ngrams(tokens, 1)''', setup='import nltk; tokens = list(range(1000))')
0.6647043000000394
>>> timeit.timeit('''nltk.ngrams(tokens, 2)''', setup='import nltk; tokens = list(range(1000))')
0.9033590999999888
>>> timeit.timeit('''nltk.ngrams(tokens, 3)''', setup='import nltk; tokens = list(range(1000))')
1.1749377000001004
>>> timeit.timeit('''nltk.ngrams(tokens, 4)''', setup='import nltk; tokens = list(range(1000))')
1.4639482000000044
>>> timeit.timeit('''nltk.ngrams(tokens, 5)''', setup='import nltk; tokens = list(range(1000))')
1.787133600000061

Though this is the expected behavior, perhaps it could be improved further. There is a function sliding_window from more-itertools which have seemingly implemented the same function as nltk.ngrams:

>>> timeit.timeit('more_itertools.sliding_window(tokens, 1)', setup='import more_itertools; tokens = list(range(1000))')
0.1839231999997537
>>> timeit.timeit('more_itertools.sliding_window(tokens, 2)', setup='import more_itertools; tokens = list(range(1000))')
0.188646499999777
>>> timeit.timeit('more_itertools.sliding_window(tokens, 3)', setup='import more_itertools; tokens = list(range(1000))')
0.17990640000016356
>>> timeit.timeit('more_itertools.sliding_window(tokens, 4)', setup='import more_itertools; tokens = list(range(1000))')
0.18609590000005483
>>> timeit.timeit('more_itertools.sliding_window(tokens, 5)', setup='import more_itertools; tokens = list(range(1000))')
0.19976090000000113

This implementation is faster, and it seems that the performance does not decrease when the size of n-grams increases. I suppose that this would be a better alternative to the current implementation, but I'm not sure about the compatibility with other optional parameters (e.g. pad_left, pad_right) and the license issue (Apache vs. MIT).

If the maintainers do not have time, I could work on this.

Notes: more-itertools has another function windowed which likewise would possibly be a better replacement for nltk.skipgrams.

@tomaarsen
Copy link
Member

Although the performance increase is interesting, I'm hesitant to increase the number of dependencies of NLTK. Unless you're suggesting to rework the implementation of NLTK's ngram to mirror that of sliding_window, because that is worth considering.

@BLKSerene
Copy link
Contributor Author

BLKSerene commented Dec 14, 2022

@tomaarsen Yes, I'm suggesting that this function with only a few lines of code be incorporated (perhaps after some modifications?) into NLTK and the issue that needs to be considered is license compatibility.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants