Improve performance of nltk.ngrams #3083

BLKSerene · 2022-12-13T11:49:06Z

With the current implementation of nltk.ngrams, the performance decreases slightly when the size of n-grams increases:

>>> timeit.timeit('''nltk.ngrams(tokens, 1)''', setup='import nltk; tokens = list(range(1000))')
0.6647043000000394
>>> timeit.timeit('''nltk.ngrams(tokens, 2)''', setup='import nltk; tokens = list(range(1000))')
0.9033590999999888
>>> timeit.timeit('''nltk.ngrams(tokens, 3)''', setup='import nltk; tokens = list(range(1000))')
1.1749377000001004
>>> timeit.timeit('''nltk.ngrams(tokens, 4)''', setup='import nltk; tokens = list(range(1000))')
1.4639482000000044
>>> timeit.timeit('''nltk.ngrams(tokens, 5)''', setup='import nltk; tokens = list(range(1000))')
1.787133600000061

Though this is the expected behavior, perhaps it could be improved further. There is a function sliding_window from more-itertools which have seemingly implemented the same function as nltk.ngrams:

>>> timeit.timeit('more_itertools.sliding_window(tokens, 1)', setup='import more_itertools; tokens = list(range(1000))')
0.1839231999997537
>>> timeit.timeit('more_itertools.sliding_window(tokens, 2)', setup='import more_itertools; tokens = list(range(1000))')
0.188646499999777
>>> timeit.timeit('more_itertools.sliding_window(tokens, 3)', setup='import more_itertools; tokens = list(range(1000))')
0.17990640000016356
>>> timeit.timeit('more_itertools.sliding_window(tokens, 4)', setup='import more_itertools; tokens = list(range(1000))')
0.18609590000005483
>>> timeit.timeit('more_itertools.sliding_window(tokens, 5)', setup='import more_itertools; tokens = list(range(1000))')
0.19976090000000113

This implementation is faster, and it seems that the performance does not decrease when the size of n-grams increases. I suppose that this would be a better alternative to the current implementation, but I'm not sure about the compatibility with other optional parameters (e.g. pad_left, pad_right) and the license issue (Apache vs. MIT).

If the maintainers do not have time, I could work on this.

Notes: more-itertools has another function windowed which likewise would possibly be a better replacement for nltk.skipgrams.

The text was updated successfully, but these errors were encountered:

tomaarsen · 2022-12-13T21:35:58Z

Although the performance increase is interesting, I'm hesitant to increase the number of dependencies of NLTK. Unless you're suggesting to rework the implementation of NLTK's ngram to mirror that of sliding_window, because that is worth considering.

BLKSerene · 2022-12-14T05:25:20Z

@tomaarsen Yes, I'm suggesting that this function with only a few lines of code be incorporated (perhaps after some modifications?) into NLTK and the issue that needs to be considered is license compatibility.

tomaarsen added the performance label Dec 13, 2022

rmalouf mentioned this issue Jun 9, 2023

Use efficient ngrams implementation from python docs #3162

Merged

BLKSerene closed this as completed Apr 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance of nltk.ngrams #3083

Improve performance of nltk.ngrams #3083

BLKSerene commented Dec 13, 2022 •

edited

tomaarsen commented Dec 13, 2022

BLKSerene commented Dec 14, 2022 •

edited

Improve performance of nltk.ngrams #3083

Improve performance of nltk.ngrams #3083

Comments

BLKSerene commented Dec 13, 2022 • edited

tomaarsen commented Dec 13, 2022

BLKSerene commented Dec 14, 2022 • edited

BLKSerene commented Dec 13, 2022 •

edited

BLKSerene commented Dec 14, 2022 •

edited