Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resolved serious ReDoS in PunktSentenceTokenizer #2869

Merged
merged 3 commits into from Nov 26, 2021

Conversation

tomaarsen
Copy link
Member

@tomaarsen tomaarsen commented Oct 27, 2021

Resolves #2866

Hello!

Pull request overview

  • Resolved serious Regular Expression Denial of Service (ReDoS) issue in PunktSentenceTokenizer.
    • This has severe consequences for the commonly used sent_tokenize and word_tokenize.

The consequences of the ReDoS

The ReDoS is described in #2866, where an example of a malicious string is given. The consequences can be seen with this script:

n = 8
for length in [10**i for i in range(2, n)]:
    text = "a" * length
    start_t = time.time()
    word_tokenize(text)
    print(f"A length of {length:<{n}} takes {time.time() - start_t:.4f}s")

Which outputs:

A length of 100      takes 0.0060s
A length of 1000     takes 0.0060s
A length of 10000    takes 0.6320s
A length of 100000   takes 56.3322s
...

This is before I canceled the execution of the program after running it for several hours.

Potential malicious inputs

Any long sequence of any non-whitespace characters except ., ? or ! will be malicious input to the regex. A quick example is simply:

"a" * 100000

It takes word_tokenize 56.3322 seconds to tokenize this input.

The details of the ReDoS

The ReDoS is caused by this regex:

nltk/nltk/tokenize/punkt.py

Lines 268 to 278 in ad3c84c

_period_context_fmt = r"""
\S* # some word material
%(SentEndChars)s # a potential sentence ending
(?=(?P<after_tok>
%(NonWord)s # either other punctuation
|
\s+(?P<next_tok>\S+) # or whitespace and some other token
))"""
"""Format of a regular expression to find contexts including possible
sentence boundaries. Matches token which the possible sentence boundary
ends, and matches the following token within a lookahead expression."""

After being formatted with the default values, this becomes:

_period_context_fmt = r"""
\S*                             # some word material
[\.\?!]                         # a potential sentence ending
(?=(?P<after_tok>
    (?:[)\";}\]\*:@\'\({\[!\?]  # either other punctuation
    |
    \s+(?P<next_tok>\S+)        # or whitespace and some other token
))"""

The vulnerability is already at the very start of the regex. The following regex is also vulnerable:

_period_context_fmt = r"""
\S*                             # some word material
[\.\?!]                         # a potential sentence ending
"""

The issue is that the Python regex engine will start matching tokens with \S* (i.e. any non-whitespace), and only recognize that it failed to find a ., ? or a ! when it either reaches a whitespace character. So, with an input of a thousand characters of "a", then the regex engine will start matching from the very first "a", and only recognize that it won't be able to match this way after matching the full 1000 "a"'s. Then, it will backtrack all the way back to the first "a", skip it, and try the entire thing again from the second "a", now matching 999 characters before realising this won't work. This will continue until finally realising that the regex will not match the string at all.
Given an input string of length n, this will take (n^2 + n) / 2 steps. With other words, word_tokenize becomes at least O(n^2) time complexity, while a performance of O(n) should be reachable for most regex problems.

The fix

The fix seems elementary:

  • Remove \S* from the regex, then the regex engine will only start trying to match when it actually finds a potential end of sentence token (i.e. ., ! or ? by default).
  • Then, use a simply regex like \S*$ to match the final word before the end of sentence token.

And this works very well, with one small but crucial detail: re.finditer will find all non-overlapping matches, and we've now reduced the size of each match. See the consequences with this example:

On the develop branch

>>> pst = PunktSentenceTokenizer()
>>> text = "Very bad acting!!! I promise."
>>> list(pst._lang_vars.period_context_re().finditer(text)) # doctest: +SKIP
[<re.Match object; span=(9, 18), match='acting!!!'>]

After removing \S* from the regex

>>> pst = PunktSentenceTokenizer()
>>> text = "Very bad acting!!! I promise."
>>> list(pst._lang_vars.period_context_re().finditer(text)) # doctest: +NORMALIZE_WHITESPACE
[<re.Match object; span=(15, 16), match='!'>,
<re.Match object; span=(16, 17), match='!'>,
<re.Match object; span=(17, 18), match='!'>]

In short, we need to manually remove these overlapping cases. This is what the new _match_potential_end_contexts method is for. I can't come up with a regex-only solution for this problem.

Results

Upon running the script again, but now with the fix in place, we get:

A length of 100      takes 0.0070s
A length of 1000     takes 0.0010s
A length of 10000    takes 0.0060s
A length of 100000   takes 0.0400s
A length of 1000000  takes 0.3520s
A length of 10000000 takes 3.4641s

As you can see, the performance is significantly better, and more in tune with the expected O(n).

Beyond that, the performance of normal sentences (i.e. with dots) is in line with the original performance.

If all is well, then only the performance is (positively) impacted.

Note

I believe that this PR does not require re-training of the Punkt models.

  • Tom Aarsen

@stevenbird stevenbird self-assigned this Nov 4, 2021
@stevenbird stevenbird merged commit 1405aad into nltk:develop Nov 26, 2021
16 checks passed
@stevenbird
Copy link
Member

@stevenbird stevenbird commented Nov 26, 2021

Thanks @tomaarsen

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants