Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resolved ReDoS vulnerability in Corpus Reader #2816

Merged
merged 2 commits into from
Sep 24, 2021

Conversation

tomaarsen
Copy link
Member

Hello!

Pull request overview

  • Resolved Regular Expression Denial Of Service vulnerability in the CorpusReader for the Comparative Sentences Dataset.
  • Added performance tests hinting that the ReDoS is removed.

Background

Wikipedia page for ReDoS.

The ReDoS

The regular expression vulnerable to a ReDoS is compiled here:

KEYWORD = re.compile(r"\((?!.*\()(.*)\)$")

And only used once, right here:
keyword = KEYWORD.findall(line)

Regex breakdown

In full:

\((?!.*\()(.*)\)$

It consists of 4 segments, described here:

Regex section Explanation
\(

Match the character ( exactly.

(?!.*\()

Negative lookahead. Makes sure that .*\( can not be matched. In short, this means that the remainder of the match (after the previous segment) can not contain another (.

(.*)
Matches any number of characters, and places them in a group for extraction.
\)$

Match the character ) exactly, and ensure that this is the end of the line.

What does this regex try to do?

It tries to find information within the brackets in each line in e.g.

1_its 2_models 3_fast-forward 3_rewind (more)
 1_4.5 gb 2_0.7 gb (nicer)
1_mirror 2_silver 3_finish (more)
 1_dvd players 3_quality 2_this (better)
 1_panasonics 1_toshibas 1_sonys (cheaper)

i.e. extract more, nicer, more, better, cheaper. This is what segments 1, 3 and 4 do. Segment 2 ensures that the starting ( is the most right-most ( in the entire input. This way, in e.g.

1_its 2_models (3_fast-forward) 3_rewind (more)

Only more is extracted.

The ReDoS comes from the combination of the two .*'s, and likely results from naive backtracking in Python's regular expression engine.

The fix

The new regex looks like so:

KEYWORD = re.compile(r"\(([^\(]*)\)$")

It's fairly similar to the previous regex. Segments 1 and 4 are reused, while segment 2 is removed, and segment 3 is modified.
Segment 3 is now:

([^\(]*)

This regular expression will now, rather than match anything, match anything except (. Because the $ at the end anchors us at the end of the line, this is still guaranteed to only get the last case of e.g. ... (more).

I've quickly tested this new regex with all lines in the Comparative Sentences Dataset, and it finds just as many matches as the old regex. I believe it's identical in output, and only differs in performance.


Tests

In order to help convince that this has actually resolved the issue, I've created a doctest in corpus.doctest. It will:

  • Perform KEYWORD.findall(payload) 9 times with a malicious payload of 4000 characters. Then, take the mean execution time between these 9 calls. This is dubbed the short mean.
  • Perform KEYWORD.findall(payload) 9 times with a malicious payload of 40000 (!) characters. Then, take the mean execution time between these 9 calls. This is dubbed the long mean.
  • If the regular expression execution time is linear to the size of the input (as it should be in such a relatively simple regex), then the long mean should be roughly 10 times as big as the short mean.
    To play it safe, we ensure that the long mean is at most 30 times as big. A value of 30 seems to work fine.
    When the ReDoS was still intact, the long mean would be rougly 80 times as big, which definitely indicates that the regex is not linear in execution time.

Thank you for reporting this vulnerability through our team email.

  • Tom Aarsen

@stevenbird stevenbird merged commit 277711a into nltk:develop Sep 24, 2021
@stevenbird
Copy link
Member

@tomaarsen nice fix... not just safer and more efficient, but more readable

@tomaarsen tomaarsen deleted the vulnerability/comparative_redos branch September 25, 2021 11:08
@iliakur
Copy link
Contributor

iliakur commented Sep 26, 2021

I'm starting to lean towards a similar "empirical threshold" approach for the language vocabulary performance test.

Copy link

@PeterJCLaw PeterJCLaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing!

nltk/test/corpus.doctest Show resolved Hide resolved
icanhasmath pushed a commit to ActiveState/nltk that referenced this pull request Dec 21, 2023
* Resolved ReDoS vulnerability in the Corpus Reader for the Comparative Sentence Dataset

* Solidified performance tests
icanhasmath added a commit to ActiveState/nltk that referenced this pull request Dec 21, 2023
Resolved ReDoS vulnerability in Corpus Reader (nltk#2816)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants