Resolved ReDoS vulnerability in Corpus Reader #2816

tomaarsen · 2021-09-24T16:21:08Z

Hello!

Pull request overview

Resolved Regular Expression Denial Of Service vulnerability in the CorpusReader for the Comparative Sentences Dataset.
Added performance tests hinting that the ReDoS is removed.

Background

Wikipedia page for ReDoS.

The ReDoS

The regular expression vulnerable to a ReDoS is compiled here:

nltk/nltk/corpus/reader/comparative_sents.py

Line 48 in 23f4b1c

KEYWORD = re.compile(r"$(?!.*\()(.*)$$")

And only used once, right here:

nltk/nltk/corpus/reader/comparative_sents.py

Line 259 in 23f4b1c

keyword = KEYWORD.findall(line)

Regex breakdown

In full:

\((?!.*\()(.*)\)$

It consists of 4 segments, described here:

Regex section	Explanation
\(	Match the character `(` exactly.
(?!.*\()	Negative lookahead. Makes sure that `.\(` can not* be matched. In short, this means that the remainder of the match (after the previous segment) can not contain another `(`.
(.*)	Matches any number of characters, and places them in a group for extraction.
\)$	Match the character `)` exactly, and ensure that this is the end of the line.

What does this regex try to do?

It tries to find information within the brackets in each line in e.g.

1_its 2_models 3_fast-forward 3_rewind (more)
 1_4.5 gb 2_0.7 gb (nicer)
1_mirror 2_silver 3_finish (more)
 1_dvd players 3_quality 2_this (better)
 1_panasonics 1_toshibas 1_sonys (cheaper)

i.e. extract more, nicer, more, better, cheaper. This is what segments 1, 3 and 4 do. Segment 2 ensures that the starting ( is the most right-most ( in the entire input. This way, in e.g.

1_its 2_models (3_fast-forward) 3_rewind (more)

Only more is extracted.

The ReDoS comes from the combination of the two .*'s, and likely results from naive backtracking in Python's regular expression engine.

The fix

The new regex looks like so:

KEYWORD = re.compile(r"\(([^\(]*)\)$")

It's fairly similar to the previous regex. Segments 1 and 4 are reused, while segment 2 is removed, and segment 3 is modified.
Segment 3 is now:

([^\(]*)

This regular expression will now, rather than match anything, match anything except (. Because the $ at the end anchors us at the end of the line, this is still guaranteed to only get the last case of e.g. ... (more).

I've quickly tested this new regex with all lines in the Comparative Sentences Dataset, and it finds just as many matches as the old regex. I believe it's identical in output, and only differs in performance.

Tests

In order to help convince that this has actually resolved the issue, I've created a doctest in corpus.doctest. It will:

Perform KEYWORD.findall(payload) 9 times with a malicious payload of 4000 characters. Then, take the mean execution time between these 9 calls. This is dubbed the short mean.
Perform KEYWORD.findall(payload) 9 times with a malicious payload of 40000 (!) characters. Then, take the mean execution time between these 9 calls. This is dubbed the long mean.
If the regular expression execution time is linear to the size of the input (as it should be in such a relatively simple regex), then the long mean should be roughly 10 times as big as the short mean.
To play it safe, we ensure that the long mean is at most 30 times as big. A value of 30 seems to work fine.
When the ReDoS was still intact, the long mean would be rougly 80 times as big, which definitely indicates that the regex is not linear in execution time.

Thank you for reporting this vulnerability through our team email.

Tom Aarsen

… Sentence Dataset

stevenbird · 2021-09-24T23:16:58Z

@tomaarsen nice fix... not just safer and more efficient, but more readable

iliakur · 2021-09-26T08:25:50Z

I'm starting to lean towards a similar "empirical threshold" approach for the language vocabulary performance test.

PeterJCLaw

Thanks for fixing!

nltk/test/corpus.doctest

* Resolved ReDoS vulnerability in the Corpus Reader for the Comparative Sentence Dataset * Solidified performance tests

Resolved ReDoS vulnerability in Corpus Reader (nltk#2816)

tomaarsen added 2 commits September 24, 2021 17:11

Resolved ReDoS vulnerability in the Corpus Reader for the Comparative…

bc7ab4a

… Sentence Dataset

Solidified performance tests

bb2ac6f

tomaarsen added corpus bug needs review labels Sep 24, 2021

purificant approved these changes Sep 24, 2021

View reviewed changes

stevenbird merged commit 277711a into nltk:develop Sep 24, 2021

tomaarsen deleted the vulnerability/comparative_redos branch September 25, 2021 11:08

PeterJCLaw reviewed Sep 30, 2021

View reviewed changes

nltk/test/corpus.doctest Show resolved Hide resolved

This was referenced Sep 30, 2021

Change mean -> median in ReDoS docstring comment #2831

Merged

Skip ReDoS test - performance testing isn't viable with cloud computing #2841

Merged

tomaarsen mentioned this pull request Nov 28, 2021

Multiple CVEs CVEProject/cvelist#2998

Merged

icanhasmath pushed a commit to ActiveState/nltk that referenced this pull request Dec 21, 2023

Resolved ReDoS vulnerability in Corpus Reader (nltk#2816)

d97326e

* Resolved ReDoS vulnerability in the Corpus Reader for the Comparative Sentence Dataset * Solidified performance tests

icanhasmath added a commit to ActiveState/nltk that referenced this pull request Dec 21, 2023

Merge pull request #1 from ActiveState/cve-2021-3828

817c76f

Resolved ReDoS vulnerability in Corpus Reader (nltk#2816)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resolved ReDoS vulnerability in Corpus Reader #2816

Resolved ReDoS vulnerability in Corpus Reader #2816

tomaarsen commented Sep 24, 2021

stevenbird commented Sep 24, 2021

iliakur commented Sep 26, 2021

PeterJCLaw left a comment

Resolved ReDoS vulnerability in Corpus Reader #2816

Resolved ReDoS vulnerability in Corpus Reader #2816

Conversation

tomaarsen commented Sep 24, 2021

Pull request overview

Background

The ReDoS

Regex breakdown

What does this regex try to do?

The fix

Tests

stevenbird commented Sep 24, 2021

iliakur commented Sep 26, 2021

PeterJCLaw left a comment

Choose a reason for hiding this comment