Resolve ReDoS opportunity by fixing incorrectly specified regex #2906
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Hello!
Pull request overview
The regex
The original regex was
^-?[0-9]+(.[0-9]+)?$
, which is vulnerable to a ReDoS due to[0-9]+
followed by any character (due to.
), and then again[0-9]+
. However, this original regex does not seem to match what it was intended for: Matching cardinal numbers. The regex seems to want to match:-
.
followed by a non-zero amount of numbers.This would match e.g.
3
,-4
and3.12
. However, the regex actually does not match a dot directly, it matches any number. This means that-3a12
is also allowed, which we would rather not have. I believe the original designer of the regex meant to use\.
, but used.
instead, and did not realise the small mistake.ReDoS reproduction:
outputs
After several minutes I stopped the execution.
The fix
Simply adding a
\
before the.
should fix the problem.The consequences
The regex will no longer be vulnerable to a ReDoS. Beyond that,
-3a12
will no longer be considered a cardinal number in the RegexpTagger examples throughout NLTK. This is desirable behaviour, with one small subjective exception:3e12
will no longer be considered a cardinal number, even though one could argue that it is.If we want to still consider it one, then we can of course go for
^-?[0-9]+([e\.][0-9]+)?$
, but I really don't think it matters.ReDoS reproduction
now outputs
As is correct for a simple regex, the execution time of the regex is linear in the length of the input. This was not the case before.
Thanks to @Scara31 for making us aware of this vulnerability through proper means.