Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[libc++][regex] Correctly adjust match prefix for zero-length matches. #94550

Merged
merged 2 commits into from
Jun 7, 2024

Conversation

var-const
Copy link
Member

@var-const var-const commented Jun 6, 2024

For regex patterns that produce zero-length matches, there is one (imaginary) match in-between every character in the sequence being searched (as well as before the first character and after the last character). It's easiest to demonstrate using replacement: std::regex_replace("abc"s, "!", "") should produce !a!b!c!, where each exclamation mark makes a zero-length match visible.

Currently our implementation doesn't correctly set the prefix of each zero-length match, "swallowing" the characters separating the imaginary matches -- e.g. when going through zero-length matches within abc, the corresponding prefixes should be {'', 'a', 'b', 'c'}, but before this patch they will all be empty ({'', '', '', ''}). This happens in the implementation of regex_iterator::operator++. Note that the Standard spells out quite explicitly that the prefix might need to be adjusted when dealing with zero-length matches in re.regiter.incr:

In all cases in which the call to regex_search returns true, match.prefix().first shall be equal to the previous value of match[0].second... It is unspecified how the implementation makes these adjustments.

Reproduction example

#include <iostream>
#include <regex>
#include <string>

int main() {
  std::string str = "abc";
  std::regex empty_matching_pattern("");

  { // The underlying problem is that `regex_iterator::operator++` doesn't update
    // the prefix correctly.
    std::sregex_iterator i(str.begin(), str.end(), empty_matching_pattern), e;
    std::cout << "\"";
    for (; i != e; ++i) {
      const std::ssub_match& prefix = i->prefix();
      std::cout << prefix.str();
    }
    std::cout << "\"\n";
    // Before the patch: ""
    // After the patch: "abc"
  }

  { // `regex_replace` makes the problem very visible.
    std::string replaced = std::regex_replace(str, empty_matching_pattern, "!");
    std::cout << "\"" << replaced << "\"\n";
    // Before the patch: "!!!!"
    // After the patch: "!a!b!c!"
  }
}

Fixes #64451

rdar://119912002

@var-const var-const requested a review from a team as a code owner June 6, 2024 00:39
@llvmbot llvmbot added the libc++ libc++ C++ Standard Library. Not GNU libstdc++. Not libc++abi. label Jun 6, 2024
@var-const var-const added the regex Issues related to regex label Jun 6, 2024
@llvmbot
Copy link
Collaborator

llvmbot commented Jun 6, 2024

@llvm/pr-subscribers-libcxx

Author: Konstantin Varlamov (var-const)

Changes

Full diff: https://github.com/llvm/llvm-project/pull/94550.diff

2 Files Affected:

  • (modified) libcxx/include/regex (+20-2)
  • (modified) libcxx/test/std/re/re.iter/re.regiter/re.regiter.incr/post.pass.cpp (+48)
diff --git a/libcxx/include/regex b/libcxx/include/regex
index b3869d36de1df..8b2637ead328a 100644
--- a/libcxx/include/regex
+++ b/libcxx/include/regex
@@ -792,6 +792,7 @@ typedef regex_token_iterator<wstring::const_iterator> wsregex_token_iterator;
 #include <__algorithm/find.h>
 #include <__algorithm/search.h>
 #include <__assert>
+#include <__availability>
 #include <__config>
 #include <__iterator/back_insert_iterator.h>
 #include <__iterator/default_sentinel.h>
@@ -4700,6 +4701,9 @@ private:
 
   template <class, class>
   friend class __lookahead;
+
+  template <class, class, class>
+  friend class regex_iterator;
 };
 
 template <class _BidirectionalIterator, class _Allocator>
@@ -5410,7 +5414,9 @@ template <class _BidirectionalIterator, class _CharT, class _Traits>
 regex_iterator<_BidirectionalIterator, _CharT, _Traits>&
 regex_iterator<_BidirectionalIterator, _CharT, _Traits>::operator++() {
   __flags_ |= regex_constants::__no_update_pos;
-  _BidirectionalIterator __start = __match_[0].second;
+  _BidirectionalIterator __start        = __match_[0].second;
+  _BidirectionalIterator __prefix_start = __start;
+
   if (__match_[0].first == __match_[0].second) {
     if (__start == __end_) {
       __match_ = value_type();
@@ -5424,9 +5430,21 @@ regex_iterator<_BidirectionalIterator, _CharT, _Traits>::operator++() {
     else
       ++__start;
   }
+
   __flags_ |= regex_constants::match_prev_avail;
-  if (!std::regex_search(__start, __end_, __match_, *__pregex_, __flags_))
+  if (!std::regex_search(__start, __end_, __match_, *__pregex_, __flags_)) {
     __match_ = value_type();
+
+  } else {
+    // The Standard mandates that if `regex_search` returns true ([re.regiter.incr]), "`match.prefix().first` shall be
+    // equal to the previous value of `match[0].second`... It is unspecified how the implementation makes these
+    // adjustments." The adjustment is necessary if we incremented `__start` above (the branch that deals with
+    // zero-length matches).
+    auto& __prefix = __match_.__prefix_;
+    __prefix.first = __prefix_start;
+    __prefix.matched = __prefix.first != __prefix.second;
+  }
+
   return *this;
 }
 
diff --git a/libcxx/test/std/re/re.iter/re.regiter/re.regiter.incr/post.pass.cpp b/libcxx/test/std/re/re.iter/re.regiter/re.regiter.incr/post.pass.cpp
index 9332158f0de95..596a71c70a484 100644
--- a/libcxx/test/std/re/re.iter/re.regiter/re.regiter.incr/post.pass.cpp
+++ b/libcxx/test/std/re/re.iter/re.regiter/re.regiter.incr/post.pass.cpp
@@ -114,5 +114,53 @@ int main(int, char**)
         assert(i == e);
     }
 
+  {
+    // Check that we correctly adjust the match prefix when dealing with zero-length matches -- this is explicitly
+    // required by the Standard ([re.regiter.incr]: "In all cases in which the call to `regex_search` returns true,
+    // `match.prefix().first` shall be equal to the previous value of `match[0].second`"). For a pattern that matches
+    // empty sequences, there is an implicit zero-length match between every character in a string -- make sure the
+    // prefix of each of these matches (except the first one) is the preceding character.
+
+    auto validate = [](const std::regex& empty_matching_pattern) {
+      const char source[] = "abc";
+
+      std::cregex_iterator i(source, source + 3, empty_matching_pattern);
+      assert(!i->prefix().matched);
+      assert(i->prefix().length() == 0);
+      assert(i->prefix().first == source);
+      assert(i->prefix().second == source);
+
+      ++i;
+      assert(i->prefix().matched);
+      assert(i->prefix().length() == 1);
+      assert(i->prefix().first == source);
+      assert(i->prefix().second == source + 1);
+      assert(i->prefix().str() == "a");
+
+      ++i;
+      assert(i->prefix().matched);
+      assert(i->prefix().length() == 1);
+      assert(i->prefix().first == source + 1);
+      assert(i->prefix().second == source + 2);
+      assert(i->prefix().str() == "b");
+
+      ++i;
+      assert(i->prefix().matched);
+      assert(i->prefix().length() == 1);
+      assert(i->prefix().first == source + 2);
+      assert(i->prefix().second == source + 3);
+      assert(i->prefix().str() == "c");
+
+      ++i;
+      assert(i == std::cregex_iterator());
+    };
+
+    // An empty pattern produces zero-length matches.
+    validate(std::regex(""));
+    // Any character repeated zero or more times can produce zero-length matches.
+    validate(std::regex("X*"));
+    validate(std::regex("X{0,3}"));
+  }
+
   return 0;
 }

assert(i->prefix().second == source);

++i;
assert(i->prefix().matched);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this repetition warrants another helper function (calls would look something like

check_prefix(i->prefix(), /*matched=*/true, /*length=*/1, /*first=*/source, /*second=*/source + 1);

).

@@ -114,5 +114,53 @@ int main(int, char**)
assert(i == e);
}

{
// Check that we correctly adjust the match prefix when dealing with zero-length matches -- this is explicitly
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please let me know if you think I'm writing too much detail for this single specific issue (here and/or in the test file).

Copy link

github-actions bot commented Jun 6, 2024

✅ With the latest revision this PR passed the C/C++ code formatter.

@var-const
Copy link
Member Author

Fixes #64451

// equal to the previous value of `match[0].second`... It is unspecified how the implementation makes these
// adjustments." The adjustment is necessary if we incremented `__start` above (the branch that deals with
// zero-length matches).
auto& __prefix = __match_.__prefix_;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's another call to regex_search above, but I don't think prefix adjustment should ever be necessary there since it happens before __start is incremented.

// empty sequences, there is an implicit zero-length match between every character in a string -- make sure the
// prefix of each of these matches (except the first one) is the preceding character.

auto validate = [](const std::regex& empty_matching_pattern) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: I actually didn't realize we support <regex> in the C++03 mode. Would you prefer to change this to a helper function, or to make the test C++11 and above? (perhaps split into a separate file)

Copy link
Member

@ldionne ldionne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with the comment, and also please make sure to provide a very short explanation of the bug we're fixing with this, since the only explanation is on an internal bug tracker. Basically, please make the commit message sufficient in itself to quickly understand the bug we're fixing.

@ldionne ldionne merged commit e9adcc4 into llvm:main Jun 7, 2024
57 checks passed
@HerrCai0907 HerrCai0907 mentioned this pull request Jun 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
libc++ libc++ C++ Standard Library. Not GNU libstdc++. Not libc++abi. regex Issues related to regex
Projects
None yet
Development

Successfully merging this pull request may close these issues.

libc++ regex drops the character between a zero-length match and its subsequent match
3 participants