Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FileCheck] Don't use regex to find prefixes #72237

Merged
merged 2 commits into from
Nov 15, 2023

Conversation

nikic
Copy link
Contributor

@nikic nikic commented Nov 14, 2023

FileCheck currently compiles a regular expression of the form Prefix1|Prefix2|... and uses it to find the next prefix in the input.

If we had a fast regex implementation, this would be a useful thing to do, as the regex implementation would be able to match multiple prefixes more efficiently than a naive approach. However, with our actual regex implementation, finding the prefixes basically becomes O(InputLen * RegexLen * LargeConstantFactor), which is a lot worse than a simple string search.

Replace the regex with StringRef::find(), and keeping track of the next position of each prefix. There are various ways this could be improved on, but it's already significantly faster that the previous approach.

For me, this improves check-llvm time from 138.5s to 132.5s, so by around 4-5%.

For vector-interleaved-load-i16-stride-7.ll in particular, test time drops from 5s to 2.5s.

FileCheck currently compiles a regular expression of the form
`Prefix1|Prefix2|...` and uses it to find the next prefix in the
input.

If we had a fast regex implementation, this would be a useful
thing to do, as the regex implementation would be able to match
multiple prefix more efficiently than a naive approach. However,
with our actual regex implementation, finding the prefixes
basically becomes O(InputLen * RegexLen * LargeConstantFactor),
which is a lot worse than a simple string search.

Replace the regex with StringRef::find(), and keeping track of the
next position of each prefix. There are various ways this could
be improved on, but it's already significantly faster that the
previous approach.

For me, this improves check-llvm time from 138.5s to 132.5s, so
by around 4-5%.

For vector-interleaved-load-i16-stride-7.ll in particular, test
time drops from 5s to 2.5s.
@llvmbot
Copy link
Collaborator

llvmbot commented Nov 14, 2023

@llvm/pr-subscribers-testing-tools

@llvm/pr-subscribers-llvm-globalisel

Author: Nikita Popov (nikic)

Changes

FileCheck currently compiles a regular expression of the form Prefix1|Prefix2|... and uses it to find the next prefix in the input.

If we had a fast regex implementation, this would be a useful thing to do, as the regex implementation would be able to match multiple prefixes more efficiently than a naive approach. However, with our actual regex implementation, finding the prefixes basically becomes O(InputLen * RegexLen * LargeConstantFactor), which is a lot worse than a simple string search.

Replace the regex with StringRef::find(), and keeping track of the next position of each prefix. There are various ways this could be improved on, but it's already significantly faster that the previous approach.

For me, this improves check-llvm time from 138.5s to 132.5s, so by around 4-5%.

For vector-interleaved-load-i16-stride-7.ll in particular, test time drops from 5s to 2.5s.


Full diff: https://github.com/llvm/llvm-project/pull/72237.diff

5 Files Affected:

  • (modified) llvm/include/llvm/FileCheck/FileCheck.h (+1-11)
  • (modified) llvm/lib/FileCheck/FileCheck.cpp (+62-39)
  • (modified) llvm/unittests/CodeGen/GlobalISel/GISelMITest.h (+1-2)
  • (modified) llvm/unittests/MIR/MachineMetadata.cpp (+1-2)
  • (modified) llvm/utils/FileCheck/FileCheck.cpp (+1-12)
diff --git a/llvm/include/llvm/FileCheck/FileCheck.h b/llvm/include/llvm/FileCheck/FileCheck.h
index d6d8dc531e1001c..321ce1d26e1639b 100644
--- a/llvm/include/llvm/FileCheck/FileCheck.h
+++ b/llvm/include/llvm/FileCheck/FileCheck.h
@@ -187,24 +187,14 @@ class FileCheck {
   explicit FileCheck(FileCheckRequest Req);
   ~FileCheck();
 
-  // Combines the check prefixes into a single regex so that we can efficiently
-  // scan for any of the set.
-  //
-  // The semantics are that the longest-match wins which matches our regex
-  // library.
-  Regex buildCheckPrefixRegex();
-
   /// Reads the check file from \p Buffer and records the expected strings it
   /// contains. Errors are reported against \p SM.
   ///
-  /// Only expected strings whose prefix is one of those listed in \p PrefixRE
-  /// are recorded. \returns true in case of an error, false otherwise.
-  ///
   /// If \p ImpPatBufferIDRange, then the range (inclusive start, exclusive end)
   /// of IDs for source buffers added to \p SM for implicit patterns are
   /// recorded in it.  The range is empty if there are none.
   bool
-  readCheckFile(SourceMgr &SM, StringRef Buffer, Regex &PrefixRE,
+  readCheckFile(SourceMgr &SM, StringRef Buffer,
                 std::pair<unsigned, unsigned> *ImpPatBufferIDRange = nullptr);
 
   bool ValidateCheckPrefixes();
diff --git a/llvm/lib/FileCheck/FileCheck.cpp b/llvm/lib/FileCheck/FileCheck.cpp
index bef6dd69fadfdee..e8a7521836f99bf 100644
--- a/llvm/lib/FileCheck/FileCheck.cpp
+++ b/llvm/lib/FileCheck/FileCheck.cpp
@@ -1634,6 +1634,59 @@ static size_t SkipWord(StringRef Str, size_t Loc) {
   return Loc;
 }
 
+static const char *DefaultCheckPrefixes[] = {"CHECK"};
+static const char *DefaultCommentPrefixes[] = {"COM", "RUN"};
+
+static void addDefaultPrefixes(FileCheckRequest &Req) {
+  if (Req.CheckPrefixes.empty()) {
+    for (const char *Prefix : DefaultCheckPrefixes)
+      Req.CheckPrefixes.push_back(Prefix);
+    Req.IsDefaultCheckPrefix = true;
+  }
+  if (Req.CommentPrefixes.empty())
+    for (const char *Prefix : DefaultCommentPrefixes)
+      Req.CommentPrefixes.push_back(Prefix);
+}
+
+struct PrefixMatcher {
+  /// Prefixes and their first occurrence past the current position.
+  SmallVector<std::pair<StringRef, size_t>> Prefixes;
+  StringRef Input;
+
+  PrefixMatcher(ArrayRef<StringRef> CheckPrefixes,
+                ArrayRef<StringRef> CommentPrefixes, StringRef Input)
+      : Input(Input) {
+    for (StringRef Prefix : CheckPrefixes)
+      Prefixes.push_back({Prefix, Input.find(Prefix)});
+    for (StringRef Prefix : CommentPrefixes)
+      Prefixes.push_back({Prefix, Input.find(Prefix)});
+
+    // Sort by descending length.
+    llvm::sort(Prefixes,
+               [](auto A, auto B) { return A.first.size() > B.first.size(); });
+  }
+
+  /// Find the next match of a prefix in Buffer.
+  std::optional<StringRef> match(StringRef Buffer) {
+    assert(Buffer.data() >= Input.data() &&
+           Buffer.data() + Buffer.size() == Input.data() + Input.size() &&
+           "Buffer must be suffix of Input");
+
+    size_t From = Buffer.data() - Input.data();
+    std::optional<StringRef> Match;
+    for (auto &[Prefix, Pos] : Prefixes) {
+      // If the last occurrence was before From, find the next one after From.
+      if (Pos < From)
+        Pos = Input.find(Prefix, From);
+      // Find the first prefix with the lowest position.
+      if (Pos != StringRef::npos &&
+          (!Match || size_t(Match->data() - Input.data()) > Pos))
+        Match = StringRef(Input.substr(Pos, Prefix.size()));
+    }
+    return Match;
+  }
+};
+
 /// Searches the buffer for the first prefix in the prefix regular expression.
 ///
 /// This searches the buffer using the provided regular expression, however it
@@ -1658,19 +1711,17 @@ static size_t SkipWord(StringRef Str, size_t Loc) {
 /// If no valid prefix is found, the state of Buffer, LineNumber, and CheckTy
 /// is unspecified.
 static std::pair<StringRef, StringRef>
-FindFirstMatchingPrefix(const FileCheckRequest &Req, Regex &PrefixRE,
+FindFirstMatchingPrefix(const FileCheckRequest &Req, PrefixMatcher &Matcher,
                         StringRef &Buffer, unsigned &LineNumber,
                         Check::FileCheckType &CheckTy) {
-  SmallVector<StringRef, 2> Matches;
-
   while (!Buffer.empty()) {
-    // Find the first (longest) match using the RE.
-    if (!PrefixRE.match(Buffer, &Matches))
+    // Find the first (longest) prefix match.
+    std::optional<StringRef> Res = Matcher.match(Buffer);
+    if (!Res)
       // No match at all, bail.
       return {StringRef(), StringRef()};
 
-    StringRef Prefix = Matches[0];
-    Matches.clear();
+    StringRef Prefix = *Res;
 
     assert(Prefix.data() >= Buffer.data() &&
            Prefix.data() < Buffer.data() + Buffer.size() &&
@@ -1720,7 +1771,7 @@ FileCheck::FileCheck(FileCheckRequest Req)
 FileCheck::~FileCheck() = default;
 
 bool FileCheck::readCheckFile(
-    SourceMgr &SM, StringRef Buffer, Regex &PrefixRE,
+    SourceMgr &SM, StringRef Buffer,
     std::pair<unsigned, unsigned> *ImpPatBufferIDRange) {
   if (ImpPatBufferIDRange)
     ImpPatBufferIDRange->first = ImpPatBufferIDRange->second = 0;
@@ -1769,6 +1820,8 @@ bool FileCheck::readCheckFile(
   // found.
   unsigned LineNumber = 1;
 
+  addDefaultPrefixes(Req);
+  PrefixMatcher Matcher(Req.CheckPrefixes, Req.CommentPrefixes, Buffer);
   std::set<StringRef> PrefixesNotFound(Req.CheckPrefixes.begin(),
                                        Req.CheckPrefixes.end());
   const size_t DistinctPrefixes = PrefixesNotFound.size();
@@ -1779,7 +1832,7 @@ bool FileCheck::readCheckFile(
     StringRef UsedPrefix;
     StringRef AfterSuffix;
     std::tie(UsedPrefix, AfterSuffix) =
-        FindFirstMatchingPrefix(Req, PrefixRE, Buffer, LineNumber, CheckTy);
+        FindFirstMatchingPrefix(Req, Matcher, Buffer, LineNumber, CheckTy);
     if (UsedPrefix.empty())
       break;
     if (CheckTy != Check::CheckComment)
@@ -2431,9 +2484,6 @@ static bool ValidatePrefixes(StringRef Kind, StringSet<> &UniquePrefixes,
   return true;
 }
 
-static const char *DefaultCheckPrefixes[] = {"CHECK"};
-static const char *DefaultCommentPrefixes[] = {"COM", "RUN"};
-
 bool FileCheck::ValidateCheckPrefixes() {
   StringSet<> UniquePrefixes;
   // Add default prefixes to catch user-supplied duplicates of them below.
@@ -2454,33 +2504,6 @@ bool FileCheck::ValidateCheckPrefixes() {
   return true;
 }
 
-Regex FileCheck::buildCheckPrefixRegex() {
-  if (Req.CheckPrefixes.empty()) {
-    for (const char *Prefix : DefaultCheckPrefixes)
-      Req.CheckPrefixes.push_back(Prefix);
-    Req.IsDefaultCheckPrefix = true;
-  }
-  if (Req.CommentPrefixes.empty()) {
-    for (const char *Prefix : DefaultCommentPrefixes)
-      Req.CommentPrefixes.push_back(Prefix);
-  }
-
-  // We already validated the contents of CheckPrefixes and CommentPrefixes so
-  // just concatenate them as alternatives.
-  SmallString<32> PrefixRegexStr;
-  for (size_t I = 0, E = Req.CheckPrefixes.size(); I != E; ++I) {
-    if (I != 0)
-      PrefixRegexStr.push_back('|');
-    PrefixRegexStr.append(Req.CheckPrefixes[I]);
-  }
-  for (StringRef Prefix : Req.CommentPrefixes) {
-    PrefixRegexStr.push_back('|');
-    PrefixRegexStr.append(Prefix);
-  }
-
-  return Regex(PrefixRegexStr);
-}
-
 Error FileCheckPatternContext::defineCmdlineVariables(
     ArrayRef<StringRef> CmdlineDefines, SourceMgr &SM) {
   assert(GlobalVariableTable.empty() && GlobalNumericVariableTable.empty() &&
diff --git a/llvm/unittests/CodeGen/GlobalISel/GISelMITest.h b/llvm/unittests/CodeGen/GlobalISel/GISelMITest.h
index 27d599671db6ddf..06786b15252a0ef 100644
--- a/llvm/unittests/CodeGen/GlobalISel/GISelMITest.h
+++ b/llvm/unittests/CodeGen/GlobalISel/GISelMITest.h
@@ -189,8 +189,7 @@ static inline bool CheckMachineFunction(const MachineFunction &MF,
   SourceMgr SM;
   SM.AddNewSourceBuffer(MemoryBuffer::getMemBuffer(CheckFileText, "CheckFile"),
                         SMLoc());
-  Regex PrefixRE = FC.buildCheckPrefixRegex();
-  if (FC.readCheckFile(SM, CheckFileText, PrefixRE))
+  if (FC.readCheckFile(SM, CheckFileText))
     return false;
 
   auto OutBuffer = OutputBuf->getBuffer();
diff --git a/llvm/unittests/MIR/MachineMetadata.cpp b/llvm/unittests/MIR/MachineMetadata.cpp
index f50e9b562942b62..bea11e4c734fc0a 100644
--- a/llvm/unittests/MIR/MachineMetadata.cpp
+++ b/llvm/unittests/MIR/MachineMetadata.cpp
@@ -193,8 +193,7 @@ static bool checkOutput(std::string CheckString, std::string Output) {
   SourceMgr SM;
   SM.AddNewSourceBuffer(MemoryBuffer::getMemBuffer(CheckFileText, "CheckFile"),
                         SMLoc());
-  Regex PrefixRE = FC.buildCheckPrefixRegex();
-  if (FC.readCheckFile(SM, CheckFileText, PrefixRE))
+  if (FC.readCheckFile(SM, CheckFileText))
     return false;
 
   auto OutBuffer = OutputBuffer->getBuffer();
diff --git a/llvm/utils/FileCheck/FileCheck.cpp b/llvm/utils/FileCheck/FileCheck.cpp
index 5f85f4f75d13a97..e74a79e1312b25e 100644
--- a/llvm/utils/FileCheck/FileCheck.cpp
+++ b/llvm/utils/FileCheck/FileCheck.cpp
@@ -810,17 +810,6 @@ int main(int argc, char **argv) {
   if (!FC.ValidateCheckPrefixes())
     return 2;
 
-  Regex PrefixRE = FC.buildCheckPrefixRegex();
-  std::string REError;
-  if (!PrefixRE.isValid(REError)) {
-    errs() << "Unable to combine check-prefix strings into a prefix regular "
-              "expression! This is likely a bug in FileCheck's verification of "
-              "the check-prefix strings. Regular expression parsing failed "
-              "with the following error: "
-           << REError << "\n";
-    return 2;
-  }
-
   SourceMgr SM;
 
   // Read the expected strings from the check file.
@@ -842,7 +831,7 @@ int main(int argc, char **argv) {
                             SMLoc());
 
   std::pair<unsigned, unsigned> ImpPatBufferIDRange;
-  if (FC.readCheckFile(SM, CheckFileText, PrefixRE, &ImpPatBufferIDRange))
+  if (FC.readCheckFile(SM, CheckFileText, &ImpPatBufferIDRange))
     return 2;
 
   // Open the file to check and add it to SourceMgr.

@jayfoad
Copy link
Contributor

jayfoad commented Nov 14, 2023

Nice! test/CodeGen/AMDGPU/memory-legalizer-* may benefit from this.

@nikic
Copy link
Contributor Author

nikic commented Nov 14, 2023

Nice! test/CodeGen/AMDGPU/memory-legalizer-* may benefit from this.

I checked, and it doesn't. These tests spend most time in llc rather than FileCheck. More specifically, they spend most of the time in -verify-machineinstrs. verifyUseLists() and getMinimalPhysRegClassLLT() together take up about 2/3 of the profile.

}

/// Find the next match of a prefix in Buffer.
std::optional<StringRef> match(StringRef Buffer) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't see the point of the optional, the empty StringRef will naturally act like none

Copy link
Collaborator

@tstellar tstellar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@nikic nikic merged commit 261b471 into llvm:main Nov 15, 2023
3 checks passed
zahiraam pushed a commit to zahiraam/llvm-project that referenced this pull request Nov 20, 2023
FileCheck currently compiles a regular expression of the form
`Prefix1|Prefix2|...` and uses it to find the next prefix in the input.

If we had a fast regex implementation, this would be a useful thing to
do, as the regex implementation would be able to match multiple prefixes
more efficiently than a naive approach. However, with our actual regex
implementation, finding the prefixes basically becomes O(InputLen *
RegexLen * LargeConstantFactor), which is a lot worse than a simple
string search.

Replace the regex with StringRef::find(), and keeping track of the next
position of each prefix. There are various ways this could be improved
on, but it's already significantly faster that the previous approach.

For me, this improves check-llvm time from 138.5s to 132.5s, so by
around 4-5%.

For vector-interleaved-load-i16-stride-7.ll in particular, test time
drops from 5s to 2.5s.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants