Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FIX] Fix undefined-behaviour in regex engine. #73071

Merged
merged 1 commit into from
Dec 1, 2023

Conversation

tanmaysachan
Copy link
Contributor

@tanmaysachan tanmaysachan commented Nov 22, 2023

Running the mlir-text-parser-fuzzer on a random corpus discovers a path that causes application of offset to a null pointer (UB) in the regex engine.

This patch adds a check.

Input:
Binary input, generated by fuzzer.

Output:

/Users/tsachan/Documents/llvm-project/llvm/lib/Support/regengine.inc:152:18: runtime error: applying zero offset to null pointer
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /Users/tsachan/Documents/llvm-project/llvm/lib/Support/regengine.inc:152:18 in
==35858== ERROR: libFuzzer: deadly signal
    #0 0x10424236c in __sanitizer_print_stack_trace+0x28 (libclang_rt.asan_osx_dynamic.dylib:arm64+0x5e36c)
    #1 0x1030c1ce0 in fuzzer::PrintStackTrace()+0x2c (mlir-text-parser-fuzzer:arm64+0x100b4dce0)
    #2 0x1030b43b4 in fuzzer::Fuzzer::CrashCallback()+0x54 (mlir-text-parser-fuzzer:arm64+0x100b403b4)
    #3 0x1940aaa20 in _sigtramp+0x34 (libsystem_platform.dylib:arm64+0x3a20)
    #4 0xeb4900019407bc24  (<unknown module>)
    #5 0x6b3b000193f89ae4  (<unknown module>)
    #6 0x824c80010425c834  (<unknown module>)
    #7 0x10425bfa0 in __sanitizer::Die()+0xcc (libclang_rt.asan_osx_dynamic.dylib:arm64+0x77fa0)
    #8 0x104271334 in __ubsan_handle_pointer_overflow_abort+0x24 (libclang_rt.asan_osx_dynamic.dylib:arm64+0x8d334)
    #9 0x102be79c8 in llvm_regexec+0x4df8 (mlir-text-parser-fuzzer:arm64+0x1006739c8)
    #10 0x102bf98a8 in llvm::Regex::match(llvm::StringRef, llvm::SmallVectorImpl<llvm::StringRef>*, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>*) const+0x608 (mlir-text-parser-fuzzer:arm64+0x1006858a8)
    #11 0x102c31880 in mlir::Dialect::isValidNamespace(llvm::StringRef)+0xf0 (mlir-text-parser-fuzzer:arm64+0x1006bd880)
    #12 0x102b7d748 in mlir::OpaqueType::verify(llvm::function_ref<mlir::InFlightDiagnostic ()>, mlir::StringAttr, llvm::StringRef)+0x13c (mlir-text-parser-fuzzer:arm64+0x100609748)
    #13 0x102b76280 in mlir::OpaqueType::getChecked(llvm::function_ref<mlir::InFlightDiagnostic ()>, mlir::StringAttr, llvm::StringRef)+0x158 (mlir-text-parser-fuzzer:arm64+0x100602280)
    #14 0x102f91368 in mlir::detail::Parser::parseExtendedType()+0xe00 (mlir-text-parser-fuzzer:arm64+0x100a1d368)
    #15 0x10300ea00 in mlir::detail::Parser::parseNonFunctionType()+0x420 (mlir-text-parser-fuzzer:arm64+0x100a9aa00)
    #16 0x102fb6944 in mlir::parseAsmSourceFile(llvm::SourceMgr const&, mlir::Block*, mlir::ParserConfig const&, mlir::AsmParserState*, mlir::AsmParserCodeCompleteContext*)+0x1668 (mlir-text-parser-fuzzer:arm64+0x100a42944)
    #17 0x102ebe8e8 in mlir::parseSourceFile(llvm::SourceMgr const&, mlir::Block*, mlir::ParserConfig const&, mlir::LocationAttr*)+0x2b0 (mlir-text-parser-fuzzer:arm64+0x10094a8e8)
    #18 0x102ec0144 in mlir::parseSourceString(llvm::StringRef, mlir::Block*, mlir::ParserConfig const&, llvm::StringRef, mlir::LocationAttr*)+0x258 (mlir-text-parser-fuzzer:arm64+0x10094c144)
    #19 0x1025773e4 in mlir::OwningOpRef<mlir::ModuleOp> mlir::parseSourceString<mlir::ModuleOp>(llvm::StringRef, mlir::ParserConfig const&, llvm::StringRef)+0x160 (mlir-text-parser-fuzzer:arm64+0x1000033e4)
    #20 0x102576ed4 in LLVMFuzzerTestOneInput+0x578 (mlir-text-parser-fuzzer:arm64+0x100002ed4)
    #21 0x1030b57ec in fuzzer::Fuzzer::ExecuteCallback(unsigned char const*, unsigned long)+0x138 (mlir-text-parser-fuzzer:arm64+0x100b417ec)
    #22 0x1030a6558 in fuzzer::RunOneTest(fuzzer::Fuzzer*, char const*, unsigned long)+0xd0 (mlir-text-parser-fuzzer:arm64+0x100b32558)
    #23 0x1030ab974 in fuzzer::FuzzerDriver(int*, char***, int (*)(unsigned char const*, unsigned long))+0x1a98 (mlir-text-parser-fuzzer:arm64+0x100b37974)
    #24 0x1030c25d4 in main+0x24 (mlir-text-parser-fuzzer:arm64+0x100b4e5d4)
    #25 0x193d23f24  (<unknown module>)
    #26 0x121afffffffffffc  (<unknown module>)

NOTE: libFuzzer has rudimentary signal handlers.
      Combine libFuzzer with AddressSanitizer or similar for better crash reports.
SUMMARY: libFuzzer: deadly signal

@llvmbot
Copy link
Collaborator

llvmbot commented Nov 22, 2023

@llvm/pr-subscribers-llvm-support

Author: Tanmay (tanmaysachan)

Changes

Running the mlir-text-parser-fuzzer discovers a path that causes application of offset to a null pointer (UB) in the regex engine.

This patch adds a check.


Full diff: https://github.com/llvm/llvm-project/pull/73071.diff

1 Files Affected:

  • (modified) llvm/lib/Support/regengine.inc (+3-1)
diff --git a/llvm/lib/Support/regengine.inc b/llvm/lib/Support/regengine.inc
index f23993abc6e7e71..54dd96ab9cfada5 100644
--- a/llvm/lib/Support/regengine.inc
+++ b/llvm/lib/Support/regengine.inc
@@ -146,7 +146,9 @@ matcher(struct re_guts *g, const char *string, size_t nmatch,
 	const char *stop;
 
 	/* simplify the situation where possible */
-	if (g->cflags&REG_NOSUB)
+        if (!string)
+		return(REG_INVARG);
+        if (g->cflags&REG_NOSUB)
 		nmatch = 0;
 	if (eflags&REG_STARTEND) {
 		start = string + pmatch[0].rm_so;

@tanmaysachan tanmaysachan changed the title Fix undefined-behaviour in regex engine. [LLVM][Support]Fix undefined-behaviour in regex engine. Nov 22, 2023
@tanmaysachan tanmaysachan changed the title [LLVM][Support]Fix undefined-behaviour in regex engine. [LLVM][Support] Fix undefined-behaviour in regex engine. Nov 22, 2023
@tanmaysachan tanmaysachan changed the title [LLVM][Support] Fix undefined-behaviour in regex engine. [FIX] Fix undefined-behaviour in regex engine. Nov 22, 2023
Copy link

github-actions bot commented Nov 29, 2023

✅ With the latest revision this PR passed the C/C++ code formatter.

@tanmaysachan
Copy link
Contributor Author

tanmaysachan commented Nov 29, 2023

@nikic ping.
Moved the fix to Regex.cpp, returns false since no match on null string possible.

Edit: build seems to be failing for some reason, trying to fix.
Edit: Changed empty() check to nullptr check, works.

llvm/lib/Support/Regex.cpp Outdated Show resolved Hide resolved
@dwblaikie
Copy link
Collaborator

Any chance of a test case? (Not sure, but I'd expect we have unit tests for the regex api that could be extended to cover this)

@nikic
Copy link
Contributor

nikic commented Nov 30, 2023

Any chance of a test case? (Not sure, but I'd expect we have unit tests for the regex api that could be extended to cover this)

We do have some tests here: https://github.com/llvm/llvm-project/blob/main/llvm/unittests/Support/RegexTest.cpp

@tanmaysachan
Copy link
Contributor Author

tanmaysachan commented Nov 30, 2023

@dwblaikie Thanks! Added the unittest.

Copy link
Collaborator

@dwblaikie dwblaikie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thanks!

- Running the regex engine on an empty string causes "Applying non-zero offset to null pointer" UB.
- Bug discovered through "mlir-text-parser-fuzzer" module.
- This patch puts a check in the matcher and adds a corresponding test.
@tanmaysachan
Copy link
Contributor Author

Squashed the commits.

@dwblaikie
Copy link
Collaborator

Hmm, I was going to merge this but I started thinking about how to tidy up the commit message and then went down a rabbit hole of: Wait, why do we need this?

Could you help me understand a bit more what code is invalid here, down in the regex implementation? Because we do pass down the length (which should be zero) - so the underlying code should never dereference the pointer, right? So how do we end up with UB?

@tanmaysachan
Copy link
Contributor Author

tanmaysachan commented Dec 1, 2023

@dwblaikie In regengine.inc (line 152), we use the start and end (both 0 here) and add it to the string pointer to find the bounds (null in this case). This causes the UBSan to act.
They are dereferenced later on.

@dwblaikie
Copy link
Collaborator

@dwblaikie In regengine.inc (line 152), we use the start and end (both 0 here)

Oh, is it that null+0 is UB? I guess that's fair... Hmm :/

@tanmaysachan
Copy link
Contributor Author

tanmaysachan commented Dec 1, 2023

Null+0 is a UB, but even if we ignore that, the engine still crashes if string = null with length = 0 along with null bounds. I believe that's by design since the regex ^$ should match with length = 0 string, so the function should still execute (just not with a null string).

@dwblaikie
Copy link
Collaborator

the engine still crashes if string = null with length = 0 along with null bounds. I believe that's by design since the regex ^$ should match with length = 0 string, so the function should still execute (just not with a null string).

I don't understand that bit, and what got me asking more questions/concerns - if you pass in a pointer+length where length is zero, the implementation cannot/should not dereference that pointer. So it shouldn't be a problem if that pointer is null, because it should never be dereferenced. (this comes up with memcpy, which technically requires a non-null pointer, and even when the length is zero it's still technically UB and is a problem ( https://www.imperialviolet.org/2016/06/26/nonnull.html ) but that doesn't usually come up in user code - they'd have to have specifically annotated a pointer parameter as nonnull for the compiler to make any assumptions there, etc).

But, yeah, the null+0 is enough to explain why we need a fix here... and I guess this is as good as any.

@dwblaikie dwblaikie merged commit deca805 into llvm:main Dec 1, 2023
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants