Skip to content

Conversation

@zeyi2
Copy link
Member

@zeyi2 zeyi2 commented Nov 23, 2025

FormatStringConverter::appendFormatText incorrectly treated non-ASCII characters (e.g. UTF-8) as negative values when using signed chars. This caused them to pass the < 32 check for control characters.

The negative values were passed to llvm::hexdigit, resulting in an OOB access and a crash.

This closes #169198

@llvmbot
Copy link
Member

llvmbot commented Nov 23, 2025

@llvm/pr-subscribers-github-workflow

@llvm/pr-subscribers-clang-tools-extra

Author: mitchell (zeyi2)

Changes

FormatStringConverter::appendFormatText incorrectly treated non-ASCII characters (e.g. UTF-8) as negative values when using signed chars. This caused them to pass the &lt; 32 check for control characters.

The negative values were passed to llvm::hexdigit, resulting in an OOB access and a crash.

This closes #169198


Full diff: https://github.com/llvm/llvm-project/pull/169215.diff

1 Files Affected:

  • (modified) clang-tools-extra/clang-tidy/utils/FormatStringConverter.cpp (+4-3)
diff --git a/clang-tools-extra/clang-tidy/utils/FormatStringConverter.cpp b/clang-tools-extra/clang-tidy/utils/FormatStringConverter.cpp
index 23dae04916e9b..a3af9504e6542 100644
--- a/clang-tools-extra/clang-tidy/utils/FormatStringConverter.cpp
+++ b/clang-tools-extra/clang-tidy/utils/FormatStringConverter.cpp
@@ -700,6 +700,7 @@ void FormatStringConverter::finalizeFormatText() {
 /// Append literal parts of the format text, reinstating escapes as required.
 void FormatStringConverter::appendFormatText(const StringRef Text) {
   for (const char Ch : Text) {
+    const unsigned char UCh = static_cast<unsigned char>(Ch);
     if (Ch == '\a')
       StandardFormatString += "\\a";
     else if (Ch == '\b')
@@ -724,10 +725,10 @@ void FormatStringConverter::appendFormatText(const StringRef Text) {
     } else if (Ch == '}') {
       StandardFormatString += "}}";
       FormatStringNeededRewriting = true;
-    } else if (Ch < 32) {
+    } else if (UCh < 32) {
       StandardFormatString += "\\x";
-      StandardFormatString += llvm::hexdigit(Ch >> 4, true);
-      StandardFormatString += llvm::hexdigit(Ch & 0xf, true);
+      StandardFormatString += llvm::hexdigit(UCh >> 4, true);
+      StandardFormatString += llvm::hexdigit(UCh & 0xf, true);
     } else
       StandardFormatString += Ch;
   }

@llvmbot
Copy link
Member

llvmbot commented Nov 23, 2025

@llvm/pr-subscribers-clang-tidy

Author: mitchell (zeyi2)

Changes

FormatStringConverter::appendFormatText incorrectly treated non-ASCII characters (e.g. UTF-8) as negative values when using signed chars. This caused them to pass the &lt; 32 check for control characters.

The negative values were passed to llvm::hexdigit, resulting in an OOB access and a crash.

This closes #169198


Full diff: https://github.com/llvm/llvm-project/pull/169215.diff

1 Files Affected:

  • (modified) clang-tools-extra/clang-tidy/utils/FormatStringConverter.cpp (+4-3)
diff --git a/clang-tools-extra/clang-tidy/utils/FormatStringConverter.cpp b/clang-tools-extra/clang-tidy/utils/FormatStringConverter.cpp
index 23dae04916e9b..a3af9504e6542 100644
--- a/clang-tools-extra/clang-tidy/utils/FormatStringConverter.cpp
+++ b/clang-tools-extra/clang-tidy/utils/FormatStringConverter.cpp
@@ -700,6 +700,7 @@ void FormatStringConverter::finalizeFormatText() {
 /// Append literal parts of the format text, reinstating escapes as required.
 void FormatStringConverter::appendFormatText(const StringRef Text) {
   for (const char Ch : Text) {
+    const unsigned char UCh = static_cast<unsigned char>(Ch);
     if (Ch == '\a')
       StandardFormatString += "\\a";
     else if (Ch == '\b')
@@ -724,10 +725,10 @@ void FormatStringConverter::appendFormatText(const StringRef Text) {
     } else if (Ch == '}') {
       StandardFormatString += "}}";
       FormatStringNeededRewriting = true;
-    } else if (Ch < 32) {
+    } else if (UCh < 32) {
       StandardFormatString += "\\x";
-      StandardFormatString += llvm::hexdigit(Ch >> 4, true);
-      StandardFormatString += llvm::hexdigit(Ch & 0xf, true);
+      StandardFormatString += llvm::hexdigit(UCh >> 4, true);
+      StandardFormatString += llvm::hexdigit(UCh & 0xf, true);
     } else
       StandardFormatString += Ch;
   }

@github-actions
Copy link

github-actions bot commented Nov 23, 2025

✅ With the latest revision this PR passed the C/C++ code linter.

@vbvictor
Copy link
Contributor

Please add tests and release note


void printf_utf8_text() {
// Non-ASCII UTF-8 in format string should not crash.
printf("你好世界\n");
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For clarity: "你好世界" is the Chinese equivalent of "Hello, World".

@zeyi2
Copy link
Member Author

zeyi2 commented Nov 23, 2025

The python script inside CI container gives UnicodeEncodeError: 'charmap' codec can't encode characters in position 1451-1454: character maps to <undefined>, which is weird. I'm trying to fix this issue (which will take some time because I don't have a windows setup).

I skip the testcase on Windows, now the CI passes.

Also, there may be another issue with CI:

https://github.com/llvm/llvm-project/actions/runs/19613463479?pr=169215

The aarch64 workflow gives an error but marked as passed, not sure if this is the expected behaviour.


void printf_utf8_text() {
// Hex encodes U+4F60 U+597D U+4E16 U+754C (你好世界) in UTF-8
printf("\xE4\xBD\xA0\xE5\xA5\xBD\xE4\xB8\x96\xE7\x95\x8C\n");
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be verified using: python3 -c 'print(b"\xE4\xBD\xA0\xE5\xA5\xBD\xE4\xB8\x96\xE7\x95\x8C\n".decode("utf-8"), end="")'

@zeyi2 zeyi2 force-pushed the fix-format-string-converter branch from 5dd20c6 to 88e6348 Compare November 25, 2025 03:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

modernize-use-std-print checker crashes when printf contains Chinese

3 participants