-
Notifications
You must be signed in to change notification settings - Fork 15.5k
[CIR] Support wide string literals in CIR codegen #171541
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Implement support for wide string literals (wchar_t, char16_t, char32_t) in getConstantArrayFromStringLiteral. This migrates the feature from the incubator to upstream. The implementation handles wide character string literals by: - Collecting code units using getCodeUnit() - Creating constant arrays with IntAttr elements - Using ZeroAttr for null-filled strings Add test file wide-string.cpp copied from incubator, expanded with wchar_t test cases.
|
@llvm/pr-subscribers-clangir Author: None (adams381) ChangesThis PR migrates support for wide string literals from the incubator to upstream. Changes
Testing
Full diff: https://github.com/llvm/llvm-project/pull/171541.diff 2 Files Affected:
diff --git a/clang/lib/CIR/CodeGen/CIRGenModule.cpp b/clang/lib/CIR/CodeGen/CIRGenModule.cpp
index 41a5d9db83e2b..c0dcd3f55f328 100644
--- a/clang/lib/CIR/CodeGen/CIRGenModule.cpp
+++ b/clang/lib/CIR/CodeGen/CIRGenModule.cpp
@@ -21,7 +21,9 @@
#include "clang/AST/GlobalDecl.h"
#include "clang/AST/RecordLayout.h"
#include "clang/Basic/SourceManager.h"
+#include "clang/CIR/Dialect/IR/CIRAttrs.h"
#include "clang/CIR/Dialect/IR/CIRDialect.h"
+#include "clang/CIR/Dialect/IR/CIRTypes.h"
#include "clang/CIR/Interfaces/CIROpInterfaces.h"
#include "clang/CIR/MissingFeatures.h"
@@ -31,6 +33,8 @@
#include "mlir/IR/MLIRContext.h"
#include "mlir/IR/Verifier.h"
+#include <algorithm>
+
using namespace clang;
using namespace clang::CIRGen;
@@ -960,9 +964,36 @@ CIRGenModule::getConstantArrayFromStringLiteral(const StringLiteral *e) {
return builder.getString(str, eltTy, finalSize);
}
- errorNYI(e->getSourceRange(),
- "getConstantArrayFromStringLiteral: wide characters");
- return mlir::Attribute();
+ auto arrayTy = mlir::dyn_cast<cir::ArrayType>(convertType(e->getType()));
+ assert(arrayTy && "string literals must be emitted as an array type");
+
+ auto arrayEltTy = mlir::dyn_cast<cir::IntType>(arrayTy.getElementType());
+ assert(arrayEltTy &&
+ "string literal elements must be emitted as integral type");
+
+ auto arraySize = arrayTy.getSize();
+ auto literalSize = e->getLength();
+
+ // Collect the code units.
+ SmallVector<uint32_t, 32> elementValues;
+ elementValues.reserve(arraySize);
+ for (unsigned i = 0; i < literalSize; ++i)
+ elementValues.push_back(e->getCodeUnit(i));
+ elementValues.resize(arraySize);
+
+ // If the string is full of null bytes, emit a #cir.zero instead.
+ if (std::all_of(elementValues.begin(), elementValues.end(),
+ [](uint32_t x) { return x == 0; }))
+ return cir::ZeroAttr::get(arrayTy);
+
+ // Otherwise emit a constant array holding the characters.
+ SmallVector<mlir::Attribute, 32> elements;
+ elements.reserve(arraySize);
+ for (uint64_t i = 0; i < arraySize; ++i)
+ elements.push_back(cir::IntAttr::get(arrayEltTy, elementValues[i]));
+
+ auto elementsAttr = mlir::ArrayAttr::get(&getMLIRContext(), elements);
+ return builder.getConstArray(elementsAttr, arrayTy);
}
bool CIRGenModule::supportsCOMDAT() const {
diff --git a/clang/test/CIR/CodeGen/wide-string.cpp b/clang/test/CIR/CodeGen/wide-string.cpp
new file mode 100644
index 0000000000000..9f145d022a943
--- /dev/null
+++ b/clang/test/CIR/CodeGen/wide-string.cpp
@@ -0,0 +1,40 @@
+// RUN: %clang_cc1 -std=c++17 -triple x86_64-unknown-linux-gnu -fclangir -emit-cir %s -o %t.cir
+// RUN: FileCheck --input-file=%t.cir %s
+
+const char16_t *test_utf16() {
+ return u"你好世界";
+}
+
+// CHECK: cir.global "private" constant cir_private dso_local @{{.+}} = #cir.const_array<[#cir.int<20320> : !u16i, #cir.int<22909> : !u16i, #cir.int<19990> : !u16i, #cir.int<30028> : !u16i, #cir.int<0> : !u16i]> : !cir.array<!u16i x 5>
+
+const char32_t *test_utf32() {
+ return U"你好世界";
+}
+
+// CHECK: cir.global "private" constant cir_private dso_local @{{.+}} = #cir.const_array<[#cir.int<20320> : !u32i, #cir.int<22909> : !u32i, #cir.int<19990> : !u32i, #cir.int<30028> : !u32i, #cir.int<0> : !u32i]> : !cir.array<!u32i x 5>
+
+const char16_t *test_zero16() {
+ return u"\0\0\0\0";
+}
+
+// CHECK: cir.global "private" constant cir_private dso_local @{{.+}} = #cir.zero : !cir.array<!u16i x 5>
+
+const char32_t *test_zero32() {
+ return U"\0\0\0\0";
+}
+
+// CHECK: cir.global "private" constant cir_private dso_local @{{.+}} = #cir.zero : !cir.array<!u32i x 5>
+
+#include <stddef.h>
+
+const wchar_t *test_wchar() {
+ return L"1234";
+}
+
+// CHECK: cir.global "private" constant cir_private dso_local @{{.+}} = #cir.const_array<[#cir.int<49> : !s32i, #cir.int<50> : !s32i, #cir.int<51> : !s32i, #cir.int<52> : !s32i, #cir.int<0> : !s32i]> : !cir.array<!s32i x 5>
+
+const wchar_t *test_wchar_zero() {
+ return L"";
+}
+
+// CHECK: cir.global "private" constant cir_private dso_local @{{.+}} = #cir.zero : !cir.array<!s32i x 1>
|
|
@llvm/pr-subscribers-clang Author: None (adams381) ChangesThis PR migrates support for wide string literals from the incubator to upstream. Changes
Testing
Full diff: https://github.com/llvm/llvm-project/pull/171541.diff 2 Files Affected:
diff --git a/clang/lib/CIR/CodeGen/CIRGenModule.cpp b/clang/lib/CIR/CodeGen/CIRGenModule.cpp
index 41a5d9db83e2b..c0dcd3f55f328 100644
--- a/clang/lib/CIR/CodeGen/CIRGenModule.cpp
+++ b/clang/lib/CIR/CodeGen/CIRGenModule.cpp
@@ -21,7 +21,9 @@
#include "clang/AST/GlobalDecl.h"
#include "clang/AST/RecordLayout.h"
#include "clang/Basic/SourceManager.h"
+#include "clang/CIR/Dialect/IR/CIRAttrs.h"
#include "clang/CIR/Dialect/IR/CIRDialect.h"
+#include "clang/CIR/Dialect/IR/CIRTypes.h"
#include "clang/CIR/Interfaces/CIROpInterfaces.h"
#include "clang/CIR/MissingFeatures.h"
@@ -31,6 +33,8 @@
#include "mlir/IR/MLIRContext.h"
#include "mlir/IR/Verifier.h"
+#include <algorithm>
+
using namespace clang;
using namespace clang::CIRGen;
@@ -960,9 +964,36 @@ CIRGenModule::getConstantArrayFromStringLiteral(const StringLiteral *e) {
return builder.getString(str, eltTy, finalSize);
}
- errorNYI(e->getSourceRange(),
- "getConstantArrayFromStringLiteral: wide characters");
- return mlir::Attribute();
+ auto arrayTy = mlir::dyn_cast<cir::ArrayType>(convertType(e->getType()));
+ assert(arrayTy && "string literals must be emitted as an array type");
+
+ auto arrayEltTy = mlir::dyn_cast<cir::IntType>(arrayTy.getElementType());
+ assert(arrayEltTy &&
+ "string literal elements must be emitted as integral type");
+
+ auto arraySize = arrayTy.getSize();
+ auto literalSize = e->getLength();
+
+ // Collect the code units.
+ SmallVector<uint32_t, 32> elementValues;
+ elementValues.reserve(arraySize);
+ for (unsigned i = 0; i < literalSize; ++i)
+ elementValues.push_back(e->getCodeUnit(i));
+ elementValues.resize(arraySize);
+
+ // If the string is full of null bytes, emit a #cir.zero instead.
+ if (std::all_of(elementValues.begin(), elementValues.end(),
+ [](uint32_t x) { return x == 0; }))
+ return cir::ZeroAttr::get(arrayTy);
+
+ // Otherwise emit a constant array holding the characters.
+ SmallVector<mlir::Attribute, 32> elements;
+ elements.reserve(arraySize);
+ for (uint64_t i = 0; i < arraySize; ++i)
+ elements.push_back(cir::IntAttr::get(arrayEltTy, elementValues[i]));
+
+ auto elementsAttr = mlir::ArrayAttr::get(&getMLIRContext(), elements);
+ return builder.getConstArray(elementsAttr, arrayTy);
}
bool CIRGenModule::supportsCOMDAT() const {
diff --git a/clang/test/CIR/CodeGen/wide-string.cpp b/clang/test/CIR/CodeGen/wide-string.cpp
new file mode 100644
index 0000000000000..9f145d022a943
--- /dev/null
+++ b/clang/test/CIR/CodeGen/wide-string.cpp
@@ -0,0 +1,40 @@
+// RUN: %clang_cc1 -std=c++17 -triple x86_64-unknown-linux-gnu -fclangir -emit-cir %s -o %t.cir
+// RUN: FileCheck --input-file=%t.cir %s
+
+const char16_t *test_utf16() {
+ return u"你好世界";
+}
+
+// CHECK: cir.global "private" constant cir_private dso_local @{{.+}} = #cir.const_array<[#cir.int<20320> : !u16i, #cir.int<22909> : !u16i, #cir.int<19990> : !u16i, #cir.int<30028> : !u16i, #cir.int<0> : !u16i]> : !cir.array<!u16i x 5>
+
+const char32_t *test_utf32() {
+ return U"你好世界";
+}
+
+// CHECK: cir.global "private" constant cir_private dso_local @{{.+}} = #cir.const_array<[#cir.int<20320> : !u32i, #cir.int<22909> : !u32i, #cir.int<19990> : !u32i, #cir.int<30028> : !u32i, #cir.int<0> : !u32i]> : !cir.array<!u32i x 5>
+
+const char16_t *test_zero16() {
+ return u"\0\0\0\0";
+}
+
+// CHECK: cir.global "private" constant cir_private dso_local @{{.+}} = #cir.zero : !cir.array<!u16i x 5>
+
+const char32_t *test_zero32() {
+ return U"\0\0\0\0";
+}
+
+// CHECK: cir.global "private" constant cir_private dso_local @{{.+}} = #cir.zero : !cir.array<!u32i x 5>
+
+#include <stddef.h>
+
+const wchar_t *test_wchar() {
+ return L"1234";
+}
+
+// CHECK: cir.global "private" constant cir_private dso_local @{{.+}} = #cir.const_array<[#cir.int<49> : !s32i, #cir.int<50> : !s32i, #cir.int<51> : !s32i, #cir.int<52> : !s32i, #cir.int<0> : !s32i]> : !cir.array<!s32i x 5>
+
+const wchar_t *test_wchar_zero() {
+ return L"";
+}
+
+// CHECK: cir.global "private" constant cir_private dso_local @{{.+}} = #cir.zero : !cir.array<!s32i x 1>
|
| auto arrayTy = mlir::dyn_cast<cir::ArrayType>(convertType(e->getType())); | ||
| assert(arrayTy && "string literals must be emitted as an array type"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| auto arrayTy = mlir::dyn_cast<cir::ArrayType>(convertType(e->getType())); | |
| assert(arrayTy && "string literals must be emitted as an array type"); | |
| auto arrayTy = mlir::cast<cir::ArrayType>(convertType(e->getType())); |
| auto arrayEltTy = mlir::dyn_cast<cir::IntType>(arrayTy.getElementType()); | ||
| assert(arrayEltTy && | ||
| "string literal elements must be emitted as integral type"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| auto arrayEltTy = mlir::dyn_cast<cir::IntType>(arrayTy.getElementType()); | |
| assert(arrayEltTy && | |
| "string literal elements must be emitted as integral type"); | |
| auto arrayEltTy = mlir::cast<cir::IntType>(arrayTy.getElementType()); |
| auto arraySize = arrayTy.getSize(); | ||
| auto literalSize = e->getLength(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't use auto here.
| auto literalSize = e->getLength(); | ||
|
|
||
| // Collect the code units. | ||
| SmallVector<uint32_t, 32> elementValues; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| SmallVector<uint32_t, 32> elementValues; | |
| SmallVector<uint32_t> elementValues; |
This is an artifact from the classic codegen implementation. The size used to be required, but it's not anymore and unless there is a known reason to reserve a specific size, it's usually better to omit it. In this case, since you're calling reserve on the next line the explicit size is entirely redundant.
| elementValues.resize(arraySize); | ||
|
|
||
| // If the string is full of null bytes, emit a #cir.zero instead. | ||
| if (std::all_of(elementValues.begin(), elementValues.end(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we do this before we build the element value vector? Maybe something like this?
bool isAllZero = false;
for (unsigned i = 0; i < literalSize; ++i) {
if (!e->getCodeUnit(i) == 0) {
isAllZero = true;
break;
}
}
if (isAllZero)
return cir::ZeroAttr::get(arrayTy);
In most non-zero cases, this will break out of the loop on the first element.
| SmallVector<mlir::Attribute, 32> elements; | ||
| elements.reserve(arraySize); | ||
| for (uint64_t i = 0; i < arraySize; ++i) | ||
| elements.push_back(cir::IntAttr::get(arrayEltTy, elementValues[i])); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| elements.push_back(cir::IntAttr::get(arrayEltTy, elementValues[i])); | |
| elements.push_back(cir::IntAttr::get(arrayEltTy, e->getCodeUnit(i))); |
If we do the zero check as I suggested, we can avoid looping twice to build our vector of attributes.
| return cir::ZeroAttr::get(arrayTy); | ||
|
|
||
| // Otherwise emit a constant array holding the characters. | ||
| SmallVector<mlir::Attribute, 32> elements; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| SmallVector<mlir::Attribute, 32> elements; | |
| SmallVector<mlir::Attribute> elements; |
| @@ -0,0 +1,40 @@ | |||
| // RUN: %clang_cc1 -std=c++17 -triple x86_64-unknown-linux-gnu -fclangir -emit-cir %s -o %t.cir | |||
| // RUN: FileCheck --input-file=%t.cir %s | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add LLVM and OGCG checks here.
| for (uint64_t i = 0; i < arraySize; ++i) | ||
| elements.push_back(cir::IntAttr::get(arrayEltTy, elementValues[i])); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| for (uint64_t i = 0; i < arraySize; ++i) | |
| elements.push_back(cir::IntAttr::get(arrayEltTy, elementValues[i])); | |
| for (uint32_t elementValue : elementValues) | |
| elements.push_back(cir::IntAttr::get(arrayEltTy, elementValue)); |
- Use mlir::cast<> instead of dyn_cast<> + assert - Use explicit types (uint64_t, unsigned) instead of auto for method returns - Remove unnecessary SmallVector size parameters - Optimize zero-check to happen before building vector (early exit) - Avoid double-looping by building elements directly - Add LLVM and OGCG checks to test file
|
Addressed all reviewer feedback: Code Style Improvements:
Performance Optimizations:
Test Improvements:
All reviewer suggestions have been implemented. The code now follows LLVM coding standards and includes performance optimizations. |
andykaylor
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good. I just have a few more suggestions.
| for (unsigned i = 0; i < literalSize; ++i) | ||
| elements.push_back(cir::IntAttr::get(arrayEltTy, e->getCodeUnit(i))); | ||
| // Pad with zeros if needed. | ||
| for (uint64_t i = literalSize; i < arraySize; ++i) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see why you'd be inclined to do this, but classic codegen doesn't. I wonder if it's necessary. Have you seen a case where arraySize is greater than literalSize? Maybe you could just assert their equivalence.
| // LLVM: @{{.+}} = private constant [5 x i32] zeroinitializer | ||
| // OGCG: @{{.+}} = private unnamed_addr constant [5 x i32] zeroinitializer | ||
|
|
||
| #include <stddef.h> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| #include <stddef.h> | |
| typedef __WCHAR_TYPE__ wchar_t; |
| const wchar_t *test_wchar_zero() { | ||
| return L""; | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add test cases using typedef __CHAR16_TYPE__ char16_t;?
- Replace padding logic with assert that arraySize == literalSize - Improve assert message clarity - Replace #include <stddef.h> with typedef __WCHAR_TYPE__ wchar_t - Add test case using typedef __CHAR16_TYPE__ char16_t - Add explanatory comments for test organization Addresses reviewer comments on PR llvm#171541.
|
Addressed remaining reviewer feedback: Code Changes:
Test Improvements:
All changes comply with LLVM coding standards and .cursorrules requirements. |
🐧 Linux x64 Test Results
Failed Tests(click on a test name to see its output) ClangClang.CIR/CodeGen/mms-bitfields.cClang.CIR/CodeGen/mms-bitfields.cIf these failures are unrelated to your changes (for example tests are broken or flaky at HEAD), please open an issue at https://github.com/llvm/llvm-project/issues and add the |
|
|
||
| uint64_t arraySize = arrayTy.getSize(); | ||
| unsigned literalSize = e->getLength(); | ||
| assert(arraySize == literalSize && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that this assert is firing in testing. I wasn't expecting that. Looking closer at the classic codegen implementation, it fills the elements array from zero to E->getLength() (i.e. literalSize here) and then resizes the array to NumElements. Given that classic codegen is using a vector of integers, maybe the resize handles the zero padding. I guess you need to bring back the zero-padding loop. Sorry about the bad suggestion.
- Change assert to arraySize == literalSize + 1 to account for null terminator - Explicitly add null terminator to elements array - Remove typedef statements that conflict with C++17 built-in types Fixes CI assertion failure in PR llvm#171541.
|
Fixed the CI assertion failure by accounting for the null terminator in the assertion (arraySize == literalSize + 1) and explicitly adding it to the elements array. Also removed typedef statements from the test file that conflict with C++17 built-in types. All tests pass locally. Ready for review. |
andykaylor
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great.
This PR migrates support for wide string literals from the incubator to upstream. ## Changes - Implement wide string literal support in `getConstantArrayFromStringLiteral` - Handle wchar_t, char16_t, and char32_t string literals - Collect code units and create constant arrays with IntAttr elements - Use ZeroAttr for null-filled strings ## Testing - Copied `wide-string.cpp` test file from incubator - Expanded test to include wchar_t test cases (incubator only had char16_t and char32_t) - All tests pass --------- Co-authored-by: Andy Kaylor <akaylor@nvidia.com>
This PR migrates support for wide string literals from the incubator to upstream.
Changes
getConstantArrayFromStringLiteralTesting
wide-string.cpptest file from incubator