Skip to content

Conversation

@cor3ntin
Copy link
Contributor

@cor3ntin cor3ntin commented Oct 22, 2025

UTF-16 to UTF-32 conversions seems widespread,
and lone surrogate have a distinct representation in UTF-32.

Lets not warn on this case to make the warning easier to adopt. This follows SG-16 guideline

https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2025/p3695r2.html#changes-since-r1

Fixes #163719

@cor3ntin cor3ntin requested a review from AaronBallman October 22, 2025 16:11
@llvmbot llvmbot added clang Clang issues not falling into any other category clang:frontend Language frontend issues, e.g. anything involving "Sema" labels Oct 22, 2025
@llvmbot
Copy link
Member

llvmbot commented Oct 22, 2025

@llvm/pr-subscribers-clang

Author: Corentin Jabot (cor3ntin)

Changes

UTF-16 to UTF-16 conversions seems widespread,
and lone surrogate have a distinct representation in UTF-32.

Lets not warn on this case to make the warning easier to adopt. This follows SG-16 guideline

https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2025/p3695r2.html#changes-since-r1

Fixes #163719


Full diff: https://github.com/llvm/llvm-project/pull/164654.diff

2 Files Affected:

  • (modified) clang/lib/Sema/SemaChecking.cpp (+8-1)
  • (modified) clang/test/SemaCXX/warn-implicit-unicode-conversions.cpp (+4-4)
diff --git a/clang/lib/Sema/SemaChecking.cpp b/clang/lib/Sema/SemaChecking.cpp
index dd5b710d7e1d4..41bcf8fd493fc 100644
--- a/clang/lib/Sema/SemaChecking.cpp
+++ b/clang/lib/Sema/SemaChecking.cpp
@@ -12014,13 +12014,20 @@ static void DiagnoseMixedUnicodeImplicitConversion(Sema &S, const Type *Source,
                                                    SourceLocation CC) {
   assert(Source->isUnicodeCharacterType() && Target->isUnicodeCharacterType() &&
          Source != Target);
+
+  // Lone surrogates have a distinct representation in UTF-32.
+  // Converting between UTF-16 and UTF-32 codepoints seems very widespread,
+  // so don't warn on such conversion.
+  if (Source->isChar16Type() && Target->isChar32Type())
+    return;
+
   Expr::EvalResult Result;
   if (E->EvaluateAsInt(Result, S.getASTContext(), Expr::SE_AllowSideEffects,
                        S.isConstantEvaluatedContext())) {
     llvm::APSInt Value(32);
     Value = Result.Val.getInt();
     bool IsASCII = Value <= 0x7F;
-    bool IsBMP = Value <= 0xD7FF || (Value >= 0xE000 && Value <= 0xFFFF);
+    bool IsBMP = Value <= 0xDFFF || (Value >= 0xE000 && Value <= 0xFFFF);
     bool ConversionPreservesSemantics =
         IsASCII || (!Source->isChar8Type() && !Target->isChar8Type() && IsBMP);
 
diff --git a/clang/test/SemaCXX/warn-implicit-unicode-conversions.cpp b/clang/test/SemaCXX/warn-implicit-unicode-conversions.cpp
index fcff006d0e028..f17f20ca25295 100644
--- a/clang/test/SemaCXX/warn-implicit-unicode-conversions.cpp
+++ b/clang/test/SemaCXX/warn-implicit-unicode-conversions.cpp
@@ -14,7 +14,7 @@ void test(char8_t u8, char16_t u16, char32_t u32) {
     c16(u32); // expected-warning {{implicit conversion from 'char32_t' to 'char16_t' may lose precision and change the meaning of the represented code unit}}
 
     c32(u8);  // expected-warning {{implicit conversion from 'char8_t' to 'char32_t' may change the meaning of the represented code unit}}
-    c32(u16); // expected-warning {{implicit conversion from 'char16_t' to 'char32_t' may change the meaning of the represented code unit}}
+    c32(u16);
     c32(u32);
 
 
@@ -30,7 +30,7 @@ void test(char8_t u8, char16_t u16, char32_t u32) {
     c16(char32_t(0x7f));
     c16(char32_t(0x80));
     c16(char32_t(0xD7FF));
-    c16(char32_t(0xD800)); // expected-warning {{implicit conversion from 'char32_t' to 'char16_t' changes the meaning of the code unit '<0xD800>'}}
+    c16(char32_t(0xD800));
     c16(char32_t(0xE000));
     c16(char32_t(U'🐉')); // expected-warning {{implicit conversion from 'char32_t' to 'char16_t' changes the meaning of the code point '🐉'}}
 
@@ -44,8 +44,8 @@ void test(char8_t u8, char16_t u16, char32_t u32) {
     c32(char16_t(0x80));
 
     c32(char16_t(0xD7FF));
-    c32(char16_t(0xD800)); // expected-warning {{implicit conversion from 'char16_t' to 'char32_t' changes the meaning of the code unit '<0xD800>'}}
-    c32(char16_t(0xDFFF)); // expected-warning {{implicit conversion from 'char16_t' to 'char32_t' changes the meaning of the code unit '<0xDFFF>'}}
+    c32(char16_t(0xD800));
+    c32(char16_t(0xDFFF));
     c32(char16_t(0xE000));
     c32(char16_t(u'☕'));
 

@efriedma-quic efriedma-quic added this to the LLVM 21.x Release milestone Oct 22, 2025
@github-project-automation github-project-automation bot moved this to Needs Triage in LLVM Release Status Oct 22, 2025
@c-rhodes c-rhodes moved this from Needs Triage to Needs Review in LLVM Release Status Oct 23, 2025
@cor3ntin
Copy link
Contributor Author

@AaronBallman

@AaronBallman
Copy link
Collaborator

UTF-16 to UTF-16 conversions seems widespread,

UTF-16 to UTF-32, right?

Copy link
Collaborator

@AaronBallman AaronBallman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM aside from the PR summary, should have a release note (perhaps on the release branch).

@h-vetinari
Copy link
Contributor

h-vetinari commented Oct 26, 2025

should have a release note (perhaps on the release branch).

sidenote: updated release notes on the maintenance branches almost never get published (though they should).

Version Last Patch Release Release Notes (RN) exist RN don't exist RN for most recent
patch release published?
v21 21.1.4 (currently) 21.1.0
21.1.2
21.1.1
21.1.3
21.1.4
v20 20.1.8 20.1.0 20.1.1 and onward
v19 19.1.7 19.1.0 19.1.1 and onward
v18 18.1.8 18.1.0
18.1.1
18.1.4
18.1.6
18.1.7
18.1.8
18.1.2
18.1.3
18.1.5
v17 17.0.6 17.0.1 17.0.0
17.0.2 and onward
v16 16.0.6 16.0.0 16.0.1 and onward

It would help immensely if the release notes were built and published in an automated fashion per branch (some prior thoughts on this), rather than being dependent on the availability and goodwill of the release managers. Though I realize this would be a big initial lift.

@c-rhodes c-rhodes moved this from Needs Review to Needs Merge in LLVM Release Status Oct 27, 2025
@c-rhodes
Copy link
Collaborator

sidenote: updated release notes on the maintenance branches almost never get published (though they should).

I'm new to release maintenance so this is new to me, thanks for mentioning. I'll raise it with the other release maintainers.

UTF-16 to UTF-16 conversions seems widespread,
and lone surrogate have a distinct representation in UTF-32.

Lets not warn on this case to make the warning easier to adopt. This
follows SG-16 guideline


https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2025/p3695r2.html#changes-since-r1

Fixes llvm#163719
@c-rhodes c-rhodes force-pushed the wconversion_backport branch from 9b3789a to 5c802f9 Compare October 27, 2025 10:19
@c-rhodes c-rhodes merged commit 5c802f9 into llvm:release/21.x Oct 27, 2025
4 of 8 checks passed
@github-project-automation github-project-automation bot moved this from Needs Merge to Done in LLVM Release Status Oct 27, 2025
@github-actions
Copy link

@cor3ntin (or anyone else). If you would like to add a note about this fix in the release notes (completely optional). Please reply to this comment with a one or two sentence description of the fix. When you are done, please add the release:note label to this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

clang:frontend Language frontend issues, e.g. anything involving "Sema" clang Clang issues not falling into any other category release:backport

Projects

Development

Successfully merging this pull request may close these issues.

6 participants