Skip to content

<print>: std::println("{}", ...) UTF-8 Truncation Bug with /utf-8 (256-byte Buffer Split) #5894

@739C1AE2

Description

@739C1AE2

Describe the bug

When using the MSVC compiler with /utf-8, std::println truncates overly long UTF-8 strings at internal buffer boundaries (replaced with U+FFFD replacement characters) when formatting arguments are used (std::println("{}", str)).

Reproduction Code and Output

#include <print>

int main()
{
    std::println("{}", "这是一段超长长长长长长长长长长长长长长长长长长长长长长长长长长长长长长长长长长长长长长长长长长长长长长长长长长长长长长长长长长长长长长长长长长长长长长长长长长长长长长长长长的文本。");
}

Compilation Command: cl.exe /std:c++latest /utf-8 repro.cpp

Compiler Version: 用于 x86 的 Microsoft (R) C/C++ 优化编译器 19.50.35718 版

Expected Behavior: The UTF-8 string is output completely and correctly.

Observed Behavior: 这是一段超长长长长长长长长长长长长长长长长长长长长长长长长长长长长长长长长长长长长长长长长长长长长长长长长长长长长长长长长长长长长长长长长长长长长长长长长长长长长长长长长���的文本。

Possible Cause

The issue is likely caused by the output mechanism splitting the long string into small, fixed-size chunks (e.g., 256 bytes) before sending them to the console.

The failure is suspected to lie in the specialized handler responsible for committing these chunks to the console: _Fmt_iterator_flush<_Print_to_unicode_console_it>.

This handler, which manages the UTF-8 to Console conversion, appears to simply pass the raw byte chunk's range (_First to _Last) to the underlying write function without ensuring the chunk contains complete UTF-8 characters:

// https://github.com/microsoft/STL/blob/main/stl/inc/print 
template <>
struct _Fmt_iterator_flush<_Print_to_unicode_console_it> {
    static _Print_to_unicode_console_it _Flush(
        const char* const _First, const char* const _Last, _Print_to_unicode_console_it _Output) {
        _STD _Print_noformat_unicode_to_console_nonlocking(_Output._Get_console_handle(), {_First, _Last});
        return _Output;
    }
};

If a chunk ends in the middle of a multi-byte UTF-8 character, committing the incomplete sequence at this point may cause the downstream MultiByteToWideChar conversion to fail, resulting in the observed U+FFFD characters. This suggests the necessary UTF-8 boundary check logic may be missing from this specific specialization before the data is written to the console handle.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingformatC++20/23 format

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions