Skip to content

[libcxx] Miscompilation of string code when using new-struct-path-tbaa #55384

@brunodefraine

Description

@brunodefraine

When using libcxx and clang 14, we observe a miscompilation of the following simple std::string code:

std::string r;
...
r = std::string("Hello, ") + "world";

We expect string r to be equal to "Hello, world" but we end up with "H" followed by 11 garbage characters, depending on the earlier contents of r. This is a godbolt demo. The only non-trivial factor to trigger the wrong behavior, is the use of the new TBAA struct path format (using clang option -new-struct-path-tbaa).

Analysis

After appending to the left-hand side string, the statement triggers a move constructor and move assignment to get the result in place. Each move consists of a copy of the string representation, followed by a zero operation that clears the representation of the stolen string. It seems these memory access operations do not have proper TBAA information, leading to incorrect memory optimization.

TBAA tags

In the LLVM IR for the move assignment operator and move constructor, we can find back the TBAA tags for the memory operations. These tags should be compatible as they may refer to overlapping memory access. The internal representation of a basic_string is the following structure:

    union __ulx{__long __lx; __short __lxx;};

    enum {__n_words = sizeof(__ulx) / sizeof(size_type)};

    struct __raw
    {
        size_type __words[__n_words];
    };

    struct __rep
    {
        union
        {
            __long  __l;
            __short __s;
            __raw   __r;
        };
    };

    __compressed_pair<__rep, allocator_type> __r_;

Type-punning using a union violates strict aliasing, but is supported by e.g. GCC provided the memory is accessed through the union type. Also in clang, access using a union is normally given the tag of "omnipotent char".

The move assignment operator is doing (in __move_assign):

  __r_.first() = __str.__r_.first();
  __str.__set_short_size(0);

The assignment at the level of __rep gives a memcpy intrinsic with TBAA base and access tag of __rep, which has a member (the anonymous union) of omnipotent char. __set_short_size uses the access expression __r_.first().__s.__size_, the union access has base and access tag of omnipotent char.

The move constructor is doing:

    : __r_(_VSTD::move(__str.__r_))
{
    __str.__zero();

The move at the level of the __compressed_pair gives a memcpy intrinsic with TBAA base and access tag of the __compressed_pair, but this structure has no members in the TBAA info, this seems incomplete:

!8 = !{!9, !9, i64 0, i64 24}
!9 = !{!6, i64 24, !"_ZTSNSt3__117__compressed_pairINS_12basic_stringIcNS_11char_traitsIcEENS_9allocatorIcEEE5__repES5_EE"}

The function __zero is doing:

    void __zero() _NOEXCEPT
        {
            size_type (&__a)[__n_words] = __r_.first().__r.__words;
            for (unsigned __i = 0; __i < __n_words; ++__i)
                __a[__i] = 0;
        }

The store of the zero value in the loop body is given TBAA base and access tag of long, this is clearly wrong. However, function __zero violates the constraint "memory is accessed through the union type" from the GCC manual (cited above). The use of the intermediate reference variable a is equivalent to the counter-example (using an intermediate pointer variable int* ip) given there. If the source code is changed to remove the use of the reference variable, the base and access tag is omnipotent char, as expected.

Workaround

Changing the code of __move_assign do the assignment at the level of __raw:

  __r_.first().__r = __str.__r_.first().__r;

Gives a memcpy with TBAA tag of omnipotent char (since we are accessing a union member) and prevents the incorrect memory optimization. Presumably because this forces to alias with the __compressed_pair (but as noted above, that tag seems incomplete).

Credits

Bhramar Bhushan Vatsa, Wouter Vermaelen and Jeroen Dobbelaere contributed to this bug report.

Metadata

Metadata

Assignees

No one assigned

    Labels

    TBAAType-Based Alias Analysis / Strict Aliasingclang:codegenIR generation bugs: mangling, exceptions, etc.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions