BUG: Integer conversion misbehavior on QEMU ppc64le #20964

ckastner · 2022-02-01T17:43:35Z

Describe the issue:

[I want to begin by first acknowledging that this is a fringe case of a bug appearing only in a VM (I haven't yet been able to reproduce it on real hardware), so please feel free to close this issue right away. However, I felt that I still should at least point it out, in case there is some latent real issue that this might uncover. In any case, hints as to how I can dig deeper into this so that I may report it to the proper channel (be it QEMU, GNU libc, or whatever) would be welcome.]

On a QEMU ppc64le VM, When casting an array of floats to np.intc, a weird truncation seems to happen. This only occurs in the VM; it does not happen onthe real ppc64le hardware I had access to (although that system hat an older 4.19 kernel; I cannot change that).

In the code below, I create arrays of value 1. of length 1 through 10, and then cast them to np.int64 and np.intc for comparison.

This works fine for arrays with length <= 3. If fails for longer ones, and apparently there is a pattern in this failure.

Reproduce the code example:

import numpy as np
for i in range(1, 11):
    a = np.full(i, 1.)
    print(i, a, a.astype(np.int64), a.astype(np.intc))

Error message:

1 [1.] [1] [1]
2 [1. 1.] [1 1] [1 1]
3 [1. 1. 1.] [1 1 1] [1 1 1]
4 [1. 1. 1. 1.] [1 1 1 1] [0 0 0 0]
5 [1. 1. 1. 1. 1.] [1 1 1 1 1] [0 0 0 0 1]
6 [1. 1. 1. 1. 1. 1.] [1 1 1 1 1 1] [0 0 0 0 1 1]
7 [1. 1. 1. 1. 1. 1. 1.] [1 1 1 1 1 1 1] [0 0 0 0 1 1 1]
8 [1. 1. 1. 1. 1. 1. 1. 1.] [1 1 1 1 1 1 1 1] [0 0 0 0 0 0 0 0]
9 [1. 1. 1. 1. 1. 1. 1. 1. 1.] [1 1 1 1 1 1 1 1 1] [0 0 0 0 0 0 0 0 1]
10 [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.] [1 1 1 1 1 1 1 1 1 1] [0 0 0 0 0 0 0 0 1 1]

NumPy/Python version information:

1.22.1 3.9.10 (main, Jan 16 2022, 17:12:18)
[GCC 11.2.0]

The text was updated successfully, but these errors were encountered:

seberg · 2022-02-01T17:55:59Z

Aside from being unreadable code due to templating, NumPy should be doing nothing special here. This should effectively be a simple contiguous loop, such as:

while (N--) {
    *(int *)out = (int)(*(double *)in);
    out += sizeof(int);
    in += sizeof(double);
}

The one little "extra" is, that the function is marked with __attribute__((optimize("O3"))).

So, I suspect there is some issue with the emulation and GCC generated highly optimized and architecture specific code? But I don't really have an additional lead (maybe someone else remembers a pointer).

EDIT: Just as a note, SIMD probably explains the pattern here, that any "left over" elements beyond multiples of 4 are handled in a non-vectorized way and thus work fine.

ckastner · 2022-02-01T18:15:31Z

Thanks for the quick reply. Indeed, SIMD would be a reasonable explanation.

I've seen this in a VM with a userspace (based on Debian unstable) identical to a physical machine where this isn't happening, so perhaps it's the emulation after all.

I'll see what I can find out, and will close this bug soon unless somebody else closes it sooner.

seberg · 2022-02-01T18:18:11Z

Thanks, if you are trying to reproduce this in pure C, I expect putting my loop into a small function with the optimization attribute will do the trick (with in being a passed in, allocated, double array and out being an integer one).

ckastner · 2022-02-01T21:43:45Z

I could indeed create a minimal this by putting it into a function. gcc -O2 output is fine, gcc -O3 produces results consistent with the above. I created an issue in the QEMU tracker.

Many thanks for your advice. This has been bugging me immensely. I stumbled over this in a scikit-learn build, which triggered this in scipy, and so on.

ckastner · 2022-02-02T06:48:20Z

@seberg: FYI, the issue was closed in QEMU with the comment that "this type punning through casts is invalid C". If that comment is true, then there might be a latent issue after all.

Either way, I'm admittedly no longer fluid in C, so I'll leave it at that.

mattip · 2022-02-02T08:05:10Z

@seiko2plus thoughts?

seiko2plus · 2022-02-02T09:52:53Z

@mattip, I'm going to dig into it. apparently, the issue seems related to VSX instruction xvcvdpsxws which truncates vector of double-precision into a signed-word vector (f64 -> int32).

mattip · 2022-02-02T10:13:21Z

Thanks. Maybe @rafaelcfsousa who seems to know about VSX could also help?

seiko2plus · 2022-02-02T11:57:11Z

Here is what GCC generate from the provided source code (convert.c) within the upstream bug, see https://godbolt.org/z/ej7naaP7G.

.L4:
        lxvd2x 33,4,9
        lxvd2x 32,6,9
        sldi 10,8,4
        addi 9,9,32
        addi 8,8,1
        xvcvdpsxws 33,33
        xvcvdpsxws 32,32
        vperm 0,0,1,13
        stxvd2x 32,5,10
        bdnz .L4
        rldicr 9,3,0,61
        rldicr 8,3,3,58
        rldicr 10,3,2,59
        cmpld 0,3,9
        subf 7,9,7
        add 4,4,8
        add 5,5,10
        beqlr 0
# vperm indices
.LC1:
        .byte   23
        .byte   22
        .byte   21
        .byte   20
        .byte   31
        .byte   30
        .byte   29
        .byte   28
        .byte   7
        .byte   6
        .byte   5
        .byte   4
        .byte   15
        .byte   14
        .byte   13
        .byte   12

GCC performs 2 loads, 2 truncate via xvcvdpsxws, then permuting the even elements and here is the bug. GCC suppose to permute the odd elements on little-endian mode since according to the Power/ISA_V2.07B:

VSX Vector truncate Double-Precision to
integer and Convert to Signed Integer Word
format with Saturate XX2-form
...
The result is placed into bits 0:31 of doubleword
element i of VSR[XT].
The contents of bits 32:63 of doubleword element
1 of VSR[XT] are undefined

see page 497 (Chapter 7. Vector-Scalar Floating-Point Operations [Category: VSX])

Now if GCC is wrong about permuting then why does it work on native hardware?

By compiling and executing the following code:

#include <stdio.h>
#include <altivec.h>

void print_simd_trunc(vector double a)
{
    vector signed int fxpt;
    __asm__ ("xvcvdpsxws %x0,%x1" : "=wa" (fxpt) : "wa" (a));
    printf("xvcvspsxws:( ");
    for (int i = 0; i < 4; ++i) {
        printf("%d, ", vec_extract(fxpt, i));
    }
    printf(")\n");
}

int main(void)
{
    print_simd_trunc((vector double) {1, 2});
    print_simd_trunc((vector double) {3, 4});
    return 0;
}

On Power[8, 9]/VM(native), I got the following result:

xvcvspsxws:( 1, 1, 2, 2, )
xvcvspsxws:( 3, 3, 4, 4, )

And on qemu(emulate):

xvcvspsxws:( 0, 1, 0, 2, )
xvcvspsxws:( 0, 3, 0, 4, )

On Power[8, 9] the contents of bits 32:63 aren't undefined as mentioned in the ISA but it contains a duplicate of the bits 0:31 and that's why GCC code works!

Funny but it seems qemu is strict in respecting the Power/ISA more than IBM and GCC do.

I think this issue should be forwarded to GCC unless IBM has a different opinion on this matter.

@ckastner,

the issue was closed in QEMU with the comment that "this type punning through casts is invalid C"

Your code is just fine, and there's no type punning. maybe they got confused by the null reinterpreting:

-                *(int *)out = (int) (*(double *)in);
+               *out = (int) (*in);

ckastner · 2022-02-02T17:01:54Z

@seiko2plus, thank you for your remarkable resolution.

I will indeed report this to GCC, and also ping the QEMU issue again just to make them aware of this oddity.

ckastner · 2022-02-10T23:45:31Z

Just to documented this here: I reported the issue as GCC#104353.

edelsohn · 2022-02-16T15:51:26Z

second programming note for xvcvdpsxws in ISA 3.1:

Previous versions of the architecture allowed the contents of words 1
and 3 of the result register to be undefined. However, all processors
that support this instruction write the result into words 0 and 1 and
words 2 and 3 of the result register, as is required by this version
of the architecture.

This needs to go back to a bug in QEMU.

seiko2plus · 2022-02-16T16:29:21Z

Previous versions of the architecture allowed the contents of words 1
and 3 of the result register to be undefined.

xvcvdpsxws suppose to be affected by the endianness, isn't it?

edelsohn · 2022-02-16T16:35:27Z

The NumPy code is correct and the GCC code generation is correct.

How is endianness an issue for this?

QEMU had implemented the original ISA and had not matched the hardware. The hardware behavior was codified in the ISA as correct. QEMU needs to be updated to match the latest ISA.

seiko2plus · 2022-02-16T17:05:05Z

The NumPy code is correct and the GCC code generation is correct.

How is endianness an issue for this?

GCC permute the even elements(0, 2) on both big/little-endian modes under flags -mcpu=power8(ISA 2.07) & -mcpu=power9(ISA 3.00) while clang & qemu respect the endianness on these architectures.

QEMU had implemented the original ISA and had not matched the hardware. The hardware behavior was codified in the ISA as correct. QEMU needs to be updated to match the latest ISA.

I didn't know this behavior changed on ISA 3.1. I agree with you that qemu should update the latest ISA but only when option -cpu power10 is enabled, Am I missing something?

edelsohn · 2022-02-16T17:31:02Z

The endianness is a separate issue. There also was an issue recently where cnttz_lsbb were implemented with different endianness in Clang and GCC. GCC was fixed to conform to the ISA. If there is an endianness issue for xvcvdpsxws, please open a GCC issue solely for the endianness issue.

In regard to xvcvdpsxws elements, QEMU presumably should emulate a hardware processor, not an abstract ISA. All hardware implementations behave as specified in the latest ISA. The ISA is clarified / corrected in ISA 3.1, but it is not a change in behavior limited to Power10. QEMU should emulate the hardware, which means that it should be modified to implement xvcvdpsxws as specified in ISA 3.1, which is the behavior implemented in all hardware.

seiko2plus · 2022-02-16T17:56:13Z

The ISA is clarified / corrected in ISA 3.1, but it is not a change in behavior limited to Power10

Thank you for the clarification, it's been clear now this bug belongs to qemu.

seiko2plus · 2022-02-16T19:22:19Z

@edelsohn, Clang/LLVM needs to update their VSX intrinsics and remove the unnecessary shuffles, take a look at the following code:

https://github.com/llvm/llvm-project/blob/da7c77b82c217592cc14f5b5a3c6a9e6741896af/clang/lib/Headers/altivec.h#L3465-L3484

And in general, there's no need for VSX intrinsics like vec_signede, vec_signedo, vec_unsignede, vec_unsignedo, vec_floate, and vec_floato. just keep one version for each conversion rather than provide one for even elements and another for odd, it seems everyone was following the ISA without checking the actual hardware behavior, not just qemu.

henryiii · 2022-02-16T19:36:22Z

In #21062, you can see float64 -> float32 is broken on emulation too:

>>> import numpy as np
>>> s = np.random.normal(size=4); s
array([ 0.04760427, -0.11949105, -1.67610564,  1.38181202])
>>> s.astype(np.float32)
array([ 4.7665749e-18,  1.9329982e-10,  6.0247680e-16, -6.2795307e+26],
      dtype=float32)
>>> np.array([1., 2., 3., 1.00001]).astype(np.float32)
array([0.0000000e+00, 0.0000000e+00, 0.0000000e+00, 4.5436204e+36],
      dtype=float32)

edelsohn · 2022-02-16T19:37:34Z

Can you open an issue in the LLVM Github repository about the unnecessary shuffles?

seiko2plus · 2022-02-16T19:49:09Z

@edelsohn,

Can you open an issue in the LLVM Github repository about the unnecessary shuffles?

sure, I will. also to deprecate the following intrinsics:
vec_signede, vec_signedo, vec_unsignede, vec_unsignedo, vec_floate, and vec_floato

And not just clang, GCC performs the same too. Take a look:
https://godbolt.org/z/hcx4zvnoh

seiko2plus · 2022-02-16T19:52:57Z

@henryiii,
Thank you for updating, any SIMD operation that performs any kind of conversion from 64-bit to 32-bit is affected by this bug.

edelsohn · 2022-02-16T19:56:16Z

GCC inserts endian swap code to match the instruction semantics. When GCC can provide that the endian swap code will cancel out or is unnecessary, it can avoid the endian swap code. Are you suggesting that the endian swap is wrong, or never is necessary or that the compiler should be able to provide that it is unnecessary for that function? You should open GCC and LLVM issues and explain why the compiler should know that the swap is unnecessary.

seiko2plus · 2022-02-16T20:12:11Z

@edelsohn,

Are you suggesting that the endian swap is wrong, or never is necessary or that the compiler should be able to provide that it is unnecessary for that function?

according to ISA 3.1 update

However, all processors that support this instruction write the result into words 0 and 1 and
words 2 and 3 of the result register, as is required by this version
of the architecture.

Then yes, all shuffles that related to permute even/odd depend on the endianness is unnecessary.

seiko2plus · 2022-02-18T08:54:33Z

@edelsohn, informed both projects, llvm-project and GCC.

ckastner added the 00 - Bug label Feb 1, 2022

seberg added 57 - Close? Issues which may be closable unless discussion continued and removed 00 - Bug labels Feb 1, 2022

ckastner closed this as completed Feb 1, 2022

seiko2plus mentioned this issue Feb 16, 2022

BUG: ppc64le conversion float64->32 are broken for arrays 4+ elements #21062

Closed

henryiii mentioned this issue Feb 16, 2022

[BUG]: Numpy test failure on ppc64le architecture pybind/pybind11#3710

Closed

3 tasks

seberg mentioned this issue Mar 30, 2022

BUG: astype corrupts data on ppc64le #21267

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Integer conversion misbehavior on QEMU ppc64le #20964

BUG: Integer conversion misbehavior on QEMU ppc64le #20964

ckastner commented Feb 1, 2022 •

edited

Loading

seberg commented Feb 1, 2022 •

edited

Loading

ckastner commented Feb 1, 2022 •

edited

Loading

seberg commented Feb 1, 2022

ckastner commented Feb 1, 2022

ckastner commented Feb 2, 2022

mattip commented Feb 2, 2022

seiko2plus commented Feb 2, 2022

mattip commented Feb 2, 2022

seiko2plus commented Feb 2, 2022 •

edited

Loading

ckastner commented Feb 2, 2022

ckastner commented Feb 10, 2022

edelsohn commented Feb 16, 2022

seiko2plus commented Feb 16, 2022

edelsohn commented Feb 16, 2022

seiko2plus commented Feb 16, 2022

edelsohn commented Feb 16, 2022

seiko2plus commented Feb 16, 2022

seiko2plus commented Feb 16, 2022 •

edited

Loading

henryiii commented Feb 16, 2022 •

edited

Loading

edelsohn commented Feb 16, 2022

seiko2plus commented Feb 16, 2022 •

edited

Loading

seiko2plus commented Feb 16, 2022

edelsohn commented Feb 16, 2022

seiko2plus commented Feb 16, 2022 •

edited

Loading

seiko2plus commented Feb 18, 2022

BUG: Integer conversion misbehavior on QEMU ppc64le #20964

BUG: Integer conversion misbehavior on QEMU ppc64le #20964

Comments

ckastner commented Feb 1, 2022 • edited Loading

Describe the issue:

Reproduce the code example:

Error message:

NumPy/Python version information:

seberg commented Feb 1, 2022 • edited Loading

ckastner commented Feb 1, 2022 • edited Loading

seberg commented Feb 1, 2022

ckastner commented Feb 1, 2022

ckastner commented Feb 2, 2022

mattip commented Feb 2, 2022

seiko2plus commented Feb 2, 2022

mattip commented Feb 2, 2022

seiko2plus commented Feb 2, 2022 • edited Loading

ckastner commented Feb 2, 2022

ckastner commented Feb 10, 2022

edelsohn commented Feb 16, 2022

seiko2plus commented Feb 16, 2022

edelsohn commented Feb 16, 2022

seiko2plus commented Feb 16, 2022

edelsohn commented Feb 16, 2022

seiko2plus commented Feb 16, 2022

seiko2plus commented Feb 16, 2022 • edited Loading

henryiii commented Feb 16, 2022 • edited Loading

edelsohn commented Feb 16, 2022

seiko2plus commented Feb 16, 2022 • edited Loading

seiko2plus commented Feb 16, 2022

edelsohn commented Feb 16, 2022

seiko2plus commented Feb 16, 2022 • edited Loading

seiko2plus commented Feb 18, 2022

ckastner commented Feb 1, 2022 •

edited

Loading

seberg commented Feb 1, 2022 •

edited

Loading

ckastner commented Feb 1, 2022 •

edited

Loading

seiko2plus commented Feb 2, 2022 •

edited

Loading

seiko2plus commented Feb 16, 2022 •

edited

Loading

henryiii commented Feb 16, 2022 •

edited

Loading

seiko2plus commented Feb 16, 2022 •

edited

Loading

seiko2plus commented Feb 16, 2022 •

edited

Loading