Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Integer conversion misbehavior on QEMU ppc64le #20964

Closed
ckastner opened this issue Feb 1, 2022 · 25 comments
Closed

BUG: Integer conversion misbehavior on QEMU ppc64le #20964

ckastner opened this issue Feb 1, 2022 · 25 comments
Labels
57 - Close? Issues which may be closable unless discussion continued

Comments

@ckastner
Copy link
Contributor

ckastner commented Feb 1, 2022

Describe the issue:

[I want to begin by first acknowledging that this is a fringe case of a bug appearing only in a VM (I haven't yet been able to reproduce it on real hardware), so please feel free to close this issue right away. However, I felt that I still should at least point it out, in case there is some latent real issue that this might uncover. In any case, hints as to how I can dig deeper into this so that I may report it to the proper channel (be it QEMU, GNU libc, or whatever) would be welcome.]

On a QEMU ppc64le VM, When casting an array of floats to np.intc, a weird truncation seems to happen. This only occurs in the VM; it does not happen onthe real ppc64le hardware I had access to (although that system hat an older 4.19 kernel; I cannot change that).

In the code below, I create arrays of value 1. of length 1 through 10, and then cast them to np.int64 and np.intc for comparison.

This works fine for arrays with length <= 3. If fails for longer ones, and apparently there is a pattern in this failure.

Reproduce the code example:

import numpy as np
for i in range(1, 11):
    a = np.full(i, 1.)
    print(i, a, a.astype(np.int64), a.astype(np.intc))

Error message:

1 [1.] [1] [1]
2 [1. 1.] [1 1] [1 1]
3 [1. 1. 1.] [1 1 1] [1 1 1]
4 [1. 1. 1. 1.] [1 1 1 1] [0 0 0 0]
5 [1. 1. 1. 1. 1.] [1 1 1 1 1] [0 0 0 0 1]
6 [1. 1. 1. 1. 1. 1.] [1 1 1 1 1 1] [0 0 0 0 1 1]
7 [1. 1. 1. 1. 1. 1. 1.] [1 1 1 1 1 1 1] [0 0 0 0 1 1 1]
8 [1. 1. 1. 1. 1. 1. 1. 1.] [1 1 1 1 1 1 1 1] [0 0 0 0 0 0 0 0]
9 [1. 1. 1. 1. 1. 1. 1. 1. 1.] [1 1 1 1 1 1 1 1 1] [0 0 0 0 0 0 0 0 1]
10 [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.] [1 1 1 1 1 1 1 1 1 1] [0 0 0 0 0 0 0 0 1 1]

NumPy/Python version information:

1.22.1 3.9.10 (main, Jan 16 2022, 17:12:18)
[GCC 11.2.0]

@seberg
Copy link
Member

seberg commented Feb 1, 2022

Aside from being unreadable code due to templating, NumPy should be doing nothing special here. This should effectively be a simple contiguous loop, such as:

while (N--) {
    *(int *)out = (int)(*(double *)in);
    out += sizeof(int);
    in += sizeof(double);
}

The one little "extra" is, that the function is marked with __attribute__((optimize("O3"))).

So, I suspect there is some issue with the emulation and GCC generated highly optimized and architecture specific code? But I don't really have an additional lead (maybe someone else remembers a pointer).

EDIT: Just as a note, SIMD probably explains the pattern here, that any "left over" elements beyond multiples of 4 are handled in a non-vectorized way and thus work fine.

@seberg seberg added 57 - Close? Issues which may be closable unless discussion continued and removed 00 - Bug labels Feb 1, 2022
@ckastner
Copy link
Contributor Author

ckastner commented Feb 1, 2022

Thanks for the quick reply. Indeed, SIMD would be a reasonable explanation.

I've seen this in a VM with a userspace (based on Debian unstable) identical to a physical machine where this isn't happening, so perhaps it's the emulation after all.

I'll see what I can find out, and will close this bug soon unless somebody else closes it sooner.

@seberg
Copy link
Member

seberg commented Feb 1, 2022

Thanks, if you are trying to reproduce this in pure C, I expect putting my loop into a small function with the optimization attribute will do the trick (with in being a passed in, allocated, double array and out being an integer one).

@ckastner
Copy link
Contributor Author

ckastner commented Feb 1, 2022

I could indeed create a minimal this by putting it into a function. gcc -O2 output is fine, gcc -O3 produces results consistent with the above. I created an issue in the QEMU tracker.

Many thanks for your advice. This has been bugging me immensely. I stumbled over this in a scikit-learn build, which triggered this in scipy, and so on.

@ckastner ckastner closed this as completed Feb 1, 2022
@ckastner
Copy link
Contributor Author

ckastner commented Feb 2, 2022

@seberg: FYI, the issue was closed in QEMU with the comment that "this type punning through casts is invalid C". If that comment is true, then there might be a latent issue after all.

Either way, I'm admittedly no longer fluid in C, so I'll leave it at that.

@mattip
Copy link
Member

mattip commented Feb 2, 2022

@seiko2plus thoughts?

@seiko2plus
Copy link
Member

@mattip, I'm going to dig into it. apparently, the issue seems related to VSX instruction xvcvdpsxws which truncates vector of double-precision into a signed-word vector (f64 -> int32).

@mattip
Copy link
Member

mattip commented Feb 2, 2022

Thanks. Maybe @rafaelcfsousa who seems to know about VSX could also help?

@seiko2plus
Copy link
Member

seiko2plus commented Feb 2, 2022

Here is what GCC generate from the provided source code (convert.c) within the upstream bug, see https://godbolt.org/z/ej7naaP7G.

.L4:
        lxvd2x 33,4,9
        lxvd2x 32,6,9
        sldi 10,8,4
        addi 9,9,32
        addi 8,8,1
        xvcvdpsxws 33,33
        xvcvdpsxws 32,32
        vperm 0,0,1,13
        stxvd2x 32,5,10
        bdnz .L4
        rldicr 9,3,0,61
        rldicr 8,3,3,58
        rldicr 10,3,2,59
        cmpld 0,3,9
        subf 7,9,7
        add 4,4,8
        add 5,5,10
        beqlr 0
# vperm indices
.LC1:
        .byte   23
        .byte   22
        .byte   21
        .byte   20
        .byte   31
        .byte   30
        .byte   29
        .byte   28
        .byte   7
        .byte   6
        .byte   5
        .byte   4
        .byte   15
        .byte   14
        .byte   13
        .byte   12

GCC performs 2 loads, 2 truncate via xvcvdpsxws, then permuting the even elements and here is the bug. GCC suppose to permute the odd elements on little-endian mode since according to the Power/ISA_V2.07B:

VSX Vector truncate Double-Precision to
integer and Convert to Signed Integer Word
format with Saturate XX2-form
...
The result is placed into bits 0:31 of doubleword
element i of VSR[XT].
The contents of bits 32:63 of doubleword element
1 of VSR[XT] are undefined

see page 497 (Chapter 7. Vector-Scalar Floating-Point Operations [Category: VSX])

Now if GCC is wrong about permuting then why does it work on native hardware?

By compiling and executing the following code:

#include <stdio.h>
#include <altivec.h>

void print_simd_trunc(vector double a)
{
    vector signed int fxpt;
    __asm__ ("xvcvdpsxws %x0,%x1" : "=wa" (fxpt) : "wa" (a));
    printf("xvcvspsxws:( ");
    for (int i = 0; i < 4; ++i) {
        printf("%d, ", vec_extract(fxpt, i));
    }
    printf(")\n");
}

int main(void)
{
    print_simd_trunc((vector double) {1, 2});
    print_simd_trunc((vector double) {3, 4});
    return 0;
}

On Power[8, 9]/VM(native), I got the following result:

xvcvspsxws:( 1, 1, 2, 2, )
xvcvspsxws:( 3, 3, 4, 4, )

And on qemu(emulate):

xvcvspsxws:( 0, 1, 0, 2, )
xvcvspsxws:( 0, 3, 0, 4, )

On Power[8, 9] the contents of bits 32:63 aren't undefined as mentioned in the ISA but it contains a duplicate of the bits 0:31 and that's why GCC code works!

Funny but it seems qemu is strict in respecting the Power/ISA more than IBM and GCC do.

I think this issue should be forwarded to GCC unless IBM has a different opinion on this matter.

@ckastner,

the issue was closed in QEMU with the comment that "this type punning through casts is invalid C"

Your code is just fine, and there's no type punning. maybe they got confused by the null reinterpreting:

-                *(int *)out = (int) (*(double *)in);
+               *out = (int) (*in);

@ckastner
Copy link
Contributor Author

ckastner commented Feb 2, 2022

@seiko2plus, thank you for your remarkable resolution.

I will indeed report this to GCC, and also ping the QEMU issue again just to make them aware of this oddity.

@ckastner
Copy link
Contributor Author

Just to documented this here: I reported the issue as GCC#104353.

@edelsohn
Copy link

second programming note for xvcvdpsxws in ISA 3.1:

Previous versions of the architecture allowed the contents of words 1
and 3 of the result register to be undefined. However, all processors
that support this instruction write the result into words 0 and 1 and
words 2 and 3 of the result register, as is required by this version
of the architecture.

This needs to go back to a bug in QEMU.

@seiko2plus
Copy link
Member

Previous versions of the architecture allowed the contents of words 1
and 3 of the result register to be undefined.

xvcvdpsxws suppose to be affected by the endianness, isn't it?

@edelsohn
Copy link

The NumPy code is correct and the GCC code generation is correct.

How is endianness an issue for this?

QEMU had implemented the original ISA and had not matched the hardware. The hardware behavior was codified in the ISA as correct. QEMU needs to be updated to match the latest ISA.

@seiko2plus
Copy link
Member

The NumPy code is correct and the GCC code generation is correct.

How is endianness an issue for this?

GCC permute the even elements(0, 2) on both big/little-endian modes under flags -mcpu=power8(ISA 2.07) & -mcpu=power9(ISA 3.00) while clang & qemu respect the endianness on these architectures.

QEMU had implemented the original ISA and had not matched the hardware. The hardware behavior was codified in the ISA as correct. QEMU needs to be updated to match the latest ISA.

I didn't know this behavior changed on ISA 3.1. I agree with you that qemu should update the latest ISA but only when option -cpu power10 is enabled, Am I missing something?

@edelsohn
Copy link

The endianness is a separate issue. There also was an issue recently where cnttz_lsbb were implemented with different endianness in Clang and GCC. GCC was fixed to conform to the ISA. If there is an endianness issue for xvcvdpsxws, please open a GCC issue solely for the endianness issue.

In regard to xvcvdpsxws elements, QEMU presumably should emulate a hardware processor, not an abstract ISA. All hardware implementations behave as specified in the latest ISA. The ISA is clarified / corrected in ISA 3.1, but it is not a change in behavior limited to Power10. QEMU should emulate the hardware, which means that it should be modified to implement xvcvdpsxws as specified in ISA 3.1, which is the behavior implemented in all hardware.

@seiko2plus
Copy link
Member

The ISA is clarified / corrected in ISA 3.1, but it is not a change in behavior limited to Power10

Thank you for the clarification, it's been clear now this bug belongs to qemu.

@seiko2plus
Copy link
Member

seiko2plus commented Feb 16, 2022

@edelsohn, Clang/LLVM needs to update their VSX intrinsics and remove the unnecessary shuffles, take a look at the following code:

https://github.com/llvm/llvm-project/blob/da7c77b82c217592cc14f5b5a3c6a9e6741896af/clang/lib/Headers/altivec.h#L3465-L3484

And in general, there's no need for VSX intrinsics like vec_signede, vec_signedo, vec_unsignede, vec_unsignedo, vec_floate, and vec_floato. just keep one version for each conversion rather than provide one for even elements and another for odd, it seems everyone was following the ISA without checking the actual hardware behavior, not just qemu.

@henryiii
Copy link
Contributor

henryiii commented Feb 16, 2022

In #21062, you can see float64 -> float32 is broken on emulation too:

>>> import numpy as np
>>> s = np.random.normal(size=4); s
array([ 0.04760427, -0.11949105, -1.67610564,  1.38181202])
>>> s.astype(np.float32)
array([ 4.7665749e-18,  1.9329982e-10,  6.0247680e-16, -6.2795307e+26],
      dtype=float32)
>>> np.array([1., 2., 3., 1.00001]).astype(np.float32)
array([0.0000000e+00, 0.0000000e+00, 0.0000000e+00, 4.5436204e+36],
      dtype=float32)

@edelsohn
Copy link

Can you open an issue in the LLVM Github repository about the unnecessary shuffles?

@seiko2plus
Copy link
Member

seiko2plus commented Feb 16, 2022

@edelsohn,

Can you open an issue in the LLVM Github repository about the unnecessary shuffles?

sure, I will. also to deprecate the following intrinsics:
vec_signede, vec_signedo, vec_unsignede, vec_unsignedo, vec_floate, and vec_floato

And not just clang, GCC performs the same too. Take a look:
https://godbolt.org/z/hcx4zvnoh

@seiko2plus
Copy link
Member

@henryiii,
Thank you for updating, any SIMD operation that performs any kind of conversion from 64-bit to 32-bit is affected by this bug.

@edelsohn
Copy link

GCC inserts endian swap code to match the instruction semantics. When GCC can provide that the endian swap code will cancel out or is unnecessary, it can avoid the endian swap code. Are you suggesting that the endian swap is wrong, or never is necessary or that the compiler should be able to provide that it is unnecessary for that function? You should open GCC and LLVM issues and explain why the compiler should know that the swap is unnecessary.

@seiko2plus
Copy link
Member

seiko2plus commented Feb 16, 2022

@edelsohn,

Are you suggesting that the endian swap is wrong, or never is necessary or that the compiler should be able to provide that it is unnecessary for that function?

according to ISA 3.1 update

However, all processors that support this instruction write the result into words 0 and 1 and
words 2 and 3 of the result register, as is required by this version
of the architecture.

Then yes, all shuffles that related to permute even/odd depend on the endianness is unnecessary.

@seiko2plus
Copy link
Member

@edelsohn, informed both projects, llvm-project and GCC.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
57 - Close? Issues which may be closable unless discussion continued
Projects
None yet
Development

No branches or pull requests

6 participants