-
-
Notifications
You must be signed in to change notification settings - Fork 9.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Integer conversion misbehavior on QEMU ppc64le #20964
Comments
Aside from being unreadable code due to templating, NumPy should be doing nothing special here. This should effectively be a simple contiguous loop, such as:
The one little "extra" is, that the function is marked with So, I suspect there is some issue with the emulation and GCC generated highly optimized and architecture specific code? But I don't really have an additional lead (maybe someone else remembers a pointer). EDIT: Just as a note, SIMD probably explains the pattern here, that any "left over" elements beyond multiples of 4 are handled in a non-vectorized way and thus work fine. |
Thanks for the quick reply. Indeed, SIMD would be a reasonable explanation. I've seen this in a VM with a userspace (based on Debian unstable) identical to a physical machine where this isn't happening, so perhaps it's the emulation after all. I'll see what I can find out, and will close this bug soon unless somebody else closes it sooner. |
Thanks, if you are trying to reproduce this in pure C, I expect putting my loop into a small function with the optimization attribute will do the trick (with |
I could indeed create a minimal this by putting it into a function. Many thanks for your advice. This has been bugging me immensely. I stumbled over this in a scikit-learn build, which triggered this in scipy, and so on. |
@seberg: FYI, the issue was closed in QEMU with the comment that "this type punning through casts is invalid C". If that comment is true, then there might be a latent issue after all. Either way, I'm admittedly no longer fluid in C, so I'll leave it at that. |
@seiko2plus thoughts? |
@mattip, I'm going to dig into it. apparently, the issue seems related to VSX instruction |
Thanks. Maybe @rafaelcfsousa who seems to know about VSX could also help? |
Here is what GCC generate from the provided source code (convert.c) within the upstream bug, see https://godbolt.org/z/ej7naaP7G. .L4:
lxvd2x 33,4,9
lxvd2x 32,6,9
sldi 10,8,4
addi 9,9,32
addi 8,8,1
xvcvdpsxws 33,33
xvcvdpsxws 32,32
vperm 0,0,1,13
stxvd2x 32,5,10
bdnz .L4
rldicr 9,3,0,61
rldicr 8,3,3,58
rldicr 10,3,2,59
cmpld 0,3,9
subf 7,9,7
add 4,4,8
add 5,5,10
beqlr 0
# vperm indices
.LC1:
.byte 23
.byte 22
.byte 21
.byte 20
.byte 31
.byte 30
.byte 29
.byte 28
.byte 7
.byte 6
.byte 5
.byte 4
.byte 15
.byte 14
.byte 13
.byte 12 GCC performs 2 loads, 2 truncate via
Now if GCC is wrong about permuting then why does it work on native hardware? By compiling and executing the following code: #include <stdio.h>
#include <altivec.h>
void print_simd_trunc(vector double a)
{
vector signed int fxpt;
__asm__ ("xvcvdpsxws %x0,%x1" : "=wa" (fxpt) : "wa" (a));
printf("xvcvspsxws:( ");
for (int i = 0; i < 4; ++i) {
printf("%d, ", vec_extract(fxpt, i));
}
printf(")\n");
}
int main(void)
{
print_simd_trunc((vector double) {1, 2});
print_simd_trunc((vector double) {3, 4});
return 0;
} On Power[8, 9]/VM(native), I got the following result: xvcvspsxws:( 1, 1, 2, 2, )
xvcvspsxws:( 3, 3, 4, 4, ) And on qemu(emulate): xvcvspsxws:( 0, 1, 0, 2, )
xvcvspsxws:( 0, 3, 0, 4, ) On Power[8, 9] the contents of bits 32:63 aren't undefined as mentioned in the ISA but it contains a duplicate of the bits 0:31 and that's why GCC code works! Funny but it seems qemu is strict in respecting the Power/ISA more than IBM and GCC do. I think this issue should be forwarded to GCC unless IBM has a different opinion on this matter.
Your code is just fine, and there's no type punning. maybe they got confused by the null reinterpreting: - *(int *)out = (int) (*(double *)in);
+ *out = (int) (*in); |
@seiko2plus, thank you for your remarkable resolution. I will indeed report this to GCC, and also ping the QEMU issue again just to make them aware of this oddity. |
Just to documented this here: I reported the issue as GCC#104353. |
second programming note for xvcvdpsxws in ISA 3.1: Previous versions of the architecture allowed the contents of words 1 This needs to go back to a bug in QEMU. |
|
The NumPy code is correct and the GCC code generation is correct. How is endianness an issue for this? QEMU had implemented the original ISA and had not matched the hardware. The hardware behavior was codified in the ISA as correct. QEMU needs to be updated to match the latest ISA. |
GCC permute the even elements(0, 2) on both big/little-endian modes under flags
I didn't know this behavior changed on ISA 3.1. I agree with you that qemu should update the latest ISA but only when option |
The endianness is a separate issue. There also was an issue recently where cnttz_lsbb were implemented with different endianness in Clang and GCC. GCC was fixed to conform to the ISA. If there is an endianness issue for In regard to |
Thank you for the clarification, it's been clear now this bug belongs to qemu. |
@edelsohn, Clang/LLVM needs to update their VSX intrinsics and remove the unnecessary shuffles, take a look at the following code: And in general, there's no need for VSX intrinsics like |
In #21062, you can see float64 -> float32 is broken on emulation too: >>> import numpy as np
>>> s = np.random.normal(size=4); s
array([ 0.04760427, -0.11949105, -1.67610564, 1.38181202])
>>> s.astype(np.float32)
array([ 4.7665749e-18, 1.9329982e-10, 6.0247680e-16, -6.2795307e+26],
dtype=float32)
>>> np.array([1., 2., 3., 1.00001]).astype(np.float32)
array([0.0000000e+00, 0.0000000e+00, 0.0000000e+00, 4.5436204e+36],
dtype=float32) |
Can you open an issue in the LLVM Github repository about the unnecessary shuffles? |
sure, I will. also to deprecate the following intrinsics: And not just clang, GCC performs the same too. Take a look: |
@henryiii, |
GCC inserts endian swap code to match the instruction semantics. When GCC can provide that the endian swap code will cancel out or is unnecessary, it can avoid the endian swap code. Are you suggesting that the endian swap is wrong, or never is necessary or that the compiler should be able to provide that it is unnecessary for that function? You should open GCC and LLVM issues and explain why the compiler should know that the swap is unnecessary. |
according to ISA 3.1 update
Then yes, all shuffles that related to permute even/odd depend on the endianness is unnecessary. |
@edelsohn, informed both projects, llvm-project and GCC. |
Describe the issue:
[I want to begin by first acknowledging that this is a fringe case of a bug appearing only in a VM (I haven't yet been able to reproduce it on real hardware), so please feel free to close this issue right away. However, I felt that I still should at least point it out, in case there is some latent real issue that this might uncover. In any case, hints as to how I can dig deeper into this so that I may report it to the proper channel (be it QEMU, GNU libc, or whatever) would be welcome.]
On a QEMU ppc64le VM, When casting an array of floats to
np.intc
, a weird truncation seems to happen. This only occurs in the VM; it does not happen onthe real ppc64le hardware I had access to (although that system hat an older 4.19 kernel; I cannot change that).In the code below, I create arrays of value
1.
of length 1 through 10, and then cast them tonp.int64
andnp.intc
for comparison.This works fine for arrays with length <= 3. If fails for longer ones, and apparently there is a pattern in this failure.
Reproduce the code example:
Error message:
NumPy/Python version information:
1.22.1 3.9.10 (main, Jan 16 2022, 17:12:18)
[GCC 11.2.0]
The text was updated successfully, but these errors were encountered: