Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent Output at -O1 and -O2 Optimization Levels on PowerPC64 Due to Complex Type Casting and Nested Loop Structure #71030

Open
gyuminb opened this issue Nov 2, 2023 · 8 comments

Comments

@gyuminb
Copy link

gyuminb commented Nov 2, 2023

Description:

The Proof-of-Concept (PoC) code provided demonstrates an inconsistency in the computed results of an unsigned long long int variable when compiled using Clang-18 for the PowerPC64 architecture. The discrepancy is observed specifically under the optimization levels -O1 and -O2. The output of computedResultUll displays inconsistency as shown below:

Computed Result (ULL): ffffffffffffffff
Computed Result (ULL): ffffffff
Computed Result (ULL): ffffffff
Computed Result (ULL): ffffffffffffffff

Environment:

  • Compiler: Clang-18
  • Target Architecture: PowerPC64
  • Optimization Level: This issue is exclusively observed at O1 and O2 optimization levels.

PoC:

#include <stdio.h>

// Define a macro to find the minimum value between two numbers
#define MIN(a,b) \
    ({ __typeof__ (a) _a = (a); \
       __typeof__ (b) _b = (b); \
       _a < _b ? _a : _b; })

// Global variables
short globalShortValue = (short)1;
signed char globalCharValue = (signed char)0;
unsigned long long largeNumber = 14782061517590169264ULL;
int someIntValue = 378441747;
unsigned int unitIncrement = 1U;

// Variables to store the results of computations
unsigned long long computedResultUll = 0ULL;
short computedResultShort = (short)0;
unsigned char computedResultUChar = (unsigned char)0;
_Bool computedResultBool = (_Bool)0;
unsigned char computedResultChar = (unsigned char)0;

// Arrays used in computations
short shortArray[8];
long long int longArray[8][8];
int intArray[8][8][8];
unsigned long long ullArray[8];
signed char charArray[8][8][8];
short resultArray[8][8];

// Initialize arrays
void initializeArrays() {
    for (size_t i = 0; i < 8; ++i) {
        shortArray[i] = (short)-1;
        ullArray[i] = 1;
        for (size_t j = 0; j < 8; ++j) {
            longArray[i][j] = 0LL;
            resultArray[i][j] = (short)1;
            for (size_t k = 0; k < 8; ++k) {
                intArray[i][j][k] = 0;
                charArray[i][j][k] = (signed char)0;
            }
        }
    }
}

int main() {
    initializeArrays();
    // Main loop for computations
    for (short index = 3; index < ((int) (short) largeNumber) - 1705/*7*/; index += 4) {
        computedResultUll = (unsigned long long) ((int) MIN(globalShortValue, shortArray[index])); // Potential issue here
        for (int i = 0; i < 8; i++) {
            for (signed char j = ((int) (signed char) someIntValue) - 19/*0*/; j < 8; j += 4) {
                for (long long int k = 2; k < 4; k += 4) {
                    computedResultShort -= (short) unitIncrement;
                    computedResultUChar = (unsigned char) (_Bool) MIN((short) globalCharValue, shortArray[index - 1]);
                    charArray[2][0][index] = (signed char) globalShortValue;
                    resultArray[0][0] &= (short) longArray[0][j];
                }
                for (int l = 1; l < 7; l++) {
                    computedResultBool = (_Bool) (ullArray[index]);
                    computedResultChar -= (unsigned char) intArray[0][index][j];
                }
            }
        }
    }
    // Print the result
    printf("Computed Result (ULL): %llx\n", computedResultUll);
}

Expected Behavior:

Regardless of the optimization level, the value of computedResultUll should be consistently and accurately computed as an unsigned long long int.

Observed Behavior:

When compiled with Clang-18 under -O1 and -O2 optimization levels, the computed value for computedResultUll shows inconsistency:

Computed Result (ULL): ffffffffffffffff
Computed Result (ULL): ffffffff
Computed Result (ULL): ffffffff
Computed Result (ULL): ffffffffffffffff

Analysis:

The inconsistency is identified under the following conditions:

  1. Nested Loops and Type Casting: The issue is seen in a complex nested loop structure where boundary conditions involve type casting.
  2. Array Operations: The presence of operations involving multiple arrays seems to be a contributing factor.
  3. Unsigned Type Extension: The primary concern appears to be an incorrect extension of a value to unsigned long long int.

These conditions are specific and intricate, but the inconsistency is notable and is not attributed to Undefined Behavior.

Steps to Reproduce:

  1. Compile the PoC code using Clang-18 targeting PowerPC64 with O1 and O2 optimization levels.
  2. Run the compiled binary.
  3. Observe the inconsistency in the output for computedResultUll.

Conclusion:

The observed inconsistency in extending values to unsigned long long int, under specific conditions involving complex loop structures and type casting operations, when compiled using Clang-18 at -O1 and -O2 optimization levels, warrants further investigation and resolution.

@llvmbot
Copy link
Collaborator

llvmbot commented Nov 2, 2023

@llvm/issue-subscribers-backend-powerpc

Author: None (gyuminb)

### **Description:**

The Proof-of-Concept (PoC) code provided demonstrates an inconsistency in the computed results of an unsigned long long int variable when compiled using Clang-18 for the PowerPC64 architecture. The discrepancy is observed specifically under the optimization levels -O1 and -O2. The output of computedResultUll displays inconsistency as shown below:

Computed Result (ULL): ffffffffffffffff
Computed Result (ULL): ffffffff
Computed Result (ULL): ffffffff
Computed Result (ULL): ffffffffffffffff

Environment:

  • Compiler: Clang-18
  • Target Architecture: PowerPC64
  • Optimization Level: This issue is exclusively observed at O1 and O2 optimization levels.

PoC:

#include &lt;stdio.h&gt;

// Define a macro to find the minimum value between two numbers
#define MIN(a,b) \
    ({ __typeof__ (a) _a = (a); \
       __typeof__ (b) _b = (b); \
       _a &lt; _b ? _a : _b; })

// Global variables
short globalShortValue = (short)1;
signed char globalCharValue = (signed char)0;
unsigned long long largeNumber = 14782061517590169264ULL;
int someIntValue = 378441747;
unsigned int unitIncrement = 1U;

// Variables to store the results of computations
unsigned long long computedResultUll = 0ULL;
short computedResultShort = (short)0;
unsigned char computedResultUChar = (unsigned char)0;
_Bool computedResultBool = (_Bool)0;
unsigned char computedResultChar = (unsigned char)0;

// Arrays used in computations
short shortArray[8];
long long int longArray[8][8];
int intArray[8][8][8];
unsigned long long ullArray[8];
signed char charArray[8][8][8];
short resultArray[8][8];

// Initialize arrays
void initializeArrays() {
    for (size_t i = 0; i &lt; 8; ++i) {
        shortArray[i] = (short)-1;
        ullArray[i] = 1;
        for (size_t j = 0; j &lt; 8; ++j) {
            longArray[i][j] = 0LL;
            resultArray[i][j] = (short)1;
            for (size_t k = 0; k &lt; 8; ++k) {
                intArray[i][j][k] = 0;
                charArray[i][j][k] = (signed char)0;
            }
        }
    }
}

int main() {
    initializeArrays();
    // Main loop for computations
    for (short index = 3; index &lt; ((int) (short) largeNumber) - 1705/*7*/; index += 4) {
        computedResultUll = (unsigned long long) ((int) MIN(globalShortValue, shortArray[index])); // Potential issue here
        for (int i = 0; i &lt; 8; i++) {
            for (signed char j = ((int) (signed char) someIntValue) - 19/*0*/; j &lt; 8; j += 4) {
                for (long long int k = 2; k &lt; 4; k += 4) {
                    computedResultShort -= (short) unitIncrement;
                    computedResultUChar = (unsigned char) (_Bool) MIN((short) globalCharValue, shortArray[index - 1]);
                    charArray[2][0][index] = (signed char) globalShortValue;
                    resultArray[0][0] &amp;= (short) longArray[0][j];
                }
                for (int l = 1; l &lt; 7; l++) {
                    computedResultBool = (_Bool) (ullArray[index]);
                    computedResultChar -= (unsigned char) intArray[0][index][j];
                }
            }
        }
    }
    // Print the result
    printf("Computed Result (ULL): %llx\n", computedResultUll);
}

Expected Behavior:

Regardless of the optimization level, the value of computedResultUll should be consistently and accurately computed as an unsigned long long int.

Observed Behavior:

When compiled with Clang-18 under -O1 and -O2 optimization levels, the computed value for computedResultUll shows inconsistency:

Computed Result (ULL): ffffffffffffffff
Computed Result (ULL): ffffffff
Computed Result (ULL): ffffffff
Computed Result (ULL): ffffffffffffffff

Analysis:

The inconsistency is identified under the following conditions:

  1. Nested Loops and Type Casting: The issue is seen in a complex nested loop structure where boundary conditions involve type casting.
  2. Array Operations: The presence of operations involving multiple arrays seems to be a contributing factor.
  3. Unsigned Type Extension: The primary concern appears to be an incorrect extension of a value to unsigned long long int.

These conditions are specific and intricate, but the inconsistency is notable and is not attributed to Undefined Behavior.

Steps to Reproduce:

  1. Compile the PoC code using Clang-18 targeting PowerPC64 with O1 and O2 optimization levels.
  2. Run the compiled binary.
  3. Observe the inconsistency in the output for computedResultUll.

Conclusion:

The observed inconsistency in extending values to unsigned long long int, under specific conditions involving complex loop structures and type casting operations, when compiled using Clang-18 at -O1 and -O2 optimization levels, warrants further investigation and resolution.

@stefanp-ibm
Copy link
Contributor

I've taken a quick look at the assembly for this test.

For O2 we produce the following code to compute the MIN for computedResultUll.

        lwz 4, 100(1)                           # 4-byte Folded Reload
        extsh 3, 10
        cmpw    3, 4
        isellt  4, 3, 4
        addis 3, 2, computedResultUll@toc@ha
        std 4, computedResultUll@toc@l(3)

For O3 we produce the following code:

        extsh 4, 21
        extsh 3, 3
        cmpw    3, 4
        isellt  4, 3, 4
        addis 3, 2, computedResultUll@toc@ha
        std 4, computedResultUll@toc@l(3)

For O1

        lhax 17, 5, 17
<... lots of code in here ...>   
        extsh 3, 11
        addis 4, 2, computedResultUll@toc@ha
        cmpw    3, 17
        addi 4, 4, computedResultUll@toc@l
        isellt  3, 3, 17
        std 3, 0(4)

The bottom line is that for -O2 we seem to use a load with zero extend which is why we have the different result.

@stefanp-ibm
Copy link
Contributor

stefanp-ibm commented Nov 3, 2023

Looks like we are changing LHAX to LWZ in the register allocator. Is this an attempt to reduce live range?

# *** IR Dump Before Greedy Register Allocator (greedy) ***:

<...>
1968B   bb.3.for.cond.for.cond.cleanup_crit_edge:
        ; predecessors: %bb.6
          successors: %bb.4(0x80000000); %bb.4(100.00%)

1984B     %223:gprc_and_gprc_nor0 = EXTSH %4:gprc
2000B     %224:crrc = CMPW %223:gprc_and_gprc_nor0, %222:gprc_and_gprc_nor0
2016B     undef %232.sub_32:g8rc = ISEL %223:gprc_and_gprc_nor0, %222:gprc_and_gprc_nor0, %224.sub_lt:crrc
2048B     %226:g8rc_and_g8rc_nox0 = ADDIStocHA8 $x2, @computedResultUll
2064B     STD %232:g8rc, target-flags(ppc-toc-lo) @computedResultUll, %226:g8rc_and_g8rc_nox0, implicit $x2 :: (store (s64) into @computedResultUll, !tbaa !9)

<... Different BB ... >
2544B     %222:gprc_and_gprc_nor0 = LHAX %78:g8rc_and_g8rc_nox0, %158:g8rc :: (load (s16) from %ir.arrayidx, !tbaa !5)
# *** IR Dump After Greedy Register Allocator (greedy) ***:
<...>
2000B   bb.3.for.cond.for.cond.cleanup_crit_edge:
        ; predecessors: %bb.6
          successors: %bb.4(0x80000000); %bb.4(100.00%)

2016B     %223:gprc_and_gprc_nor0 = EXTSH %4:gprc
2024B     %285:gprc_and_gprc_nor0 = LWZ 0, %stack.5 :: (load (s32) from %stack.5)
2032B     %224:crrc = CMPW %223:gprc_and_gprc_nor0, %285:gprc_and_gprc_nor0
2048B     undef %232.sub_32:g8rc = ISEL %223:gprc_and_gprc_nor0, %285:gprc_and_gprc_nor0, %224.sub_lt:crrc
2080B     %226:g8rc_and_g8rc_nox0 = ADDIStocHA8 $x2, @computedResultUll
2096B     STD %232:g8rc, target-flags(ppc-toc-lo) @computedResultUll, %226:g8rc_and_g8rc_nox0, implicit $x2 :: (store (s64) into @computedResultUll, !tbaa !9)

<... Different BB ... >
2608B     %286:gprc_and_gprc_nor0 = LHAX %271:g8rc_and_g8rc_nox0, %158:g8rc :: (load (s16) from %ir.arrayidx, !tbaa !5)
2616B     STW %286:gprc_and_gprc_nor0, 0, %stack.5 :: (store (s32) into %stack.5)

Should that be an STD and LD instead of STW and LWZ?

@fhahn
Copy link
Contributor

fhahn commented Nov 3, 2023

@gyuminb Did you check that the code is UB free using UBSan? Seems like for (short index = 3; index < ((int) (short) largeNumber) - 1705/*7*/; index += 4) { might be converting a unsigned value to a signed value that doesn't fit in a short.

@gyuminb
Copy link
Author

gyuminb commented Nov 4, 2023

@gyuminb Did you check that the code is UB free using UBSan? Seems like for (short index = 3; index < ((int) (short) largeNumber) - 1705/*7*/; index += 4) { might be converting a unsigned value to a signed value that doesn't fit in a short.

Yes, when I checked the code using the -fsanitize=undefined option, there were no instances of undefined behavior (UB).

@nemanjai
Copy link
Member

This is a bug in PPCMIPeepholes. It is quite subtle, but a bug nonetheless. This is probably why it requires all the complexity in the test case. Here's the gist of it:

  1. We have two i16 that are inputs to a select_cc which is then sign extended to i64
  2. We select the two inputs to something, followed by EXTSH and the result is extended with EXTSW_32_64
  3. In the MI Peephole, we then check if the EXTSW_32_64 is fed by an operation that extends 32 to 64 and because the two EXTSH (or whatever they're transformed to) are marked as such, the EXTSW32_64 is converted to an INSERT_SUBREG
  4. The register allocator then spills the EXTSH (or the LHA that it is converted to) and because it is a GPRC register, it is spilled (and reloaded) using STW/LWZ

Ultimately, we have to remove all the SExt32To64 decorations from $LLVM_SRC/lib/Target/PowerPC/PPCInstrInfo.td because when they are spilled, they will be reloaded using a zero-extending load.

In order to prevent lost opportunities to remove redundant sign-extend instructions, perhaps the peephole can be modified to look at the use of instructions that sign-extend to 32-bits and if all the uses will then sign-extend, then convert the instruction to a sign-extend to 64-bits and update the uses to 64-bit uses. But that's a bit more involved.

@diggerlin diggerlin self-assigned this Jan 29, 2024
@diggerlin
Copy link
Contributor

diggerlin commented Feb 5, 2024

I compare the asm code. the different is at
the error output
r10 has the vaule of shortArray[3], when it store to stack, it only store 4 byte.
lha 10, 6(31) 6(31) is shortArray[3] ,
.....
stw 10, 124(1) # 4-byte Folded Spill,

when it load the value from stack , it only load 4 bytes

L..BB0_11:                              # %for.cond.cleanup15
   lwz 4, 124(1)                           # 4-byte Folded Reload , shortArray[3] = 0xFFFF,
   extsh 3, 12                             # globalShortValue is 1
   cmpw	3, 4.
   isellt	4, 3, 4
   ld 3, L..C16(2)                         # @computedResultUll
   std 4, 0(3)                             # store r4 to computedResultUll
   ld 3, L..C2(2)                          # @_MergedGlobals
   bl .printf[PR]
   

the correct one:

   	lha 4, 6(29)                            # @shortArray[3]
	lha 3, 0(28)                            # content of @globalShortValue
	cmpw	3, 4
	isellt	30, 3, 4                            int tmp in r30 
        std 30, 0(11)                           store R30(tmp) into computedResultUll
  	std 30, 112(1)                          # 8-byte Folded Spill    int tmp in r30 , 112(1)
  	
L..BB0_11:                              # %for.cond.cleanup16
	ld 3, 120(1)                            # 8-byte Folded Reload
	ld 4, 112(1)                            # 8-byte Folded Reload ,   shortArray[3] 
	addi 3, 3, 4
	bl .printf[PR]

@diggerlin
Copy link
Contributor

diggerlin commented Mar 19, 2024

The bug only happen in 64bit mode. In the PPCMIPeepholes optimization , if there are a instruction extsw (EXTSW_32_64), it may be eliminated after optimization when the following condition met.

  1. the instruction which defined the register RS which is used by extsw RA,RS is already a signed extend instruction. for example:
   LHA 4, 2(2)         # if the content in the memory address 2(r2) is -1, so the r4 will 0xFFFFFFFFFFFFFFFF in 64bit mode
   EXTSW 5, 4        #  so the r5 will 0xFFFFFFFFFFFFFFFF. 
   ADDI  6, 5, 3      # the content of r6 is 2.

LHA will be lower to lha in instruction selection, since lha is 16bit -> 64bit signed extend in 64 bit mode application, after the PPCMIPeepholes, the code will change to

LHA 4, 2(2)   # if the content in the memory address 2(r2) is -1, so the r4 will 0xFFFFFFFFFFFFFFFF. 
ADDI 6, 4, 3  # the content of r6 is 2.

2 . the instruction defined the register RS which is used by extsw RA,RS is not a signed extend instruction. but the content of register RS defined in the instruction is 64 bit signed extend(which is deduced by the logic flow of the instructions)

for example:

        LHA       4, 2(2)        # if the content in the memory address 2(2) is -1, so the r4 will 0xFFFFFFFFFFFFFFFF. 
	EXTSH    3, 12         # if the content in   r12 is 1.                 
	CMPW   3, 4.          # compare r3 and r4
	ISEL	      4, 3, 4       # r4 will be 0xFFFFFFFFFFFFFFFF
	EXTSW   5,4            # r5 will be 0xFFFFFFFFFFFFFFFF
        ADDI      6, 5,3        # r6 will be 2

in the scenario, the extsw still can be eliminated by PPCMIPeepholes even if the instruction isellt is not signed extended instruction, but the Registers which used in isellt is defined by 64 signed extended instruction: lha and extsh

the code can be optimize to

        lha 4, 2(2)         # if the content in the memory address 2(r2) is -1, so the r4 will 0xFFFFFFFFFFFFFFFF. 
	extsh 3, 12        # if the content in   r12 is 1.                            
	cmpw	3, 4.   # compare r3 and r4
	isellt	4, 3, 4        # r4 will be 0xFFFFFFFFFFFFFFFF
        addi 6, 4,3         # r6 will be 2

But in some special situation, there is the problem when there is spill happen. In the example 1 (which is snippet code of a function in 64bit mode, which has spill on r4),

       LHA 4, 2(2)   # if the content in the memory address 2(2) is -1, so the r4 will 0xFFFFFFFFFFFFFFFF. 
       ADDI 6, 4, 3  # the content of r6 is 2 =3  + (-1) .

will change to following code after spill. it spill r4 into memory (4 bytes) with STW since the LHA is 32 bit pseudo instruction, the length of Register is 32 bit,

           LHA 4, 2(2)   # if the content in the memory address 2(2) is -1, so the r4 will 0xFFFFFFFFFFFFFFFF. 
           STW 4   8(1)   # spill r4 to memory 8(1),  since the LHA is 32 bit pseudo instruction, the length of Register is 32 bit, 
                                # it will store the 0xFFFFFFFF to  8(1).
             ....
           LWZ 4, 8(1)     # reload r4 from memory, the r4 will 0x00000000FFFFFFFF
           ADDI 6, 4, 3  # the content of r6 is 3+0x00000000FFFFFFFF, not 2 any more.

to fix the problem , we need to promote the LHA to LHA8 which is 64-bit REGISTER

        LHA 4, 2(2)         # if the content in the memory address 2(2) is -1, so the r4 will 0xFFFFFFFFFFFFFFFF. 
       EXTSW 5, 4        #  so the r5 will 0xFFFFFFFFFFFFFFFF. 
       ADDI  6, 5, 3      # the content of r6 is 2.

will be changed to

         LHA8 4, 2(2)   # if the content in the memory address 2(2) is -1, so the r4 will 0xFFFFFFFFFFFFFFFF. 
         ADDI 6, 4, 3  # the content of r6 is 2.

if there is a spill between the LHA8 4, 2(2) and ADDI 6, 4, 3, the code will be change to following after spill , it spill r4 into memory (8 bytes) with STD since the LHA8 is 64 bit pseudo instruction, the length of Register is 64 bit,

           LHA8 4, 2(2)   # if the content in the memory address 2(2) is -1, so the r4 will 0xFFFFFFFFFFFFFFFF. 
           std 4   8(1)   # spill r4 to memory 8(1),  since the LHA8 is 64 bit instruction, the length of Register is 64 bit, 
                                # it will store the 0xFFFFFFFFFFFFFFFF to  8(1).
             ....
           ld  4, 8(1)     # reload r4 from memory, the r4 will 0xFFFFFFFFFFFFFFF
           ADDI 6, 4, 3  # the content of r6 is 2

in example 2:

        LHA       4, 2(2)        # if the content in the memory address 2(2) is -1, so the r4 will 0xFFFFFFFFFFFFFFFF. 
	EXTSH    3, 12         # if the content in   r12 is 1.                 
	CMPW   3, 4.          # compare r3 and r4
	ISEL	      4, 3, 4       # r4 will be 0xFFFFFFFFFFFFFFFF
	EXTSW   5,4            # r5 will be 0xFFFFFFFFFFFFFFFF
        ADDI      6, 5,3        # r6 will be 2

we do not know when the spill will be happen ,All these instructions in the chain used to deduce sign extension to eliminate the 'extsw' will need to be promoted to 64-bit pseudo instructions. We need to promote the EXTSH, LHA, ISEL to EXTSH8, LHA8, ISEL8

diggerlin added a commit that referenced this issue Apr 30, 2024
Add pre-commit MIR test for PR "[Promote Pseudo Opcode from 32-bit to
64-bit after eliminating the extsw instruction in PPCMIPeepholes
optimization](#85451)" which
fixes bug reported in the issue "[Inconsistent Output at -O1 and -O2
Optimization Levels on PowerPC64 Due to Complex Type Casting and Nested
Loop Structure](#71030)".
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants