Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

powerpc: don't use asm optimisation with clang #210

Closed
wants to merge 1 commit into from
Closed

powerpc: don't use asm optimisation with clang #210

wants to merge 1 commit into from

Conversation

julianaito
Copy link

@julianaito julianaito commented Apr 28, 2020

OpenBSD and FreeBSD use clang to build MilkyTracker, and this compiler does not like the asm optimisations used for ppc32/64.

On OpenBSD/macppc (but it seems to be the same on FreeBSD/powerpc), this generates these errors:

conv.c:78:51: error: invalid operand for instruction
DEFINE_CLIPCONVERT_POWERPC(s8,f32, -128.0, 127.0, LFSUX, LBZ_STBUX)
                                                  ^
conv.c:65:16: note: expanded from macro 'LFSUX'
#define LFSUX   "1:     lfsux f0,%1,%6          \n"
                 ^
<inline asm>:4:10: note: instantiated into assembly here
1:      lfsux f0,5,6            
              ^
conv.c:78:1: error: invalid operand for instruction
DEFINE_CLIPCONVERT_POWERPC(s8,f32, -128.0, 127.0, LFSUX, LBZ_STBUX)
^
conv.c:48:4: note: expanded from macro 'DEFINE_CLIPCONVERT_POWERPC'
                "       fsub f1,f0,%3           \n"                     \
                 ^
<inline asm>:5:7: note: instantiated into assembly here
        fsub f1,f0,2        
(and more)    

Using -fno-integrated-as did not solve the issue, it emits relocations errors. I can't test every platform using clang on ppc32/64 so maybe it would need further OS-filtering, but i wanted more opinions before eventually polishing that PR, or using some different code if my proposal is considered a bad idea :)

CC'ing @pkubaj if he's interested.

This allows MilkyTracker to be built with that compiler on this arch.
@pkubaj
Copy link

pkubaj commented Apr 28, 2020

Building with GCC9 also fails (FreeBSD 12.1-RELEASE on powerpc64, base compiler is GCC 4.2.1):

/usr/local/bin/g++9   -I/wrkdirs/usr/ports/audio/milkytracker/work/MilkyTracker-1.02.00/src/ppui/osinterface/posix -I/usr/local/include/SDL2 -I/wrkdirs/usr/ports/audio/milkytracker/work/MilkyTracker-1.02.00/src/ppui -I/wrkdirs/usr/ports/audio/milkytracker/work/MilkyTracker-1.02.00/src/ppui/osinterface -O2 -pipe  -fstack-protector-strong -Wl,-rpath=/usr/local/lib/gcc9  -Wl,-rpath=/usr/local/lib/gcc9 -O2 -pipe  -fstack-protector-strong -Wl,-rpath=/usr/local/lib/gcc9  -Wl,-rpath=/usr/local/lib/gcc9 -MD -MT src/ppui/CMakeFiles/ppui.dir/Graphics_15BIT.cpp.o -MF src/ppui/CMakeFiles/ppui.dir/Graphics_15BIT.cpp.o.d -o src/ppui/CMakeFiles/ppui.dir/Graphics_15BIT.cpp.o -c /wrkdirs/usr/ports/audio/milkytracker/work/MilkyTracker-1.02.00/src/ppui/Graphics_15BIT.cpp
{standard input}: Assembler messages:
{standard input}:1802: Error: unsupported relocation against r9
{standard input}:1803: Error: unsupported relocation against r10
{standard input}:1803: Error: unsupported relocation against r5
{standard input}:1805: Error: unsupported relocation against r10
{standard input}:1805: Error: unsupported relocation against r9
{standard input}:1809: Error: unsupported relocation against r4
{standard input}:1809: Error: unsupported relocation against r3
{standard input}:1810: Error: unsupported relocation against r4
{standard input}:1810: Error: unsupported relocation against r3
{standard input}:1811: Error: unsupported relocation against r4
{standard input}:1811: Error: unsupported relocation against r3
{standard input}:1812: Error: unsupported relocation against r4
{standard input}:1812: Error: unsupported relocation against r3
{standard input}:1813: Error: unsupported relocation against r10
{standard input}:1813: Error: unsupported relocation against r10
{standard input}:1814: Error: unsupported relocation against r3
{standard input}:1814: Error: unsupported relocation against r3
{standard input}:1815: Error: unsupported relocation against r10
{standard input}:1815: Error: unsupported relocation against r9
{standard input}:1817: Error: unsupported relocation against r11
{standard input}:1817: Error: unsupported relocation against r5
{standard input}:1819: Error: unsupported relocation against r11
{standard input}:1819: Error: unsupported relocation against r9
{standard input}:1822: Error: unsupported relocation against r4
{standard input}:1822: Error: unsupported relocation against r3
{standard input}:1823: Error: unsupported relocation against r11
{standard input}:1823: Error: unsupported relocation against r11
{standard input}:1824: Error: unsupported relocation against r3
{standard input}:1824: Error: unsupported relocation against r3
{standard input}:1825: Error: unsupported relocation against r11
{standard input}:1825: Error: unsupported relocation against r9

@julianaito
Copy link
Author

Building with GCC9 also fails (FreeBSD 12.1-RELEASE on powerpc64, base compiler is GCC 4.2.1):

Oh, it used to build with GCC-8.3 on OpenBSD/macppc.

It looks like we're going straight into an #ifdef hell the way i'm proposing this :|

@kernigh
Copy link

kernigh commented Apr 28, 2020

@julianaito: GCC-8.3 on OpenBSD doesn't define __ppc__, so it would have skipped the asm and used the C alternative. Clang does define __ppc__.

The asm is in the wrong syntax. li r9, 0 is Darwin syntax, but BSD and Linux use ELF syntax: li %r9, 0 or li 9, 0. PowerPC Darwin (Mac OS X) became obsolete in 2009, so we might want to delete the asm and just use the C code on PowerPC. If so, we might also delete __attribute__((noinline)).

@Deltafire
Copy link
Member

I'm in favour of removing these asm blocks and relying on the compiler to generate any optimisations. Could someone test that there are no performance bottlenecks if we do so?

@kernigh
Copy link

kernigh commented May 6, 2020

@Deltafire: I made a simple benchmark of fill_dword() and fill_dword_vertical() for my old G3 (PowerPC 750 at 400 MHz) running OpenBSD/macppc. The C code looks about as fast as the asm code.

I needed to edit the asm code. I added some % signs for ELF syntax. Then the asm crashed because clang optimized away the function arguments. I moved the asm code to a label, and constrained bl label to require the arguments. I also tried to realign the loops in fill_dword() by deleting the first 2 nops.

gcc 8.3.0 -O2 with len = 100, pitch = 32:

$ ./timefill-n 100 32 
USE_ASM? no
fill_dword 1: 0.000007931 s
fill_dword 2: 0.000002884 s
fill_dword 3: 0.000002884 s
fdvertical 4: 0.000002443 s
fdvertical 5: 0.000002564 s
fdvertical 6: 0.000002723 s
buff is ok
$ ./timefill-y 100 32 
USE_ASM? yes
fill_dword 1: 0.000007450 s
fill_dword 2: 0.000003245 s
fill_dword 3: 0.000003005 s
fdvertical 4: 0.000002564 s
fdvertical 5: 0.000002443 s
fdvertical 6: 0.000002563 s
buff is ok

clang 8.0.1 -O2 with len = 100, pitch 32:

$ ./timefill-n 100 32 
USE_ASM? no
fill_dword 1: 0.000005807 s
fill_dword 2: 0.000005928 s
fill_dword 3: 0.000003886 s
fdvertical 4: 0.000002444 s
fdvertical 5: 0.000002524 s
fdvertical 6: 0.000002443 s
buff is ok
$ ./timefill-y 100 32 
USE_ASM? yes
fill_dword 1: 0.000007009 s
fill_dword 2: 0.000002924 s
fill_dword 3: 0.000002924 s
fdvertical 4: 0.000002604 s
fdvertical 5: 0.000002564 s
fdvertical 6: 0.000002444 s
buff is ok

The compilers and the asm use different optimizations. Both gcc and clang use mtctr/bdnz loops, and clang also uses stwu, but the asm never uses bdnz nor stwu. clang also unrolls fill_dword() to prepend a 3rd loop (of 8 stores). The asm aligns the loops to 16 bytes, but gcc and clang don't align the loops.

I removed __attribute__((noinline)) from the C code, but I didn't inline the asm code. gcc inlined fill_dword_vertical(). clang inlined both fill_dword() and fill_dword_vertical().

@Deltafire
Copy link
Member

I removed the PPC assembly code and replaced with the library function memset(), which should already by optimised for the target platforms. See commit ba5699e.

@Deltafire Deltafire closed this May 8, 2020
@kernigh
Copy link

kernigh commented May 10, 2020

@Deltafire, does memset() really work? As I understood it, fill_dword() fills in 32-bit values, but memset() fills in 8-bit values.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants