-
Notifications
You must be signed in to change notification settings - Fork 11.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inefficient code generated for inline asm with multiple output register operands #2466
Comments
This is a classic case that suffers from not having multiple return values. Because we can't do this on the llvm IR, asms that have multiple output constraints (= or +) have to return them through memory, which is grossly inefficient. |
Specifically the llvm ir for the first function compiles to multiple =*r constraints, which are "indirect outputs". |
Reduced testcase: int sad16_sse2(void *v, unsigned char *blk2, unsigned char *blk1, We compile this to: _sad16_sse2: GCC compiles this to: _sad16_sse2: |
This is what the .ll will contain in the future. I'm working through teaching the code generator to grok this: ; ModuleID = 'pr2094.c' define i32 @sad16_sse2(i8* %v, i8* %blk2, i8* %blk1, i32 %stride, i32 %h) nounwind { |
I believe the code generator now handles multiple result .ll files. Time to update the f.e. |
Implemented. Here's the last codegen pieces: Here's the CFE piece: -Chris |
The testcase inline-asm-mrv.c is failing on x86-32 linux. ; ModuleID = 'inline-asm-mrv.c' define i32 @sad16_sse2(i8* %v, i8* %blk2, i8* %blk1, i32 %stride, i32 %h) nounwind { |
Sorry, my mistake: I accidentally tested using an old version of llvm-gcc. |
Ok. the new code is much better! :) |
mentioned in issue llvm/llvm-bugzilla-archive#2095 |
mentioned in issue #1234 |
Extended Description
Testcase:
#include <stdint.h>
int sad16_sse2(void *v, uint8_t *blk2, uint8_t *blk1, int stride, int h)
{
int ret;
asm volatile(
"pxor %%xmm6, %%xmm6 \n\t"
//ASMALIGN(4)
"1: \n\t"
"movdqu (%1), %%xmm0 \n\t"
"movdqu (%1, %3), %%xmm1 \n\t"
"psadbw (%2), %%xmm0 \n\t"
"psadbw (%2, %3), %%xmm1 \n\t"
"paddw %%xmm0, %%xmm6 \n\t"
"paddw %%xmm1, %%xmm6 \n\t"
"lea (%1,%3,2), %1 \n\t"
"lea (%2,%3,2), %2 \n\t"
"sub $2, %0 \n\t"
" jg 1b \n\t"
: "+r" (h), "+r" (blk1), "+r" (blk2)
: "r" ((long)stride)
);
asm volatile(
"movhlps %%xmm6, %%xmm0 \n\t"
"paddw %%xmm0, %%xmm6 \n\t"
"movd %%xmm6, %0 \n\t"
: "=r"(ret)
);
return ret;
}
Generated code:
(We'll put aside for the moment the fact that this code is extremely dangerous because a compiler using certain kinds of optimizations might actually end up using the xmm regs between the two asm statements.)
The generated code ends up being rather inefficient in that it emits four unnecessary stores to the stack, plus allocation for the necessary space. I think it's because blk1 and blk2 have to be put into alloca's at the il level, and codegen isn't smart enough to eliminate them. Not sure what the right fix is; maybe inline asm should take advantage of the multiple return value work?
(I don't know how much fixing this will help, but this function shows up at the top of a profile in ffmpeg re-encoding from h.264 to mpeg4, so every bit likely helps.)
The text was updated successfully, but these errors were encountered: