-
-
Notifications
You must be signed in to change notification settings - Fork 255
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
__ir_pure seems much more expensive in large projects #4388
Comments
I've just stumbled upon: Line 718 in 3eb3190
This means that the IR fragment is generated (in its own IR module) and then 'linked' for every call site, instead of once per template instantiation of |
Confirmed: import ldc.llvmasm;
version (all)
{
// fast variant - IR-inlining once
pragma(inline, true)
double muladdFast(double a, double b, double c)
{
return __ir!(`%p = fmul fast double %0, %1
%r = fadd fast double %p, %2
ret double %r`,
double, double, double, double)(a, b, c);
}
}
else
{
// slow variant - IR-inlining at every call site
alias muladdFast = __ir!(`%p = fmul fast double %0, %1
%r = fadd fast double %p, %2
ret double %r`,
double, double, double, double);
}
static foreach (int i; 0 .. 5_000)
{
mixin("double foo" ~ i.stringof ~ `(double a, double b, double c) {
return muladdFast(a, b, c);
}`);
} On my box, the fast variant compiles in one second, while the slow one takes 7 seconds. So aliasing an |
AFAIK, this compiles everything (incl. all dub deps, direct and indirect) to a single object file (= IR module). Each inline IR fragment is generated in its own temporary IR module, which is then 'linked' into the referencing IR module. Maybe this linking step scales very badly with huge object files. So I'd try without |
My fear with Should I pragma(inline, true) only those intrinsics that have __ir ? That's possible of course. The big problem is that there are two meanings to pragma(inline, true) => always inline, and always export the body (header generation) I need one without the other. I never want to force the compiler to inline. And similarly, in |
Well with If a |
I need to try and see if there is any performance loss. I trust the huge single object file and have not had the same experience with multiple translation units in the past. Also it tends to be quicker to full build. Also do you agree that without |
Well, if a function is supposed to be inlined at every call site, the .di header needs to contain the body.
Oh well, dub... I recommend reggae for builds taking more than a few seconds. That enables parallel and proper incremental builds. And easily allows to add D flags for the whole build.
Not sure what you mean. In your case, if a dub package wraps intrinsics, I'd expect ~every 'intrinsic' to be marked with PS: It's hard to keep up with your many editings of your posts. ;) |
But you can want to have the body in a .di header while still not inlining the body all the time.
What if it's not faster? intel-intrinsics emulates what is missing in some arch. Well this is frustrating, I point to 50ms slowdown for a single
Is there really no other way to do it? |
Hold on, I didn't tell you what to do, I'm just offering explanations and avenues to tackle the problem now, by changing the build. I'd never use |
Yes, I'm sorry. |
Another route for intrinsics are function literals: alias muladdFast = (double a, double b, double c)
{
return __ir!(`%p = fmul fast double %0, %1
%r = fadd fast double %p, %2
ret double %r`,
double, double, double, double)(a, b, c);
}; They have the nice property that they are only codegen'd into each referencing object file. So if Edit: This should be very close to the C++ |
I've tried building that Dplug
[Note that a Interestingly, thin LTO seems to speed-up the overall build significantly over non-LTO. This is mainly to show that reggae is easy to use and way faster; and you might be able to check the performance of these libraries. I'd hope that the LTO reggae builds are on-par with the dub Edit: And with the |
Well that's very interesting, possibly full LTO would perhaps yield superior performance (increase in code size might indicate higher inlining amount). And the gains are not too shabby, a bit like redub I think (which I don't use). |
I mean that this is not a "full" rebuild? But a full rebuild is also unneeded since inlining happens at link stage? |
Not sure what you mean - with the mentioned |
Well that's damn fast! Didn't expected that. |
Reference: AuburnSounds/intel-intrinsics#130
Problem statement
It seems some functions that instantiate a template get a lot more expensive to generate code for, in larger projects.
EDIT: actually, it seems everything that uses
__ir_pure
pays a (growing) costExample:
_mm_unpacklo_epi8
takes 1ms 45us in LDC build times woes AuburnSounds/intel-intrinsics#130_mm_unpackhi_ps
, takes more than 50ms.How to reproduce
You can reproduce by building this project: https://github.com/AuburnSounds/Dplug/tree/master/examples/clipit
with LDC 1.32.1 and
--ftime-trace
(typedub --combined
).All intrinsics with
__ir_pure
take about 50ms.intel-intrinsics
then become a very significant contributor to total build times (this is not the -g regression).At first I thought this was all about
ldc.simd.shufflevector
having too many CTFE, when when precomputing the LLVM IR and using__ir_pure
instead the performance of build is the same, or even reduced.The text was updated successfully, but these errors were encountered: