__ir_pure seems much more expensive in large projects #4388

p0nce · 2023-04-30T18:37:15Z

Reference: AuburnSounds/intel-intrinsics#130

Problem statement
It seems some functions that instantiate a template get a lot more expensive to generate code for, in larger projects.
EDIT: actually, it seems everything that uses __ir_pure pays a (growing) cost

Example:

_mm_unpacklo_epi8 takes 1ms 45us in LDC build times woes AuburnSounds/intel-intrinsics#130
but in a larger project, the same kind of function, such as _mm_unpackhi_ps, takes more than 50ms.

How to reproduce
You can reproduce by building this project: https://github.com/AuburnSounds/Dplug/tree/master/examples/clipit
with LDC 1.32.1 and --ftime-trace (type dub --combined).
All intrinsics with __ir_pure take about 50ms. intel-intrinsics then become a very significant contributor to total build times (this is not the -g regression).

At first I thought this was all about ldc.simd.shufflevector having too many CTFE, when when precomputing the LLVM IR and using __ir_pure instead the performance of build is the same, or even reduced.

The text was updated successfully, but these errors were encountered:

p0nce · 2023-04-30T18:40:20Z

And that can be avoided by not using __ir_pure

kinke · 2024-03-16T15:37:49Z

I've just stumbled upon:

ldc/gen/toir.cpp

Line 718 in 3eb3190

return DtoInlineIRExpr(e->loc, fd, e->arguments, sretPointer);

This means that the IR fragment is generated (in its own IR module) and then 'linked' for every call site, instead of once per template instantiation of __ir[Ex][_pure]. I'm pretty sure this makes it extremely slow.

kinke · 2024-03-16T16:04:21Z

Confirmed:

import ldc.llvmasm;

version (all)
{
    // fast variant - IR-inlining once
    pragma(inline, true)
    double muladdFast(double a, double b, double c)
    {
        return __ir!(`%p = fmul fast double %0, %1
                             %r = fadd fast double %p, %2
                             ret double %r`,
                             double, double, double, double)(a, b, c);
    }
}
else
{
    // slow variant - IR-inlining at every call site
    alias muladdFast = __ir!(`%p = fmul fast double %0, %1
                             %r = fadd fast double %p, %2
                             ret double %r`,
                             double, double, double, double);
}

static foreach (int i; 0 .. 5_000)
{
    mixin("double foo" ~ i.stringof ~ `(double a, double b, double c) {
        return muladdFast(a, b, c);
    }`);
}

On my box, the fast variant compiles in one second, while the slow one takes 7 seconds. So aliasing an __ir* template instantiation should be avoided, wrapping it instead to IR-inline once.

kinke · 2024-03-16T16:27:08Z

dub --combined

AFAIK, this compiles everything (incl. all dub deps, direct and indirect) to a single object file (= IR module). Each inline IR fragment is generated in its own temporary IR module, which is then 'linked' into the referencing IR module. Maybe this linking step scales very badly with huge object files. So I'd try without --combined (brrr, single-threaded build, and potentially huge mem requirements for the single compiler invocation), and letting LTO or pragma(inline, true) take care of cross-module optimizations.

p0nce · 2024-03-16T16:35:20Z

My fear with pragma(inline, true) is that without --combined the body of ALL intrinsics will get built by every dependees, making the task of compiling intel-intrinsics each time quite expensive. For now, I just avoid __ir in debug builds which works ok. Say I have 17 packages, that 17 builds of every function body, and there are over a thousands intrinsics, not all of them even need to be inlined. Until now I had trouble everytime getting them to inline when needed without --combined

Should I pragma(inline, true) only those intrinsics that have __ir ? That's possible of course.

The big problem is that there are two meanings to pragma(inline, true) => always inline, and always export the body (header generation) I need one without the other. I never want to force the compiler to inline. And similarly, in pragma(inline, cond) I don't always want a false to mean it shouldn't ever inline.

kinke · 2024-03-16T16:47:25Z

Well with --combined, that's definitely what dependees get, everything needs to be built at once, into a single huge object file (similar to full LTO, at compile-time). Without --combined, they'll just link a built-once static library of intel-intrinsics, with one object file per D module.

If a pragma(inline, true) function is used by the dependee, it is semantically analyzed once when compiling the dependee, and codegen'd into every referencing IR module ('object file') of the dependee (so that the optimizer can inline it later). So only used pragma(inline, true) functions are codegen'd when compiling the dependee.

p0nce · 2024-03-16T16:51:51Z

I need to try and see if there is any performance loss. I trust the huge single object file and have not had the same experience with multiple translation units in the past. Also it tends to be quicker to full build.

Also do you agree that without pragma(inline, xxx) the compiler will be able to inline or not in codegen? It has real benefits for performance.

kinke · 2024-03-16T16:56:50Z

The big problem is that there are two meanings to pragma(inline, true) => always inline, and always export the body (header generation)

Well, if a function is supposed to be inlined at every call site, the .di header needs to contain the body.

Also it tends to be quicker to full build.

Oh well, dub... I recommend reggae for builds taking more than a few seconds. That enables parallel and proper incremental builds. And easily allows to add D flags for the whole build.

Also do you agree that without pragma(inline, xxx) the compiler will be able to inline or not in codegen? It has real benefits for performance.

Not sure what you mean. In your case, if a dub package wraps intrinsics, I'd expect ~every 'intrinsic' to be marked with pragma(inline, true), so that it's a proper inlined intrinsic, regardless of how the dependee's build looks like.

PS: It's hard to keep up with your many editings of your posts. ;)

p0nce · 2024-03-16T17:04:37Z

Well, if a function is supposed to be inlined at every call site, the .di header needs to contain the body.

But you can want to have the body in a .di header while still not inlining the body all the time.
In C++ there is inline and force_inline, so they don't have the issue?
IIRC we have a single thing in dlang to mean those two things.

I'd expect ~every 'intrinsic' to be marked with pragma(inline, true), so that it's a proper inlined intrinsic

What if it's not faster? intel-intrinsics emulates what is missing in some arch.
Some intrinsics like _mm_cmpestrs have a pretty complicated alternative path if SSE4.2 is absent in target.
Some intrinsics may be useful not to inline (though I don't have a ready exemple, true).

Well this is frustrating, I point to 50ms slowdown for a single __ir_pure and while this is obviously a LLVM and LDC performance bottleneck I'm told that I should stop using a --combined flag that everyone uses, and use reggae (I don't want to!)

Each inline IR fragment is generated in its own temporary IR module, which is then 'linked' into the referencing IR module. Maybe this linking step scales very badly with huge object files.

Is there really no other way to do it?

kinke · 2024-03-16T17:13:46Z

Well this is frustrating, I point to 50ms slowdown for a single __ir_pure and while this is obviously a LLVM and LDC performance bottleneck I'm told that I should stop using a --combined flag that everyone uses, and use reggae (I don't want to!)

Hold on, I didn't tell you what to do, I'm just offering explanations and avenues to tackle the problem now, by changing the build. I'd never use --combined myself if I can help it [I'm actually using it for building the bundled reggae :D, but that's just laziness and overkill...].

p0nce · 2024-03-16T17:14:56Z

Yes, I'm sorry.
I could probably try to pragma(inline, true) only the needed intrinsics (10% of them) and see what happens.
This will also help with redub build actually.
For now it's not a problem since release builds can pay the price, and debug builds just use the alt paths instead.

kinke · 2024-03-16T17:24:44Z

Another route for intrinsics are function literals:

alias muladdFast = (double a, double b, double c)
{
    return __ir!(`%p = fmul fast double %0, %1
                         %r = fadd fast double %p, %2
                         ret double %r`,
                         double, double, double, double)(a, b, c);
};

They have the nice property that they are only codegen'd into each referencing object file. So if intel-intrinsics only had these, it could be a header-only library, or otherwise still be very quick to compile (no lambda to compile). And dependees only codegen what they use (into every referencing object file => inlineable everywhere). They don't need a pragma(inline, true); if wanting to let the optimizer decide what to do.

Edit: This should be very close to the C++ inline semantics. In D, the function literals are codegen'd lazily though, not sure if that's the case with C++ too (i.e., whether the compiler skips codegen of an inline function if it isn't referenced anywhere in the preprocessed .cpp).

kinke · 2024-03-16T22:15:07Z

I've tried building that Dplug clipit example; some build timings and resulting libclipit.so sizes on my Ubuntu 22 box (24 physical CPU cores) using LDC v1.37.0 (and env var VST2_SDK="" to overcome build errors), best of 3:

dub build --config=VST3 -b release-nobounds --force: 17 secs, 4.4 MB
dub build --config=VST3 -b release-nobounds --force --combined: 15.5 secs, 4.3 MB
reggae --dub-config=VST3 --dub-build-type=release-nobounds && ninja: 3.6 secs, 4.4 MB
reggae --dub-config=VST3 --dub-build-type=release-nobounds --dflags="--flto=full -linker=gold" && ninja: 3.9 secs, 4.7 MB
reggae --dub-config=VST3 --dub-build-type=release-nobounds --dflags="--flto=thin -linker=gold" && ninja: 2.1 secs, 4.6 MB

[Note that a rm -rf .reggae .ninja_log is required before each reggae invocation to enforce a fresh build from scratch, to get comparable timings.]

Interestingly, thin LTO seems to speed-up the overall build significantly over non-LTO.

This is mainly to show that reggae is easy to use and way faster; and you might be able to check the performance of these libraries. I'd hope that the LTO reggae builds are on-par with the dub --combined one.

Edit: And with the debug dub build type: 7.8 secs with dub, 4.4 with --combined, and 1.3 with reggae (0.2 for reggae, and 1.1 for ninja).

p0nce · 2024-03-17T12:53:42Z

Well that's very interesting, possibly full LTO would perhaps yield superior performance (increase in code size might indicate higher inlining amount). And the gains are not too shabby, a bit like redub I think (which I don't use).
I will schedule a test for this indeed.

p0nce · 2024-03-17T12:57:11Z

reggae --dub-config=VST3 --dub-build-type=release-nobounds --dflags="--flto=thin -linker=gold" && ninja

I mean that this is not a "full" rebuild? But a full rebuild is also unneeded since inlining happens at link stage?

kinke · 2024-03-17T15:44:07Z

I mean that this is not a "full" rebuild? But a full rebuild is also unneeded since inlining happens at link stage?

Not sure what you mean - with the mentioned rm -rf .reggae .ninja_log, you get a full build from scratch every time, just for benchmarking or comparing with dub --force. Normal incremental builds are simply done via ninja, only re-compiling what has changed and then linking. reggae is similar to CMake.

p0nce · 2024-03-17T21:08:57Z

Well that's damn fast! Didn't expected that.

p0nce changed the title ~~Templates seem much more expensive in large projects~~ __ir_pure seems much more expensive in large projects May 1, 2023

p0nce mentioned this issue Mar 16, 2024

Doing nothing vs going pragma(inline, true) vs function literals AuburnSounds/intel-intrinsics#137

Open

p0nce mentioned this issue Mar 17, 2024

Consider other build options AuburnSounds/Dplug#843

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

__ir_pure seems much more expensive in large projects #4388

__ir_pure seems much more expensive in large projects #4388

p0nce commented Apr 30, 2023 •

edited

Loading

p0nce commented Apr 30, 2023 •

edited

Loading

kinke commented Mar 16, 2024 •

edited

Loading

kinke commented Mar 16, 2024

kinke commented Mar 16, 2024

p0nce commented Mar 16, 2024 •

edited

Loading

kinke commented Mar 16, 2024

p0nce commented Mar 16, 2024 •

edited

Loading

kinke commented Mar 16, 2024

p0nce commented Mar 16, 2024 •

edited

Loading

kinke commented Mar 16, 2024

p0nce commented Mar 16, 2024 •

edited

Loading

kinke commented Mar 16, 2024 •

edited

Loading

kinke commented Mar 16, 2024 •

edited

Loading

p0nce commented Mar 17, 2024 •

edited

Loading

p0nce commented Mar 17, 2024 •

edited

Loading

kinke commented Mar 17, 2024

p0nce commented Mar 17, 2024

__ir_pure seems much more expensive in large projects #4388

__ir_pure seems much more expensive in large projects #4388

Comments

p0nce commented Apr 30, 2023 • edited Loading

p0nce commented Apr 30, 2023 • edited Loading

kinke commented Mar 16, 2024 • edited Loading

kinke commented Mar 16, 2024

kinke commented Mar 16, 2024

p0nce commented Mar 16, 2024 • edited Loading

kinke commented Mar 16, 2024

p0nce commented Mar 16, 2024 • edited Loading

kinke commented Mar 16, 2024

p0nce commented Mar 16, 2024 • edited Loading

kinke commented Mar 16, 2024

p0nce commented Mar 16, 2024 • edited Loading

kinke commented Mar 16, 2024 • edited Loading

kinke commented Mar 16, 2024 • edited Loading

p0nce commented Mar 17, 2024 • edited Loading

p0nce commented Mar 17, 2024 • edited Loading

kinke commented Mar 17, 2024

p0nce commented Mar 17, 2024

p0nce commented Apr 30, 2023 •

edited

Loading

p0nce commented Apr 30, 2023 •

edited

Loading

kinke commented Mar 16, 2024 •

edited

Loading

p0nce commented Mar 16, 2024 •

edited

Loading

p0nce commented Mar 16, 2024 •

edited

Loading

p0nce commented Mar 16, 2024 •

edited

Loading

p0nce commented Mar 16, 2024 •

edited

Loading

kinke commented Mar 16, 2024 •

edited

Loading

kinke commented Mar 16, 2024 •

edited

Loading

p0nce commented Mar 17, 2024 •

edited

Loading

p0nce commented Mar 17, 2024 •

edited

Loading