New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FMA instruction #1438

Closed
9il opened this Issue Apr 16, 2016 · 53 comments

Comments

Projects
None yet
5 participants
@9il

9il commented Apr 16, 2016

Hello,

I am looking for a portable way of vectorized FMA operations for BLAS.

Variant 1

fast attribute

double dot(double[] a, double[] b)
{
    typeof(return) s = 0;

    foreach(size_t i; 0..c.length)
    {
        s = inlineIR!(`
        %p = fmul fast double %0, %1
        %r = fadd fast double %p, %2
        ret double %r`, double)(a[i], b[i], s);
    }

    return s;
}

Result: No FMA

LBB0_8:
    vmovupd -224(%rax), %ymm4
    vmovupd -192(%rax), %ymm5
    vmovupd -160(%rax), %ymm6
    vmovupd -128(%rax), %ymm7
    vmulpd  -224(%rsi), %ymm4, %ymm4
    vmulpd  -192(%rsi), %ymm5, %ymm5
    vmulpd  -160(%rsi), %ymm6, %ymm6
    vmulpd  -128(%rsi), %ymm7, %ymm7
    vaddpd  %ymm0, %ymm4, %ymm0
    vaddpd  %ymm1, %ymm5, %ymm1
    vaddpd  %ymm2, %ymm6, %ymm2
    vaddpd  %ymm3, %ymm7, %ymm3
    vmovupd -96(%rax), %ymm4
    vmovupd -64(%rax), %ymm5
    vmovupd -32(%rax), %ymm6
    vmovupd (%rax), %ymm7
    vmulpd  -96(%rsi), %ymm4, %ymm4
    vmulpd  -64(%rsi), %ymm5, %ymm5
    vmulpd  -32(%rsi), %ymm6, %ymm6
    vmulpd  (%rsi), %ymm7, %ymm7
    vaddpd  %ymm0, %ymm4, %ymm0
    vaddpd  %ymm1, %ymm5, %ymm1
    vaddpd  %ymm2, %ymm6, %ymm2
    vaddpd  %ymm3, %ymm7, %ymm3
    addq    $256, %rsi
    addq    $256, %rax
    addq    $-32, %rdx
    jne LBB0_8

Variant 2

llvm_fmuladd function

double dot(double[] a, double[] b)
{
    typeof(return) s = 0;

    foreach(size_t i; 0..a.length)
    {
        s = llvm_fmuladd(a[i], b[i], s);
    }

    return s;
}

Result: No Vectorization

LBB0_7:
    vmovsd  -56(%rcx), %xmm1
    vmovsd  -48(%rcx), %xmm2
    vfmadd132sd -56(%rax), %xmm0, %xmm1
    vfmadd231sd -48(%rax), %xmm2, %xmm1
    vmovsd  -40(%rcx), %xmm0
    vfmadd132sd -40(%rax), %xmm1, %xmm0
    vmovsd  -32(%rcx), %xmm1
    vfmadd132sd -32(%rax), %xmm0, %xmm1
    vmovsd  -24(%rcx), %xmm0
    vfmadd132sd -24(%rax), %xmm1, %xmm0
    vmovsd  -16(%rcx), %xmm1
    vfmadd132sd -16(%rax), %xmm0, %xmm1
    vmovsd  -8(%rcx), %xmm2
    vfmadd132sd -8(%rax), %xmm1, %xmm2
    vmovsd  (%rcx), %xmm0
    vfmadd132sd (%rax), %xmm2, %xmm0
    addq    $64, %rax
    addq    $64, %rcx
    addq    $-8, %rdx
    jne LBB0_7
@9il

This comment has been minimized.

Show comment
Hide comment

9il commented Apr 16, 2016

@JohanEngelen

This comment has been minimized.

Show comment
Hide comment
@JohanEngelen

JohanEngelen Apr 17, 2016

Member

I will have to figure out how LLVM handles "fast", and how to trigger optimizations... looking through LLVM's testcases now.

Member

JohanEngelen commented Apr 17, 2016

I will have to figure out how LLVM handles "fast", and how to trigger optimizations... looking through LLVM's testcases now.

@9il

This comment has been minimized.

Show comment
Hide comment
@9il

9il Apr 17, 2016

Thanks!

9il commented Apr 17, 2016

Thanks!

@JohanEngelen

This comment has been minimized.

Show comment
Hide comment
@JohanEngelen

JohanEngelen Apr 17, 2016

Member

Does @target("+fma") help?

Member

JohanEngelen commented Apr 17, 2016

Does @target("+fma") help?

@JohanEngelen

This comment has been minimized.

Show comment
Hide comment
@JohanEngelen

JohanEngelen Apr 17, 2016

Member

In LLVM tests, the flag -enable-unsafe-fp-math is passed to llc to get the optimization. But we want to apply it per function. I see that some tests pass "unsafe-fp-math"="false" as a function attribute (separate from e.g. "target-features"="+sse,+sse2", so probably @target("+fma") does nothing).

Member

JohanEngelen commented Apr 17, 2016

In LLVM tests, the flag -enable-unsafe-fp-math is passed to llc to get the optimization. But we want to apply it per function. I see that some tests pass "unsafe-fp-math"="false" as a function attribute (separate from e.g. "target-features"="+sse,+sse2", so probably @target("+fma") does nothing).

@9il

This comment has been minimized.

Show comment
Hide comment
@9il

9il Apr 17, 2016

so probably @target("+fma") does nothing

Yes

9il commented Apr 17, 2016

so probably @target("+fma") does nothing

Yes

@9il

This comment has been minimized.

Show comment
Hide comment
@9il

9il Apr 17, 2016

But we want to apply it per function.

This is very important for a generic library

9il commented Apr 17, 2016

But we want to apply it per function.

This is very important for a generic library

@JohanEngelen

This comment has been minimized.

Show comment
Hide comment
@JohanEngelen

JohanEngelen Apr 17, 2016

Member

https://github.com/llvm-mirror/clang/blob/master/lib/CodeGen/CGCall.cpp#L1708

So we have to somehow allow people to specify this with LDC. Do you know how clang does this? Can you specify this per function in clang, or only on the commandline ?

Member

JohanEngelen commented Apr 17, 2016

https://github.com/llvm-mirror/clang/blob/master/lib/CodeGen/CGCall.cpp#L1708

So we have to somehow allow people to specify this with LDC. Do you know how clang does this? Can you specify this per function in clang, or only on the commandline ?

@JohanEngelen

This comment has been minimized.

Show comment
Hide comment
@JohanEngelen

JohanEngelen Apr 17, 2016

Member

(and how do you think it is best? commandline or per function?)

Member

JohanEngelen commented Apr 17, 2016

(and how do you think it is best? commandline or per function?)

@9il

This comment has been minimized.

Show comment
Hide comment
@9il

9il Apr 17, 2016

In LLVM level this can be implemented with attribute groups.

commandline or per function

Only per function. For example, mir.las.sum contains summation algorithms, for fast this attribute should be added but not for pairwise. Furthermore both this functions are generic, this means that user flag in dub would affect both of functions whatever module they are located.

9il commented Apr 17, 2016

In LLVM level this can be implemented with attribute groups.

commandline or per function

Only per function. For example, mir.las.sum contains summation algorithms, for fast this attribute should be added but not for pairwise. Furthermore both this functions are generic, this means that user flag in dub would affect both of functions whatever module they are located.

@JohanEngelen

This comment has been minimized.

Show comment
Hide comment
@JohanEngelen

JohanEngelen Apr 17, 2016

Member

I could add a very generic @(ldc.attributes.llvmattr("...","...")) UDA, that would put attributes on functions so that we don't need maintenance on each new special attribute that LLVM invents. But it would be quite a low-level thing, that is perhaps not portable to different LLVM versions for which the name of an attribute may change.
Random examples from LLVM tests: "less-precise-fpmad"="false" "no-frame-pointer-elim"="true" "no-frame-pointer-elim-non-leaf" "no-infs-fp-math"="false" "no-nans-fp-math"="false" "stack-protector-buffer-size"="8" "target-cpu"="x86-64" "target-features"="+sse,+sse2" "unsafe-fp-math"="false" "use-soft-float"="false".
In this particular case, it would look like: @(ldc.attributes.llvmattr("unsafe-fp-math","true")).

The other option is to add an ldc.attributes.... special UDA for each thing we want to pass to LLVM.
Something like @(ldc.attributes.unsafeFPMath) might pass these three to LLVM: "no-infs-fp-math"="true" "no-nans-fp-math"="true" "unsafe-fp-math"="true".

@klickverbot @redstar Which solution would you prefer?

Member

JohanEngelen commented Apr 17, 2016

I could add a very generic @(ldc.attributes.llvmattr("...","...")) UDA, that would put attributes on functions so that we don't need maintenance on each new special attribute that LLVM invents. But it would be quite a low-level thing, that is perhaps not portable to different LLVM versions for which the name of an attribute may change.
Random examples from LLVM tests: "less-precise-fpmad"="false" "no-frame-pointer-elim"="true" "no-frame-pointer-elim-non-leaf" "no-infs-fp-math"="false" "no-nans-fp-math"="false" "stack-protector-buffer-size"="8" "target-cpu"="x86-64" "target-features"="+sse,+sse2" "unsafe-fp-math"="false" "use-soft-float"="false".
In this particular case, it would look like: @(ldc.attributes.llvmattr("unsafe-fp-math","true")).

The other option is to add an ldc.attributes.... special UDA for each thing we want to pass to LLVM.
Something like @(ldc.attributes.unsafeFPMath) might pass these three to LLVM: "no-infs-fp-math"="true" "no-nans-fp-math"="true" "unsafe-fp-math"="true".

@klickverbot @redstar Which solution would you prefer?

@9il

This comment has been minimized.

Show comment
Hide comment
@9il

9il Apr 17, 2016

@(ldc.attributes.llvmattr("...","..."))

Yes, please! This would be awesome for BLAS ^_^

9il commented Apr 17, 2016

@(ldc.attributes.llvmattr("...","..."))

Yes, please! This would be awesome for BLAS ^_^

@JohanEngelen

This comment has been minimized.

Show comment
Hide comment
@JohanEngelen

JohanEngelen Apr 17, 2016

Member

@9il what other attributes would you want to attach to functions?
I don't know how stable the exact attribute names are in LLVM, and how they apply to all the different targets. If we hide the exact name(s) behind @(ldc.attributes.unsafeFPMath), LDC can apply different LLVM attributes depending on the compile target.

Member

JohanEngelen commented Apr 17, 2016

@9il what other attributes would you want to attach to functions?
I don't know how stable the exact attribute names are in LLVM, and how they apply to all the different targets. If we hide the exact name(s) behind @(ldc.attributes.unsafeFPMath), LDC can apply different LLVM attributes depending on the compile target.

@9il

This comment has been minimized.

Show comment
Hide comment
@9il

9il Apr 17, 2016

@9il what other attributes would you want to attach to functions?
I don't know how stable the exact attribute names are in LLVM, and how they apply to all the different targets. If we hide the exact name(s) behind @(ldc.attributes.unsafeFPMath), LDC can apply different LLVM attributes depending on the compile target

I want as much as possible. I will write code that would generate other code. Note, that this is no big problem with portability because I would define aliases in a configuration file, which would be hidden from user.

Please add ability to define attributes for inlineIR too.

9il commented Apr 17, 2016

@9il what other attributes would you want to attach to functions?
I don't know how stable the exact attribute names are in LLVM, and how they apply to all the different targets. If we hide the exact name(s) behind @(ldc.attributes.unsafeFPMath), LDC can apply different LLVM attributes depending on the compile target

I want as much as possible. I will write code that would generate other code. Note, that this is no big problem with portability because I would define aliases in a configuration file, which would be hidden from user.

Please add ability to define attributes for inlineIR too.

@9il

This comment has been minimized.

Show comment
Hide comment
@9il

9il Apr 17, 2016

BTW, how can i get LLVM version during CT?

9il commented Apr 17, 2016

BTW, how can i get LLVM version during CT?

@JohanEngelen

This comment has been minimized.

Show comment
Hide comment
@JohanEngelen

JohanEngelen Apr 17, 2016

Member

version(LDC_LLVM_305)

Member

JohanEngelen commented Apr 17, 2016

version(LDC_LLVM_305)

@JohanEngelen

This comment has been minimized.

Show comment
Hide comment
@JohanEngelen

JohanEngelen Apr 17, 2016

Member

Haha, LLVM is toying with me :)
@llvmattr("unsafe-fp-math", "true")
results in
attributes #2 = { uwtable "unsafe-fp-math"="false" }

Member

JohanEngelen commented Apr 17, 2016

Haha, LLVM is toying with me :)
@llvmattr("unsafe-fp-math", "true")
results in
attributes #2 = { uwtable "unsafe-fp-math"="false" }

@klickverbot

This comment has been minimized.

Show comment
Hide comment
@klickverbot

klickverbot Apr 17, 2016

Member

We can certainly add a UDA for generic LLVM attributes, but we can't really make any guarantees about compatibility with future LLVM versions from the LDC side. Applying not quite valid combinations of attributes, etc. might also lead to internal compiler errors from within LLVM, which might again turn out to be cumbersome to catch from LDC.

I would suggest adding both; a raw attribute for direct experiments with LLVM codegen, plus "high-level" UDAs for certain common use cases. As Johan points out, we could actually guarantee version-independent availability for the latter. For FMA and other math optimisations in particular, I would suggest having a close look at both the Clang and the LLVM-level (opt, …) interfaces. Between the frontend emitting the llvm.fma and/or llvm.fmuladd intrinsics and the various math optimisation flags, there seems to be quite a large design space.

Member

klickverbot commented Apr 17, 2016

We can certainly add a UDA for generic LLVM attributes, but we can't really make any guarantees about compatibility with future LLVM versions from the LDC side. Applying not quite valid combinations of attributes, etc. might also lead to internal compiler errors from within LLVM, which might again turn out to be cumbersome to catch from LDC.

I would suggest adding both; a raw attribute for direct experiments with LLVM codegen, plus "high-level" UDAs for certain common use cases. As Johan points out, we could actually guarantee version-independent availability for the latter. For FMA and other math optimisations in particular, I would suggest having a close look at both the Clang and the LLVM-level (opt, …) interfaces. Between the frontend emitting the llvm.fma and/or llvm.fmuladd intrinsics and the various math optimisation flags, there seems to be quite a large design space.

@JohanEngelen

This comment has been minimized.

Show comment
Hide comment
@JohanEngelen

JohanEngelen Apr 17, 2016

Member

Yeah, clang does some fusing of operations (instead of LLVM). That would be quite some work to do in LDC too.

I am adding @llvmattr now, and then we can alias something like:
enum unsafeFPMath = llvmattr("unsafe-fp-math", "true"); ?

Member

JohanEngelen commented Apr 17, 2016

Yeah, clang does some fusing of operations (instead of LLVM). That would be quite some work to do in LDC too.

I am adding @llvmattr now, and then we can alias something like:
enum unsafeFPMath = llvmattr("unsafe-fp-math", "true"); ?

@klickverbot

This comment has been minimized.

Show comment
Hide comment
@klickverbot

klickverbot Apr 17, 2016

Member

Sounds like a plan, although in the default D style, it should probably be called llvmAttr

Member

klickverbot commented Apr 17, 2016

Sounds like a plan, although in the default D style, it should probably be called llvmAttr

@JohanEngelen

This comment has been minimized.

Show comment
Hide comment
@JohanEngelen

JohanEngelen Apr 17, 2016

Member

I had to make another modification: inlineIR functions now inherit attributes from the enclosing function.

@9il

@llvmAttr("unsafe-fp-math", "true")
double dot(double[] a, double[] b)
{
    double s = 0;
    foreach (size_t i; 0 .. a.length)
    {
        s = inlineIR!(`
         %p = fmul fast double %0, %1
         %r = fadd fast double %p, %2
         ret double %r`,
            double)(a[i], b[i], s);
    }
    return s;
}

ldc2 -mcpu=haswell -O3 -release -c -output-s gives:

LBB2_8:
    vmovupd -224(%rdx), %ymm4
    vmovupd -192(%rdx), %ymm5
    vmovupd -160(%rdx), %ymm6
    vmovupd -128(%rdx), %ymm7
    vfmadd132pd -224(%rax), %ymm0, %ymm4
    vfmadd132pd -192(%rax), %ymm1, %ymm5
    vfmadd132pd -160(%rax), %ymm2, %ymm6
    vfmadd132pd -128(%rax), %ymm3, %ymm7
    vmovupd -96(%rdx), %ymm0
    vmovupd -64(%rdx), %ymm1
    vmovupd -32(%rdx), %ymm2
    vmovupd (%rdx), %ymm3
    vfmadd132pd -96(%rax), %ymm4, %ymm0
    vfmadd132pd -64(%rax), %ymm5, %ymm1
    vfmadd132pd -32(%rax), %ymm6, %ymm2
    vfmadd132pd (%rax), %ymm7, %ymm3
    addq    $256, %rax
    addq    $256, %rdx
    addq    $-32, %r9
    jne LBB2_8

happy?

Member

JohanEngelen commented Apr 17, 2016

I had to make another modification: inlineIR functions now inherit attributes from the enclosing function.

@9il

@llvmAttr("unsafe-fp-math", "true")
double dot(double[] a, double[] b)
{
    double s = 0;
    foreach (size_t i; 0 .. a.length)
    {
        s = inlineIR!(`
         %p = fmul fast double %0, %1
         %r = fadd fast double %p, %2
         ret double %r`,
            double)(a[i], b[i], s);
    }
    return s;
}

ldc2 -mcpu=haswell -O3 -release -c -output-s gives:

LBB2_8:
    vmovupd -224(%rdx), %ymm4
    vmovupd -192(%rdx), %ymm5
    vmovupd -160(%rdx), %ymm6
    vmovupd -128(%rdx), %ymm7
    vfmadd132pd -224(%rax), %ymm0, %ymm4
    vfmadd132pd -192(%rax), %ymm1, %ymm5
    vfmadd132pd -160(%rax), %ymm2, %ymm6
    vfmadd132pd -128(%rax), %ymm3, %ymm7
    vmovupd -96(%rdx), %ymm0
    vmovupd -64(%rdx), %ymm1
    vmovupd -32(%rdx), %ymm2
    vmovupd (%rdx), %ymm3
    vfmadd132pd -96(%rax), %ymm4, %ymm0
    vfmadd132pd -64(%rax), %ymm5, %ymm1
    vfmadd132pd -32(%rax), %ymm6, %ymm2
    vfmadd132pd (%rax), %ymm7, %ymm3
    addq    $256, %rax
    addq    $256, %rdx
    addq    $-32, %r9
    jne LBB2_8

happy?

@9il

This comment has been minimized.

Show comment
Hide comment
@9il

9il Apr 17, 2016

Whohoo!

9il commented Apr 17, 2016

Whohoo!

@9il

This comment has been minimized.

Show comment
Hide comment
@9il

9il Apr 17, 2016

Can I force LLVM to do not inline a function?

9il commented Apr 17, 2016

Can I force LLVM to do not inline a function?

@9il

This comment has been minimized.

Show comment
Hide comment
@9il

9il Apr 17, 2016

Or I have a simple way with LDC?

9il commented Apr 17, 2016

Or I have a simple way with LDC?

@JohanEngelen

This comment has been minimized.

Show comment
Hide comment
@JohanEngelen

JohanEngelen Apr 17, 2016

Member

In codegen tests, I now often apply @weak to a function, so LLVM cannot reason about it and cannot inline it.

Member

JohanEngelen commented Apr 17, 2016

In codegen tests, I now often apply @weak to a function, so LLVM cannot reason about it and cannot inline it.

@9il

This comment has been minimized.

Show comment
Hide comment
@9il

9il Apr 17, 2016

What LDC version will include this change 1.0.0 or 1.1.0?

9il commented Apr 17, 2016

What LDC version will include this change 1.0.0 or 1.1.0?

JohanEngelen added a commit to JohanEngelen/ldc that referenced this issue Apr 17, 2016

@klickverbot

This comment has been minimized.

Show comment
Hide comment
@klickverbot

klickverbot Apr 17, 2016

Member

As of recently, the proper D way is pragma(inline, false). There is also the old LDC_never_inline. Supporting ithe former should be trivial, if we don't already.

Member

klickverbot commented Apr 17, 2016

As of recently, the proper D way is pragma(inline, false). There is also the old LDC_never_inline. Supporting ithe former should be trivial, if we don't already.

@klickverbot

This comment has been minimized.

Show comment
Hide comment
@klickverbot

klickverbot Apr 17, 2016

Member

Regarding the code sample: Is inline IR even needed? It is too very fragile between LLVM versions and I'd avoided whenever possible for maintenance reasons.

Member

klickverbot commented Apr 17, 2016

Regarding the code sample: Is inline IR even needed? It is too very fragile between LLVM versions and I'd avoided whenever possible for maintenance reasons.

JohanEngelen added a commit to JohanEngelen/ldc that referenced this issue Apr 17, 2016

@JohanEngelen

This comment has been minimized.

Show comment
Hide comment
@JohanEngelen

JohanEngelen Apr 17, 2016

Member

Regarding the code sample: Is inline IR even needed?

I think so. I don't think we have a way at the moment to emit the "fast" attribute on fmul and fadd.

Edit: this has been on my "todo" list for a while, and I had forgotten about it. I think in future, the code should look more like this and it should just work then:

@(ldc.attributes.awesomeSuperFastMathYeah)
double dot(double[] a, double[] b) {
    double s = 0;
    foreach(size_t i; 0..c.length) {
        s += a +b;
    }
    return s;
}
Member

JohanEngelen commented Apr 17, 2016

Regarding the code sample: Is inline IR even needed?

I think so. I don't think we have a way at the moment to emit the "fast" attribute on fmul and fadd.

Edit: this has been on my "todo" list for a while, and I had forgotten about it. I think in future, the code should look more like this and it should just work then:

@(ldc.attributes.awesomeSuperFastMathYeah)
double dot(double[] a, double[] b) {
    double s = 0;
    foreach(size_t i; 0..c.length) {
        s += a +b;
    }
    return s;
}
@JohanEngelen

This comment has been minimized.

Show comment
Hide comment
@JohanEngelen

JohanEngelen Apr 17, 2016

Member

@9il wrote:

Please add ability to define attributes for inlineIR too.

Have a look at the testcases in #1441 . I've found that this works:

pragma(LDC_inline_ir)
    R inlineIR_fastmath(string s, R, P...)(P) @llvmAttr("unsafe-fp-math", "true");
alias inlineIR_fastmath!(`%p = fmul fast double %0, %1
                          %r = fadd fast double %p, %2
                          ret double %r`,
                          double, double, double, double) muladd;

Edit: note that the calling function will also need @llvmAttr("unsafe-fp-math", "true") too!

Member

JohanEngelen commented Apr 17, 2016

@9il wrote:

Please add ability to define attributes for inlineIR too.

Have a look at the testcases in #1441 . I've found that this works:

pragma(LDC_inline_ir)
    R inlineIR_fastmath(string s, R, P...)(P) @llvmAttr("unsafe-fp-math", "true");
alias inlineIR_fastmath!(`%p = fmul fast double %0, %1
                          %r = fadd fast double %p, %2
                          ret double %r`,
                          double, double, double, double) muladd;

Edit: note that the calling function will also need @llvmAttr("unsafe-fp-math", "true") too!

@klickverbot

This comment has been minimized.

Show comment
Hide comment
@klickverbot

klickverbot Apr 17, 2016

Member

@JohanEngelen: I think awesomeSuperFastMathYeah is definitely what we should go for, a quick fix for Ilya to use in the meantime nonwithstanding. It shouldn't be hard to implement either – just store the appropriate flags in IrFunction and check them in the FP codegen routines.

Member

klickverbot commented Apr 17, 2016

@JohanEngelen: I think awesomeSuperFastMathYeah is definitely what we should go for, a quick fix for Ilya to use in the meantime nonwithstanding. It shouldn't be hard to implement either – just store the appropriate flags in IrFunction and check them in the FP codegen routines.

@9il

This comment has been minimized.

Show comment
Hide comment
@9il

9il Apr 17, 2016

@JohanEngelen: I think awesomeSuperFastMathYeah is definitely what we should go for, a quick fix for Ilya to use in the meantime nonwithstanding. It shouldn't be hard to implement either – just store the appropriate flags in IrFunction and check them in the FP codegen routines.

I also need control on loop & slp vectorization and unrolling params. Please add @llvmAttr

9il commented Apr 17, 2016

@JohanEngelen: I think awesomeSuperFastMathYeah is definitely what we should go for, a quick fix for Ilya to use in the meantime nonwithstanding. It shouldn't be hard to implement either – just store the appropriate flags in IrFunction and check them in the FP codegen routines.

I also need control on loop & slp vectorization and unrolling params. Please add @llvmAttr

@klickverbot

This comment has been minimized.

Show comment
Hide comment
@klickverbot

klickverbot Apr 17, 2016

Member

@9il: I'm not opposed to adding llvmAttr, as mentioned above. It's just that it is a very sharp tool (like inline IR in general), and not appropriate for most users (requires knowledge of LLVM IR, tightly couples code to LLVM version). "Fast" math is something enough people need so that an easier-to-use tool should be available in addition to it.

Member

klickverbot commented Apr 17, 2016

@9il: I'm not opposed to adding llvmAttr, as mentioned above. It's just that it is a very sharp tool (like inline IR in general), and not appropriate for most users (requires knowledge of LLVM IR, tightly couples code to LLVM version). "Fast" math is something enough people need so that an easier-to-use tool should be available in addition to it.

@9il

This comment has been minimized.

Show comment
Hide comment
@9il

9il Apr 17, 2016

@klickverbot oh, ok. Thanks)

9il commented Apr 17, 2016

@klickverbot oh, ok. Thanks)

JohanEngelen added a commit to JohanEngelen/ldc that referenced this issue Apr 18, 2016

JohanEngelen added a commit to JohanEngelen/ldc that referenced this issue Apr 18, 2016

JohanEngelen added a commit to JohanEngelen/ldc that referenced this issue Apr 18, 2016

@JohanEngelen

This comment has been minimized.

Show comment
Hide comment
@JohanEngelen

JohanEngelen Apr 18, 2016

Member

(incorrectly closed by merging #1441 )

Member

JohanEngelen commented Apr 18, 2016

(incorrectly closed by merging #1441 )

@JohanEngelen

This comment has been minimized.

Show comment
Hide comment
@JohanEngelen

JohanEngelen Apr 18, 2016

Member

I have either hit an LLVM bug, or it is intended behavior I am not sure. It appears that once one function carries the "unsafe-fp-math"="true" attribute, all other (subsequent) functions that do not explicitly set it to "false" are treated as if they have it on!

Member

JohanEngelen commented Apr 18, 2016

I have either hit an LLVM bug, or it is intended behavior I am not sure. It appears that once one function carries the "unsafe-fp-math"="true" attribute, all other (subsequent) functions that do not explicitly set it to "false" are treated as if they have it on!

@9il

This comment has been minimized.

Show comment
Hide comment
@9il

9il Apr 18, 2016

This looks like proper behavior.

9il commented Apr 18, 2016

This looks like proper behavior.

@JohanEngelen

This comment has been minimized.

Show comment
Hide comment
@JohanEngelen

JohanEngelen Apr 18, 2016

Member

Really? The order in which functions are defined in IR determines their codegen?

Member

JohanEngelen commented Apr 18, 2016

Really? The order in which functions are defined in IR determines their codegen?

@9il

This comment has been minimized.

Show comment
Hide comment
@9il

9il Apr 18, 2016

Really? The order in which functions are defined in IR determines their codegen?

Oh, I misunderstood you, thought about functions called from a function with attribute.

9il commented Apr 18, 2016

Really? The order in which functions are defined in IR determines their codegen?

Oh, I misunderstood you, thought about functions called from a function with attribute.

@9il

This comment has been minimized.

Show comment
Hide comment
@9il

9il Apr 18, 2016

Do you have llvm ir output?

9il commented Apr 18, 2016

Do you have llvm ir output?

@JohanEngelen

This comment has been minimized.

Show comment
Hide comment
@JohanEngelen

JohanEngelen Apr 19, 2016

Member

Test case:

target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64--linux-gnu"

define void @leading_function() #0 {
  ret void
}

define float @test_f32_fmadd(float %a0, float %a1, float %a2) #0 {
; CHECK-NOT: vfmadd
  %x = fmul float %a0, %a1
  %res = fadd float %x, %a2
  ret float %res
}

define void @trailing_function() #1 {
  ret void
}

attributes #0 = { uwtable }
attributes #1 = { uwtable "unsafe-fp-math"="true" }

Compile with llc -mcpu=haswell --> no FMA.
Change the attribute on leading_function to #1 --> FMA in test_f32_fmadd!?

Member

JohanEngelen commented Apr 19, 2016

Test case:

target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64--linux-gnu"

define void @leading_function() #0 {
  ret void
}

define float @test_f32_fmadd(float %a0, float %a1, float %a2) #0 {
; CHECK-NOT: vfmadd
  %x = fmul float %a0, %a1
  %res = fadd float %x, %a2
  ret float %res
}

define void @trailing_function() #1 {
  ret void
}

attributes #0 = { uwtable }
attributes #1 = { uwtable "unsafe-fp-math"="true" }

Compile with llc -mcpu=haswell --> no FMA.
Change the attribute on leading_function to #1 --> FMA in test_f32_fmadd!?

@JohanEngelen

This comment has been minimized.

Show comment
Hide comment
Member

JohanEngelen commented Apr 19, 2016

@9il

This comment has been minimized.

Show comment
Hide comment
@9il

9il Apr 19, 2016

This is terrible ... Is any work around or bug fix possible ?

9il commented Apr 19, 2016

This is terrible ... Is any work around or bug fix possible ?

@JohanEngelen

This comment has been minimized.

Show comment
Hide comment
@JohanEngelen

JohanEngelen Apr 19, 2016

Member

LLVM trunk code:

void TargetMachine::resetTargetOptions(const Function &F) const {
#define RESET_OPTION(X, Y)                                                     \
  do {                                                                         \
    if (F.hasFnAttribute(Y))                                                   \
      Options.X = (F.getFnAttribute(Y).getValueAsString() == "true");          \
  } while (0)

  RESET_OPTION(LessPreciseFPMADOption, "less-precise-fpmad");
  RESET_OPTION(UnsafeFPMath, "unsafe-fp-math");
  RESET_OPTION(NoInfsFPMath, "no-infs-fp-math");
  RESET_OPTION(NoNaNsFPMath, "no-nans-fp-math");
}

The proposed work-around is to apply these four explicitly to all functions. For this purpose I think clang has an options struct that is applied to all functions, such that the default values of these is overridable with commandline switches (instead of settings them to "false" for all functions).

Member

JohanEngelen commented Apr 19, 2016

LLVM trunk code:

void TargetMachine::resetTargetOptions(const Function &F) const {
#define RESET_OPTION(X, Y)                                                     \
  do {                                                                         \
    if (F.hasFnAttribute(Y))                                                   \
      Options.X = (F.getFnAttribute(Y).getValueAsString() == "true");          \
  } while (0)

  RESET_OPTION(LessPreciseFPMADOption, "less-precise-fpmad");
  RESET_OPTION(UnsafeFPMath, "unsafe-fp-math");
  RESET_OPTION(NoInfsFPMath, "no-infs-fp-math");
  RESET_OPTION(NoNaNsFPMath, "no-nans-fp-math");
}

The proposed work-around is to apply these four explicitly to all functions. For this purpose I think clang has an options struct that is applied to all functions, such that the default values of these is overridable with commandline switches (instead of settings them to "false" for all functions).

@redstar

This comment has been minimized.

Show comment
Hide comment
@redstar

redstar Apr 19, 2016

Member

@JohanEngelen I once started to add various command line options regarding math in #789 but this is only half of the way

Member

redstar commented Apr 19, 2016

@JohanEngelen I once started to add various command line options regarding math in #789 but this is only half of the way

@JohanEngelen

This comment has been minimized.

Show comment
Hide comment
@JohanEngelen

JohanEngelen Apr 19, 2016

Member

@redstar Nice. So the plan was to apply these attributes to all functions then ("false" when no commandline switch is given, and "true" when it is given)? It is not very nice to explicitly disable all these per default, but I think it is the only way if we want to be able to selectively apply these attributes to functions. For the (rare) use case where someone compiles things to bc and then manually compiles it to machine code with optimization options, he will not be able to override it with the commandline interface I think...

Member

JohanEngelen commented Apr 19, 2016

@redstar Nice. So the plan was to apply these attributes to all functions then ("false" when no commandline switch is given, and "true" when it is given)? It is not very nice to explicitly disable all these per default, but I think it is the only way if we want to be able to selectively apply these attributes to functions. For the (rare) use case where someone compiles things to bc and then manually compiles it to machine code with optimization options, he will not be able to override it with the commandline interface I think...

@Temtaime

This comment has been minimized.

Show comment
Hide comment
@Temtaime

Temtaime Apr 19, 2016

IR is really ugly.
GCC and Clang can generate fma from a * b + c.
Why ldc can't ?

Temtaime commented Apr 19, 2016

IR is really ugly.
GCC and Clang can generate fma from a * b + c.
Why ldc can't ?

@JohanEngelen

This comment has been minimized.

Show comment
Hide comment
@JohanEngelen

JohanEngelen Apr 19, 2016

Member

I don't think we have a way at the moment to emit the "fast" attribute on fmul and fadd.

In the test case above, fast is not needed to generate FMA instruction. "unsafe-fp-math" is enough.

Member

JohanEngelen commented Apr 19, 2016

I don't think we have a way at the moment to emit the "fast" attribute on fmul and fadd.

In the test case above, fast is not needed to generate FMA instruction. "unsafe-fp-math" is enough.

JohanEngelen added a commit to JohanEngelen/ldc that referenced this issue Apr 19, 2016

For `LDC_inline_ir` functions: instantiate a new "alwaysinline" funct…
…ion for each call, which is the current behavior for inline ASM too.

When certain attributes are applied to the calling function, like "unsafe-fp-math", the inlined inlineIR function has to have the same attributes otherwise the calling function attribute will be reset to the safe merge of the two: "false". Because the same inlineIR function can be called in different functions using an alias definition, a new function (with possibly different attributes) has to be instantiated upon every call.

Related GH issue #1438
@JohanEngelen

This comment has been minimized.

Show comment
Hide comment
@JohanEngelen

JohanEngelen Apr 19, 2016

Member

GCC and Clang can generate fma from a * b + c.
Why ldc can't ?

LDC does generate FMA instructions if you tell it that your processor can do that (-mattr=+fma or @target("fma")). However, for some reason the dot product function is not vectorized.
The generated IR is slightly different for (compared with the inlineIR version)

double dot(double[] a, double[] b) {
    double s = 0;
    foreach(size_t i; 0..a.length) {
        s += a[i] * b[i];
    }
    return s;
}

I think a closer inspection of how and why the IR is different will lead to enabling vectorization. The ordering of operations is different and perhaps LLVM has to assume the possiblity of aliasing of pointers. (the order of some loads from memory are different from the inlineIR version that does vectorize)

Member

JohanEngelen commented Apr 19, 2016

GCC and Clang can generate fma from a * b + c.
Why ldc can't ?

LDC does generate FMA instructions if you tell it that your processor can do that (-mattr=+fma or @target("fma")). However, for some reason the dot product function is not vectorized.
The generated IR is slightly different for (compared with the inlineIR version)

double dot(double[] a, double[] b) {
    double s = 0;
    foreach(size_t i; 0..a.length) {
        s += a[i] * b[i];
    }
    return s;
}

I think a closer inspection of how and why the IR is different will lead to enabling vectorization. The ordering of operations is different and perhaps LLVM has to assume the possiblity of aliasing of pointers. (the order of some loads from memory are different from the inlineIR version that does vectorize)

@JohanEngelen

This comment has been minimized.

Show comment
Hide comment
@JohanEngelen

JohanEngelen Apr 19, 2016

Member

I think a closer inspection of how and why the IR is different will lead to enabling vectorization. The ordering of operations is different and perhaps LLVM has to assume the possiblity of aliasing of pointers. (the order of some loads from memory are different from the inlineIR version that does vectorize)

This is all not true or not relevant. What is relevant is the fast attribute on fmul and fadd: when I add fast to the IR generated from pure D code, it vectorizes exactly like the inline IR version.

Member

JohanEngelen commented Apr 19, 2016

I think a closer inspection of how and why the IR is different will lead to enabling vectorization. The ordering of operations is different and perhaps LLVM has to assume the possiblity of aliasing of pointers. (the order of some loads from memory are different from the inlineIR version that does vectorize)

This is all not true or not relevant. What is relevant is the fast attribute on fmul and fadd: when I add fast to the IR generated from pure D code, it vectorizes exactly like the inline IR version.

JohanEngelen added a commit to JohanEngelen/ldc that referenced this issue Apr 19, 2016

For `LDC_inline_ir` functions: instantiate a new "alwaysinline" funct…
…ion for each call, which is the current behavior for inline ASM too.

When certain attributes are applied to the calling function, like "unsafe-fp-math", the inlined inlineIR function has to have the same attributes otherwise the calling function attribute will be reset to the safe merge of the two: "false". Because the same inlineIR function can be called in different functions using an alias definition, a new function (with possibly different attributes) has to be instantiated upon every call.

Related GH issue #1438

JohanEngelen added a commit to JohanEngelen/ldc that referenced this issue Apr 19, 2016

For `LDC_inline_ir` functions: instantiate a new "alwaysinline" funct…
…ion for each call, which is the current behavior for inline ASM too.

When certain attributes are applied to the calling function, like "unsafe-fp-math", the inlined inlineIR function has to have the same attributes otherwise the calling function attribute will be reset to the safe merge of the two: "false". Because the same inlineIR function can be called in different functions using an alias definition, a new function (with possibly different attributes) has to be instantiated upon every call.

Related GH issue #1438

JohanEngelen added a commit to JohanEngelen/ldc that referenced this issue Apr 19, 2016

For `LDC_inline_ir` functions: instantiate a new "alwaysinline" funct…
…ion for each call, which is the current behavior for inline ASM too.

When certain attributes are applied to the calling function, like "unsafe-fp-math", the inlined inlineIR function has to have the same attributes otherwise the calling function attribute will be reset to the safe merge of the two: "false". Because the same inlineIR function can be called in different functions using an alias definition, a new function (with possibly different attributes) has to be instantiated upon every call.

Related GH issue #1438

JohanEngelen added a commit to JohanEngelen/ldc that referenced this issue Apr 21, 2016

For `LDC_inline_ir` functions: instantiate a new "alwaysinline" funct…
…ion for each call, which is the current behavior for inline ASM too.

When certain attributes are applied to the calling function, like "unsafe-fp-math", the inlined inlineIR function has to have the same attributes otherwise the calling function attribute will be reset to the safe merge of the two: "false". Because the same inlineIR function can be called in different functions using an alias definition, a new function (with possibly different attributes) has to be instantiated upon every call.

Related GH issue #1438
@JohanEngelen

This comment has been minimized.

Show comment
Hide comment
@JohanEngelen

JohanEngelen Jun 9, 2016

Member

With #1472 this generates, target CPU permitting, vectorized fused multiply-add machinecode:

@fastmath
extern (C) double dot(double[] a, double[] b)
{
    double s = 0;
    foreach (size_t i; 0 .. a.length)
    {
        s += a[i] * b[i];
    }
    return s;
}

(part of LDC's testsuite now)

Member

JohanEngelen commented Jun 9, 2016

With #1472 this generates, target CPU permitting, vectorized fused multiply-add machinecode:

@fastmath
extern (C) double dot(double[] a, double[] b)
{
    double s = 0;
    foreach (size_t i; 0 .. a.length)
    {
        s += a[i] * b[i];
    }
    return s;
}

(part of LDC's testsuite now)

@9il

This comment has been minimized.

Show comment
Hide comment
@9il

9il Jul 4, 2016

@JohanEngelen This works pretty well! For example see trsm micro kernel composed with gemm micro kernel. Kernel's code is can be found here.

9il commented Jul 4, 2016

@JohanEngelen This works pretty well! For example see trsm micro kernel composed with gemm micro kernel. Kernel's code is can be found here.

@JohanEngelen

This comment has been minimized.

Show comment
Hide comment
@JohanEngelen

JohanEngelen Jul 4, 2016

Member

@9il Nice :)

Member

JohanEngelen commented Jul 4, 2016

@9il Nice :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment