Added support for SIMD.jl; WIP #15

chriselrod · 2018-12-25T08:44:36Z

Add tests for Vec{N,T} where T <: FloatTypes.
Make sure all of these tests also pass.
Investigate performance regressions vs the SLEEF C library.

Overview of this PR:
The C SLEEF (SIMD Library for Evaluating Elementary Functions) library provides vectorized elementary functions. Therefore, I thought it makes sense to let SLEEF.jl support the SIMD.jl's Vec{N,T} vector type.

This PR provides preliminary support.

using SIMD, SLEEF, SLEEFwrap, BenchmarkTools, Random
@inline extract(x) = x.elts # 64-byte vectors segfault when returned while wrapped in a struct
sv8 = Vec{8,Float32}(ntuple(Val(8)) do x Core.VecElement(randexp(Float32)) end)
dv4 = Vec{4,Float64}(ntuple(Val(4)) do x Core.VecElement(randexp(Float64)) end)
sv16 = Vec{16,Float32}(ntuple(Val(16)) do x Core.VecElement(randexp(Float32)) end)
dv8 = Vec{8,Float64}(ntuple(Val(8)) do x Core.VecElement(randexp(Float64)) end)
function bench(jl, c, x)
    display(@benchmark extract($jl($x)))
    display(@benchmark $c(extract($x)))
end

Testing a bunch of functions:
exp:

julia> bench(SLEEF.exp, SLEEFwrap.exp, sv8)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     5.545 ns (0.00% GC)
  median time:      5.686 ns (0.00% GC)
  mean time:        5.816 ns (0.00% GC)
  maximum time:     23.974 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     4.689 ns (0.00% GC)
  median time:      4.722 ns (0.00% GC)
  mean time:        4.740 ns (0.00% GC)
  maximum time:     23.272 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

julia> bench(SLEEF.exp, SLEEFwrap.exp, dv4)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     7.408 ns (0.00% GC)
  median time:      7.449 ns (0.00% GC)
  mean time:        7.467 ns (0.00% GC)
  maximum time:     24.513 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     999
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     6.615 ns (0.00% GC)
  median time:      6.722 ns (0.00% GC)
  mean time:        6.737 ns (0.00% GC)
  maximum time:     20.488 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

julia> bench(SLEEF.exp, SLEEFwrap.exp, sv16)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  4677168108795565027
  --------------
  minimum time:     5.691 ns (0.00% GC)
  median time:      5.731 ns (0.00% GC)
  mean time:        5.779 ns (0.00% GC)
  maximum time:     22.034 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  4677168108795565027
  --------------
  minimum time:     5.256 ns (0.00% GC)
  median time:      5.287 ns (0.00% GC)
  mean time:        5.297 ns (0.00% GC)
  maximum time:     14.432 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

julia> bench(SLEEF.exp, SLEEFwrap.exp, dv8)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  4613273474792594525
  --------------
  minimum time:     7.284 ns (0.00% GC)
  median time:      7.321 ns (0.00% GC)
  mean time:        7.336 ns (0.00% GC)
  maximum time:     25.833 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     999
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  4613273474792594525
  --------------
  minimum time:     11.036 ns (0.00% GC)
  median time:      11.553 ns (0.00% GC)
  mean time:        11.370 ns (0.00% GC)
  maximum time:     38.117 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     999

log

julia> bench(SLEEF.log, SLEEFwrap.log, sv8)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     15.225 ns (0.00% GC)
  median time:      15.276 ns (0.00% GC)
  mean time:        15.310 ns (0.00% GC)
  maximum time:     31.264 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     998
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     9.967 ns (0.00% GC)
  median time:      10.042 ns (0.00% GC)
  mean time:        10.065 ns (0.00% GC)
  maximum time:     32.280 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     999

julia> bench(SLEEF.log, SLEEFwrap.log, dv4)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     16.762 ns (0.00% GC)
  median time:      16.993 ns (0.00% GC)
  mean time:        16.964 ns (0.00% GC)
  maximum time:     30.792 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     998
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     12.829 ns (0.00% GC)
  median time:      12.873 ns (0.00% GC)
  mean time:        12.897 ns (0.00% GC)
  maximum time:     27.613 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     999

julia> bench(SLEEF.log, SLEEFwrap.log, sv16)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  4552958378306737260
  --------------
  minimum time:     16.331 ns (0.00% GC)
  median time:      16.536 ns (0.00% GC)
  mean time:        16.543 ns (0.00% GC)
  maximum time:     42.043 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     998
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  4552958378306737260
  --------------
  minimum time:     8.060 ns (0.00% GC)
  median time:      8.115 ns (0.00% GC)
  mean time:        8.130 ns (0.00% GC)
  maximum time:     31.205 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     999

julia> bench(SLEEF.log, SLEEFwrap.log, dv8)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  -4651049139759164439
  --------------
  minimum time:     18.395 ns (0.00% GC)
  median time:      18.477 ns (0.00% GC)
  mean time:        18.613 ns (0.00% GC)
  maximum time:     45.013 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     997
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  -4651049139759164439
  --------------
  minimum time:     11.021 ns (0.00% GC)
  median time:      11.084 ns (0.00% GC)
  mean time:        11.114 ns (0.00% GC)
  maximum time:     35.427 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     999

sin

julia> bench(SLEEF.sin, SLEEFwrap.sin, sv8)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     19.354 ns (0.00% GC)
  median time:      19.471 ns (0.00% GC)
  mean time:        19.612 ns (0.00% GC)
  maximum time:     37.226 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     997
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     9.906 ns (0.00% GC)
  median time:      9.953 ns (0.00% GC)
  mean time:        9.972 ns (0.00% GC)
  maximum time:     21.988 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     999

julia> bench(SLEEF.sin, SLEEFwrap.sin, dv4)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     28.163 ns (0.00% GC)
  median time:      28.265 ns (0.00% GC)
  mean time:        28.329 ns (0.00% GC)
  maximum time:     52.633 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     995
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     10.484 ns (0.00% GC)
  median time:      10.541 ns (0.00% GC)
  mean time:        10.568 ns (0.00% GC)
  maximum time:     27.162 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     999

julia> bench(SLEEF.sin, SLEEFwrap.sin, sv16)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  4569599948461514222
  --------------
  minimum time:     20.364 ns (0.00% GC)
  median time:      20.458 ns (0.00% GC)
  mean time:        20.502 ns (0.00% GC)
  maximum time:     47.938 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     997
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  4569599948461514222
  --------------
  minimum time:     10.426 ns (0.00% GC)
  median time:      10.565 ns (0.00% GC)
  mean time:        10.587 ns (0.00% GC)
  maximum time:     33.371 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     999

julia> bench(SLEEF.sin, SLEEFwrap.sin, dv8)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  4605730538145129761
  --------------
  minimum time:     28.796 ns (0.00% GC)
  median time:      28.919 ns (0.00% GC)
  mean time:        29.123 ns (0.00% GC)
  maximum time:     55.898 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     995
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  4605730538145129761
  --------------
  minimum time:     11.913 ns (0.00% GC)
  median time:      12.026 ns (0.00% GC)
  mean time:        12.050 ns (0.00% GC)
  maximum time:     33.233 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     999

tan

julia> bench(SLEEF.tan, SLEEFwrap.tan, sv8)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     36.797 ns (0.00% GC)
  median time:      36.895 ns (0.00% GC)
  mean time:        36.988 ns (0.00% GC)
  maximum time:     58.675 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     992
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     16.273 ns (0.00% GC)
  median time:      16.346 ns (0.00% GC)
  mean time:        16.381 ns (0.00% GC)
  maximum time:     34.868 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     998

julia> bench(SLEEF.tan, SLEEFwrap.tan, dv4)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     51.512 ns (0.00% GC)
  median time:      51.640 ns (0.00% GC)
  mean time:        52.010 ns (0.00% GC)
  maximum time:     73.956 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     986
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     14.053 ns (0.00% GC)
  median time:      14.161 ns (0.00% GC)
  mean time:        14.179 ns (0.00% GC)
  maximum time:     31.734 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     998

julia> bench(SLEEF.tan, SLEEFwrap.tan, sv16)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  -4606600161933539213
  --------------
  minimum time:     38.064 ns (0.00% GC)
  median time:      38.202 ns (0.00% GC)
  mean time:        38.285 ns (0.00% GC)
  maximum time:     62.710 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     992
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  -4606600161933539213
  --------------
  minimum time:     18.630 ns (0.00% GC)
  median time:      18.712 ns (0.00% GC)
  mean time:        18.756 ns (0.00% GC)
  maximum time:     44.121 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     997

julia> bench(SLEEF.tan, SLEEFwrap.tan, dv8)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  4609617611958208877
  --------------
  minimum time:     55.713 ns (0.00% GC)
  median time:      55.881 ns (0.00% GC)
  mean time:        56.035 ns (0.00% GC)
  maximum time:     78.817 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     984
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  4609617611958208877
  --------------
  minimum time:     17.800 ns (0.00% GC)
  median time:      17.898 ns (0.00% GC)
  mean time:        18.053 ns (0.00% GC)
  maximum time:     42.916 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     998

cbrt

julia> bench(SLEEF.cbrt, SLEEFwrap.cbrt, sv8)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     31.845 ns (0.00% GC)
  median time:      32.018 ns (0.00% GC)
  mean time:        32.143 ns (0.00% GC)
  maximum time:     54.500 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     994
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     25.222 ns (0.00% GC)
  median time:      26.324 ns (0.00% GC)
  mean time:        26.364 ns (0.00% GC)
  maximum time:     43.927 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     996

julia> bench(SLEEF.cbrt, SLEEFwrap.cbrt, dv4)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     36.175 ns (0.00% GC)
  median time:      36.303 ns (0.00% GC)
  mean time:        36.564 ns (0.00% GC)
  maximum time:     57.701 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     993
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     28.349 ns (0.00% GC)
  median time:      29.205 ns (0.00% GC)
  mean time:        29.250 ns (0.00% GC)
  maximum time:     46.513 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     995

julia> bench(SLEEF.cbrt, SLEEFwrap.cbrt, sv16)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  4584898609104978811
  --------------
  minimum time:     34.463 ns (0.00% GC)
  median time:      34.570 ns (0.00% GC)
  mean time:        34.634 ns (0.00% GC)
  maximum time:     58.556 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     993
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  4584898609104978811
  --------------
  minimum time:     23.273 ns (0.00% GC)
  median time:      25.731 ns (0.00% GC)
  mean time:        25.492 ns (0.00% GC)
  maximum time:     50.618 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     996

julia> bench(SLEEF.cbrt, SLEEFwrap.cbrt, dv8)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  4607167657796590655
  --------------
  minimum time:     42.291 ns (0.00% GC)
  median time:      42.392 ns (0.00% GC)
  mean time:        42.476 ns (0.00% GC)
  maximum time:     65.205 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     990
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  4607167657796590655
  --------------
  minimum time:     26.524 ns (0.00% GC)
  median time:      26.741 ns (0.00% GC)
  mean time:        26.800 ns (0.00% GC)
  maximum time:     46.431 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     995

Performance is currently often 2 or 3x worse than SLEEFwrap.jl (which wraps the C library).

coveralls · 2018-12-25T09:00:16Z

Coverage increased (+36.7%) to 65.182% when pulling 8b83a5a on chriselrod:master into b089af5 on musm:master.

coveralls · 2018-12-25T09:00:20Z

Coverage increased (+36.6%) to 65.074% when pulling e57ed3c on chriselrod:master into b089af5 on musm:master.

…teger(::Type{T}) to always defaut to 32 bit integers on 32 bit systems.

src/utils.jl

musm · 2018-12-25T17:55:43Z

awesome progress. Can you please remove the Manifest file

musm · 2018-12-25T18:00:19Z

src/log.jl

-    (d < 0 || isnan(d)) && (x = T(NaN))
-    d == 0 && (x = -T(Inf))
+
+    x = muladd(x, t, T(MLN2) * e)


I'm not sure you can safely replace the previous code with a muladd, I recall this replacement actually makes it less accurate missing the ulp requirements.

I went through and reverted all of the muladds I've added.
I had avoided adding any in the Doubles code, figuring it was necessary there. But now I'll avoid touching anything unless someone confirms (or I learn enough) to say it's okay.

Yes let's try to keep the PR as minimal, and we can open up further PRs if we see that the accuracy is not modified, although I'm pretty sure it is, since I recall trying this.

src/priv.jl

…dd changes, and removed a commented out function.

musm · 2018-12-25T21:51:45Z

src/double.jl

        invy = 1 / y.hi
        zhi = x.hi * invy
        Double(zhi, (fma(-zhi, y.hi, x.hi) + fma(-zhi, y.lo, x.lo)) * invy)
    end

-    @inline function ddiv(x::T, y::T) where {T<:IEEEFloat}
+    @inline function ddiv(x::vIEEEFloat, y::vIEEEFloat)


So I'm guessing these changes in the type signature are required to operate on the vector version?
I'm sure you are aware that the changes is not equivalent to the version on master.
I just want to confirm.

The old version forced x and y to be of the same type. The current version does not.
The reason for that change is that often one of x or y will be a scalar, and the other a Vec.
If you would like type checking to enforce that you don't mix Float32 and Float64s, we could:

function foo(x::vIEEEFloat, y::vIEEEFloat) @assert eltype(x) == eltype(y) ...

or

function foo(x::T1, y::T2) where {T <: IEEEFloat, T1 <: Union{T,Vec{<:Any,T}}, T2 <: Union{T,Vec<:Any,T}}} ...

I haven't tested the second option, but I think something like it should work.

musm · 2018-12-25T21:52:06Z

src/exp.jl

@@ -26,48 +26,49 @@ const min_exp2(::Type{Float32}) = -150f0
    c3  = 0.5550410866482046596e-1
    c2  = 0.2402265069591012214
    c1  = 0.6931471805599452862
-    return @horner x c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11
+    @horner x c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11


can you please leave in the explicit return, thanks.

Added explicit returns.

musm · 2018-12-25T21:54:54Z

src/hyp.jl

@@ -48,16 +50,17 @@ over_th(::Type{Float32}) = 18.714973875f0

 Compute hyperbolic tangent of `x`.
 """
-function tanh(x::T) where {T<:Union{Float32,Float64}}
+function tanh(x::V) where V <: FloatType


Stylistically I'd prefer we left this as where {V <: FloatType}

https://github.com/jrevels/YASGuide

Type variable bindings should always be enclosed within {} brackets when using where syntax, e.g. Vector{Vector{T} where T} is good, Vector{Vector{T}} where {T} is good, Vector{Vector{T}} where T is bad.

The return keyword should always be omitted from return statements within short-form method definitions (f(...) = ...). The return keyword should never be omitted from return statements within any other context (function ... end, macro ... end, etc.).
If a function does not have a clearly appropriate return value, then explicitly return nothing.

Got it. Added brackets.

musm · 2018-12-25T21:55:41Z

src/SLEEF.jl

+
+EquivalentInteger(::Type{Float64}) = Int == Int32 ? Int32 : Int64
+EquivalentInteger(::Type{Float32}) = Int32
+EquivalentInteger(::Type{Vec{N,Float64}}) where N = Int == Int32 ? Vec{N,Int32} : Vec{N,Int64}


braces around where clause. Otherwise this is really hard to comprehend :)

I also converted these functions to long form to make them even easier to read.

musm · 2018-12-25T22:08:38Z

src/SLEEF.jl

+const IntegerType32 = Union{Int32,Vec{<:Any,Int32}}
+const IntegerType = Union{IntegerType64,IntegerType32}
+
+EquivalentInteger(::Type{Float64}) = Int == Int32 ? Int32 : Int64


I think these should all just return Int

can we also rename this to fpinttype, the function is quite similar to https://github.com/JuliaLang/julia/blob/master/base/atomics.jl#L331

Except always returns the machine word size.

If you look at the previous code, we always use Int, even for code that operates on Float32, because if you use Int32 for Float32 inputs of a 64-bit machine, this can crush the range of the calculations for the trig functions (I'm pretty sure, this is true, if I recall correctly).

I changed the name from EquivalentInteger to fpinttype.

If Int is Int64 and you're doing vector operations on
Float32, the resulting integer vectors will be twice the bytes, taking up twice the register space.

This can have a significant impact on runtime. In the case of exp, it can add 1-2 nanoseconds.

With 64 bit integers:

.text vmovups (%rsi), %zmm2 movabsq $139690975831012, %rax # imm = 0x7F0C56FE23E4 vmulps (%rax){1to16}, %zmm2, %zmm0 vrndscaleps $4, %zmm0, %zmm3 vextractf64x4 $1, %zmm3, %ymm0 vcvttps2qq %ymm0, %zmm0 vcvttps2qq %ymm3, %zmm1 movabsq $139690975831052, %rax # imm = 0x7F0C56FE240C vbroadcastss (%rax), %zmm4 movabsq $139690975831060, %rax # imm = 0x7F0C56FE2414 vcmpnltps (%rax){1to16}, %zmm2, %k1 movabsq $139690975831016, %rax # imm = 0x7F0C56FE23E8 vcmpnltps %zmm2, %zmm4, %k2 vfmadd231ps (%rax){1to16}, %zmm3, %zmm2 movabsq $139690975831020, %rax # imm = 0x7F0C56FE23EC vfmadd231ps (%rax){1to16}, %zmm3, %zmm2 movabsq $139690975831024, %rax # imm = 0x7F0C56FE23F0 vbroadcastss (%rax), %zmm3 movabsq $139690975831028, %rax # imm = 0x7F0C56FE23F4 vfmadd213ps (%rax){1to16}, %zmm2, %zmm3 movabsq $139690975831032, %rax # imm = 0x7F0C56FE23F8 vfmadd213ps (%rax){1to16}, %zmm2, %zmm3 movabsq $139690975831036, %rax # imm = 0x7F0C56FE23FC vfmadd213ps (%rax){1to16}, %zmm2, %zmm3 movabsq $139690975831040, %rax # imm = 0x7F0C56FE2400 vfmadd213ps (%rax){1to16}, %zmm2, %zmm3 movabsq $139690975831044, %rax # imm = 0x7F0C56FE2404 vfmadd213ps (%rax){1to16}, %zmm2, %zmm3 vmulps %zmm2, %zmm2, %zmm4 vmulps %zmm3, %zmm4, %zmm3 vaddps %zmm3, %zmm2, %zmm2 movabsq $139690975831048, %rax # imm = 0x7F0C56FE2408 vaddps (%rax){1to16}, %zmm2, %zmm2 vpsraq $1, %zmm1, %zmm3 vpsraq $1, %zmm0, %zmm4 movabsq $139690975831064, %rax # imm = 0x7F0C56FE2418 vpbroadcastq (%rax), %zmm5 vpaddq %zmm5, %zmm4, %zmm6 vpaddq %zmm5, %zmm3, %zmm7 movabsq $139690975831104, %rax # imm = 0x7F0C56FE2440 vmovdqa32 (%rax), %zmm8 vpermt2d %zmm6, %zmm8, %zmm7 vpslld $23, %zmm7, %zmm6 vmulps %zmm6, %zmm2, %zmm2 vpsubq %zmm3, %zmm1, %zmm1 vpsubq %zmm4, %zmm0, %zmm0 vpaddq %zmm5, %zmm0, %zmm0 vpaddq %zmm5, %zmm1, %zmm1 vpermt2d %zmm0, %zmm8, %zmm1 vpslld $23, %zmm1, %zmm0 movabsq $139690975831056, %rax # imm = 0x7F0C56FE2410 vbroadcastss (%rax), %zmm1 vmulps %zmm0, %zmm2, %zmm1 {%k2} vmovaps %zmm1, %zmm0 {%k1} {z} vmovaps %zmm0, (%rdi) movq %rdi, %rax vzeroupper retq nopw %cs:(%rax,%rax)

This is 60 lines. With 32 bit integers, we have 49 lines:

.text vmovups (%rsi), %zmm0 movabsq $139690975841852, %rax # imm = 0x7F0C56FE4E3C vmulps (%rax){1to16}, %zmm0, %zmm1 vrndscaleps $4, %zmm1, %zmm1 vcvttps2dq %zmm1, %zmm2 movabsq $139690975841892, %rax # imm = 0x7F0C56FE4E64 vbroadcastss (%rax), %zmm3 movabsq $139690975841900, %rax # imm = 0x7F0C56FE4E6C vcmpnltps (%rax){1to16}, %zmm0, %k1 movabsq $139690975841856, %rax # imm = 0x7F0C56FE4E40 vcmpnltps %zmm0, %zmm3, %k2 vfmadd231ps (%rax){1to16}, %zmm1, %zmm0 movabsq $139690975841860, %rax # imm = 0x7F0C56FE4E44 vfmadd231ps (%rax){1to16}, %zmm1, %zmm0 movabsq $139690975841864, %rax # imm = 0x7F0C56FE4E48 vbroadcastss (%rax), %zmm1 movabsq $139690975841868, %rax # imm = 0x7F0C56FE4E4C vfmadd213ps (%rax){1to16}, %zmm0, %zmm1 movabsq $139690975841872, %rax # imm = 0x7F0C56FE4E50 vfmadd213ps (%rax){1to16}, %zmm0, %zmm1 movabsq $139690975841876, %rax # imm = 0x7F0C56FE4E54 vfmadd213ps (%rax){1to16}, %zmm0, %zmm1 movabsq $139690975841880, %rax # imm = 0x7F0C56FE4E58 vfmadd213ps (%rax){1to16}, %zmm0, %zmm1 movabsq $139690975841884, %rax # imm = 0x7F0C56FE4E5C vfmadd213ps (%rax){1to16}, %zmm0, %zmm1 vmulps %zmm0, %zmm0, %zmm3 vmulps %zmm1, %zmm3, %zmm1 vaddps %zmm1, %zmm0, %zmm0 movabsq $139690975841888, %rax # imm = 0x7F0C56FE4E60 vaddps (%rax){1to16}, %zmm0, %zmm0 vpsrld $1, %zmm2, %zmm1 vpslld $23, %zmm1, %zmm3 vpbroadcastd (%rax), %zmm4 vpaddd %zmm4, %zmm3, %zmm3 vmulps %zmm3, %zmm0, %zmm0 vpsubd %zmm1, %zmm2, %zmm1 vpslld $23, %zmm1, %zmm1 vpaddd %zmm4, %zmm1, %zmm1 movabsq $139690975841896, %rax # imm = 0x7F0C56FE4E68 vbroadcastss (%rax), %zmm2 vmulps %zmm1, %zmm0, %zmm2 {%k2} vmovaps %zmm2, %zmm0 {%k1} {z} vmovaps %zmm0, (%rdi) movq %rdi, %rax vzeroupper retq nopl (%rax)

Here are the parts related to this, in the 64 bit int version:

vrndscaleps $4, %zmm0, %zmm3 vextractf64x4 $1, %zmm3, %ymm0 vcvttps2qq %ymm0, %zmm0 vcvttps2qq %ymm3, %zmm1 ... vpsraq $1, %zmm1, %zmm3 vpsraq $1, %zmm0, %zmm4 movabsq $139690975831064, %rax # imm = 0x7F0C56FE2418 vpbroadcastq (%rax), %zmm5 vpaddq %zmm5, %zmm4, %zmm6 vpaddq %zmm5, %zmm3, %zmm7 movabsq $139690975831104, %rax # imm = 0x7F0C56FE2440 vmovdqa32 (%rax), %zmm8 vpermt2d %zmm6, %zmm8, %zmm7 vpslld $23, %zmm7, %zmm6 vmulps %zmm6, %zmm2, %zmm2 vpsubq %zmm3, %zmm1, %zmm1 vpsubq %zmm4, %zmm0, %zmm0 vpaddq %zmm5, %zmm0, %zmm0 vpaddq %zmm5, %zmm1, %zmm1 vpermt2d %zmm0, %zmm8, %zmm1 vpslld $23, %zmm1, %zmm0

32 bit int:

vrndscaleps $4, %zmm1, %zmm1 vcvttps2dq %zmm1, %zmm2 ... vpsrld $1, %zmm2, %zmm1 vpslld $23, %zmm1, %zmm3 vpbroadcastd (%rax), %zmm4 vpaddd %zmm4, %zmm3, %zmm3 vmulps %zmm3, %zmm0, %zmm0 vpsubd %zmm1, %zmm2, %zmm1 vpslld $23, %zmm1, %zmm1

Allocating and operating on two registers instead of 1.

However, you're are correct:

julia> using SLEEF, SIMD julia> x = Vec{16,Float32}((1f3,1f5,1f7,1f9,1f11,1f13,1f15,1f17,1f19,1f21,1f23,1f25,1f27,1f29,1f31,1f33)) <16 x Float32>[1000.0, 100000.0, 1.0e7, 1.0e9, 1.0e11, 1.0e13, 1.0e15, 1.0e17, 1.0e19, 1.0e21, 1.0e23, 1.0e25, 1.0e27, 1.0e29, 1.0e31, 1.0e33] julia> SLEEF.sin(x) <16 x Float32>[0.82687956, 0.0357488, 0.13669702, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0] julia> using SLEEFwrap julia> Vec{16,Float32}(SLEEFwrap.sin(x.elts)) # lets print pretty <16 x Float32>[0.82687956, 0.0357488, 0.42054778, 0.5458434, 0.99810874, 0.96887577, 0.9944343, -0.5699717, 0.5780979, 0.7704365, -0.925232, -0.40585858, -0.97865087, 0.8592228, -0.039693512, 0.33392745] julia> sin(x) <16 x Float32>[0.82687956, 0.0357488, 0.42054778, 0.5458434, 0.99810874, 0.96887577, 0.9944343, -0.5699717, 0.5780979, 0.7704365, -0.925232, -0.40585858, -0.97865087, 0.8592228, -0.039693512, 0.33392745] julia> using BenchmarkTools julia> @benchmark SLEEFwrap.sin($x.elts) BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 4400699998796492385 -------------- minimum time: 77.960 ns (0.00% GC) median time: 79.492 ns (0.00% GC) mean time: 79.636 ns (0.00% GC) maximum time: 125.621 ns (0.00% GC) -------------- samples: 10000 evals/sample: 968 julia> @benchmark SLEEF.sin($x).elts BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 4400699998796492385 -------------- minimum time: 19.888 ns (0.00% GC) median time: 19.992 ns (0.00% GC) mean time: 20.038 ns (0.00% GC) maximum time: 49.129 ns (0.00% GC) -------------- samples: 10000 evals/sample: 997 julia> @benchmark sin($x).elts BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 4400699998796492385 -------------- minimum time: 198.555 ns (0.00% GC) median time: 199.522 ns (0.00% GC) mean time: 200.183 ns (0.00% GC) maximum time: 254.833 ns (0.00% GC) -------------- samples: 10000 evals/sample: 600

SLEEF (the C library)'s sin still has solid range on these trig functions, but slows down dramatically compared to values close to 0:

julia> using SIMD: VE julia> vx16 = Vec{16,Float32}(ntuple(Val(16)) do x VE(randn(Float32)) end); julia> @benchmark SLEEF.sin($vx16).elts BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 4570671147679511021 -------------- minimum time: 19.904 ns (0.00% GC) median time: 20.008 ns (0.00% GC) mean time: 20.170 ns (0.00% GC) maximum time: 48.694 ns (0.00% GC) -------------- samples: 10000 evals/sample: 997 julia> @benchmark SLEEFwrap.sin($vx16.elts) BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 4570671147679511021 -------------- minimum time: 10.420 ns (0.00% GC) median time: 10.553 ns (0.00% GC) mean time: 10.575 ns (0.00% GC) maximum time: 33.753 ns (0.00% GC) -------------- samples: 10000 evals/sample: 999 julia> @benchmark sin($vx16).elts BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 4570671147679511021 -------------- minimum time: 60.156 ns (0.00% GC) median time: 60.584 ns (0.00% GC) mean time: 60.808 ns (0.00% GC) maximum time: 83.603 ns (0.00% GC) -------------- samples: 10000 evals/sample: 982

The C library behaves much better in both cases (getting the correct answer much more quickly; for extreme values, that is by getting the correct answers at all).

Perhaps we use 32 bit integers for functions like exp, and 64 bit integers for the periodic trig functions?

I replaced all instances of fpinttype(T) with Int in the trig file (locally), but:

julia> SLEEF.sin(x) <16 x Float32>[0.82687956, 0.0357488, 0.13669702, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0]

is still the answer I get.
That is because of line 213:

u = vifelse((!isinf(t)) & (isnegzero(t) | (abs(t) > TRIG_MAX(T))), T(-0.0), u)

and SLEEF.TRIG_MAX(Float32) returning 1.0f7.
However, that still does not explain:

julia> (SLEEF.sin(1e7),SLEEF.sin(1f7),SLEEF.sin_fast(1f7),sin(1e7)) (0.4205477931907825, 0.13669702f0, 0.4205478f0, 0.4205477931907825)

why SLEEF.sin is getting the wrong answer here.

Checking out the latest master (rather than this PR)...

julia> (SLEEF.sin(1e7),SLEEF.sin(1f7),SLEEF.sin_fast(1f7),sin(1e7)) (0.4205477931907825, 0.13669702f0, 0.4205478f0, 0.4205477931907825)

so it is an existing problem.

If you're using TRIG_MAX(::Type{Float32}) = 1f7, then 32 bit integers should be okay.

Ok I see, but the trig function is only guaranteed over

Notes
The trigonometric functions are tested to return values with specified accuracy when the argument is within the following range:

Double (Float64) precision trigonometric functions : [-1e+14, 1e+14]
Single (Float32) precision trigonometric functions : [-39000, 39000]

not 1f7

If I recall correctly it's a lot faster for non-vectorized code to always use machine size-int, even if operating on 32 bit floats.

However, according to your analysis this is not true for vector versions.

AFAIK, 32 and 64 bit operations should be the same fast when not vectorized -- as long as you don't need to promote from one to the other. (Note that pointers on 64 bit machines are 64 bits.)

The difference is when they are vectorized. Half the bits means you can fit twice as many into a register, and operate on twice as many per operation.

chriselrod added 2 commits December 25, 2018 02:01

Added support for SIMD.jl.

0cb1736

Bug fixes to the vectorized versions of a few hyp and trig functions.

8b83a5a

chriselrod added 2 commits December 25, 2018 03:13

Attempt to fix code for 32 bit Windows.

faac78d

Second attempt at fixing for 64bit Windows, via defining EquivalentIn…

3f04bcc

…teger(::Type{T}) to always defaut to 32 bit integers on 32 bit systems.

musm reviewed Dec 25, 2018

View reviewed changes

src/utils.jl Show resolved Hide resolved

musm reviewed Dec 25, 2018

View reviewed changes

src/priv.jl Outdated Show resolved Hide resolved

musm reviewed Dec 25, 2018

View reviewed changes

src/priv.jl Outdated Show resolved Hide resolved

chriselrod added 2 commits December 25, 2018 13:28

Made changes reflecting comments: restored a let block, reverted mula…

3aee0ef

…dd changes, and removed a commented out function.

Removed the rest of the added muladds.

99ec7ec

musm reviewed Dec 25, 2018

View reviewed changes

Style updates.

e57ed3c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added support for SIMD.jl; WIP #15

Added support for SIMD.jl; WIP #15

chriselrod commented Dec 25, 2018 •

edited

Loading

coveralls commented Dec 25, 2018

coveralls commented Dec 25, 2018 •

edited

Loading

musm commented Dec 25, 2018

musm Dec 25, 2018

chriselrod Dec 25, 2018 •

edited

Loading

musm Dec 25, 2018

musm Dec 25, 2018

chriselrod Dec 26, 2018

musm Dec 25, 2018

chriselrod Dec 26, 2018

musm Dec 25, 2018

chriselrod Dec 26, 2018

musm Dec 25, 2018

chriselrod Dec 26, 2018

musm Dec 25, 2018

chriselrod Dec 26, 2018 •

edited

Loading

chriselrod Dec 26, 2018

musm Dec 26, 2018 •

edited

Loading

chriselrod Dec 27, 2018

Added support for SIMD.jl; WIP #15

Are you sure you want to change the base?

Added support for SIMD.jl; WIP #15

Conversation

chriselrod commented Dec 25, 2018 • edited Loading

coveralls commented Dec 25, 2018

coveralls commented Dec 25, 2018 • edited Loading

musm commented Dec 25, 2018

Choose a reason for hiding this comment

chriselrod Dec 25, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chriselrod Dec 26, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

musm Dec 26, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chriselrod commented Dec 25, 2018 •

edited

Loading

coveralls commented Dec 25, 2018 •

edited

Loading

chriselrod Dec 25, 2018 •

edited

Loading

chriselrod Dec 26, 2018 •

edited

Loading

musm Dec 26, 2018 •

edited

Loading