-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added support for SIMD.jl; WIP #15
base: master
Are you sure you want to change the base?
Conversation
…teger(::Type{T}) to always defaut to 32 bit integers on 32 bit systems.
awesome progress. Can you please remove the Manifest file |
src/log.jl
Outdated
(d < 0 || isnan(d)) && (x = T(NaN)) | ||
d == 0 && (x = -T(Inf)) | ||
|
||
x = muladd(x, t, T(MLN2) * e) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure you can safely replace the previous code with a muladd, I recall this replacement actually makes it less accurate missing the ulp requirements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went through and reverted all of the muladds I've added.
I had avoided adding any in the Doubles code, figuring it was necessary there. But now I'll avoid touching anything unless someone confirms (or I learn enough) to say it's okay.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes let's try to keep the PR as minimal, and we can open up further PRs if we see that the accuracy is not modified, although I'm pretty sure it is, since I recall trying this.
…dd changes, and removed a commented out function.
invy = 1 / y.hi | ||
zhi = x.hi * invy | ||
Double(zhi, (fma(-zhi, y.hi, x.hi) + fma(-zhi, y.lo, x.lo)) * invy) | ||
end | ||
|
||
@inline function ddiv(x::T, y::T) where {T<:IEEEFloat} | ||
@inline function ddiv(x::vIEEEFloat, y::vIEEEFloat) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I'm guessing these changes in the type signature are required to operate on the vector version?
I'm sure you are aware that the changes is not equivalent to the version on master.
I just want to confirm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The old version forced x
and y
to be of the same type. The current version does not.
The reason for that change is that often one of x
or y
will be a scalar, and the other a Vec
.
If you would like type checking to enforce that you don't mix Float32
and Float64
s, we could:
function foo(x::vIEEEFloat, y::vIEEEFloat)
@assert eltype(x) == eltype(y)
...
or
function foo(x::T1, y::T2) where {T <: IEEEFloat, T1 <: Union{T,Vec{<:Any,T}}, T2 <: Union{T,Vec<:Any,T}}}
...
I haven't tested the second option, but I think something like it should work.
src/exp.jl
Outdated
@@ -26,48 +26,49 @@ const min_exp2(::Type{Float32}) = -150f0 | |||
c3 = 0.5550410866482046596e-1 | |||
c2 = 0.2402265069591012214 | |||
c1 = 0.6931471805599452862 | |||
return @horner x c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11 | |||
@horner x c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you please leave in the explicit return, thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added explicit returns.
src/hyp.jl
Outdated
@@ -48,16 +50,17 @@ over_th(::Type{Float32}) = 18.714973875f0 | |||
|
|||
Compute hyperbolic tangent of `x`. | |||
""" | |||
function tanh(x::T) where {T<:Union{Float32,Float64}} | |||
function tanh(x::V) where V <: FloatType |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stylistically I'd prefer we left this as where {V <: FloatType}
https://github.com/jrevels/YASGuide
Type variable bindings should always be enclosed within {} brackets when using where syntax, e.g. Vector{Vector{T} where T} is good, Vector{Vector{T}} where {T} is good, Vector{Vector{T}} where T is bad.
The return keyword should always be omitted from return statements within short-form method definitions (f(...) = ...). The return keyword should never be omitted from return statements within any other context (function ... end, macro ... end, etc.).
If a function does not have a clearly appropriate return value, then explicitly return nothing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Added brackets.
src/SLEEF.jl
Outdated
|
||
EquivalentInteger(::Type{Float64}) = Int == Int32 ? Int32 : Int64 | ||
EquivalentInteger(::Type{Float32}) = Int32 | ||
EquivalentInteger(::Type{Vec{N,Float64}}) where N = Int == Int32 ? Vec{N,Int32} : Vec{N,Int64} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
braces around where
clause. Otherwise this is really hard to comprehend :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also converted these functions to long form to make them even easier to read.
src/SLEEF.jl
Outdated
const IntegerType32 = Union{Int32,Vec{<:Any,Int32}} | ||
const IntegerType = Union{IntegerType64,IntegerType32} | ||
|
||
EquivalentInteger(::Type{Float64}) = Int == Int32 ? Int32 : Int64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these should all just return Int
can we also rename this to fpinttype
, the function is quite similar to https://github.com/JuliaLang/julia/blob/master/base/atomics.jl#L331
Except always returns the machine word size.
If you look at the previous code, we always use Int
, even for code that operates on Float32
, because if you use Int32
for Float32
inputs of a 64-bit machine, this can crush the range of the calculations for the trig functions (I'm pretty sure, this is true, if I recall correctly).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the name from EquivalentInteger
to fpinttype
.
If Int
is Int64
and you're doing vector operations on
Float32
, the resulting integer vectors will be twice the bytes, taking up twice the register space.
This can have a significant impact on runtime. In the case of exp
, it can add 1-2 nanoseconds.
With 64 bit integers:
.text
vmovups (%rsi), %zmm2
movabsq $139690975831012, %rax # imm = 0x7F0C56FE23E4
vmulps (%rax){1to16}, %zmm2, %zmm0
vrndscaleps $4, %zmm0, %zmm3
vextractf64x4 $1, %zmm3, %ymm0
vcvttps2qq %ymm0, %zmm0
vcvttps2qq %ymm3, %zmm1
movabsq $139690975831052, %rax # imm = 0x7F0C56FE240C
vbroadcastss (%rax), %zmm4
movabsq $139690975831060, %rax # imm = 0x7F0C56FE2414
vcmpnltps (%rax){1to16}, %zmm2, %k1
movabsq $139690975831016, %rax # imm = 0x7F0C56FE23E8
vcmpnltps %zmm2, %zmm4, %k2
vfmadd231ps (%rax){1to16}, %zmm3, %zmm2
movabsq $139690975831020, %rax # imm = 0x7F0C56FE23EC
vfmadd231ps (%rax){1to16}, %zmm3, %zmm2
movabsq $139690975831024, %rax # imm = 0x7F0C56FE23F0
vbroadcastss (%rax), %zmm3
movabsq $139690975831028, %rax # imm = 0x7F0C56FE23F4
vfmadd213ps (%rax){1to16}, %zmm2, %zmm3
movabsq $139690975831032, %rax # imm = 0x7F0C56FE23F8
vfmadd213ps (%rax){1to16}, %zmm2, %zmm3
movabsq $139690975831036, %rax # imm = 0x7F0C56FE23FC
vfmadd213ps (%rax){1to16}, %zmm2, %zmm3
movabsq $139690975831040, %rax # imm = 0x7F0C56FE2400
vfmadd213ps (%rax){1to16}, %zmm2, %zmm3
movabsq $139690975831044, %rax # imm = 0x7F0C56FE2404
vfmadd213ps (%rax){1to16}, %zmm2, %zmm3
vmulps %zmm2, %zmm2, %zmm4
vmulps %zmm3, %zmm4, %zmm3
vaddps %zmm3, %zmm2, %zmm2
movabsq $139690975831048, %rax # imm = 0x7F0C56FE2408
vaddps (%rax){1to16}, %zmm2, %zmm2
vpsraq $1, %zmm1, %zmm3
vpsraq $1, %zmm0, %zmm4
movabsq $139690975831064, %rax # imm = 0x7F0C56FE2418
vpbroadcastq (%rax), %zmm5
vpaddq %zmm5, %zmm4, %zmm6
vpaddq %zmm5, %zmm3, %zmm7
movabsq $139690975831104, %rax # imm = 0x7F0C56FE2440
vmovdqa32 (%rax), %zmm8
vpermt2d %zmm6, %zmm8, %zmm7
vpslld $23, %zmm7, %zmm6
vmulps %zmm6, %zmm2, %zmm2
vpsubq %zmm3, %zmm1, %zmm1
vpsubq %zmm4, %zmm0, %zmm0
vpaddq %zmm5, %zmm0, %zmm0
vpaddq %zmm5, %zmm1, %zmm1
vpermt2d %zmm0, %zmm8, %zmm1
vpslld $23, %zmm1, %zmm0
movabsq $139690975831056, %rax # imm = 0x7F0C56FE2410
vbroadcastss (%rax), %zmm1
vmulps %zmm0, %zmm2, %zmm1 {%k2}
vmovaps %zmm1, %zmm0 {%k1} {z}
vmovaps %zmm0, (%rdi)
movq %rdi, %rax
vzeroupper
retq
nopw %cs:(%rax,%rax)
This is 60 lines. With 32 bit integers, we have 49 lines:
.text
vmovups (%rsi), %zmm0
movabsq $139690975841852, %rax # imm = 0x7F0C56FE4E3C
vmulps (%rax){1to16}, %zmm0, %zmm1
vrndscaleps $4, %zmm1, %zmm1
vcvttps2dq %zmm1, %zmm2
movabsq $139690975841892, %rax # imm = 0x7F0C56FE4E64
vbroadcastss (%rax), %zmm3
movabsq $139690975841900, %rax # imm = 0x7F0C56FE4E6C
vcmpnltps (%rax){1to16}, %zmm0, %k1
movabsq $139690975841856, %rax # imm = 0x7F0C56FE4E40
vcmpnltps %zmm0, %zmm3, %k2
vfmadd231ps (%rax){1to16}, %zmm1, %zmm0
movabsq $139690975841860, %rax # imm = 0x7F0C56FE4E44
vfmadd231ps (%rax){1to16}, %zmm1, %zmm0
movabsq $139690975841864, %rax # imm = 0x7F0C56FE4E48
vbroadcastss (%rax), %zmm1
movabsq $139690975841868, %rax # imm = 0x7F0C56FE4E4C
vfmadd213ps (%rax){1to16}, %zmm0, %zmm1
movabsq $139690975841872, %rax # imm = 0x7F0C56FE4E50
vfmadd213ps (%rax){1to16}, %zmm0, %zmm1
movabsq $139690975841876, %rax # imm = 0x7F0C56FE4E54
vfmadd213ps (%rax){1to16}, %zmm0, %zmm1
movabsq $139690975841880, %rax # imm = 0x7F0C56FE4E58
vfmadd213ps (%rax){1to16}, %zmm0, %zmm1
movabsq $139690975841884, %rax # imm = 0x7F0C56FE4E5C
vfmadd213ps (%rax){1to16}, %zmm0, %zmm1
vmulps %zmm0, %zmm0, %zmm3
vmulps %zmm1, %zmm3, %zmm1
vaddps %zmm1, %zmm0, %zmm0
movabsq $139690975841888, %rax # imm = 0x7F0C56FE4E60
vaddps (%rax){1to16}, %zmm0, %zmm0
vpsrld $1, %zmm2, %zmm1
vpslld $23, %zmm1, %zmm3
vpbroadcastd (%rax), %zmm4
vpaddd %zmm4, %zmm3, %zmm3
vmulps %zmm3, %zmm0, %zmm0
vpsubd %zmm1, %zmm2, %zmm1
vpslld $23, %zmm1, %zmm1
vpaddd %zmm4, %zmm1, %zmm1
movabsq $139690975841896, %rax # imm = 0x7F0C56FE4E68
vbroadcastss (%rax), %zmm2
vmulps %zmm1, %zmm0, %zmm2 {%k2}
vmovaps %zmm2, %zmm0 {%k1} {z}
vmovaps %zmm0, (%rdi)
movq %rdi, %rax
vzeroupper
retq
nopl (%rax)
Here are the parts related to this, in the 64 bit int version:
vrndscaleps $4, %zmm0, %zmm3
vextractf64x4 $1, %zmm3, %ymm0
vcvttps2qq %ymm0, %zmm0
vcvttps2qq %ymm3, %zmm1
...
vpsraq $1, %zmm1, %zmm3
vpsraq $1, %zmm0, %zmm4
movabsq $139690975831064, %rax # imm = 0x7F0C56FE2418
vpbroadcastq (%rax), %zmm5
vpaddq %zmm5, %zmm4, %zmm6
vpaddq %zmm5, %zmm3, %zmm7
movabsq $139690975831104, %rax # imm = 0x7F0C56FE2440
vmovdqa32 (%rax), %zmm8
vpermt2d %zmm6, %zmm8, %zmm7
vpslld $23, %zmm7, %zmm6
vmulps %zmm6, %zmm2, %zmm2
vpsubq %zmm3, %zmm1, %zmm1
vpsubq %zmm4, %zmm0, %zmm0
vpaddq %zmm5, %zmm0, %zmm0
vpaddq %zmm5, %zmm1, %zmm1
vpermt2d %zmm0, %zmm8, %zmm1
vpslld $23, %zmm1, %zmm0
32 bit int:
vrndscaleps $4, %zmm1, %zmm1
vcvttps2dq %zmm1, %zmm2
...
vpsrld $1, %zmm2, %zmm1
vpslld $23, %zmm1, %zmm3
vpbroadcastd (%rax), %zmm4
vpaddd %zmm4, %zmm3, %zmm3
vmulps %zmm3, %zmm0, %zmm0
vpsubd %zmm1, %zmm2, %zmm1
vpslld $23, %zmm1, %zmm1
Allocating and operating on two registers instead of 1.
However, you're are correct:
julia> using SLEEF, SIMD
julia> x = Vec{16,Float32}((1f3,1f5,1f7,1f9,1f11,1f13,1f15,1f17,1f19,1f21,1f23,1f25,1f27,1f29,1f31,1f33))
<16 x Float32>[1000.0, 100000.0, 1.0e7, 1.0e9, 1.0e11, 1.0e13, 1.0e15, 1.0e17, 1.0e19, 1.0e21, 1.0e23, 1.0e25, 1.0e27, 1.0e29, 1.0e31, 1.0e33]
julia> SLEEF.sin(x)
<16 x Float32>[0.82687956, 0.0357488, 0.13669702, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0]
julia> using SLEEFwrap
julia> Vec{16,Float32}(SLEEFwrap.sin(x.elts)) # lets print pretty
<16 x Float32>[0.82687956, 0.0357488, 0.42054778, 0.5458434, 0.99810874, 0.96887577, 0.9944343, -0.5699717, 0.5780979, 0.7704365, -0.925232, -0.40585858, -0.97865087, 0.8592228, -0.039693512, 0.33392745]
julia> sin(x)
<16 x Float32>[0.82687956, 0.0357488, 0.42054778, 0.5458434, 0.99810874, 0.96887577, 0.9944343, -0.5699717, 0.5780979, 0.7704365, -0.925232, -0.40585858, -0.97865087, 0.8592228, -0.039693512, 0.33392745]
julia> using BenchmarkTools
julia> @benchmark SLEEFwrap.sin($x.elts)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 4400699998796492385
--------------
minimum time: 77.960 ns (0.00% GC)
median time: 79.492 ns (0.00% GC)
mean time: 79.636 ns (0.00% GC)
maximum time: 125.621 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 968
julia> @benchmark SLEEF.sin($x).elts
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 4400699998796492385
--------------
minimum time: 19.888 ns (0.00% GC)
median time: 19.992 ns (0.00% GC)
mean time: 20.038 ns (0.00% GC)
maximum time: 49.129 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 997
julia> @benchmark sin($x).elts
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 4400699998796492385
--------------
minimum time: 198.555 ns (0.00% GC)
median time: 199.522 ns (0.00% GC)
mean time: 200.183 ns (0.00% GC)
maximum time: 254.833 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 600
SLEEF (the C library)'s sin
still has solid range on these trig functions, but slows down dramatically compared to values close to 0:
julia> using SIMD: VE
julia> vx16 = Vec{16,Float32}(ntuple(Val(16)) do x VE(randn(Float32)) end);
julia> @benchmark SLEEF.sin($vx16).elts
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 4570671147679511021
--------------
minimum time: 19.904 ns (0.00% GC)
median time: 20.008 ns (0.00% GC)
mean time: 20.170 ns (0.00% GC)
maximum time: 48.694 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 997
julia> @benchmark SLEEFwrap.sin($vx16.elts)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 4570671147679511021
--------------
minimum time: 10.420 ns (0.00% GC)
median time: 10.553 ns (0.00% GC)
mean time: 10.575 ns (0.00% GC)
maximum time: 33.753 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 999
julia> @benchmark sin($vx16).elts
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 4570671147679511021
--------------
minimum time: 60.156 ns (0.00% GC)
median time: 60.584 ns (0.00% GC)
mean time: 60.808 ns (0.00% GC)
maximum time: 83.603 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 982
The C library behaves much better in both cases (getting the correct answer much more quickly; for extreme values, that is by getting the correct answers at all).
Perhaps we use 32 bit integers for functions like exp
, and 64 bit integers for the periodic trig functions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I replaced all instances of fpinttype(T)
with Int
in the trig file (locally), but:
julia> SLEEF.sin(x)
<16 x Float32>[0.82687956, 0.0357488, 0.13669702, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0]
is still the answer I get.
That is because of line 213:
u = vifelse((!isinf(t)) & (isnegzero(t) | (abs(t) > TRIG_MAX(T))), T(-0.0), u)
and SLEEF.TRIG_MAX(Float32)
returning 1.0f7
.
However, that still does not explain:
julia> (SLEEF.sin(1e7),SLEEF.sin(1f7),SLEEF.sin_fast(1f7),sin(1e7))
(0.4205477931907825, 0.13669702f0, 0.4205478f0, 0.4205477931907825)
why SLEEF.sin
is getting the wrong answer here.
Checking out the latest master (rather than this PR)...
julia> (SLEEF.sin(1e7),SLEEF.sin(1f7),SLEEF.sin_fast(1f7),sin(1e7))
(0.4205477931907825, 0.13669702f0, 0.4205478f0, 0.4205477931907825)
so it is an existing problem.
If you're using TRIG_MAX(::Type{Float32}) = 1f7
, then 32 bit integers should be okay.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I see, but the trig function is only guaranteed over
Notes
The trigonometric functions are tested to return values with specified accuracy when the argument is within the following range:
Double (Float64) precision trigonometric functions : [-1e+14, 1e+14]
Single (Float32) precision trigonometric functions : [-39000, 39000]
not 1f7
If I recall correctly it's a lot faster for non-vectorized code to always use machine size-int, even if operating on 32 bit floats.
However, according to your analysis this is not true for vector versions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK, 32 and 64 bit operations should be the same fast when not vectorized -- as long as you don't need to promote from one to the other. (Note that pointers on 64 bit machines are 64 bits.)
The difference is when they are vectorized. Half the bits means you can fit twice as many into a register, and operate on twice as many per operation.
Overview of this PR:
The C SLEEF (SIMD Library for Evaluating Elementary Functions) library provides vectorized elementary functions. Therefore, I thought it makes sense to let
SLEEF.jl
support theSIMD.jl
'sVec{N,T}
vector type.This PR provides preliminary support.
Testing a bunch of functions:
exp
:log
sin
tan
cbrt
Performance is currently often 2 or 3x worse than
SLEEFwrap.jl
(which wraps the C library).