usage of sincos #41

jaakkor2 · 2020-01-30T15:34:36Z

Here are two variants of a function that I cannot figure how one can apply @avx in front of for

function f1(N)
	x = π*cumsum(ones(N))/2
	y = repeat(similar(x),1,2)
	for i = 1:length(x)
		y[i,1:2] .= sincos(x[i])
	end
	return y
end

function f2(N)
	x = π*cumsum(ones(N))/2
	y = repeat(similar(x),1,2)
	for i = 1:length(x)
		y[i,1], y[i,2] = sincos(x[i])
	end
	return y
end

If I refer to https://software.intel.com/en-us/mkl-vmperfdata-sincos and https://software.intel.com/en-us/mkl-vmperfdata-sin , those pages suggest that not a big gain is available if sin and cos are to be evaluated using sincos than separately. But I would like to test it myself.

The text was updated successfully, but these errors were encountered:

chriselrod · 2020-01-30T16:35:19Z

Hmm. I need to add support for multiple return values (or, equivalently, returning a tuple and unpacking it).

However, in case your curiosity is impatient and you doesn't want to wait for me to fix this issue:
LoopVectorization uses SLEEFPirates for evaluating special functions.

julia> using SLEEFPirates, SIMDPirates

julia> x = SVec(ntuple(Val(4)) do i Core.VecElement(rand()) end)
SVec{4,Float64}<0.25773116159740384, 0.6887968844827816, 0.3906818208655902, 0.5101711516658685>

julia> sin(x)
SVec{4,Float64}<0.25488730940177806, 0.6356088237080095, 0.38081894899915164, 0.4883266114084775>

julia> cos(x)
SVec{4,Float64}<0.966970764555952, 0.7720112843893673, 0.9246496244973993, 0.8726609425145104>

julia> sincos(x)
(SVec{4,Float64}<0.25488730940177806, 0.6356088237080095, 0.38081894899915164, 0.4883266114084775>, SVec{4,Float64}<0.966970764555952, 0.7720112843893673, 0.9246496244973993, 0.8726609425145105>)

julia> using BenchmarkTools

julia> @benchmark sincos($x)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     21.699 ns (0.00% GC)
  median time:      21.903 ns (0.00% GC)
  mean time:        22.186 ns (0.00% GC)
  maximum time:     333.447 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     997

julia> @benchmark sin($x)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     20.016 ns (0.00% GC)
  median time:      20.105 ns (0.00% GC)
  mean time:        20.424 ns (0.00% GC)
  maximum time:     326.284 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     997

julia> @benchmark cos($x)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     20.389 ns (0.00% GC)
  median time:      20.481 ns (0.00% GC)
  mean time:        20.926 ns (0.00% GC)
  maximum time:     226.778 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     997

julia> versioninfo()
Julia Version 1.5.0-DEV.168
Commit c4c4a13* (2020-01-28 16:23 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-9.0.0 (ORCJIT, haswell)

To compare with the links, the CPU I tested on is closest to the Intel® Xeon® CPU E5-2699 v4.
At 2.5 GHz, 20ns corresponds to 20ns * 2.5GHz / 4 elements = 12.5 clocks per element

I assume "HA" and "LA" mean high and low accuracy, respectively?
This would make Intel's high accuracy sin roughly twice as fast as the sin from SLEEFPirates, while their sincos is only a little faster.

SLEEFPirates does far better than base Julia, but is still definitely suboptimal. Would be great to improve it or switch to a better library in the future.
This is true for most functions. I included an "exp" benchmark in the discourse thread, where you'll see LoopVectorization was much faster than base Julia and Clang, but lagged behind gfortran and the Intel compilers.

Note that the C-library "SLEEF" is now on major version 3, while SLEEFPirates.jl was forked from SLEEF.jl, which was based on C-SLEEF major version 2.

zsoerenm · 2020-02-04T16:12:02Z

Adding to this conversation and since you were interested in fixed-point sine and cosine approximation, I have written an article at nextjournal about my findings: https://nextjournal.com/zorn/fast-fixed-point-sine-and-cosine-approximation-with-julia
There is indeed a better way than simply doing cos(x) = sin(x + pi / 2).
As you will see in the article, the Int16 and Int32 variant vectorize quite well using standard Julia. The Int64 variant is a little slower. I hope that once LoopVectorization supports multiple return values, that I can get this faster.

chriselrod · 2020-02-11T14:57:03Z

@zsoerenm You may be interested in VectorizedRNG.jl, which I discussed before.

I'm still using a floating point approximation of sin and cosine instead of fixed point, but I am using a very simple approximation I wrote. Floating point has the advantage of not needing to keep track of the exponents and lets you avoid all those shifts (although shifts are very fast).
Because I need a random sin and cos on the interval (0,2pi), I approximate a single quadrant of sin on (0,pi/2), shift cos appropriately, and copy random sign bits to get the full (0, 2pi) range.

I used Remez.jl for my polynomial coefficients.

My random log(U) where U ~ Uniform(0,1) isn't quite as big an improvement. That one takes a random integer.
I pass in a random series of bits, count the number of leading zeros as a random 2^lz exponent, and then use the remaining bits, up to a maximum of 52, for a uniform random floating point value in (1,2).
Then I approximate log2(x) over the interval (1,2), and add on the leading zeros.
Multiplying by 1/log2(e) gives a random base-e log. Math:

log(U) = log2(U)/log2(e)
log2(U) = log2(U* 2^lz/2^lz)
= log2(U * 2 ^ lz) - log2(2^lz)
= log2(U * 2 ^ lz) - lz

and U * 2^lz is in the interval (1,2). This works for now, but I'm sure there are better approximations, so I'll look into this more later.

Unfortunately, counting the number of leading zeros requires AVX512CD to vectorize, but this is still reasonably fast on AVX2.

Putting these together, I get about a 2x improvement on Random.randn! with AVX2, and a more than 4x improvement with AVX512.

zsoerenm · 2020-02-13T16:07:05Z

Very interesting!

Floating point has the advantage of not needing to keep track of the exponents and lets you avoid all those shifts (although shifts are very fast).

Sure, however, my requirement for accuracy is very low. 2 or 3 bit would already be enough. Unfortunately there is no build in Float16, that I can use to make things faster. Int16 was the best bet.

I used Remez.jl for my polynomial coefficients

As far as I understand, it does a sort of least square fit, doesn't it? In that case you'll have to pay attention to the boundaries of the quarter sine / cosine approximation. Otherwise the function takes a jump every quarter of a circle. I have chosen the polynomials in such a way that the boundaries have a zero error.

Here you'll find a solution for a sixth order polynomial, that has zero errors at the boundaries and let's you calculate sin and cos simultaneously: http://www.olliw.eu/2014/fast-functions/

and copy random sign bits to get the full (0, 2pi) range

How do you restrict the phase to be between 0 and 2pi? I found that the mod function is quite expensive.

chriselrod · 2020-02-14T23:28:24Z

Sure, however, my requirement for accuracy is very low. 2 or 3 bit would already be enough. Unfortunately there is no build in Float16, that I can use to make things faster. Int16 was the best bet.

Ah, if you want to minimize the number of bits per number, that makes sense.

As far as I understand, it does a sort of least square fit, doesn't it?

It does min-max fitting. That is, it minimizes the maximum error over the given range, which in my case was only the first of four quadrants.
Also, because I'm only approximating a quarter of the circle, it can't exactly jump every quarter circle ;).

How do you restrict the phase to be between 0 and 2pi? I found that the mod function is quite expensive.

The inputs are random 64-bit integers. I use the first two bits to ultimately pick the signs of sin and cos respectively, after evaluating the quarter-circle approximation. This is equivalent to randomly picking the quadrant. I ignore the next 10 bits.
The last 52 bits make the fractional component of a 64-bit Float in [1,2) (after appropriately setting the first 12 bits):

julia> VectorizedRNG.mask(typemin(UInt), Float64)
1.0

julia> VectorizedRNG.mask(typemax(UInt), Float64)
1.9999999999999998

From there, I adjust into the open-open (0,1) interval, where sin and cos are mirrored across 0.5.
Roughly, if r is the random float in [1,2), then I evaluate a scaled sin approximation at roughly both r - 1 and 2 - r to get the sine and cosine, respectively.

The sin approximation is scaled so that numbers in the range (0,1) correspond do (0,pi/2). Of course, it is only valid over this range.
"roughly", because I translate [1,2) to (0,1) instead of [0,1) and (0,1] as would happen if I used r-1 and 2-r.

So no need for mod. The phase is restricted by design.

Additionally, because the purpose is as part of a random number generator, and because the only inputs are themselves these random unsigned integers, continuity isn't important.
What is important is that the outputs are equivalent to what I'd get if I randomly sampled

sincos(2pi*rand())

The boundary values are:

julia> using VectorizedRNG

julia> u = Core.VecElement.((0x0000000000000000, 0x0fffffffffffffff))
(VecElement{UInt64}(0x0000000000000000), VecElement{UInt64}(0x0fffffffffffffff))

julia> VectorizedRNG.randsincos(u, Float64)
((VecElement{Float64}(3.9734697296186967e-7), VecElement{Float64}(1.000000397346973)), (VecElement{Float64}(1.000000397346973), VecElement{Float64}(3.973469727874718e-7)))

So it doesn't match exactly at 0 or 1. In fact, at 1, it is outside of the valid range for sin and cos (slightly greater than 1).
But what matters is that it is relatively close. The final numbers are supposed to be normally distributed. randsincos is an internal function, not intended to be used by others. Before its output are returned by someone using the random normal generator, they're multiplied by another random number. That number is exponentially distributed, meaning it can be greater than or less than 1, effectively hiding these small errors / the fact that this result is greater than 1.

What I do want is to use a lot of bits to generate the random number, so that a long stream of them would be unique, and so that it passes RNG tests like those in RNGTest.jl. I should run it against BigCrush when I get home tonight, but SmallCrush isn't a problem:

julia> using RNGTest, Distributions, Random, VectorizedRNG

julia> struct RandNormal01{T<:VectorizedRNG.AbstractPCG} <: Random.AbstractRNG
          pcg::T
       end

julia> function Random.rand!(r::RandNormal01, x::AbstractArray)
           randn!(r.pcg, x)
           x .= cdf.(Normal(0,1), x) # small crush tests uniform numbers, so transform normal to uniform
       end

julia> rngnorm = RNGTest.wrap(RandNormal01(local_pcg()), Float64);

julia> RNGTest.smallcrushTestU01(rngnorm)

========= Summary results of SmallCrush =========

 Version:          TestU01 1.2.3
 Generator:
 Number of statistics:  15
 Total CPU time:   00:00:15.92

 All tests were passed

EDIT:
Unfortunately, BigCrush has been showing me that I need a lot more accuracy.

chriselrod · 2020-02-19T15:06:29Z

@zsoerenm
BTW, integer mod with respect to powers of 2 is very fast. The trick is to replace a % m with a & (m-1), e.g. if you want the mod with respect to 2^30

julia> const MODMASK30 = (one(UInt32) << 30) - 0x01

julia> mod30(u) = u & MODMASK30

julia> u = rand(UInt32)
0xea0e3557

julia> mod30(u) === u % (UInt32(2)^30)
true

This works for powers of 2 because

julia> bitstring(one(UInt32) << 30)
"01000000000000000000000000000000"

julia> bitstring(one(UInt32) << 30 - one(UInt32))
"00111111111111111111111111111111"

You're just setting all the higher bits to 0, which is the same as subtracting off all the higher powers of 2 to leave only the remainder.

Although, if you use % with a constant power of two value, LLVM should have figured this out for you:

julia> rem30(u) = u % 0x40000000
rem30 (generic function with 1 method)

julia> @code_llvm debuginfo=:none rem30(u)

define i32 @julia_rem30_19002(i32) {
top:
  %1 = and i32 %0, 1073741823
  ret i32 %1
}

julia> reinterpret(Int32, MODMASK30)
1073741823

It's using and and 1073741823 instead of calculating the mod, so LLVM obviously noticed that 0x40000000 is a power of two and used this trick itself.

zsoerenm · 2020-02-19T15:51:31Z

Yes, I am aware of that. I also use it for the Look-up-Table comparison in https://nextjournal.com/zorn/fast-fixed-point-sine-and-cosine-approximation-with-julia and in the function get_angle(x), which I use to calculate sin(x) and cos(x) based on A(x) and B(x).
I thought, that you were doing it in float arithmetic.
But it is interesting to see, that LLVM is doing that automatically!

chriselrod closed this as completed in 44c12bc Jun 30, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

usage of sincos #41

usage of sincos #41

jaakkor2 commented Jan 30, 2020

chriselrod commented Jan 30, 2020

zsoerenm commented Feb 4, 2020

chriselrod commented Feb 11, 2020 •

edited

Loading

zsoerenm commented Feb 13, 2020 •

edited

Loading

chriselrod commented Feb 14, 2020 •

edited

Loading

chriselrod commented Feb 19, 2020 •

edited

Loading

zsoerenm commented Feb 19, 2020

usage of sincos #41

usage of sincos #41

Comments

jaakkor2 commented Jan 30, 2020

chriselrod commented Jan 30, 2020

zsoerenm commented Feb 4, 2020

chriselrod commented Feb 11, 2020 • edited Loading

zsoerenm commented Feb 13, 2020 • edited Loading

chriselrod commented Feb 14, 2020 • edited Loading

chriselrod commented Feb 19, 2020 • edited Loading

zsoerenm commented Feb 19, 2020

chriselrod commented Feb 11, 2020 •

edited

Loading

zsoerenm commented Feb 13, 2020 •

edited

Loading

chriselrod commented Feb 14, 2020 •

edited

Loading

chriselrod commented Feb 19, 2020 •

edited

Loading