# STREAMBenchmark.jl

* [JuliaPerf/STREAMBenchmark.jl: A version of the STREAM benchmark which measures the sustainable memory bandwidth.](https://github.com/JuliaPerf/STREAMBenchmark.jl)
* [JuliaSIMD/LoopVectorization.jl: Macro(s) for vectorizing loops.](https://github.com/JuliaSIMD/LoopVectorization.jl)

In [1]:
versioninfo()

Julia Version 1.7.3
Commit 742b9abb4d (2022-05-06 12:58 UTC)
Platform Info:
  OS: Linux (x86_64-redhat-linux)
  CPU: Intel(R) Xeon(R) Gold 6226 CPU @ 2.70GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, cascadelake)
Environment:
  JULIA_DEPOT_PATH = /home/manabu/.julia-1.7.3
  JULIA_NUM_THREADS = 12


In [2]:
Threads.nthreads()

12

In [3]:
#import Pkg; Pkg.add("ThreadPinning")
using ThreadPinning

In [29]:
threadinfo()


[0m[1m| [22m[33m[1m0[22m[39m,[33m[1m1[22m[39m,[33m[1m2[22m[39m,[33m[1m3[22m[39m,[33m[1m4[22m[39m,[33m[1m5[22m[39m,[33m[1m6[22m[39m,[33m[1m7[22m[39m,[33m[1m8[22m[39m,[33m[1m9[22m[39m,[33m[1m10[22m[39m,[33m[1m11[22m[39m,[90m12[39m,[90m13[39m,[90m14[39m,[90m15[39m,
  [90m16[39m,[90m17[39m,[90m18[39m,[90m19[39m,[90m20[39m,[90m21[39m,[90m22[39m,[90m23[39m[0m[1m | [22m

[33m[1m#[22m[39m = Julia thread, [90m#[39m = HT, [95m[1m#[22m[39m = Julia thread on HT, [0m[1m|[22m = Socket seperator

Julia threads: [32m12[39m
├ Occupied CPU-threads: [32m12[39m
└ Mapping (Thread => CPUID): 1 => 0, 2 => 1, 3 => 2, 4 => 3, 5 => 4, ...



In [28]:
pinthreads(:compact)

In [5]:
#import Pkg; Pkg.add("STREAMBenchmark")
using STREAMBenchmark

In [7]:
STREAMBenchmark.last_cachesize() / 1024 /1024

19.25

In [8]:
STREAMBenchmark.default_vector_length() / 1024 / 1024

9.625

In [9]:
STREAMBenchmark.default_vector_length() / 1024 / 1024

9.625

In [8]:
memory_bandwidth(verbose=true)

╔══╡ Multi-threaded:
╠══╡ (12 threads)
╟─ COPY:  119028.2 MB/s
╟─ SCALE: 117682.5 MB/s
╟─ ADD:   106470.1 MB/s
╟─ TRIAD: 114864.0 MB/s
╟─────────────────────
║ Median: 116273.2 MB/s
╚═════════════════════


(median = 116273.2, minimum = 106470.1, maximum = 119028.2)

In [6]:
benchmark()

╔══╡ Single-threaded:
╟─ COPY:  22871.2 MB/s
╟─ SCALE: 23005.5 MB/s
╟─ ADD:   20066.4 MB/s
╟─ TRIAD: 19901.2 MB/s
╟─────────────────────
║ Median: 21468.8 MB/s
╚═════════════════════

╔══╡ Multi-threaded:
╠══╡ (12 threads)
╟─ COPY:  89700.4 MB/s
╟─ SCALE: 116668.5 MB/s
╟─ ADD:   114404.0 MB/s
╟─ TRIAD: 113976.9 MB/s
╟─────────────────────
║ Median: 114190.4 MB/s
╚═════════════════════



(single = (median = 21468.8, minimum = 19901.2, maximum = 23005.5), multi = (median = 114190.4, minimum = 89700.4, maximum = 116668.5))

In [12]:
STREAMBenchmark.vector_length_dependence()

1: 2523136 => 155508.2
2: 5046272 => 122824.1
3: 7569408 => 118074.6
4: 10092544 => 115707.4


Dict{Int64, Float64} with 4 entries:
  7569408  => 1.18075e5
  2523136  => 1.55508e5
  5046272  => 1.22824e5
  10092544 => 1.15707e5

In [13]:
y = scaling_benchmark()

# Threads: 1	Max. memory bandwidth: 22707.2
# Threads: 2	Max. memory bandwidth: 43882.3
# Threads: 3	Max. memory bandwidth: 63514.0
# Threads: 4	Max. memory bandwidth: 79941.0
# Threads: 5	Max. memory bandwidth: 94599.4
# Threads: 6	Max. memory bandwidth: 102420.3
# Threads: 7	Max. memory bandwidth: 107036.7
# Threads: 8	Max. memory bandwidth: 111792.3
# Threads: 9	Max. memory bandwidth: 114532.9
# Threads: 10	Max. memory bandwidth: 116031.9
# Threads: 11	Max. memory bandwidth: 118681.9
# Threads: 12	Max. memory bandwidth: 119601.9


12-element Vector{Float64}:
  22707.2
  43882.3
  63514.0
  79941.0
  94599.4
 102420.3
 107036.7
 111792.3
 114532.9
 116031.9
 118681.9
 119601.9

In [22]:
#import Pkg; Pkg.add("UnicodePlots")
using UnicodePlots

In [25]:
lineplot(1:length(y), y, title = "Bandwidth Scaling", xlabel = "# cores", ylabel = "MB/s", border = :ascii, canvas = AsciiCanvas)

                            [97;1mBandwidth Scaling[0m             
               [38;5;8m+----------------------------------------+[0m 
        [38;5;8m120000[0m [38;5;8m|[0m                 [38;5;2m_[0m[38;5;2m_[0m[38;5;2m-[0m[38;5;2m-[0m[38;5;2m-[0m[38;5;2m-[0m[38;5;2m-[0m[38;5;2m,[0m               [38;5;8m|[0m [38;5;8m[0m
              [38;5;8m[0m [38;5;8m|[0m             [38;5;2m.[0m[38;5;2mr[0m[38;5;2m/[0m[38;5;2m"[0m                       [38;5;8m|[0m [38;5;8m[0m
              [38;5;8m[0m [38;5;8m|[0m            [38;5;2m.[0m[38;5;2m'[0m                          [38;5;8m|[0m [38;5;8m[0m
              [38;5;8m[0m [38;5;8m|[0m         [38;5;2m.[0m[38;5;2mr[0m[38;5;2m"[0m                            [38;5;8m|[0m [38;5;8m[0m
              [38;5;8m[0m [38;5;8m|[0m        [38;5;2m.[0m[38;5;2m/[0m                              [38;5;8m|[0m [38;5;8m[0m
              [38;5;8m[0m [38;5;8m|[0m        [38;5

In [12]:
STREAMBenchmark.download_original_STREAM()

- Creating folder "stream"
- Downloading C STREAM benchmark
- Done.


In [41]:
STREAMBenchmark.compile_original_STREAM(compiler="gcc", multithreading=true)

- Trying to compile "stream.c" using gcc -march=skylake-avx512


LoadError: Unknown compiler option: gcc -march=skylake-avx512.

In [31]:
8*10092544/1024/1024

77.0

In [13]:
STREAMBenchmark.execute_original_STREAM()

-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 10092544 (elements), Offset = 0 (elements)
Memory per array = 77.0 MiB (= 0.1 GiB).
Total memory required = 231.0 MiB (= 0.2 GiB).
Each kernel will be executed 10 times.
 The *best* time for each kernel (excluding the first iteration)
 will be used to compute the reported bandwidth.
-------------------------------------------------------------
Number of Threads requested = 12
Number of Threads counted = 12
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 1521 microseconds.
   (= 1521 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.

In [12]:
ENV["OMP_NUM_THREADS"]="12"

"12"

In [14]:
run(`stream/a.out`)

-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 10092544 (elements), Offset = 0 (elements)
Memory per array = 77.0 MiB (= 0.1 GiB).
Total memory required = 231.0 MiB (= 0.2 GiB).
Each kernel will be executed 10 times.
 The *best* time for each kernel (excluding the first iteration)
 will be used to compute the reported bandwidth.
-------------------------------------------------------------
Number of Threads requested = 12
Number of Threads counted = 12
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 1552 microseconds.
   (= 1552 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.

Process(`[4mstream/a.out[24m`, ProcessExited(0))

In [25]:
cmd = 

`[4mstream/a.out[24m`

In [9]:
run(`stream/a.out`)

-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 10092544 (elements), Offset = 0 (elements)
Memory per array = 77.0 MiB (= 0.1 GiB).
Total memory required = 231.0 MiB (= 0.2 GiB).
Each kernel will be executed 10 times.
 The *best* time for each kernel (excluding the first iteration)
 will be used to compute the reported bandwidth.
-------------------------------------------------------------
Number of Threads requested = 24
Number of Threads counted = 24
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 1551 microseconds.
   (= 1551 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.

Process(`[4mstream/a.out[24m`, ProcessExited(0))

* [CPU/SIMD Optimizations — NumPy v1.24.dev0 Manual](https://numpy.org/devdocs/reference/simd/index.html)
* [NEP 38 — Using SIMD optimization instructions for performance — NumPy Enhancement Proposals](https://numpy.org/neps/nep-0038-SIMD-optimizations.html)

* [Multi-Threading · The Julia Language](https://docs.julialang.org/en/v1/manual/multi-threading/)
* [Home · LoopVectorization.jl](https://docs.juliahub.com/LoopVectorization/4TogI/0.12.12/)
  - [Multithreading · LoopVectorization.jl](https://docs.juliahub.com/LoopVectorization/4TogI/0.12.12/examples/multithreading/)
* [ThreadPinning · ThreadPinning.jl](https://carstenbauer.github.io/ThreadPinning.jl/dev/)

In [18]:
using PyCall

In [19]:
np = pyimport("numpy");

In [20]:
const N = 2^17

131072

In [103]:
@time x0 = rand(Float64, (77*N,));
@time y0 = rand(Float64, (77*N,));

  0.009057 seconds (2 allocations: 77.000 MiB)
  0.008506 seconds (2 allocations: 77.000 MiB)


In [42]:
using LoopVectorization

In [105]:
@time for j in 1:10
    @avxt for i in eachindex(x0)
        @inbounds y0[i] = x0[i]
    end
end

  0.020797 seconds (330 allocations: 7.656 KiB)


In [106]:
sizeof(x0)/1024/1024

77.0

In [111]:
2*77/(0.020797/10)/1024

72.31361494446315

In [41]:
using Base.Threads: nthreads, @threads

In [59]:
avxt() = true
macro threaded(code)
    return esc(:(if $(@__MODULE__).avxt()
                     @avxt($code)
                 else
                     @threads(:static, $code)
                 end))
end

@threaded (macro with 1 method)