Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inline masking #83

Merged
merged 1 commit into from
Jan 19, 2023
Merged

Inline masking #83

merged 1 commit into from
Jan 19, 2023

Conversation

moogle19
Copy link
Contributor

Inlining the mask function seems to yield some more performance benefits.

Benchmark
Operating System: macOS
CPU Information: Apple M1 Pro
Number of Available Cores: 10
Available memory: 32 GB
Elixir 1.15.0-dev
Erlang 25.2.1

Benchmark suite executing with the following configuration:
warmup: 2 s
time: 10 s
memory time: 2 s
reduction time: 0 ns
parallel: 1
inputs: huge, large, medium, micro, small, tiny
Estimated total run time: 2.80 min

Benchmarking current with input huge ...
Benchmarking current with input large ...
Benchmarking current with input medium ...
Benchmarking current with input micro ...
Benchmarking current with input small ...
Benchmarking current with input tiny ...
Benchmarking current_inlined with input huge ...
Benchmarking current_inlined with input large ...
Benchmarking current_inlined with input medium ...
Benchmarking current_inlined with input micro ...
Benchmarking current_inlined with input small ...
Benchmarking current_inlined with input tiny ...

##### With input huge #####
Name                      ips        average  deviation         median         99th %
current_inlined        377.47        2.65 ms    ±46.17%        2.34 ms        8.56 ms
current                315.21        3.17 ms    ±52.98%        2.71 ms       11.99 ms

Comparison:
current_inlined        377.47
current                315.21 - 1.20x slower +0.52 ms

Memory usage statistics:

Name               Memory usage
current_inlined         4.11 MB
current                 4.41 MB - 1.07x memory usage +0.30 MB

**All measurements for memory usage were the same**

##### With input large #####
Name                      ips        average  deviation         median         99th %
current_inlined        3.47 K      287.88 μs    ±62.34%      277.08 μs      555.56 μs
current                3.11 K      321.55 μs   ±118.10%      294.75 μs      862.33 μs

Comparison:
current_inlined        3.47 K
current                3.11 K - 1.12x slower +33.67 μs

Memory usage statistics:

Name               Memory usage
current_inlined       422.02 KB
current               452.69 KB - 1.07x memory usage +30.66 KB

**All measurements for memory usage were the same**

##### With input medium #####
Name                      ips        average  deviation         median         99th %
current_inlined       26.36 K       37.93 μs   ±326.09%       19.38 μs      614.08 μs
current               23.46 K       42.63 μs   ±464.79%       20.25 μs      608.37 μs

Comparison:
current_inlined       26.36 K
current               23.46 K - 1.12x slower +4.70 μs

Memory usage statistics:

Name               Memory usage
current_inlined        42.77 KB
current                45.89 KB - 1.07x memory usage +3.13 KB

**All measurements for memory usage were the same**

##### With input micro #####
Name                      ips        average  deviation         median         99th %
current_inlined        1.21 M      824.34 ns ±13368.33%         208 ns        1250 ns
current                1.13 M      881.30 ns ±12486.04%         208 ns        1292 ns

Comparison:
current_inlined        1.21 M
current                1.13 M - 1.07x slower +56.95 ns

Memory usage statistics:

Name               Memory usage
current_inlined           488 B
current                   528 B - 1.08x memory usage +40 B

**All measurements for memory usage were the same**

##### With input small #####
Name                      ips        average  deviation         median         99th %
current_inlined      189.49 K        5.28 μs  ±1578.45%        2.50 μs        7.50 μs
current              167.15 K        5.98 μs  ±1661.28%        2.54 μs        7.63 μs

Comparison:
current_inlined      189.49 K
current              167.15 K - 1.13x slower +0.71 μs

Memory usage statistics:

Name               Memory usage
current_inlined         5.23 KB
current                 5.70 KB - 1.09x memory usage +0.47 KB

**All measurements for memory usage were the same**

##### With input tiny #####
Name                      ips        average  deviation         median         99th %
current_inlined      489.05 K        2.04 μs  ±5209.21%        0.71 μs        2.08 μs
current              397.25 K        2.52 μs  ±5079.90%        0.71 μs        2.08 μs

Comparison:
current_inlined      489.05 K
current              397.25 K - 1.23x slower +0.47 μs

Memory usage statistics:

Name               Memory usage
current_inlined         1.36 KB
current                 1.55 KB - 1.14x memory usage +0.195 KB

**All measurements for memory usage were the same**

@mtrudel
Copy link
Owner

mtrudel commented Jan 19, 2023

Inlining the (trivial) String implementation of String.valid?/2 yields a similar 15% difference. I'll land that one here too!

@mtrudel
Copy link
Owner

mtrudel commented Jan 19, 2023

On second thought, I've pushed UTF-8 validation improvements upstream: elixir-lang/elixir#12354.

This looks great! Merging!

@mtrudel mtrudel merged commit 60960d6 into mtrudel:main Jan 19, 2023
@mtrudel mtrudel added the benchmark Assign this to a PR to have the benchmark CI suite run label Jan 19, 2023
@mtrudel
Copy link
Owner

mtrudel commented Jan 19, 2023

Bad news: this is actually quite a bit slower on x86:

Benchmark

Mix.install([:benchee])

defmodule Old do
  # Note that masking is an involution, so we don't need a separate unmask function
  def mask(payload, mask) do
    payload
    |> do_mask(<<mask::32>>, [])
    |> IO.iodata_to_binary()
  end

  defp do_mask(<<h::32, rest::binary>>, <<int_mask::32>> = mask, acc) do
    do_mask(rest, mask, [acc, <<Bitwise.bxor(h, int_mask)::32>>])
  end

  defp do_mask(<<h::8, rest::binary>>, <<current::8, mask::24>>, acc) do
    do_mask(rest, <<mask::24, current::8>>, [acc, <<Bitwise.bxor(h, current)::8>>])
  end

  defp do_mask(<<>>, _mask, acc), do: acc
end

defmodule New512 do
  @mask_size 512
  # Note that masking is an involution, so we don't need a separate unmask function
  def mask(payload, mask) do
    payload
    |> do_mask(String.duplicate(<<mask::32>>, div(@mask_size, 32)), [])
    |> IO.iodata_to_binary()
  end

  # Matching the full mask size
  defp do_mask(
         <<h::unquote(@mask_size), rest::binary>>,
         <<int_mask::unquote(@mask_size)>> = mask,
         acc
       ) do
    do_mask(rest, mask, [acc, <<Bitwise.bxor(h, int_mask)::unquote(@mask_size)>>])
  end

  defp do_mask(<<h::32, rest::binary>>, <<int_mask::32, _mask_rest::binary>> = mask, acc) do
    do_mask(rest, mask, [acc, <<Bitwise.bxor(h, int_mask)::32>>])
  end

  defp do_mask(<<h::8, rest::binary>>, <<current::8, mask::24, _mask_rest::binary>>, acc) do
    do_mask(rest, <<mask::24, current::8>>, [acc, <<Bitwise.bxor(h, current)::8>>])
  end

  defp do_mask(<<>>, _mask, acc), do: acc
end

defmodule Adaptive512 do
  @compile {:inline, mask: 2, do_mask: 3}
  @mask_size 512
  # Note that masking is an involution, so we don't need a separate unmask function
  def mask(payload, mask) do
    payload
    |> do_mask(String.duplicate(<<mask::32>>, div(@mask_size, 32)), [])
    |> IO.iodata_to_binary()
  end

  # Matching the full mask size
  defp do_mask(
         <<h::unquote(@mask_size), rest::binary>>,
         <<int_mask::unquote(@mask_size)>> = mask,
         acc
       ) do
    do_mask(rest, mask, [acc, <<Bitwise.bxor(h, int_mask)::unquote(@mask_size)>>])
  end

  defp do_mask(<<h::32, rest::binary>>, <<int_mask::32, _mask_rest::binary>> = mask, acc) do
    do_mask(rest, mask, [acc, <<Bitwise.bxor(h, int_mask)::32>>])
  end

  defp do_mask(<<h::8, rest::binary>>, <<current::8, mask::24, _mask_rest::binary>>, acc) do
    do_mask(rest, <<mask::24, current::8>>, [acc, <<Bitwise.bxor(h, current)::8>>])
  end

  defp do_mask(<<>>, _mask, acc), do: acc
end

Benchee.run(
  %{
    "old" => fn input -> Old.mask(input, 1234) end,
    "new_512" => fn input -> New512.mask(input, 1234) end,
    "adaptive_512" => fn input -> Adaptive512.mask(input, 1234) end
  },
  time: 10,
  memory_time: 2,
  inputs: %{
    "micro" => String.duplicate("a", 10),
    "tiny" => String.duplicate("a", 102),
    "small" => String.duplicate("a", 1_002),
    "medium" => String.duplicate("a", 10_002),
    "large" => String.duplicate("a", 100_002),
    "huge" => String.duplicate("a", 1_000_002)
  }
)
Operating System: Linux
CPU Information: Intel(R) Xeon(R) Platinum 8280 CPU @ 2.70GHz
Number of Available Cores: 2
Available memory: 7.76 GB
Elixir 1.14.3
Erlang 25.2

Benchmark suite executing with the following configuration:
warmup: 2 s
time: 10 s
memory time: 2 s
reduction time: 0 ns
parallel: 1
inputs: huge, large, medium, micro, small, tiny
Estimated total run time: 4.20 min

Benchmarking adaptive_512 with input huge ...
Benchmarking adaptive_512 with input large ...
Benchmarking adaptive_512 with input medium ...
Benchmarking adaptive_512 with input micro ...
Benchmarking adaptive_512 with input small ...
Benchmarking adaptive_512 with input tiny ...
Benchmarking new_512 with input huge ...
Benchmarking new_512 with input large ...
Benchmarking new_512 with input medium ...
Benchmarking new_512 with input micro ...
Benchmarking new_512 with input small ...
Benchmarking new_512 with input tiny ...
Benchmarking old with input huge ...
Benchmarking old with input large ...
Benchmarking old with input medium ...
Benchmarking old with input micro ...
Benchmarking old with input small ...
Benchmarking old with input tiny ...

##### With input huge #####
Name                   ips        average  deviation         median         99th %
new_512             156.81        6.38 ms     ±5.56%        6.35 ms        7.25 ms
adaptive_512        118.30        8.45 ms    ±61.21%        6.01 ms       22.54 ms
old                  20.91       47.83 ms     ±6.28%       48.14 ms       59.11 ms

Comparison:
new_512             156.81
adaptive_512        118.30 - 1.33x slower +2.08 ms
old                  20.91 - 7.50x slower +41.46 ms

Memory usage statistics:

Name            Memory usage
new_512              4.41 MB
adaptive_512         4.11 MB - 0.93x memory usage -0.29814 MB
old                 22.89 MB - 5.19x memory usage +18.48 MB

**All measurements for memory usage were the same**

##### With input large #####
Name                   ips        average  deviation         median         99th %
new_512             1.91 K      522.66 μs    ±17.41%      525.41 μs      831.00 μs
adaptive_512        1.80 K      556.32 μs    ±52.98%      544.70 μs      753.57 μs
old                 0.33 K     3010.50 μs     ±8.84%     2985.59 μs     3451.06 μs

Comparison:
new_512             1.91 K
adaptive_512        1.80 K - 1.06x slower +33.66 μs
old                 0.33 K - 5.76x slower +2487.84 μs

Memory usage statistics:

Name            Memory usage
new_512            453.55 KB
adaptive_512       422.88 KB - 0.93x memory usage -30.66406 KB
old               2344.98 KB - 5.17x memory usage +1891.43 KB

**All measurements for memory usage were the same**

##### With input medium #####
Name                   ips        average  deviation         median         99th %
new_512            16.24 K       61.57 μs   ±156.28%       37.55 μs      617.97 μs
adaptive_512       15.79 K       63.33 μs   ±219.60%       42.44 μs      634.05 μs
old                 3.84 K      260.37 μs    ±41.70%      236.68 μs      543.08 μs

Comparison:
new_512            16.24 K
adaptive_512       15.79 K - 1.03x slower +1.76 μs
old                 3.84 K - 4.23x slower +198.80 μs

Memory usage statistics:

Name            Memory usage
new_512             46.75 KB
adaptive_512        43.63 KB - 0.93x memory usage -3.12500 KB
old                235.58 KB - 5.04x memory usage +188.83 KB

**All measurements for memory usage were the same**

##### With input micro #####
Name                   ips        average  deviation         median         99th %
old               360.05 K        2.78 μs  ±4248.51%        0.83 μs        3.62 μs
adaptive_512      333.70 K        3.00 μs  ±3878.65%        0.97 μs        3.92 μs
new_512           333.67 K        3.00 μs  ±3937.37%        0.95 μs        3.82 μs

Comparison:
old               360.05 K
adaptive_512      333.70 K - 1.08x slower +0.22 μs
new_512           333.67 K - 1.08x slower +0.22 μs

Memory usage statistics:

Name            Memory usage
old                  1.38 KB
adaptive_512         1.41 KB - 1.03x memory usage +0.0391 KB
new_512              1.45 KB - 1.06x memory usage +0.0781 KB

**All measurements for memory usage were the same**

##### With input small #####
Name                   ips        average  deviation         median         99th %
adaptive_512       86.47 K       11.56 μs  ±1065.86%        5.07 μs       19.42 μs
new_512            84.64 K       11.81 μs  ±1022.02%        5.02 μs       18.67 μs
old                31.18 K       32.07 μs   ±323.61%       13.37 μs      576.31 μs

Comparison:
adaptive_512       86.47 K
new_512            84.64 K - 1.02x slower +0.25 μs
old                31.18 K - 2.77x slower +20.51 μs

Memory usage statistics:

Name            Memory usage
adaptive_512         6.09 KB
new_512              6.55 KB - 1.08x memory usage +0.47 KB
old                 24.64 KB - 4.05x memory usage +18.55 KB

**All measurements for memory usage were the same**

##### With input tiny #####
Name                   ips        average  deviation         median         99th %
adaptive_512      224.28 K        4.46 μs  ±2753.79%        1.65 μs        5.44 μs
new_512           217.81 K        4.59 μs  ±2586.96%        1.69 μs        5.54 μs
old               155.02 K        6.45 μs  ±1800.17%        2.27 μs        7.70 μs

Comparison:
adaptive_512      224.28 K
new_512           217.81 K - 1.03x slower +0.133 μs
old               155.02 K - 1.45x slower +1.99 μs

Memory usage statistics:

Name            Memory usage
adaptive_512         2.22 KB
new_512              2.41 KB - 1.09x memory usage +0.195 KB
old                  3.55 KB - 1.60x memory usage +1.33 KB

**All measurements for memory usage were the same**

I've rescinded the upstream PR, and think I'll likely back this one out too. WDYT?

@mtrudel
Copy link
Owner

mtrudel commented Jan 19, 2023

In general it's surprising how often things end up benching differently between Apple Silicon and x86.

@moogle19
Copy link
Contributor Author

moogle19 commented Jan 19, 2023

In general it's surprising how often things end up benching differently between Apple Silicon and x86.

Looks like it is not just between Apple and x86, but even between Intel and AMD:

AMD Benchmark
Operating System: Windows
CPU Information: AMD Ryzen 7 3700X 8-Core Processor
Number of Available Cores: 16
Available memory: 15.93 GB
Elixir 1.13.2
Erlang 24.1.7

##### With input huge #####
Name                   ips        average  deviation         median         99th %
adaptive_512        168.38        5.94 ms     ┬▒5.96%        5.94 ms        7.17 ms
new_512             161.73        6.18 ms     ┬▒5.81%        6.25 ms        7.48 ms
old                  27.98       35.74 ms     ┬▒2.79%       35.53 ms       39.01 ms

Comparison:
adaptive_512        168.38
new_512             161.73 - 1.04x slower +0.24 ms
old                  27.98 - 6.02x slower +29.80 ms

Memory usage statistics:

Name            Memory usage
adaptive_512         4.11 MB
new_512              4.41 MB - 1.07x memory usage +0.30 MB
old                 22.89 MB - 5.57x memory usage +18.78 MB

**All measurements for memory usage were the same**

##### With input large #####
Name                   ips        average  deviation         median         99th %
adaptive_512        1.70 K      587.95 ╬╝s    ┬▒19.27%      614.40 ╬╝s      921.60 ╬╝s
new_512             1.63 K      611.64 ╬╝s    ┬▒16.68%      614.40 ╬╝s      921.60 ╬╝s
old                 0.37 K     2708.38 ╬╝s     ┬▒7.54%     2662.40 ╬╝s     3481.60 ╬╝s

Comparison:
adaptive_512        1.70 K
new_512             1.63 K - 1.04x slower +23.69 ╬╝s
old                 0.37 K - 4.61x slower +2120.43 ╬╝s

Memory usage statistics:

Name            Memory usage
adaptive_512       422.02 KB
new_512            452.69 KB - 1.07x memory usage +30.66 KB
old               2344.01 KB - 5.55x memory usage +1921.98 KB

**All measurements for memory usage were the same**

##### With input medium #####
Name                   ips        average  deviation         median         99th %
adaptive_512       15.27 K       65.48 ╬╝s    ┬▒18.60%       61.44 ╬╝s       92.16 ╬╝s
new_512            14.89 K       67.18 ╬╝s   ┬▒192.73%           0 ╬╝s      819.20 ╬╝s
old                 3.53 K      283.30 ╬╝s    ┬▒48.07%      307.20 ╬╝s      614.40 ╬╝s

Comparison:
adaptive_512       15.27 K
new_512            14.89 K - 1.03x slower +1.69 ╬╝s
old                 3.53 K - 4.33x slower +217.82 ╬╝s

Memory usage statistics:

Name            Memory usage
adaptive_512        42.77 KB
new_512             45.89 KB - 1.07x memory usage +3.13 KB
old                234.74 KB - 5.49x memory usage +191.98 KB

**All measurements for memory usage were the same**

##### With input micro #####
Name                   ips        average  deviation         median         99th %
old                 1.31 M      761.90 ns    ┬▒26.03%      716.80 ns     1228.80 ns
new_512             1.12 M      890.21 ns    ┬▒24.17%      921.60 ns     1331.20 ns
adaptive_512        0.99 M     1005.94 ns   ┬▒210.98%        1024 ns       12288 ns

Comparison:
old                 1.31 M
new_512             1.12 M - 1.17x slower +128.31 ns
adaptive_512        0.99 M - 1.32x slower +244.05 ns

Memory usage statistics:

Name            Memory usage
old                    528 B
new_512                608 B - 1.15x memory usage +80 B
adaptive_512           568 B - 1.08x memory usage +40 B

**All measurements for memory usage were the same**

##### With input small #####
Name                   ips        average  deviation         median         99th %
new_512           108.55 K        9.21 ╬╝s   ┬▒215.74%       10.24 ╬╝s      122.88 ╬╝s
adaptive_512       81.60 K       12.26 ╬╝s  ┬▒1243.51%           0 ╬╝s      102.40 ╬╝s
old                29.47 K       33.93 ╬╝s    ┬▒48.36%       30.72 ╬╝s       71.68 ╬╝s

Comparison:
new_512           108.55 K
adaptive_512       81.60 K - 1.33x slower +3.04 ╬╝s
old                29.47 K - 3.68x slower +24.72 ╬╝s

Memory usage statistics:

Name            Memory usage
new_512              5.70 KB
adaptive_512         5.23 KB - 0.92x memory usage -0.46875 KB
old                 23.78 KB - 4.18x memory usage +18.09 KB

**All measurements for memory usage were the same**

##### With input tiny #####
Name                   ips        average  deviation         median         99th %
adaptive_512      460.18 K        2.17 ╬╝s    ┬▒91.32%        1.02 ╬╝s        9.22 ╬╝s
new_512           452.29 K        2.21 ╬╝s    ┬▒84.12%        1.02 ╬╝s        8.19 ╬╝s
old               241.77 K        4.14 ╬╝s    ┬▒35.99%        4.10 ╬╝s        8.19 ╬╝s

Comparison:
adaptive_512      460.18 K
new_512           452.29 K - 1.02x slower +0.0379 ╬╝s
old               241.77 K - 1.90x slower +1.96 ╬╝s

Memory usage statistics:

Name            Memory usage
adaptive_512         1.36 KB
new_512              1.55 KB - 1.14x memory usage +0.195 KB
old                  2.69 KB - 1.98x memory usage +1.33 KB

**All measurements for memory usage were the same**

@mtrudel
Copy link
Owner

mtrudel commented Jan 19, 2023

Wild. In any case there's enough of a downside on enough platforms that I think we ought to pull this. Too bad. 😞

@stevensonmt
Copy link

One thing that jumps out is the difference in available cores. The Apple test shows 10 cores, AMD shows 16 cores, and Intel Xeon shows 2 cores. I don't know enough about the benchmark you're running but if it's heavy on parallelization that might explain the differences.

mtrudel added a commit that referenced this pull request Jan 19, 2023
@mtrudel
Copy link
Owner

mtrudel commented Jan 19, 2023

Reverted at 934106f

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
benchmark Assign this to a PR to have the benchmark CI suite run
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants