Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
52 changes: 45 additions & 7 deletions prometheus-client/benchmarks/Main.hs
Original file line number Diff line number Diff line change
Expand Up @@ -4,13 +4,20 @@
module Main (main) where

import Prometheus
import GHC.Stack

import Control.DeepSeq
import qualified Prometheus.Metric.Vector.STM as Vector.STM
import qualified Prometheus.Metric.Vector as Vector

import Control.Monad
import Criterion
import Criterion.Main
import Data.Foldable (for_)
import qualified Data.Text as T
import Data.Text (Text)
import System.Random
import Control.Concurrent.Async

withMetric m =
envWithCleanup
Expand All @@ -32,11 +39,13 @@ withHistogram buckets =
main :: IO ()
main =
defaultMain
[ benchCounter
, benchGauge
, benchSummary
, benchHistogram
, benchExport
[ -- benchCounter
-- , benchGauge
-- , benchSummary
-- , benchHistogram
-- , benchExport
benchVector "Vector" Vector.vector Vector.withLabel Vector.getVectorWith
, benchVector "Vector.STM" Vector.STM.vector Vector.STM.withLabel Vector.STM.getVectorWith
]


Expand Down Expand Up @@ -104,8 +113,6 @@ benchSummaryWithQuantiles q =
benchSummaryObserve s =
bench "observe" $ whnfIO (observe s 42)



-- Histogram benchmarks


Expand Down Expand Up @@ -147,3 +154,34 @@ benchExportHistograms nHistograms =
setup = replicateM_ nHistograms $ do
register $ histogram (Info (T.pack $ show nHistograms) "") defaultBuckets
teardown _ = unregisterAll

benchVector
:: (HasCallStack, NFData (vector (Text, Text, Text) Counter))
=> String
-> ((Text, Text, Text) -> Metric Counter -> Metric (vector (Text, Text, Text) Counter))
-> (vector (Text, Text, Text) Counter -> (Text, Text, Text) -> (Counter -> IO ()) -> IO ())
-> (vector (Text, Text, Text) Counter -> (Counter -> IO Double) -> IO [((Text, Text, Text), Double)])
-> Benchmark
benchVector str mkVector withLabel' getVectorWith' =
envWithCleanup (do
v <- register $ mkVector ("route", "status_code", "verb") (counter (Info "hits" "The number of hits on this combination of endpoint, status, and verb"))
_ <- pure $!! labelCombinations
pure v
)
(\ _ -> unregisterAll)
$ \ vec -> do

bgroup str
[ bench "increment all combinations concurrently" $ nfIO $ do
forConcurrently_ labelCombinations $ \label -> do
replicateConcurrently_ 100 $ do
replicateM_ 10 $ do
withLabel' vec label incCounter
]

labelCombinations :: [(Text, Text, Text)]
labelCombinations = do
route <- ["home", "cool", "users", "users/#{id}", "organizations", "organizations/#{id}", "login", "logout", "subscriptions"]
statusCode <- ["200", "201", "204", "400", "401", "404", "500"]
verb <- ["GET", "POST", "PUT"]
pure (route, statusCode, verb)
259 changes: 259 additions & 0 deletions prometheus-client/docs/concurrent-vector.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,259 @@
# Concurrency in the `Vector` type

The `Vector` type in `Prometheus.Metric.Vector` uses an `IORef (Map l (Metric m))`.
This is a known antipattern for poor performance.
The proper fix is to use the [`stm-containers`](https://nikita-volkov.github.io/stm-containers/) which allows signifiacntly better concurrent updates to the `Map`.

The problem is that the `Vector` requires a lock on the entire `Map` in order to update a single value.
Suppose you have 200 threads, each wanting to update the `Map`.
The first will do `atomicModifyIORef`, which will grab a lock on the map.
Every other thread is now blocked.
The first thread will update the label `k0` and put the map back in.
The next thread will now lock the whole map for label `k1`.
Rinse and repeat.

With `StmContainers.Map`, the *key* is locked - so there is only contention on specific keys.
If 200 threads want to update the map, and they all have distinct keys, then this can all happen concurrently without any locking or STM retrying.

# Benchmarks

To understand this behavior, I created a benchmark that would concurrently make many writes to a `Vector`.

To measure impact of concurrency, I am going to iterate this with `-with-rtsopts=-Nx`.

I am using an Intel i9 with 32 cores.

NOTE: Oops. This code is not faithful to Hackage/released code.
The released code uses `Data.Atomics.atomicModifyIORefCAS`, which avoids the locking and contention.
However, I'm going to leave the benchmarks here because they're *very* interesting as a way of seeing how `Data.Atomics.atomicModifyIORefCAS` compares to `Data.IORef.atomicModifyIORef'`.

## `-N1`

```
benchmarking Vector/increment all combinations concurrently
time 474.6 ms (429.8 ms .. 514.9 ms)
0.999 R² (0.996 R² .. 1.000 R²)
mean 511.6 ms (493.2 ms .. 529.9 ms)
std dev 21.50 ms (18.13 ms .. 24.06 ms)
variance introduced by outliers: 19% (moderately inflated)

benchmarking Vector.STM/increment all combinations concurrently
time 194.4 ms (184.5 ms .. 201.0 ms)
0.999 R² (0.994 R² .. 1.000 R²)
mean 202.3 ms (197.9 ms .. 206.2 ms)
std dev 5.733 ms (3.561 ms .. 8.113 ms)
variance introduced by outliers: 14% (moderately inflated)
```

In a single threaded context, `stm-containers` is still over twice as fast.

## `-N2`


```
benchmarking Vector/increment all combinations concurrently
time 791.9 ms (780.6 ms .. 800.9 ms)
1.000 R² (1.000 R² .. 1.000 R²)
mean 789.9 ms (787.3 ms .. 791.3 ms)
std dev 2.581 ms (5.418 μs .. 3.169 ms)
variance introduced by outliers: 19% (moderately inflated)

benchmarking Vector.STM/increment all combinations concurrently
time 110.8 ms (108.3 ms .. 113.4 ms)
0.999 R² (0.997 R² .. 1.000 R²)
mean 113.8 ms (112.3 ms .. 115.4 ms)
std dev 2.452 ms (1.767 ms .. 3.111 ms)
variance introduced by outliers: 11% (moderately inflated)
```

With two threads, the normal vector encounters contention and becomes significantly slower (1.6x more time).
Meanwhile, the STM vector takes 56% the time.

## `-N4`

```
benchmarking Vector/increment all combinations concurrently
time 862.9 ms (812.8 ms .. 918.9 ms)
0.999 R² (0.999 R² .. 1.000 R²)
mean 849.7 ms (837.4 ms .. 857.4 ms)
std dev 12.52 ms (4.446 ms .. 15.37 ms)
variance introduced by outliers: 19% (moderately inflated)

benchmarking Vector.STM/increment all combinations concurrently
time 67.63 ms (64.95 ms .. 69.86 ms)
0.997 R² (0.995 R² .. 0.999 R²)
mean 65.72 ms (64.07 ms .. 66.88 ms)
std dev 2.405 ms (1.510 ms .. 3.519 ms)
```

With four threads, contention on the traditional `Vector` is high, and we take even more time - however, it's a modest increase.
The STM vector improves performance significantly again - taking 57% of the time of using 2 threads.

## `-N8`

```
benchmarking Vector/increment all combinations concurrently
time 1.043 s (981.4 ms .. 1.099 s)
1.000 R² (0.998 R² .. 1.000 R²)
mean 954.4 ms (890.9 ms .. 984.0 ms)
std dev 50.79 ms (26.11 ms .. 62.92 ms)
variance introduced by outliers: 19% (moderately inflated)

benchmarking Vector.STM/increment all combinations concurrently
time 43.53 ms (42.78 ms .. 44.29 ms)
0.999 R² (0.997 R² .. 1.000 R²)
mean 43.83 ms (43.26 ms .. 44.58 ms)
std dev 1.254 ms (919.2 μs .. 1.745 ms)
```

The pattern repeats: contention makes the `IORef` vector slower (120% prior time), while more threads makes STM faster (64% of the time).

## `-N16`

```
benchmarking Vector/increment all combinations concurrently
time 2.415 s (2.019 s .. 2.984 s)
0.994 R² (0.982 R² .. 1.000 R²)
mean 2.603 s (2.461 s .. 2.805 s)
std dev 191.8 ms (71.20 ms .. 244.5 ms)
variance introduced by outliers: 21% (moderately inflated)

benchmarking Vector.STM/increment all combinations concurrently
time 44.69 ms (43.21 ms .. 46.21 ms)
0.998 R² (0.996 R² .. 0.999 R²)
mean 48.07 ms (46.33 ms .. 51.82 ms)
std dev 4.928 ms (1.915 ms .. 7.111 ms)
variance introduced by outliers: 41% (moderately inflated)
```

With 16 threads, contention has almost doubled the time on the `IORef`.
Meanwhile, the `STM` implementation has stabilized around 45ms.

# Oops

At this point, I realize an error that I have made.
The Hackage version of code uses `Data.Atomics.atomicModifyIORefCAS`, but I have switched it to `atomicModifyIORef'`.
If I switch it back and rerun, I get significantly better results:

## `-N1`

```
benchmarking Vector/increment all combinations concurrently
time 144.2 ms (137.9 ms .. 152.0 ms)
0.998 R² (0.995 R² .. 1.000 R²)
mean 141.7 ms (140.2 ms .. 143.8 ms)
std dev 2.614 ms (1.425 ms .. 3.817 ms)
variance introduced by outliers: 12% (moderately inflated)

benchmarking Vector.STM/increment all combinations concurrently
time 207.0 ms (182.7 ms .. 230.8 ms)
0.994 R² (0.979 R² .. 1.000 R²)
mean 203.6 ms (197.3 ms .. 209.6 ms)
std dev 8.605 ms (6.082 ms .. 10.73 ms)
variance introduced by outliers: 14% (moderately inflated)
```

With a single threaded implementation, the stock implementation is faster.
STM is about 33% slower.

## `-N2`

```
benchmarking Vector/increment all combinations concurrently
time 122.5 ms (120.4 ms .. 124.6 ms)
1.000 R² (1.000 R² .. 1.000 R²)
mean 119.4 ms (118.1 ms .. 120.5 ms)
std dev 1.858 ms (1.325 ms .. 2.617 ms)
variance introduced by outliers: 11% (moderately inflated)

benchmarking Vector.STM/increment all combinations concurrently
time 110.9 ms (104.2 ms .. 116.0 ms)
0.997 R² (0.995 R² .. 1.000 R²)
mean 115.6 ms (113.3 ms .. 118.4 ms)
std dev 3.859 ms (2.682 ms .. 5.249 ms)
variance introduced by outliers: 11% (moderately inflated)
```

Stock implementation is slightly faster with two threads.
STM implemenation is almost twice as fast, and is now slightly faster.

## `-N4`

```
benchmarking Vector/increment all combinations concurrently
time 99.45 ms (98.89 ms .. 100.6 ms)
1.000 R² (1.000 R² .. 1.000 R²)
mean 98.95 ms (98.27 ms .. 99.38 ms)
std dev 875.4 μs (576.1 μs .. 1.230 ms)

benchmarking Vector.STM/increment all combinations concurrently
time 58.75 ms (57.62 ms .. 59.41 ms)
1.000 R² (0.999 R² .. 1.000 R²)
mean 59.72 ms (59.11 ms .. 60.80 ms)
std dev 1.487 ms (589.8 μs .. 2.436 ms)
```

With 4 threads, both see significant performance improvements, with STm gaining more of a lead.

## `-N8`

```
benchmarking Vector/increment all combinations concurrently
time 105.3 ms (104.1 ms .. 108.0 ms)
0.999 R² (0.998 R² .. 1.000 R²)
mean 105.0 ms (104.0 ms .. 106.1 ms)
std dev 1.664 ms (1.196 ms .. 2.480 ms)

benchmarking Vector.STM/increment all combinations concurrently
time 39.39 ms (38.23 ms .. 40.39 ms)
0.998 R² (0.996 R² .. 1.000 R²)
mean 40.21 ms (39.74 ms .. 40.54 ms)
std dev 780.2 μs (546.1 μs .. 1.251 ms)
```

With 8 cores, performance of the `IORef` has stabilized.
The `STM` implementation continues to improve.

## `-N16`

```
benchmarking Vector/increment all combinations concurrently
time 153.0 ms (148.8 ms .. 154.1 ms)
1.000 R² (0.999 R² .. 1.000 R²)
mean 152.8 ms (151.7 ms .. 153.4 ms)
std dev 1.213 ms (511.2 μs .. 1.869 ms)
variance introduced by outliers: 12% (moderately inflated)

benchmarking Vector.STM/increment all combinations concurrently
time 44.94 ms (43.59 ms .. 47.01 ms)
0.995 R² (0.989 R² .. 0.999 R²)
mean 45.78 ms (44.95 ms .. 46.58 ms)
std dev 1.609 ms (1.311 ms .. 2.102 ms)
```

At 16 cores, contention on the IORef increases overhead.
STM appears stable.

## `-N32`

```
benchmarking Vector/increment all combinations concurrently
time 3.400 s (654.0 ms .. 4.890 s)
0.921 R² (NaN R² .. 1.000 R²)
mean 3.877 s (3.432 s .. 4.264 s)
std dev 457.1 ms (376.4 ms .. 523.3 ms)
variance introduced by outliers: 23% (moderately inflated)

benchmarking Vector.STM/increment all combinations concurrently
time 51.69 ms (50.44 ms .. 53.48 ms)
0.997 R² (0.994 R² .. 1.000 R²)
mean 52.45 ms (51.51 ms .. 55.21 ms)
std dev 2.943 ms (1.060 ms .. 5.225 ms)
variance introduced by outliers: 14% (moderately inflated)
```

With 32 cores, `STM` is constant and even the atomic IORef runs into significant contention.

# Defaults

Based on these observations, this commit changes the default exported in `Prometheus` to the STM vector.
Loading