When using multiple benchmarks earlier ones affect the ones coming later #166

harendra-kumar · 2017-11-02T16:47:55Z

I have the following benchmarks in a group:

        bgroup "map"
          [ bench "machines" $ whnf drainM (M.mapping (+1))
          , bench "streaming" $ whnf drainS (S.map (+1))
          , bench "pipes" $ whnf drainP (P.map (+1))
          , bench "conduit" $ whnf drainC (C.map (+1))
          , bench "list-transformer" $ whnf drainL (lift . return . (+1))
          ]

The last two benchmarks take significantly more time when I run all these benchmarks in one go using stack bench --benchmark-arguments "-m glob ops/map/*".

$ stack bench --benchmark-arguments "-m glob ops/map/*"

benchmarking ops/map/machines
time                 30.23 ms   (29.22 ms .. 31.04 ms)

benchmarking ops/map/streaming
time                 17.91 ms   (17.48 ms .. 18.37 ms)

benchmarking ops/map/pipes
time                 29.30 ms   (28.12 ms .. 30.03 ms)

benchmarking ops/map/conduit
time                 36.69 ms   (35.73 ms .. 37.58 ms)

benchmarking ops/map/list-transformer
time                 84.06 ms   (75.02 ms .. 90.34 ms)

However when I run individual benchmarks the results are different:

$ stack bench --benchmark-arguments "-m glob ops/map/conduit"

benchmarking ops/map/conduit
time                 31.64 ms   (31.30 ms .. 31.86 ms)

$ stack bench --benchmark-arguments "-m glob ops/map/list-transformer"

benchmarking ops/map/list-transformer
time                 68.67 ms   (66.84 ms .. 70.96 ms)

To reproduce the issue just run those commands in this repo.

I cannot figure what the problem is here. I tried using "env" to run the benchmarks and putting a "threadDelay" for a few seconds and a "performGC" in it but nothing helps.

I am now resorting to always running each benchmark individually in a separate process. Maybe we can have support for running each benchmark in a separate process in criterion itself to guarantee isolation of benchmarks, as I have seen this sort of problem too often. Now I am always skeptical of the results produced by criterion.

The text was updated successfully, but these errors were encountered:

RyanGlScott · 2017-11-02T16:51:00Z

Thanks for the bug report. This is effectively the same issue as #60, so I'll close this in favor of that issue.

harendra-kumar · 2017-11-02T17:48:55Z

@RyanGlScott I am aware of #60 but this may not be the same issue. The reason is that the binary is the same in this case, therefore there is no issue of the generated code being different. In the same binary if I select multiple benchmarks vs a single benchmark the result is different and therefore it is not the same as #60.

Please reopen if you agree with this reasoning or let me know If I am missing something.

RyanGlScott · 2017-11-02T17:59:54Z

The reason I suspect they're the same issue is because internally, both of those examples are running the same flavor of code. Before criterion runs, it filters out all the benchmarks it needs to run based on command-line arguments, so in principle, there shouldn't be any different between running a file with lots of benchmarks that are filtered out with -m, as opposed to recompiling the program with benchmarks commented out.

That being said, it's difficult for me to verify this claim since I can't reproduce the results in #60 (comment) anymore, and the program here relies on same text file that isn't provided.

harendra-kumar · 2017-11-02T18:17:34Z

As I understand from @mikeizbicki's comments in #60 the issue there was code generation being different when the source code was actually changed. The code generated was more efficient in one case than in the other.

What's going on is that different core is generated for the sumV and VU.sum benchmarks depending on whether the other benchmark is present. Essentially, the common bits are being factored out and placed in a function, and this function is getting called in both benchmarks. This happens to make both benchmarks faster.

However, in this case it is a dynamic issue rather than a static one. For example this could be due to something done at runtime by the previous tests (e.g. more garbage generated that is collected later during the other tests) that affects the later tests. When the other tests are not run the dynamic issue goes away created by the previous test is not present.

The fundamental cause in both the cases is entirely different (static vs dynamic), even though the symptoms are similar.

harendra-kumar · 2017-11-08T05:12:25Z

I fixed this in gauge: vincenthz/hs-gauge#3. It can be pulled from there. The fix runs each benchmark in a separate process. However the root cause seems to be space hold up in the benchmark - see this vincenthz/hs-gauge#10 (comment). Perhaps the space is not released because other benchmarks are sharing it, I have not investigated yet.

OlivierSohn · 2018-01-08T23:29:29Z

@harendra-kumar : could it be that while running the benchmarks, the CPU temperature increases, and when the CPU reaches a given temperature it is slown down by the OS (or by the CPU itself)? Maybe monitoring the cpu temperature in parallel of the test could help to see if that is the case.

I once encountered this kind of behaviour in performance tests : the tests that would run first were running faster just because the CPU was cooler at that moment!

patrickdoc · 2018-02-28T22:39:40Z

The heap may be another confounding factor. I ran the ops/map/* benchmark to generate an eventlog. At the start of the conduit benchmark, the heap size was 300 million bytes. So then I ran the ops/map/conduit benchmark with +RTS -H300000000 and got ~10 seconds more than without the heap change.

I would imagine (completely untested) that running each benchmark in a separate process would perform something of a hard reset on the heap. While this would make results more stable when changing the set of benchmarks run, I'm not sure that it necessarily makes them more accurate. Programs won't execute them in isolation, so why should the benchmark?

I don't have a particular solution, but thought I'd provide my two cents while this is all fresh in my head.

RyanGlScott · 2018-03-01T01:02:47Z

Indeed. It's worth noting that gauge has an extremely grotesque workaround for this problem, which to specify the path of the benchmarking executable (via a command-line flag or otherwise) and use that to run each benchmark in an isolated process. There has to be a better way to do this, though!

RyanGlScott closed this as completed Nov 2, 2017

RyanGlScott mentioned this issue Nov 2, 2017

A benchmark's runtime can depend on the presence/absence of other benchmarks #60

Open

RyanGlScott reopened this Nov 7, 2017

RyanGlScott added Bug Help wanted labels Nov 7, 2017

sphaso mentioned this issue Oct 14, 2018

[#103] drop utf8-string dependency kowainik/relude#104

Merged

10 tasks

sjakobi mentioned this issue Jun 23, 2020

Use env where it works dhall-lang/dhall-haskell#1877

Merged

sjakobi mentioned this issue Jul 18, 2020

Fast inclusion operation on hashmaps and hashsets haskell-unordered-containers/unordered-containers#282

Merged

sjakobi mentioned this issue Mar 15, 2021

Benchmark results depend on preceding benchmarks haskell/bytestring#376

Closed

phadej mentioned this issue Jan 23, 2022

use performMajorGC #257

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When using multiple benchmarks earlier ones affect the ones coming later #166

When using multiple benchmarks earlier ones affect the ones coming later #166

harendra-kumar commented Nov 2, 2017

RyanGlScott commented Nov 2, 2017

harendra-kumar commented Nov 2, 2017

RyanGlScott commented Nov 2, 2017

harendra-kumar commented Nov 2, 2017

harendra-kumar commented Nov 8, 2017

OlivierSohn commented Jan 8, 2018 •

edited

Loading

patrickdoc commented Feb 28, 2018

RyanGlScott commented Mar 1, 2018

When using multiple benchmarks earlier ones affect the ones coming later #166

When using multiple benchmarks earlier ones affect the ones coming later #166

Comments

harendra-kumar commented Nov 2, 2017

RyanGlScott commented Nov 2, 2017

harendra-kumar commented Nov 2, 2017

RyanGlScott commented Nov 2, 2017

harendra-kumar commented Nov 2, 2017

harendra-kumar commented Nov 8, 2017

OlivierSohn commented Jan 8, 2018 • edited Loading

patrickdoc commented Feb 28, 2018

RyanGlScott commented Mar 1, 2018

OlivierSohn commented Jan 8, 2018 •

edited

Loading