Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When using multiple benchmarks earlier ones affect the ones coming later #166

Open
harendra-kumar opened this issue Nov 2, 2017 · 8 comments

Comments

@harendra-kumar
Copy link

I have the following benchmarks in a group:

        bgroup "map"
          [ bench "machines" $ whnf drainM (M.mapping (+1))
          , bench "streaming" $ whnf drainS (S.map (+1))
          , bench "pipes" $ whnf drainP (P.map (+1))
          , bench "conduit" $ whnf drainC (C.map (+1))
          , bench "list-transformer" $ whnf drainL (lift . return . (+1))
          ]

The last two benchmarks take significantly more time when I run all these benchmarks in one go using stack bench --benchmark-arguments "-m glob ops/map/*".

$ stack bench --benchmark-arguments "-m glob ops/map/*"

benchmarking ops/map/machines
time                 30.23 ms   (29.22 ms .. 31.04 ms)

benchmarking ops/map/streaming
time                 17.91 ms   (17.48 ms .. 18.37 ms)

benchmarking ops/map/pipes
time                 29.30 ms   (28.12 ms .. 30.03 ms)

benchmarking ops/map/conduit
time                 36.69 ms   (35.73 ms .. 37.58 ms)

benchmarking ops/map/list-transformer
time                 84.06 ms   (75.02 ms .. 90.34 ms)

However when I run individual benchmarks the results are different:

$ stack bench --benchmark-arguments "-m glob ops/map/conduit"

benchmarking ops/map/conduit
time                 31.64 ms   (31.30 ms .. 31.86 ms)

$ stack bench --benchmark-arguments "-m glob ops/map/list-transformer"

benchmarking ops/map/list-transformer
time                 68.67 ms   (66.84 ms .. 70.96 ms)

To reproduce the issue just run those commands in this repo.

I cannot figure what the problem is here. I tried using "env" to run the benchmarks and putting a "threadDelay" for a few seconds and a "performGC" in it but nothing helps.

I am now resorting to always running each benchmark individually in a separate process. Maybe we can have support for running each benchmark in a separate process in criterion itself to guarantee isolation of benchmarks, as I have seen this sort of problem too often. Now I am always skeptical of the results produced by criterion.

@RyanGlScott
Copy link
Member

Thanks for the bug report. This is effectively the same issue as #60, so I'll close this in favor of that issue.

@harendra-kumar
Copy link
Author

@RyanGlScott I am aware of #60 but this may not be the same issue. The reason is that the binary is the same in this case, therefore there is no issue of the generated code being different. In the same binary if I select multiple benchmarks vs a single benchmark the result is different and therefore it is not the same as #60.

Please reopen if you agree with this reasoning or let me know If I am missing something.

@RyanGlScott
Copy link
Member

The reason I suspect they're the same issue is because internally, both of those examples are running the same flavor of code. Before criterion runs, it filters out all the benchmarks it needs to run based on command-line arguments, so in principle, there shouldn't be any different between running a file with lots of benchmarks that are filtered out with -m, as opposed to recompiling the program with benchmarks commented out.

That being said, it's difficult for me to verify this claim since I can't reproduce the results in #60 (comment) anymore, and the program here relies on same text file that isn't provided.

@harendra-kumar
Copy link
Author

As I understand from @mikeizbicki's comments in #60 the issue there was code generation being different when the source code was actually changed. The code generated was more efficient in one case than in the other.

What's going on is that different core is generated for the sumV and VU.sum benchmarks depending on whether the other benchmark is present. Essentially, the common bits are being factored out and placed in a function, and this function is getting called in both benchmarks. This happens to make both benchmarks faster.

However, in this case it is a dynamic issue rather than a static one. For example this could be due to something done at runtime by the previous tests (e.g. more garbage generated that is collected later during the other tests) that affects the later tests. When the other tests are not run the dynamic issue goes away created by the previous test is not present.

The fundamental cause in both the cases is entirely different (static vs dynamic), even though the symptoms are similar.

@harendra-kumar
Copy link
Author

I fixed this in gauge: vincenthz/hs-gauge#3. It can be pulled from there. The fix runs each benchmark in a separate process. However the root cause seems to be space hold up in the benchmark - see this vincenthz/hs-gauge#10 (comment). Perhaps the space is not released because other benchmarks are sharing it, I have not investigated yet.

@OlivierSohn
Copy link

OlivierSohn commented Jan 8, 2018

@harendra-kumar : could it be that while running the benchmarks, the CPU temperature increases, and when the CPU reaches a given temperature it is slown down by the OS (or by the CPU itself)? Maybe monitoring the cpu temperature in parallel of the test could help to see if that is the case.

I once encountered this kind of behaviour in performance tests : the tests that would run first were running faster just because the CPU was cooler at that moment!

@patrickdoc
Copy link
Contributor

The heap may be another confounding factor. I ran the ops/map/* benchmark to generate an eventlog. At the start of the conduit benchmark, the heap size was 300 million bytes. So then I ran the ops/map/conduit benchmark with +RTS -H300000000 and got ~10 seconds more than without the heap change.

I would imagine (completely untested) that running each benchmark in a separate process would perform something of a hard reset on the heap. While this would make results more stable when changing the set of benchmarks run, I'm not sure that it necessarily makes them more accurate. Programs won't execute them in isolation, so why should the benchmark?

I don't have a particular solution, but thought I'd provide my two cents while this is all fresh in my head.

@RyanGlScott
Copy link
Member

Indeed. It's worth noting that gauge has an extremely grotesque workaround for this problem, which to specify the path of the benchmarking executable (via a command-line flag or otherwise) and use that to run each benchmark in an isolated process. There has to be a better way to do this, though!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants