New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor benchmark tools for statistical significance #7094

Merged
merged 9 commits into from Jul 26, 2016

Conversation

Projects
None yet
10 participants
@AndreasMadsen
Member

AndreasMadsen commented Jun 1, 2016

Checklist
  • tests and code linting passes
  • a test and/or benchmark is included
  • documentation is changed or added
  • the commit message follows commit guidelines
Affected core subsystem(s)

benchmark

Description of change

I have been rather confused about the benchmark suite and I don't think it is as user friendly as the rest of nodecore. This PR attempt to remove most of the confusion I was facing when I started using it. Primarily it:

  • removes unused/undocumented files
  • allows partially setting the benchmarks variables using process arguments.
  • refactor compare.js such comparing node versions and getting statistical significance is easy.
  • refactor the plot.R tool (now called scatter) to show a scatter plot with confidence bars.
  • refactor cli tools such the cli API is more homogeneous.
  • documents all the tools.
  • removes the implicit process.exit(0) after bench.end().
  • uses process.send to avoid most parsing (the benchmark cli arguments haven't changed).

The specifics are documented in the commit messages. Please also see the the new README as quite a lot have changed (be sure the to check my spelling!).

Note that some benchmark takes a very long time to complete, e.g. timers/timers.js type=depth thousands=500 takes 11.25 min. Thus running it 30 times for statistical significance is unreasonable. I suspect the only reason why it is set to so many iterations is to get a small variance, but with the the new compare tool the variance can be estimated instead of being reduced. Thus we can reduce the number of iterations and still get the information we need. But I suggest we do that in another pull request, as is very different discussion.

Motivation (long story): I wanted to benchmark the effect of some async_wrap changes. I went to the benchmark/ directory and read the README. However I quickly discovered that it was primarily about running benchmarks a single time and how to write benchmarks. And most importantly it didn't explain how to compare two node versions. This is now documented in the new README.

I then had to search for the tools myself and discovered the large amount of benchmarks files which where not put into categorized directories. I assumed they where somehow extra significant, but in reality they just appear to be unused. These files are now removed.

After discovering the compare tool, which has the cli API

node benchmark/compare.js
            <node-binary1> <node-binary2> +
            [--html] [--red|-r] [--green|-g] +
            [-- <type> [testFilter]]

I was confused about what the --red, --green was and how the node-binary1 and node-binary2 compared, should I write ./node-old ./node-new or ./node-new ./node-old if I wanted a positive improvement factor to signify an improvement? The new compare API is:

usage: ./node benchmark/compare.js <type> ...
  --new    ./new-node-binary  new node binary (required)
  --old    ./old-node-binary  old node binary (required)
  --runs   30                 number of samples
  --filter pattern            string to filter benchmark scripts
  --var    variable=value     set benchmark variable (can be repeated)

After understanding common.js this it was still unclear if the performance was statistically significant different. I tried running the benchmark 5 times and got that 4/5 was an improvement, I was expecting it to have the same performance or be slower. (spoiler: it wasn't significant). The compare.js script now runs the benchmarks many times (30 by default) and there is an R script to analyse the csv results.

At this point I wanted to do a rewrite of the benchmark tools (not the benchmarks themself) and changed a few other things in the process as well. - I'm a mathematician so I care a lot about statistical significance :)

@AndreasMadsen

This comment has been minimized.

Member

AndreasMadsen commented Jun 4, 2016

I'm not sure who to cc for this one.
/cc @Trott as you appear to have made some resent benchmark changes.

@Trott

This comment has been minimized.

Member

Trott commented Jun 5, 2016

@Trott

This comment has been minimized.

Member

Trott commented Jun 5, 2016

In theory, this sounds fantastic to me! In practice, there's so much about benchmarking that I'm ignorant about, I have to defer to others.

@jasnell

This comment has been minimized.

Member

jasnell commented Jun 6, 2016

Very nice. @mscdex @bnoordhuis ... any thoughts on this?

var s;
for (var i = 0; i < n; i++) {
s = '01234567890';
s[1] = 'a';

This comment has been minimized.

@mscdex

mscdex Jun 6, 2016

Contributor

Perhaps this line was added to prevent v8 from optimizing the for-loop away or something (since s wouldn't have been referenced)?

This comment has been minimized.

@AndreasMadsen

AndreasMadsen Jun 6, 2016

Member

Perhaps. With use strict it is definitely broken. Looking at the original commit ( 12a169e - 6 years ago) it seams like it was just a misunderstanding of how strings works. The commit appears to compare strings and buffers, which is not comparable in this case as strings are immutable.

@mscdex

View changes

benchmark/common.js Outdated
// Construct confiuration string, " A=a, B=b, ..."
let conf = '';
for (const key of Object.keys(data.conf)) {
conf += ' ' + key + '=' + data.conf[key];

This comment has been minimized.

@mscdex

mscdex Jun 6, 2016

Contributor

data.conf[key] may need to be JSON.stringify()ed for strings with control characters (including newlines, etc.), otherwise you'll end up with messed up output. I had to change this with the existing benchmark runner when benchmarking some http header character validation functions.

This comment has been minimized.

@AndreasMadsen

AndreasMadsen Jun 6, 2016

Member

I'm happy to do that. I didn't want to change the format in case it is used by some 3th party benchmark monitoring setup. edit: changed to use JSON.stringify() as suggested.

@mscdex

View changes

benchmark/run.js Outdated
// Construct configuration string, " A=a, B=b, ..."
let conf = '';
for (const key of Object.keys(data.conf)) {
conf += ' ' + key + '=' + data.conf[key];

This comment has been minimized.

@mscdex

mscdex Jun 6, 2016

Contributor

Same thing here about JSON.stringify()ing data.conf[key].

@mscdex

This comment has been minimized.

Contributor

mscdex commented Jun 6, 2016

Just briefly looking over it, it mostly seems to look ok except for a few nits.

I did spot a typo in the benchmark: add script for creating scatter plot commit message body.

@mscdex

View changes

benchmark/compare.js Outdated
//
// Parse arguments
//
const cli = CLI(`usage: ./node benchmark/compare.js <type> ...

This comment has been minimized.

@mscdex

mscdex Jun 6, 2016

Contributor

There should probably be some explanation (in the help text) about what <type> should be exactly...

This comment has been minimized.

@AndreasMadsen

AndreasMadsen Jun 6, 2016

Member

are you referring to <type>?

This comment has been minimized.

@mscdex

mscdex Jun 6, 2016

Contributor

Yes, markdown cut that part out.

This comment has been minimized.

@trevnorris

trevnorris Jun 15, 2016

Contributor

iirc there was another one of these in another commit. just to watch out for it.

This comment has been minimized.

@AndreasMadsen

AndreasMadsen Jun 15, 2016

Member

They should all be fixed. Unless you are talking about the R scripts, but they only take -- arguments.

This comment has been minimized.

@addaleax

addaleax Jul 23, 2016

Member

btw, the first few times I tried running the new I found it very confusing that the type argument needed to appear before the arguments starting with -- (i.e. compare.js --new bla --old blah http did not work). I almost never use CLIs with that argument order, and just showing this usage text wasn’t exactly helpful, either.

You don’t need to change the behaviour, but maybe add a note here about that and for the other scripts where it applies?

This comment has been minimized.

@AndreasMadsen

AndreasMadsen Jul 23, 2016

Member

Can you elaborate on that note, I think it is very specific, this is the message you get now.

usage: ./node benchmark/compare.js <type> ...
  Run each benchmark in the <type> directory many times using two diffrent
  node versions. More than one <type> directory can be specified. The output is
  formatted as csv, which can be processed using for example 'compare.R'.

  --new    ./new-node-binary  new node binary (required)
  --old    ./old-node-binary  old node binary (required)
  --runs   30                 number of samples
  --filter pattern            string to filter benchmark scripts
  --set    variable=value     set benchmark variable (can be repeated)

I choose this order, because it could be implemented using less code.

I will try and change the argument order, this appears to cause a lot of confusion for many people, but I would love to understand why.

This comment has been minimized.

@addaleax

addaleax Jul 23, 2016

Member

I will try and change the argument order, this appears to cause a lot of confusion for many people, but I would love to understand why.

If I had to guess, I’d say it’s because that’s the order usually suggested in man pages and --help texts, and maybe because the positional arguments are the ones one is most likely to spend more time editing before hitting enter… idk, maybe there’s more to it.

This comment has been minimized.

@AndreasMadsen

AndreasMadsen Jul 23, 2016

Member

Oh I understand the order is confusing (it is fixed now). But this is the third comment I got about a missing note, but unless I'm misunderstanding the comment, there is a note just one line below.

@AndreasMadsen

This comment has been minimized.

Member

AndreasMadsen commented Jun 6, 2016

@mscdex thanks. Updated as suggested.

@AndreasMadsen

This comment has been minimized.

Member

AndreasMadsen commented Jun 11, 2016

ping

@mscdex

View changes

benchmark/README.md Outdated
```
## How to write a benchmark test
After generating the csv, a comparens table can be created using the `scatter.R`

This comment has been minimized.

@mscdex

mscdex Jun 11, 2016

Contributor

s/comparens/comparison ?

@mscdex

View changes

benchmark/README.md Outdated
## Creating a benchmark
All benchmarks uses the `require('../common.js')` module. This contains the

This comment has been minimized.

@mscdex

mscdex Jun 11, 2016

Contributor

s/uses/use/

@mscdex

View changes

benchmark/README.md Outdated
var bench = common.createBenchmark(main, {
type: ['fast', 'slow'], // Two types of buffer
n: [512] // Number of times (each unit is 1024) to call the slice API
The first argument `main` is the benchmark function, the second arguments

This comment has been minimized.

@mscdex

mscdex Jun 11, 2016

Contributor

s/arguments/argument/

@mscdex

View changes

benchmark/README.md Outdated
The first argument `main` is the benchmark function, the second arguments
specifies the benchmark parameters. `createBenchmark` will run all possible
combinations of these parameters, unless specified otherwise. Note that the
configuration values can only be strings and numbers.

This comment has been minimized.

@mscdex

mscdex Jun 11, 2016

Contributor

s/and/or/

@mscdex

View changes

benchmark/README.md Outdated
available through your preferred package manager. If not `wrk` can be built
[from source][wrk] via `make`.
The R scripts uses `ggplot2` and `plyr`, you can install them using from the

This comment has been minimized.

@mscdex

mscdex Jun 11, 2016

Contributor

s/uses/use/
s/from// or s/using//

@mscdex

View changes

benchmark/README.md Outdated
### Run all tests of a given type
individual benchmarks can be executed by simply executing the benchmark script

This comment has been minimized.

@mscdex

mscdex Jun 11, 2016

Contributor

s/individual/Individual/

This comment has been minimized.

@AndreasMadsen

AndreasMadsen Jun 15, 2016

Member

search and replace part is the same? Looked it up in oxford, it appears individual is the correct spelling.

This comment has been minimized.

@thefourtheye

thefourtheye Jun 22, 2016

Contributor

First letter should be a capital letter.

This comment has been minimized.

@mscdex

mscdex Jun 22, 2016

Contributor

Yes, sorry, that is what I meant. First letter should be capitalized.

@mscdex

View changes

benchmark/README.md Outdated
the test function with each of the combined arguments in spawned processes. For
example, buffers/buffer-read.js has the following configuration:
Each line represents a single benchmark with parameters specified as
`${variable}=${value}`. Each configuration combination is executed in separate

This comment has been minimized.

@mscdex

mscdex Jun 11, 2016

Contributor

s/in separate processes/in a separate process/

This comment has been minimized.

@AndreasMadsen

AndreasMadsen Jun 15, 2016

Member

Thanks. I'm really curious why, do you know the name of the rule?

@mscdex

View changes

benchmark/README.md Outdated
The last number is the rate of operations. Higher is better.
### Comparing node versions
To compare the effect of a new node versions use the `compare.js` tool. This

This comment has been minimized.

@mscdex

mscdex Jun 11, 2016

Contributor

s/versions/version/

@mscdex

View changes

benchmark/README.md Outdated
For example, buffer-slice.js:
First build two versions of node, one from the master branch (here called
`./node-mater`) and another with the pull request applied (here called

This comment has been minimized.

@mscdex

mscdex Jun 11, 2016

Contributor

s/mater/master/

@mscdex

View changes

benchmark/compare.R Outdated
}
r = list(
improvment = improvment,

This comment has been minimized.

@mscdex

mscdex Jun 11, 2016

Contributor

s/improvment/improvement/ for all instances in this script and in the new benchmark document changes

@mscdex

View changes

benchmark/README.md Outdated
This example will run only the first type of url test, with one iteration.
(Note: benchmarks require __many__ iterations to be statistically accurate.)
The `compare.R` tool can also produces a box plot by using the `--plot filename`

This comment has been minimized.

@mscdex

mscdex Jun 11, 2016

Contributor

s/produces/produce/

@mscdex

View changes

benchmark/README.md Outdated
(Note: benchmarks require __many__ iterations to be statistically accurate.)
The `compare.R` tool can also produces a box plot by using the `--plot filename`
option. In this case there are 48 different benchmark combinations, thus you
may want to filter the csv file. This can be during benchmarking using for

This comment has been minimized.

@mscdex

mscdex Jun 11, 2016

Contributor

s/be during/be done while/
This sentence might be totally reworded though, it sounds a bit awkward as-is. Perhaps something like:

This can be done while benchmarking using the `--var` parameter (e.g. `--var encoding=ascii`) or by filtering results afterwards using tools such as `sed` or `grep`.
@mscdex

View changes

benchmark/README.md Outdated
var common = require('../common.js'); // Load the test runner
Because the scatter plot can only show two variables (in this case _chunk_ and
_encoding_) the rest is aggregated. Sometimes aggregating is a problem, this
can be solved by filtering. This can be done during benchmarking by using

This comment has been minimized.

@mscdex

mscdex Jun 11, 2016

Contributor

I think this same sentence can be reworded similarly as suggested earlier.

@mscdex

This comment has been minimized.

Contributor

mscdex commented Jun 11, 2016

@ChALkeR

This comment has been minimized.

Member

ChALkeR commented Jun 11, 2016

@mscdex What's the semver status of this? Major?

@mscdex

This comment has been minimized.

Contributor

mscdex commented Jun 11, 2016

@ChALkeR I don't know how benchmarks are covered when it comes to that kind of thing. I would guess they are treated like tests or docs since they are not a part of the runtime?

@mcollina

This comment has been minimized.

Member

mcollina commented Jun 14, 2016

I'll go for major, it makes things easier and less complicated.

One thing that is not clear from the document is how the statistical significance is achieved.

@AndreasMadsen

This comment has been minimized.

Member

AndreasMadsen commented Jun 14, 2016

@mscdex Thanks for the suggestions, I will update the documentation tomorrow.

@mcollina It runs each the benchmark a given number of times (--runs) using the new and old node binary that is provided to compare.js. Using the R script it then ...

... makes an independent/unpaired 2-group t-test, with the null hypothesis that the performance is the same for both versions. The significant field will show a star if the p-value is less than 0.05.

I think the compare documentation is fairly clear on this. But do tell me how I can improve it.

@AndreasMadsen

This comment has been minimized.

Member

AndreasMadsen commented Jul 26, 2016

Thanks for the review. Landed in ee2843b edbed3f 0f9bfaa f3463cf3061931b5c94ba9c753c1d75ee4d2b712 1f64ceba89a074f9e23196d019d56f00cdd4577a 01fbf656a3874d189cadeced08266a26ea526491 de9b44c0889d2264436277848762f1ebf868aa57 6e745d7a7586b12b894537192726bf2b999a456d 693e7be399e4c0964b5bbceaee6e8326c7c02a42

@addaleax

This comment has been minimized.

Member

addaleax commented Jul 26, 2016

Uh, you might want to back these commits out of master for now, the linter complains about benchmark/_cli.js

@AndreasMadsen

This comment has been minimized.

Member

AndreasMadsen commented Jul 26, 2016

As in force push?

@addaleax

This comment has been minimized.

Member

addaleax commented Jul 26, 2016

@AndreasMadsen I’d do that for now. Could you fix that, and maybe do a CI or linter run before re-landing? ;)

@addaleax addaleax reopened this Jul 26, 2016

AndreasMadsen added some commits Feb 1, 2016

benchmark: refactor to use process.send
This removes the need for parsing stdout from the benchmarks. If the
process wasn't executed by fork, it will just print like it used to.

This also fixes the parsing of CLI arguments, by inferring the type
from the options object instead of the value content.

Only two benchmarks had to be changed:

* http/http_server_for_chunky_client.js this previously used a spawn
now it uses a fork and relays the messages using common.sendResult.

* misc/v8-bench.js this utilized that v8/benchmark/run.js called
global.print and reformatted the input. It now interfaces directly
with the benchmark runner global.BenchmarkSuite.

PR-URL: #7094
Reviewed-By: Trevor Norris <trev.norris@gmail.com>
Reviewed-By: Jeremiah Senkpiel <fishrock123@rocketmail.com>
Reviewed-By: Brian White <mscdex@mscdex.net>
Reviewed-By: Anna Henningsen <anna@addaleax.net>
benchmark: missing process.exit after bench.end
Previously bench.end would call process.exit(0) however this is rather
confusing and indeed a few benchmarks had code that assumed otherwise.

This adds process.exit(0) to the benchmarks that needs it.

PR-URL: #7094
Reviewed-By: Trevor Norris <trev.norris@gmail.com>
Reviewed-By: Jeremiah Senkpiel <fishrock123@rocketmail.com>
Reviewed-By: Brian White <mscdex@mscdex.net>
Reviewed-By: Anna Henningsen <anna@addaleax.net>
benchmark: use t-test for comparing node versions
The data sampling is done in node and the data processing is done in R.
Only plyr was added as an R dependency and it is fairly standard.

PR-URL: #7094
Reviewed-By: Trevor Norris <trev.norris@gmail.com>
Reviewed-By: Jeremiah Senkpiel <fishrock123@rocketmail.com>
Reviewed-By: Brian White <mscdex@mscdex.net>
Reviewed-By: Anna Henningsen <anna@addaleax.net>
benchmark: add script for creating scatter plot
Previously this a tool in `plot.R`. It is now are more complete tool
which executes the benchmarks many times and creates a boxplot.

PR-URL: #7094
Reviewed-By: Trevor Norris <trev.norris@gmail.com>
Reviewed-By: Jeremiah Senkpiel <fishrock123@rocketmail.com>
Reviewed-By: Brian White <mscdex@mscdex.net>
Reviewed-By: Anna Henningsen <anna@addaleax.net>
benchmark: update docs after refactor
PR-URL: #7094
Reviewed-By: Trevor Norris <trev.norris@gmail.com>
Reviewed-By: Jeremiah Senkpiel <fishrock123@rocketmail.com>
Reviewed-By: Brian White <mscdex@mscdex.net>
Reviewed-By: Anna Henningsen <anna@addaleax.net>
benchmark: remove broken string-creation.js
Strings where never mutable, it is not clear what this benchmarks
attempts to do. This did work at some point, but only because the
benchmark wasn't using strict mode.

PR-URL: #7094
Reviewed-By: Trevor Norris <trev.norris@gmail.com>
Reviewed-By: Jeremiah Senkpiel <fishrock123@rocketmail.com>
Reviewed-By: Brian White <mscdex@mscdex.net>
Reviewed-By: Anna Henningsen <anna@addaleax.net>
@AndreasMadsen

This comment has been minimized.

Member

AndreasMadsen commented Jul 26, 2016

Thanks for the quick eye. I have force pushed and updated the PR. I wish I knew how it happened.

CI: https://ci.nodejs.org/job/node-test-pull-request/3422/

@addaleax

This comment has been minimized.

Member

addaleax commented Jul 26, 2016

Well, yeah, I’ve had the, ahem, pleasure of breaking master by not having run CI again before landing myself in the recent past. :)

Anyway, CI looked good before it went all 502 (FreeBSD failure is unrelated and only the Windows tests were remaining), I’d say you can land this. Thanks!

@AndreasMadsen AndreasMadsen merged commit d525e6c into nodejs:master Jul 26, 2016

@AndreasMadsen

This comment has been minimized.

Member

AndreasMadsen commented Jul 26, 2016

@addaleax

This comment has been minimized.

Member

addaleax commented Jul 27, 2016

Labelled this semver-major because that’s what has been suggested above, and #7890 shows that people obviously were using APIs of the old benchmarking scripts.

@AndreasMadsen

This comment has been minimized.

Member

AndreasMadsen commented Jul 27, 2016

Sounds good. This is obviously not backward compatible and it is quite easy to use the new tools on an old node version.

Also I don't really want to backport this ;)

Trott added a commit to Trott/io.js that referenced this pull request Aug 31, 2016

MylesBorins added a commit that referenced this pull request Sep 4, 2016

tools: enforce JS brace style with linting
Enable `brace-style` in ESLint.

Ref: #7094 (comment)

PR-URL: #8348
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Myles Borins <myles.borins@gmail.com>

MylesBorins added a commit that referenced this pull request Sep 28, 2016

tools: enforce JS brace style with linting
Enable `brace-style` in ESLint.

Ref: #7094 (comment)

PR-URL: #8348
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Myles Borins <myles.borins@gmail.com>

rvagg added a commit that referenced this pull request Oct 18, 2016

tools: enforce JS brace style with linting
Enable `brace-style` in ESLint.

Ref: #7094 (comment)

PR-URL: #8348
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Myles Borins <myles.borins@gmail.com>

MylesBorins added a commit that referenced this pull request Oct 26, 2016

tools: enforce JS brace style with linting
Enable `brace-style` in ESLint.

Ref: #7094 (comment)

PR-URL: #8348
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Myles Borins <myles.borins@gmail.com>

@gibfahn gibfahn referenced this pull request Jun 15, 2017

Closed

Auditing for 6.11.1 #230

2 of 3 tasks complete
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment