Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Measure performance with Canavas #77

Closed
aruneshchandra opened this issue Feb 1, 2017 · 8 comments
Closed

Measure performance with Canavas #77

aruneshchandra opened this issue Feb 1, 2017 · 8 comments
Assignees
Milestone

Comments

@aruneshchandra
Copy link
Contributor

No description provided.

@jasongin
Copy link
Member

jasongin commented Feb 8, 2017

I ran the Canvas benchmarks on my Windows machine, comparing results from before and after the NAPI port (using the same build of NAPI-enabled node). While some benchmarks show no measurable change, others are up to 5x slower on NAPI:

image

Note that the benchmarks that show significant slowness from NAPI are the ones that have a high number of operations per second -- that is, they have very frequent calls through the NAPI layer. The first data point there, lineTo(), does very little work on its own, so a majority of the benchmark time is spent calling in and out of the NAPI layer.

With the current APIs, every call from JavaScript to C++ requires 4-6 NAPI calls, not including any additional parameter type validation and retrieval that may be done by the C++ function being called. The sequence is (in pseudocode):

argc = napi_get_cb_args_length();
argv = malloc(argc);
napi_get_cb_args(argv);
callbackData = napi_get_cb_data();
thisWrapper = napi_get_cb_this(); // Even static methods in JS have a 'this' (usually 'global')
thisArg = napi_unwrap(thisWrapper); // Only called for instance methods
returnValue = callbackData->method(thisArg, argv, callbackData->userData); // Call the user function
napi_set_return_value(returnValue); // Only called for methods that have a non-void return type

The lineTo() benchmark scenario makes four additional NAPI calls to validate and retrieve its arguments: 2 calls each to napi_get_type_of_value() and napi_get_value_number().

I did some experiments and some math, and found that on my machine every NAPI call costs approximately 25ns. That's actually not much, but I think there are some things we can do to reduce the number of NAPI calls required.

To be continued...

@jasongin
Copy link
Member

jasongin commented Feb 9, 2017

To reduce the number of NAPI calls required for every call, we could define an ugly API that looks something like this, to retrieve all the callback info at once:

napi_status napi_get_cb_info(
  napi_env e,                // [in] NAPI environment handle
  napi_callback_info cbinfo, // [in] Opaque callback-info handle
  int* argc,                 // [in-out] Specifies the size of the provided argv and argt arrays
                             // and receives the actual count of args.
  napi_value* argv,          // [out] Array of values
  napi_valuetype* argt,      // [out] Optional array of value types, for optimizing arg validation
  napi_value* thisArg,       // [out] Receives the JS 'this' arg for the call
  void** data);              // [out] Receives the data pointer for the callback.

While we could skip the optional argt array there, it would make canvas faster because canvas does frequent type-checking on arguments, for both validation and method overloading.

In the case of the lineTo() benchmark scenario, this API could reduce the number of NAPI calls per operation from 9 to 4. That would theoretically reduce the per-operation time from 0.38 μs to around 0.25 μs, reducing the NAPI overhead to 0.25/0.08 = 313%. Still not great, but this is an extreme case.

@jasongin
Copy link
Member

jasongin commented Feb 9, 2017

I still want to test canvas perf on a non-Windows system, since we might find different performance characteristics for calling through the NAPI layer.

@mhdawson
Copy link
Member

mhdawson commented Feb 9, 2017

Are you actually doing a malloc as shown in the pseudocode ? thats going to be a killer I think. We should do a stack allocation (even if we have to overestimate the size we need) up to a certain number of parameters as 99% of the time that will probably be less than ~6

@jasongin
Copy link
Member

jasongin commented Feb 9, 2017

Currently it's using a std::vector which I believe allocates on the heap. But yes, I had thought about allocating space for some small fixed number of args on the stack, then using the heap only for extreme cases. I'll try that and see if it makes a measurable difference in performance.

@ianwjhalliday
Copy link

I think the argt array would be useful. In the leveldown conversion I found most of the callbacks did overload resolution and/or parameter validation based on the types of the arguments, so most of them requested the types of the arguments.

I also put the idea of one big ugly API as you propose here in the back of my mind to explore later if we ever hit a case where the performance overhead was significant, so +1 to this proposal.

@jasongin
Copy link
Member

The above benchmark data was collected on a 5-year-old workstation PC running Windows, with a Xeon W3530 @2.8GHz, 20 GB RAM.

I also ran on a 1.5-year-old Mac Mini and the results were very similar percentage-wise.

@jasongin
Copy link
Member

jasongin commented Mar 7, 2017

The performance improvements in my PR reduce the worst case canvas benchmark from 505% to 277%. Other benchmarks that stress the JS-to-C++ NAPI callback layer show similar improvements.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants