New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
libuv performance experiments (Network IO) - request feedback #1217
Comments
Not that those 64K libuv passes are just a suggestion. Node (and any other application) are free to ignore it and pick a better value. In fact, we are removing that for v2 :-) So, how did you modify libuv, if at all? What kind of feedback do you expect? On the methodology? Any actionable items? Cheers! |
thanks @saghul. the approach is to query the OS to confirm the window scaling option, and then pick up /proc/sys/net/core/rmem_max for the socket receiver buffer size, and then modify the chunk size in uv__read to this value. Improvement here would be to take into account rmem_max and wmem_max, to cater to full-duplex sockets (most common cases). the performance is 'really' a function of what we do in the read callback and how loaded the uv loop are. So the feedback requested are: i) does the customization of chunk size for workloads which demands it sound good to you? ii) is deriving the chunk sizes from the underlying system appear reasonable to you? iii) any specific workload which you know of this experiment can bank on? |
@gireeshpunathil Did you profile node.js or plain libuv + something else? Also, can I suggest you use perf instead of strace? strace is ptrace-based and has rather significant overhead; it's not a good tool for benchmarking. |
@bnoordhuis - I used node.js. sure, will use perf and come up with observations. thanks. |
test details: uv__read with 64KB chunk, and tcp auto-tuning(no SNDBUF and RCVBUF adjustments):
uv__read with 256KB chunk, (SNDBUF = RCVBUF = 256KB with window scaling ON):
|
Also ran with appmetrics (https://www.npmjs.com/package/appmetrics) to see how the uv loop latency for pending tasks (time elapsed between requesting a task to be run on the loop and its execution commencement - such as process.nextTick(), setImmediate(), setTimeout()) are faring:
leads to an inference that the loop residents are better attended to, in the second case. |
By the For reliable benchmark numbers, bare metal is still best but for now, try |
I am working with @gireeshpunathil on this. The top CPU consumers for original and modified node are as shown below. I am still exploring to get fine tuned info with the callstack as well as filter probes. Top 10 consumers in original node (full profile):
Top 10 consumers in modified node (full profile):
The key observation is that buffer copy (copy_user_enhanced_fast_string) overhead has increased, due to the large read. libc:memchr also shows up, which is root caused by the Buffer.includes() call from the test case, which takes more time to search in large data. Top 10 consumers in the kernel (original node):
Top 10 consumers in the kernel (modified node):
This is in alignment with the full profile where the TCP read and memory copy takes more CPU. [ I need to figure out who calls iowrite16 ]. |
@sreepurnajasti Can you paste the output of
...is the (or at least: a) thing of interest. |
@bnoordhuis - Last time I have misplaced the command --perf_basic_prof in the command line position and hence the JS symbols were not resolved, and accumulated into one single entity, with their total cpu consumption. Currently, I am not able to produce the correct profile due to internal network problem on the machine we are running. I will update as soon as I have data. |
when perf report –stdio command is executed, As updated earlier, the single most cpu consumer
Is now (with correct command line options) is now distributed all round NOTE: I am not pasting each command in detail as it is very huge The following are the observations below:
Original Node:
|
Instead of posting large text inline, you can create a gist at gist.github.com, and post links. |
@sreepurnajasti If you see Scanner and ParserBase in there, the application can't have been doing anything very interesting. That's the JS source code parser and it normally takes up an insignificant fraction of the total running time, especially in programs that run for longer than a second or two. You wouldn't happen to be spawning child processes, would you? Because in that case you've probably been profiling the parent process while it was waiting for the child process. |
The test case and the full logs are available here The test case is an attempt to simulate a real world workload from the UV loop perspective. Suggestions to improve the test case is much appreciated! Here is an explanation to what I am doing in the code:
|
That's not how I would describe what the gist does. I would say it does this:
and starts a whole bunch of http.gets at the same time, so they run in parallel:
The whole test looks like it runs too fast to learn anything, but then pauses doing nothing but a set interval for 5 or 6 seconds. I'd suggest setting off hundreds of http.gets in parallel, but when each one completes, have it do another http.get, and have the entire thing run for a few hundred seconds at least, and then call process.exit. |
@sam-github, thanks. I guess your description matches the scenario when the main client read takes very less time. The data being streamed is of importance here, and the program makes meaning with some 'big' data. Can you please try with a 100MB or more being sent from a server? (in short RTT or in a high speed n/w, otherwise 10MB or more over a long RTT). In terms of what the code is doing: agree that they look meaningless apparently. But this is simulation testing, meant to do nothing externally, but make their impact visible in the area of code of our interest (libuv in general, the uv__read in particular). The interval timer: Meant to make sure the timer is kicked in every 100ms. If there is a starvation due to tight-reading of chunks, this will be delayed, and show off in the overall performance. Reading from 3 well-known web sites: again, they don't do any meaningful work, but causes some tasks to be pending in the pre-loop-queue (the loop watcher queue) and / or some I/O ready fds in the OS queue. The overall run time depends on their completion, which in turn get influenced by the chunk reading logic in uv__read. Two point-less calls: Please suggest how do we otherwise 'encapsulate' variety of streaming / piping use cases in few lines here. My rational is to combine an invariant and a variant operation (w.r.t the incoming data). If we don't do anything in the callback, we deviate from the most common use cases, and if we over-do here, we go far from piping scenarios. Care has been taken to strike a right balance, but am happy to amend, with a better abstraction if available. Yes, server supports the large TCP Windows. If I read back-to-back (as opposed to parallel), the overhead of socket creation, connection etc. can surface and may cause noise in the read. On the other hand, there is a limit on the number of sockets which can be established between two end-points, and hence 128 clients. The intent of the test is to see how efficiently we are reading, efficiency here refers to the combination of speed and concurrency. Having said that, I will follow your suggestion to increase the iterations and the running time and post the results. |
As per the earlier suggestions on testcase, I have made the changes and here is the link The modified test case does the following:
With this test case, an improvement of 12% is visible with modified node. |
@sreepurnajasti Can you post the output from perf? Intuitively, I would expect that benchmark to spend some (possibly quite a lot) of its CPU time doing work that isn't directly related to network I/O, e.g. the |
@bnoordhuis Output of top 20 consumers of perf is here. (The perf was run without the call graph as it produced huge data.) At high level, this seem to match with your expectations. Is there any other info you are keen on from this run? |
@sreepurnajasti No, that about confirms my intuition. I'd say it's hard to draw meaningful conclusions from a benchmark where <1% is spent doing the thing we're actually interested in. I imagine the 12% speedup is because you enjoy some secondary benefits from being able to operate on larger chunks of data (less garbage to collect, better cache efficiency, etc.) so it might still be worthwhile to pursue. One last quick gauge: what does |
@sreepurnajasti I know there are pros and cons of microbenchmarks, and you are trying to emulate some kind of real-world work-load, but I wonder if a straight-line program to read and discard data as quickly as possible from a TCP socket (in node) with data coming from a server with large windows sizes would give a better initial assesment of what kinds of improvements are possible, if any? |
@bnoordhuis, thanks.
Agreed on garbage, cache etc., but how about potential improvements which came through reduced number of callbacks, JIT factors (flow based optimizations when code running on larger volume of data), added effort for call linkage (C -> C++ -> JS -> C++ -> C), object creation overhead etc.? Here is the time profile, which shows reduction in both system+user space as well as in the page faults. Original Node:
Modified Node:
|
@sreepurnajasti Re: potential improvements, I agree, they're the "etc." :-) |
@sam-github, thanks. If we are reading and discarding data as fast as we can, I guess the performance becomes a function of the network bandwidth alone. For example, in most of the dev systems I tested (with varying degree of network proximity), the chunks arrived are in the order of few KBs. In such setups, the current logic of 64 KB read is sufficient (and works better) to cater to back-to-back reading. Networks which offer better bandwidth, would receive data at better rate than the socket can read, then large chunk reading shows up improvements (reason explained in previous comments). Result of a co-located (server and client in the same system) test with empty read callback (all other factors kept in-tact) is as below: Original Node: 1035620 millis (total run time) My bottom-line is that when the code tends towards real-world workload scenarios, the large buffer brings benefit. Now, I am investigating on couple of other platforms. |
Purpose of this issue is solved:
|
This issue is opened to gather early feedback on an experiment within libuv loop to investigate improvement in network troughput by exploiting the operating system tunables in the TCP stack and leveraging them in the libuv.
Background:
Node.js Big data use cases (with / without streaming and piping) may involve large chunks of data xfer between endpoints. The current unit of data xfer in Node is 64 KB (see uv__read for example) and is hard-coded.
This may be good enough if the underlying TCP stack supports only windows of small size. Modern operating systems support larger TCP data windows. (ref: RFC 1323 TCP high performance extensions, https://tools.ietf.org/html/rfc7323).
Goal:
Investigate improvement in Node's network throughput by exploiting the operating system tunables in the TCP stack and leveraging them in the libuv.
Non-goal:
Externals:
A --bigdata flag to be introduced to node.js for usage in large data xfer scenarios. Using this flag will cause node to query the platform to read the maximum possible TCP read/write buffer sizes, and the TCP sockets thus created will be made capable of transferring chunk sizes equal to that.
Testing:
stream-bench module which is a benchmark for measuring streaming performance can be used for measurement, with few combinations of client-server proximity (short and long RTT systems)
while [1] may not reflect a real use case, [2] (cluster), [3] (collective cluster / db / cloud), [4] (db / microservices / cloud) are.
Some early experimet in Linux revealed the following: (I paln to experiment with AIX, Windows, MAC etc.)
A test case with 128 concurrent clients accessing 5MB of data across network showed this strace statistics: node at the client side, so read() is at focus.
*Few observations [and their explanations]:
[due to the increase work in the kernel for allocating and copying in large chunks ]
[due to the increase work in the kernel for allocating and copying in large chunks ]
[number of callbacks reduced, OS buffers are efficiently utilized, other parties in the uv loop is sufficiently atteneded to.]
[ No code can match the speed of a single-threaded, dedicated, blocking C client-server data xfer]
[ When the read callback is under execution, the read buffer in the kernel gets accumulated with data. The more time the callback runs, the more the buffers are filled, making the next readrelatively optimal. ]
In summary my inference is that in an application where the uv loop is sufficiently excercised, the large-buffer-read-overhead is masked by the gain we get by: i) better utiliation of n/w resources in the system, ii) reduced callbacks, iii) reduction in the starvation in the loop.
Please let me know what you think.
The text was updated successfully, but these errors were encountered: