Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

libuv performance experiments (Network IO) - request feedback #1217

Closed
gireeshpunathil opened this issue Feb 6, 2017 · 25 comments
Closed

libuv performance experiments (Network IO) - request feedback #1217

gireeshpunathil opened this issue Feb 6, 2017 · 25 comments

Comments

@gireeshpunathil
Copy link
Contributor

gireeshpunathil commented Feb 6, 2017

This issue is opened to gather early feedback on an experiment within libuv loop to investigate improvement in network troughput by exploiting the operating system tunables in the TCP stack and leveraging them in the libuv.

Background:
Node.js Big data use cases (with / without streaming and piping) may involve large chunks of data xfer between endpoints. The current unit of data xfer in Node is 64 KB (see uv__read for example) and is hard-coded.

This may be good enough if the underlying TCP stack supports only windows of small size. Modern operating systems support larger TCP data windows. (ref: RFC 1323 TCP high performance extensions, https://tools.ietf.org/html/rfc7323).

Goal:
Investigate improvement in Node's network throughput by exploiting the operating system tunables in the TCP stack and leveraging them in the libuv.

Non-goal:

  • Improvements realized in the most common use case (single page webserver etc.) is not a goal of this experiment.
  • Enhancing the OS tunables to fit the workload is also not a goal, node being run under normal user permissions can only query the values and enhance the TCP chunking accordingly.

Externals:
A --bigdata flag to be introduced to node.js for usage in large data xfer scenarios. Using this flag will cause node to query the platform to read the maximum possible TCP read/write buffer sizes, and the TCP sockets thus created will be made capable of transferring chunk sizes equal to that.

Testing:
stream-bench module which is a benchmark for measuring streaming performance can be used for measurement, with few combinations of client-server proximity (short and long RTT systems)

  1. Client and Server in the same system, using loopback address.
  2. Client and Server in the same system, using IP address.
  3. Client and Server in different systems, under same subnet.
  4. Client and Server in different systems, under different subnet.
    while [1] may not reflect a real use case, [2] (cluster), [3] (collective cluster / db / cloud), [4] (db / microservices / cloud) are.

Some early experimet in Linux revealed the following: (I paln to experiment with AIX, Windows, MAC etc.)

A test case with 128 concurrent clients accessing 5MB of data across network showed this strace statistics: node at the client side, so read() is at focus.

Original node (reading in terms of 64KB): total runtime: 13500 millis.
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
98.00    0.786124        2629       299        34 futex
  0.31    0.002518          18       142           mmap
  0.20    0.001574           1      1349           read
  0.17    0.001331          23        59           mprotect
  0.14    0.001158          33        35           rt_sigaction
  0.14    0.001084           2       514       128 epoll_ctl
.......
------ ----------- ----------- --------- --------- ----------------
100.00    0.802162                  3420       310 total
Modified node (reading in terms of 256KB): total runtime: 7639 millis.
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
99.35    2.773892       11276       246        28 futex
  0.32    0.009062          16       566           read
  0.05    0.001328           9       140           mmap
  0.05    0.001260           2       514       128 epoll_ctl
  0.03    0.000977           7       131           write
  0.03    0.000835           2       384           setsockopt
  0.03    0.000727           4       169           epoll_wait
  0.03    0.000700           7        96           munmap
  0.02    0.000583           4       142           close
  0.02    0.000497           4       128       128 connect
....
------ ----------- ----------- --------- --------- ----------------
100.00    2.792172                  3087       304 total

*Few observations [and their explanations]:

  • Amount of time spent in read has increased (1.5 millis vs. 9 millis).
    [due to the increase work in the kernel for allocating and copying in large chunks ]
  • Amount of time spent in the kernel has increased (800 vs 2700 ms).
    [due to the increase work in the kernel for allocating and copying in large chunks ]
  • But eventually, the large buffer node performs better, (7.6 secs vs 13.5 secs).
    [number of callbacks reduced, OS buffers are efficiently utilized, other parties in the uv loop is sufficiently atteneded to.]
  • The performance gain is a function of concurrence in the system. If a dedicated client attempts to read, the gain is not visible, instead the read overhead surfaces.
    [ No code can match the speed of a single-threaded, dedicated, blocking C client-server data xfer]
  • The performance gain is also a function of the actions in the read callback. If an empty callback is employed, the performance gain is not visible.
    [ When the read callback is under execution, the read buffer in the kernel gets accumulated with data. The more time the callback runs, the more the buffers are filled, making the next readrelatively optimal. ]
    In summary my inference is that in an application where the uv loop is sufficiently excercised, the large-buffer-read-overhead is masked by the gain we get by: i) better utiliation of n/w resources in the system, ii) reduced callbacks, iii) reduction in the starvation in the loop.

Please let me know what you think.

@saghul
Copy link
Member

saghul commented Feb 6, 2017

The current unit of data xfer in Node is 64 KB (see uv__read for example) and is hard-coded.

Not that those 64K libuv passes are just a suggestion. Node (and any other application) are free to ignore it and pick a better value. In fact, we are removing that for v2 :-)

So, how did you modify libuv, if at all? What kind of feedback do you expect? On the methodology? Any actionable items?

Cheers!

@gireeshpunathil
Copy link
Contributor Author

gireeshpunathil commented Feb 6, 2017

thanks @saghul. the approach is to query the OS to confirm the window scaling option, and then pick up /proc/sys/net/core/rmem_max for the socket receiver buffer size, and then modify the chunk size in uv__read to this value. Improvement here would be to take into account rmem_max and wmem_max, to cater to full-duplex sockets (most common cases).

the performance is 'really' a function of what we do in the read callback and how loaded the uv loop are. So the feedback requested are: i) does the customization of chunk size for workloads which demands it sound good to you? ii) is deriving the chunk sizes from the underlying system appear reasonable to you? iii) any specific workload which you know of this experiment can bank on?

@bnoordhuis
Copy link
Member

@gireeshpunathil Did you profile node.js or plain libuv + something else?

Also, can I suggest you use perf instead of strace? strace is ptrace-based and has rather significant overhead; it's not a good tool for benchmarking.

@gireeshpunathil
Copy link
Contributor Author

@bnoordhuis - I used node.js. sure, will use perf and come up with observations. thanks.

@gireeshpunathil
Copy link
Contributor Author

test details:
data xfer: 10MB
no. of concurrent clients: 128
network: short RTT.

uv__read with 64KB chunk, and tcp auto-tuning(no SNDBUF and RCVBUF adjustments):

Performance counter stats for 'node_old down.js [IP] 25000 128':

      25918.928625 task-clock                #    0.744 CPUs utilized          
             2,545 context-switches          #    0.098 K/sec                  
               133 cpu-migrations            #    0.005 K/sec                  
            23,658 page-faults               #    0.913 K/sec                  
   <not supported> cycles                  
   <not supported> stalled-cycles-frontend 
   <not supported> stalled-cycles-backend  
   <not supported> instructions            
   <not supported> branches                
   <not supported> branch-misses           

      34.841539894 seconds time elapsed

uv__read with 256KB chunk, (SNDBUF = RCVBUF = 256KB with window scaling ON):

Performance counter stats for 'node down.js [IP] 25000 128':

       7615.776219 task-clock                #    0.472 CPUs utilized          
             1,680 context-switches          #    0.221 K/sec                  
                52 cpu-migrations            #    0.007 K/sec                  
            22,619 page-faults               #    0.003 M/sec                  
   <not supported> cycles                  
   <not supported> stalled-cycles-frontend 
   <not supported> stalled-cycles-backend  
   <not supported> instructions            
   <not supported> branches                
   <not supported> branch-misses           

      16.144609213 seconds time elapsed

@gireeshpunathil
Copy link
Contributor Author

Also ran with appmetrics (https://www.npmjs.com/package/appmetrics) to see how the uv loop latency for pending tasks (time elapsed between requesting a task to be run on the loop and its execution commencement - such as process.nextTick(), setImmediate(), setTimeout()) are faring:

64KB case:
4916
4154
4452
289
0
256KB case:
615
185
0

leads to an inference that the loop residents are better attended to, in the second case.

@bnoordhuis
Copy link
Member

By the <not supported> markers I'm guessing you're testing in a VM.

For reliable benchmark numbers, bare metal is still best but for now, try perf record -c 40000 -i -g node --perf_basic_prof to get call stacks. If you want to filter/group on system calls, use perf probe.

@sreepurnajasti
Copy link

I am working with @gireeshpunathil on this. The top CPU consumers for original and modified node are as shown below. I am still exploring to get fine tuned info with the callstack as well as filter probes.

Top 10 consumers in original node (full profile):


+  23.63%  node_org  perf-11287.map        [.] 0x000037d67f60bb77
+   2.29%  node_org  node_org              [.] v8::internal::Scanner::Scan()
+   2.20%  node_org  node_org              [.] v8::internal::Scanner::ScanIdentifierOrKeyword()
+   1.66%  node_org  node_org              [.] v8::internal::LookupIterator::PropertyOrElement(v8::internal::Isolate*, v8
+   1.54%  node_org  node_org              [.] v8::internal::ParserBase<v8::internal::Parser>::ParseAssignmentExpression(
+   1.52%  node_org  node_org              [.] void v8::internal::String::WriteToFlat<unsigned short>(v8::internal::Strin
+   1.47%  node_org  node_org              [.] v8::internal::Heap::AllocateRaw(int, v8::internal::AllocationSpace, v8::in
+   1.27%  node_org  node_org              [.] void v8::internal::String::WriteToFlat<unsigned char>(v8::internal::String
+   1.26%  node_org  node_org              [.] v8::internal::StringReplaceGlobalRegExpWithString(v8::internal::Isolate*,
+   1.14%  node_org  node_org              [.] v8::internal::AstValueFactory::GetOneByteStringInternal(v8::internal::Vect


Top 10 consumers in modified node (full profile):

+  22.18%  node  perf-10270.map        [.] 0x0000122fc5408529
+   2.99%  node  [kernel.kallsyms]     [k] copy_user_enhanced_fast_string
+   2.15%  node  node                  [.] v8::internal::Scanner::Scan()
+   2.07%  node  node                  [.] v8::internal::Scanner::ScanIdentifierOrKeyword()
+   1.41%  node  node                  [.] void v8::internal::String::WriteToFlat<unsigned short>(v8::internal::String*,
+   1.40%  node  node                  [.] v8::internal::ParserBase<v8::internal::Parser>::ParseAssignmentExpression(bool
+   1.38%  node  libc-2.17.so          [.] memchr
+   1.38%  node  node                  [.] v8::internal::LookupIterator::PropertyOrElement(v8::internal::Isolate*, v8::in
+   1.28%  node  node                  [.] v8::internal::Heap::AllocateRaw(int, v8::internal::AllocationSpace, v8::intern
+   1.21%  node  node                  [.] void v8::internal::String::WriteToFlat<unsigned char>(v8::internal::String*, u

The key observation is that buffer copy (copy_user_enhanced_fast_string) overhead has increased, due to the large read. libc:memchr also shows up, which is root caused by the Buffer.includes() call from the test case, which takes more time to search in large data.

Top 10 consumers in the kernel (original node):

   0.81%  node_org  [kernel.kallsyms]     [k] copy_user_enhanced_fast_string
    0.49%  node_org  [kernel.kallsyms]     [k] iowrite16 
    0.17%  node_org  [kernel.kallsyms]     [k] free_hot_cold_page  
    0.14%  node_org  [kernel.kallsyms]     [k] __do_softirq        
    0.11%  node_org  [kernel.kallsyms]     [k] put_page
    0.08%  node_org  [kernel.kallsyms]     [k] get_page_from_freelist        
    0.08%  node_org  [kernel.kallsyms]     [k] run_timer_softirq   
    0.07%  node_org  [kernel.kallsyms]     [k] tcp_recvmsg         
    0.05%  node_org  [kernel.kallsyms]     [k] clear_page_c_e      
    0.04%  node_org  [kernel.kallsyms]     [k] _raw_spin_unlock_irqrestore   

Top 10 consumers in the kernel (modified node):

     2.99%     node  [kernel.kallsyms]     [k] copy_user_enhanced_fast_string        
     0.59%     node  [kernel.kallsyms]     [k] iowrite16       
     0.46%     node  [kernel.kallsyms]     [k] free_hot_cold_page         
     0.41%     node  [kernel.kallsyms]     [k] get_page_from_freelist     
     0.38%     node  [kernel.kallsyms]     [k] __do_softirq    
     0.28%     node  [kernel.kallsyms]     [k] put_page        
     0.25%     node  [kernel.kallsyms]     [k] clear_page_c_e  
     0.15%     node  [kernel.kallsyms]     [k] __do_page_fault 
     0.13%     node  [kernel.kallsyms]     [k] _raw_spin_unlock_irqrestore
     0.09%     node  [kernel.kallsyms]     [k] tcp_recvmsg     

This is in alignment with the full profile where the TCP read and memory copy takes more CPU. [ I need to figure out who calls iowrite16 ].

@bnoordhuis
Copy link
Member

@sreepurnajasti Can you paste the output of perf report --stdio? In the first two, this...

  • 23.63% node_org perf-11287.map [.] 0x000037d67f60bb77

...is the (or at least: a) thing of interest.

@sreepurnajasti
Copy link

sreepurnajasti commented Feb 8, 2017

@bnoordhuis - Last time I have misplaced the command --perf_basic_prof in the command line position and hence the JS symbols were not resolved, and accumulated into one single entity, with their total cpu consumption.

Currently, I am not able to produce the correct profile due to internal network problem on the machine we are running. I will update as soon as I have data.

@sreepurnajasti
Copy link

when perf report –stdio command is executed,
In Modified node: copy_user_enhanced_fast_string, scan, ScanIdentifierOrKeyword() are the top 3 consumers
In Original node: Scan, ScanIdentifierOrKeyword() are the top 2 consumers

As updated earlier, the single most cpu consumer

23.63% node_org perf-11287.map [.] 0x000037d67f60bb77

Is now (with correct command line options) is now distributed all round

NOTE: I am not pasting each command in detail as it is very huge

The following are the observations below:
Modified Node:


Overhead        Command         Shared Object
# ........  ....................  ..................................................
3.02%     node  [kernel.kallsyms]     [k] copy_user_enhanced_fast_string
     |
     --- copy_user_enhanced_fast_string
        |
        |--99.70%-- skb_copy_datagram_iovec
        |tcp_recvmsg
        |inet_recvmsg
        |sock_aio_read.part.7
        |sock_aio_read
        |do_sync_read
        |vfs_read
        |sys_read
        |system_call_fastpath
        |0x7f8a6746325d
        |uv__stream_io
        |uv__io_poll
        |uv_run
        |node::Start(uv_loop_s*, int, char const* const*, int, char const* const*)
        |node::Start(int, char**)
        |__libc_start_main
         --0.30%-- [...]

1.99%     node  node        [.] v8::internal::Scanner::Scan()
     |
     --- v8::internal::Scanner::Scan()
        |
        |--89.08%-- v8::internal::Scanner::Next()
        ||
        ||--17.08%-- v8::internal::ParserBase<v8::internal::Parser>::Expect(v8::internal::Token::Value, bool*)

(this continues …)

1.98%     node  node        [.] v8::internal::Scanner::ScanIdentifierOrKeyword()
     |
     --- v8::internal::Scanner::ScanIdentifierOrKeyword()
        |
        |--98.19%-- v8::internal::Scanner::Scan()
        ||
        ||--99.43%-- v8::internal::Scanner::Next()

(this continues …)

Original Node:

# Overhead   Command         Shared Object
# ........   ....................  .....................................................
  2.26%  node_org  node_org    [.] v8::internal::Scanner::Scan()
  |
  --- v8::internal::Scanner::Scan()
     |
     |--88.64%-- v8::internal::Scanner::Next()
     ||
     ||--17.27%-- v8::internal::ParserBase<v8::internal::Parser>::Expect(v8::internal::Token::Value, bool*)
(this continues...)

2.21%  node_org  node_org    [.] v8::internal::Scanner::ScanIdentifierOrKeyword()
  |
  --- v8::internal::Scanner::ScanIdentifierOrKeyword()
     |
     |--98.00%-- v8::internal::Scanner::Scan()
     ||
     ||--99.44%-- v8::internal::Scanner::Next()
(this continues…)

0.97%  node_org  [kernel.kallsyms]     [k] copy_user_enhanced_fast_string
  |
  --- copy_user_enhanced_fast_string
     |
     |--99.63%-- skb_copy_datagram_iovec
     |tcp_recvmsg
     |inet_recvmsg
     |sock_aio_read.part.7
     |sock_aio_read
     |do_sync_read
     |vfs_read
     |sys_read
     |system_call_fastpath
     |0x7f24860f925d
     |uv__stream_io
     |uv__io_poll
     |uv_run
     |node::Start(uv_loop_s*, int, char const* const*, int, char const* const*)
     |node::Start(int, char**)
     |__libc_start_main
      --0.37%--

@sam-github
Copy link
Contributor

NOTE: I am not pasting each command in detail as it is very huge

Instead of posting large text inline, you can create a gist at gist.github.com, and post links.

@bnoordhuis
Copy link
Member

@sreepurnajasti If you see Scanner and ParserBase in there, the application can't have been doing anything very interesting. That's the JS source code parser and it normally takes up an insignificant fraction of the total running time, especially in programs that run for longer than a second or two.

You wouldn't happen to be spawning child processes, would you? Because in that case you've probably been profiling the parent process while it was waiting for the child process.

@sreepurnajasti
Copy link

The test case and the full logs are available here

The test case is an attempt to simulate a real world workload from the UV loop perspective. Suggestions to improve the test case is much appreciated!

Here is an explanation to what I am doing in the code:

  • Three sites are accessed 128 times in conjunction with the main streaming, to make sure that there are other I/O operations in the work queue.

  • An empty function is called at 100ms interval and 100 times to make sure that the UV loop is attended to, and to make the program completion depend on this.

  • App-metrics is used to measure the latency incurred in the loop.

  • Following actions are performed per callback in the down.js:
    I. The individual chunk lengths are stored - which can be used to measure how much we read every time.
    II. An object serialization which incurs a constant amount of work (invariant of the chunk volume) to encapsulate any general CPU bound activities.
    III. A raw search is performed in the received chunk - effort linearly proportional to the chunk volume.

  • The test does not spawn any child process. As I have observed earlier, the more work we do in the callback, the performance gain increases, and vice versa. My reasoning on that is as follows:
    I. When more CPU is spent in the callback, the time to return to UV__read increases, and as a result, I/O buffer accumulates more incoming data. (potentially > 64KB)
    II. When next iteration of read is attempted (in UV__read), we get more data than earlier, and thereby reducing the callback count.
    III. While reading amount is increased, the probability of bytes read to be less than requested length (256KB in this case) is still more than the old case. This causes the thread to return to uv loop, and attend other parties, more frequently.

@sam-github
Copy link
Contributor

That's not how I would describe what the gist does. I would say it does this:

  • sets an interval timer to force node to stay alive for at least 10 seconds, whether it is actively doing any work or not, not sure what it means for the uv loop to be "attended to", or why this is here

and starts a whole bunch of http.gets at the same time, so they run in parallel:

  • makes (128 * 3) http.gets across 3 well-known web sites, and doesn't bother reading their responses, which takes < 4 seconds on my machine over slow wifi, most of it i/o blocked, only 1 sec spent in cpu
  • makes 128 http.gets to a specific URL from cmd line, collects the response, and does two pointless calls (util.inspect, and Buffer.includes) - I assume this URL is to a server that supports large TCP windows?

The whole test looks like it runs too fast to learn anything, but then pauses doing nothing but a set interval for 5 or 6 seconds.

I'd suggest setting off hundreds of http.gets in parallel, but when each one completes, have it do another http.get, and have the entire thing run for a few hundred seconds at least, and then call process.exit.

@sreepurnajasti
Copy link

@sam-github, thanks.

I guess your description matches the scenario when the main client read takes very less time. The data being streamed is of importance here, and the program makes meaning with some 'big' data.

Can you please try with a 100MB or more being sent from a server? (in short RTT or in a high speed n/w, otherwise 10MB or more over a long RTT).

In terms of what the code is doing: agree that they look meaningless apparently. But this is simulation testing, meant to do nothing externally, but make their impact visible in the area of code of our interest (libuv in general, the uv__read in particular).

The interval timer: Meant to make sure the timer is kicked in every 100ms. If there is a starvation due to tight-reading of chunks, this will be delayed, and show off in the overall performance.

Reading from 3 well-known web sites: again, they don't do any meaningful work, but causes some tasks to be pending in the pre-loop-queue (the loop watcher queue) and / or some I/O ready fds in the OS queue. The overall run time depends on their completion, which in turn get influenced by the chunk reading logic in uv__read.

Two point-less calls: Please suggest how do we otherwise 'encapsulate' variety of streaming / piping use cases in few lines here. My rational is to combine an invariant and a variant operation (w.r.t the incoming data). If we don't do anything in the callback, we deviate from the most common use cases, and if we over-do here, we go far from piping scenarios. Care has been taken to strike a right balance, but am happy to amend, with a better abstraction if available.

Yes, server supports the large TCP Windows.

If I read back-to-back (as opposed to parallel), the overhead of socket creation, connection etc. can surface and may cause noise in the read. On the other hand, there is a limit on the number of sockets which can be established between two end-points, and hence 128 clients.

The intent of the test is to see how efficiently we are reading, efficiency here refers to the combination of speed and concurrency.

Having said that, I will follow your suggestion to increase the iterations and the running time and post the results.

@sreepurnajasti
Copy link

As per the earlier suggestions on testcase, I have made the changes and here is the link

The modified test case does the following:

  1. Generates 128 http.get connections to the server which sends a data of about 93 MB in short RTT
  2. At each connection:
    • Run a custom word count transformation to the incoming data, and pipe the result into a file.
    • Repeat this for 99 more times, at the end of each connection.

With this test case, an improvement of 12% is visible with modified node.

@bnoordhuis
Copy link
Member

@sreepurnajasti Can you post the output from perf? Intuitively, I would expect that benchmark to spend some (possibly quite a lot) of its CPU time doing work that isn't directly related to network I/O, e.g. the d.toString() call and the regex, plus I can see it getting backlogged on DNS queries and disk writes because they go through the thread pool.

@sreepurnajasti
Copy link

@bnoordhuis Output of top 20 consumers of perf is here. (The perf was run without the call graph as it produced huge data.) At high level, this seem to match with your expectations. Is there any other info you are keen on from this run?

@bnoordhuis
Copy link
Member

@sreepurnajasti No, that about confirms my intuition. I'd say it's hard to draw meaningful conclusions from a benchmark where <1% is spent doing the thing we're actually interested in.

I imagine the 12% speedup is because you enjoy some secondary benefits from being able to operate on larger chunks of data (less garbage to collect, better cache efficiency, etc.) so it might still be worthwhile to pursue.

One last quick gauge: what does /usr/bin/time (not the time bash builtin) print for both binaries? The sys/user breakdown and number of pagefaults might tell us something.

@sam-github
Copy link
Contributor

@sreepurnajasti I know there are pros and cons of microbenchmarks, and you are trying to emulate some kind of real-world work-load, but I wonder if a straight-line program to read and discard data as quickly as possible from a TCP socket (in node) with data coming from a server with large windows sizes would give a better initial assesment of what kinds of improvements are possible, if any?

@sreepurnajasti
Copy link

sreepurnajasti commented Feb 16, 2017

@bnoordhuis, thanks.

I imagine the 12% speedup is because you enjoy some secondary benefits from being able to operate on larger chunks of data (less garbage to collect, better cache efficiency, etc.) so it might still be worthwhile to pursue.

Agreed on garbage, cache etc., but how about potential improvements which came through reduced number of callbacks, JIT factors (flow based optimizations when code running on larger volume of data), added effort for call linkage (C -> C++ -> JS -> C++ -> C), object creation overhead etc.?
For example, we see 7531830 read callbacks in the modified case as opposed to 19161368 with the old one.

Here is the time profile, which shows reduction in both system+user space as well as in the page faults.

Original Node:

[bash node]$ /usr/bin/time ./node_org exp.js X Y 128
2602101 millis.
1665.49user 740.30system 43:22.29elapsed 92%CPU (0avgtext+0avgdata 86888maxresident)k
0inputs+64944outputs (0major+78803minor)pagefaults 0swaps

Modified Node:

[bash node]$ /usr/bin/time ./node exp.js X Y 128
2401625 millis.
1553.76user 673.00system 40:01.76elapsed 92%CPU (0avgtext+0avgdata 97508maxresident)k
0inputs+45120outputs (0major+457397minor)pagefaults 0swaps

@bnoordhuis
Copy link
Member

@sreepurnajasti Re: potential improvements, I agree, they're the "etc." :-)

@sreepurnajasti
Copy link

@sam-github, thanks.

If we are reading and discarding data as fast as we can, I guess the performance becomes a function of the network bandwidth alone. For example, in most of the dev systems I tested (with varying degree of network proximity), the chunks arrived are in the order of few KBs. In such setups, the current logic of 64 KB read is sufficient (and works better) to cater to back-to-back reading.

Networks which offer better bandwidth, would receive data at better rate than the socket can read, then large chunk reading shows up improvements (reason explained in previous comments). Result of a co-located (server and client in the same system) test with empty read callback (all other factors kept in-tact) is as below:

Original Node: 1035620 millis (total run time)
Modified Node: 1010135 millis (2.4% improvement over the default)

My bottom-line is that when the code tends towards real-world workload scenarios, the large buffer brings benefit.

Now, I am investigating on couple of other platforms.

@gireeshpunathil
Copy link
Contributor Author

Purpose of this issue is solved:

  1. Experiment with the approach and make performance observations
  2. Gather review comments on the community on the usefulness.
    As PR is place to drive this work to conclusion, closing this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants