Performance Testing on Dedicated Hardware
We seek to prove the hypothesis that the Netty stack (Zuul-Netty) is able to produce a more stable, higher-throughput and lower-latency performance characteristic than the original Neftlix Zuul implementation (Zuul-Tomcat). We posit that significant gains will be achieved by utilising non-blocking outbound IO and zero-copy buffer transfer.
An experiment will be carried out to tune and stress both implementations, measuring throughput and latency as strain effects surface.
Take Aways - For the Impatient
- We were able to achieve throughputs on Zuul-Netty of ~28,000 requests/s with fair outliers. Zuul-Tomcat made it to ~16,000 requests/s but suffered a progressive, non-linear degradation in latency and significant problems with badly latent outliers.
- Non-blocking inbound and outbound IO performs much better in terms of linear scalability. A caveat is that all processing filters employed in the proxy must never block the execution stage threads i.e. they must use non-blocking outbound IO e.g. if config services are being contacted by filters synchronously.
- Time spent in GC is reduced significantly by Netty so long as the request and response buffers are not intercepted in processing filters. Doing so would drag the content into user space and cause an increase in GC utilisation. We chose to carefully expose only the minimum request and response data in the filter API so that the framework developers can make guarantees to filter developers about performance.
- Netty becomes non-linear in the face of a lot of inbound connection churn. The APR connector performs much better in this case. It should be noted that the performance characteristic is most predictable when a system in our control (e.g. loadbalancer) can guarantee that keep alives will be maintained with the proxy in the face of inbound connection churn.
Equipment and Tool Specification
- Intel(R) Xeon(R) CPU E5430 @ 2.66GHz (dual quad core - total 8 CPU cores).
- Single 1Gbps NICs employed on each server.
- Unspecified Production-grade, low-latency networking equipment utilised for interconnects.
- Mixture of WRK load generator and HP LoadRunner used to provide a sample probe.
- -Dxorg.jboss.netty.epollBugWorkaround=true - enable the epoll CPU fix, although not experienced
- -Dxorg.jboss.netty.selectTimeout=10 - the default, has affect only when idling
- -Dcom.netflix.zuul.workers.inbound=4 - limit inbound IO workers to 4 threads (half the cores)
- -Dcom.netflix.zuul.workers.stage=8 - limit execution stage threads to the number of cores
- -Dcom.netflix.zuul.workers.outbound=4 - limit outbound IO workers to 4 threads (half the cores)
- set to ensure no pool bottleneck
- net.core.rmem_max = 8388608 - increase read buffer size
- net.core.wmem_max = 8388608 - increase write buffer size
- net.core.netdev_max_backlog = 5000 - increase the maximum number of packets queued on the INPUT side, especially when the NIC receives packets faster than kernel can process them
- net.core.somaxconn = 5000 - increase the maximum number of inbound connections allowed
- net.ipv4.tcp_max_syn_backlog = 5000 - increase the backlog per port to surface queueing effects
The load test scenario was simulated using the “wrk benchmark” tool which was sending requests through KeepAlive connections against a target stub which simulated a random 1k payload with a 50ms uniform latency. In a realistic scenario we assume this implementation will be behind a Netscaler/VIP which would reuse sockets instead of exposing constant socket churn.
The number of simultaneous connections was slowly ramped, starting at 700 and ending at 2000, with 10 minutes spent in steadystate load at each step.
Highlights of Observations
- We have successfully achieved the objective of tuning Zuul-Netty to reach linear scalability and highlighted the benefits of nonblocking outbound IO via a comparison to Zuul-Tomcat. To achieve this we made some TCP tweaks and added overrides for Netty's IO threads.
- We observed that, comparatively, Tomcat’s APR connector was much more efficient in cases where there was no KeepAlive, i.e every request involved the client opening a TCP connection to the proxy.
- The scalability point with respect to Tomcat vs Netty was determined by discounting the overheads/latencies incurred in the time spent in processing the response content as this is noise present in both tests. Based on the table below, we can conclude that the scalability point of Zuul-Tomcat was achieved at 1300 concurrent connections with a throughput of around 16K TPS, whereas the Netty implementation was linearly scalable [we tested upto 2000 connections ~ 28K TPS].
- The Zuul-Netty response times for maximum load of 2000 connections was around 71 ms, that is a total of ~21ms spent on the wire and in the proxy. The difference in the response time at each load level was almost uniform, this can be attributed to the overhead of processing the increasing server load [overhead of the HTTP codec/network latencies within the stream, @max load we were using 680Mbps on a 1G NIC and network saturation effects contributed to an increase in latency at this load level].
- The Zuul-Tomcat response times started to increase significantly with increasing connections, which is a composite effect of blocking connections, frequent GC and high number of busy worker threads. This is reflected in the Operating System’s run queue length and the JVM’s GC throughput. Comparison graphs can be found below.
Results for Ramped Load Test
Aggregate Performance fn(connections)
|Observed Performance||Resource Utilisation|
Result Samples fn(time)
|Zuul running on Tomcat||Zuul-Netty|
Observed Benefits of Zuul-Netty
It is clear that Zuul-Netty has a much more stable performance characteristic than Zuul-Tomcat.
- Non blocking inbound AND outbound IO. The threads doing the useful filter work never block and so can be set to a fixed number, equal to the number of cores. This has the effect of preventing contention for CPU resource, something which was observed in Zuul-Tomcat.
- During profiling and thread activity analysis of the Netty instance we observed that the major contributor was the Selector method, hence reducing the inbound and outbound IO worker threads [default: 2*no of CPUs] helped us save some CPU cycles. These values need to be determined as per the application usage.
- Efficiently utilises the system resources - CPU, network etc. Even higher throughput can be achieved with bonded or dedicated NICs. Depending upon the additional tasks on the proxy layer we might need additional CPU capacity.
- High number of connections doesn’t affect the stability of the Zuul-Netty proxy instance, whereas with Zuul-Tomcat the instance became unresponsive when the worker threads reach limits.
- Memory utilization is very efficient as the temporary stacks created by the number of threads are hugely less thanks to its low thread count and also zero-copy request and response content buffers are employed. Higher application throughput is visible in the GC graphs shown above.
Observed Limitations of Zuul-Tomcat
- In Zuul-Netty we observed that the context switches started off at 100K per second and settled at 48K at peak load. This is because at the start of the run the efficiency of the Zuul-Netty proxy is at peak [all threads were active from start of the run] since Netty NIO is event based, this trend matches the TPS and mirrors the response latency. In Zuul-Tomcat, we observed that context switching contributed by the high number of worker threads was high and remained high throughout, in turn causing inefficient use of CPU resource.
- In Zuul-Tomcat the worker threads increased with the increase in concurrent connections and leveled off as soon as the throughput settled. Since higher load demands more worker threads, the throughput scalability is non-linear with respect to load as more context-switching and more CPU core contention occurs.
- The CPU Load Averages for Zuul-Tomcat quickly outstrip Zuul-Netty and as the load passes the number of available cores, as a result of the copious worker threads, a queueing effect ensues which presents as an increase in latency and therefore a lower througput for a given number of connections.
This work was carried out by Raamnath Mani & Neil Beveridge.