# Performance Characterization of 10-Gigabit Ethernet: From Head to TOE

W. -C. Feng, P. Balaji, C. Baron, L. N. Bhuyan and D. K. Panda

Technical Report LA-UR-05-2635

# Performance Characterization of 10-Gigabit Ethernet: From Head to TOE

W. Feng<sup>†</sup> P. Balaji<sup>‡</sup> C. Baron<sup>§</sup> L. H. Bhuyan<sup>§</sup> D. K. Panda<sup>‡</sup>

†Advanced Computing Lab, Los Alamos National Lab feng@lanl.gov <sup>‡</sup>Comp. Sci. and Engg. Ohio State University {balaji, panda}@cse.ohio-state.edu §Comp. Sci. and Engg. UC Riverside {cbaron, bhuyan}@cs.ucr.edu

## **Abstract**

Though traditional Ethernet based network architectures such as Gigabit Ethernet have suffered from a huge performance difference as compared to other high performance networks (e.g, InfiniBand, Quadrics, Myrinet), Ethernet has continued to be the most widely used network architecture today. This trend is mainly attributed to the low cost of the network components and their backward compatibility with the existing Ethernet infrastructure. With the advent of 10-Gigabit Ethernet and TCP Offload Engines (TOEs), whether this performance gap be bridged is an open question.

In this paper, we present a detailed performance evaluation of the Chelsio T110 10-Gigabit Ethernet adapter with TOE. We have done performance evaluations in three broad categories: (i) Detailed microbenchmark performance evaluation at the sockets layer, (ii) performance evaluation of the Message Passing Interface (MPI) stack atop the sockets interface both with micro-benchmarks as well as the NAS Parallel Benchmarks, and (iii) application-level evaluations using the Apache web server. Our experimental results demonstrate latency as low as 9.6  $\mu$ s and throughput of nearly 7.6 Gbps for these adapters (limited by the PCI-X interface). Further, we see nearly a three-fold improvement in the performance of the Apache web server while utilizing the TOE as compared to the basic 10-Gigabit Ethernet adapter without TCP Offloading.

Keywords: 10-Gigabit Ethernet, TCP Offload Engine

#### 1 Introduction

Despite the performance criticisms of Ethernet for highperformance computing (HPC), the Top500 Supercomputer List [4] continues to move towards more commodity-based Ethernet clusters. Just three years ago, there were zero Gigabit Ethernet-based clusters in the Top500 list; now, Gigabit Ethernet-based clusters make up 176 (or 35.2%) of these. The primary drivers of this Ethernet trend are ease of deployment and cost. So, even though the end-to-end throughput and latency of Gigabit Ethernet (GigE) lags exotic highspeed networks such as Quadrics [23], Myrinet [11], and InfiniBand [6] by as much as ten-fold, the current trend indicates that GigE-based clusters will soon make up over half of the Top500 (as early as November 2005). Further, Ethernet is already the ubiquitous interconnect technology for commodity grid computing because it leverages the legacy Ethernet/IP infrastructure whose roots date back to the mid-1970s. Its ubiquity will become even more widespread as long-haul network providers move towards 10-Gigabit Ethernet (10GigE) [17, 15] backbones, as recently demonstrated by the longest continuous 10GigE connection between Tokyo, Japan and Geneva, Switzerland via Canada and the United States [14]. Specifically, in late 2004, researchers from Japan, Canada, the United States, and Europe completed an 18,500-km 10GigE connection between the Japanese Data Reservoir project in Tokyo and the CERN particle physical laboratory in Geneva; a connection that used 10GigE WAN PHY technology to set-up a local-area network at the University of Tokyo that appeared to include systems at CERN, which were 17 time zones away.

Given that GigE is so far behind the curve with respect to network performance, can 10GigE bridge the performance divide while achieving the ease of deployment and cost of GigE? Arguably yes. The IEEE 802.3-ae 10-Gb/s standard, supported by the 10GigE Alliance, already ensures interoperability with existing Ethernet/IP infrastructures, and the manufacturing volume of 10GigE is already driving costs down exponentially, just as it did for Fast Ethernet and Gigabit Ethernet. This leaves us with the "performance divide" between 10GigE and the more exotic network technologies.

In a distributed grid environment, the performance difference is a non-issue mainly because of the ubiquity of Ethernet and IP as the routing language of choice for local-, metropolitan, and wide-area networks in support of grid computing. Ethernet has become synonymous with IP for these environments, allowing complete compatibility for clusters using Ethernet to communicate over these environments. On the other hand, networks such as Quadrics, Myrinet, and InfiniBand are unusable in such environments due to their incompatibility with Ethernet and due to their limitations against using the IP stack in order to maintain a high performance.

With respect to the cluster environment, Gigabit Ethernet suffers from an order-of-magnitude performance penalty when compared to networks such as Quadrics and InfiniBand. In our previous work [17, 15, 8], we had demonstrated the capabilities of the basic 10GigE adapters in bridging this gap. In this paper, we take the next step by demonstrating the capabilities of the Chelsio T110 10GigE adapter with TCP Offload Engine (TOE). We present performance evaluations in three broad categories: (i) Detailed micro-benchmark performance evaluation at the sockets layer, (ii) performance evaluation of the Message Passing Interface (MPI) [22] stack atop the sockets interface both with micro-benchmarks as well as the NAS Parallel Benchmarks (NPB) [2], and (iii) application-level evaluation using the Apache web server [3]. Our experimental results demonstrate latency as low as 9.6  $\mu$ s and throughput of nearly 7.6 Gbps for these adapters (limited by the PCI-X interface).

<sup>&</sup>lt;sup>1</sup>Per-port costs for 10GigE have dropped nearly ten-fold in two years.



Figure 1. TCP Offload Engines

Further, we see nearly a three-fold improvement in the performance of the Apache web server while utilizing the TOE as compared to a 10GigE adapter without TCP Offloading.

# 2 Background

In this section, we briefly discuss the TOE architecture and provide an overview of the Chelsio T110 10GigE adapter.

#### 2.1 Overview of TCP Offload Engines (TOEs)

The processing of traditional protocols such as TCP/IP and UDP/IP is accomplished by software running on the central processor, CPU or microprocessor, of the server. As network connections scale beyond GigE speeds, the CPU becomes burdened with the large amount of protocol processing required. Resource-intensive memory copies, checksum computation, interrupts, and reassembling of out-of-order packets put a tremendous amount of load on the host CPU. In high-speed networks, the CPU has to dedicate more processing to handle the network traffic than to the applications it is running. TCP Offload Engines (TOEs) are emerging as a solution to limit the processing required by CPUs for networking.

The basic idea of a TOE is to offload the processing of protocols from the host processor to the hardware on the adapter or in the system (Figure 1). A TOE can be implemented with a network processor and firmware, specialized ASICs, or a combination of both. Most TOE implementations available in the market concentrate on offloading the TCP and IP processing, while a few of them focus on other protocols such as UDP/IP.

As a precursor to complete protocol offloading, some operating systems started incorporating support for features to offload some compute-intensive features from the host to the underlying adapter, e.g., TCP/UDP and IP checksum offload. But as Ethernet speeds increased beyond 100 Mbps, the need for further protocol processing offload became a clear requirement. Some GigE adapters complemented this requirement by offloading TCP/IP and UDP/IP segmentation or even the whole protocol stack onto the network adapter [18, 13].

TOE can be implemented in different ways depending on the end-user preference between various factors like deployment flexibility and performance. Traditionally, firmware-based solutions provided the flexibility to implement new features, while ASIC solutions provided performance but were not flexible enough to add new features. Today, there is a new breed



Figure 2. Chelsio T110 Adapter Architecture

of performance-optimized ASICs utilizing multiple processing engines to provide ASIC-like performance with more deployment flexibility.

# 2.2 Chelsio 10-Gigabit Ethernet TOE

The Chelsio T110 is a PCI-X network adapter capable of supporting full TCP/IP offloading from a host system at line speeds of 10 Gbps. The adapter consists of multiple components: the Terminator which provides the basis for offloading, separate memory systems each designed for holding particular types of data, and a MAC and XPAC Optical Transceiver for physical transferring data over the line. An overview of the T110's architecture can be seen in Figure 2.

Content (CM) and Packet (PM) memory are available onboard as well as a 64 KB EEPROM. A 4.5 MB TCAM is used to store a Layer 3 routing table and can filter out invalid segments for non-offloaded connections. The T110 is a Terminator ASIC, which is the core of the offload engine, capable of handling 64,000 connections at once, with a setup and teardown rate of about 3 million connections per second.

**Memory Layout:** Two types of on-board memory are available to the Terminator. 256 MB of EFF FCRAM Context Memory stores TCP state information for each offloaded and protected non-offloaded connection as well as a Layer 3 routing table and its associated structures. Each connection uses 128 bytes of memory to store state information in a TCP Control Block. For payload (packets), standard ECC SDRAM (PC2700) can be used, ranging from 128 MB to 4 GB.

**Terminator Core:** The Terminator sits between a systems host and its Ethernet interface. When offloading a TCP/IP connection, it can handle such tasks as connection management, checksums, route lookup from the TCAM, congestion control, and most other TCP/IP processing. When offloading is not desired, a connection can be tunneled directly to the host's TCP/IP stack. In most cases, the PCI-X interface is used to send both data and control messages between the host, but an SPI-4.2 interface can be used to pass data to and from a network processor (NPU) for further processing.

#### 3 Interfacing with the TOE

Since the Linux kernel does not currently support TCP Offload Engines (TOEs), there are various approaches researchers have taken in order to allow applications to interface with TOEs. The two predominant approaches are High Performance Sockets (HPS) [24, 20, 21, 9, 10, 7] and TCP Stack

Override. The Chelsio T110 adapter uses the latter approach, which is described below.

In this approach, the kernel-based sockets layer is retained and used by the applications. However, the TCP/IP stack is overridden, and the data is pushed directly to the offloaded protocol stack, bypassing the host TCP/IP stack implementation. One of Chelsio's goals in constructing a TOE was to keep it from being too invasive to the current structure of the system. By adding kernel hooks inside the TCP/IP stack and avoiding actual code changes, the current TCP/IP stack remains usable for all other network interfaces, including loopback.

The architecture used by Chelsio essentially has two software components: the TCP Offload Module and the Offload driver.

**TCP Offload Module:** As mentioned earlier, the Linux operating system lacks support for TOE devices. Chelsio provides a framework of a TCP offload module (TOM) that is not device specific. Other vendors can use the same module to migrate connections between their TOE and the software stack. The TOM can be thought of as the upper layer of the TOE stack. It is responsible for implementing portions of TCP processing that cannot be done on the TOE (e.g., TCP TIME\_WAIT processing). The state of all offloaded connections is also maintained by the TOM. Not all of the Linux network API calls (e.g., tcp\_sendmsg, tcp\_recvmsg) are compatible with offloading to the TOE. Such a requirement would result in extensive changes in the TCP/IP stack. To avoid this, the TOM implements its own subset of the transport layer API. TCP connections that are offloaded have certain function pointers redirected to the TOM's functions. Thus, non-offloaded connections can continue through the network stack normally.

**Offload Driver:** The offload driver is the lower layer of the TOE stack. It is directly responsible for manipulating the Terminator and its associated resources. TOEs have a many-to-one relationship with a TOM. A TOM can support multiple TOEs as long as it provides all functionality required by each. Each TOE can only be assigned one TOM. More than one driver may be associated with a single TOE device. If a TOE wishes to act as a normal Ethernet device (capable of only inputting/outputting Layer 2 level packets), a separate device driver may be required.

#### 4 Experimental Evaluation

In this section, we evaluate the performance achieved by the Chelsio T110 10GigE adapter with TOE. In Section 4.1, we perform evaluations on the native sockets layer; in Section 4.2, we perform evaluations of the Message Passing Interface (MPI) stack atop the sockets interface both with microbenchmarks as well as the NAS Parallel Benchmarks; and in Section 4.3, we perform evaluations using the Apache web server as an end application.

We used two clusters for the experimental evaluation. **Cluster 1** consists of two Opteron 248 nodes, each with a 2.2-GHz CPU along with 1 GB of 400-MHz DDR SDRAM and 1 MB of L2-Cache. These nodes are connected back-to-back with Chelsio T110 10GigE adapters with TOEs. **Cluster 2** consists of four Opteron 846 nodes, each with four 2.0-GHz CPUs (quad systems) along with 4 GB of 333-MHz DDR SDRAM

and 1 MB of L2-Cache. It is connected with similar network adapters (Chelsio T110 10GigE-based TOEs) but via a 12-port Fujitsu XG1200 10GigE switch (with a latency of approximately 450 ns and capable of up to 240 Gbps of aggregate throughput). The experiments on both the clusters were performed with the SuSE Linux distribution installed with kernel.org kernel 2.6.6 (patched with Chelsio TCP Offload modules). In general, we have used Cluster 1 for all experiments requiring only two nodes and Cluster 2 for all experiments requiring more nodes. We will be pointing out the cluster used for each experiment throughout this section.

For optimizing the performance of the network adapters, we have modified several settings on the hardware as well as the software systems, e.g., (i) increased PCI burst size to 2 KB, (ii) increased send and receive socket buffer sizes to 512 KB each, and (iii) increased window size to 10 MB. Detailed descriptions about these optimizations and their impacts can be found in our previous work [17, 15].

#### 4.1 Sockets-level Evaluation

In this section, we evaluate the performance of the native sockets layer atop the TOEs as compared to the native host-based TCP/IP stack. We perform micro-benchmark level evaluations in two sub-categories. First, we perform evaluations based on a single connection measuring the point-to-point latency and uni-directional bandwidth together with the CPU utilization. Second, we perform evaluations based on multiple connections using the multi-stream bandwidth test, hot-spot test, fan-in and fan-out tests.

#### 4.1.1 Single Connection Micro-Benchmarks

Figures 3 and 4 show the basic single-stream performance of 10GigE-based TOE as compared to the traditional host-based TCP/IP stack. The point-to-point latency numbers are measured using the Netpipe benchmark and the throughput numbers using the Nttcp benchmark. All experiments in this section have been performed on Cluster 1 (described in Section 4).

Figure 3a shows that the TCP Offload Engines (TOE) can achieve a point-to-point latency of about 9.6  $\mu$ s as compared to the 11.1  $\mu$ s achievable by the host-based TCP/IP stack (non-TOE); an improvement of about 13.5%. Figure 3b shows the uni-directional throughput achieved by the TOE as compared to the non-TOE. As shown in the figure, the TOE achieves a throughput of up to 7.6 Gbps as compared to the 5 Gbps achievable by the non-TOE (improvement of about 52%).

The drop in non-TOE performance for very large messages (e.g., 256 KB) has been consistent in all our experiments. We believe the reason for this is the continuous per segment checksum computations done by the host stack as a part of the bottom-half handler; this results in starving the lower priority data-copy phase, causing additional cache misses [19]. We are currently investigating the cache misses in the stack to verify this.

By increasing the MTU size of the network adapter to 9 KB (Jumbo frames), we improve the performance of the non-TOE TCP/IP stack to 5.5 Gbps (Figure 4b). There is no additional improvement in the TOE throughput with Jumbo frames. The





Figure 3. Sockets-level Micro-Benchmarks (MTU 1500): (a) Latency and (b) Bandwidth





Figure 4. Sockets-level Micro-Benchmarks (MTU 9000): (a) Latency and (b) Bandwidth

reason for this is the way the Chelsio T110 adapter handles the message transmission. For the offloaded TCP/IP stack (TOE), the device driver hands over large message chunks (currently 10 KB) to be sent out. The actual segmentation of the message chunk to smaller MTU-sized frames is carried out by the network adapter. Thus, the host does not realize any difference between a 1500-KB MTU and a 9000-KB MTU. On the other hand, for the host-based TCP/IP stack (non-TOE), an MTU of 1500 bytes results in more segments and correspondingly more interrupts to be handled for every message causing a lower performance as compared to an MTU of 9 KB.

#### 4.1.2 Multiple Connection Micro-Benchmarks

Here we evaluate the TOE and non-TOE stacks with microbenchmarks utilizing multiple simultaneous connections.

**Multi-stream Throughput Test:** Figure 5a illustrates the aggregate throughput achieved by two nodes (in Cluster 1) performing multiple instances of uni-directional throughput tests. It is to be noted that the results for this experiment with the TOE have been quite flaky (the non-TOE version seems to be more consistent), so it is difficult to characterize the improvement of the TOE as compared to the non-TOE, but we have observed that the TOE generally achieves a throughput of about 1000-1500 Mbps higher than the non-TOE.

**Hot-Spot Latency Test:** Figure 5b shows the impact of multiple connections on small message transactions. In this experiment, a number of client nodes perform a point-to-point latency test with the same server forming a hot-spot on the server. We performed this experiment on Cluster 2 with one

node acting as a server node and each of the other three 4-processor nodes hosting totally 12 client processes. The clients are alloted in a cyclic manner, so 3 clients refers to 1 client on each node, 6 clients refers to 2 clients on each node and so on. As seen in the figure, both the non-TOE as well as the TOE stacks show similar scalability with increasing number of clients, i.e., the performance difference seen with just one client continues with increasing number of clients. This ensures that the look-up time for connection related data-structures is performed efficiently enough on the TOE and does not form a significant bottleneck.

Fan-out and Fan-in Tests: With the hot-spot test, we have ensured that the lookup time for connection related datastructures is quite efficient on the TOE. However, the hotspot test does not stress the other resources on the network adapter such as management of memory regions for buffering data during transmission and reception. In order to stress such resources, we have designed two other tests namely fan-out and fan-in. In both these tests, one server process carries out uni-directional throughput tests simultaneously with a number of client threads (performed on Cluster 2). The difference being that in a fan-out test the server pushes data to the different clients (stressing the transmission path on the network adapter) and in a fan-in test the clients push data to the server process (stressing the receive path on the network adapter). Figure 6 shows the performance of the TOE stack as compared to the non-TOE stack for both these tests. As seen in Figure 6a, the fan-out performance is quite consistent with increasing number of clients suggesting an efficient transmission path imple-





Figure 5. (a) Multi-stream Throughput and (b) Hot-Spot Latency





Figure 6. (a) Fan-out Test and (b) Fan-in Test

mentation on the TOE. On the other hand, as seen in Figure 6b, the fan-in performance is highly inconsistent and unable to maintain the peak throughput suggesting that the receive path might be implemented more inefficiently.

#### 4.2 MPI-level Evaluation

In this section, we evaluate the Message Passing Interface (MPI) stack written using the sockets interface on the TOE and non-TOE stacks. MPI is considered the *de facto* standard programming model for scientific applications; thus this evaluation would allow us to understand the implications of the TOE stack for such applications.

We performed two kinds of evaluations on the MPI stack. First, we performed basic micro-benchmark level evaluation in the form of point-to-point latency and uni-directional bandwidth, but on top of the MPI stack. Second, we used a set of standard benchmark applications, the NAS Parallel Benchmarks (NPB), to understand the impact of the TOE on the communication performed by the end applications. The microbenchmark level evaluations were done using the LAM [12] and MPICH [16] implementations of MPI. However, we report only the LAM numbers here due to its better performance with both the TOE as well as the non-TOE stacks. The NPB numbers, on the other hand, are reported for MPICH since the profiling tool we used (MPIP) [1] seems to be incompatible with LAM for Fortran applications.

Figure 7 illustrates the point-to-point latency and unidirectional bandwidth achievable with the TOE and non-TOE stacks for an MTU size of 9000 bytes. As shown in Figure 7a, MPI over the TOE stack achieves a latency of about 10.86  $\mu$ s compared to the 12.5- $\mu$ s latency achieved by the non-TOE stack (an improvement of 13.1%). The increased point-to-point latency of the MPI stack as compared to that of the native sockets layer (9.6  $\mu$ s) is attributed to the overhead of the MPI implementation. Figure 7b shows the uni-directional throughput achieved by the two stacks. TOE achieves a throughput of about 7 Gbps as compared to the 5 Gbps achieved by the non-TOE stack (improvement of about 40%).



Figure 8. NAS Parallel Benchmarks

Next, we evaluate some of the NAS Parallel Benchmarks on top of both the TOE as well as non-TOE stacks on a 2-node cluster. The NAS Parallel Benchmarks are a set of MPI-based benchmark applications which together imitate the behavior of several scientific applications. Figure 8 illustrates the percentage improvement in the communication time for these benchmarks while using the TOE stack as compared to the non-TOE





Figure 7. MPI-level Micro-Benchmarks (MTU 9000): (a) Latency and (b) Bandwidth





Figure 9. Apache Web-Server Performance: (a) Single File Traces and (b) Zipf Traces

stack. We see that some benchmarks such as IS achieve an improvement of up to 11%.

#### 4.3 Application-level Evaluation

In this section, we evaluate the performance of the two stacks (TOE and non-TOE) using a real application, namely the Apache web server. We perform two kinds of experiments.

In the first experiment, we use a simulated trace which consists of only one file. By evaluating the two stacks with various sizes for this file, we can understand their performance without being diluted by other system parameters. Figure 9a depicts the performance achievable (number of transactions per second) by the two stacks for different file sizes. As seen in the figure, the TOE stack achieves a significantly better performance as compared to the non-TOE stack for large files (e.g., a file size of 32 KB shows a factor of 3.16 better performance for the TOE stack). For a 2-KB file size, however, we see a drastically lower performance for the TOE stack. This could be because of some interaction with the burst size of the adapters which is also 2 KB. We are investigating the exact reason for this behavior.

In the second experiment (Figure 9b), we build a trace based on a popular file request distribution named Zipf [25]. The Zipf distribution states that the probability of requesting the  $I^{th}$  most popular document is inversely proportional to a constant power  $(\alpha)$  of I. In some sense,  $\alpha$  represents the amount of temporal locality in the trace. A high  $\alpha$  value (close to one) represents a high temporal locality while a low  $\alpha$  value (close to zero) represents a low temporal locality. We used the World-Cup trace [5] as a base to associate file sizes with

the Zipf pattern; like several other traces, this trace associates small files to be the most popularly requested ones while larger files tend to be less popular. Following the same logic, when the  $\alpha$  value is very close to one, a lot of small files (of around 2 KB size) tend to be accessed. This causes the TOE stack to have a worse performance for high  $\alpha$  values as compared to the non-TOE stack (note that the TOE stack shows terrible performance as compared to the non-TOE stack for files close to size 2 KB as shown in Figure 9a). However, when the  $\alpha$  value becomes smaller, the requests are more spread out to the larger files as well. For these ranges, the TOE stack shows better performance (up to 3 times in some cases) as compared to the non-TOE stack.

#### 5 Concluding Remarks

In this paper, we presented a detailed performance evaluation of the Chelsio T110 10GigE adapter with TOE. We have performed evaluations in three categories: (i) Detailed microbenchmark level evaluation of the native sockets layer, (ii) Evaluation of the Message Passing Interface (MPI) stack over the sockets interface both with micro-benchmarks as well as the NAS Parallel Benchmarks, and (iii) Application-level evaluation of the Apache web server. These experimental evaluations provide several useful insights into the effectiveness of the TOE stack in scientific as well as commercial domains.

## References

- mpiP: Lightweight, Scalable MPI Profiling. http://www.llnl.gov/CASC/mpip/.
- [2] NAS Parallel Benchmarks. http://www.nas.nasa.gov.

- [3] The Apache Web Server. http://www.apache.org/.
- [4] Top500 supercomputer list. http://www.top500.org, November 2004.
- [5] The Internet Traffic Archive. http://ita.ee.lbl.gov/html/traces.html.
- [6] Infiniband Trade Association. http://www.infinibandta.org.
- [7] P. Balaji, S. Narravula, K. Vaidyanathan, S. Krishnamoorthy, J. Wu, and D. K. Panda. Sockets Direct Protocol over InfiniBand in Clusters: Is it Beneficial? In *ISPASS* '04.
- [8] P. Balaji, H. V. Shah, and D. K. Panda. Sockets vs RDMA Interface over 10-Gigabit Networks: An In-depth analysis of the Memory Traffic Bottleneck. In RAIT workshop '04.
- [9] P. Balaji, P. Shivam, P. Wyckoff, and D. K. Panda. High Performance User Level Sockets over Gigabit Ethernet. In *Cluster Computing* '02.
- [10] P. Balaji, J. Wu, T. Kurc, U. Catalyurek, D. K. Panda, and J. Saltz. Impact of High Performance Sockets on Data Intensive Applications. In HPDC '03
- [11] N. J. Boden, D. Cohen, R. E. Felderman, A. E. Kulawik, C. L. Seitz, J. N. Seizovic, and W. K. Su. Myrinet: A Gigabit-per-Second Local Area Network. *IEEE Micro* '95.
- [12] Greg Burns, Raja Daoud, and James Vaigl. LAM: An Open Cluster Environment for MPI. In Supercomputing Symposium.
- [13] Chelsio Communications. http://www.chelsio.com/.
- [14] Chelsio Communications. http://www.gridtoday.com/04/1206/104373.html, December 2004.
- [15] W. Feng, J. Hurwitz, H. Newman, S. Ravot, L. Cottrell, O. Martin, F. Coccetti, C. Jin, D. Wei, and S. Low. Optimizing 10-Gigabit Ethernet for Networks of Workstations, Clusters and Grids: A Case Study. In SC '03.
- [16] W. Gropp, E. Lusk, N. Doss, and A. Skjellum. A high-performance, portable implementation of the MPI message passing interface standard. *Parallel Computing*.
- [17] J. Hurwitz and W. Feng. End-to-End Performance of 10-Gigabit Ethernet on Commodity Systems. IEEE Micro '04.
- [18] Ammasso Incorporation. http://www.ammasso.com/.
- [19] H. W. Jin, P. Balaji, C. Yoo, J. Y. Choi, and D. K. Panda. Exploiting NIC Architectural Support for Enhancing IP based Protocols on High Performance Networks. JPDC '05.
- [20] J. S. Kim, K. Kim, and S. I. Jung. Building a High-Performance Communication Layer over Virtual Interface Architecture on Linux Clusters. In ICS '01.
- [21] J. S. Kim, K. Kim, and S. I. Jung. SOVIA: A User-level Sockets Layer Over Virtual Interface Architecture. In Cluster Computing '01.
- [22] Message Passing Interface Forum. MPI: A Message-Passing Interface Standard, March 1994.
- [23] F. Petrini, W. Feng, A. Hoisie, S. Coll, and E. Frachtenberg. The Quadrics Network (QsNet): High-Performance Clustering Technology. In *Hotl* '01.
- [24] H. V. Shah, C. Pu, and R. S. Madukkarumukumana. High Performance Sockets and RPC over Virtual Interface (VI) Architecture. In CANPC workshop '99.
- [25] George Kingsley Zipf. Human Behavior and the Principle of Least Effort. Addison-Wesley Press, 1949.