Skip to content
This repository

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Overview

At present, when a request is made from a client to a remote Voldemort node, a socket is “checked out” for the entire duration of the request/response cycle. The amount of time the socket will be used (and thus unavailable to others) is:

total time = write request + network + server processing + network + read response

For “total time” that socket is unavailable for other requests and as “total time” increases on average, the fewer requests can be made concurrently.

For processing within the same rack/data center, “total time” can be as short as a couple of milliseconds. For remote data centers, the “network” portion of “total time” can approach 100+ milliseconds.

The essential question is:

Is there a way to decouple writing, reading, and network transit time such that we can process more requests concurrently?

Certainly, there’s nothing we can do to eliminate the network transit time, but we should move requests onto the network in parallel to keep requests from blocking and timing out.

Proposal

The basic premise of the desired change is to decouple writing of the request and reading of the response. We should consider making the request and response asynchronous to achieve this. The act of “checking out” a socket causes becomes the “critical section” of code that we need to shorten.

Can we “check out” a socket for only the writing and reading portions separately, thus making available the socket for other outbound requests while the client is waiting for the response? Essentially we want to check out the socket when we have something to write and as soon as the last byte is written, check it back in. And then at some later point in time, check out the socket to read the response and as soon as the last byte is read, check it back in. (This won’t exactly fly, but I’ll come back to it in a moment.)

Assertions

There are a few conceptual changes that we (or at least I) need to verify as true in order for this to work:

  1. Does the same socket need to be used for both the writing and reading of a given request? That is, is there any reason that a request can’t be written to socket A and the response read from socket B given that both sockets connect the client and the server?
  2. Is there a reason to have multiple sockets between a client and a server any longer? If the socket is not “blocked” for the entire operation, is anything gained by creating multiple connections between the client and server?
  3. Can we avoid having to continuously toggle a given socket between reading and writing? TCP/IP sockets support duplex mode, so we can write data to a socket A in one thread 1 as we read from socket A from thread 2. As long as we keep the buffers for reading and writing separate this should work, right?

If both of the above are true, then we can assert that any given socket can be registered for reads and as-needed registered temporarily for writing.

A fundamental concept I’m trying to work out is whether the sockets can essentially always be registered to listen for responses and only occasionally registered to write requests. Essentially this means that in this context ‘checking out a socket’ really only means that we have permission to write to that socket. It’s implied that it’s always reading or waiting to read.

Observations

One result of this change is that the sockets need to be “checked out” at a very low level, in a just-in-time fashion. At present, higher levels of the client code check out the socket and only check it back in after the request is finished. This forces us to push the socket check out/check in code down to a fairly low level, i.e. in the layer that deals with the NIO APIs.

Although this seemingly only applies to the client side of things, really this same approach is required for the server as well. We want to be able to listen for requests on all sockets, and only when an operation is complete.

Outstanding Issues and Questions

Socket pool

I have one lingering conceptual issue regarding a socket pool. Traditionally, a pool guards usage of a resource so that it is only accessed for one operation at a time. In the case of our duplex approach, we want a given socket in our pool to be used by two operations:

  1. Writing requests
  2. Reading responses

It is only the latter that needs to be guarded. We need to make sure that a given socket is being written to or read from at a given time. I haven’t given it enough thought if there is anything we need to do explicitly on the server such that we have to manage the reading of responses to ensure that a given socket isn’t asked to (somehow) read from two different responses from the server.

Essentially then, in a sense we always have the sockets checked out — for reading. We sometimes have the sockets checked out for writing. An area of concern is accurately reporting the notion of available vs. in-use sockets for monitoring purposes. Presently there are JMX hooks to determine a handful of metrics about the current state of a given socket pool.

Sockets are not always available for read. Once the socket has data available, the “read” needs to be locked until a complete request is read. Once a read is started, that read must be completed before the next request can begin to be read. Partial request reads would need to be handled so that a request can span multiple reads. Comparably, a socket is checked out for the entire write operation, even if the operation cannot be completed in a single pass.

Performance

How do we vet a given design to ensure that we’re not going to hurt performance for ‘intra-data center’ mode? Is there any need to support dual modes: requests in the same data center use the “classic” communications mode and inter-data center communications use the duplex mode?

Correlation IDs

Because the requests are being decoupled from their responses, we need a way to correlate an incoming response from the server with the request that sent it. It was suggested to look at JMS’ correlation IDs. My question here is, how robust of an implementation do we need to make good IDs? How much data needs to be captured in the correlation ID?

How complicated correlation IDs need to be partially depends on the security requirements of the system. The server will need to know which client the request came from (its IP and port) as well as a unique ID for this request. The ID could be something as simple as a sequential or random integer or something more complex. Similarly, the client will need to know the same information from the server (server IP and port, the ID of this request) in order to perform the same correlation.

Support for Non-Duplex Clients

The server side of the equation will need to handle both duplex- and simplex-style methods of communication in order to support older Java clients as well as non-Java clients. How complicated this will make the code has yet to be determined.

The simplest approach to making this work may be to introduce a “brand-new” protocol families (“va” and “pa” as opposed to the current “vp” and “pb”, respectively). Based on the protocol family in use, the server and client will know if it is a synchronous or asynchronous protocol

Support for Streaming Operations

It seems obvious (to me) that streaming operations cannot support this ‘write-and-return’ style of requests. This makes the client and server more complicated because we have to support both synchronous and asynchronous request/response types.

There are two approaches that could be used for streaming protocol operations.

  • Eliminate “streaming” operations in favor of “pagination” operations. In this approach, the client would control the streaming (rather than the server). Instead of asking for all values, the client would ask for a subset of values starting at some location. The server would respond with a number of values and some continuation code. When the client requested the next set, it would pass the continuation code to the server, and the server would use this value to find where it left off.

  • If the sockets are only checked out during the “write” operation, the streaming protocol would have to change to checkout the socket, write the next value (along with the correlation ID) and check the socket back in
  • Something went wrong with that request. Please try again.