New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Packing entries into AppendEntries requests is slow #161

ongardie opened this Issue Jun 5, 2015 · 2 comments


None yet
1 participant
Copy link

ongardie commented Jun 5, 2015

This is closely related to #160. Adding 4000 entries totaling 1 megabyte to the AppendEntries request takes about 900 milliseconds. That's too long.

The first problem is that there's probably too many invocations of request.ByteSize(). It'd be better to accumulate entry.ByteSize() instead as an approximation.

We may also want a limit on the number of entries, since each one requires some CPU time to process.


This comment has been minimized.

Copy link

ongardie commented Jun 9, 2015

I started on this, but I want to benchmark it in a real-world scenario, so it probably depends on #168. In case someone (likely @nhardt if anyone) needs it before then, here's a patch that might help:

diff --git a/Server/ b/Server/
index ee6fab9..196771f 100644
--- a/Server/
+++ b/Server/
@@ -2201,6 +2201,7 @@ void
 RaftConsensus::appendEntries(std::unique_lock<Mutex>& lockGuard,
                              Peer& peer)
+    using Core::Util::downCast;
     uint64_t lastLogIndex = log->getLastLogIndex();
     uint64_t prevLogIndex = peer.nextIndex - 1;
     assert(prevLogIndex <= lastLogIndex);
@@ -2235,25 +2236,45 @@ RaftConsensus::appendEntries(std::unique_lock<Mutex>& lockGuard,
     // Add as many as entries as will fit comfortably in the request. It's
     // easiest to add one entry at a time until the RPC gets too big, then back
     // the last one out.
+    //
+    // Calculating the size of the request ProtoBuf is a bit expensive, so this
+    // estimates high, then if it reaches the size limit, corrects the estimate
+    // and keep going. This is a dumb algorithm but does well enough. It gets
+    // the number of calls to request.ByteSize() down to about 15 even with
+    // extremely small entries (10 bytes of payload data in each of 50,000
+    // entries filling to a 1MB max).
     uint64_t numEntries = 0;
+    uint64_t baseSize = downCast<uint64_t>(request.ByteSize());
+    uint64_t currentSize = baseSize;
     if (!peer.forceHeartbeat) {
         for (uint64_t index = peer.nextIndex;
              index <= lastLogIndex;
              ++index) {
             const Log::Entry& entry = log->getEntry(index);
             *request.add_entries() = entry;
-            uint64_t requestSize =
-                Core::Util::downCast<uint64_t>(request.ByteSize());
-            if (requestSize < SOFT_RPC_SIZE_LIMIT || numEntries == 0) {
-                // this entry fits, send it
-                VERBOSE("sending entry <index=%lu,term=%lu>",
-                        index, entry.term());
-                ++numEntries;
-            } else {
-                // this entry doesn't fit, discard it
-                request.mutable_entries()->RemoveLast();
-                break;
+            // Each member of a repeated message field is encoded with a tag
+            // and a length. We conservatively assume the tag and length will
+            // be up to 10 bytes each (2^64), though in practice the tag is
+            // probably one byte and the length is probably two.
+            currentSize += entry.ByteSize() + 20;
+            if (currentSize >= SOFT_RPC_SIZE_LIMIT) {
+                // The message might be too big: calculate more exact but more
+                // expensive size.
+                uint64_t actualSize = downCast<uint64_t>(request.ByteSize());
+                assert(currentSize >= actualSize);
+                currentSize = actualSize;
+                if (currentSize >= SOFT_RPC_SIZE_LIMIT && numEntries > 0) {
+                    // this entry doesn't fit and we've already got some
+                    // entries to send: discard this one and stop adding more
+                    request.mutable_entries()->RemoveLast();
+                    break;
+                }
+            // this entry fits, send it
+            VERBOSE("sending entry <index=%lu,term=%lu>",
+                    index, entry.term());
+            ++numEntries;
     request.set_commit_index(std::min(commitIndex, prevLogIndex + numEntries));

This patch does seem to help significantly when you're packing 10 or 100-byte log entries. I wasn't able to measure a significant difference for 1000-byte objects with the benchmark I was running (that had several seconds of overhead).

@ongardie ongardie added the perf label Jun 10, 2015


This comment has been minimized.

Copy link

ongardie commented Jun 13, 2015

In the benchmarks I ran on my laptop, the above patch took packing 1MB of:

  • 1000 byte objects down from 60ms to 2ms
  • 100 byte objects down from 4s to 30ms.
  • 10 byte objects (10 bytes of command data, each entry is ~55 bytes, 19064 entries) down to 42ms (original time before patch not measured).

So this patch is definitely needed and pretty much does the job. Leaders flip like crazy with 100-byte objects before the patch and are stable after.

42ms is still a bit high for packing very small entries, so I think I'll limit each request to 5000 entries max as well.

@ongardie ongardie closed this in ec08743 Jun 15, 2015

nhardt pushed a commit to nhardt/logcabin that referenced this issue Aug 21, 2015

Improve performance of packing entries into AppendEntries requests
Before, it possible for packing up the requests to take so long as to
cause availability and performance problems. The number of calls to
calculate the ProtoBuf's size have been greatly reduced now, and the
number of entries in any given request is also limited to 5000.

Close logcabin#161: Packing entries into AppendEntries requests is slow
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment