Concurrent get_multi #295

Merged
merged 15 commits into from Dec 5, 2012

Conversation

Projects
None yet
3 participants
Contributor

tmm1 commented Nov 28, 2012

WIP re-implementation of Dalli::Client#get_multi

Ditches the serialized blocking reads in favor of a non-blocking IO.select + read() loop. Responses from servers are drained into buffers as fast as possible, and then parsed into key/value pairs.

/cc @brianmario @rtomayko github/github#6208

Collaborator

mperham commented Nov 28, 2012

Looks very promising. Got any benchmark numbers?

@tmm1 tmm1 and 1 other commented on an outdated diff Nov 28, 2012

lib/dalli/client.rb
+
+ buffers = Hash.new{ |h,k| h[k] = '' }
+
+ while servers_in_use.any?
+ timeout = servers_in_use.first.options[:socket_timeout]
+ servers_in_use.delete_if{ |s| s.sock.nil? }
+ sockets = servers_in_use.map(&:sock)
+ readable, _ = IO.select(sockets, nil, nil, timeout)
+
+ if readable.nil?
+ # mark all pending servers as failed
+ servers_in_use.each do |server|
+ begin
+ server.send(:failure!)
+ rescue NetworkError => e
+ end
@tmm1

tmm1 Nov 28, 2012

Contributor

failure! is a private method and it raises a Dalli::NetworkError, so I have to use send() and rescue anytime there's an IO error.

@rtomayko

rtomayko Nov 28, 2012

Yeah seems like a new method on server designed for this would be ideal.

@tmm1

tmm1 Nov 28, 2012

Contributor

Agreed. A lot of this can move back into Server and each connection can maintain its own buffer. The API exposed could be something like:

# initialize buffer, send NOOP
Server#multi_response_start

# read into buffer and parse incrementally. return all values on success and cleanup buffer
Server#multi_response_nonblock # => :again
Server#multi_response_nonblock # => :again
Server#multi_response_nonblock # => {'key' => 'value'}

# assume IO timeout and mark server as failed/down, clean up buffer
Server#multi_response_abort

@tmm1 tmm1 commented on an outdated diff Nov 28, 2012

lib/dalli/client.rb
+ readable, _ = IO.select(sockets, nil, nil, timeout)
+
+ if readable.nil?
+ # mark all pending servers as failed
+ servers_in_use.each do |server|
+ begin
+ server.send(:failure!)
+ rescue NetworkError => e
+ end
+ end
+
+ break
+
+ else
+ readable.each do |sock|
+ server = sock.server
@tmm1

tmm1 Nov 28, 2012

Contributor

Server sets @inprogress = true internally when doing IO; we should probably be setting it here as well.

@rtomayko rtomayko commented on an outdated diff Nov 28, 2012

lib/dalli/client.rb
+ # all done!
+ buffers.delete(server)
+ servers_in_use.delete(server)
+ break
+
+ elsif buf.bytesize >= (24 + body_length)
+ buf.slice!(0, 24)
+ flags = buf.slice!(0, 4).unpack('N')[0]
+ key = buf.slice!(0, key_length)
+ value = buf.slice!(0, body_length - key_length - 4) if body_length - key_length - 4 > 0
+
+ begin
+ values[key_without_namespace(key)] = server.send(:deserialize, value, flags)
+ rescue DalliError => e
+ Dalli.logger.debug { e.inspect }
+ Dalli.logger.debug { "results from this server will be missing" }
@rtomayko

rtomayko Nov 28, 2012

I'd pull this buffer parsing out into a separate method. It'd cut this method down a lot.

Yeah so has this run in github.com production yet? Can you deploy these versions on separate fes and graph the slowness we're seeing with heavy get_multi?

Contributor

tmm1 commented Nov 28, 2012

Refactored the code based on earlier discussion. Pretty happy with the diff now.

I ran some benchmarks locally using TEST=get-multi CLIENT=dalli ruby test/profile/benchmark.rb from memcached.gem. There was no significant difference, which is to be expected as the select() loop only optimizes for network latency.

Collaborator

mperham commented Nov 28, 2012

@tmm1 Remove the "WIP" description when this is ready to go. I would love to see before/after graph/numbers in production if you have time.

Contributor

tmm1 commented Nov 29, 2012

Here's some benchmarks in our production network:

4 x 100 requests to https://github.com/github/github/pull/6208 as tmm1

## memcached.gem

          cache:  20,611 / 9,251ms total         (92ms avg/req,  58ms - 276ms)
          cache:  20,607 / 8,602ms total         (86ms avg/req,  58ms - 270ms)
          cache:  20,600 / 7,690ms total         (76ms avg/req,  57ms - 272ms)
          cache:  20,607 / 7,926ms total         (79ms avg/req,  57ms - 271ms)

## dalli 2.5.0

          cache:  20,613 / 11,230ms total        (112ms avg/req,  90ms - 219ms)
          cache:  20,630 / 13,353ms total        (133ms avg/req,  89ms - 410ms)
          cache:  20,627 / 12,534ms total        (125ms avg/req,  88ms - 317ms)
          cache:  20,624 / 11,681ms total        (116ms avg/req,  91ms - 412ms)

## dalli 2.5.0 + mperham/dalli#295

          cache:  20,609 / 10,962ms total        (109ms avg/req,  78ms - 243ms)
          cache:  20,609 / 10,239ms total        (102ms avg/req,  78ms - 283ms)
          cache:  20,622 / 11,066ms total        (110ms avg/req,  80ms - 302ms)
          cache:  20,610 / 9,830ms total         (98ms avg/req,  78ms - 293ms)

Each line accounts for time spent in cache calls over 100 requests, where each request performed ~206 cache operations.


With this pull request, both min and max times are improved as network latency becomes less of an issue: 78ms - 243ms vs 91ms - 412ms.

Compared to memcached.gem, dalli is still slower. In the best case, there is an additional 0.1ms of overhead per operation: (78ms - 57ms) / 206 ops = 0.10ms per op. This lines up with benchmarks in memcached.gem:

                                           user     system      total        real
set: dalli:bin                         5.550000   3.190000   8.740000 ( 13.469940)
set: libm:ascii                        0.840000   1.900000   2.740000 (  6.663731)

Here, set operations also show a 0.1ms overhead per operation: (13.4s - 6.6s) / 50000 ops * 1000ms = 0.13ms per op.

0.1ms is not much, but it adds up when making several hundred cache calls. There might be additional cpu performance improvements possible in dalli, but it's likely memcached.gem's C implementation is giving it the edge here.

rtomayko commented Dec 1, 2012

Great stuff.

Compared to memcached.gem, dalli is still slower. In the best case, there is an additional 0.1ms of overhead per operation ((78ms - 57ms) / 206 ops = 0.10ms)

Not worried about that much personally.

Collaborator

mperham commented Dec 3, 2012

@tmm1 Anything still pending here?

Contributor

tmm1 commented Dec 5, 2012

I'm happy with this diff if you want to merge. I haven't done any testing in threaded mode, but I'm pretty sure it will work as expected.

mperham merged commit 0f3ad9e into petergoldstein:master Dec 5, 2012

1 check failed

default The Travis build failed
Details
Collaborator

mperham commented Dec 5, 2012

Seeing this test failure on master now:

  1) Failure:
test_0006_support multi-get(Dalli::using a live server) [/Users/mperham/src/dalli/test/test_dalli.rb:197]:
Expected: 10000
  Actual: 8801
Collaborator

mperham commented Dec 5, 2012

This is a single threaded test and it's racy - sometimes it works. In the case of failure, the socket timeout hits and multi_response_abort is triggered.

Contributor

tmm1 commented Dec 5, 2012

Does increasing the socket_timeout help?

Collaborator

mperham commented Dec 5, 2012

It didn't but I reduced the test's get_multi key count to 1000 and can't reproduce it anymore so I'll ship it as is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment