Add option to drop replies from duplex servers #22

Closed
wants to merge 5 commits into
from

Projects

None yet

3 participants

@jib
jib commented Mar 28, 2012

I'd like your feedback on the following code change; it addresses an issue where the duplex servers take significantly longer to respond than the relay server and this causes the throughput of the proxy to drop significantly.

Please see this gist for an explanation and an Apache Benchmark session that illustrates the problem + solution:

https://gist.github.com/2222700

Now, I'm not very well versed in Ruby or EventMachine, so please let me know if this is the right way to insert this code, and if it's likely to have any (dangerous) side effects.

If it's a good solution, of course I'd appreciate it if you could pull it into master.

Thanks,

-Jos

jib added some commits Mar 28, 2012
@jib jib * chmod +x + hashbang so its executable with rvm 048bf12
@jib jib * support shadow traffic by dropping traffic from the relay servers a…
…s soon as the data is sent. this stops slow/down relay servers from holding up any other connections
adda696
@igrigorik
Owner

Hey Jos, thanks for the gist and reporting this - very helpful.

I think there is a simpler way to achieve same effect. I just pushed a commit, take a look: e386de7

It seems to achieve what you're after. Let me know if that works for you.

@jib
jib commented Mar 28, 2012

Hi Ilya,

thanks for the quick response; I like how clean your solution is, however it doesn't ensure that the duplex server actually gets the data. If you use em-proxy on localhost, with 2 localhost backends, the request will reach both before the the relay answers on all but the most loaded machines.

However, if your proxy is on localhost, your relay is on localhost, but your duplex is on a different staging host, the network roundtrip will actually can take longer than the localhost to respond, meaning the connection is closed before the duplex even gets the data.

So this config will see next to no traffic on the duplex server provided the relay server responds fast enough:

Proxy.start(:host => "0.0.0.0", :port => 8080 ) do |conn|
  conn.server :prod, :host => "127.0.0.1", :port => 80    # production, will render resposne
  conn.server :test, :host => "staging.example.com", :port => 80    # testing, internal only

This in our case is almost always true; we respond in the 1-2ms range to requests.

@jib
jib commented Mar 30, 2012

The above commit adds more graceful handling of when a duplex server is unavailable or generally slow to respond to a connect request; it has the same underlying problem as a slow response from the duplex server.

Again, your insights would be much appreciated.

@igrigorik
Owner

Ah, hmm.. right. I think we need a more generic solution here:

(1) You have multiple backends, only one of which will be used as a response
(2) The response from other backends should not hold up, in any way the response of the main backend - the data should be streamed back as soon as available, and connection should be closed as soon as the upstream closes it.

However, while dropping the response from other upstreams is one scenario, it is also a perfectly good use case to want to aggregate all the responses to perform extra analysis, etc. That's the problem we need to solve.

P.S. Any chance you can revert the whitespace formatting changes? All green/red diffs are not great for revision history..

jib added some commits Mar 30, 2012
@jib jib * if a duplex server is down or very slow to respond, that holds up t…
…he connection. Add a timeout (default 5 seconds) if you are willing to drop the reply
46578cc
@jib jib * redo the s/relay/duplex/ 1098bf8
@jib
jib commented Mar 30, 2012

Agreed, we should have a cleaner, more generic solution here that addresses the points you raise. I'm happy to work with you on this, especially if you can point me in the right direction, but right now, I'm stuck for a better implementation than the one currently in my repo.

I've also reverted the whitespace changes as you requested (it looks like backend.rb has different line endings than the other files).

Let me know what you think we can do to proceed.

@igrigorik
Owner

Jos, going over the code and the use cases.. Few thoughts:

  1. We don't need the extra timeout flags, you can already accomplish this via existing API's, ex:
  prod = conn.server :prod, :host => "127.0.0.1", :port => 8100
  test = conn.server :test, :host => "127.0.0.1", :port => 8200

  prod.pending_connect_timeout = 10
  test.pending_connect_timeout = 5
  1. Your current "drop_reply" logic within relay_to_servers assumes that the entire request will come in as a single packet + callback, which will definitely break in many cases. In other words, closing the connection needs to be controlled by the client from within the on_data callback. With that in mind, I think this could do the trick:
  prod = conn.server :prod, :host => "127.0.0.1", :port => 8100
  test = conn.server :test, :host => "127.0.0.1", :port => 8200

  conn.on_data do |data|
     # some custom logic to determine if we're good to close the connection
    if request_complete?
      # this will schedule a close on test connection after the code below executes..
      # since we will queue up data on the socket, it will finish writing and then terminate it
      EM.next_tick { test.close_connection_after_writing } 
   end

    # apply data manipulation logic, etc
    data.gsub(/User-Agent: .*?\r\n/, "User-Agent: em-proxy/0.1\r\n")
  end
@jib
jib commented Apr 2, 2012

Ilya,

thanks for those two snippets; I get what you're doing there and it makes perfect sense.

The one part that had me wondering though is why not just to use the 'on_finish' callback instead like below, so there'd be no need to write custom 'request_complete?' logic:

  conn.on_data do |data|
    data
  end

  conn.on_finish do |backend, name|
    if name == :prod then
      EM.next_tick { test.close_connection_after_writing }
    end
  end

Testing the code it seems to work well, but running a benchmark against it shows me it's about 15% slower than your code, which surprised me. The only culprit I could think of is that there's now 2 callbacks that need to be called instead of one.

Doing it 'on_response' also seems to have the desired effect, and performs comparably to your code despite also having 2 callbacks though:

  conn.on_data do |data|
    data
  end

  conn.on_response do |name, resp|
    if name == :prod then
      EM.next_tick { test.close_connection_after_writing }
    end
  end

Did you already think of this and is that why you moved it all to on_data? Is there a bit of logic I'm missing to make this a viable multi-packet solution?

@igrigorik
Owner

on_finish is called when the downstream closes the connection: https://github.com/igrigorik/em-proxy/blob/master/lib/em-proxy/connection.rb#L88. Which means that you would proxy the request, let it do the work, stream it back and then discard the response.. As per your use case, if you don't care for the results of the response at all, then there is no reason to even listen for the response. Similar logic applies to on_response.

What you're doing also makes sense, but it could be a different use case.. Let's say I have two backends, and all I want is just to serve the response that comes back the fastest. In this case, I'd use the on_response, stream that data back to client, and then close the connection to the other backend.

In other words, all of these are fairly similar, but subtly different. :)

@jib
jib commented Apr 3, 2012

Ilya,

thanks for explaining; I think I have enough on to do some production tests. Either way, I think my more invasive changes aren't necessary at this point, and I can accomplish what I need through the API.

I still think this might be a useful addition to the em-proxy command (or a separate command) and happily offer up the patch. Let me know if that's something you're interested in.

Thanks again for your help here, and I'll be sure to report back if I come across any other challenges.

@jib jib closed this Apr 3, 2012
@miyagawa

Guess I'm in the same situation as @jib and want to do exactly this:

However, while dropping the response from other upstreams is one scenario, it is also a perfectly good use case to want to aggregate all the responses to perform extra analysis, etc. That's the problem we need to solve.

i.e. return the fastest data to the client, close client connection, wait for all the backends to respond and compare them for analysis.

The :close flag in the on_finish looks helpful but it drops the connections to the backends as well, which haven't returned data yet. Any code sample to accomplish this?

@miyagawa

Looks like the issue here is that in Connection#unbind it automatically closes the servers as well and i apparently was able to work around it with the following code:

conn.on_finish do |backend|
  if all_servers_finished?
    unbind
  end
  :close if backend == :prod   
end

def conn.close_connection(*args)
  @server_side_close = true
  super
end

def conn.unbind
  super if !@server_side_close or all_servers_finished?
end

def conn.all_servers_finished?
  @servers.values.compact.size.zero?
end

This way we call unbind twice, first when :prod closes the connection so that client won't be blocked, and second when all servers returned response, in which case we call close_connection on active servers (which should actually be empty usually).

This seems to do what I want - do you see any issues with this code flow?

@igrigorik
Owner

@miyagawa seems to make sense to me!

@miyagawa

Very cool - just a note that I updated the code so that unbind will call super when a client disconnected, by overriding close_connection per http://eventmachine.rubyforge.org/EventMachine/Connection.html#unbind-instance_method otherwise this will wait forever when a client dropped before sending an HTTP request :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment