Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide a configuration option for buffering responses in order to better cope with slow clients #519

Closed
FooBarWidget opened this issue May 29, 2014 · 18 comments

Comments

@FooBarWidget
Copy link
Member

From dustym on November 20, 2009 02:10:45

What steps will reproduce the problem? 1. Have only seen this in production. What is the expected output? What do you see instead? Passenger freezes. What version of Phusion Passenger are you using? Which version of Rails? On what operating system? Passenger 2.2.5
Rails 2.2.2 Please provide any additional information below. We are running 64 bit Debian Etch. Each box has 8 gigs of ram, 4 of which
are dedicated to passenger and apache. Here is our passenger config:

PassengerRoot /usr/local/lib/ruby/gems/1.8/gems/passenger-2.2.5
PassengerRuby /usr/local/bin/ruby
PassengerUseGlobalQueue on
PassengerUserSwitching off
PassengerDefaultUser www-data
PassengerPoolIdleTime 0
RailsFrameworkSpawnerIdleTime 0
RailsAppSpawnerIdleTime 0
RailsSpawnMethod smart
PassengerMaxPoolSize 10

PassengerLogLevel 3

We are running REE 1.8.6:
ruby 1.8.6 (2008-08-11 patchlevel 287) [x86_64-linux]
Ruby Enterprise Edition 20090610

Our server header:
Server: Apache/2.2.3 (Debian) Phusion_Passenger/2.2.5

When we first put passenger into production, a couple of times a day we
would see requests back up in the global queue to the point that apache
would stop accepting connections due to MaxClients. This was with a timeout
of 300s in apache. We dug around the issue tracker and found this bug ( https://code.google.com/p/phusion-passenger/issues/detail?id=318 ) which
suggested disabling the global queue. We also lowered the apache timeout.
After that we would periodically see a backup at the handler level that
would last for a brief period of time and would only affect a couple of
handlers, but the other handlers would continue to process requests
quickly. Overall the systems seems more stable with a 30s timeout and we've
since switched back to the global queueing. But we are wary of what is
going on, apache might just be "unhanging" passenger when it's frozen by
dropping the connection. Indeed when passenger does occasionally completely
hang, we'll see a burst of these errors:


[ pid=16687 file=ext/apache2/Hooks.cpp:654 time=2009-11-10 11:36:25.519 ]:
Either the vistor clicked on the 'Stop' button in the web browser, or the
visitor's connection has stalled and couldn't receive the data that Apache
is sending to it. As a result, you will probably see a 'Broken Pipe' error
in this log file. Please ignore it, this is normal. You might also want to
increase Apache's TimeOut configuration option if you experience this
problem often.

Generally the box never swaps, and we kill rails handlers that go above a
certain memory threshold. Load is usually 1.5 to 5 on a quad core box.

At some point I found this support request for New Relic's RPM plugin: https://newrelic.tenderapp.com/discussions/support/1204-timeout-issues-should-switch-to-systemtimer-timeout-library The issue explained in that support request is pretty much identical to
ours. We also had the new relic plugin installed, but we've since
completely removed it and the problem persists.

When I send a kill -3 to the passenger processes during a freeze I'll get a
dump like that in the attached file kill_3.txt.

I installed Amit Gupta's gdb.rb ( http://github.com/tmm1/gdb.rb ) and
modified it to connect to each passenger related process (including
spawners) and output each thread's backtrace. This can be found in the
attached file gdb_rb.txt. I don't understand what's going on completely in
those backtraces, but I believe the first 3 sets of backtraces (split by
the lines DEBUGGING ) are the Passenger spawn server, Passenger
FrameworkSpawner: 2.2.2 and the Passenger ApplicationSpawner. The rest are
rails handlers. Seven of which are in this state:

node_fcall select in
/usr/local/lib/ruby/gems/1.8/gems/passenger-2.2.5/lib/phusion_passenger/abstract_request_handler.rb:367

The other three (pids 29327,29368,29247) are in this state:

node_call write in
/usr/local/lib/ruby/gems/1.8/gems/actionpack-2.2.2/lib/action_controller/cgi_process.rb:176

When I took this snapshot, the global queue had something like 450 backed
up connections and all of the processes in passenger-status reported to
be working on 1 session.

Based on Ludwig's notes in the new relic bug report above I'm leaning
toward this being a blocking IO issue, but I can't be sure.

If you guys need any other instrumentation or expirements done I can
somewhat consistently recreate the issue by upping the Apache Timeout value
to 300s on one of our cluster nodes during peak traffic. At some point
passenger will freeze.

Thanks for the help.

Attachment: kill_3.txt gdb_rb.txt

Original issue: http://code.google.com/p/phusion-passenger/issues/detail?id=419

@FooBarWidget
Copy link
Member Author

From honglilai on November 22, 2009 01:55:19

It looks like Apache is stuck while writing the response back to the web server.
There might be a bunch of slow or frozen HTTP clients who are keeping the connections
open but aren't actually receiving data.

What's the value of your TimeOut config option? I think you need to decrease it, not
increase it.

@FooBarWidget
Copy link
Member Author

From dustym on November 22, 2009 12:04:37

Thanks for the response Hongli,

TimeOut is 30. I think that's pretty reasonable. I only increase it to recreate the
issue faster. 30 seconds makes the system stable, but it's just a hack. Passenger
still freezes and backs up, but the timeout keeps it from being overwhelming.

Which is to say a low TimeOut isn't an adequate fix as it doesn't address the real
problem.

Do you have any recommendations on how I might debug this issue further? I forgot to
say we do get TimeOut errors like this very often:

[ pid=12365 file=ext/apache2/Hooks.cpp:654 time=2009-11-22 15:01:58.440 ]:
Either the vistor clicked on the 'Stop' button in the web browser, or the visitor's
connection has stalled and couldn't receive the data that Apache is sending to it. As
a result, you will probably see a 'Broken Pipe' error in this log file. Please ignore
it, this is normal. You might also want to increase Apache's TimeOut configuration
option if you experience this problem often.

We've traced that to code/interaction within our app where in certain situations the
browser is redirected using javascript before the whole response can be written back
to the user. This was never an issue with mongrel/haproxy, so I'm curious if it might
seem like something passenger would have a problem with?

@FooBarWidget
Copy link
Member Author

From honglilai on November 23, 2009 14:28:54

Well looking at your backtraces it's definitely slow HTTP clients that are causing
the problem. Maybe this can be solved by having the web server buffer the response.
At this time Phusion Passenger doesn't buffer the response in order to make streaming
responses work, but you can enable it by commenting out one line, as follows:

  1. Go to the Phusion Passenger source directory.

  2. Edit ext/apache2/Bucket.cpp

  3. Change:

    return APR_EAGAIN;

into:

// return APR_EAGAIN;

  1. Save the file and type 'rake apache2' as root.
  2. Restart Apache.

Check whether this helps.

@FooBarWidget
Copy link
Member Author

From travisdbell on November 24, 2009 15:37:45

We are having the EXACT same problem.

I'll give the source change a try and report back.

@FooBarWidget
Copy link
Member Author

From andy.paul.007 on November 30, 2009 08:23:16

I've been getting a lot of these error messages as well. Im not sure if they're
related or not, but sooner or later, I start getting "cannot fork process" errors and
the server needs a reboot altogether :S

Im going to try commenting the code line and see how that goes, although Im not
really hopeful. Something seems to be driving ressource consumption up, just dunno
what it is ...

When you guys say "Passenger hangs", does your web app stay available or does the
server hang? Just wondering if the issues Im having are similar to yours ...

thx

cheers

@FooBarWidget
Copy link
Member Author

From honglilai on November 30, 2009 12:41:16

andy.paul.007, your issue is unrelated, but if you get "cannot fork process" then
that means your server doesn't have enough memory. You'll need to expand your RAM or
increase your swap space.

@FooBarWidget
Copy link
Member Author

From dustym on November 30, 2009 13:14:33

Just to give you guys an update, I patched passenger on one of our web servers in
our cluster and brought it online with the increased memory timeout (reminder: I
only use the longer timeout to test volatility, by default we always use a shorter
timeout). The system was stable for a greater part of the day under peak traffic, but I
still observed some of the rails instances stop serving requests until, ostensibly, the
connection was timed out by apache. For example, after enabling the long timeout 9
out of 10 of our Rails processes had served ~650 requests each in 4 minutes, while 1
of them had only served 46 requests and was stuck on one request. After Apache
timed out that request, it began serving requests at a normal rate.

This is fine and we plan on rolling passenger across our cluster soon, as we've
deemed it stable enough. I am worried about the nature of the blocking though.

Hong Li, is it possible there could be an issue in the interaction somewhere between
forked rails processes, passenger and apache that could cause a long blocking
read/write if a client aborts a connection very early, before the response is fully
delivered? We know that we get a lot of the "Either the vistor clicked on the 'Stop'"
errors because we have situations where we issue a javascript location.href =
‘/some/url’ before the content is completely sent back to the client. We've done a
great deal of work implementing timeouts on potentially long running actions and we
are pretty confident in the performance of our app, but of course we can't eliminate
our application in this scenario.

Additionally, do you think it would be possible to provide a buffering enable/disable
configuration parameter at some point? We never send very large responses back, so
turning that on is not a problem for us. Dunno if there is sufficient demand for such
a thing, but it would be helpful for us (and maybe others here).

Thanks again for the time and help.

@FooBarWidget
Copy link
Member Author

From honglilai on November 30, 2009 13:25:22

Javascript redirection: in that case the 'Stop' warning is completely legit and you
should ignore the warnings, because that's what's actually happening: the browser
aborted the connection before everything is received.

Buffering: yes providing an option for buffering would be a good idea. I've marked
this issue as such.

Summary: Provide a configuration option for buffering responses in order to better cope with slow clients
Labels: -Priority-Medium Priority-High

@FooBarWidget
Copy link
Member Author

From honglilai on November 15, 2010 05:27:31

The Nginx version already provides such an option. Next up is Apache.

Labels: Milestone-3.0.2

@FooBarWidget
Copy link
Member Author

From honglilai on November 15, 2010 05:28:25

It should be noted that buffering responses conflicts with apps that try to stream large responses. At least without some very intelligent form of buffering such as implemented by mod_accel. This fact should be documented.

@FooBarWidget
Copy link
Member Author

From honglilai on November 15, 2010 06:10:05

Issue 396 has been merged into this issue.

@FooBarWidget
Copy link
Member Author

From Martin.Kammerlander on November 24, 2010 08:33:44

We had the same issue:


[ pid=16687 file=ext/apache2/Hooks.cpp:654 time=2009-11-10 11:36:25.519 ]:
Either the vistor clicked on the 'Stop' button in the web browser, or the
visitor's connection has stalled and couldn't receive the data that Apache
is sending to it. As a result, you will probably see a 'Broken Pipe' error
in this log file. Please ignore it, this is normal. You might also want to
increase Apache's TimeOut configuration option if you experience this
problem often.

The ajax request did not load and rails/passenger hangs.

We think now (actually we pretty sure) this is in our case a jQuery/Javascript issue when you have a longer waiting time for the ajax response. Seems that timeout of jQuery is set to a very low value of 1 second.

We solved this by adding a global by including this after the jQuery include:

/* Global configuration for Ajax requests */
jQuery.ajaxSetup({
timeout: 30000
});

This sets the timeout to 30 Seconds and all works fine again. No error/warning anymore, no abortion of the jQuery ajax request.

best
Martin

@FooBarWidget
Copy link
Member Author

From ryoqun on November 11, 2011 01:32:59

Hi, I faced this issue at my deployed server.

Applying Comment #3's patch, the slow clients' problem is fixed. Thank you for the info!

So, I thought it is a good idea to make this configurable.

Here is the patch: ryoqun@247a719 I'll also submit a pull request shortly.

Can anybody review this patch?

regards,

@FooBarWidget
Copy link
Member Author

From ryoqun on November 11, 2011 01:40:12

Here is the pull request: #30

@FooBarWidget
Copy link
Member Author

From ryoqun on November 15, 2011 20:15:52

After since, I improved my patch.

Here is a list of what the improvements are:

  1. PassengerBufferResponse can be configured at any place (was only at global server configuration).
  2. Added Documention about PassengerBufferResponse (Mostly copied from the Nginx counterpart directive. As a exception, I modified first paragraph and added a note mentioning lack of file-backed buffering fall-back mechanism in case of large responses).
  3. explicitly mention and set the default value of PassengerBufferResponse to On.

@FooBarWidget
Copy link
Member Author

From ryoqun on November 15, 2011 20:22:12

I updated the description in the pull request as well.

Well, I forgot to mention that this patch changes the default behavior of passenger to buffering responses from NOT buffering responses.

@FooBarWidget
Copy link
Member Author

From honglilai on November 23, 2011 09:01:18

Labels: -Milestone-3.0.2 Milestone-3.0.10

@FooBarWidget
Copy link
Member Author

From honglilai on November 26, 2011 00:31:42

Implemented in commit 247a719.

Status: Fixed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant