New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Passenger 6 runaway cpu issue with crystal http server #2144

Closed
anamba opened this Issue Dec 3, 2018 · 21 comments

Comments

Projects
None yet
3 participants
@anamba
Copy link

anamba commented Dec 3, 2018

I've been using Passenger for about as long as it has existed. Until now, I've only used it with Ruby, but now I'm getting into Crystal, and I was excited to see that I might be able to use Passenger 6 with Crystal.

And it turns out it's pretty great! On Xenial. I am very pleased with Passenger GLS on Xenial.

However, I am encountering a runaway CPU issue on Bionic.

Issue report

Question 1: What is the problem?

  • What is the expected behavior?
    This should run my Crystal app via Passenger GLS: `passenger start --app-start-command='env PORT=$PORT ./mycrystalprog'

  • What is the actual behavior?
    Runaway cpu usage on PassengerAgent and ./mycrystalprog. Interestingly, while this is happening, you can still curl the crystal app on the port Passenger assigns and it responds to http requests normally (albeit a little bit slower than usual due to the competition for cpu cycles).

  • How can we reproduce it?

Your answer:

It's a little hard to explain if you're unfamiliar with Crystal, and neither the Crystal app nor Passenger ever outputs any kind of error (even at log-level=7), so I put together a demo with a Docker config:
https://github.com/anamba/docker-passenger-issue

Question 2: Passenger version and integration mode:

Your answer: open source 6.0.0 standalone

Question 3: OS or Linux distro, platform (including version):

Your answer: Ubuntu 18.04 LTS

Have repro'd the issue in Docker and on two different Linode servers (one fresh install and one upgraded from Xenial).

Question 4: Passenger installation method:

Your answer:
[ ] RubyGems + Gemfile
[ ] RubyGems, no Gemfile
[X] Phusion APT repo
[ ] Phusion YUM repo
[ ] OS X Homebrew
[ ] source tarball
[ ] Other, please specify:

Question 5: Your app's programming language (including any version managers) and framework (including versions):

Your answer: Crystal 0.27.0

Question 6: Are you using a PaaS and/or containerization? If so which one?

  • For example: Heroku, Amazon Container Services, Docker 1.9 with an image based on passenger-docker

Your answer: repro'd on Docker for Mac and Linode (KVM)

@CamJN

This comment has been minimized.

Copy link
Contributor

CamJN commented Dec 3, 2018

So this is interesting, I actually don't see this happen when running your example in docker for mac. Would you mind providing the version of docker for mac you are using and the settings for CPU count, memory, swap, disk image type (raw/qcow2) and any other settings that are non-default.

@anamba

This comment has been minimized.

Copy link

anamba commented Dec 3, 2018

!!! There have been several plot twists in my investigation of this issue, but that one was the most shocking. I will experiment a little bit more and report back...

@anamba

This comment has been minimized.

Copy link

anamba commented Dec 3, 2018

I uploaded an image to Docker Hub and tested it on my two machines in various configurations: https://hub.docker.com/r/anamba/passenger-issue/

So far, no Docker settings made any difference, but it does look like the issue is not present when running the image via Docker for Mac 2.0.0.0, only in the version before that one (sorry, somehow I failed to grab a screenshot before upgrading, but it was the last one with the YY.MM.p version format).

@anamba

This comment has been minimized.

Copy link

anamba commented Dec 4, 2018

I tried different kernel versions and that didn't seem to have much of an effect.

A new and VERY STRANGE observation:

  1. Turns out, the runaway cpu does not go on forever. I waited it out (~3 minutes?) and it eventually stopped, and the app was served normally after that.
  2. Subsequent app starts, either via touch tmp/restart.txt or by stopping and starting Passenger, have a different delay, as low as a second or two, up to the original ~3 minutes. ??!?

Have not yet figured out any pattern to the duration of the runaway cpu usage.

However, after noticing that the runaway cpu doesn't happen the same way every time, I tried to see if I could get it to happen in Docker 2.0.0.0 (using the hub image I mentioned in the previous comment) by running passenger start repeatedly. Sure enough, after about 9 runs, I got this:

top - 01:12:56 up  1:59,  0 users,  load average: 0.41, 0.12, 0.03Tasks:  18 total,   1 running,  17 sleeping,   0 stopped,   0 zombie
%Cpu(s):  2.4 us, 50.3 sy,  0.0 ni, 47.0 id,  0.0 wa,  0.0 hi,  0.3 si,  0.0 stKiB Mem :  4042388 total,  2111400 free,   340956 used,  1590032 buff/cache
KiB Swap:   524284 total,   524284 free,        0 used.  3452108 avail Mem
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND 1072 app       20   0 1165836  19172  15536 S 100.3  0.5   0:28.45 PassengerAgent
 1113 app       20   0  143376   8616   4904 S   4.7  0.2   0:01.43 httpserver

(CPU usage pattern is different in this example because I gave Docker 2 cores on this machine. So perhaps the high CPU issue is mainly with PassengerAgent itself.)

8 runs later, the same thing. So perhaps it could be possible for you to reproduce the issue, just not as often.

@anamba

This comment has been minimized.

Copy link

anamba commented Dec 4, 2018

I also managed to trigger the issue on the Xenial server that I thought was working, just by touching restart.txt a few times. So the picture is becoming a little clearer now.

@anamba anamba changed the title Passenger 6 runaway cpu issue with bionic + crystal Passenger 6 runaway cpu issue with crystal http server Dec 4, 2018

@anamba

This comment has been minimized.

Copy link

anamba commented Dec 4, 2018

Last observation for today, then I have to get back to work.

It felt like I encountered this issue more frequently with the --release flag, which increases compile time, but produces a more optimized binary.

Based on that hunch, I added a sleep 1 before server.listen, to avoid binding the port assigned by Passenger right away. So far, that appears to have solved the issue 100%!

So my wild guess here is that there is some kind of race condition... my Crystal app starting up too quickly 😁 and binding the port before Passenger is ready.

@anamba

This comment has been minimized.

Copy link

anamba commented Dec 5, 2018

Sigh... forget what I said about the workaround solving the issue 100% of the time. I still can't come up with any kind of pattern that will work on ALL the environments that are exhibiting this issue (which is to say, every single environment I've tried): one Xenial Linode, two Bionic Linodes (one fresh and one upgraded from Xenial), Docker on my desktop, Docker on my laptop. They all exhibit the issue sometimes (but not always), and sometimes I can wait it out, from a few seconds up to ~3 min (but not always).

@anamba

This comment has been minimized.

Copy link

anamba commented Dec 5, 2018

If I intentionally bind the wrong port, Passenger waits patiently and eventually times out, using little to no cpu.

However, as soon as I bind $PORT, boom, Passenger core starts hogging the cpu.

I tried to look through the SpawningKit code, but there's just too much going on there, with not enough debugging output for me to understand which path is being taken. So all I can see is that I am definitely getting to watchSocketPingability, and the else branch with syscalls::usleep is fine, but once I bind the port, then either pingTcpServer does something weird (unlikely?), or something after that does (very likely, there's sooo much stuff that happens after that).

@FooBarWidget

This comment has been minimized.

Copy link
Member

FooBarWidget commented Dec 5, 2018

Can you maybe use gdb, lldb or crash-watch --dump to obtain a backtrace of the Passenger Core to see where it's doing a CPU-intensive loop?

@anamba

This comment has been minimized.

Copy link

anamba commented Dec 5, 2018

Tried to sample one of the crashed processes 3x, but they look all the same, and I can't make much sense of any of them. Hope there's something interesting in there...

crash-watch-dump.txt
crash-watch-dump2.txt
crash-watch-dump3.txt

Also sampled Passenger core from another server 3x:
crash-watch-dump-a.txt
crash-watch-dump-a2.txt
crash-watch-dump-a3.txt

@CamJN

This comment has been minimized.

Copy link
Contributor

CamJN commented Dec 7, 2018

I recreated the crash with dbg symbols installed:

   8     #0  0x00007f0d5074e03f in __GI___select (nfds=11, readfds=0x7ffc39fea600,
   9         writefds=0x0, exceptfds=0x0, timeout=0x0)
  10         at ../sysdeps/unix/sysv/linux/select.c:41
  11             resultvar = 18446744073709551102
  12             sc_cancel_oldtype = 0
  13             sc_ret = <optimized out>
  14     #1  0x00005631b41e81cd in oxt::syscalls::select (nfds=11,
  15         readfds=0x7ffc39fea600, writefds=0x0, errorfds=0x0, timeout=0x0)
  16         at src/cxx_supportlib/oxt/system_calls.cpp:487
  17             ctx = 0x5631b4fd6040
  18             _my_errno = 22065
  19             _intr_requested = false
  20             ret = -1258463168
  21     #2  0x00005631b3f881f1 in waitForExitEvent ()
  22         at src/agent/Core/CoreMain.cpp:1220

Which is if (syscalls::select(largestFd + 1, &fds, NULL, NULL, NULL) == -1) { which got a SIGABRT because the thread was killed with SI_TKILL by PID 1291 running as the app user.

And that happened because

0x00005631b3e9fa20 in Passenger::pingTcpServer (host=..., port=5000,
 125         timeout=0x7f0d114dfc08) at src/cxx_supportlib/IOTools/IOUtils.cpp:751

which is:

 736 bool
 737 pingTcpServer(const StaticString &host, unsigned int port, unsigned long long *timeout) {
 738         TRACE_POINT();
 739         NTCP_State state;
 740 
 741         setupNonBlockingTcpSocket(state, host, port, __FILE__, __LINE__);
 742 
 743         try {
 744                 if (connectToTcpServer(state)) {
 745                         return true;
 746                 }
 747         } catch (const SystemException &e) {
 748                 if (e.code() == ECONNREFUSED) {
 749                         return false;
 750                 } else {
 751                         throw e;
 752                 }
 753         }
 754 ...snip...

which shows us we have an unhandled exception raised in pingTcpServer because there's a system exception when connecting to the tcp server which is not an ECONNREFUSED. I don't yet know what the exception is, but this should help move things forward. next step is to log what the exception is and see if it makes sense to handle it there or if there's some other problem causing the issue indirectly.

@CamJN

This comment has been minimized.

Copy link
Contributor

CamJN commented Dec 10, 2018

So the unhandled SystemException has code 99 (or EADDRNOTAVAIL), which gives description:
Cannot connect to TCP socket '127.0.0.1:5000': Cannot assign requested address (errno=99), or from the man page: The specified address is not available from the local machine. Which is really weird because the address is 127.0.0.1.

@CamJN

This comment has been minimized.

Copy link
Contributor

CamJN commented Dec 10, 2018

My guess is based on this: puma/puma#1062 that the crystal app might be binding to ::1 instead of 127.0.0.1 and therefore when we try and connect to 127.0.0.1 we fail.

Nevermind, the crystal app specifies 127.0.0.1 directly, it's not using /etc/hosts.

@CamJN

This comment has been minimized.

Copy link
Contributor

CamJN commented Dec 10, 2018

I can stop the crash, though the CPU still burns for a while until the connect call succeeds.

@CamJN

This comment has been minimized.

Copy link
Contributor

CamJN commented Dec 10, 2018

The linux man page specifically says this is due to ephemeral port exhaustion, coupled with the socket file descriptor not having previously been bound to an address, but since this is in the perform step, not the prepare step, that is probably not true (the socket is first connected in prepare, then again in perform). @FooBarWidget thoughts?

@CamJN

This comment has been minimized.

Copy link
Contributor

CamJN commented Dec 10, 2018

We are in fact using up all of the ephemeral ports somehow, they're all in the TIME_WAIT state as verified by netstat -an.

@CamJN

This comment has been minimized.

Copy link
Contributor

CamJN commented Dec 10, 2018

I think the wakeupEventLoop function is not stopping the connection attempts like it seems like it should. I'm seeing a successful connect followed by more connect attempts in a tight loop forever.

CamJN added a commit that referenced this issue Dec 10, 2018

Fixes an issue during the GLS startup handshake where ephemeral ports…
… could be exhausted. Closes GH-2144.

The watchSocketPingability loop would wakeup the other tasks waiting on the cond variable, but would then immediately continue trying to connect to the socket, which eventually exhausted the available ephemeral ports. I wait on the cond var instead after waking all other tasks.
@CamJN

This comment has been minimized.

Copy link
Contributor

CamJN commented Dec 10, 2018

Possible fix here: https://github.com/phusion/passenger/tree/GH-2144_GLS_handshake_issues if you'd like to try it out @anamba.

@anamba

This comment has been minimized.

Copy link

anamba commented Dec 10, 2018

Whoa, this was my first time building straight from Github and it was almost too easy. Nice job.

Would be nice if I could do the compile step locally though, I have a lot more CPU and memory resources available. Even while building w/o optimizations, I ran out of memory on the server the first time. Looks like the build maxes out at 1.23GB and I normally have ~1GB free on this server (out of 2GB). I was able to free up some memory by temporarily stopping some things.

Once I finally got it built, I realized I was on master. Doh. I checked out the branch for this issue and after a couple mins, figured out how to get it to rebuild (I just deleted ~/.passenger/support-binaries/6.0.1/PassengerAgent).

From my limited testing so far, it works! Due to the nature of the problem, it will take some time to be 100% sure, but so far so good!

@CamJN

This comment has been minimized.

Copy link
Contributor

CamJN commented Dec 11, 2018

We have the passenger binary automation project that we use for cross compilation, however it's a bit complicated to setup due to c++ being much harder to cross compile reliably than say Go or Rust.

@anamba

This comment has been minimized.

Copy link

anamba commented Dec 18, 2018

Quick follow-up, happy to report that the fix from 7 days ago is still working great, have not encountered this issue once since then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment