Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

App takes longer to boot on Ubuntu 18 #1969

Closed
collimarco opened this issue Sep 15, 2019 · 11 comments
Closed

App takes longer to boot on Ubuntu 18 #1969

collimarco opened this issue Sep 15, 2019 · 11 comments

Comments

@collimarco
Copy link

Steps to reproduce

  1. Send a simple request to puma and everything works properly
  2. Enable cluster mode (e.g. by passing -w 2)
  3. Request hangs forever and no response is received (after some time you need to terminate the connection from client)

This is the only information that you see in the puma logs:

=== puma startup: 2019-09-15 11:01:53 +0000 ===
[892] ! Terminating timed out worker: 1145
[892] ! Terminating timed out worker: 1146
[892] ! Terminating timed out worker: 1147
[892] ! Terminating timed out worker: 1148
[892] ! Terminating timed out worker: 1397
[892] ! Terminating timed out worker: 1399
...

Expected behavior

Puma should work fine in cluster mode.

Actual behavior

Request hang forever and no response is received. Logs don't give enough information.

System configuration

Ruby version: 2.6.4
Rails version: 5.2.3
Puma version: 4.1.1
OS/version: Ubuntu 18.04 LTS
Server: DigitalOcean CPU-Optimized 4vCPU

@collimarco
Copy link
Author

I could partially solve the issue by setting this:

worker_timeout 300

The workers start responding properly after about 2 minutes. However that really seems a strange startup time.

We run this Rails app in production since 2015 and we were using DigitalOcean Standard 2vCPU. We have always used clustered mode and we had never increased the worker_timeout: everything was working fine. Now we use more powerful servers with dedicated hyperthreads and more CPUs (i.e. DigitalOcean CPU-Optimized 4vCPU) and the startup is extremely slow (causing that Terminating timed out worker if we don't increase worker_timeout).

@collimarco
Copy link
Author

I have made more tests: after a restart Puma takes ~90 seconds before responding to the first request. During that time there is no CPU usage (i.e. top says 99.5 idle) and there is a lot of free RAM (i.e. top says 8GB free). So why the worker startup is so low?

@nateberkopec
Copy link
Member

So, there's not really a bug in Puma, it's just that your app is suddenly taking a long time to start up?

Does it take so long to start in another application server, like Unicorn?

@nateberkopec
Copy link
Member

Also does the application boot quickly in single mode? That is, after it boots and you immediately send it a request, how long does it take for that request to return?

@collimarco
Copy link
Author

The app is currently (today) running on the old servers (Ubuntu 16.04) without the need of worker_timeout: everything works fine and the startup is fast (in any case < 60s).
The exact same app, with the same configuration, also runs on some new servers (which have the newer Ubuntu 18.04 and much more powerful hardware): on these new servers the startup is slow and the timeout occurs if you don't increase it.

Also does the application boot quickly in single mode?

In single mode takes about the same time as in cluster mode (the only difference is that it returns Connection refused in the meantime, instead of keeping the connection).

@collimarco
Copy link
Author

collimarco commented Sep 15, 2019

I have tried with ruby 2.5 and the problem persists on the new server.
On the old server the startup (with same number of threads and workers) takes only 10s.
So it's really weird:

  • DigitalOcean 4vCPU dedicated ($80): 90s
  • DigitalOcean 2vCPU shared ($20): 10s

The problem is the same on all the new servers. Apart from hardware, the only difference is Ubuntu (on old server was 16.04LTS, on the new server is 18.04LTS).

@collimarco
Copy link
Author

Another note: the startup time doesn't change if I increase or decrease the number of workers.

@nateberkopec
Copy link
Member

🤷‍♂ Not really any evidence of a Puma issue here, though. Could be any one of a million things in your app thats different w/the new boxes. The new OS version especially - there's literally thousands of different new packages.

If you try Unicorn and it boots in 10s, please let us know.

The SIGINFO/log_thread_status changes on master may help you debug this eventually once we actually get that working on linux (#1964), until then you're stuck using gdb to figure out where it might be hanging.

@nateberkopec nateberkopec changed the title Puma not accepting requests in cluster mode App takes longer to boot on Ubuntu 18 Sep 15, 2019
@collimarco
Copy link
Author

Yes, it might not be strictly related to Puma and it is also not related to our app (it's working fine on the previous servers).

Now I have also realized that another similar issue happens on the new database server (Ubuntu 18.04): rake db:migrate takes about ~90 seconds (same time as Puma boot) to run, even when there are no migrations to run (on the previous server took only a few seconds).

In general the performance of the new servers (tested with bechmarks, like redis benchmark) are very high. So this must be a bug specific to Rails / Ruby / Ubuntu 18. If you have any other suggestion on how to debug this would be appreciated.

@nateberkopec
Copy link
Member

nateberkopec commented Sep 15, 2019

rake db:migrate takes about ~90 seconds

That task + booting the app both have to set up the Rails "environment". My guess is that $ rails runner "puts 'hello'" also takes 90 seconds.

You might try rbspy to see where the time is going. I think it would work well here.

@collimarco
Copy link
Author

@nateberkopec Thank you! That command saved my day... The runner indeed takes 90 seconds, exactly the same.

So I have tried to CTRL-C and it was clear that the process was stuck on Redis gem. After some investigation I found that there was this line in config/environments/production.rb:

config.cache_store = :redis_cache_store, { host: '10.129.123.123' }

That cache store was never used, but it was configured. The problem is that the host was pointing to a machine in the old datacenter: when we tried to connect from the old servers (located in the old datacenter), the connection was simply rejected immediately. However in the new datacenter, probably due to different firewall configurations, that line produced a long waiting for a response.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants