Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CPU pinned / spinning on epoll_wait #53

Closed
igrigorik opened this issue Dec 19, 2010 · 11 comments
Closed

CPU pinned / spinning on epoll_wait #53

igrigorik opened this issue Dec 19, 2010 · 11 comments
Assignees

Comments

@igrigorik
Copy link

Haven't been able to identify the exact culprit, but out of the blue one of our beanstalk processes is pinning the CPU (system). Doing an strace shows:

clock_gettime(CLOCK_MONOTONIC, {3162510, 551700172}) = 0
epoll_wait(4, {{EPOLLIN, {u32=20, u64=20}}, {EPOLLIN, {u32=26, u64=26}}, {EPOLLIN, {u32=23, u64=23}}}, 32, 998) = 3
clock_gettime(CLOCK_MONOTONIC, {3162510, 551773172}) = 0
epoll_wait(4, {{EPOLLIN, {u32=23, u64=23}}, {EPOLLIN, {u32=26, u64=26}}, {EPOLLIN, {u32=20, u64=20}}}, 32, 998) = 3
clock_gettime(CLOCK_MONOTONIC, {3162510, 551844172}) = 0
epoll_wait(4, {{EPOLLIN, {u32=20, u64=20}}, {EPOLLIN, {u32=26, u64=26}}, {EPOLLIN, {u32=23, u64=23}}}, 32, 998) = 3
clock_gettime(CLOCK_MONOTONIC, {3162510, 551916172}) = 0
epoll_wait(4, {{EPOLLIN, {u32=23, u64=23}}, {EPOLLIN, {u32=26, u64=26}}, {EPOLLIN, {u32=20, u64=20}}}, 32, 998) = 3
clock_gettime(CLOCK_MONOTONIC, {3162510, 551987172}) = 0
epoll_wait(4, {{EPOLLIN, {u32=20, u64=20}}, {EPOLLIN, {u32=26, u64=26}}, {EPOLLIN, {u32=23, u64=23}}}, 32, 998) = 3
clock_gettime(CLOCK_MONOTONIC, {3162510, 552059172}) = 0
epoll_wait(4, {{EPOLLIN, {u32=23, u64=23}}, {EPOLLIN, {u32=26, u64=26}}, {EPOLLIN, {u32=20, u64=20}}}, 32, 998) = 3
clock_gettime(CLOCK_MONOTONIC, {3162510, 552131172}) = 0
epoll_wait(4, {{EPOLLIN, {u32=7, u64=7}}, {EPOLLIN, {u32=20, u64=20}}, {EPOLLIN, {u32=26, u64=26}}, {EPOLLIN, {u32=23, u64=23}}}, 32, 998) = 4
clock_gettime(CLOCK_MONOTONIC, {3162510, 552203172}) = 0

Running 1.4.6 -- anyone else experienced this?

@lucky
Copy link

lucky commented Feb 9, 2011

Haven't seen this. What version of libevent?

@lucky
Copy link

lucky commented Mar 11, 2011

As an update, I believe we've run into the same behavior. Note that this elevated CPU usage does not correspond with an increase in throughput. Mean reserves per second is 150, though it does spike to 1000/s for very short periods of time (a few minutes).

Here's what out CPU usage looked like before and after this problem cropped up.

We're running a 32-bit EC2 instance (a small):

  • Ubuntu 10.04.1 LTS
  • Ubuntu/Debian package of libevent 1.4.13-stable-1 - libc6 (2 2.4)
  • beanstalkd 1.4.6 (built from source)

CPU Info
model name : Intel(R) Xeon(R) CPU E5430 @ 2.66GHz
stepping : 10
cpu MHz : 2666.760
cache size : 6144 KB

uname -a:
Linux queue-0 2.6.32-305-ec2 #9-Ubuntu SMP Thu Apr 15 04:14:01 UTC 2010 i686 GNU/Linux

@ghost ghost assigned kr Apr 10, 2011
@kr
Copy link
Member

kr commented May 8, 2011

We don't use libevent at all any more, so it's possible the bug is
simply gone. Even if not, it should be easier to fix now.

Can anyone reproduce this with recent code from git (since April 19)?
If so, please reopen the issue.

@kr kr closed this as completed May 8, 2011
@igrigorik
Copy link
Author

Keith, would you consider trunk be "production quality" at this point? We still have the IO pinned instance in prod, which I'd love to fix up, but given that I've had a very had time trying to synthetically reproduce this behavior.. don't want to break other stuff while trying building from master. ;-)

@kr
Copy link
Member

kr commented May 8, 2011

Understood. Yes, I'd consider it production quality, but if you're
using the binlog, be aware that the format has changed (from
version 5 to 6) and I haven't yet written code to read version 5.
Also, it will change again (to at least version 7) before the next
release, and I wasn't planning on writing a routine to read version 6.

@igrigorik
Copy link
Author

Ok, thanks, we'll give it a shot and let you know what we find out.

@await
Copy link

await commented Jun 10, 2011

Hi Keith, compiled and deployed the tip of beanstalkd as of June 10, 2011 and I'm seeing the same behavior as described in the bug. It's not completely pegging the cpu, but we're getting the following from an strace. Didn't see the same behavior on 1.4.6. We have a large number of clients doing a 'reserve-with-timeout'. Would that cause timestamp calls in a tight for-loop?

epoll_wait(6, {}, 500, 10) = 0
gettimeofday({1307731157, 547447}, NULL) = 0
gettimeofday({1307731157, 547522}, NULL) = 0
epoll_wait(6, {}, 500, 10) = 0
gettimeofday({1307731157, 553925}, NULL) = 0
epoll_wait(6, {}, 500, 10) = 0
gettimeofday({1307731157, 563925}, NULL) = 0
gettimeofday({1307731157, 564007}, NULL) = 0
epoll_wait(6, {}, 500, 10) = 0
gettimeofday({1307731157, 573923}, NULL) = 0
epoll_wait(6, {}, 500, 10) = 0
gettimeofday({1307731157, 583918}, NULL) = 0
gettimeofday({1307731157, 583975}, NULL) = 0
epoll_wait(6, {}, 500, 10) = 0
gettimeofday({1307731157, 598114}, NULL) = 0
gettimeofday({1307731157, 598163}, NULL) = 0
epoll_wait(6, {}, 500, 10) = 0
gettimeofday({1307731157, 603916}, NULL) = 0
epoll_wait(6, {}, 500, 10) = 0
gettimeofday({1307731157, 613894}, NULL) = 0
gettimeofday({1307731157, 613947}, NULL) = 0
epoll_wait(6, {}, 500, 10) = 0
gettimeofday({1307731157, 620005}, NULL) = 0
epoll_wait(6, {}, 500, 10) = 0
gettimeofday({1307731157, 633839}, NULL) = 0
gettimeofday({1307731157, 633897}, NULL) = 0
epoll_wait(6, {}, 500, 10) = 0
gettimeofday({1307731157, 643883}, NULL) = 0
epoll_wait(6, {}, 500, 10) = 0
gettimeofday({1307731157, 649118}, NULL) = 0
gettimeofday({1307731157, 649167}, NULL) = 0
epoll_wait(6, {}, 500, 10) = 0
gettimeofday({1307731157, 660758}, NULL) = 0
gettimeofday({1307731157, 660833}, NULL) = 0

@kr
Copy link
Member

kr commented Jun 15, 2011

This output looks normal. The process wakes up at least once
every 10ms to see if anything needs to be done (such as moving
a job from the delay queue to the ready queue).

Beanstalkd used to compute the exact amount of time to wait,
so you'd see absolutely no activity when nothing needs to be
done. I changed it to wake up at a fixed period while refactoring
the networking code and removing libevent, because doing it
this way is easier to get right. Eventually I'd like to switch back
to an exact calculation of the sleep time, but that's pretty far
down on the priority list, since the performance impact is tiny,
and it only makes a difference when the system is idle.

@kr kr reopened this Feb 5, 2012
@kr
Copy link
Member

kr commented Feb 5, 2012

See https://groups.google.com/d/topic/beanstalk-talk/MZNodUgzGHM for
further discussion. I suspect it's the same problem.

Edit: fixed link.

@chexov
Copy link

chexov commented Feb 7, 2012

Just to keep it in the github issue tracker here is my current situation with
event loop on beanstalkd 1.5 and openbsd (compiled with port-bsd.c, sock-bsd.c)

Here is the loop eating up to 100% CPU (1 core)

CALL kevent(0x6,0,0,0x7f7ffffc5f80,0x1,0x829e80)
RET kevent 1
CALL gettimeofday(0x7f7ffffc5f50,0)
RET gettimeofday 0
CALL read(0x1d,0x207db7854,0xd0)
RET read -1 errno 35 Resource temporarily unavailable
CALL kevent(0x6,0,0,0x7f7ffffc5f80,0x1,0x829e80)
RET kevent 1
CALL gettimeofday(0x7f7ffffc5f50,0)
RET gettimeofday 0
CALL read(0x1b,0x207db7c54,0xd0)
RET read -1 errno 35 Resource temporarily unavailable
CALL kevent(0x6,0,0,0x7f7ffffc5f80,0x1,0x829e80)
RET kevent 1
CALL gettimeofday(0x7f7ffffc5f50,0)
RET gettimeofday 0
CALL kevent(0x6,0,0,0x7f7ffffc5f80,0x1,0x829e80)
RET kevent 1
CALL gettimeofday(0x7f7ffffc5f50,0)
RET gettimeofday 0
CALL kevent(0x6,0,0,0x7f7ffffc5f80,0x1,0x829e80)
RET kevent 1
CALL gettimeofday(0x7f7ffffc5f50,0)
RET gettimeofday 0
CALL read(0x10,0x20cd73054,0xd0)
RET read -1 errno 35 Resource temporarily unavailable
CALL kevent(0x6,0,0,0x7f7ffffc5f80,0x1,0x829e80)
RET kevent 1
CALL gettimeofday(0x7f7ffffc5f50,0)
RET gettimeofday 0
CALL kevent(0x6,0,0,0x7f7ffffc5f80,0x1,0x829e80)
RET kevent 1

@kr
Copy link
Member

kr commented Feb 13, 2012

This was separately reported as #96, and it's fixed now.

@kr kr closed this as completed Feb 13, 2012
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants