New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memcached gets a dead loop in func assoc_find #271
Comments
|
how are you causing that? what's the full |
|
gdb bt info: |
|
Are you able to reproduce this state? Have you modified the source in any way? Never seen this failure. Very interested in finding out how it got into that state. any chance of getting |
|
This state has reproduced many times, and we have 30 instances of mc, but only one happens always, I don't have modified souce code. The process has rejected tcp conection, stats info from gdb |
|
looks like none of the new features are even enabled, so it's probably not a side effect of any of those... how is memcached being installed? you're sure it still happens on 1.4.36? need to think about how to figure this out. There isn't any way that should happen. Are you using binary protocol or ASCII? Maybe you can add: |
|
About install: I get the release 1.4.36 source code from github, and make && make install to install it. Btw, In the assoc_insert, the expand_bucket needn't a lock ? |
|
Doesn't need a lock because of how expanding started/stopped. All threads synchronize first. If you're worried about that, you can give the startup option |
|
All right, I will try your suggestion about the "assert...", hope its helpful. |
|
Hi guys, I have tried your suggestion, but it is not helpful. There is no info about the 'assert' , I guess the 'assert' has not be executed. And I guess is expansion made the matter, so are there have any other way ? |
|
Hey, Can you confirm you were running the Have you tried adding I'm hesitant to blame the hash expansion because I have tried to blame it for many bugs in the past; but it always turns out to be something else. |
|
Yeah, I have used the memcached-debug binary, and got some stderr info:"memcached-debug: items.c:344: item_free: Assertion `(it->it_flags & 1) == 0' failed", but not the "assert(it->h_next != it)", I don't know what the error mean ? |
|
And when got the error , the mc went away, so I can't to debug it. |
|
Sorry, you have to start it within GDB, or enable coredumps. That assertion is a little scary.. it means an item is being freed to memory while it's still linked into the hash table and linked list. Something like that could cause this same problem. Any chance you could run it under GDB and get the backtrace for this assert? Also, from any of your live and non-bugged instances in the same pool, I would greatly appreciate the output from all the various stats commands (stats, settings, items, slabs). The prints you did earlier are missing a lot of data. |
|
Hi guys, I have got backtrace about the assertion '(it->it_flags & 1) == 0': And I started the memcached-debug with argv '-ohashpower=21' , When mc crashed, the hashpower still eq 21, So you are right, is not expansion's matter. |
|
thanks! hmm, not super enlightening. That's the normal code path, now freeing something that's still linked. Any chance I could get a look (you can send them privately) at full stats output from a running instance? |
|
When I got the assertion, the mc-debug process has gone away, I can only get stats by 'gdb p stats'. |
|
you said you have several instances, but only one of them crashes, correct? Can I see the stats output from a different instance? or do they all get different types of traffic? How long does it take to crash after you start it, btw? |
|
Yes, only one crashed. The crash happened every day. Another good instance stats info: |
|
Hey, I have got a full stats before mc crashed today, and hope helpful. |
|
And are there any other way for me to debug the crash ? |
|
Thanks! that is helpful. Sorry, I've been a little busy so I don't have a lot of great ideas yet. The area of code is very narrow though, since you are using so few features. The stats output helps a lot. I'd like to try two sort of crazy things just to rule them out:
so it looks like: this is to rule out the refcount overflowing, hitting 0, and then the item getting frreed while linked.
this changes the key distribution in the hash table, in case that's a bug (very far fetched). Since your connection count is low, my only guess for a refcount overflow would be if you are accidentally generating massive multigets for the same key. We can protect better against that by forcing a miss instead of upping the refcount, I just haven't written it yet since it's never been reported before. but, have to prove the theory first. if I come up with any other ideas for things for you to test I'll let you know. |
|
Thanks very much ! |
|
Hey, you are right !! |
|
Hah, nice. No, you should really look at your app and find out why that's happening. Since your connection count is very low, you're likely fetching the same key over and over again, and either not reading the response or asking for many at once, ie; "get key1 key1 key1 key1". First and foremost that's not doing you any favors, so figure out how you're getting into that state. In 1.4.36, you have access to "watch fetchers", which gives you some view into it. I'll try up a patch to force a miss if refcount is too high. probably not until the weekend though. (it's a little more complicated than just doing that) |
|
Hey dormando, you are right again !! BTW, I have temporary modified the code in item.c to prevent refcount out memory: |
|
good luck! Your change can still cause some really bad problems. Something safer would be to check down in the expire code in do_item_get: treat it as expired (unlink) if refcount is > USHRT_MAX-10, which leaves some headroom for other functions which increase the refcount on their own. It's not safe for some functions to get a miss, just the get protocol commands |
|
I got it, our app has recovered oneself now. Thanks ! |
|
awesome! I really appreciate you sticking around to work through it. |
|
didn't get to this over the weekend; should happen sometime during the week or next weekend. |
|
Ok |
|
fix is here. gone into the 'next' branch for the next release. thanks again for your report and patience! It's not obvious when this is kicking in unfortunately. |
|
is there a reproducer for this issue? we're trying to see if older releases of memcached (1.4.13, in Debian LTS/wheezy) are affected by this and it would be useful to have a regression test. thanks! |
|
great, thanks! |
OS Ver: CentOS 6.5 x64
Kernel Ver: 2.6.32-431.11.7.el6
Memcached Ver: 1.4.24 and 1.4.36
gdb info:
(gdb) info thread
7 Thread 0x7f103e389700 (LWP 29762) logger_thread (arg=) at logger.c:476
5 Thread 0x7f103cf87700 (LWP 29764) 0x00007f103e93a334 in __lll_lock_wait () from /lib64/libpthread.so.0
4 Thread 0x7f1034586700 (LWP 29765) 0x00007f103e93a334 in __lll_lock_wait () from /lib64/libpthread.so.0
3 Thread 0x7f1037fff700 (LWP 29766) 0x00007f103e93a334 in __lll_lock_wait () from /lib64/libpthread.so.0
2 Thread 0x7f10375fe700 (LWP 29767) 0x00007f103e93768c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
1 Thread 0x7f103f1aa700 (LWP 29761) 0x00007f103e680f33 in epoll_wait () from /lib64/libc.so.6
In the thread 6(assoc_find):
dead loop in:
(gdb) n
92 while (it) {
(gdb)
93 if ((nkey == it->nkey) && (memcmp(key, ITEM_key(it), nkey) == 0)) {
(gdb)
97 it = it->h_next;
items info:
(gdb) p it
$31 = (item *) 0x7f101a4fd7a0
(gdb) p *it
$30 = {next = 0x7f101a4fd7a0, prev = 0x7f0ffaeee700, h_next = 0x7f101a4fd7a0, time = 166607, exptime = 253007, nbytes = 2378, refcount = 1, nsuffix = 10 '\n',
it_flags = 11 '\v', slabs_clsid = 144 '\220', nkey = 24 '\030', data = 0x7f101a4fd7a0}
it->h_next == it self, so the deap loop coming..
So this is a bug ?
The text was updated successfully, but these errors were encountered: