Segfault in createFilter() #425

michihenning · 2016-11-11T10:17:10Z

I'm getting segfaults with leveldb 1.19. This is in code that has been working perfectly for more than a year with 1.18. The segfault happens some time during shutdown and looks like a race condition. Valgrind has this to say:

http://pastebin.ubuntu.com/23459626/

gdb shows this stack trace:

http://pastebin.ubuntu.com/23459624/

I can produce the segfault on arm64, amd64, powerpc, s390x, ppc64el, armhf, and i386 simply by changing from libleveldb.so.1.18 to libleveldb.so.1.19

More details here: https://bugs.launchpad.net/ubuntu/+source/thumbnailer/+bug/1640326

michihenning · 2016-11-14T03:36:47Z

I did a bisect. The core dump first shows up with this commit:

ac1d69d

ghemawat · 2016-11-14T03:48:08Z

From the stack trace, one potential explanation is that the caller is
deleting the filter policy object before deleting the DB. E.g., suppose
the caller does:

At initialization time:
options.filter_policy = ... create some filter policy ...
open db with options
Run for a file
Cleanup:
delete options.filter_policy;
delete db;

This would explain the crash you are seeing since background threads inside
leveldb may attempt to access the filter policy until the db deletion is
finished.

On Sun, Nov 13, 2016 at 7:36 PM, michihenning notifications@github.com
wrote:

I did a bisect. The core dump first shows up with this commit:

ac1d69d
ac1d69d

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#425 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AHuM0boid1vnRX6YHc8gpSyGnAQohLFYks5q99dXgaJpZM4Kvm7Z
.

ghemawat · 2016-11-14T04:06:40Z

Also, an easy thing to do might be to try running under address sanitizer
and thread sanitizer.

On Sun, Nov 13, 2016 at 7:47 PM, Sanjay Ghemawat sanjay@google.com wrote:

From the stack trace, one potential explanation is that the caller is
deleting the filter policy object before deleting the DB. E.g., suppose
the caller does:

At initialization time:
options.filter_policy = ... create some filter policy ...
open db with options

Run for a file

Cleanup:
delete options.filter_policy;
delete db;

This would explain the crash you are seeing since background threads
inside leveldb may attempt to access the filter policy until the db
deletion is finished.

On Sun, Nov 13, 2016 at 7:36 PM, michihenning notifications@github.com
wrote:

I did a bisect. The core dump first shows up with this commit:

ac1d69d
ac1d69d

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#425 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AHuM0boid1vnRX6YHc8gpSyGnAQohLFYks5q99dXgaJpZM4Kvm7Z
.

michihenning · 2016-11-14T04:07:10Z

Thanks for the quick response!

We are not using any filter policy.

We use ReadOptions and WriteOptions. Both have static storage duration.

For the read options, we set:
read_options.verify_checksums = true; // Only for debug builds

For the write options, we don't set anything, the variable is default-constructed.

We also pass an instance of leveldb::Options to DB::Open(). That option instance is on the stack, and the only flags set for it are

options.paranoid_checks = true;
options.create_if_missing = true;

Could any of these have something to do with this?

michihenning · 2016-11-14T04:11:28Z

One other point that might be relevant: we have three instances of the DB open, each one for a different purpose.

michihenning · 2016-11-14T04:27:29Z

Hmmm... Scanning through our code, we also use

options.block_cache = ...;

The life time of that block cache ends before the DB is destroyed. Could this be the cause?

michihenning · 2016-11-14T04:50:47Z

Address sanitizer says this:
http://pastebin.ubuntu.com/23474089/

The address on the first line isn't always 0x0. It varies from run to run. About half the time, I get 0x0, the other half, I get this instead (the stack trace is the same otherwise):

==107173==ERROR: AddressSanitizer: SEGV on unknown address 0x000000000018 (pc 0x7fcf567be24a bp 0x60d00002d7a0 sp 0x7fcf41958f80 T5)

ghemawat · 2016-11-14T19:21:03Z

On Sun, Nov 13, 2016 at 8:27 PM, michihenning notifications@github.com
wrote:

Hmmm... Scanning through our code, we also use

options.block_cache = ...;

The life time of that block cache ends before the DB is destroyed. Could
this be the cause?

Could be. But from the code I found on the web, we have

http://bazaar.launchpad.net/~unity-api-team/persistent-cache-cpp/trunk/view/head:/include/core/internal/persistent_string_cache_impl.h

std::unique_ptrleveldb::Cache block_cache_; // Must be defined before db_!
std::unique_ptrleveldb::DB db_;

That should prevent the block_cache_ from being deleted until after the db_
has been deleted?

I think the first thing to do would be to get more information from the
ASAN failure. Perhaps build with line number information (-g1 or -g flag?)

It would also be good to know where the main thread was when the ASAN
failure happened. Did you have stack traces for other threads? That would
let us know how far the DB destruction had progressed.

—

You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#425 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AHuM0ZVzL_-9sLeXsjTAtNr7AsZB-Atcks5q9-M6gaJpZM4Kvm7Z
.

michihenning · 2016-11-14T21:46:24Z

My apologies about the block_cache_ life time, I wasn't thinking straight.

Here is the asan output with line numbers:
http://pastebin.ubuntu.com/23477477/

Here are the stacks for all the threads. The main thread is in the DB destructor, there are four threads waiting for an event to arrive, and there is DB background thread that hits the segfault.
http://pastebin.ubuntu.com/23477508/

michihenning · 2016-11-14T21:47:31Z

BTW, to avoid any confusion, this is with leveldb compiled from commit ac1d69d

ghemawat · 2016-11-14T21:56:29Z

I am confused by the two occurrences of ~DBImpl in the stack trace:

#2  0x00007ffff53144d9 in leveldb::DBImpl::~DBImpl (this=0x61700001f880,
__in_chrg=<optimised out>)
    at db/db_impl.cc:152
#3  0x00007ffff5314fe9 in leveldb::DBImpl::~DBImpl (this=0x61700001f880,
__in_chrg=<optimised out>)
    at db/db_impl.cc:174

The top one is in the following loop:
151 while (bg_compaction_scheduled_) {
152 bg_cv_.Wait();
153 }

The second one (the caller???) is the end of the destructor:
173 }
174 }

Is the program perhaps throwing exceptions when some leveldb library
function is on the stack? (leveldb code is not exception safe).

Perhaps you could fprintfs in various parts of ~DBImpl to figure out which
lines of the destructor have already been executed by the time of the ASAN
crash.

And maybe compile without optimization.

On Mon, Nov 14, 2016 at 1:47 PM, michihenning notifications@github.com
wrote:

BTW, to avoid any confusion, this is with leveldb compiled from commit
ac1d69d
ac1d69d

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#425 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AHuM0VuwAb5XKsBQkJ9nlqBzmH8nC7Qqks5q-Nb8gaJpZM4Kvm7Z
.

michihenning · 2016-11-14T22:26:36Z

There are no exceptions anywhere near this.
Compiling with -g2 and without optimization, the main thread stack trace still shows the same two frames for the destructor:

#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x00007ffff53734b5 in leveldb::port::CondVar::Wait (this=0x61700001f970) at port/port_posix.cc:38
#2  0x00007ffff5300a69 in leveldb::DBImpl::~DBImpl (this=0x61700001f880, __in_chrg=<optimised out>)
    at db/db_impl.cc:153
#3  0x00007ffff5301038 in leveldb::DBImpl::~DBImpl (this=0x61700001f880, __in_chrg=<optimised out>)
    at db/db_impl.cc:175
#4  0x000055555560ac3a in core::internal::PersistentStringCacheImpl::~PersistentStringCacheImpl() ()
#5  0x0000555555604441 in core::PersistentStringCache::~PersistentStringCache() ()

I've seen this before--I think it's just an idiosyncracy of gcc/gdb with inlined destructors?

I inserted some trace into the constructor and destructor:

DBImpl::~DBImpl() {
  std::cerr << "~DBImpl()" << std::endl;
  // Wait for background work to finish
  mutex_.Lock();
  shutting_down_.Release_Store(this);  // Any non-NULL value is ok
  std::cerr << "called Release_Store()" << std::endl;
  while (bg_compaction_scheduled_) {
    std::cerr << "waiting" << std::endl;
    bg_cv_.Wait();
  }
  mutex_.Unlock();
  std::cerr << "done waiting" << std::endl;

We have three instances of PersistentCache in the code; here is the trace I get:

[ RUN      ] ThumbnailerTest.basic
DBImpl()
DBImpl()
DBImpl()
cache version update from 0 to 2
cleared all caches
~DBImpl()
called Release_Store()
done waiting
~DBImpl()
called Release_Store()
waiting
ASAN:DEADLYSIGNAL

…oogle#427) While the percentages are displayed for both of the columns, the old/new values are only displayed for the second column, for the CPU time. And the column is not even spelled out. In cases where b->UseRealTime(); is used, this is at the very least highly confusing. So why don't we just display both the old/new for both the columns? Fixes google#425

cmumford added the bug label Jan 4, 2017

wbolster mentioned this issue Oct 16, 2017

Program received signal SIGSEGV, Segmentation fault. wbolster/plyvel#59

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Segfault in createFilter() #425

Segfault in createFilter() #425

michihenning commented Nov 11, 2016

michihenning commented Nov 14, 2016

ghemawat commented Nov 14, 2016

ghemawat commented Nov 14, 2016

michihenning commented Nov 14, 2016

michihenning commented Nov 14, 2016

michihenning commented Nov 14, 2016

michihenning commented Nov 14, 2016

ghemawat commented Nov 14, 2016

michihenning commented Nov 14, 2016

michihenning commented Nov 14, 2016

ghemawat commented Nov 14, 2016 •

edited by cmumford

Loading

michihenning commented Nov 14, 2016 •

edited by cmumford

Loading

Segfault in createFilter() #425

Segfault in createFilter() #425

Comments

michihenning commented Nov 11, 2016

michihenning commented Nov 14, 2016

ghemawat commented Nov 14, 2016

ghemawat commented Nov 14, 2016

michihenning commented Nov 14, 2016

michihenning commented Nov 14, 2016

michihenning commented Nov 14, 2016

michihenning commented Nov 14, 2016

ghemawat commented Nov 14, 2016

michihenning commented Nov 14, 2016

michihenning commented Nov 14, 2016

ghemawat commented Nov 14, 2016 • edited by cmumford Loading

michihenning commented Nov 14, 2016 • edited by cmumford Loading

ghemawat commented Nov 14, 2016 •

edited by cmumford

Loading

michihenning commented Nov 14, 2016 •

edited by cmumford

Loading