Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfault in createFilter() #425

Open
michihenning opened this issue Nov 11, 2016 · 12 comments
Open

Segfault in createFilter() #425

michihenning opened this issue Nov 11, 2016 · 12 comments
Labels

Comments

@michihenning
Copy link

I'm getting segfaults with leveldb 1.19. This is in code that has been working perfectly for more than a year with 1.18. The segfault happens some time during shutdown and looks like a race condition. Valgrind has this to say:

http://pastebin.ubuntu.com/23459626/

gdb shows this stack trace:

http://pastebin.ubuntu.com/23459624/

I can produce the segfault on arm64, amd64, powerpc, s390x, ppc64el, armhf, and i386 simply by changing from libleveldb.so.1.18 to libleveldb.so.1.19

More details here: https://bugs.launchpad.net/ubuntu/+source/thumbnailer/+bug/1640326

@michihenning
Copy link
Author

I did a bisect. The core dump first shows up with this commit:

ac1d69d

@ghemawat
Copy link
Contributor

From the stack trace, one potential explanation is that the caller is
deleting the filter policy object before deleting the DB. E.g., suppose
the caller does:

  1. At initialization time:
    options.filter_policy = ... create some filter policy ...
    open db with options
  2. Run for a file
  3. Cleanup:
    delete options.filter_policy;
    delete db;

This would explain the crash you are seeing since background threads inside
leveldb may attempt to access the filter policy until the db deletion is
finished.

On Sun, Nov 13, 2016 at 7:36 PM, michihenning notifications@github.com
wrote:

I did a bisect. The core dump first shows up with this commit:

ac1d69d
ac1d69d


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#425 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AHuM0boid1vnRX6YHc8gpSyGnAQohLFYks5q99dXgaJpZM4Kvm7Z
.

@ghemawat
Copy link
Contributor

Also, an easy thing to do might be to try running under address sanitizer
and thread sanitizer.

On Sun, Nov 13, 2016 at 7:47 PM, Sanjay Ghemawat sanjay@google.com wrote:

From the stack trace, one potential explanation is that the caller is
deleting the filter policy object before deleting the DB. E.g., suppose
the caller does:

  1. At initialization time:
    options.filter_policy = ... create some filter policy ...
    open db with options
  2. Run for a file
  3. Cleanup:
    delete options.filter_policy;
    delete db;

This would explain the crash you are seeing since background threads
inside leveldb may attempt to access the filter policy until the db
deletion is finished.

On Sun, Nov 13, 2016 at 7:36 PM, michihenning notifications@github.com
wrote:

I did a bisect. The core dump first shows up with this commit:

ac1d69d
ac1d69d


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#425 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AHuM0boid1vnRX6YHc8gpSyGnAQohLFYks5q99dXgaJpZM4Kvm7Z
.

@michihenning
Copy link
Author

Thanks for the quick response!

We are not using any filter policy.

We use ReadOptions and WriteOptions. Both have static storage duration.

For the read options, we set:
read_options.verify_checksums = true; // Only for debug builds

For the write options, we don't set anything, the variable is default-constructed.

We also pass an instance of leveldb::Options to DB::Open(). That option instance is on the stack, and the only flags set for it are

options.paranoid_checks = true;
options.create_if_missing = true;

Could any of these have something to do with this?

@michihenning
Copy link
Author

One other point that might be relevant: we have three instances of the DB open, each one for a different purpose.

@michihenning
Copy link
Author

Hmmm... Scanning through our code, we also use

options.block_cache = ...;

The life time of that block cache ends before the DB is destroyed. Could this be the cause?

@michihenning
Copy link
Author

Address sanitizer says this:
http://pastebin.ubuntu.com/23474089/

The address on the first line isn't always 0x0. It varies from run to run. About half the time, I get 0x0, the other half, I get this instead (the stack trace is the same otherwise):

==107173==ERROR: AddressSanitizer: SEGV on unknown address 0x000000000018 (pc 0x7fcf567be24a bp 0x60d00002d7a0 sp 0x7fcf41958f80 T5)

@ghemawat
Copy link
Contributor

On Sun, Nov 13, 2016 at 8:27 PM, michihenning notifications@github.com
wrote:

Hmmm... Scanning through our code, we also use

options.block_cache = ...;

The life time of that block cache ends before the DB is destroyed. Could
this be the cause?

Could be. But from the code I found on the web, we have

http://bazaar.launchpad.net/~unity-api-team/persistent-cache-cpp/trunk/view/head:/include/core/internal/persistent_string_cache_impl.h

std::unique_ptrleveldb::Cache block_cache_; // Must be defined before db_!
std::unique_ptrleveldb::DB db_;

That should prevent the block_cache_ from being deleted until after the db_
has been deleted?

I think the first thing to do would be to get more information from the
ASAN failure. Perhaps build with line number information (-g1 or -g flag?)

It would also be good to know where the main thread was when the ASAN
failure happened. Did you have stack traces for other threads? That would
let us know how far the DB destruction had progressed.

You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#425 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AHuM0ZVzL_-9sLeXsjTAtNr7AsZB-Atcks5q9-M6gaJpZM4Kvm7Z
.

@michihenning
Copy link
Author

My apologies about the block_cache_ life time, I wasn't thinking straight.

Here is the asan output with line numbers:
http://pastebin.ubuntu.com/23477477/

Here are the stacks for all the threads. The main thread is in the DB destructor, there are four threads waiting for an event to arrive, and there is DB background thread that hits the segfault.
http://pastebin.ubuntu.com/23477508/

@michihenning
Copy link
Author

BTW, to avoid any confusion, this is with leveldb compiled from commit ac1d69d

@ghemawat
Copy link
Contributor

ghemawat commented Nov 14, 2016

I am confused by the two occurrences of ~DBImpl in the stack trace:

#2  0x00007ffff53144d9 in leveldb::DBImpl::~DBImpl (this=0x61700001f880,
__in_chrg=<optimised out>)
    at db/db_impl.cc:152
#3  0x00007ffff5314fe9 in leveldb::DBImpl::~DBImpl (this=0x61700001f880,
__in_chrg=<optimised out>)
    at db/db_impl.cc:174

The top one is in the following loop:
151 while (bg_compaction_scheduled_) {
152 bg_cv_.Wait();
153 }

The second one (the caller???) is the end of the destructor:
173 }
174 }

Is the program perhaps throwing exceptions when some leveldb library
function is on the stack? (leveldb code is not exception safe).

Perhaps you could fprintfs in various parts of ~DBImpl to figure out which
lines of the destructor have already been executed by the time of the ASAN
crash.

And maybe compile without optimization.

On Mon, Nov 14, 2016 at 1:47 PM, michihenning notifications@github.com
wrote:

BTW, to avoid any confusion, this is with leveldb compiled from commit
ac1d69d
ac1d69d


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#425 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AHuM0VuwAb5XKsBQkJ9nlqBzmH8nC7Qqks5q-Nb8gaJpZM4Kvm7Z
.

@michihenning
Copy link
Author

michihenning commented Nov 14, 2016

There are no exceptions anywhere near this.
Compiling with -g2 and without optimization, the main thread stack trace still shows the same two frames for the destructor:

#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x00007ffff53734b5 in leveldb::port::CondVar::Wait (this=0x61700001f970) at port/port_posix.cc:38
#2  0x00007ffff5300a69 in leveldb::DBImpl::~DBImpl (this=0x61700001f880, __in_chrg=<optimised out>)
    at db/db_impl.cc:153
#3  0x00007ffff5301038 in leveldb::DBImpl::~DBImpl (this=0x61700001f880, __in_chrg=<optimised out>)
    at db/db_impl.cc:175
#4  0x000055555560ac3a in core::internal::PersistentStringCacheImpl::~PersistentStringCacheImpl() ()
#5  0x0000555555604441 in core::PersistentStringCache::~PersistentStringCache() ()

I've seen this before--I think it's just an idiosyncracy of gcc/gdb with inlined destructors?

I inserted some trace into the constructor and destructor:

DBImpl::~DBImpl() {
  std::cerr << "~DBImpl()" << std::endl;
  // Wait for background work to finish
  mutex_.Lock();
  shutting_down_.Release_Store(this);  // Any non-NULL value is ok
  std::cerr << "called Release_Store()" << std::endl;
  while (bg_compaction_scheduled_) {
    std::cerr << "waiting" << std::endl;
    bg_cv_.Wait();
  }
  mutex_.Unlock();
  std::cerr << "done waiting" << std::endl;

We have three instances of PersistentCache in the code; here is the trace I get:

[ RUN      ] ThumbnailerTest.basic
DBImpl()
DBImpl()
DBImpl()
cache version update from 0 to 2
cleared all caches
~DBImpl()
called Release_Store()
done waiting
~DBImpl()
called Release_Store()
waiting
ASAN:DEADLYSIGNAL

@cmumford cmumford added the bug label Jan 4, 2017
maochongxin pushed a commit to maochongxin/leveldb that referenced this issue Jul 21, 2022
…oogle#427)

While the percentages are displayed for both of the columns,
the old/new values are only displayed for the second column,
for the CPU time. And the column is not even spelled out.

In cases where b->UseRealTime(); is used, this is at the
very least highly confusing. So why don't we just
display both the old/new for both the columns?

Fixes google#425
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants