Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

aggressive decommit mode alloc may trigger core in 2.8.0 #1227

Closed
qhsong opened this issue Oct 18, 2020 · 2 comments
Closed

aggressive decommit mode alloc may trigger core in 2.8.0 #1227

qhsong opened this issue Oct 18, 2020 · 2 comments

Comments

@qhsong
Copy link

qhsong commented Oct 18, 2020

I want to alloc large memory in aggressive decommit mode, but it trigger an coredump:

#0  std::_Rb_tree_increment (__x=0x0)
    at ../../../.././libstdc++-v3/src/c++98/tree.cc:91
#1  0x00000000004535af in std::_Rb_tree_const_iterator<tcmalloc::SpanPtrWithLength>::operator++ (this=0x7fffffffdd30)
    at /usr/local/gcc-4.9.4/include/c++/4.9.4/bits/stl_tree.h:283
#2  0x00000000004539fc in std::_Rb_tree<tcmalloc::SpanPtrWithLength, tcmalloc::SpanPtrWithLength, std::_Identity<tcmalloc::SpanPtrWithLength>, tcmalloc::SpanBestFitLess, tcmalloc::STLPageHeapAllocator<tcmalloc::SpanPtrWithLength, void> >::erase[abi:cxx11](std::_Rb_tree_const_iterator<tcmalloc::SpanPtrWithLength>) (
    this=0x843cf8 <tcmalloc::Static::pageheap_+1572920>, __position=...)
    at /usr/local/gcc-4.9.4/include/c++/4.9.4/bits/stl_tree.h:856
#3  0x0000000000453543 in std::set<tcmalloc::SpanPtrWithLength, tcmalloc::SpanBestFitLess, tcmalloc::STLPageHeapAllocator<tcmalloc::SpanPtrWithLength, void> >::erase[abi:cxx11](std::_Rb_tree_const_iterator<tcmalloc::SpanPtrWithLength>) (
    this=0x843cf8 <tcmalloc::Static::pageheap_+1572920>, __position=...)
    at /usr/local/gcc-4.9.4/include/c++/4.9.4/bits/stl_set.h:591
#4  0x0000000000451b4b in tcmalloc::PageHeap::RemoveFromFreeList (
    this=0x6c3cc0 <tcmalloc::Static::pageheap_>, span=0x84f1e0)
    at src/page_heap.cc:445
#5  0x0000000000451635 in tcmalloc::PageHeap::CheckAndHandlePreMerge (
    this=0x6c3cc0 <tcmalloc::Static::pageheap_>, span=0x84f1b0, other=0x84f1e0)
    at src/page_heap.cc:349
---Type <return> to continue, or q <return> to quit---
#6  0x0000000000451894 in tcmalloc::PageHeap::MergeIntoFreeList (
    this=0x6c3cc0 <tcmalloc::Static::pageheap_>, span=0x84f1b0)
    at src/page_heap.cc:392
#7  0x0000000000451d25 in tcmalloc::PageHeap::ReleaseSpan (
    this=0x6c3cc0 <tcmalloc::Static::pageheap_>, span=0x84f1b0)
    at src/page_heap.cc:511
#8  0x00000000004515e8 in tcmalloc::PageHeap::CheckAndHandlePreMerge (
    this=0x6c3cc0 <tcmalloc::Static::pageheap_>, span=0x84f1e0, other=0x84f1b0)
    at src/page_heap.cc:339
#9  0x000000000045176b in tcmalloc::PageHeap::MergeIntoFreeList (
    this=0x6c3cc0 <tcmalloc::Static::pageheap_>, span=0x84f1e0)
    at src/page_heap.cc:382
#10 0x000000000045156e in tcmalloc::PageHeap::Delete (
    this=0x6c3cc0 <tcmalloc::Static::pageheap_>, span=0x84f1e0)
    at src/page_heap.cc:318
#11 0x0000000000452634 in tcmalloc::PageHeap::GrowHeap (
    this=0x6c3cc0 <tcmalloc::Static::pageheap_>, n=256)
    at src/page_heap.cc:699

tcmalloc alloc 0x84f1e0 span and then delete it, in aggressive decommit mode, it will try merge prev span 0x84f1b0, so call ReleaseSpan(0x84f1b0).
But ReleaseSpan(0x84f1b0) will call RemoveFromFreeList(0x84f1e0), it is a new span and iter is empty, so erase it may core.

This is my test source code:

  MallocExtension::instance()->SetNumericProperty("tcmalloc.aggressive_memory_decommit", 1);
  std::vector<void *> ptr_vec;
  for(int count = 0; count < 10; ++count)
  for(int i = 0; i < 512; i++)
  {
      size_t size = 256 * 8 * 1024;
      void *ptr = malloc(size);   //Core here and i=0
      if(ptr == nullptr)
      {
        std::cout<<"can not alloc memory"<<std::endl;
        continue;
      }
      memset(ptr, 1, size);
      ptr_vec.push_back(ptr);
  }

  std::cout<<"memory alloc: "<<ptr_vec.size()<<std::endl;
qhsong pushed a commit to qhsong/gperftools that referenced this issue Oct 18, 2020
qhsong added a commit to qhsong/gperftools that referenced this issue Oct 18, 2020
alk added a commit that referenced this issue Dec 20, 2020
This reverts commit be3da70.

There are reports of crashes and false-positive OOMs from this
patch. Crashes under aggressive decommit mode are understood, but I
have yet to get confirmations whether false-positive OOMs were seen
under aggressive decommit or not. Thus lets revert for now.

Updates issue #1227 and issue #1204.
@alk
Copy link
Contributor

alk commented Dec 20, 2020

Much thanks for high quality bug report and also thanks for the patch. This is actually even trickier a bit. And I think your patch doesn't fully fix the problem.

The problem is that after we decommited our "main" span that is being deleted, we mark it as sitting on freelist, where in fact it isn't properly inserted there yet. Then when we run decommit for adjacent spans, we drop lock. And while that lock is dropped other operations may "see" our original span and merge with it, corrupting memory in the process.

I think better fix would be to just not bothering with decommitting adjacent spans at all. We will rarely face adjacent spans not on returned free list anyways. And all we need from aggressive decommit is dealing with memory we're freeing. All this "adjacent spans" business is not important and tricky to get right.

There also another reports of issues with problematic commit. Which may be caused by same bug, or maybe some other bug. I looked through the code and I think it there should be no other bugs. But in any case I've asked if those other reports also do aggressive decommit. For now I have reverted problematic commit.

@qhsong
Copy link
Author

qhsong commented Dec 20, 2020

Got it, I'm not fully understand how span release work and do some tricky patch.
Maybe I can pull a test case on this situation, I found aggressive decommit mode do not have enough test cases.

I will close this issues.
Thanks for your reply

@qhsong qhsong closed this as completed Dec 20, 2020
mbautin added a commit to yugabyte/yugabyte-db-thirdparty that referenced this issue May 17, 2021
mbautin added a commit to yugabyte/yugabyte-db that referenced this issue May 18, 2021
Summary:
Downgrade gperftools to 2.7 to avoid hitting tcmalloc bugs (gperftools/gperftools#1204, gperftools/gperftools#1227) and because we have already extensively tested tcmalloc from gperftools 2.7.

We will still use the old Linuxbrew-based third-party archive for ASAN/TSAN builds because the new Linuxbrew-based GCC 5 third-party archive does not have a Clang 7 toolchain anymore (Linuxbrew is being removed from an increasing number of build types). But ASAN/TSAN builds are non-production and do not use tcmalloc anyway so it is OK to use an archive that has gperftools 2.8.

Also fix find_or_download_thirdparty.sh to take BUILD_ROOT into account.

Test Plan: Jenkins

Reviewers: bogdan, tvesely, steve.varnau

Reviewed By: steve.varnau

Subscribers: ybase

Differential Revision: https://phabricator.dev.yugabyte.com/D11633
mbautin added a commit to mbautin/yugabyte-db that referenced this issue May 18, 2021
Summary:
Downgrade gperftools to 2.7 to avoid hitting tcmalloc bugs (gperftools/gperftools#1204, gperftools/gperftools#1227) and because we have already extensively tested tcmalloc from gperftools 2.7.

We will still use the old Linuxbrew-based third-party archive for ASAN/TSAN builds because the new Linuxbrew-based GCC 5 third-party archive does not have a Clang 7 toolchain anymore (Linuxbrew is being removed from an increasing number of build types). But ASAN/TSAN builds are non-production and do not use tcmalloc anyway so it is OK to use an archive that has gperftools 2.8.

Also fix find_or_download_thirdparty.sh to take BUILD_ROOT into account.

Test Plan: Jenkins

Reviewers: bogdan, tvesely, steve.varnau

Reviewed By: steve.varnau

Subscribers: ybase
mbautin added a commit to yugabyte/yugabyte-db that referenced this issue May 19, 2021
Summary:
Original differential revision: https://phabricator.dev.yugabyte.com/D11633
Original commit: e5d4a27

Downgrade gperftools to 2.7 to avoid hitting tcmalloc bugs (gperftools/gperftools#1204, gperftools/gperftools#1227) and because we have already extensively tested tcmalloc from gperftools 2.7.

We will still use the old Linuxbrew-based third-party archive for ASAN/TSAN builds because the new Linuxbrew-based GCC 5 third-party archive does not have a Clang 7 toolchain anymore (Linuxbrew is being removed from an increasing number of build types). But ASAN/TSAN builds are non-production and do not use tcmalloc anyway so it is OK to use an archive that has gperftools 2.8.

Also fix find_or_download_thirdparty.sh to take BUILD_ROOT into account.

Test Plan: Jenkins: urgent, rebase: 2.4

Reviewers: bogdan, tvesely, steve.varnau

Reviewed By: steve.varnau

Subscribers: ybase

Differential Revision: https://phabricator.dev.yugabyte.com/D11655
mbautin added a commit to yugabyte/yugabyte-db that referenced this issue May 20, 2021
Summary:
Original revision: https://phabricator.dev.yugabyte.com/D11633
Original commit: e5d4a27

Downgrade gperftools to 2.7 to avoid hitting tcmalloc bugs (gperftools/gperftools#1204, gperftools/gperftools#1227) and because we have already extensively tested tcmalloc from gperftools 2.7.

We will still use the old Linuxbrew-based third-party archive for ASAN/TSAN builds because the new Linuxbrew-based GCC 5 third-party archive does not have a Clang 7 toolchain anymore (Linuxbrew is being removed from an increasing number of build types). But ASAN/TSAN builds are non-production and do not use tcmalloc anyway so it is OK to use an archive that has gperftools 2.8.

Also fix find_or_download_thirdparty.sh to take BUILD_ROOT into account.

Note that we are using the updated third-party URL built for the 2.4 branch here, because the previous yugabyte-db-thirdparty commit we used in the 2.5.3 branch was yugabyte/yugabyte-db-thirdparty@45c97f4, which was also used in the 2.4 branch, and had the problematic gperftools version 2.8.0. The new commit we are using is https://github.com/yugabyte/yugabyte-db-thirdparty/commits/07aad696773b3db7976568a3c827e96d8c3d24c9, with gperftools downgraded to 2.7.0.

Test Plan: Jenkins: urgent, rebase: 2.5.3

Reviewers: bogdan, tvesely, steve.varnau

Reviewed By: steve.varnau

Subscribers: ybase

Differential Revision: https://phabricator.dev.yugabyte.com/D11665
mbautin added a commit to yugabyte/yugabyte-db that referenced this issue May 21, 2021
Summary:
Original revision: https://phabricator.dev.yugabyte.com/D11633
Original commit: e5d4a27

Downgrade gperftools to 2.7 to avoid hitting tcmalloc bugs (gperftools/gperftools#1204, gperftools/gperftools#1227) and because we have already extensively tested tcmalloc from gperftools 2.7.

We will still use the old Linuxbrew-based third-party archive for ASAN/TSAN builds because the new Linuxbrew-based GCC 5 third-party archive does not have a Clang 7 toolchain anymore (Linuxbrew is being removed from an increasing number of build types). But ASAN/TSAN builds are non-production and do not use tcmalloc anyway so it is OK to use an archive that has gperftools 2.8.

Also fix find_or_download_thirdparty.sh to take BUILD_ROOT into account.

Test Plan: Jenkins: rebase: 2.7.1

Reviewers: bogdan, tvesely, steve.varnau

Reviewed By: steve.varnau

Subscribers: ybase

Differential Revision: https://phabricator.dev.yugabyte.com/D11680
mbautin added a commit to yugabyte/yugabyte-db that referenced this issue May 21, 2021
Summary:
Original differential revision: https://phabricator.dev.yugabyte.com/D11633
Original commit: e5d4a27

Downgrade gperftools to 2.7 to avoid hitting tcmalloc bugs (gperftools/gperftools#1204, gperftools/gperftools#1227) and because we have already extensively tested tcmalloc from gperftools 2.7.

We will still use the old Linuxbrew-based third-party archive for ASAN/TSAN builds because the new Linuxbrew-based GCC 5 third-party archive does not have a Clang 7 toolchain anymore (Linuxbrew is being removed from an increasing number of build types). But ASAN/TSAN builds are non-production and do not use tcmalloc anyway so it is OK to use an archive that has gperftools 2.8.

Also fix find_or_download_thirdparty.sh to take BUILD_ROOT into account.

We are using the updated third-party URL specifically built for the 2.6 branch here. The previous yugabyte-db-thirdparty commit we used in the 2.6 branch was yugabyte/yugabyte-db-thirdparty@ee4e2e4, and it had the problematic gperftools version 2.8. The new commit we are using is https://github.com/yugabyte/yugabyte-db-thirdparty/commits/d83a2e241523b48e9cd8b7bd5dd248e74bf0132c, with gperftools downgraded to 2.7.

Test Plan: Jenkins: rebase: 2.6

Reviewers: bogdan, tvesely, steve.varnau

Reviewed By: steve.varnau

Subscribers: ybase

Differential Revision: https://phabricator.dev.yugabyte.com/D11682
YintongMa pushed a commit to YintongMa/yugabyte-db that referenced this issue May 26, 2021
Summary:
Downgrade gperftools to 2.7 to avoid hitting tcmalloc bugs (gperftools/gperftools#1204, gperftools/gperftools#1227) and because we have already extensively tested tcmalloc from gperftools 2.7.

We will still use the old Linuxbrew-based third-party archive for ASAN/TSAN builds because the new Linuxbrew-based GCC 5 third-party archive does not have a Clang 7 toolchain anymore (Linuxbrew is being removed from an increasing number of build types). But ASAN/TSAN builds are non-production and do not use tcmalloc anyway so it is OK to use an archive that has gperftools 2.8.

Also fix find_or_download_thirdparty.sh to take BUILD_ROOT into account.

Test Plan: Jenkins

Reviewers: bogdan, tvesely, steve.varnau

Reviewed By: steve.varnau

Subscribers: ybase

Differential Revision: https://phabricator.dev.yugabyte.com/D11633
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants