Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tcmalloc for large object return null always #1204

Closed
ghost opened this issue Jul 23, 2020 · 10 comments
Closed

tcmalloc for large object return null always #1204

ghost opened this issue Jul 23, 2020 · 10 comments

Comments

@ghost
Copy link

ghost commented Jul 23, 2020

The server memory size is 256g. My application uses about 2%. There is a lot of remaining memory, but the application often gets null.The memory I applied for is probably between 8m and 80m.
Can someone help me analyze the cause of the problem?

@gcsmith
Copy link

gcsmith commented Aug 10, 2020

I'm experiencing similar issues. After updating to gperftools-2.8 (w/ libunwind-1.4.0) our applications started crashing with null pointer dereferences and std::bad_alloc exceptions. Rolling back to gperftools-2.7 (w/ libunwind-1.2.1) resolved the issue.

These regressions run on a server farm with ~500GB of memory.

@gcsmith
Copy link

gcsmith commented Aug 10, 2020

Here's an example backtrace:

 #0  __cxxabiv1::__cxa_throw (obj=0x1fb711f0, tinfo=0x7fffdf137790 <typeinfo for std::bad_alloc>, dest=0x7fffdee56e50 <std::bad_alloc::~bad_alloc()>) at ../../../../gcc-5.2.0/libstdc++-v3/libsupc++/eh_throw.cc:62                                                                                                                                                     
 #1  0x00007fffdfad28b3 in (anonymous namespace)::handle_oom (retry_fn=retry_fn@entry=0x7fffdfad3b30 <(anonymous namespace)::retry_malloc(void*)>, retry_arg=retry_arg@entry=0xc0000, from_operator=from_operator@entry=true, nothrow=nothrow@entry=false) at src/tcmalloc.cc:1264                                                                                       
 #2  0x00007fffdfaf0c96 in tcmalloc::cpp_throw_oom (size=size@entry=786432) at src/tcmalloc.cc:1736                                                                                                                                                                                                                                                                      
 #3  0x00007fffdfaf23aa in tcmalloc::do_allocate_full<tcmalloc::cpp_throw_oom> (size=786432) at src/tcmalloc.cc:1779                                                                                                                                                                                                                                                     
 #4  tcmalloc::allocate_full_cpp_throw_oom (size=786432) at src/tcmalloc.cc:1791                                                                                                                                                                                                                                                                                         
 #5  0x000000000055e788 in __gnu_cxx::new_allocator<unsigned long>::allocate (this=0x247d90b8, __n=98304) at /home/utils/gcc-5.2.0/include/c++/5.2.0/ext/new_allocator.h:104                                                                                                                                                                                             
 #6  0x000000000055b27f in std::allocator_traits<std::allocator<unsigned long> >::allocate (__a=..., __n=98304) at /home/utils/gcc-5.2.0/include/c++/5.2.0/bits/alloc_traits.h:360                                                                                                                                                                                       
 #7  0x0000000000559292 in std::_Vector_base<unsigned long, std::allocator<unsigned long> >::_M_allocate (this=0x247d90b8, __n=98304) at /home/utils/gcc-5.2.0/include/c++/5.2.0/bits/stl_vector.h:170                                                                                                                                                                   
 #8  0x0000000000592bcd in std::_Vector_base<unsigned long, std::allocator<unsigned long> >::_M_create_storage (this=0x247d90b8, __n=98304) at /home/utils/gcc-5.2.0/include/c++/5.2.0/bits/stl_vector.h:185                                                                                                                                                             
 #9  0x0000000000591fcd in std::_Vector_base<unsigned long, std::allocator<unsigned long> >::_Vector_base (this=0x247d90b8, __n=98304, __a=...) at /home/utils/gcc-5.2.0/include/c++/5.2.0/bits/stl_vector.h:136                                                                                                                                                         
 #10 0x000000000059172b in std::vector<unsigned long, std::allocator<unsigned long> >::vector (this=0x247d90b8, __x=std::vector of length 98304, capacity 98304 = {...}) at /home/utils/gcc-5.2.0/include/c++/5.2.0/bits/stl_vector.h:320                                                                                                                                
 #11 0x00007fffe211aa07 in sparse::EWAHBoolArray<unsigned long>::EWAHBoolArray (this=0x247d90b8, other=...) at xxx
 ...

@geoffrob345
Copy link

I am tracking the same problem, after upgrading from 2.7 to 2.8, we've begun seeing a large scale (not the first or the largest - about 83 MB) cause an abort, with errno set to 2.

These crashes are intermittent, but consistently occur on the same code. It occurred directly after upgrade and stops if I revert back to 2.7

@ghost
Copy link
Author

ghost commented Oct 14, 2020

In order to alleviate this problem, I made the following changes in src/common.h:
`static const size_t kMaxThreadCacheSize = 4 << 28;

static const size_t kPageSize = 1 << 14;
static const size_t kMaxSize = 512 * 1024;
// For all span-lengths <= kMaxPages we keep an exact-size list in PageHeap.
static const size_t kMaxPages = 1 << 14;`
I know this may not be the best way, but it does alleviate the problem.

@robryk
Copy link

robryk commented Oct 17, 2020

#1226 might be a duplicate of this issue. We've bisected the issue down to be3da70 there

@vladr
Copy link

vladr commented Nov 16, 2020

Same issue here after upgrading from 2.7 to 2.8.

@geoffrob345
Copy link

geoffrob345 commented Nov 16, 2020

I concur with robryk.

This issue was occurring multiple times daily for our product.

I found it was specific to the 2.8 decision to unlock the pageheap during system memory release.

I added the following patch as a temporary fix.

The system has been running steady, meeting load requirements and passing all tests for the past month since this fix:

  | - Static::pageheap_lock()->Unlock();
  | + // gperftools version 2.8 intentionally unlocks during a span release, as this
  | + // can be a slow operation. However the product was showing instability caused
  | + // by the removal of the page lock, so (for now) we are adding the lock back in.
  | +
  | + // Static::pageheap_lock()->Unlock();
  | bool rv = TCMalloc_SystemRelease(reinterpret_cast<void*>(span->start << kPageShift),
  | static_cast<size_t>(span->length << kPageShift));
  | - Static::pageheap_lock()->Lock();
  | + // Static::pageheap_lock()->Lock();

@alk
Copy link
Contributor

alk commented Dec 20, 2020

For now I am reverting problematic commit. There is indeed at least one bug pointed out in issue #1227. And this is only place I can see some races being possible too.

So my question to people on this ticket are you all running with aggressive decommit? Also can someone offer some kind of reproduction of this bug?

alk added a commit that referenced this issue Dec 20, 2020
This reverts commit be3da70.

There are reports of crashes and false-positive OOMs from this
patch. Crashes under aggressive decommit mode are understood, but I
have yet to get confirmations whether false-positive OOMs were seen
under aggressive decommit or not. Thus lets revert for now.

Updates issue #1227 and issue #1204.
asfgit pushed a commit to apache/kudu that referenced this issue Dec 23, 2020
gperftools 2.8.0 had a new feature that lead to crashes and corruption
which was reverted in 2.8.1. This patch upgrades to 2.8.1 to avoid any
issues.

One of the issues that is fixed via feature revert in 2.8.1 is
gperftools/gperftools#1204

Change-Id: I69f3405d14c4a853d8c224b8111fef5961ea34dc
Reviewed-on: http://gerrit.cloudera.org:8080/16897
Reviewed-by: Bankim Bhavsar <bankim@cloudera.com>
Reviewed-by: Alexey Serbin <aserbin@cloudera.com>
Tested-by: Kudu Jenkins
@alk
Copy link
Contributor

alk commented Dec 23, 2020

Ok, it was not aggressive decommit. It was race in how we're growing heap. I.e. when PageHeap::New finds no suitable free chunk, it calls GrowHeap and than tries to search for free chunk again. Anticipating it will work, since we just grew heap anyways.

But when we added code to drop lock while releasing memory, there is small chance that GrowHeap's call to Delete (to place newly added chunk of memory to page heap) will trigger IncrementalScavenge which decides to release some smaller, unrelated span, and while this happens, lock is released, then other thread is able to "steal" this just added chunk of memory. So then GrowHeap succeeds, but this success is stolen by another thread. And thread that grew heap sees OOM event.

I'll see how to safely re-enable this original feature adding more careful tests. For now I am satisfied with:

a) revert

b) 2 bugs that we found.

Notably, that second bug is also present in "abseil" tcmalloc. It is just that tcmalloc defaults to different page heap implementation. So I'll be fixing it over there too.

@alk alk closed this as completed Dec 23, 2020
@alk
Copy link
Contributor

alk commented Dec 23, 2020

https://gist.github.com/alk/e46cce07da5a5182dbc092815e2db546 is the test program that helped find the second bug

mbautin added a commit to yugabyte/yugabyte-db-thirdparty that referenced this issue May 17, 2021
mbautin added a commit to yugabyte/yugabyte-db that referenced this issue May 18, 2021
Summary:
Downgrade gperftools to 2.7 to avoid hitting tcmalloc bugs (gperftools/gperftools#1204, gperftools/gperftools#1227) and because we have already extensively tested tcmalloc from gperftools 2.7.

We will still use the old Linuxbrew-based third-party archive for ASAN/TSAN builds because the new Linuxbrew-based GCC 5 third-party archive does not have a Clang 7 toolchain anymore (Linuxbrew is being removed from an increasing number of build types). But ASAN/TSAN builds are non-production and do not use tcmalloc anyway so it is OK to use an archive that has gperftools 2.8.

Also fix find_or_download_thirdparty.sh to take BUILD_ROOT into account.

Test Plan: Jenkins

Reviewers: bogdan, tvesely, steve.varnau

Reviewed By: steve.varnau

Subscribers: ybase

Differential Revision: https://phabricator.dev.yugabyte.com/D11633
mbautin added a commit to mbautin/yugabyte-db that referenced this issue May 18, 2021
Summary:
Downgrade gperftools to 2.7 to avoid hitting tcmalloc bugs (gperftools/gperftools#1204, gperftools/gperftools#1227) and because we have already extensively tested tcmalloc from gperftools 2.7.

We will still use the old Linuxbrew-based third-party archive for ASAN/TSAN builds because the new Linuxbrew-based GCC 5 third-party archive does not have a Clang 7 toolchain anymore (Linuxbrew is being removed from an increasing number of build types). But ASAN/TSAN builds are non-production and do not use tcmalloc anyway so it is OK to use an archive that has gperftools 2.8.

Also fix find_or_download_thirdparty.sh to take BUILD_ROOT into account.

Test Plan: Jenkins

Reviewers: bogdan, tvesely, steve.varnau

Reviewed By: steve.varnau

Subscribers: ybase
mbautin added a commit to yugabyte/yugabyte-db that referenced this issue May 19, 2021
Summary:
Original differential revision: https://phabricator.dev.yugabyte.com/D11633
Original commit: e5d4a27

Downgrade gperftools to 2.7 to avoid hitting tcmalloc bugs (gperftools/gperftools#1204, gperftools/gperftools#1227) and because we have already extensively tested tcmalloc from gperftools 2.7.

We will still use the old Linuxbrew-based third-party archive for ASAN/TSAN builds because the new Linuxbrew-based GCC 5 third-party archive does not have a Clang 7 toolchain anymore (Linuxbrew is being removed from an increasing number of build types). But ASAN/TSAN builds are non-production and do not use tcmalloc anyway so it is OK to use an archive that has gperftools 2.8.

Also fix find_or_download_thirdparty.sh to take BUILD_ROOT into account.

Test Plan: Jenkins: urgent, rebase: 2.4

Reviewers: bogdan, tvesely, steve.varnau

Reviewed By: steve.varnau

Subscribers: ybase

Differential Revision: https://phabricator.dev.yugabyte.com/D11655
mbautin added a commit to yugabyte/yugabyte-db that referenced this issue May 20, 2021
Summary:
Original revision: https://phabricator.dev.yugabyte.com/D11633
Original commit: e5d4a27

Downgrade gperftools to 2.7 to avoid hitting tcmalloc bugs (gperftools/gperftools#1204, gperftools/gperftools#1227) and because we have already extensively tested tcmalloc from gperftools 2.7.

We will still use the old Linuxbrew-based third-party archive for ASAN/TSAN builds because the new Linuxbrew-based GCC 5 third-party archive does not have a Clang 7 toolchain anymore (Linuxbrew is being removed from an increasing number of build types). But ASAN/TSAN builds are non-production and do not use tcmalloc anyway so it is OK to use an archive that has gperftools 2.8.

Also fix find_or_download_thirdparty.sh to take BUILD_ROOT into account.

Note that we are using the updated third-party URL built for the 2.4 branch here, because the previous yugabyte-db-thirdparty commit we used in the 2.5.3 branch was yugabyte/yugabyte-db-thirdparty@45c97f4, which was also used in the 2.4 branch, and had the problematic gperftools version 2.8.0. The new commit we are using is https://github.com/yugabyte/yugabyte-db-thirdparty/commits/07aad696773b3db7976568a3c827e96d8c3d24c9, with gperftools downgraded to 2.7.0.

Test Plan: Jenkins: urgent, rebase: 2.5.3

Reviewers: bogdan, tvesely, steve.varnau

Reviewed By: steve.varnau

Subscribers: ybase

Differential Revision: https://phabricator.dev.yugabyte.com/D11665
mbautin added a commit to yugabyte/yugabyte-db that referenced this issue May 21, 2021
Summary:
Original revision: https://phabricator.dev.yugabyte.com/D11633
Original commit: e5d4a27

Downgrade gperftools to 2.7 to avoid hitting tcmalloc bugs (gperftools/gperftools#1204, gperftools/gperftools#1227) and because we have already extensively tested tcmalloc from gperftools 2.7.

We will still use the old Linuxbrew-based third-party archive for ASAN/TSAN builds because the new Linuxbrew-based GCC 5 third-party archive does not have a Clang 7 toolchain anymore (Linuxbrew is being removed from an increasing number of build types). But ASAN/TSAN builds are non-production and do not use tcmalloc anyway so it is OK to use an archive that has gperftools 2.8.

Also fix find_or_download_thirdparty.sh to take BUILD_ROOT into account.

Test Plan: Jenkins: rebase: 2.7.1

Reviewers: bogdan, tvesely, steve.varnau

Reviewed By: steve.varnau

Subscribers: ybase

Differential Revision: https://phabricator.dev.yugabyte.com/D11680
mbautin added a commit to yugabyte/yugabyte-db that referenced this issue May 21, 2021
Summary:
Original differential revision: https://phabricator.dev.yugabyte.com/D11633
Original commit: e5d4a27

Downgrade gperftools to 2.7 to avoid hitting tcmalloc bugs (gperftools/gperftools#1204, gperftools/gperftools#1227) and because we have already extensively tested tcmalloc from gperftools 2.7.

We will still use the old Linuxbrew-based third-party archive for ASAN/TSAN builds because the new Linuxbrew-based GCC 5 third-party archive does not have a Clang 7 toolchain anymore (Linuxbrew is being removed from an increasing number of build types). But ASAN/TSAN builds are non-production and do not use tcmalloc anyway so it is OK to use an archive that has gperftools 2.8.

Also fix find_or_download_thirdparty.sh to take BUILD_ROOT into account.

We are using the updated third-party URL specifically built for the 2.6 branch here. The previous yugabyte-db-thirdparty commit we used in the 2.6 branch was yugabyte/yugabyte-db-thirdparty@ee4e2e4, and it had the problematic gperftools version 2.8. The new commit we are using is https://github.com/yugabyte/yugabyte-db-thirdparty/commits/d83a2e241523b48e9cd8b7bd5dd248e74bf0132c, with gperftools downgraded to 2.7.

Test Plan: Jenkins: rebase: 2.6

Reviewers: bogdan, tvesely, steve.varnau

Reviewed By: steve.varnau

Subscribers: ybase

Differential Revision: https://phabricator.dev.yugabyte.com/D11682
YintongMa pushed a commit to YintongMa/yugabyte-db that referenced this issue May 26, 2021
Summary:
Downgrade gperftools to 2.7 to avoid hitting tcmalloc bugs (gperftools/gperftools#1204, gperftools/gperftools#1227) and because we have already extensively tested tcmalloc from gperftools 2.7.

We will still use the old Linuxbrew-based third-party archive for ASAN/TSAN builds because the new Linuxbrew-based GCC 5 third-party archive does not have a Clang 7 toolchain anymore (Linuxbrew is being removed from an increasing number of build types). But ASAN/TSAN builds are non-production and do not use tcmalloc anyway so it is OK to use an archive that has gperftools 2.8.

Also fix find_or_download_thirdparty.sh to take BUILD_ROOT into account.

Test Plan: Jenkins

Reviewers: bogdan, tvesely, steve.varnau

Reviewed By: steve.varnau

Subscribers: ybase

Differential Revision: https://phabricator.dev.yugabyte.com/D11633
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants