New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tcmalloc for large object return null always #1204
Comments
|
I'm experiencing similar issues. After updating to gperftools-2.8 (w/ libunwind-1.4.0) our applications started crashing with null pointer dereferences and std::bad_alloc exceptions. Rolling back to gperftools-2.7 (w/ libunwind-1.2.1) resolved the issue. These regressions run on a server farm with ~500GB of memory. |
|
Here's an example backtrace: |
|
I am tracking the same problem, after upgrading from 2.7 to 2.8, we've begun seeing a large scale (not the first or the largest - about 83 MB) cause an abort, with errno set to 2. These crashes are intermittent, but consistently occur on the same code. It occurred directly after upgrade and stops if I revert back to 2.7 |
|
In order to alleviate this problem, I made the following changes in src/common.h: static const size_t kPageSize = 1 << 14; |
|
Same issue here after upgrading from 2.7 to 2.8. |
|
I concur with robryk. This issue was occurring multiple times daily for our product. I found it was specific to the 2.8 decision to unlock the pageheap during system memory release. I added the following patch as a temporary fix. The system has been running steady, meeting load requirements and passing all tests for the past month since this fix: | - Static::pageheap_lock()->Unlock(); |
|
For now I am reverting problematic commit. There is indeed at least one bug pointed out in issue #1227. And this is only place I can see some races being possible too. So my question to people on this ticket are you all running with aggressive decommit? Also can someone offer some kind of reproduction of this bug? |
This reverts commit be3da70. There are reports of crashes and false-positive OOMs from this patch. Crashes under aggressive decommit mode are understood, but I have yet to get confirmations whether false-positive OOMs were seen under aggressive decommit or not. Thus lets revert for now. Updates issue #1227 and issue #1204.
gperftools 2.8.0 had a new feature that lead to crashes and corruption which was reverted in 2.8.1. This patch upgrades to 2.8.1 to avoid any issues. One of the issues that is fixed via feature revert in 2.8.1 is gperftools/gperftools#1204 Change-Id: I69f3405d14c4a853d8c224b8111fef5961ea34dc Reviewed-on: http://gerrit.cloudera.org:8080/16897 Reviewed-by: Bankim Bhavsar <bankim@cloudera.com> Reviewed-by: Alexey Serbin <aserbin@cloudera.com> Tested-by: Kudu Jenkins
|
Ok, it was not aggressive decommit. It was race in how we're growing heap. I.e. when PageHeap::New finds no suitable free chunk, it calls GrowHeap and than tries to search for free chunk again. Anticipating it will work, since we just grew heap anyways. But when we added code to drop lock while releasing memory, there is small chance that GrowHeap's call to Delete (to place newly added chunk of memory to page heap) will trigger IncrementalScavenge which decides to release some smaller, unrelated span, and while this happens, lock is released, then other thread is able to "steal" this just added chunk of memory. So then GrowHeap succeeds, but this success is stolen by another thread. And thread that grew heap sees OOM event. I'll see how to safely re-enable this original feature adding more careful tests. For now I am satisfied with: a) revert b) 2 bugs that we found. Notably, that second bug is also present in "abseil" tcmalloc. It is just that tcmalloc defaults to different page heap implementation. So I'll be fixing it over there too. |
|
https://gist.github.com/alk/e46cce07da5a5182dbc092815e2db546 is the test program that helped find the second bug |
Related issues in gperftools: - gperftools/gperftools#1204 - gperftools/gperftools#1227
Summary: Downgrade gperftools to 2.7 to avoid hitting tcmalloc bugs (gperftools/gperftools#1204, gperftools/gperftools#1227) and because we have already extensively tested tcmalloc from gperftools 2.7. We will still use the old Linuxbrew-based third-party archive for ASAN/TSAN builds because the new Linuxbrew-based GCC 5 third-party archive does not have a Clang 7 toolchain anymore (Linuxbrew is being removed from an increasing number of build types). But ASAN/TSAN builds are non-production and do not use tcmalloc anyway so it is OK to use an archive that has gperftools 2.8. Also fix find_or_download_thirdparty.sh to take BUILD_ROOT into account. Test Plan: Jenkins Reviewers: bogdan, tvesely, steve.varnau Reviewed By: steve.varnau Subscribers: ybase Differential Revision: https://phabricator.dev.yugabyte.com/D11633
Summary: Downgrade gperftools to 2.7 to avoid hitting tcmalloc bugs (gperftools/gperftools#1204, gperftools/gperftools#1227) and because we have already extensively tested tcmalloc from gperftools 2.7. We will still use the old Linuxbrew-based third-party archive for ASAN/TSAN builds because the new Linuxbrew-based GCC 5 third-party archive does not have a Clang 7 toolchain anymore (Linuxbrew is being removed from an increasing number of build types). But ASAN/TSAN builds are non-production and do not use tcmalloc anyway so it is OK to use an archive that has gperftools 2.8. Also fix find_or_download_thirdparty.sh to take BUILD_ROOT into account. Test Plan: Jenkins Reviewers: bogdan, tvesely, steve.varnau Reviewed By: steve.varnau Subscribers: ybase
Summary: Original differential revision: https://phabricator.dev.yugabyte.com/D11633 Original commit: e5d4a27 Downgrade gperftools to 2.7 to avoid hitting tcmalloc bugs (gperftools/gperftools#1204, gperftools/gperftools#1227) and because we have already extensively tested tcmalloc from gperftools 2.7. We will still use the old Linuxbrew-based third-party archive for ASAN/TSAN builds because the new Linuxbrew-based GCC 5 third-party archive does not have a Clang 7 toolchain anymore (Linuxbrew is being removed from an increasing number of build types). But ASAN/TSAN builds are non-production and do not use tcmalloc anyway so it is OK to use an archive that has gperftools 2.8. Also fix find_or_download_thirdparty.sh to take BUILD_ROOT into account. Test Plan: Jenkins: urgent, rebase: 2.4 Reviewers: bogdan, tvesely, steve.varnau Reviewed By: steve.varnau Subscribers: ybase Differential Revision: https://phabricator.dev.yugabyte.com/D11655
Summary: Original revision: https://phabricator.dev.yugabyte.com/D11633 Original commit: e5d4a27 Downgrade gperftools to 2.7 to avoid hitting tcmalloc bugs (gperftools/gperftools#1204, gperftools/gperftools#1227) and because we have already extensively tested tcmalloc from gperftools 2.7. We will still use the old Linuxbrew-based third-party archive for ASAN/TSAN builds because the new Linuxbrew-based GCC 5 third-party archive does not have a Clang 7 toolchain anymore (Linuxbrew is being removed from an increasing number of build types). But ASAN/TSAN builds are non-production and do not use tcmalloc anyway so it is OK to use an archive that has gperftools 2.8. Also fix find_or_download_thirdparty.sh to take BUILD_ROOT into account. Note that we are using the updated third-party URL built for the 2.4 branch here, because the previous yugabyte-db-thirdparty commit we used in the 2.5.3 branch was yugabyte/yugabyte-db-thirdparty@45c97f4, which was also used in the 2.4 branch, and had the problematic gperftools version 2.8.0. The new commit we are using is https://github.com/yugabyte/yugabyte-db-thirdparty/commits/07aad696773b3db7976568a3c827e96d8c3d24c9, with gperftools downgraded to 2.7.0. Test Plan: Jenkins: urgent, rebase: 2.5.3 Reviewers: bogdan, tvesely, steve.varnau Reviewed By: steve.varnau Subscribers: ybase Differential Revision: https://phabricator.dev.yugabyte.com/D11665
Summary: Original revision: https://phabricator.dev.yugabyte.com/D11633 Original commit: e5d4a27 Downgrade gperftools to 2.7 to avoid hitting tcmalloc bugs (gperftools/gperftools#1204, gperftools/gperftools#1227) and because we have already extensively tested tcmalloc from gperftools 2.7. We will still use the old Linuxbrew-based third-party archive for ASAN/TSAN builds because the new Linuxbrew-based GCC 5 third-party archive does not have a Clang 7 toolchain anymore (Linuxbrew is being removed from an increasing number of build types). But ASAN/TSAN builds are non-production and do not use tcmalloc anyway so it is OK to use an archive that has gperftools 2.8. Also fix find_or_download_thirdparty.sh to take BUILD_ROOT into account. Test Plan: Jenkins: rebase: 2.7.1 Reviewers: bogdan, tvesely, steve.varnau Reviewed By: steve.varnau Subscribers: ybase Differential Revision: https://phabricator.dev.yugabyte.com/D11680
Summary: Original differential revision: https://phabricator.dev.yugabyte.com/D11633 Original commit: e5d4a27 Downgrade gperftools to 2.7 to avoid hitting tcmalloc bugs (gperftools/gperftools#1204, gperftools/gperftools#1227) and because we have already extensively tested tcmalloc from gperftools 2.7. We will still use the old Linuxbrew-based third-party archive for ASAN/TSAN builds because the new Linuxbrew-based GCC 5 third-party archive does not have a Clang 7 toolchain anymore (Linuxbrew is being removed from an increasing number of build types). But ASAN/TSAN builds are non-production and do not use tcmalloc anyway so it is OK to use an archive that has gperftools 2.8. Also fix find_or_download_thirdparty.sh to take BUILD_ROOT into account. We are using the updated third-party URL specifically built for the 2.6 branch here. The previous yugabyte-db-thirdparty commit we used in the 2.6 branch was yugabyte/yugabyte-db-thirdparty@ee4e2e4, and it had the problematic gperftools version 2.8. The new commit we are using is https://github.com/yugabyte/yugabyte-db-thirdparty/commits/d83a2e241523b48e9cd8b7bd5dd248e74bf0132c, with gperftools downgraded to 2.7. Test Plan: Jenkins: rebase: 2.6 Reviewers: bogdan, tvesely, steve.varnau Reviewed By: steve.varnau Subscribers: ybase Differential Revision: https://phabricator.dev.yugabyte.com/D11682
Summary: Downgrade gperftools to 2.7 to avoid hitting tcmalloc bugs (gperftools/gperftools#1204, gperftools/gperftools#1227) and because we have already extensively tested tcmalloc from gperftools 2.7. We will still use the old Linuxbrew-based third-party archive for ASAN/TSAN builds because the new Linuxbrew-based GCC 5 third-party archive does not have a Clang 7 toolchain anymore (Linuxbrew is being removed from an increasing number of build types). But ASAN/TSAN builds are non-production and do not use tcmalloc anyway so it is OK to use an archive that has gperftools 2.8. Also fix find_or_download_thirdparty.sh to take BUILD_ROOT into account. Test Plan: Jenkins Reviewers: bogdan, tvesely, steve.varnau Reviewed By: steve.varnau Subscribers: ybase Differential Revision: https://phabricator.dev.yugabyte.com/D11633
The server memory size is 256g. My application uses about 2%. There is a lot of remaining memory, but the application often gets null.The memory I applied for is probably between 8m and 80m.
Can someone help me analyze the cause of the problem?
The text was updated successfully, but these errors were encountered: