[improvement](memory) simplify memory config related to tcmalloc and …

…add gc (apache#1191) (apache#1193) * [improvement](memory) simplify memory config related to tcmalloc There are several configs related to tcmalloc, users do know how to config them. Actually users just want two modes, performance or compact, in performance mode, users want doris run query and load quickly while in compact mode, users want doris run with less memory usage. If we want to config tcmalloc individually, we can use env variables which are supported by tcmalloc. * [improvement](tcmalloc) add moderate mode and avoid oom with a lot of cache (apache#14374) ReleaseToSystem aggressively when there are little free memory.
luwei16 · Nov 30, 2022 · 23a144c · 23a144c
1 parent e4e281d
commit 23a144c
Show file tree

Hide file tree

Showing 7 changed files with 132 additions and 81 deletions.
diff --git a/be/src/common/config.h b/be/src/common/config.h
@@ -46,26 +46,10 @@ CONF_Int32(single_replica_load_brpc_num_threads, "64");
 // If no ip match this rule, will choose one randomly.
 CONF_String(priority_networks, "");
 
-////
-//// tcmalloc gc parameter
-////
-// min memory for TCmalloc, when used memory is smaller than this, do not returned to OS
-CONF_mInt64(tc_use_memory_min, "10737418240");
-// free memory rate.[0-100]
-CONF_mInt64(tc_free_memory_rate, "20");
-// tcmallc aggressive_memory_decommit
-CONF_mBool(tc_enable_aggressive_memory_decommit, "false");
-
-// Bound on the total amount of bytes allocated to thread caches.
-// This bound is not strict, so it is possible for the cache to go over this bound
-// in certain circumstances. This value defaults to 1GB
-// If you suspect your application is not scaling to many threads due to lock contention in TCMalloc,
-// you can try increasing this value. This may improve performance, at a cost of extra memory
-// use by TCMalloc.
-// reference: https://gperftools.github.io/gperftools/tcmalloc.html: TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES
-//            https://github.com/gperftools/gperftools/issues/1111
-CONF_Int64(tc_max_total_thread_cache_bytes, "1073741824");
-
+// memory mode
+// performance or compact
+CONF_String(memory_mode, "moderate");
+
 // process memory limit specified as number of bytes
 // ('<int>[bB]?'), megabytes ('<float>[mM]'), gigabytes ('<float>[gG]'),
 // or percentage of the physical memory ('<int>%').

diff --git a/be/src/common/daemon.cpp b/be/src/common/daemon.cpp
@@ -72,23 +72,111 @@ void Daemon::tcmalloc_gc_thread() {
     // TODO All cache GC wish to be supported
 #if !defined(ADDRESS_SANITIZER) && !defined(LEAK_SANITIZER) && !defined(THREAD_SANITIZER) && \
         !defined(USE_JEMALLOC)
-    while (!_stop_background_threads_latch.wait_for(std::chrono::seconds(10))) {
-        size_t used_size = 0;
-        size_t free_size = 0;
 
+    // Limit size of tcmalloc cache via release_rate and max_cache_percent.
+    // We adjust release_rate according to memory_pressure, which is usage percent of memory.
+    int64_t max_cache_percent = 60;
+    double release_rates[10] = {1.0, 1.0, 1.0, 5.0, 5.0, 20.0, 50.0, 100.0, 500.0, 2000.0};
+    int64_t pressure_limit = 90;
+    bool is_performance_mode = false;
+    size_t physical_limit_bytes = std::min(MemInfo::hard_mem_limit(), MemInfo::mem_limit());
+
+    if (config::memory_mode == std::string("performance")) {
+        max_cache_percent = 100;
+        pressure_limit = 90;
+        is_performance_mode = true;
+        physical_limit_bytes = std::min(MemInfo::mem_limit(), MemInfo::physical_mem());
+    } else if (config::memory_mode == std::string("compact")) {
+        max_cache_percent = 20;
+        pressure_limit = 80;
+    }
+
+    int last_ms = 0;
+    const int kMaxLastMs = 30000;
+    const int kIntervalMs = 10;
+    size_t init_aggressive_decommit = 0;
+    size_t current_aggressive_decommit = 0;
+    size_t expected_aggressive_decommit = 0;
+    int64_t last_memory_pressure = 0;
+
+    MallocExtension::instance()->GetNumericProperty("tcmalloc.aggressive_memory_decommit",
+                                                    &init_aggressive_decommit);
+    current_aggressive_decommit = init_aggressive_decommit;
+
+    while (!_stop_background_threads_latch.wait_for(std::chrono::milliseconds(kIntervalMs))) {
+        size_t tc_used_bytes = 0;
+        size_t tc_alloc_bytes = 0;
+        size_t rss = PerfCounters::get_vm_rss();
+
+        MallocExtension::instance()->GetNumericProperty("generic.total_physical_bytes",
+                                                        &tc_alloc_bytes);
         MallocExtension::instance()->GetNumericProperty("generic.current_allocated_bytes",
-                                                        &used_size);
-        MallocExtension::instance()->GetNumericProperty("tcmalloc.pageheap_free_bytes", &free_size);
-        size_t alloc_size = used_size + free_size;
-        LOG(INFO) << "tcmalloc.pageheap_free_bytes " << free_size
-                  << ", generic.current_allocated_bytes " << used_size;
-
-        if (alloc_size > config::tc_use_memory_min) {
-            size_t max_free_size = alloc_size * config::tc_free_memory_rate / 100;
-            if (free_size > max_free_size) {
-                MallocExtension::instance()->ReleaseToSystem(free_size - max_free_size);
+                                                        &tc_used_bytes);
+        int64_t tc_cached_bytes = tc_alloc_bytes - tc_used_bytes;
+        int64_t to_free_bytes =
+                (int64_t)tc_cached_bytes - (tc_used_bytes * max_cache_percent / 100);
+
+        int64_t memory_pressure = 0;
+        int64_t alloc_bytes = std::max(rss, tc_alloc_bytes);
+        memory_pressure = alloc_bytes * 100 / physical_limit_bytes;
+
+        expected_aggressive_decommit = init_aggressive_decommit;
+        if (memory_pressure > pressure_limit) {
+            // We are reaching oom, so release cache aggressively.
+            // Ideally, we should reuse cache and not allocate from system any more,
+            // however, it is hard to set limit on cache of tcmalloc and doris
+            // use mmap in vectorized mode.
+            if (last_memory_pressure <= pressure_limit) {
+                int64_t min_free_bytes = alloc_bytes - physical_limit_bytes * 9 / 10;
+                to_free_bytes = std::max(to_free_bytes, min_free_bytes);
+                to_free_bytes = std::max(to_free_bytes, tc_cached_bytes * 30 / 100);
+                to_free_bytes = std::min(to_free_bytes, tc_cached_bytes);
+                expected_aggressive_decommit = 1;
+            } else {
+                // release rate is enough.
+                to_free_bytes = 0;
+            }
+            last_ms = kMaxLastMs;
+        } else if (memory_pressure > (pressure_limit - 10)) {
+            if (last_memory_pressure <= (pressure_limit - 10)) {
+                to_free_bytes = std::max(to_free_bytes, tc_cached_bytes * 10 / 100);
+            } else {
+                to_free_bytes = 0;
             }
         }
+
+        int release_rate_index = memory_pressure / 10;
+        double release_rate = 1.0;
+        if (release_rate_index >= sizeof(release_rates)) {
+            release_rate = 2000.0;
+        } else {
+            release_rate = release_rates[release_rate_index];
+        }
+        MallocExtension::instance()->SetMemoryReleaseRate(release_rate);
+
+        if ((current_aggressive_decommit != expected_aggressive_decommit) && !is_performance_mode) {
+            MallocExtension::instance()->SetNumericProperty("tcmalloc.aggressive_memory_decommit",
+                                                            expected_aggressive_decommit);
+            current_aggressive_decommit = expected_aggressive_decommit;
+        }
+
+        last_memory_pressure = memory_pressure;
+        if (to_free_bytes > 0) {
+            last_ms += kIntervalMs;
+            if (last_ms >= kMaxLastMs) {
+                LOG(INFO) << "generic.current_allocated_bytes " << tc_used_bytes
+                          << ", generic.total_physical_bytes " << tc_alloc_bytes << ", rss " << rss
+                          << ", max_cache_percent " << max_cache_percent << ", release_rate "
+                          << release_rate << ", memory_pressure " << memory_pressure
+                          << ", physical_limit_bytes " << physical_limit_bytes << ", to_free_bytes "
+                          << to_free_bytes << ", current_aggressive_decommit "
+                          << current_aggressive_decommit;
+                MallocExtension::instance()->ReleaseToSystem(to_free_bytes);
+                last_ms = 0;
+            }
+        } else {
+            last_ms = 0;
+        }
     }
 #endif
 }

diff --git a/be/src/runtime/memory/mem_tracker_limiter.cpp b/be/src/runtime/memory/mem_tracker_limiter.cpp
@@ -30,6 +30,8 @@
 
 namespace doris {
 
+bool MemTrackerLimiter::_oom_avoidance {true};
+
 MemTrackerLimiter::MemTrackerLimiter(int64_t byte_limit, const std::string& label,
                                      const std::shared_ptr<MemTrackerLimiter>& parent,
                                      RuntimeProfile* profile) {

diff --git a/be/src/runtime/memory/mem_tracker_limiter.h b/be/src/runtime/memory/mem_tracker_limiter.h
@@ -66,6 +66,9 @@ class MemTrackerLimiter final : public MemTracker {
 
 public:
     static bool sys_mem_exceed_limit_check(int64_t bytes) {
+        if (!_oom_avoidance) {
+            return false;
+        }
         // Limit process memory usage using the actual physical memory of the process in `/proc/self/status`.
         // This is independent of the consumption value of the mem tracker, which counts the virtual memory
         // of the process malloc.
@@ -112,6 +115,8 @@ class MemTrackerLimiter final : public MemTracker {
     // Returns the lowest limit for this tracker limiter and its ancestors. Returns -1 if there is no limit.
     int64_t get_lowest_limit() const;
 
+    static void disable_oom_avoidance() { _oom_avoidance = false; }
+
 public:
     // up to (but not including) end_tracker.
     // This happens when we want to update tracking on a particular mem tracker but the consumption
@@ -259,6 +264,7 @@ class MemTrackerLimiter final : public MemTracker {
     // In some cases, in order to avoid the cumulative error of the upper global tracker,
     // the consumption of the current tracker is reset to zero.
     bool _reset_zero = false;
+    static bool _oom_avoidance;
 };
 
 inline void MemTrackerLimiter::consume(int64_t bytes) {

diff --git a/be/src/service/doris_main.cpp b/be/src/service/doris_main.cpp
@@ -325,20 +325,19 @@ int main(int argc, char** argv) {
 
 #if !defined(__SANITIZE_ADDRESS__) && !defined(ADDRESS_SANITIZER) && !defined(LEAK_SANITIZER) && \
         !defined(THREAD_SANITIZER) && !defined(USE_JEMALLOC)
-    // Aggressive decommit is required so that unused pages in the TCMalloc page heap are
-    // not backed by physical pages and do not contribute towards memory consumption.
-    if (doris::config::tc_enable_aggressive_memory_decommit) {
-        MallocExtension::instance()->SetNumericProperty("tcmalloc.aggressive_memory_decommit", 1);
-    }
     // Change the total TCMalloc thread cache size if necessary.
-    if (!MallocExtension::instance()->SetNumericProperty(
-                "tcmalloc.max_total_thread_cache_bytes",
-                doris::config::tc_max_total_thread_cache_bytes)) {
+    const size_t kDefaultTotalThreadCacheBytes = 1024 * 1024 * 1024;
+    if (!MallocExtension::instance()->SetNumericProperty("tcmalloc.max_total_thread_cache_bytes",
+                                                         kDefaultTotalThreadCacheBytes)) {
         fprintf(stderr, "Failed to change TCMalloc total thread cache size.\n");
         return -1;
     }
 #endif
 
+    if (doris::config::memory_mode == std::string("performance")) {
+        doris::MemTrackerLimiter::disable_oom_avoidance();
+    }
+
     std::vector<doris::StorePath> paths;
     auto olap_res = doris::parse_conf_store_paths(doris::config::storage_root_path, &paths);
     if (!olap_res) {

diff --git a/docs/en/docs/admin-manual/config/be-config.md b/docs/en/docs/admin-manual/config/be-config.md
@@ -838,6 +838,12 @@ The number of sliced tablets, plan the layout of the tablet, and avoid too many
 * Description: Limit the percentage of the server's maximum memory used by the BE process. It is used to prevent BE memory from occupying to many the machine's memory. This parameter must be greater than 0. When the percentage is greater than 100%, the value will default to 100%.
 * Default value: 80%
 
+### `memory_mode`
+
+* Type: string
+* Description: Control gc of tcmalloc, in performance mode doirs releases memory of tcmalloc cache when usgae >= 90% * mem_limit, otherwise, doris releases memory of tcmalloc cache when usage >= 50% * mem_limit;
+* Default value: performance
+
 ### `memory_limitation_per_thread_for_schema_change`
 
 Default: 2 （G）
@@ -1357,26 +1363,6 @@ The RPC timeout for sending a Batch (1024 lines) during import. The default is 6
 
 When meet '[E1011]The server is overcrowded' error, you can tune the configuration `brpc_socket_max_unwritten_bytes`, but it can't be modified at runtime. Set it to `true` to avoid writing failed temporarily. Notice that, it only effects `write`, other rpc requests will still check if overcrowded.
 
-### `tc_free_memory_rate`
-
-Default: 20   (%)
-
-Available memory, value range: [0-100]
-
-### `tc_max_total_thread_cache_bytes`
-
-* Type: int64
-* Description: Used to limit the total thread cache size in tcmalloc. This limit is not a hard limit, so the actual thread cache usage may exceed this limit. For details, please refer to [TCMALLOC\_MAX\_TOTAL\_THREAD\_CACHE\_BYTES](https://gperftools.github.io/gperftools/tcmalloc.html)
-* Default: 1073741824
-
-If the system is found to be in a high-stress scenario and a large number of threads are found in the tcmalloc lock competition phase through the BE thread stack, such as a large number of `SpinLock` related stacks, you can try increasing this parameter to improve system performance. [Reference](https://github.com/gperftools/gperftools/issues/1111)
-
-### `tc_use_memory_min`
-
-Default: 10737418240
-
-The minimum memory of TCmalloc, when the memory used is less than this, it is not returned to the operating system
-
 ### `thrift_client_retry_interval_ms`
 
 * Type: int64

diff --git a/docs/zh-CN/docs/admin-manual/config/be-config.md b/docs/zh-CN/docs/admin-manual/config/be-config.md
@@ -839,6 +839,12 @@ txn 管理器中每个 txn_partition_map 的最大 txns 数，这是一种自我
 * 描述：限制BE进程使用服务器最大内存百分比。用于防止BE内存挤占太多的机器内存，该参数必须大于0，当百分大于100%之后，该值会默认为100%。
 * 默认值：80%
 
+### `memory_mode`
+
+* 类型：string
+* 描述：控制tcmalloc的回收。如果配置为performance，内存使用超过mem_limit的90%时，doris会释放tcmalloc cache中的内存，如果配置为compact，内存使用超过mem_limit的50%时，doris会释放tcmalloc cache中的内存。
+* 默认值：performance
+
 ### `memory_limitation_per_thread_for_schema_change`
 
 默认值：2 （GB）
@@ -1380,26 +1386,6 @@ tablet状态缓存的更新间隔，单位：秒
 
 当遇到'[E1011]The server is overcrowded'的错误时，可以调整配置项`brpc_socket_max_unwritten_bytes`，但这个配置项不能动态调整。所以可通过设置此项为`true`来临时避免写失败。注意，此配置项只影响写流程，其他的rpc请求依旧会检查是否overcrowded。
 
-### `tc_free_memory_rate`
-
-默认值：20   (%)
-
-可用内存，取值范围：[0-100]
-
-### `tc_max_total_thread_cache_bytes`
-
-* 类型：int64
-* 描述：用来限制 tcmalloc 中总的线程缓存大小。这个限制不是硬限，因此实际线程缓存使用可能超过这个限制。具体可参阅 [TCMALLOC\_MAX\_TOTAL\_THREAD\_CACHE\_BYTES](https://gperftools.github.io/gperftools/tcmalloc.html)
-* 默认值： 1073741824
-
-如果发现系统在高压力场景下，通过 BE 线程堆栈发现大量线程处于 tcmalloc 的锁竞争阶段，如大量的 `SpinLock` 相关堆栈，则可以尝试增大该参数来提升系统性能。[参考](https://github.com/gperftools/gperftools/issues/1111)
-
-### `tc_use_memory_min`
-
-默认值：10737418240
-
-TCmalloc 的最小内存，当使用的内存小于这个时，不返回给操作系统
-
 ### `thrift_client_retry_interval_ms`
 
 * 类型：int64