Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CORE][VL] Add option to limit the memory Gluten can use for each task to N = (memory / task slots) #3101

Merged
merged 39 commits into from
Sep 14, 2023

Conversation

zhztheplayer
Copy link
Member

@zhztheplayer zhztheplayer commented Sep 11, 2023

Add new option spark.gluten.memory.isolation (by default false):

Description of the option:

Enable isolate memory mode. If true, Gluten controls the maximum off-heap memory can be used by each task to X, X = executor memory / max task slots. It's recommended to set true if Gluten serves concurrent queries within a single session, since not all memory Gluten allocated is guaranteed to be spillable. In the case, the feature should be enabled to avoid OOM.

The implementation inserts a complete memory management layer TreeMemoryConsumer between Spark's memory manager and Gluten. Once the task memory limit is hit, TreeMemoryConsumer will first try calling child spillers inside it's own scope without notifying Spark. After freeing some spaces, TreeMemoryConsumer continues to acquire memory from Spark.

User is supposed to use the feature to get rid of OOMs caused by pinned non-spillable Velox memory (#3030) used by a historical task that is commanded to spill due to the coming of the other new tasks. This typically happens when the session is shared by a couple of concurrent queries.

@github-actions
Copy link

Thanks for opening a pull request!

Could you open an issue for this pull request on Github Issues?

https://github.com/oap-project/gluten/issues

Then could you also rename commit message and pull request title in the following format?

[GLUTEN-${ISSUES_ID}][COMPONENT]feat/fix: ${detailed message}

See also:

@github-actions
Copy link

Run Gluten Clickhouse CI

Comment on lines 23 to 27
// A decorator to a task memory target, to restrict memory usage of the delegated
// memory target to X, X = free executor memory / task slots.
// Using this to prevent OOMs if the delegated memory target could possibly
// hold large memory blocks that are not spillable.
// See https://github.com/oap-project/gluten/issues/3030
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How to restrict specific memory consumer's usage to X? return zero when acquireMemory or just OOM?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More designs needed here. But overall the consumer with the decorator should behave like a consumer registered to a task memory manager with a fixed limit (which is X).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, hope following design could solve my question, when a consumer hit limit, what behavior is expected.

@winningsix winningsix mentioned this pull request Sep 12, 2023
14 tasks
import org.apache.spark.memory.TaskMemoryManager;

// A decorator to a task memory target, to restrict memory usage of the delegated
// memory target to X, X = free executor memory / task slots.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does task slot equal to the configured CPU cores of current Spark executor?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. There is a calculation

def getTaskSlots(conf: SparkConf): Int = {
  val executorCores = SparkResourceUtil.getExecutorCores(conf)
  val taskCores = conf.getInt("spark.task.cpus", 1)
  executorCores / taskCores
}


private final TaskMemoryTarget delegated;

public IsolatedByTaskSlot(TaskMemoryTarget delegated) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't find the usage of this class. Inference from the name, does that mean we want to introduce a concept of a memory pool at each task?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR was not ready at that time, The class was just a placeholder for my initial thoughts of the design.

@github-actions
Copy link

Run Gluten Clickhouse CI

1 similar comment
@github-actions
Copy link

Run Gluten Clickhouse CI

@github-actions
Copy link

Run Gluten Clickhouse CI

@github-actions
Copy link

Run Gluten Clickhouse CI

1 similar comment
@github-actions
Copy link

Run Gluten Clickhouse CI

@github-actions
Copy link

Run Gluten Clickhouse CI

@github-actions
Copy link

Run Gluten Clickhouse CI

@github-actions
Copy link

Run Gluten Clickhouse CI

@zhztheplayer zhztheplayer marked this pull request as ready for review September 13, 2023 01:39
@zhztheplayer zhztheplayer changed the title WIP: [CORE][VL] Add option to limit the memory Gluten can use for each task to N = (memory / task slots) [CORE][VL] Add option to limit the memory Gluten can use for each task to N = (memory / task slots) Sep 13, 2023
@github-actions
Copy link

Run Gluten Clickhouse CI

Comment on lines +128 to +132
while (q.peek() != null && remainingBytes > 0) {
TreeMemoryConsumerNode head = q.remove();
long spilled = spillTree(head, remainingBytes);
remainingBytes -= spilled;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic wants invoke spill from smallest consumer to largest consumer right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's from largest to smallest.

In future we may want to follow vanilla Spark's rule, which uses the smallest one among those are larger than the target size.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still not fully understanding...

We sort children in descending order and invoke spillTree on peek element's children recursively, which means we pick largest consumer and then pass smaller consumer into spillTree, when node has no children, we spill it and return.
I think spillTree loop from largest to smallest, but spill from smallest to largest, please correct me if I'm wrong, thanks!

Another misunderstanding logic, does the sort of root not sort its all children? seems we sort node's children in every spillTree invoked.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, thanks for helping checking this code. I haven't verified it carefully but let's take a simple example:

a 200 (self 80)
|-b 70
|-c 50 (self 10)
   |- d 30
   |- e 10

With the code we implemented post-order traversal on this tree (children-first, self-last), which means the visiting order is supposed to be

b (70) -> d (30) -> e (10) -> c (10) -> a (80)

Which seems to be aligned with my initial assumption: largest to smallest, but self last.
Is that the same with your thoughts?

Another misunderstanding logic, does the sort of root not sort its all children? seems we sort node's children in every spillTree invoked.

Yes the data structure is not efficient since we only sorted the children of the current node (See code q.addAll(node.children().values());). So probably we can move to TreeMap/TreeSet later on.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which seems to be aligned with my initial assumption: largest to smallest, but self last.
Is that the same with your thoughts?

Thanks, it's same.

Comment on lines +195 to +200
def conservativeOffHeapMemorySize: Long =
conf.getConf(COLUMNAR_CONSERVATIVE_OFFHEAP_SIZE_IN_BYTES)

def conservativeTaskOffHeapMemorySize: Long =
conf.getConf(COLUMNAR_CONSERVATIVE_TASK_OFFHEAP_SIZE_IN_BYTES)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some questions here:

  1. what's the difference for these two configs?
  2. what does conservative mean?
  3. seems conservativeOffHeapMemorySize was not used?

Copy link
Member Author

@zhztheplayer zhztheplayer Sep 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Conservative" means the max size Gluten can consider is "safe" to use. The two new options are set in GlutenPlugin.scala using the following code:

// Pessimistic off-heap sizes, with the assumption that all non-borrowable storage memory
// determined by spark.memory.storageFraction was used.
val fraction = 1.0d - conf.getDouble("spark.memory.storageFraction", 0.5d)
val conservativeOffHeapSize = (offHeapSize
  * fraction).toLong
conf.set(
  GlutenConfig.GLUTEN_CONSERVATIVE_OFFHEAP_SIZE_IN_BYTES_KEY,
  conservativeOffHeapSize.toString)
val conservativeOffHeapPerTask = conservativeOffHeapSize / taskSlots
conf.set(
  GlutenConfig.GLUTEN_CONSERVATIVE_TASK_OFFHEAP_SIZE_IN_BYTES_KEY,
  conservativeOffHeapPerTask.toString)

The difference between the options and "non-conservative" options is that the "conservative" ones take storage memory into account. Assuming Spark had used 30% off-heap memory in storage memory pool, the memory would not be evicted although a "borrow" is request from execution memory pool.

I think for stability, we may need to by default use "conservative" options since it's safter in most cases. I left the "non-conservative" options unchanged because of compatibility consideration.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides, I don't want users to set these "auto-generated" options. But we don't develop a general way to guard against that yet.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems conservativeOffHeapMemorySize was not used?

Yes. But it's worth to keep it for future use.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for detailed explanation! I have realized the "conservative" meaning.

we may need to by default use "conservative" options since it's safter in most cases.

Please keep a switch config for that, Spark already has maybeGrowExecutionPool to shrink storagePool and executionPool, I prefer let Spark control this logic. If use "conservative" option by default, we may not fully utilize the unused storage memory.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant to use "conservative" options for just Gluten's related codes that requires reading the off-heap size. For example shuffle writer and partial agg. It's not a goal to touch vanilla Spark's memory management.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth noting that Spark's "storage region size" determined by spark.memory.storageFraction is not evictable if in use. That's why we add the conservative options.

@github-actions
Copy link

Run Gluten Clickhouse CI

@github-actions
Copy link

Run Gluten Clickhouse CI

@github-actions
Copy link

Run Gluten Clickhouse CI

@github-actions
Copy link

Run Gluten Clickhouse CI

@github-actions
Copy link

Run Gluten Clickhouse CI

@github-actions
Copy link

Run Gluten Clickhouse CI

@github-actions
Copy link

Run Gluten Clickhouse CI

@zhztheplayer
Copy link
Member Author

/Benchmark Velox

@GlutenPerfBot
Copy link
Contributor

===== Performance report for TPCH SF2000 with Velox backend, for reference only ====

query log/native_3101_time.csv log/native_master_09_12_2023_892535655_time.csv difference percentage
q1 43.68 44.37 0.692 101.58%
q2 24.44 23.17 -1.271 94.80%
q3 35.85 36.71 0.865 102.41%
q4 41.31 41.27 -0.040 99.90%
q5 69.59 69.43 -0.163 99.77%
q6 6.93 5.00 -1.934 72.09%
q7 85.07 83.94 -1.134 98.67%
q8 81.76 82.77 1.018 101.25%
q9 116.53 115.52 -1.016 99.13%
q10 47.72 46.05 -1.674 96.49%
q11 19.41 19.04 -0.364 98.12%
q12 27.89 25.94 -1.947 93.02%
q13 52.37 51.71 -0.664 98.73%
q14 19.38 13.72 -5.662 70.79%
q15 28.27 27.98 -0.290 98.97%
q16 15.88 15.60 -0.280 98.24%
q17 120.52 119.89 -0.631 99.48%
q18 162.48 163.11 0.633 100.39%
q19 12.43 12.03 -0.405 96.74%
q20 29.19 38.93 9.745 133.39%
q21 237.04 248.23 11.197 104.72%
q22 15.43 15.56 0.122 100.79%
total 1293.19 1299.99 6.800 100.53%

@github-actions
Copy link

Run Gluten Clickhouse CI

@github-actions
Copy link

Run Gluten Clickhouse CI

Copy link
Contributor

@zhouyuan zhouyuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍
Another big change on memory component!

@zhztheplayer zhztheplayer merged commit 7311703 into apache:main Sep 14, 2023
28 checks passed
@GlutenPerfBot
Copy link
Contributor

===== Performance report for TPCH SF2000 with Velox backend, for reference only ====

query log/native_3101_time.csv log/native_master_09_13_2023_1dc14126a_time.csv difference percentage
q1 43.59 43.54 -0.052 99.88%
q2 24.71 24.58 -0.128 99.48%
q3 37.38 37.58 0.202 100.54%
q4 41.46 41.77 0.306 100.74%
q5 70.47 69.36 -1.105 98.43%
q6 6.66 6.35 -0.305 95.42%
q7 85.08 85.84 0.753 100.88%
q8 80.21 79.11 -1.106 98.62%
q9 116.00 118.08 2.079 101.79%
q10 46.87 45.32 -1.551 96.69%
q11 19.98 19.61 -0.371 98.14%
q12 24.00 26.16 2.162 109.01%
q13 48.93 50.61 1.677 103.43%
q14 16.24 16.64 0.401 102.47%
q15 31.32 26.80 -4.521 85.56%
q16 15.94 15.83 -0.112 99.30%
q17 119.27 120.95 1.677 101.41%
q18 160.27 162.53 2.262 101.41%
q19 12.56 13.26 0.705 105.62%
q20 30.09 29.37 -0.722 97.60%
q21 236.86 235.96 -0.900 99.62%
q22 15.58 15.92 0.341 102.19%
total 1283.46 1285.15 1.692 100.13%

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants