Add runtime allocator #180

crazydemo · 2024-07-22T08:32:33Z

Tracking: Issue#158

Porting GC_V1's alloc / free implementation to lib/gc/ExecutionEngine/CPURuntime/Parallel.cpp
Addgc_aligned_malloc / gc_aligned_free / gc_thread_aligned_malloc / gc_thread_aligned_free ops in CPURuntime dialect
Add memref.alloc / memref.free => cpuruntime.gc_alloc / cpuruntime.gc_free conversion pass right before ConvertSCFToOpenMPPass
Along with an analysis on whether the conversion is safe. (BufferViewFlowAnalysis could be used for do analysis on multiple alias. Current special case, return / yield memref)
Lower the cpuruntime.alloc / free to llvm, check whether there's exsisting mechanism could be reused.
Merge thread allocator and main allocator into one.

crazydemo · 2024-07-22T08:47:59Z

not sure whether we still need the thread_local_registry_t

lib/gc/ExecutionEngine/CPURuntime/ThreadLocals.cpp

lib/gc/Transforms/MemRefToCPURuntime.cpp

lib/gc/ExecutionEngine/CPURuntime/Memorypool.cpp

test/mlir/test/gc/Dialect/CPURuntime/cpuruntime-to-llvm.mlir

Menooker · 2024-07-22T09:10:21Z

include/gc/Dialect/CPURuntime/IR/CPURuntimeOps.td

+    %memref = cpuruntime.alloc (%width) : memref<64x?xf32>
+
+    // For static size memref
+    %memref = cpuruntime.alloc () : memref<64x32xf32>


Just a discussion: 1) alloc or main_alloc? 2) shall we merge alloc with thread_alloc, and add an attr like cpuruntime.alloc () {main} : memref<64x32xf32>. For Reference, https://github.com/intel/mlir-extensions/blob/417a44959726f38b36ba494ff25e18c331c956bb/include/imex/Dialect/GPUX/IR/GPUXOps.td#L192

They have a shared attr for gpux.alloc.

I have no preference on merging alloc/thread_alloc or splitting them to two ops.

I think we can merge alloc and thread_alloc to simplify the op list. Maybe I can do this refactor in the next PR, as we may need to add more allocators.

Menooker · 2024-07-23T07:25:23Z

lib/gc/ExecutionEngine/CPURuntime/Memorypool.cpp

+} // namespace mlir
+
+extern "C" void *gcAlignedMalloc(size_t sz) noexcept {
+  if (sz == 0) {


why do we need this? If it is really needed, you can wrap sz==0 with unlikely(...) for better performance.

This aligns with the legacy implementation. Maybe we can delete this.

include/gc/ExecutionEngine/MemoryPool/MemoryPool.h

crazydemo · 2024-07-30T03:54:39Z

use unitAttr to indicate whether to use thread local allocator. Performance check has been conducted, no performance regression is found.

crazydemo · 2024-08-01T01:42:05Z

@Menooker @ciyongch Could please help review the new changes?

Menooker · 2024-08-01T03:07:34Z

include/gc/Dialect/CPURuntime/IR/CPURuntimeOps.td

+  let description = [{
+    The `cpuruntime.dealloc` operation frees the region of memory referenced by a
+    memref which was originally created by the `cpuruntime.alloc` operation.
+    It is similar to the `std.dealloc` op.


memref.dealloc?

Why don't we need the threadLocal attr?

updated description, and add thread_local to dealloc as well. I think you are right, there's no need to attach pooltype in runtime. Keep the memorypool.cpp the same as before.

include/gc/Dialect/CPURuntime/IR/CPURuntimeOps.td

lib/gc/ExecutionEngine/CPURuntime/MemoryPool.cpp

Menooker · 2024-08-01T03:21:28Z

lib/gc/Transforms/MemRefToCPURuntime.cpp

+        if (op->hasTrait<OpTrait::ReturnLike>()) {
+          for (Value operand : op->getOperands()) {
+            if (isa<MemRefType>(operand.getType())) {
+              Value v = getViewBase(operand);


Do we still need this? I think alias analysis is enough.

yes, alias analysis is enough, removed the getViewBase

Menooker · 2024-08-01T03:22:27Z

test/mlir/test/gc/Dialect/CPURuntime/cpu-runner/allocators.mlir

+
+  func.func @doThreadAlloc() {
+    scf.forall (%arg2) in (3) {
+      %m0 = cpuruntime.alloc threadLocal () : memref<13xf32>


seems that MLIR like thread_local style in the IR? Need to confirm that...

confirmed the style, for attributes is thread_local, fixed.

lib/gc/ExecutionEngine/CPURuntime/MemoryPool.cpp

ciyongch · 2024-08-01T08:01:05Z

overall LGTM.

ciyongch

overall LGTM.

kurapov-peter · 2024-07-31T12:43:10Z

include/gc/Dialect/CPURuntime/IR/CPURuntimeOps.td

+
+    ```mlir
+    // For dynamic size memref
+    %memref = cpuruntime.alloc (%width) : memref<64x?xf32>


Is there semantics for memref<?x?xf32>?

Yes, It also works for dynamic shape like memref<?x?xf32>

kurapov-peter · 2024-07-31T12:45:00Z

include/gc/Transforms/Passes.h

+class CPURuntimeDialect;
+}
+
+class PassManager;


Do you use the forward declaration somehow?

This is redundant, removed.

kurapov-peter · 2024-07-31T12:47:15Z

lib/gc/ExecutionEngine/CPURuntime/MemoryPool.cpp

+constexpr size_t threadlocalChunkSize = 4 * 1024 * 1024;
+// 16MB
+constexpr size_t mainChunkSize = 16 * 1024 * 1024;
+
+static constexpr size_t defaultAlignment = 64;


These can be converted to pass parameters or smth. Not required, just nice-to-have.

Thanks for the suggestion. These values are fixed and have been chosen based on their proven performance on ICX, SPR, and EMR machines with our GC_V1 system.

We can certainly consider making these parameters configurable in the future, which would allow us to fine-tune performance for additional platforms as we expand our testing.

kurapov-peter · 2024-08-01T13:42:37Z

lib/gc/ExecutionEngine/CPURuntime/MemoryPool.cpp

This looks like a very basic implementation of a heap manager. I think we should use UMF and not reinvent the wheel.

kurapov-peter · 2024-08-01T14:18:24Z

include/gc/Transforms/Passes.td

  ];
 }

+def ConvertMemRefToCPURuntime : Pass<"convert-memref-to-cpuruntime"> {


Can't this be a func::FuncOp pass? It should just replace ops with appropriate calls.

made it a func::FuncOp pass.

crazydemo · 2024-08-13T05:42:49Z

Compare the bf16 single matmul performance with UMF / GC runtime allocator.
In geomean, GC runtime allocator has near 3% perf gain compared with UMF. Especially on the cases with small batchsize = 1 / 16, these cases could have 5-20% performance gain, with gc runtime allocator.

When comparing UMF with default allocator, they show comparable performance.

The related UMF implementation can be found here

bs	hidden_size	tile	umf time(ms)	gc times	umf/gc
1	4096x4096	32	0.016119	0.014205	1.134748
16	4096x4096	32	0.024732	0.023336	1.059823
32	4096x4096	32	0.038178	0.03664	1.04198
64	4096x4096	32	0.077707	0.076976	1.009499
512	4096x4096	32	0.772379	0.774239	0.997598
1024	1024x1024	32	0.086527	0.085596	1.010887
2048	2048x2048	32	0.497962	0.467387	1.065415
4096	4096x4096	32	6.832565	6.882722	0.992713
1	4096x4096	64	0.015138	0.014422	1.049675
16	4096x4096	64	0.023275	0.023362	0.996289
32	4096x4096	64	0.040873	0.040978	0.997439
64	4096x4096	64	0.075946	0.073673	1.030847
512	4096x4096	64	0.625766	0.621129	1.007466
1024	1024x1024	64	0.073073	0.074128	0.985776
2048	2048x2048	64	0.447378	0.444295	1.006938
4096	4096x4096	64	4.765007	4.787503	0.995301
1	4096x4096	128	0.015481	0.014327	1.080501
16	4096x4096	128	0.026465	0.021078	1.255556
32	4096x4096	128	0.037734	0.039361	0.958676
64	4096x4096	128	0.082689	0.077784	1.063058
512	4096x4096	128	0.643699	0.647983	0.99339
1024	1024x1024	128	0.075029	0.075593	0.992529
2048	2048x2048	128	0.492153	0.487092	1.01039
4096	4096x4096	128	4.960785	4.915923	1.009126
				geomean	1.029455

@ZhennanQin @ciyongch @Menooker @kurapov-peter

ZhennanQin · 2024-08-13T05:53:29Z

Another concern is whether we should introduce UMF as another dependency just because of a simple interface? I don't see much value in using UMF for CPU, but extra dependency and slow performance.

ZhangYan added 12 commits July 11, 2024 20:38

porting memorypool to cpuruntime

959b371

add alloc / dealloc in cpuruntime

bf628ef

add alloc transform

c94cf16

add dealloc transform

68c4f7d

add thread local op, along with the conversion

7026abc

simplify

7b20668

add base memref-to-cpuruntime base conversion

2ab7566

enhance memref-to-cpuruntime

21a2614

add thread_local_registry

22c27b0

Merge remote-tracking branch 'origin/main' into zhangyan/allocator

ad73e87

fix test

d59fd9e

fix format

7b59f5d

crazydemo requested a review from Menooker July 22, 2024 08:46