Skip to content

Conversation

@crazydemo
Copy link

@crazydemo crazydemo commented Jul 22, 2024

Tracking: Issue#158

  • Porting GC_V1's alloc / free implementation to lib/gc/ExecutionEngine/CPURuntime/Parallel.cpp
  • Addgc_aligned_malloc / gc_aligned_free / gc_thread_aligned_malloc / gc_thread_aligned_free ops in CPURuntime dialect
  • Add memref.alloc / memref.free => cpuruntime.gc_alloc / cpuruntime.gc_free conversion pass right before ConvertSCFToOpenMPPass
  • Along with an analysis on whether the conversion is safe. (BufferViewFlowAnalysis could be used for do analysis on multiple alias. Current special case, return / yield memref)
  • Lower the cpuruntime.alloc / free to llvm, check whether there's exsisting mechanism could be reused.
  • Merge thread allocator and main allocator into one.

@crazydemo crazydemo requested a review from Menooker July 22, 2024 08:46
@crazydemo
Copy link
Author

not sure whether we still need the thread_local_registry_t

%memref = cpuruntime.alloc (%width) : memref<64x?xf32>

// For static size memref
%memref = cpuruntime.alloc () : memref<64x32xf32>

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a discussion: 1) alloc or main_alloc? 2) shall we merge alloc with thread_alloc, and add an attr like cpuruntime.alloc () {main} : memref<64x32xf32>. For Reference, https://github.com/intel/mlir-extensions/blob/417a44959726f38b36ba494ff25e18c331c956bb/include/imex/Dialect/GPUX/IR/GPUXOps.td#L192

They have a shared attr for gpux.alloc.

I have no preference on merging alloc/thread_alloc or splitting them to two ops.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can merge alloc and thread_alloc to simplify the op list. Maybe I can do this refactor in the next PR, as we may need to add more allocators.

} // namespace mlir

extern "C" void *gcAlignedMalloc(size_t sz) noexcept {
if (sz == 0) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this? If it is really needed, you can wrap sz==0 with unlikely(...) for better performance.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This aligns with the legacy implementation. Maybe we can delete this.

@crazydemo crazydemo force-pushed the zhangyan/allocator branch from 5249e55 to e9ac38a Compare July 30, 2024 03:50
@crazydemo
Copy link
Author

use unitAttr to indicate whether to use thread local allocator. Performance check has been conducted, no performance regression is found.

@crazydemo crazydemo force-pushed the zhangyan/allocator branch from e9ac38a to 26148fb Compare July 30, 2024 03:56
@crazydemo crazydemo linked an issue Jul 30, 2024 that may be closed by this pull request
6 tasks
@crazydemo
Copy link
Author

@Menooker @ciyongch Could please help review the new changes?

let description = [{
The `cpuruntime.dealloc` operation frees the region of memory referenced by a
memref which was originally created by the `cpuruntime.alloc` operation.
It is similar to the `std.dealloc` op.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

memref.dealloc?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we need the threadLocal attr?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated description, and add thread_local to dealloc as well. I think you are right, there's no need to attach pooltype in runtime. Keep the memorypool.cpp the same as before.

if (op->hasTrait<OpTrait::ReturnLike>()) {
for (Value operand : op->getOperands()) {
if (isa<MemRefType>(operand.getType())) {
Value v = getViewBase(operand);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need this? I think alias analysis is enough.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, alias analysis is enough, removed the getViewBase


func.func @doThreadAlloc() {
scf.forall (%arg2) in (3) {
%m0 = cpuruntime.alloc threadLocal () : memref<13xf32>
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems that MLIR like thread_local style in the IR? Need to confirm that...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

confirmed the style, for attributes is thread_local, fixed.

@ciyongch
Copy link
Contributor

ciyongch commented Aug 1, 2024

overall LGTM.

Copy link
Contributor

@ciyongch ciyongch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall LGTM.


```mlir
// For dynamic size memref
%memref = cpuruntime.alloc (%width) : memref<64x?xf32>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there semantics for memref<?x?xf32>?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, It also works for dynamic shape like memref<?x?xf32>

class CPURuntimeDialect;
}

class PassManager;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you use the forward declaration somehow?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is redundant, removed.

Comment on lines +31 to +35
constexpr size_t threadlocalChunkSize = 4 * 1024 * 1024;
// 16MB
constexpr size_t mainChunkSize = 16 * 1024 * 1024;

static constexpr size_t defaultAlignment = 64;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These can be converted to pass parameters or smth. Not required, just nice-to-have.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion. These values are fixed and have been chosen based on their proven performance on ICX, SPR, and EMR machines with our GC_V1 system.

We can certainly consider making these parameters configurable in the future, which would allow us to fine-tune performance for additional platforms as we expand our testing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a very basic implementation of a heap manager. I think we should use UMF and not reinvent the wheel.

];
}

def ConvertMemRefToCPURuntime : Pass<"convert-memref-to-cpuruntime"> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't this be a func::FuncOp pass? It should just replace ops with appropriate calls.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

made it a func::FuncOp pass.

@crazydemo
Copy link
Author

Compare the bf16 single matmul performance with UMF / GC runtime allocator.
In geomean, GC runtime allocator has near 3% perf gain compared with UMF. Especially on the cases with small batchsize = 1 / 16, these cases could have 5-20% performance gain, with gc runtime allocator.

When comparing UMF with default allocator, they show comparable performance.

The related UMF implementation can be found here

<style> </style>
bs hidden_size tile umf time(ms) gc times umf/gc
1 4096x4096 32 0.016119 0.014205 1.134748
16 4096x4096 32 0.024732 0.023336 1.059823
32 4096x4096 32 0.038178 0.03664 1.04198
64 4096x4096 32 0.077707 0.076976 1.009499
512 4096x4096 32 0.772379 0.774239 0.997598
1024 1024x1024 32 0.086527 0.085596 1.010887
2048 2048x2048 32 0.497962 0.467387 1.065415
4096 4096x4096 32 6.832565 6.882722 0.992713
1 4096x4096 64 0.015138 0.014422 1.049675
16 4096x4096 64 0.023275 0.023362 0.996289
32 4096x4096 64 0.040873 0.040978 0.997439
64 4096x4096 64 0.075946 0.073673 1.030847
512 4096x4096 64 0.625766 0.621129 1.007466
1024 1024x1024 64 0.073073 0.074128 0.985776
2048 2048x2048 64 0.447378 0.444295 1.006938
4096 4096x4096 64 4.765007 4.787503 0.995301
1 4096x4096 128 0.015481 0.014327 1.080501
16 4096x4096 128 0.026465 0.021078 1.255556
32 4096x4096 128 0.037734 0.039361 0.958676
64 4096x4096 128 0.082689 0.077784 1.063058
512 4096x4096 128 0.643699 0.647983 0.99339
1024 1024x1024 128 0.075029 0.075593 0.992529
2048 2048x2048 128 0.492153 0.487092 1.01039
4096 4096x4096 128 4.960785 4.915923 1.009126
        geomean 1.029455

@ZhennanQin @ciyongch @Menooker @kurapov-peter

@ZhennanQin
Copy link
Contributor

Another concern is whether we should introduce UMF as another dependency just because of a simple interface? I don't see much value in using UMF for CPU, but extra dependency and slow performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add runtime allocator

6 participants