-
Notifications
You must be signed in to change notification settings - Fork 16
Add runtime allocator #180
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
not sure whether we still need the |
| %memref = cpuruntime.alloc (%width) : memref<64x?xf32> | ||
|
|
||
| // For static size memref | ||
| %memref = cpuruntime.alloc () : memref<64x32xf32> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a discussion: 1) alloc or main_alloc? 2) shall we merge alloc with thread_alloc, and add an attr like cpuruntime.alloc () {main} : memref<64x32xf32>. For Reference, https://github.com/intel/mlir-extensions/blob/417a44959726f38b36ba494ff25e18c331c956bb/include/imex/Dialect/GPUX/IR/GPUXOps.td#L192
They have a shared attr for gpux.alloc.
I have no preference on merging alloc/thread_alloc or splitting them to two ops.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can merge alloc and thread_alloc to simplify the op list. Maybe I can do this refactor in the next PR, as we may need to add more allocators.
| } // namespace mlir | ||
|
|
||
| extern "C" void *gcAlignedMalloc(size_t sz) noexcept { | ||
| if (sz == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need this? If it is really needed, you can wrap sz==0 with unlikely(...) for better performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This aligns with the legacy implementation. Maybe we can delete this.
5249e55 to
e9ac38a
Compare
|
use |
e9ac38a to
26148fb
Compare
| let description = [{ | ||
| The `cpuruntime.dealloc` operation frees the region of memory referenced by a | ||
| memref which was originally created by the `cpuruntime.alloc` operation. | ||
| It is similar to the `std.dealloc` op. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
memref.dealloc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't we need the threadLocal attr?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated description, and add thread_local to dealloc as well. I think you are right, there's no need to attach pooltype in runtime. Keep the memorypool.cpp the same as before.
| if (op->hasTrait<OpTrait::ReturnLike>()) { | ||
| for (Value operand : op->getOperands()) { | ||
| if (isa<MemRefType>(operand.getType())) { | ||
| Value v = getViewBase(operand); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still need this? I think alias analysis is enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, alias analysis is enough, removed the getViewBase
|
|
||
| func.func @doThreadAlloc() { | ||
| scf.forall (%arg2) in (3) { | ||
| %m0 = cpuruntime.alloc threadLocal () : memref<13xf32> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems that MLIR like thread_local style in the IR? Need to confirm that...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
confirmed the style, for attributes is thread_local, fixed.
|
overall LGTM. |
ciyongch
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overall LGTM.
|
|
||
| ```mlir | ||
| // For dynamic size memref | ||
| %memref = cpuruntime.alloc (%width) : memref<64x?xf32> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there semantics for memref<?x?xf32>?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, It also works for dynamic shape like memref<?x?xf32>
| class CPURuntimeDialect; | ||
| } | ||
|
|
||
| class PassManager; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you use the forward declaration somehow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is redundant, removed.
| constexpr size_t threadlocalChunkSize = 4 * 1024 * 1024; | ||
| // 16MB | ||
| constexpr size_t mainChunkSize = 16 * 1024 * 1024; | ||
|
|
||
| static constexpr size_t defaultAlignment = 64; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These can be converted to pass parameters or smth. Not required, just nice-to-have.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion. These values are fixed and have been chosen based on their proven performance on ICX, SPR, and EMR machines with our GC_V1 system.
We can certainly consider making these parameters configurable in the future, which would allow us to fine-tune performance for additional platforms as we expand our testing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a very basic implementation of a heap manager. I think we should use UMF and not reinvent the wheel.
include/gc/Transforms/Passes.td
Outdated
| ]; | ||
| } | ||
|
|
||
| def ConvertMemRefToCPURuntime : Pass<"convert-memref-to-cpuruntime"> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't this be a func::FuncOp pass? It should just replace ops with appropriate calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
made it a func::FuncOp pass.
|
Compare the bf16 single matmul performance with When comparing The related
|
|
Another concern is whether we should introduce UMF as another dependency just because of a simple interface? I don't see much value in using UMF for CPU, but extra dependency and slow performance. |
Tracking: Issue#158
lib/gc/ExecutionEngine/CPURuntime/Parallel.cppgc_aligned_malloc / gc_aligned_free / gc_thread_aligned_malloc / gc_thread_aligned_freeops in CPURuntime dialectConvertSCFToOpenMPPassBufferViewFlowAnalysiscould be used for do analysis on multiple alias. Current special case, return / yield memref)