Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Privatize local memory for tensors #448

Open
newling opened this issue Jun 19, 2024 · 0 comments
Open

Privatize local memory for tensors #448

newling opened this issue Jun 19, 2024 · 0 comments

Comments

@newling
Copy link
Contributor

newling commented Jun 19, 2024

The pack-peel pipeline, matmul (m=n=1024, k=512) followed by addition of a bias (1-d vector with 1024 values) results in the final allocations (I've renamed the SSA values for clarity).

%bias_local = memref.alloc() : memref<1x16x4xf32, 2 : i32>
%bias_shared = memref.alloc() : memref<2x64xf32, 1 : i32>
%B_local = memref.alloc() : memref<1x1x16x8x8x4xbf16, 2 : i32>
%A_local = memref.alloc() : memref<1x1x8x16x4x8xbf16, 2 : i32>
%B_shared = memref.alloc() : memref<1x2x64x64xbf16, 1 : i32>
%A_shared = memref.alloc() : memref<2x1x64x64xbf16, 1 : i32>
%C_local = memref.alloc() : memref<2x2x16x16x4x4xf32, 2 : i32>
%C_shared = memref.alloc() : memref<2x2x64x64xf32, 1 : i32> 

The above is for a design using a 2x2 array of AIE cores. The IR contains a loop over the 2x2 cores, indexing into arrays as follows

 scf.forall (%arg2, %arg3) in (2, 2) {

// Copy from slice of shared-memory A, to local-memory for A 
%subview_14 = memref.subview %A_shared[%arg2, 0, 0, 0] [1, 1, 64, 64] [1, 1, 1, 1]...
iree_linalg_ext.pack %subview_14 outer_dims_perm = [0, 1, 3, 2] inner_dims_pos = [2, 3] inner_tiles = [4, 8] into %A_local ...

%subview_15 = memref.subview %B_shared[0, %arg3, 0, 0] [1, 1, 64, 64] [1, 1, 1, 1]...
iree_linalg_ext.pack %subview_15 outer_dims_perm = [0, 1, 3, 2] inner_dims_pos = [2, 3] inner_tiles = [8, 4] into %B_local ...

%subview_16 = memref.subview %C_local[%arg2, %arg3, 0, 0, 0, 0] [1, 1, 16, 16, 4, 4] [1, 1, 1, 1, 1, 1]...
 
 ...
 }

For A and B, a view into shared memory is copied to the entire local buffer for A and B. For C, a slice of the local buffer is taken.

I find this very confusing, and think it would be much better if C was already 'privatized' per core, so that instead of

%C_local = memref.alloc() : memref<2x2x16x16x4x4xf32, 2 : i32>

the allocation was

%C_local = memref.alloc() : memref<2x2x16x16x4x4xf32, 2 : i32>

and then it would effectively just be

%subview_16 = %C_local

This seems like it would be more inline with how GPU abstraction works (I'm thinking of OpenCL kernels). I think there shouldn't ever be a contiguous block of memory representing all data memories IMO.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant