Privatize local memory for tensors #448

newling · 2024-06-19T19:32:26Z

The pack-peel pipeline, matmul (m=n=1024, k=512) followed by addition of a bias (1-d vector with 1024 values) results in the final allocations (I've renamed the SSA values for clarity).

%bias_local = memref.alloc() : memref<1x16x4xf32, 2 : i32>
%bias_shared = memref.alloc() : memref<2x64xf32, 1 : i32>
%B_local = memref.alloc() : memref<1x1x16x8x8x4xbf16, 2 : i32>
%A_local = memref.alloc() : memref<1x1x8x16x4x8xbf16, 2 : i32>
%B_shared = memref.alloc() : memref<1x2x64x64xbf16, 1 : i32>
%A_shared = memref.alloc() : memref<2x1x64x64xbf16, 1 : i32>
%C_local = memref.alloc() : memref<2x2x16x16x4x4xf32, 2 : i32>
%C_shared = memref.alloc() : memref<2x2x64x64xf32, 1 : i32>

The above is for a design using a 2x2 array of AIE cores. The IR contains a loop over the 2x2 cores, indexing into arrays as follows

 scf.forall (%arg2, %arg3) in (2, 2) {

// Copy from slice of shared-memory A, to local-memory for A 
%subview_14 = memref.subview %A_shared[%arg2, 0, 0, 0] [1, 1, 64, 64] [1, 1, 1, 1]...
iree_linalg_ext.pack %subview_14 outer_dims_perm = [0, 1, 3, 2] inner_dims_pos = [2, 3] inner_tiles = [4, 8] into %A_local ...

%subview_15 = memref.subview %B_shared[0, %arg3, 0, 0] [1, 1, 64, 64] [1, 1, 1, 1]...
iree_linalg_ext.pack %subview_15 outer_dims_perm = [0, 1, 3, 2] inner_dims_pos = [2, 3] inner_tiles = [8, 4] into %B_local ...

%subview_16 = memref.subview %C_local[%arg2, %arg3, 0, 0, 0, 0] [1, 1, 16, 16, 4, 4] [1, 1, 1, 1, 1, 1]...
 
 ...
 }

For A and B, a view into shared memory is copied to the entire local buffer for A and B. For C, a slice of the local buffer is taken.

I find this very confusing, and think it would be much better if C was already 'privatized' per core, so that instead of

%C_local = memref.alloc() : memref<2x2x16x16x4x4xf32, 2 : i32>

the allocation was

%C_local = memref.alloc() : memref<2x2x16x16x4x4xf32, 2 : i32>

and then it would effectively just be

%subview_16 = %C_local

This seems like it would be more inline with how GPU abstraction works (I'm thinking of OpenCL kernels). I think there shouldn't ever be a contiguous block of memory representing all data memories IMO.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Privatize local memory for tensors #448

Privatize local memory for tensors #448

newling commented Jun 19, 2024

Privatize local memory for tensors #448

Privatize local memory for tensors #448

Comments

newling commented Jun 19, 2024