# TeAAL specifications on variants of Loops' SpMV implementations (Group-Mapped)

Description: Assign an equal amount of work tiles to a group of threads (warp or blocks). Threads within each group process individual work items in parallel.

In this version, each `tile` represents a row of $A$, and each `atom` represents a nonzero entry of $A$. In other words, assigning a thread block to each row of $A$.

Template: https://github.com/gunrock/loops/blob/main/include/loops/algorithms/spmv/group_mapped.cuh

Scheduler (referenced as `config` below): https://github.com/gunrock/loops/blob/main/include/loops/schedule/group_mapped.hxx

GPU Kernel Template (Use it as a reference, don't execute it):

In [None]:
__global__ void __launch_bounds__(threads_per_block, 2) __group_mapped(...) {
  # Initialize storage and schedule.
  using setup_t = schedule::block_mapped<threads_per_block, index_t, offset_t>; # Using block_mapped, assigning an entire tile of work to a thread block
  using storage_t = typename setup_t::storage_t;
  __shared__ storage_t temporary_storage;

  # Construct the schedule.
  setup_t config(temporary_storage, offsets, rows, nnz);
  auto p = config.partition(); # Assigns work tiles to each thread block

  for (auto virtual_atom : config.atom_accessor(p)) { # Loop over total work, each thread processing individual work items
    auto virtual_tile = config.tile_accessor(virtual_atom, p);

    if (!(config.is_valid_accessor(virtual_tile, p)))
      continue;

    auto row = config.tile_id(virtual_tile, p); # Perform a binary-search to find the tile index.

    auto nz_idx = config.atom_id(virtual_atom, row, virtual_tile, p);
    atomicAdd(&(y[row]), values[nz_idx] * x[indices[nz_idx]]);
  }
}

## Imports

Import the necessary modules.

In [1]:
# HiFiber boilerplate

from fibertree_bootstrap import *

fibertree_bootstrap(style="tree", animation='movie')

# Compilation boilerplate

import os
import sys
sys.path.insert(0, "../..")

from src import utils

Running bootstrap
The fibertree module is already installed and available to import


interactive(children=(Dropdown(description='style', options=('tree', 'uncompressed', 'tree+uncompressed'), valâ€¦

Button(description='Run all cells below', style=ButtonStyle())

## Initialization

Initialize the input tensors.

For simplicity, the size of a thread warp is the same as the size of a thread block (`WARP_SIZE = BLOCK_SIZE`). Suppose that each GPU SM processes 1 thread warp/block per cycle.

In [2]:
I = 8
J = 8

# Hardware Specification
NUM_SM = 2 # Number of GPU SMs 
WARP_SIZE = 2 # Number of threads per warp
NUM_THREADS = NUM_SM * WARP_SIZE # Total number of threads

print(f"Hardware Specification\n  NUM_SM: {NUM_SM}, WARP_SIZE: {WARP_SIZE}, NUM_THREADS: {NUM_THREADS}")

seed = 1

A_IJ = Tensor.fromRandom(rank_ids=["I", "J"], shape=[I, J], seed=seed, density=[0.9, 0.6], name="A")
B_J = Tensor.fromRandom(rank_ids=["J"], shape=[J], seed=seed + 1, density=[1], name="B")

Hardware Specification
  NUM_SM: 2, WARP_SIZE: 2, NUM_THREADS: 4


## TeAAL Specifications

Rows of matrix $A$ are partitioned across the SMs' warp/block. A thread warp/block can be assigned to a row with all zeros. 

Note that the current TeAAL specificaiton does not allow to specify the rank of `opt: slip`. This means there exists a synchronization across the SMs.

In [3]:
yaml = """
einsum:
  declaration:
    A: [I, J]
    B: [J]
    Z: [I]
  expressions:
    - Z[i] = A[i, j] * B[j]
mapping:
  rank-order:
    A: [I, J]
    B: [J]
    Z: [I]
  partitioning:
    Z:
      I: [uniform_shape(NUM_SM)]
      J: [uniform_occupancy(A.WARP_SIZE)]
  loop-order:
    Z: [I1, I0, J1, J0] 
    # I1: Number of partitioned rows (I)
    # I0: Size of each partitioned row = NUM_SM
    # J1: Number of partitioned nonzero elements for a given row
    # J0: Size of each partitioned nonzero elements = WARP_SIZE (Can be less than WARP_SIZE if there are less than WARP_SIZE nonzero elements left for the current partition) 
  spacetime:
    Z:
      space: [I0, J0]
      time: [I1, J1]
      #opt: slip # Currently not working as intended. Refer to the note above.
"""

utils.compile(yaml)

In [4]:
# Autogenerated HiFiber

Z_I1I0 = Tensor(rank_ids=["I1", "I0"], name="Z")
tmp0 = A_IJ
tmp1 = tmp0.splitUniform(NUM_SM, depth=0)
A_I1I0J = tmp1
A_I1I0J.setRankIds(rank_ids=["I1", "I0", "J"])
z_i1 = Z_I1I0.getRoot()
b_j = B_J.getRoot()
a_i1 = A_I1I0J.getRoot()
canvas = createCanvas(A_I1I0J, B_J, Z_I1I0)
timestamps = {}
B_J = Tensor.fromFiber(rank_ids=["J"], fiber=b_j, name="B")
for i1, (z_i0, a_i0) in z_i1 << a_i1:
    for i0_pos, (i0, (z_ref, a_j)) in enumerate(z_i0 << a_i0):
        A_J = Tensor.fromFiber(rank_ids=["J"], fiber=a_j, name="A")
        tmp2 = A_J
        tmp3 = tmp2.splitEqual(WARP_SIZE)
        A_J1J0 = tmp3
        A_J1J0.setRankIds(rank_ids=["J1", "J0"])
        a_j1 = A_J1J0.getRoot()
        tmp4 = B_J
        tmp5 = tmp4.splitNonUniform(a_j1)
        B_J1J0 = tmp5
        B_J1J0.setRankIds(rank_ids=["J1", "J0"])
        b_j1 = B_J1J0.getRoot()
        for j1, (a_j0, b_j0) in a_j1 & b_j1:
            for j0_pos, (j0, (a_val, b_val)) in enumerate(a_j0 & b_j0):
                z_ref += a_val * b_val
                if (i0_pos, j0_pos) in timestamps.keys():
                    timestamps[(i0_pos, j0_pos)] += 1
                else:
                    timestamps[(i0_pos, j0_pos)] = 1
                canvas.addActivity((i1, i0, j0), (j0,), (i1, i0), spacetime=((i0_pos, j0_pos), (timestamps[(i0_pos, j0_pos)] - 1,)))
tmp6 = Z_I1I0
tmp7 = tmp6.mergeRanks(depth=0, levels=1, coord_style="absolute")
tmp7.setRankIds(rank_ids=["I"])
Z_I = tmp7
displayCanvas(canvas)

Starting simulation
Finished simulation


Create individual tensor images for each cycle:   0%|          | 0/9 [00:00<?, ?it/s]

Paste individual tensor images into frame for each cycle:   0%|          | 0/11 [00:00<?, ?it/s]

Render video frame for each cycle:   0%|          | 0/11 [00:00<?, ?it/s]

## Check Results

Check that generated code computes the correct result.

**Note**: Should be used after compiling and running the kernel (above cell).

In [6]:
utils.check_matrix_vector_mul(A_IJ, B_J, Z_I)

Result correct? True
