# TeAAL specifications on variants of Loops' SpMV implementations (Group-Mapped)

Warning: TeALL specs are incomplete, need to make each row partition independent from each other (be able to move onto the next assigned row once done with the current row).

Description: Assign an equal amount of work tiles to a group of threads (warp or blocks). Threads within each group process individual work items in parallel.

In this version, each `tile` represents a row of $A$, and each `atom` represents a nonzero entry of $A$. In other words, assigning a thread block to each row of $A$.

Template: https://github.com/gunrock/loops/blob/main/include/loops/algorithms/spmv/group_mapped.cuh

Scheduler (referenced as `config` below): https://github.com/gunrock/loops/blob/main/include/loops/schedule/group_mapped.hxx

GPU Kernel Template (Use it as a reference, don't execute it):

In [None]:
'''
template <std::size_t threads_per_block,
          typename index_t,
          typename offset_t,
          typename type_t>
__global__ void __launch_bounds__(threads_per_block, 2)
    __group_mapped(std::size_t rows,
                   std::size_t cols,
                   std::size_t nnz,
                   offset_t* offsets,
                   index_t* indices,
                   const type_t* values,
                   const type_t* x,
                   type_t* y) {
  using setup_t = schedule::block_mapped<threads_per_block, index_t, offset_t>;

  /// Allocate temporary storage for the schedule.
  using storage_t = typename setup_t::storage_t;
  __shared__ storage_t temporary_storage;

  /// Construct the schedule.
  setup_t config(temporary_storage, offsets, rows, nnz);
  auto p = config.partition();

  for (auto virtual_atom : config.atom_accessor(p)) {
    auto virtual_tile = config.tile_accessor(virtual_atom, p);

    if (!(config.is_valid_accessor(virtual_tile, p)))
      continue;

    auto row = config.tile_id(virtual_tile, p);

    auto nz_idx = config.atom_id(virtual_atom, row, virtual_tile, p);
    atomicAdd(&(y[row]), values[nz_idx] * x[indices[nz_idx]]);
  }
}
'''

## Imports

Import the necessary modules.

In [None]:
# HiFiber boilerplate

from fibertree_bootstrap import *

fibertree_bootstrap(style="tree", animation='movie')

# Compilation boilerplate

import os
import sys
sys.path.insert(0, "../..")

from src import utils

## Initialization

Initialize the input tensors.

For simplicity, suppose that each GPU SM processes 1 thread warp/block with size `BLOCK_SIZE` per cycle.

In [None]:
I = 4
J = 4

NUM_SM = 2 # Number of GPU SMs 
BLOCK_SIZE = 2 # Number of threads per block 
NUM_THREADS_PER_CYCLE = BLOCK_SIZE * NUM_SM # Total number of threads processed per cycle

print(f"NUM_SM: {NUM_SM}, BLOCK_SIZE: {BLOCK_SIZE}, NUM_THREADS_PER_CYCLE: {NUM_THREADS_PER_CYCLE}")
seed = 1

A_IJ = Tensor.fromRandom(rank_ids=["I", "J"], shape=[I, J], seed=seed, density=[0.9, 0.6], name="A")
#A_IJ = Tensor.fromUncompressed(rank_ids=["I", "J"], shape=[I, J], root=[[1],[2],[0],[4]], name="A")
B_J = Tensor.fromRandom(rank_ids=["J"], shape=[J], seed=seed + 1, density=[1], name="B")

TeAAL Specifications:

In [None]:
yaml = """
einsum:
  declaration:
    A: [I, J]
    B: [J]
    Z: [I]
  expressions:
    - Z[i] = A[i, j] * B[j]
mapping:
  rank-order:
    A: [I, J]
    B: [J]
    Z: [I]
  partitioning:
    Z:
      I: [uniform_shape(NUM_SM)]
      J: [uniform_occupancy(A.BLOCK_SIZE)]
  loop-order:
    Z: [I1, I0, J1, J0] 
  spacetime:
    Z:
      space: [I0, J0]
      time: [I1, J1]
      opt: slip
"""

utils.compile(yaml)

## Check Results

Check that generated code computes the correct result.

**Note**: Should be used after compiling and running the kernel (above cell).

In [None]:
utils.check_matrix_vector_mul(A_IJ, B_J, Z_I)