# TeAAL specifications on variants of Loops' SpMV implementations (Thread-Mapped)

Description: Assign a fixed, constant number of work tiles to each thread. Resultant work items from each work tile are processed sequentially within the thread. 

In this version, each `tile` represents a row of $A$, and each `atom` represents a nonzero entry of $A$. In other words, assigning a thread to each row of $A$. Therefore, TeAAL specs is idential to the original.

Template: https://github.com/gunrock/loops/blob/main/include/loops/algorithms/spmv/thread_mapped.cuh

Scheduler (referenced as `config` below): https://github.com/gunrock/loops/blob/main/include/loops/schedule/thread_mapped.hxx

GPU Kernel Template (Use it as a reference, don't execute it):

In [None]:
'''
template <typename setup_t,
          typename index_t,
          typename offset_t,
          typename type_t>
__global__ void __thread_mapped(setup_t config,
                                const std::size_t rows,
                                const std::size_t cols,
                                const std::size_t nnz,
                                const offset_t* offsets,
                                const index_t* indices,
                                const type_t* values,
                                const type_t* x,
                                type_t* y) {
  /// Equivalent to:
  /// row = blockIdx.x * blockDim.x + threadIdx.x; (init)
  /// row < rows; (boundary condition)
  /// row += gridDim.x * blockDim.x. (step)
  for (auto row : config.tiles()) {
    type_t sum = 0;

    /// Equivalent to:
    /// for (offset_t nz = offset; nz < end; ++nz)
    for (auto nz : config.atoms(row)) {
      sum += values[nz] * x[indices[nz]];
    }

    // Output
    y[row] = sum;
  }
}
'''

## Imports

Import the necessary modules.

In [None]:
# HiFiber boilerplate

from fibertree_bootstrap import *

fibertree_bootstrap(style="tree", animation='movie')

# Compilation boilerplate

import os
import sys
sys.path.insert(0, "../../")

from src import utils

## Initialization

Initialize the input tensors.

For simplicity, suppose that each GPU SM processes 1 thread warp/block with size `BLOCK_SIZE` per cycle.

In [None]:
I = 8
J = 8

NUM_SM = 2 # Number of GPU SMs 
BLOCK_SIZE = 2 # Number of threads per block 
NUM_THREADS_PER_CYCLE = BLOCK_SIZE * NUM_SM # Total number of threads processed per cycle

print(f"NUM_SM: {NUM_SM}, BLOCK_SIZE: {BLOCK_SIZE}, NUM_THREADS_PER_CYCLE: {NUM_THREADS_PER_CYCLE}")
seed = 1

A_IJ = Tensor.fromRandom(rank_ids=["I", "J"], shape=[I, J], seed=seed, density=[0.9, 0.6], name="A")
B_J = Tensor.fromRandom(rank_ids=["J"], shape=[J], seed=seed + 1, density=[1], name="B")

TeAAL Specifications:

In [6]:
yaml = """
einsum:
  declaration:
    A: [I, J]
    B: [J]
    Z: [I]
  expressions:
    - Z[i] = A[i, j] * B[j]
mapping:
  rank-order:
    A: [I, J]
    B: [J]
    Z: [I]
  partitioning:
    Z:
      I: [uniform_shape(NUM_THREADS_PER_CYCLE)]
  loop-order:
    Z: [I1, I0, J]
  spacetime:
    Z:
      space: [I0]
      time: [I1, J]
"""

utils.compile(yaml)

## Check Results

Check that generated code computes the correct result.

**Note**: Should be used after compiling and running the kernel (above cell).

In [None]:
utils.check_matrix_vector_mul(A_IJ, B_J, Z_I)