# TeAAL specifications on variants of Loops' SpMV implementations (Thread-Mapped)

Description: Assign a fixed, constant number of work tiles to each thread. Resultant work items from each work tile are processed sequentially within the thread. 

In this version, each `tile` represents a row of $A$, and each `atom` represents a nonzero entry of $A$. In other words, assigning a thread to each row of $A$. Therefore, TeAAL specs is idential to the original.

Template: https://github.com/gunrock/loops/blob/main/include/loops/algorithms/spmv/thread_mapped.cuh

Scheduler (referenced as `config` below): https://github.com/gunrock/loops/blob/main/include/loops/schedule/thread_mapped.hxx

GPU Kernel Template (Use it as a reference, don't execute it):

In [None]:
__global__ void __thread_mapped(...) {
  for (auto A_row : config.tiles()) { # Loop over assigned rows for each tid.

    type_t sum = 0;
    for (auto A_nz_idx : config.atoms(A_row)) {  # Loop over nonzero entry of current row.
      sum += A_values[A_nz_idx] * B[indices[A_nz_idx]];
    }

    # Output
    Z[A_row] = sum;
  }
}

## Imports

Import the necessary modules.

In [None]:
# HiFiber boilerplate

from fibertree_bootstrap import *

fibertree_bootstrap(style="tree", animation='movie')

# Compilation boilerplate

import os
import sys
sys.path.insert(0, "../../")

from src import utils

## Initialization

Initialize the input tensors.

For simplicity, the size of a thread warp is the same as the size of a thread block (`WARP_SIZE = BLOCK_SIZE`). Suppose that each GPU SM processes 1 thread warp/block per cycle.

In [None]:
M = 8
K = 8

# Hardware Specification
NUM_SM = 2 # Number of GPU SMs 
WARP_SIZE = 2 # Number of threads per warp
NUM_THREADS = NUM_SM * WARP_SIZE # Total number of threads

print(f"Hardware Specification\n  NUM_SM: {NUM_SM}, WARP_SIZE: {WARP_SIZE}, NUM_THREADS: {NUM_THREADS}")

seed = 1

A_MK = Tensor.fromRandom(rank_ids=["M", "K"], shape=[M, K], seed=seed, density=[0.9, 0.6], name="A")
B_K = Tensor.fromRandom(rank_ids=["K"], shape=[K], seed=seed + 1, density=[1], name="B")

## TeAAL Specifications

Rows of matrix $A$ are partitioned across the SMs' threads. A thread can be assigned to a row with all zeros. 

Note that the current TeAAL specificaiton does not allow to specify the rank of `opt: slip`. This means there exists a synchronization across the SMs.

In [None]:
yaml = """
einsum:
  declaration:
    A: [M, K]
    B: [K]
    Z: [M]
  expressions:
    - Z[m] = A[m, k] * B[k]
mapping:
  rank-order:
    A: [M, K]
    B: [K]
    Z: [M]
  partitioning:
    Z:
      M: [uniform_shape(NUM_THREADS)]
  loop-order:
    Z: [M1, M0, K]
    # M1: Number of partitioned rows of A
    # M0: Size of each partitioned row of A = NUM_THREADS
  spacetime:
    Z:
      space: [M0] # Parallelize over NUM_THREADS
      time: [M1, K] 
"""

utils.compile(yaml)

## Check Results (Correctness)

Check that generated code computes the correct result.

**Note**: Should be used after compiling and running the kernel (above cell).

In [None]:
utils.check_matvecmul(A_MK, B_K, Z_M)

## Performance on GPU

Load Balance: Poor load balance due to the difference in NNZ per row of $A$. This results in threads that are assigned to rows with few NNZ being idle.

Assuming that the $A$ is stored in CSR format, $B$ and $Z$ are in uncompressed vectors, the memory access pattern would be:
- $A$: Uncoalesced access, threads in a warp are accessing different rows of $A$.
- $B$: Depends on the column indices of each nonzero entry of $A$. The more irregular the sparsity pattern that $A$ has, the more random the column indices of $A$'s nonzero entries will be. This should result in more uncoalesced accesses to $B$.
- $Z$: Coalesced access, threads in a warp are writing to adjacent memory locations.

## TeAAL Specifications (Version 2: Rank Order Swap)

By performing a rank swap of `M` and `K`, we enable concordant traversal of matrix $A$.
This occurs when a loop nest traverses a fibertree in the order in which its ranks appear, i.e., traverses each fiber sequentially and in a depth-first manner. 

Consequently, matrix $A$ is transformed into a CSC format, allowing SMs' threads (previously partitioned by row) to achieve coalesced memory access.

In [None]:
yaml = """
einsum:
  declaration:
    A: [M, K]
    B: [K]
    Z: [M]
  expressions:
    - Z[m] = A[m, k] * B[k]
mapping:
  rank-order:
    A: [M, K]
    B: [K]
    Z: [M]
  partitioning:
    Z:
      M: [uniform_shape(NUM_THREADS)]
  loop-order:
    Z: [K, M1, M0]
    # M1: Number of partitioned rows of A
    # M0: Size of each partitioned row of A = NUM_THREADS
  spacetime:
    Z:
      space: [M0] # Parallelize over NUM_THREADS
      time: [M1, K] 
"""

utils.compile(yaml)

## Check Results

Check that the generated code computes the correct result.

**Note**: Should be used after compiling and running the kernel (above cell).

In [None]:
utils.check_matvecmul(A_MK, B_K, Z_M)