# TeAAL specifications on variants of Loops' SpMV implementations (Original)

Description: Each thread is responsible for computing one row of $A$ at a time. 

Template: https://github.com/gunrock/loops/blob/main/include/loops/algorithms/spmv/original.cuh

GPU Kernel Template (Use it as a reference, don't execute it):

## Imports

Import the necessary modules.

In [3]:
# HiFiber boilerplate

from fibertree_bootstrap import *

fibertree_bootstrap(style="tree", animation='movie')

# Compilation boilerplate

import os
import sys
sys.path.insert(0, "../..")

from src import utils

interactive(children=(Dropdown(description='style', options=('tree', 'uncompressed', 'tree+uncompressed'), valâ€¦

Button(description='Run all cells below', style=ButtonStyle())

## Initialization

Initialize the input tensors.

For simplicity, suppose that each GPU SM processes 1 thread warp/block with size `BLOCK_SIZE` per cycle.

In [45]:
I = 4
J = 4

NUM_SM = 2 # Number of GPU SMs 
BLOCK_SIZE = 2 # Number of threads per block 
NUM_THREADS_PER_CYCLE = BLOCK_SIZE * NUM_SM # Total number of threads processed per cycle

print(f"NUM_SM: {NUM_SM}, BLOCK_SIZE: {BLOCK_SIZE}, NUM_THREADS_PER_CYCLE: {NUM_THREADS_PER_CYCLE}")
seed = 1

A_IJ = Tensor.fromRandom(rank_ids=["I", "J"], shape=[I, J], seed=seed, density=[0.9, 0.6], name="A")
B_J = Tensor.fromRandom(rank_ids=["J"], shape=[J], seed=seed + 1, density=[1], name="B")

NUM_SM: 2, BLOCK_SIZE: 2, NUM_THREADS_PER_CYCLE: 4


TeAAL Specifications:

In [50]:
yaml = """
einsum:
  declaration:
    A: [I, J]
    B: [J]
    Z: [I]
  expressions:
    - Z[i] = A[i, j] * B[j]
mapping:
  rank-order:
    A: [I, J]
    B: [J]
    Z: [I]
  partitioning:
    Z:
      I: [uniform_shape(NUM_THREADS_PER_CYCLE)]
  loop-order:
    Z: [I1, I0, J]
  spacetime:
    Z:
      space: [I1, I0]
      time: [J]
"""

utils.compile(yaml)

In [51]:
# Autogenerated HiFiber

Z_I1I0 = Tensor(rank_ids=["I1", "I0"], name="Z")
tmp0 = A_IJ
tmp1 = tmp0.splitUniform(NUM_THREADS_PER_CYCLE, depth=0)
A_I1I0J = tmp1
A_I1I0J.setRankIds(rank_ids=["I1", "I0", "J"])
z_i1 = Z_I1I0.getRoot()
b_j = B_J.getRoot()
a_i1 = A_I1I0J.getRoot()
canvas = createCanvas(A_I1I0J, B_J, Z_I1I0)
for i1_pos, (i1, (z_i0, a_i0)) in enumerate(z_i1 << a_i1):
    for i0_pos, (i0, (z_ref, a_j)) in enumerate(z_i0 << a_i0):
        for j_pos, (j, (a_val, b_val)) in enumerate(a_j & b_j):
            z_ref += a_val * b_val
            canvas.addActivity((i1, i0, j), (j,), (i1, i0), spacetime=((i1_pos, i0_pos), (j_pos,)))
tmp2 = Z_I1I0
tmp3 = tmp2.mergeRanks(depth=0, levels=1, coord_style="absolute")
tmp3.setRankIds(rank_ids=["I"])
Z_I = tmp3
displayCanvas(canvas)

Starting simulation
Finished simulation


Create individual tensor images for each cycle:   0%|          | 0/3 [00:00<?, ?it/s]

Paste individual tensor images into frame for each cycle:   0%|          | 0/5 [00:00<?, ?it/s]

Render video frame for each cycle:   0%|          | 0/5 [00:00<?, ?it/s]

## Check Results

Check that generated code computes the correct result.

**Note**: Should be used after compiling and running the kernel (above cell).

In [18]:
utils.check_matrix_vector_mul(A_IJ, B_J, Z_I)

Result correct? True
