# NV-STC

This notebook reproduces the salient characteristics of the [NV-STC](https://arxiv.org/pdf/2104.08378).

## Imports

Import the necessary modules.

In [None]:
# HiFiber boilerplate

from fibertree_bootstrap import *

fibertree_bootstrap(style="tree", animation='movie')

# Compilation boilerplate

import os
import sys
sys.path.insert(0, "..")

from src import utils

## Initialization

Initialize the input tensors. Tensor shapes and densities can be modified below.

**Warning:** Large tensors will overwhelm the video generation. Either:
1. Use small tensors; as a rule of thumb, fewer than 60 computes (e.g., multiplications) should be required.
2. Do not generate a video; remove the `spacetime` specification from the `mapping` before compiling.

In [22]:
K = 8
M = 8
N = 8

M0 = 2
N0 = 2
K0 = 2
M1 = 4
N1 = 4
K1 = 4

density = [0.9, 0.5]
seed = 0
A_MK = Tensor.fromRandom(rank_ids=["M", "K"], shape=[M, K], seed=seed, density=density, name="A")
B_KN = Tensor.fromRandom(rank_ids=["K", "N"], shape=[K, N], seed=seed + 1, density=density, name="B")
B_NK = B_KN.swizzleRanks(["N", "K"])

## Compile and Run

Below is the TeAAL specification for RM-STC. To simulate the accelerator:
1. Compile it to HiFiber by running the cell, inserting a new cell
2. Run the new cell, which will
    - Execute the kernel; multiplying the above defined matrices
    - Generate visualizations of the actions of the kernel

#### Notes

- Small tensors are required for video generation. If you are using large tensors, remove the spacetime specification to generate a kernel that does not produce videos. Outputs can still be checked below.
- Partition shapes are decreased accordingly above for visualization purposes. According to the [NVIDIA PTX ISA documentation](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-shape), real NV-STC uses `M1 = 16`, `M0 = 8`, `K1 = 32`, `K0 = 4`, `N1 = 8`, and `N0 = 8`. In contrast [RM-STC](https://dl.acm.org/doi/abs/10.1145/3613424.3623775) (Table 2. in the paper) reports `M1 = 16`, `M0 = 8`, `K1 = 32`, `K0 = 4`, `N1 = 16`, and `N0 = 4`. 

In [23]:
yaml = """
einsum:
  declaration:
    A: [K, M]
    B: [K, N]
    Z: [M, N]
  expressions:
    - Z[m, n] = A[k, m] * B[k, n]
mapping:
  rank-order:
    A: [M, K]
    B: [N, K]
    Z: [M, N]
  partitioning:
    Z:
      M: [uniform_shape(M1), uniform_shape(M0)]
      N: [uniform_shape(N1), uniform_shape(N0)]
      K: [uniform_shape(K1), uniform_occupancy(A.K0)]
  loop-order:
    Z: [M2, N2, K2, M1, N1, K1, M0, N0, K0]
  spacetime:
    Z:
      space: [M0, N0, K0]
      time: [M2, N2, K2, M1, N1, K1]
"""

utils.compile(yaml)

## Check Results

Check that generated code computes the correct result.

**Note**: Should be used after compiling and running the kernel (above cell).

In [None]:
utils.check_matmul(A_MK, B_KN, Z_MN)