# Printing with CuTe DSL

This notebook demonstrates the different ways to print values in CuTe and explains the important distinction between static (compile-time) and dynamic (runtime) values.

## Key Concepts
- Static values: Known at compile time
- Dynamic values: Only known at runtime
- Different printing methods for different scenarios
- Layout representation in CuTe
- Tensor visualization and formatting

In [1]:
import cutlass
import cutlass.cute as cute
import numpy as np

## Print Example Function

The `print_example` function demonstrates several important concepts:

### 1. Python's `print` vs CuTe's `cute.printf`
- `print`: Can only show static values at compile time
- `cute.printf`: Can display both static and dynamic values at runtime

### 2. Value Types
- `a`: Dynamic `Int32` value (runtime)
- `b`: Static `Constexpr[int]` value (compile-time)

### 3. Layout Printing
Shows how layouts are represented differently in static vs dynamic contexts:
- Static context: Unknown values shown as `?`
- Dynamic context: Actual values displayed

## Compile and Run

**Direct Compilation and Run**
  - `print_example(cutlass.Int32(8), 2)`
  - Compiles and runs in one step will execute both static and dynamic print
    * `>>>` stands for static print
    * `>??` stands for dynamic print

In [None]:
@cute.jit
def print_tiled_copy():
    thr_layout = cute.make_layout(shape=(4, 32), stride=(32, 1)) # // (32,8) -> thr_idx
    val_layout = cute.make_layout(shape=(1,8))
    layout_mn = cute.raked_product(thr_layout, val_layout)
    right_layout = cute.right_inverse(layout_mn)
    layout_tv = cute.composition(layout_mn, cute.make_layout(shape=(cute.size(thr_layout), cute.size(val_layout))))

    cute.printf("thr_layout {}", thr_layout)
    cute.printf("val layout {}", val_layout)
    cute.printf("layout_mn {}", layout_mn)
    cute.printf("right layout {}", right_layout)
    cute.printf("layout_tv {}", layout_tv)   # => (8,2):(1,8)



<function __main__.print_tiled_copy()>

In [3]:
print_tiled_copy()

DSLRuntimeError: DSLRuntimeError: 💥💥💥 Error during runtime code generation for function `print_tiled_copy` 💥💥💥

## Compile Function

When compiles the function with `cute.compile(print_example, cutlass.Int32(8), 2)`, Python interpreter 
traces code and only evaluate static expression and print static information.

In [4]:
print_example_compiled = cute.compile(print_example, cutlass.Int32(8), 2)

>>> 2
>>> ?
>>> Int32
>>> <class 'int'>
>>> (?,2):(1,?)


## Call compiled function

Only print out runtime information

In [5]:
print_example_compiled(cutlass.Int32(8))

>?? 8
>?? 2
>?? (8,2):(1,8)


## Format String Example

The `format_string_example` function shows an important limitation:
- F-strings in CuTe are evaluated at compile time
- This means dynamic values won't show their runtime values in f-strings
- Use `cute.printf` when you need to see runtime values

In [6]:
@cute.jit
def format_string_example(a: cutlass.Int32, b: cutlass.Constexpr[int]):
    """
    Format string is evaluated at compile time.
    """
    print(f"a: {a}, b: {b}")

    layout = cute.make_layout((a, b))
    print(f"layout: {layout}")

print("Direct run output:")
format_string_example(cutlass.Int32(8), 2)

Direct run output:
a: ?, b: 2
layout: (?,2):(1,?)


## Printing Tensor Examples

CuTe provides specialized functionality for printing tensors through the `print_tensor` operation. The `cute.print_tensor` takes the following parameter:
- `Tensor` (required): A CuTe tensor object that you want to print. The tensor must support load and store operations
- `verbose` (optional, default=False): A boolean flag that controls the level of detail in the output. When set to True, it will print indices details for each element in the tensor.

Below example code shows the difference between verbose ON and OFF, and how to print a sub range of the given tensor.

In [7]:
from cutlass.cute.runtime import from_dlpack

@cute.jit
def print_tensor_basic(x : cute.Tensor):
    # Print the tensor
    print("Basic output:")
    cute.print_tensor(x)
    
@cute.jit
def print_tensor_verbose(x : cute.Tensor):
    # Print the tensor with verbose mode
    print("Verbose output:")
    cute.print_tensor(x, verbose=True)

@cute.jit
def print_tensor_slice(x : cute.Tensor, coord : tuple):
    # slice a 2D tensor from the 3D tensor
    sliced_data = cute.slice_(x, coord)
    y = cute.make_fragment(sliced_data.layout, sliced_data.element_type)
    # Convert to TensorSSA format by loading the sliced data into the fragment
    y.store(sliced_data.load())
    print("Slice output:")
    cute.print_tensor(y)

The default `cute.print_tensor` will output CuTe tensor with datatype, storage space, CuTe layout information, and print data in torch-style format.

In [8]:
def tensor_print_example1():
    shape = (4, 3, 2)
    
    # Creates [0,...,23] and reshape to (4, 3, 2)
    data = np.arange(24, dtype=np.float32).reshape(*shape) 
      
    print_tensor_basic(from_dlpack(data))

tensor_print_example1()

Basic output:
tensor(raw_ptr(0x000000000a5f1d50: f32, generic, align<4>) o (4,3,2):(6,2,1), data=
       [[[ 0.000000,  2.000000,  4.000000, ],
         [ 6.000000,  8.000000,  10.000000, ],
         [ 12.000000,  14.000000,  16.000000, ],
         [ 18.000000,  20.000000,  22.000000, ]],

        [[ 1.000000,  3.000000,  5.000000, ],
         [ 7.000000,  9.000000,  11.000000, ],
         [ 13.000000,  15.000000,  17.000000, ],
         [ 19.000000,  21.000000,  23.000000, ]]])


The verbosed print will show coodination details of each element in the tensor. The below example shows how we index element in a 2D 4x3 tensor space.

In [9]:
def tensor_print_example2():
    shape = (4, 3)
    
    # Creates [0,...,11] and reshape to (4, 3)
    data = np.arange(12, dtype=np.float32).reshape(*shape) 
      
    print_tensor_verbose(from_dlpack(data))

tensor_print_example2()

Verbose output:
tensor(raw_ptr(0x000000000a814cc0: f32, generic, align<4>) o (4,3):(3,1), data= (
	(0,0)= 0.000000
	(0,1)= 1.000000
	(0,2)= 2.000000
	(1,0)= 3.000000
	(1,1)= 4.000000
	(1,2)= 5.000000
	(2,0)= 6.000000
	(2,1)= 7.000000
	(2,2)= 8.000000
	(3,0)= 9.000000
	(3,1)= 10.000000
	(3,2)= 11.000000
)


To print a subset elements in the given Tensor, we can use cute.slice_ to select a range of the given tensor, load them into register and then print the values with `cute.print_tensor`.

In [10]:
def tensor_print_example3():
    shape = (4, 3)
    
    # Creates [0,...,11] and reshape to (4, 3)
    data = np.arange(12, dtype=np.float32).reshape(*shape) 
      
    print_tensor_slice(from_dlpack(data), (None, 0))
    print_tensor_slice(from_dlpack(data), (1, None))

tensor_print_example3()

Slice output:
tensor(raw_ptr(0x00007ffeeae1fc60: f32, rmem, align<32>) o (4):(3), data=
       [ 0.000000, ],
       [ 3.000000, ],
       [Slice output:
 6.000000, ],
       [ 9.000000, ])
tensor(raw_ptr(0x00007ffeeae1fc60: f32, rmem, align<32>) o (3):(1), data=
       [ 3.000000, ],
       [ 4.000000, ],
       [ 5.000000, ])
