## Systolic Mul Local memory analysis

Given a local storage capacity of **40KB** (i.e., **40960 bytes**) and assuming **Mt = Kt = Nt = M**, we want to determine the maximum value of **Mt** that can be supported.

The total storage requirement is given by the formula:


total storage = Nt × (Kt + Mt) × 4 bytes


Since **Mt = Kt = Nt = M**, we simplify this to:


total storage = M × (M + M) × 4 = 8M² bytes


With the given storage capacity of **40960 bytes**, we can set up the equation:


total storage = 8M² = 40960


Solving for **M**:


M² = 40960 / 8 = 5120
M ≈ √5120 ≈ 71.6


Thus, the maximum possible integer value for **Mt** is **71**.

## SUMMA local memory analysis

I am running SUMMA on a 2d mesh of PEs called cerebras. It's a dataflow accelerator architecture. Each PE has a local L1 cache of 48KB that is used to store the tiles of A and B and temporary C, and buffered A and B tile for broadcasting.

Here is the csl code, compile setup and the problems I am facing.

Currently the problem seems to do with numerical stability. When P is set to 400. The numerical error rate increases with the size of matrix. You can see from problem description 1.

When size of local tile matrix is restricted, Mt = 47, the numerical error increases with size of layout. When P=100, 200, the result is correct. When P=300, there will be 1 error. As P increases, the error rate grows. But overall error rate is small, which means the algorithm passes, but the numerical stability of SUMMA is not overcomed.

I need you to help me judge if I am right. I need you to find the cause of the error. And at last, you need to give me some correct advices


In [None]:
#Data Structures on Each PE
'''
1.	A_tile: Mt * Kt elements
2.	B_tile: Kt * Nt elements
3.	C_tile: Mt * Nt elements
4.	A_buffer: Mt * Kt elements
5.	B_buffer: Kt * Nt elements
'''

def calculate_single_pe_memory(Mt, Kt, Nt):
    """
    Calculates the required memory for a single PE given the values of Mt, Kt, and Nt.
    
    Parameters:
    Mt (int): Matrix dimension on one PE
    Kt (int): Matrix dimension on one PE
    Nt (int): Matrix dimension on one PE
    
    Returns:
    float: Required PE memory in KB
    """
    data_type_size = 4  # float32 is 4 bytes per element
    
    # Calculate the total number of elements
    total_elements = (Mt * Kt) + (Kt * Nt) + (Mt * Nt) + (Mt * Kt) + (Kt * Nt)
    
    # Calculate the total bytes
    total_bytes = total_elements * data_type_size
    
    # Calculate the required PE memory in KB
    required_pe_memory = total_bytes / 1024
    
    return required_pe_memory

# Example usage
Mt = Kt = Nt = 47
print(f"Required PE memory: {calculate_single_pe_memory(Mt, Kt, Nt):.2f} KB")

Required PE memory: 43.14 KB
