# Cerebras System: remarkable hardware accelarator and its programming model

Yuri Takigawa

June 26, 2025

The university of Tokyo, EEIC, Taura Lab

#### Contents

Installation and Setup

A Conceptual View: Hardware organization

PE (Processing Elements)

A Conceptual View: Programming model

Programming Language

Communication

Code Execution Unit: Task

Another use of wavelet arrival: Data Struct

layout and host code

**Installation and Setup** 

#### Contents

#### Installation and Setup

A Conceptual View: Hardware organization

PE (Processing Elements)

A Conceptual View: Programming model

Programming Language

Communication

Code Execution Unit: Task

Another use of wavelet arrival: Data Struct

layout and host code

## How to get an access to SDK

Fill the form on this link.

- Human operators reply, thus it takes more than half a day.
- The reply includes **Dropbox link** to the **SDK files**.

Most of the things about **installation and setup** are <u>here</u>.

• The most important thing is that this SDK is for amd64 architecture.

# Overview of installation and setups

Some of the (critical) things below are **NOT** explicitly written in the guide.

- Azure VM is highly recommended as an environment (Miyabi is not available here...)
  - Recommended instance type is Standard B4ms (4vcpu, 16GiB memory) with 64GiB disk.
- 2. When you connect via ssh,

```
ssh -i ~/.ssh/YOUR_PRIVATE_KEY -Y -L 8000:localhost:8000 YOUR_USERNAME@PUBLICIP
```

- transfer port 8000 to remote port 8000
- YOUR\_USERNAME is **NOT** a resource name.
- YOUR\_PUBLICIP can be seen on the resource via azure home
- You can also refer past spring-training by gotonao

# Overview of installation and setups

- 3. You can follow the guide at installation/setup, but some filenames are changed.
- 4. Finally, you can try remote GUI debug (step7 of the guide).
  - sdk\_debug\_shell visualize after running the test, open http://localhost:8000/sdk-gui at your local browser (eg. chrome)
  - Sometimes, version conflict occurs and show a kind of bugs.



A Conceptual View: Hardware

organization

#### **Contents**

Installation and Setup

A Conceptual View: Hardware organization

PE (Processing Elements)

A Conceptual View: Programming mode

# Wafer Scale Engine

Cerebras refers its hardware accelarator as a WSE (Wafer Scale Engine).

- WSE consists of hundreds (dies) of thousands (数千万個) of independent PE (processing element)s (~ cores).<sup>1</sup>
- The PEs are interconnected by communication links, and they form a two-dimensional rectangular mesh on one single silicon wafer.



<sup>&</sup>lt;sup>1</sup>uses TSMC 5nm processed node

# Characteristics of a PE: memory

- Each PE has its own physically-local SRAM (called local PE memory) <sup>a</sup> with single-cycle access.
  - is 48kB total, consists of 8 banks, and has full datapath bandwidth: 2 64(128)bit read + 1 64(128)bit write per cycle
  - each bank has 6kB, and 32bit wide, has single port
- All the code and data related to the execution on the PE are stored within this memory.
- This physically-local memory is logically local as well (i.e., No other PE are directly accessible to this memory)



WSE-3 Core

| Fabric                        |               |  |
|-------------------------------|---------------|--|
| Memory                        |               |  |
| SRAM<br>48kB                  | Cache<br>512B |  |
| Registers                     |               |  |
| 16 General Purpose 48 Data St | ructure       |  |
| 8-way 16b SIMD                |               |  |

<sup>&</sup>lt;sup>a</sup>200x normalized memory BW vs. GPU

#### Characteristics of a PE: processor

- Each PE has a processor called CE (Compute Engine).
  - 16 general purpose registers, 48 data structure registers
  - Compact 6-stage pipeline
  - Flexible general ops (e.g., arithmetic, logical, load/store, compare, branch) for control processing
- CE has **SIMD** computing unit
  - In the WSE2, 4-way 16 bit SIMD<sup>a</sup>, each way has ALU for FADD, FMUL, FMAC<sup>b</sup> etc.,
  - In the WSE3, 8-way 16 bit SIMD and 16-way 8 bit SIMD



WSE-3 Core

| Fabric                                          |               |
|-------------------------------------------------|---------------|
| Memory                                          |               |
| SRAM<br>48kB                                    | Cache<br>512B |
| Registers  16 General Purpose 48 Data Structure |               |
| 8-way 16b SIMD                                  |               |

<sup>&</sup>lt;sup>a</sup>which means execute single instruction with 4 different data simultaneously

<sup>&</sup>lt;sup>b</sup>Fused Multiply-Add

# Characteristics of a PE: processor

- Each PE has its own independent PC (program counter).
  - Thus, each PE can execute codes asynchronously by default.



#### Characteristics of a PE: router

- Each PE has the hardware unit for communication (send, receive) called **Router**.
- A Router is directly connected to its own CE via bidirectional link called RAMP (Router ALU Messaging Path).<sup>2</sup>
- A Router is directly connected to the routers of the four nearest neighboring PEs (north, south, east, west)
  - Thus, a router has 4 ports
- A Router has 8 input queues and output queues inside the PE.
  - An **input queue** is a hardware buffer where data is temporarily stored before the entering the CE. (I will explain what this means later)

 $<sup>^2</sup>$ You may have heard this word as  $4 \vee 9 - 7 \times 10^{\circ}$  of highway

A Conceptual View: Programming

model

#### Contents

Installation and Setup

A Conceptual View: Hardware organization

A Conceptual View: Programming model

Programming Language

Communication

Code Execution Unit: Task

Another use of wavelet arrival: Data Struct

layout and host code

# **Programming Language: CSL**

- To develop code for the WSE, write *device code* in the **CSL** (**Cerebras Software Language**), and *host code* in *Python*.
  - The host code is responsible for copying data to and from the device
  - CSL gives programmers full control of the WSE.
- Then, compile the device code with cslc, and run your program on either
   Cerebras fabric simulator (or the actual network-attached device).
  - The usage is <u>here</u>.

### **Programs and Tasks**

A **CSL program** consists of one or more *subprograms*.

Subprogram has two types (declaration).

- function: callable
- task: a procedure that cannot be called from other code
  - something like atomic (i.e., unsplittable) code block managed by hardware
  - tasks are managed at the specialized hardware unit of CE similar to rich NPC generator
  - A task can be activated (i.e., ready for running) by some hardware trigger (imagine
    a flip of a flag bit)
  - tasks are started by PE hardware (specifically NPC generator of CE)
  - Only one task can be executed at a time on the CE
  - Once a task is started by hardware, it runs until it complete. Then, the NPC generator chooses a new task to run.

#### The unit of communication between PEs: wavelet

32-bit messages ( $\sim$  packets), called **wavelets**, can be sent to or received by neighboring PEs in a single clock cycle.

- Arrivals of wavelets trigger something inside the PE
  - Task Activation
  - Stored in a data struct managed with a fabric DSD
- transfering data of massive size (like array, tensor) is splitted into multiple wavelets, and data of wavelets that arrive at the destination PE first is buffered in the **input queue** until all wavelets arrive.



#### Virtual Communication Channel: Color

The virtual communication path (channel) through which **wavelets** (packets) travel is called **color**.

- There exists 24 virtual channels used by hardware.
- All colors transfer data on a single physical channel.
  - If multiple colors have wavelets to send via physical path from PE X to PE Y, those colors (not wavelets) are scheduled by hardware arbiter on router.
  - IMPORTANT: The congestion of one color does NOT block traffic of another color; Fairness between colors (not wavelets)
  - For example, RR algorighm could be used as a policy.
- Each wavelet has a 5-bit tag that encodes its color

#### Contents

Installation and Setup

A Conceptual View: Hardware organization

A Conceptual View: Programming model

Programming Language

Communication

Code Execution Unit: Task

Another use of wavelet arrival: Data Struct

layout and host code

# Task IDs and Types of Tasks

Each task can be associated with task ID from 0 to 63.

- Data Task: the arrival of wavelet triggers its activation, its ID is associated with a input queue on the router<sup>3</sup>
- Local Task: the @activate(task\_id) in some other codes within the same PE triggers its activation,
- Control Task: controls other tasks on the same PE as follows, its ID can take any values from 0 to 63.
  - unblock other data task
  - conditional launch of local task

<sup>&</sup>lt;sup>3</sup>In the WSE2, task ID is directly associated with the **color** with implicit linkage between **input queue** and task ID

# The conditions to be ready for execution

There are two conditions for tasks be scheduled by task picker (hardware selector)

#### Activated

- every task is inactive by default
- programmers can activate the task within the same PE, with @activate(task\_id).
- programmers can activate the task in another PE, with @send\_to\_color(output\_queue\_id).

#### Unblocked

- every task is unblocked by default
- but, programmers can block the ID of a task at compile time, with @block(task\_id).

# Psuedo Image of hardware that manages Data Task



# Communication via router: Task activation

Three

# Code Template: link computation and communication using Task

#### Contents'

Installation and Setup

A Conceptual View: Hardware organization

PE (Processing Elements)

A Conceptual View: Programming model

Programming Language

Communication

Code Execution Unit: Task

Another use of wavelet arrival: Data Struct

layout and host code

#### Preliminaries: the hardware instruction of Tensor

#### CSL supports several tensor ops

- Every tensor op has its options
  - asynchronous execution means, release software serialization
  - i.e., start execution even before the previous op is completed, considering whether or not hazard (data, structural) exist

```
.{
.async = true, // asynchronous execution
flag
.activate = task_id, // task id of the task
activated when this op completes
.element_type = f32, // type
// other options specific to each op
}
```

• The format of these ops are (here):

```
1 // ARITHMETIC
2 // dst = src1 * src2 + acc
3 @fmacs(dst. acc. src1, src2 [, options])
4 // dst = src1 {+, -, *, /} src2
5 @fadds(dst, src1, src2 [, options])
6 @fsubs(dst, src1, src2 [, options])
 7 @fmuls(dst. src1. src2 [. options])
8 @fdivs(dst, src1, src2 [, options])
10 // COPY
11 // dst = src
12 Ofmovs(dst, src [, options])
13 // dst = src(size)
14 @copv(dst, src, size_in_bvtes)
  // ELEMENT-WISE OPS
17 @fexps(dst, src [, options])
                                   // exp
18 @flogs(dst. src [. options])
                                   // 1n
19 Ofrelus(dst, src [, options])
                                   // ReLII
20
  // REDUCTION OPS
  @fsum(dst, src [, options]) // sum of items
23 Ofmax(dst, src [, options]) // dst is scaler
```

## DSD: Abstraction of data consists of multiple values like tensor

#### Data Structure Descriptors (DSDs) [More details are available here.]

- are a compact representation of
  - a chunk of memory (or)
  - a sequence of incoming or outgoing wavelets
- enable various repeated operations to be expressed using just one hardware instruction
- is an software object which consists of
  - dst\_type: One Of mem1d\_dsd, mem4d\_dsd, fabin\_dsd, Or fabout\_dsd
  - properties: different dst\_type has different properties

#### **ADVANCED:** hardware microthread

hardware **microthread** is an mechanism to distribute and manage hardware resources for enabling asynchronous execution [Details are Microthread IDs, Async DSD Ops]

- ullet Arbitration of hardware resource like ALU, Memory Access Unit, Router Interface ( $\sim$  scheduling)
- An asynchronous DSD operation can be assigned a microthread ID through .ut\_id = @get\_ut\_id(n)
  - microthread ID could be different from input/output queue ID<sup>4</sup>
  - If multiple DSR/DSD operands have the .ut\_id setting specified, the hardware will
    pick one of them according to the order: dst > src1 > src2
- programmers can attach priority to each thread (including main thread).
- **IMPORTANT:** The programmer is responsible for ensuring that no two concurrent DSD operations share a microthread.

 $<sup>^4</sup>$ In WSE2, the same as output queue ID when using fabs\_dsd, otherwise the same as input queue ID

# The hardware mechanisms to manage DSD: DSR

There have to be the hardware mechanism that utilize DSDs: **Data Structure Registers (DSRs)** 

- DSRs are physical registers that are used to store DSD values
- All DSD operations will actually operate on DSRs behind the scenes, thus all DSD operands to DSD operations must be loaded to DSRs before executing
- Each DSR belongs to one of three DSR files, namely dest, src0, src1 DSR files (i.e., physically distinguished)
  - dsr\_dest: DSR number (レジスタ番地) that can only be used to store a destination operand DSD of a DSD operation
  - dsr\_src0: DSR number that can be used to store a source DSD as well as a destination operand DSD
  - dsr\_src1: DSR number that can only be used to store a source operand DSD
- Basically, compiler allocate DSR (and extra DSR) automatically, but programmers can use them directly with dsr = @get\_dsr, @load\_to\_dsr(dsr, dsd [, option])

#### Communication via router: DSD

#### sender

- set colors used to send/recv data between PEs
- 2. assign task ID used by a local task to unblock cmd stream

# Code Template: link computation and communication using DSD

#### Contents 5

Installation and Setup

A Conceptual View: Hardware organization

PE (Processing Elements)

A Conceptual View: Programming model

Programming Language

Communication

Code Execution Unit: Task

Another use of wavelet arrival: Data Struct

layout and host code

# Writing the top-level CSL file

There are a few things that programmers need for our device code to form a complete program

- Initialization the infrastructure of the memcpy library with @import\_module
  - In order to allow the host to launch kernels and copy data to and from the device
  - has to specify width and height parameters which correspond to the dimensions of the program rectangle
- A top-level "layout" file
  - define the program rectangle on which our kernel will run, with @set\_rectangle(columns\_dim, rows\_dim)
  - assign a code file to the single PE in our rectangle, with
     @set\_tile\_code(column\_idx, row\_idx, "pe\_program.csl" [, parameters])
  - pass memcpy parameters as a parameter, which are parameterized by the PE's column number(idx), with .{ .memcpy\_params = memcpy.get\_params(column\_idx)}

## Code template of the top-level CSL file

```
1 // Import memcpy layout module for 1 x 1 grid of PEs
 2 const memcpy = @import_module("<memcpy/get_params>",
                                     \{ . \{ . \} \}  width = 1. . height = 1 \}):
 5 layout {
     // Use just one 1 PE (columns=1, rows=1)
     @set rectangle(1, 1):
9
10
     // The lone PE in this program should execute the code in "pe_program.csl"
11
     Qset tile code(0, 0,
12
                        "pe_program.csl".
13
                        .{ .memcpv_params = memcpv.get_params(0) });
14
15
     // Export device symbol for array "v"
16
     // Last argument is mutability: host can read v. but not write to it
17
     @export name("v". [*]f32. false):
18
19
     // Export host-callable device function
20
     @export_name("init_and_compute", fn()void);
21 }
```

# Writing the host code

#### 1. **Import libraries** which is required

- SdkRuntime is the library containing the functionality necessary for loading and running the device code, as well as copying data on and off the wafer.
- MemcpyDataType and MemcpyOrder are enums containing types for use with memcpy calls

#### 2. Instantiate (=construct) runner objects like

```
runner = SdkRuntime(args.name, cmaddr=args.cmaddr)
```

- name: specify the directory containing the compilation output
- cmaddr: attach IP address of targetted real accelarator obtained from command-line like --cmaddr \$CS\_IP\_ADDR:9000<sup>5</sup>

#### 3. Load and Run device kernel (named init\_and\_compute here)

- Before loading the program, get symbol for copying y result off device
- runner.load() → runner.run() → runner.launch('init\_and\_compute' [, option])

<sup>&</sup>lt;sup>5</sup>CS use port 9000 to connect to the system and launch the program

# Writing the host code

- 4. Copy back result (with many arguments attached as follows)
  - Before copying, allocate space on the host to hold the result
  - 4.1 the array on the host to hold the result y\_result, the symbol on device that points to
    the y array y\_symbol
  - 4.2 To specify the location of rectangle of PEs from which to copy (called ROI), give the northwest corner of the ROI o, o and the width and height of the ROI 1, 1
  - 4.3 how many elements to copy back from each PE in the ROI M (if 2darray, M\*N)
  - 4.4 ROW\_MAJOR specifies that the data is ordered by (height, width, element)
  - 4.5 data\_type keyword specifies the width of the data copied back
  - 4.6 nonblock=False specifies that this call will not return control to the host until the copy into y\_result has finished

# [Appendix] ROW MAJOR and COLUMN MAJOR

"Learn by example" is the best way to understand the concept.

Here is the example of copying from multiple PEs.

- Configuration
  - Size of ROI: height = 2, width = 3
  - Calculated Data in each PE

```
PEO = [[1, 2, 3],

[4, 5, 6]]

PE1 = [[7, 8, 9],

[10, 11, 12]]
```

- ROW\_MAJOR case
  - Each PE data is copied continuously
  - Copied result (1d):[1, 2, 3, 4, 5,
    6, 7, 8, 9, 10, 11, 12]
- COLUMN\_MAJOR case
  - Elements with the same index in each PE are copied continuously
  - Copied result (1d):[1, 7, 2, 8, 3, 9, 4, 10, 5, 11, 6, 12]

#### References

- 1. cerebras SDK Documentation (1.4.0)
- 2. Cerebras Al Day Deck: A closer look at the world's fastest Al Chip
- 3. Cerebras Architecture Deep Dive: First Look Inside the HW/SW Co-Design for Deep Learning