# Systolic Array for Applying Matrix Multiplication

# Table of contents

| Systolic Array for Applying Matrix Multiplication |                                    |
|---------------------------------------------------|------------------------------------|
| 1.                                                | Introduction3                      |
| 2.                                                | Architecture4                      |
| 2                                                 | .1 PE (processing element) module4 |
| 2                                                 | .2 Structural_imp module5          |
| 2                                                 | .3 systolic_array module6          |
|                                                   | 2.3.1 input data block6            |
|                                                   | 2.3.2 feed data block              |
|                                                   | 2.3.3 Control unit block           |
|                                                   | 2.3.4 output block9                |
| 3.Si                                              | mulation results9                  |
| Table of figures                                  |                                    |
| Figure                                            | 1 : input matrix_A method3         |
| Figure                                            | 2: input matrix_B method3          |
| Figure                                            | 3: Design architecture4            |
| Figure                                            | 4: PE code snapshot4               |
| Figure                                            | 5:Structural_imp snapshot5         |
| Figure                                            | 6::Structural_imp snapshot5        |
| Figure                                            | 7: input block in top module6      |
| Figure                                            | 8: feed data block7                |
| Figure                                            | 9:Control unit block8              |
| Figure                                            | 10: output block9                  |
| Figure                                            | 11: wave form9                     |
| Figure                                            | 12: log file output10              |
| Figure                                            | 13: log file output10              |

# 1. Introduction

This section is for clarifying the design method used in the design.

The design depends on skew array (put the inputs in a diagonally) for feeding the right sequence at the right time to the PE unit.



Figure 1 : input matrix\_A method

While taking the input and assigning it to a 2-D array as shown in the figure, in parallel we send the first column of this array to the structural\_imp module and the second column in the next cycle and so on, to ensure the parallelism of the design.



Figure 2: input matrix\_B method

Same as matrix A method but it sends raw instead of column.

# 2. Architecture



Figure 3: Design architecture

The design consists of three main modules:

- PE (processing element) module
- Systolic\_array module
- Systolic\_array\_top module

# 2.1 PE (processing element) module

This module is the building block of the design where we multiply and accumulate on the previous output

```
always_ff @(posedge clk, negedge rst_n) begin
    if (!rst_n) begin
        a_out <= 0;
        b_out <= 0;
        c <= 0;
    end
    else begin
        c <= c + (a * b); // multiply and accumulate
        a_out <= a; // pass data to next PE
        b_out <= b; // pass data to next PE
    end
end</pre>
```

Figure 4: PE code snapshot

## 2.2 Systolic\_array module

In this module we connect the PE's assuming the input is perfectly processed and ready to feed into the PE element.

Figure 5:Structural\_imp snapshot

First, we put the column of A into the first column of row\_wire, raw of B in the first row of col\_wire.

Figure 6::Structural\_imp snapshot

Instantiation of the PE's using the row wire and the col wire as an internal connection.

### 2.3 systolic\_array\_top module

It is the top module where we take the input and arrange it to ensure proper sequence and timing to the PE to operate correctly.

As mentioned in the introduction section, the feed process is based on a skew array therefore this module consists of several blocks.

- Input data block (where we generate the skew array).
- Feed data block (where we send the typical data to the Structural\_imp module).
- Control unit block (where we control the internal signals used and the counters).
- Output block (where we output matrix c and valid out).

#### 2.3.1 input data block

This is where the input is arranged into the 2-D matrix diagonally.

Figure 7: input block in top module

The use of input count is basically for assigning the data N-cycles and it determines also the number of shifts needed.

#### 2.3.2 feed data block

This is where the data is fed into the structural\_imp.

```
always_ff @(posedge clk, negedge rst_n) begin :FEED_to_PE_BLOCK
   if (!rst_n) begin
        a_feed_col <= '{default: '0};
        b_feed_row <= '{default: '0};
end
else if (computation_started && count_cycle < (2*N_SIZE)-1) begin
        for (int j = 0; j < N_SIZE; j++) begin
            a_feed_col[j] <= full_matrix_a[j][count_cycle];
            b_feed_row[j] <= full_matrix_b[count_cycle][j];
end
end
else begin
        a_feed_col <= '{default: '0};
        b_feed_row <= '{default: '0};
end</pre>
```

Figure 8: feed data block

In parallel with the input being in the data is fed immediately to the PE.

#### 2.3.3 Control unit block

This is where the counter of the clock and internal signals are handled.

```
always ff @(posedge clk, negedge rst n) begin :CONTROL UNIT BLOCK
    if (!rst n) begin
        computation started <= 0;
        count cycle <= 0;
        computation done <= 0;
    else begin
        // Start computation after all input data is received
        if (valid in && !computation started) begin
            computation started <= 1;
            count cycle <= 0; // Start counting from 0
        end
        // Continue counting once started
        else if (computation started) begin
            if (count cycle < (2*N SIZE)-1) begin
                count cycle <= count cycle + 1;
            end
            else if (!computation done) begin
                computation done <= 1;
            end
        end
end :CONTROL UNIT BLOCK
```

Figure 9:Control unit block

Signal computation\_start to trigger and initialize the counter to start count clocks from the moment the PE starts working.

Note that PE is working in parallel with the input feeding which improves efficiency.

Signal computation\_done is used to trigger the end of the processing and to start outputting the result matrix.

#### 2.3.4 output block

```
always ff @(posedge clk, negedge rst n) begin : OUTPUT BLOCK
    if (!rst n) begin
        valid out <= 0;
        matrix c out <= '{default: '0};</pre>
        count out <= 0;
    else begin
        if (computation done && count out < N SIZE) begin
            valid out <= 1;
            for (k = 0; k < N SIZE; k++) begin
                matrix c out[k] <= output full matrix[count out][k];</pre>
            end
            count out <= count out + 1;
        end
        else if (count out >= N SIZE) begin
            valid out <= 0;
        end
    end
end : OUTPUT BLOCK
```

Figure 10: output block

After the end of the processing the module takes N cycle to output the rows of the result matrix.

### 3. Simulation results



Figure 11: wave form

```
=== First Matrix Multiplication ===
Matrix A (fed column-wise):
1 2 3
4 5 6
7 8 9

Matrix B (fed row-wise):
1 2 3
4 5 6
7 8 9

Result Matrix C = A × B:
30 36 42
66 81 96
102 126 150
```

Figure 12: log file output

```
=== Second Matrix Multiplication ===

Matrix A (fed column-wise):
2 1 3
0 4 2
1 3 5

Matrix B (fed row-wise):
1 0 2
3 1 4
2 2 1

Result Matrix C = A × B:
11 7 11
16 8 18
20 13 19

$finish called at time : 265 ns : File "C:/STM_assesment/systolic_array_MUL/simu/systolic_array_tb.sv" Line 222
```

Figure 13: log file output