Architetture dei Sistemi di Elaborazione 02GOLOV

Computer Architectures 02LSEYG

Laboratory 0x01

Expected delivery of lab\_01.zip including:

- program\_3.s

- lab 01.pdf (fill and export this file to pdf)

## **Delivery date 16/10/2025**

# General procedure for simulating a program with gem5 and visualize the pipeline behaviour

- If you are working on a <u>LABINF PC</u>, remember to boot Ubuntu, NOT WINDOWS.
  The next labs need to be carried out on Ubuntu.
  If you are using the Ubuntu <u>Virtual Machine</u>, the password for the default user, if needed, is "0000".
- 2. Once you have booted Ubuntu, open a Terminal (Ctrl+Alt+T) and create a folder to work in:

mkdir lab1 && cd lab1

3. If you are working on a **LABINF PC**, execute the following command:

export PYTHONPATH=\${PYTHONPATH}:/opt/gem5-22.1-ASE/gem5/configs/

## If you are using the Virtual Machine:

export PYTHONPATH=\${PYTHONPATH}:/home/vboxuser/gem5/configs/

#### NOTE: the export command need to be done every time you spawn a new shell.

4. Write your RISC-V assembly program and save it in program1.s (any file name is ok, as long as it has the .s extension). You can open it by running the following command:

gedit program1.s

## The assembly program **MUST HAVE THE FOLLOWING STRUCTURE**:

```
# Data section
.section .data
# Place here your program data.
# In this example, two vector of floats
# a vector of ints and a single int are defined
V1: .float 1.0, 2.0, 3.0, 4.0
```

```
V2: .float 5.0, 6.0, 7.0, 8.0
V3: .word 0, 1, 2, 3
T0: .word 0x0BADC0DE
# Code section
.section .text
# The start label signals the entry point of your program
# DO NOT CHANGE ITS NAME.
# It must be "_start", not "start", not "main", not "start_".
# It's " start" with a leading ' ' and all lowercase letters
.globl start
start:
     # In the start area, load the first byte/word of each of
     # the areas declared in the .data section
     # This is needed to load data in the cache and avoid
     # pipeline stalls later
     la x1, V1
     flw fs1, 0(x1)
     la x1, V2
     flw fs1, 0(x1)
     la x1, V3
     1w x2, 0(x1)
     la x1, T0
     1w x2, 0(x1)
Main:
     # Your code goes here
     addi x1, x0, 0
End:
# exit() syscall. This is needed to end the simulation
# gracefully
     li a0, 0
     li a7, 93
     ecall
```

5. Compile using riscv\_compile, passing program1.s as parameter. <u>If you are working on a LABINF PC</u>, you will find riscv\_compile under /opt/gem5-22.1-ASE/gem5 example. The complete command will be:

```
/opt/gem5-22.1-ASE/gem5_example/riscv_compile program1.s
```

The compiled program will have the same name as the assembly file and no extension, e.g. program1.s -> program1

6. Copy the gem5 configuration file gem5\_config.py from either the "Visualizer Example" folder on your Desktop (if you are using the VM) or from /opt/gem5-22.1-ASE/gem5 example (if you are using a LABINF PC).

```
cp /opt/gem5-22.1-ASE/gem5_example/gem5_config.py .
```

7. Simulate your program using gem5. The simulation can be carried out using the gem5\_run command on LABINF PCs:

```
/opt/gem5-22.1-ASE/gem5_example/gem5_run gem5_config.py program1
program1.log
```

## On the Virtual Machine:

```
gem5_run gem5_config.py program1 program1.log
```

• The gem5\_config.py file contains the pipeline configuration. Here you can modify the operation latency and the issue latency of integer and floating point functional units (ALU, Multiplier, Divider):

```
INTEGER_ALU_LATENCY = 1
INTEGER_MUL_LATENCY = 1
INTEGER_DIV_LATENCY = 1
FLOAT_ALU_LATENCY = 3
FLOAT_MUL_LATENCY = 5
FLOAT_DIV_LATENCY = 5

INTEGER_ALU_ISSUE_LATENCY = 0
INTEGER_MUL_ISSUE_LATENCY = 0
INTEGER_DIV_ISSUE_LATENCY = 0
FLOAT_ALU_ISSUE_LATENCY = 0
FLOAT_ALU_ISSUE_LATENCY = 0
FLOAT_MUL_ISSUE_LATENCY = 0
FLOAT_MUL_ISSUE_LATENCY = 0
FLOAT_DIV_ISSUE_LATENCY = FLOAT_DIV_LATENCY
...
```

If the simulation throws an error because it cannot find the "common" module, copy the "common" folder from the same folder where you have found gem5\_config.py into the folder where your gem5 config.py is (watch point 6.)

- program1 is the program you have compiled before. If it is in a different path than the folder you are currently in, you need to provide the full path.
- program1.log is the output trace produced by gem5. Remember the folder you are working in, since the log file is needed for the next step.
- 8. Open the Pipeline Visualizer. On <u>LABINF PCs</u>, the visualizer is available under /opt/gem5-22.1-ASE:

```
/opt/gem5-22.1-ASE/Gem5_Pipeline_Visualizer-x86_64.AppImage
```

If you are using the <u>VM</u>, you can use the shortcut on your Desktop.

9. Click on File -> Open and open program1.log



- Use the navigation buttons to advance/rewind the pipeline
  - You can fast-forward to the end, fast-rewind to the beginning. Or advance/rewind by a single clock
  - O You can also choose a specific clock cycle to jump to



Hovering your mouse over an instruction highlights its path through the pipeline.
 Hovering your mouse over a pipeline stage shows the clock cycle and highlights the instruction it belongs to



• Use the dropdown menu in the register section to choose the representation of register contents



# - Exercise 1

Using the same flow described before, write and run a program for calculating the Fibonacci sequence and save it in program2.s (any file name is ok, as long as it has the .s extension). The assembly program:

```
# Data section
.section .data
# Place here your program data. In this example
# two vector of floats, a vector of ints anda single int are
defined
T0: .word 0x0BADC0DE
# Code section
.section .text
# The start label signals the entry point of your program
# DO NOT CHANGE ITS NAME. It must be " start", not "start",
# not "main", not "start ".
# It's " start" with a leading underscore and
# all lowercase letters
.globl start
start:
# In the start area, load the first byte/word of each of
# the areas declared in the .data section
# This is needed to load data in the cache and avoid
# pipeline stalls later
Main:
   # Initialize Fibonacci variables
   li x1, 0  # x1 = a = first Fibonacci number (0)
                 \# x2 = b = second Fibonacci number (1)
   li x3, 21
                 # x3 = count = number of terms to generate
   li x4, 0
                 \# x4 = i = loop counter
   # Loop to generate and print remaining 20 numbers
   addi x4, x4, 1 # i = 1 (start from second iteration)
fib loop:
   beq x4, x3, End \# if i == count, exit loop
   # Calculate next Fibonacci number
   add x5, x1, x2 # x5 = next = a + b
   # Update variables for next iteration
                 # a = b (previous second becomes first)
   mv x1, x2
   mv x2, x5
                  # b = next (calculated next becomes second)
   # Increment counter and continue loop
   addi x4, x4, 1 # i++
   j fib loop  # Jump back to loop start
# exit() syscall. This is needed to end the simulation
# gracefully
```

```
li a0, 0
li a7, 93
ecall
```

The above assembly code implements the following C code for calculating the Fibonacci sequence:

# - Exercise 2

Using the same flow described before, write and run an assembly program called **program 3.s** (to be delivered) for the *RISC-V* architecture.

The program must:

1. Given two arrays of 10 8-bit integer numbers (v1,v2), check if any element of v1 is included in v2, at least once. Save the matching value in a third vector (v3).

For example:

```
v1: .byte 2, 6, -3, 11, 9, 18, -13, 16, 5, 1 v2: .byte 4, 2, -13, 3, 9, 9, 7, 16, 4, 7
```

The third vector will be composed as follows.

```
v3: .byte 2, 9, -13, 16
```

- 2. Set three flags (flag1, flag2, flag3) to indicate three conditions:
  - a. The third vector (v3) is empty. Use one 8-bit unsigned variable (flag1) to flag the condition. The variable will be equal to 1 if v3 is empty, 0 otherwise.

- b. The third vector (v3) is not empty, and each element is greater than the previous one (v3[i+1]>v3[i]). In this case, use one 8-bit unsigned variable (flag2) to flag the condition. The variable will be equal to 1 if condition is satisfied, 0 otherwise.
- c. The third vector (v3) is not empty, and each element is smaller than the previous one (v3[i+1]<v3[i]). In this case, use one 8-bit unsigned variable (flag3) to flag the condition. The variable will be equal to 1 if condition is satisfied, 0 otherwise.

If you see stalls when loading data at the beginning of the program, can you explain why that happens?

Gli stall iniziali compaiono perché le prime 1b leggono dati non ancora in D-cache: c'è un cold miss, quindi l'unità MEM deve attendere che la linea arrivi dalla memoria e la pipeline si blocca con interlock; IF e ID restano fermi e nel visualizzatore gli "S" appaiono sulla riga di auipc solo perché il frontend è congelato mentre la 1b aspetta; se la configurazione usa memoria a porta singola, l'accesso ai dati in MEM impedisce anche il fetch nello stesso ciclo e aggiunge stall strutturali.

Rember that after the declaration of the vectors, you were instructed to write (and adapt) few lines as described here.

Collect the clock cycles to fill the following table.

Table 1: Program performance for the specific processor configurations

| Program   | Clock cycles | Number of<br>Instructions | Clocks per instruction (CPI) | Instructions<br>per Clock<br>(IPC) |
|-----------|--------------|---------------------------|------------------------------|------------------------------------|
| program_1 | 23           | 13                        | 1.769                        | 0.565                              |
| program_2 | 156          | 126                       | 1.238                        | 0.808                              |
| program_3 | 793          | 594                       | 1.335                        | 0.749                              |

# - Exercise 3

Perform execution and CPI measurements of some benchmarks programs. Attached to the folder of this laboratory, there are the two following programs:

- a) calculate\_pi.s
- b) insertion\_sort.s

Do the same with the programs of the previous exercises:

- a) program\_1.s
- b) program 2.s

## c) program\_3.s

In the initial scenario, it is assumed that the weight of the programs is the same (20%) for everyone. Assume a processor frequency of 1.75 kHz (a very old technology node).

Fill the following table assuming different scenarios:

### - Scenario 1:

- o program 1.s weights 1%
- o program\_2.s weights 50%
- o program 3.s weights 13%
- o calculate pi.s weights 25%
- o insertion\_sort.s weights 11%

#### - Scenario 2:

- o program\_1.s weights 10%
- o program 2.s weights 5%
- o program\_3.s weights 50%
- o calculate pi.s weights 10%
- o insertion sort.s weights 25%

## - Scenario 3:

- o program\_1.s weights 20%
- o program 2.s weights 30%
- o program 3.s weights 1.9%
- o calculate\_pi.s weights 31.4%
- o insertion sort.s weights 16.7%

Table 2: Processor performance for different weighted programs

| Program          | Initial  | Scenario | Scenario | Scenario |
|------------------|----------|----------|----------|----------|
|                  | scenario | 1        | 2        | 3        |
| calculate_pi.s   | 1.731    | 0.4328   | 0.1731   | 0.5435   |
|                  | s        | S        | S        | S        |
| insertion_sort.s | 7.743    | 0.8517   | 1.9358   | 1.2931   |
|                  | s        | S        | S        | S        |
| program_1.s      | 0.013    | 0.0001   | 0.0013   | 0.0026   |
|                  | s        | S        | S        | S        |
| program_2.s      | 0.089    | 0.0445   | 0.0045   | 0.0267   |
|                  | S        | S        | S        | S        |
| program_3.s      | 0.453    | 0.0589   | 0.2265   | 0.0086   |
|                  | S        | S        |          | S        |
| TOTAL Time       | 10.029   | 1.3880   | 2.3412   | 1.8745   |
| (@ 1.75kHz)      | S        | S        | S        | S        |