Architetture dei Sistemi di Elaborazione O2GOLOV Delivery date:

## Until 2 AM of October 30th 2024

Laboratory 4

Expected delivery of **lab\_4.zip** must include:

- each configuration of the custom architecture (riscv\_o3\_custom.py) that you modify.
- This document with all the field compiled and in PDF form.

# **Introduction and Background**

Simulating an Out-of-Order (OoO) CPU (O3CPU)



In this laboratory, you will be able to configure an OoO CPU by using a script called riscv\_o3\_custom.py. In a few words, the script configures an <u>Out-of-Order (O3) processor</u> based on the *DerivO3CPU*, a superscalar processor with a reduced number of features.

### **Pipeline**

The processor pipeline stages can be summarized as:

- **Fetch stage:** instructions are fetched from the instruction cache. The fetchWidth parameter sets the number of fetched instructions. This stage does branch prediction and branch target prediction.
- **Decode stage:** This stage decodes instructions and handles the execution of unconditional branches. The decodeWidth parameter sets the maximum number of instructions processed per clock cycle.
- **Rename stage:** As suggested by the name, registers are renamed, and the instruction is pushed to the IEW (Issue/Execute/Write Back) stage. It checks that the *Instruction Queue* (**IQ**)/*Load and Store Queue* (**LSQ**) can hold the new instruction. The maximum number of instructions processed per clock cycle is set by the renameWidth parameter.



Figure 1: Understanding configurable OoO CPU parameters.

- **Dispatch stage**: instructions whose renamed operands are available are dispatched to functional units (**FU**). For loads and stores, they are dispatched to the Load/Store Queue (**LSQ**). The maximum number of instructions processed per clock cycle is set by the dispatchWidth parameter.
- **Issue stage**: The simulated processor has a single instruction queue from which all instructions are issued. Ordinarily, <u>instructions are taken in-order from this queue</u>. An instruction is issued if it does not have any dependency.
- Execute stage: the functional unit (FU) processes their instruction. Each functional unit can be configured with a different latency. Conditional branch <u>mispredictions are identified here</u>. The maximum number of instructions processed per clock cycle depends on the different functional units configured and their latencies.
- Writeback stage: it sends the result of the instruction to the reorder buffer (ROB). The maximum number of instructions processed per clock cycle is set by the wbWidth parameter.
- Commit stage: it processes the reorder buffer, freeing up reorder buffer entries. The maximum number of instructions processed per clock cycle is set by the committed parameter. Commit is done in order.

In the event of a **branch misprediction**, trap, or other speculative execution event, "squashing" can occur at all stages of this pipeline. When a pending instruction is squashed, it is removed from the instruction queues, reorder buffers, requests to the instruction cache, etc.



Figure 2: Example of a branch **misprediction** (transparent rows)

### **Pipeline Resources**

Additionally, it has the following structures:

- Branch predictor (BP)
  - Allows for selection between several branch predictors, including a local predictor, a
    global predictor, and a tournament predictor. Also has a branch target buffer (BTB)
    and a return address stack (RAS).
- Reorder buffer (ROB)
  - Holds instructions that have reached the back end. Handles squashing instructions and keep instructions in program order.
- Instruction queue (IQ)
  - Handles dependencies between instructions and scheduling ready instructions. Uses the **memory dependence predictor** to tell when memory operations are ready.
- Load-store queue (LSQ)
  - O Holds loads and stores that have reached the back end. It hooks up to the d-cache and initiates accesses to the memory system once memory operations have been issued and executed. Also handles forwarding from stores to loads, replaying memory operations if the memory system is blocked, and detecting memory ordering violations.
- Functional units (FU)
  - o Provides timing for instruction execution. Used to determine the latency of an instruction executing, as well as what instructions can issue each cycle.
  - **Floating point units, floating point registers,** and respective instructions are supported.

| 560: s561 (t0: r160): 0x00010106: fmv_w_x fa5, zero  | F | Dc | Rn | 1  | Is | 1  | 2  | 3  | Cm | 1 |   |
|------------------------------------------------------|---|----|----|----|----|----|----|----|----|---|---|
| 561: s562 (t0: r161): 0x0001010a: c_addi16sp sp, -64 | F | Dc | Rn | 1  | Is | Cm | 1  | 2  | 3  | 4 |   |
| 562: s563 (t0: r162): 0x0001010c: c_fsdsp fs0, 8(sp) | F | 1  | Dc | Rn | 1  | Is | Mc | 1  | 2  | 3 | 4 |
| 563: s564 (t0: r163): 0x0001010e: c_fsdsp fs1, 0(sp) | F | 1  | Dc | Rn | 1  | 2  | 3  | Is | Мс | 1 | 2 |

Figure 3: Pipeline example of FP instructions and FP registers

# **Laboratory: hands-on**

## All the needed resources are at a GitHub repository:

https://github.com/cad-polito-it/ase\_riscv\_gem5\_sim

To create your simulation environment:

For HTTPS clone:

~/my\_gem5Dir\$ git clone https://github.com/cad-polito-it/ase riscv gem5 sim.git

#### For SSH:

~/my gem5Dir\$ git clone git@github.com:cad-polito-it/ase riscv gem5 sim.git

The environment is configured to be executed on the LABINF MACHINES.

Follow the HOWTO instructions available on the GitHub Repository for simulating a program.

## **Exercise 1:**

Simulate the benchmark  $my\_c\_benchmark\_2$  (main.c) by using the gem5 simulator to obtain the trace.out file. Then, you can visualize the pipeline (i.e., load the trace.out file on Konata).

Based on the CPU architecture described in riscv\_o3\_custom.py, visualize the Konata's pipeline to find out the conditions:

- 1. Out-of-order execution (issue), in-order commit (commit)
- 2. Two commits in the same clock cycle
- 3. Flush of the pipeline.

For every condition, fill the following tables.

| Condition          | Out-of-order execution, in-order commit                                           |
|--------------------|-----------------------------------------------------------------------------------|
| Screenshot         | trace.out X                                                                       |
| from Konata        | 5963: s6021 (t8: r5039): 8x0                                                      |
|                    | 5964: s6022 (t8: r5040): 0x0                                                      |
|                    | 5965: s6023 (t0: r5041): 0x0                                                      |
|                    | 5966: 56024 (t0: r5042): 9x0                                                      |
|                    | 5967: 56025 (t8: r5043): 9x0                                                      |
|                    | 5968: 56826 (t8: r5844): 9x8                                                      |
|                    | 5969: s6027 (t8: r5845): 8x8                                                      |
|                    | 5978: 56828 (t8: r5846): 8x8                                                      |
|                    | 5971: 56829 (t8: r5847): 8x8                                                      |
|                    | .5972: 56830 (t0: r5848): 8x0 MC 1                                                |
|                    | 5973: 56831 (t0: r5849): 8x0 Is 1 2 3 Cm 1                                        |
|                    | 5974: 56832 (t8: r5859): 8x9                                                      |
|                    | 5975: 56033 (t8: r5051): 8x0 2 3 4 IS Cm 1 2                                      |
|                    | 5976: 56934 (t8: r5652): 8x9 133 134 DC Rn 1 Is Cm 1                              |
|                    | 5977: 56835 (t8: r5853): 8x8 133 134 DC Rn 1 2 Is Cm 1                            |
|                    | 5978: 56936 (t8: r5054): 8x0 126 127 128 DC Rn 1 2 Is Cm 1                        |
| <b>Explain</b> the | It happens when the CPU tries to maximize the utilization of the resources and    |
| reason behind      | reduce the stalls, since instructions are executed as soon as their data is ready |
| the condition      |                                                                                   |

|                                                       | even if there are earlier instructions still pending. However, to keep the program consistent the instructions are committed in order.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |  |  |  |  |
|-------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|--|--|--|
| Briefly explain                                       | It allows the CPU to process instructions as soon as their inputs are ready,                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |  |  |  |  |
| the advantages<br>of the OoO<br>execution in a<br>CPU | instead of waiting for previous instructions to complete. This process reduces idle times and increases resource utilization.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |  |  |  |  |
| Condition                                             | Two or more commits in the same clock cycle                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |  |  |  |  |
| Screenshot<br>from Konata                             | trace out X  5906: 155063 (10: r4081): 0x0  5906: 155064 (10: r4082): 0x0  5907: 55065 (10: r4083): 0x0  5908: 15506 (10: r4084): 0x0  5909: 155067 (10: r4085): 0x0  5910: 55069 (10: r4087): 0x0  5912: 155070 (10: r4083): 0x0  5913: 55971 (10: r4083): 0x0  5913: 55974 (10: r4093): 0x0  5914: 15572 (10: r4093): 0x0  5915: 155073 (10: r4093): 0x0  5916: 55074 (10: r4093): 0x0  5917: 15075 (10: r4093): 0x0  Fin 1  5918: 15076 (10: r4093): 0x0  Fin 1 2 Is Cm 1  5919: 55076 (10: r4095): 0x0  64 65 DC Rn 1 Is Cm 1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |  |  |  |  |
|                                                       | 5921: s5979 (t8: r4997): exe 64 65 DC Rn 1 2 Is Cm 1 MC 1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |  |  |  |  |
| Explain the reason behind the condition               | The reason multiple commits occur in the same clock cycle is due to the CPU's ability to commit independent instructions that have completed execution without dependencies on each other. By committing more than one instruction per cycle, the CPU maximizes efficiency and reduces idle time, ensuring that instructions ready for commit do not wait unnecessarily.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |  |  |  |  |
| Briefly explain                                       | The commit stage finalizes executed instructions in program order, ensuring                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |  |  |  |  |
| the Commit                                            | consistency and correct program flow. In an OoO CPU, even if instructions are                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |  |  |  |  |
| functioning                                           | executed out of order, they are committed in the original sequence to maintain correct program state, which is essential for predictable program behavior.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |  |  |  |  |
| Condition                                             | Flush of the pipeline                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |  |  |  |  |
| Screenshot<br>from Konata                             | The control of the |  |  |  |  |

## Explain reason behind the condition

When a pipeline flush occurs, it means that multiple instructions are trying to be executed at once, this happens due to a branch misprediction or exception. The CPU realizes that it has executed instructions along the wrong path or encountered an unexpected condition, and therefore it needs to flush these instructions to avoid committing incorrect results.

# **Exercise 2:**

Given your benchmark (main.c in my c benchmark 2), optimize the CPU architecture (i.e., modify the riscv o3 custom.py file) and write down the improvements in terms of CPI and speedup.

To optimize the CPU architecture, open the configuration file of the CPU (i.e., the riscv o3 custom.py), and tune specific hardware-related parameters.

You have to change specific values in **one or more** stages of the pipeline:

- o # FETCH STAGE
  - Tune parameters such as the fetchWidht, fetchBuffersize and so on, and see the effects on your system.
- # DECODE STAGE
- # RENAME STAGE
  - Try changing some values, but don't touch the "Phys" ones.
- # DISPATCH/ISSUE STAGE
- # EXECUTE STAGE
  - Here you can optimize the Functional units of your CPU like the INT ALU, the FP ALU, the FP Multiplier/Divider and so on.
  - Tune the number of units (count) that you have in the system, as well as their latency (opLat) to see how this affects the execution of your program.
- a different branch predictor. They defined in You can create are create predictor.py)
- You can also try to change the parameters of the L1 Cache. Look for the "class L1Cache" in the riscv o3 custom.py file. The L1 cache, also referred to as the primary cache, is the smallest and fastest level of memory. It is located directly on the processor, and it is used to store frequently accessed data by the CPU. In this way, the CPU saves time with respect to the normal access to the main memory.

**HINT:** To implement the best hardware optimization, and understand how to change the parameters, the best option consists in analysing the stats.txt file (in ase riscv\_gem5\_sim/results/my\_c\_benchmark\_2).

Find information regarding the workload profiling. In other words, look for lines such as "system.cpu.commitStats0.committedInstType::IntAlu", and the following ones to understand which kind of instructions are executed the most. In this way, you can target a specific functional unit and modify its specifications.

Fill the following Tables with the CPI that you obtain with the old and the new architectures. Compute also the equivalent speedup that you obtain.

HINT: You can get the CPI and other useful information from the stats.txt file.

| Parameters     | Configuration    | Configuration 2  | Configuration 4   | Configuration 5  |
|----------------|------------------|------------------|-------------------|------------------|
|                | 1                |                  |                   |                  |
| First changed  | the_cpu.fetchWi  | the_cpu.issueWid | the_cpu.fetchWid  | the_cpu.dispatch |
| paramenter     | dth = 0xc1a0     | th = 2           | th = 8            | Width $= 3$      |
| Second changed | the_cpu.dispatch | the_cpu.numPhys  | the_cpu.fetchBuff | the_cpu.smtROB   |
| paramenter     | Width =1         | IntRegs = 70     | erSize = 32       | Threshold = $90$ |
| •••            |                  | The_cpu.fetchQu  | the_cpu.renameW   | The_cpu.numPhy   |
|                |                  | eueSize = 16     | idth = 4          | sVecPredRegs =   |
|                |                  |                  |                   | 64               |
|                |                  |                  |                   | The_cpu.fetcQue  |
|                |                  |                  |                   | ueSize = 64      |

## Original CPI (no hardware optimization):

|                               | Configuration 1 | Configuration 2 | Configuration 4 | Configuration 5 |
|-------------------------------|-----------------|-----------------|-----------------|-----------------|
| CPI                           | 2.190180        | 2.204664        | 2.197422        | 2.186703        |
| Speedup (wrt<br>Original CPI) | 1               | 0.9934          | 0.9967          | 1.0016          |

Which is the best optimization in terms of CPI and speedup, why?

### Your answer:

The best optimization in terms of CPI and speedup is Configuration 5 because it achieves the lowest CPI (2.186703) and a speedup of 1.0016 compared to the original configuration.

cpu.dispatchWidth = 3: By increasing the dispatch width, the processor can handle a greater number of instructions in each cycle

cpu.smtROBThreshold = 90: Setting a higher Reorder Buffer (ROB) threshold can allow more instructions to be reordered before committing

Also increasing the number of physical vectors can help increase the program efficiency

cpu.fetQueueSize = 64: A larger fetch queue size allows more instructions to be buffered before entering the pipeline