## Design of the frontend for LEN5, a RISC-V Out-of-Order processor

Candidate: Marco Andorno Supervisor: prof. Maurizio Martina



#### POLITECNICO DI TORINO

Master's thesis in Electronic Engineering
Academic year 2018-2019

## LEN5 overview



• Open source ISA, which allows for open source hardware.



- · Open source ISA, which allows for open source hardware.
- Modular ISA, that provides extensions to tailor the architecture to the design needs.

Used in every modern high-performance processor, because it allows the best exploitation of ILP.

Used in every modern high-performance processor, because it allows the best exploitation of ILP.

Based on three pillars:

1. Dynamic pipeline scheduling

Used in every modern high-performance processor, because it allows the best exploitation of ILP.

Based on three pillars:

- 1. Dynamic pipeline scheduling
- 2. Branch prediction

Used in every modern high-performance processor, because it allows the best exploitation of ILP.

Based on three pillars:

- 1. Dynamic pipeline scheduling
- 2. Branch prediction
- 3. Speculative execution

Used in every modern high-performance processor, because it allows the best **exploitation of ILP**.

Based on three pillars:

- 1. Dynamic pipeline scheduling
- 2. Branch prediction
- 3. Speculative execution

LEN5 uses **Tomasulo's algorithm**, with its distributed control approach.

Frontend design

# Frontend overview I\$



#### PC gen stage



#### Instruction Fetch Unit (IFU)



Branch management

#### Branch Prediction Unit (BPU)



#### Gshare branch predictor



#### Branch Target Buffer (BTB)



#### Branch unit



### BPU update actions

| Prediction | Resolution | Target | Action                    |
|------------|------------|--------|---------------------------|
| Taken      | Taken      | ~      | Increment 2-bit counter   |
|            |            | ×      | Increment 2-bit counter   |
|            |            |        | Update BTB entry          |
|            |            |        | Flush, go to right target |
|            | Not taken  | -      | Decrement 2-bit counter   |
|            |            |        | Remove BTB entry          |
|            |            |        | Flush, go to branch PC+4  |
| Not taken  | Not taken  | -      | Decrement 2-bit counter   |
|            | Taken      | -      | Increment 2-bit counter   |
|            |            |        | Add BTB entry             |
|            |            |        | Flush, go to right target |

#### Results

#### Gshare accuracy vs. History bits



#### Misprediction penalty vs. BTB size



#### Synthesis area results

Total cell area grows **exponentially** with the length of the **global history** and of the **BTB index** (*z* axis in log scale).



#### Synthesis frequency results



- · Frequency depends only on the size of the BTB
- The BTB address decoding network is the critical path of the whole design

Wrap-up

#### Frontend design:

· Address generation

#### Frontend design:

- · Address generation
- · Instruction fetch unit

#### Frontend design:

- · Address generation
- · Instruction fetch unit
- Branch prediction unit

#### Frontend design:

- · Address generation
- · Instruction fetch unit
- Branch prediction unit

Insightful exploratory work, with several applications:

#### Frontend design:

- · Address generation
- · Instruction fetch unit
- Branch prediction unit

Insightful **exploratory work**, with several applications:

Teaching and research

#### Frontend design:

- · Address generation
- · Instruction fetch unit
- · Branch prediction unit

#### Insightful exploratory work, with several applications:

- Teaching and research
- High-performance computing

#### Future work

#### Possible improvements:

More advanced branch prediction

#### Future work

#### Possible improvements:

- More advanced branch prediction
- · SRAM data structures

#### Future work

#### Possible improvements:

- · More advanced branch prediction
- SRAM data structures
- ISA extensions

Thank you for your attention!