# Digital Design & Computer Architecture Sarah Harris & David Harris

# Chapter 7: Microarchitecture

# Chapter 7 :: Topics

- Introduction
- Performance Analysis
- Single-Cycle Processor
- Multicycle Processor
- Pipelined Processor
- Advanced Microarchitecture



#### Introduction

- Microarchitecture: how to implement an architecture in hardware
- Processor:
  - Datapath: functional blocks
  - Control: control signals



#### Microarchitecture

- Multiple implementations for a single architecture:
  - Single-cycle: Each instruction executes in a single cycle
  - Multicycle: Each instruction is broken up into series of shorter steps
  - Pipelined: Each instruction broken up into series of steps & multiple instructions execute at once

#### **Processor Performance**

#### Program execution time

**Execution Time = (#instructions)(cycles/instruction)(seconds/cycle)** 

#### Definitions:

- CPI: Cycles/instruction
- clock period: seconds/cycle
- IPC: instructions/cycle = IPC

#### Challenge is to satisfy constraints of:

- Cost
- Power
- Performance

### RISC-V Processor

- Consider subset of RISC-V instructions:
  - R-type ALU instructions:
    - add, sub, and, or, slt
  - Memory instructions:
    - lw, sw
  - Branch instructions:
    - beq

#### **Architectural State Elements**

#### Determines everything about a processor:

- Architectural state:
  - 32 registers
  - PC
  - Memory

### RISC-V Architectural State Elements



# Chapter 7: Microarchitecture

# Single-Cycle RISC-V Processor

# Single-Cycle RISC-V Processor

- Datapath
- Control

# Example Program

- Design datapath
- View example program executing

#### **Example Program:**

| Address    | Instruction   | Type       |                                        | Fiel                     | ds               |                                       | Ma                   | chine Language |
|------------|---------------|------------|----------------------------------------|--------------------------|------------------|---------------------------------------|----------------------|----------------|
| 0x1000 L7: | lw x6, -4(x9) | I          | <b>imm<sub>11:0</sub></b><br>111111111 | rs1                      | <b>f3</b><br>010 | <b>rd</b><br>00110                    | <b>op</b><br>0000011 | FFC4A303       |
| 0x1004     | sw x6, 8(x9)  | S          | <b>imm</b> <sub>11:5</sub> <b>rs</b>   | <b>s2 rs1</b> 0110 01001 | <b>f3</b><br>010 | <b>imm</b> <sub>4:0</sub> 01000       | <b>op</b><br>0100011 | 0064A423       |
| 0x1008     | or x4, x5, x6 | 5 <b>R</b> | <b>funct7</b> rs                       | <b>s2 rs1</b> 0110 00101 | <b>f3</b><br>110 | <b>rd</b><br>00100                    | <b>op</b><br>0110011 | 0062E233       |
| 0x100C     | beq x4, x4, L | 7 <b>B</b> |                                        | <b>s2 rs1</b> 0100 00100 | <b>f3</b><br>000 | <b>imm</b> <sub>4:1,11</sub><br>10101 | <b>op</b><br>1100011 | FE420AE3       |

# Single-Cycle RISC-V Processor

Datapath: start with 1w instruction

• Example: 
$$lw x6, -4(x9)$$
  
 $lw rd, imm(rs1)$ 

### **I-Type**

| 31:20               | 19:15  | 14:12  | 11:7   | 6:0    |
|---------------------|--------|--------|--------|--------|
| imm <sub>11:0</sub> | rs1    | funct3 | rd     | op     |
| 12 bits             | 5 bits | 3 bits | 5 bits | 7 bits |

# Single-Cycle Datapath: 1w fetch

#### **STEP 1:** Fetch instruction



CLK

CLK

# Single-Cycle Datapath: 1w Reg Read

#### **STEP 2:** Read source operand (**rs1**) from RF

CLK



CLK

# Single-Cycle Datapath: 1w Immediate

#### **STEP 3:** Extend the immediate





| Address    | Instruction  | Type | Fields                                    |                     |                  | Ma                 | chine Language       |          |
|------------|--------------|------|-------------------------------------------|---------------------|------------------|--------------------|----------------------|----------|
| 0x1000 L7: | lw x6, -4(x9 | I    | <b>imm<sub>11:0</sub></b><br>111111111100 | <b>rs1</b><br>01001 | <b>f3</b><br>010 | <b>rd</b><br>00110 | <b>op</b><br>0000011 | FFC4A303 |

# Single-Cycle Datapath: 1w Address

**STEP 4:** Compute the memory address





| Address    | Instruction   | Type | Fields                                    |                     |           |                    | Ma                   | chine Language |
|------------|---------------|------|-------------------------------------------|---------------------|-----------|--------------------|----------------------|----------------|
| 0x1000 L7: | lw x6, -4(x9) | Ţ    | <b>imm<sub>11:0</sub></b><br>111111111100 | <b>rs1</b><br>01001 | <b>f3</b> | <b>rd</b><br>00110 | <b>op</b><br>0000011 | FFC4A303       |

# Single-Cycle Datapath: 1w Mem Read

# **STEP 5:** Read data from memory and write it back to register file





# Single-Cycle Datapath: PC Increment

#### **STEP 6:** Determine address of next instruction



x6, -4(x9)

rd

00110

**op** 0000011

FFC4A303

0x1000

# Chapter 7: Microarchitecture

# Single-Cycle Datapath: Other Instructions

# Single-Cycle Datapath: sw

- Immediate: now in {instr[31:25], instr[11:7]}
- Add control signals: ImmSrc, MemWrite



| Address | Instruction  | Type |                              | Field | ls               |                                | Mad                  | chine Language |
|---------|--------------|------|------------------------------|-------|------------------|--------------------------------|----------------------|----------------|
| 0x1004  | sw x6, 8(x9) | S    | <b>imm</b> <sub>11:5</sub> 1 |       | <b>f3</b><br>010 | <b>imm<sub>4:0</sub></b> 01000 | <b>op</b><br>0100011 | 0064A423       |

# Single-Cycle Datapath: Immediate

| ImmSrc | ImmExt                                       | <b>Instruction Type</b> |
|--------|----------------------------------------------|-------------------------|
| 0      | {{20{instr[31]}}, instr[31:20]}              | I-Type                  |
| 1      | {{20{instr[31]}}, instr[31:25], instr[11:7]} | S-Type                  |





# Single-Cycle Datapath: R-type

- Read from rs1 and rs2 (instead of imm)
- Write ALUResult to rd



| Address | Instruction | Type           |               |                     | Field               | ls               |                    | Mad                  | chine Language |
|---------|-------------|----------------|---------------|---------------------|---------------------|------------------|--------------------|----------------------|----------------|
| 0x1008  | or x4, x5   | 5, x6 <b>R</b> | <b>funct7</b> | <b>rs2</b><br>00110 | <b>rs1</b><br>00101 | <b>f3</b><br>110 | <b>rd</b><br>00100 | <b>op</b><br>0110011 | 0062E233       |

# Single-Cycle Datapath: beq

Calculate target address: PCTarget = PC + imm



00100

00100

beg x4, x4, L7

10101

FE420AE3

000

0x100C

# Single-Cycle Datapath: ImmExt

| ImmSrc <sub>1:0</sub> | ImmExt                                                                  | <b>Instruction Type</b> |
|-----------------------|-------------------------------------------------------------------------|-------------------------|
| 00                    | {{20{instr[31]}}, instr[31:20]}                                         | I-Type                  |
| 01                    | {{20{instr[31]}}, instr[31:25], instr[11:7]}                            | S-Type                  |
| 10                    | {{19{instr[31]}}, instr[31], instr[7], instr[30:25], instr[11:8], 1'b0} | B-Type                  |



# Single-Cycle RISC-V Processor



# Chapter 7: Microarchitecture

# Single-Cycle Control

# Single-Cycle Control

#### **High-Level View**

#### **PCSrc** Control ResultSrc Unit **MemWrite** Instr 6:0 op ALUControl<sub>2:0</sub> 14:12 funct3 **ALUSrc** 30 funct7<sub>5</sub> ImmSrc<sub>1:0</sub> Zero Zero RegWrite

#### **Low-Level View**



# Single-Cycle Control: Main Decoder

| ор | Instr. | RegWrite | ImmSrc | ALUSrc | MemWrite | ResultSrc | Branch | ALUOp |
|----|--------|----------|--------|--------|----------|-----------|--------|-------|
| 3  | lw     | 1        | 00     | 1      | 0        | 1         | X      |       |
| 35 | sw     | 0        | 01     | 1      | 1        | 1         | X      |       |
| 51 | R-type | 1        | X      | 0      | 0        | 0         | X      |       |
| 99 | beq    | 0        | 10     | 0      | 0        | X         | 1      |       |



### Review: ALU

| ALUControl <sub>2:0</sub> | Function |
|---------------------------|----------|
| 000                       | add      |
| 001                       | subtract |
| 010                       | and      |
| 011                       | or       |
| 101                       | SLT      |



# Review: ALU

| ALUControl <sub>2:0</sub> | Function |
|---------------------------|----------|
| 000                       | add      |
| 001                       | subtract |
| 010                       | and      |
| 011                       | or       |
| 101                       | SLT      |



# Single-Cycle Control: ALU Decoder



# Single-Cycle Control: ALU Decoder

| ALUOp | funct3 | op <sub>5</sub> , funct7 <sub>5</sub> | Instruction | ALUControl <sub>2:0</sub> |
|-------|--------|---------------------------------------|-------------|---------------------------|
| 00    | х      | х                                     | lw, sw      | 000 (add)                 |
| 01    | X      | x                                     | beq         | 001 (subtract)            |
| 10    | 000    | 00, 01, 10                            | add         | 000 (add)                 |
|       | 000    | 11                                    | sub         | 001 (subtract)            |
|       | 010    | х                                     | slt         | 101 (set less than)       |
|       | 110    | х                                     | or          | 011 (or)                  |
|       | 111    | x                                     | and         | 010 (and)                 |



# Example: and

| ор | Instruct | RegWrite | ImmSrc | ALUSrc | MemWrite | ResultSrc | Branch | ALUOp |
|----|----------|----------|--------|--------|----------|-----------|--------|-------|
| 51 | R-type   | 1        | XX     | 0      | 0        | 0         | 0      | 10    |



and x5, x6, x7

# Chapter 7: Microarchitecture

# Extending the Single-Cycle Processor

# Extended Functionality: I-Type ALU

Enhance the single-cycle processor to handle I-Type ALU instructions: addi, andi, ori, and slti

- Similar to R-type instructions
- But second source comes from immediate
- Change ALUSrc to select the immediate
- And *ImmSrc* to pick the correct immediate

# **Extended Functionality: I-Type ALU**

| ор | Instruct. | RegWrite | ImmSrc | ALUSrc | MemWrite | ResultSrc | Branch | ALUOp |
|----|-----------|----------|--------|--------|----------|-----------|--------|-------|
| 3  | lw        | 1        | 00     | 1      | 0        | 1         | 0      | 00    |
| 35 | sw        | 0        | 01     | 1      | 1        | Х         | 0      | 00    |
| 51 | R-type    | 1        | XX     | 0      | 0        | 0         | 0      | 10    |
| 99 | beq       | 0        | 10     | 0      | 0        | X         | 1      | 01    |
| 19 | I-type    | 1        | 00     | 1      | 0        | 0         | 0      | 10    |

## Extended Functionality: addi

| ор | Instruct. | RegWrite | ImmSrc | ALUSrc | MemWrite | ResultSrc | Branch | ALUOp |
|----|-----------|----------|--------|--------|----------|-----------|--------|-------|
| 19 | I-type    | 1        | 00     | 1      | 0        | 0         | 0      | 10    |



#### Extended Functionality: jal jump and link

Enhance the single-cycle processor to handle jal

- Similar to beq
- But jump is always taken
  - PCSrc should be 1
- Immediate format is different
  - Need a new *ImmSrc* of 11
- And jal must compute PC+4 and store in rd
  - Take PC+4 from adder through ResultMux

## Extended Functionality: jal



## Extended Functionality: ImmExt

| ImmSrc <sub>1:0</sub> | ImmExt                                                                  | Instruction Type |
|-----------------------|-------------------------------------------------------------------------|------------------|
| 00                    | {{20{instr[31]}}, instr[31:20]}                                         | I-Type           |
| 01                    | {{20{instr[31]}}, instr[31:25], instr[11:7]}                            | S-Type           |
| 10                    | {{19{instr[31]}}, instr[31], instr[7], instr[30:25], instr[11:8], 1'b0} | В-Туре           |
| 11                    | {{12{instr[31]}}, instr[19:12], instr[20], instr[30:21], 1'b0}          | J-Type           |

#### **I-Type**

| 31:20               | 19:15  | 14:12  | 11:7   | 6:0    |
|---------------------|--------|--------|--------|--------|
| imm <sub>11:0</sub> | rs1    | funct3 | rd     | op     |
| 12 bits             | 5 bits | 3 bits | 5 bits | 7 bits |

#### **B-Type**

| 31:25                  | 24:20  | 19:15  | 14:12  | 11:7                  | 6:0    |
|------------------------|--------|--------|--------|-----------------------|--------|
| imm <sub>12,10:5</sub> | rs2    | rs1    | funct3 | imm <sub>4:1,11</sub> | op     |
| 7 bits                 | 5 bits | 5 bits | 3 bits | 5 bits                | 7 bits |

#### S-Type

| 31:25               | 24:20  | 19:15  | 14:12  | 11:7               | 6:0    |
|---------------------|--------|--------|--------|--------------------|--------|
| imm <sub>11:5</sub> | rs2    | rs1    | funct3 | imm <sub>4:0</sub> | ор     |
| 7 bits              | 5 bits | 5 bits | 3 bits | 5 bits             | 7 bits |

#### **J-Type**

| 31:12                           | 11:7   | 6:0    |  |
|---------------------------------|--------|--------|--|
| imm <sub>20,10:1,11,19:12</sub> | rd     | op     |  |
| 20 bits                         | 5 bits | 7 bits |  |

## Extended Functionality: jal

| ор  | Instruct. | RegWrite | ImmSrc | ALUSrc | MemWrite | ResultSrc | Branch | ALUOp | Jump |
|-----|-----------|----------|--------|--------|----------|-----------|--------|-------|------|
| 3   | lw        | 1        | 00     | 1      | 0        | 01        | 0      | 00    | 0    |
| 35  | sw        | 0        | 01     | 1      | 1        | XX        | 0      | 00    | 0    |
| 51  | R-type    | 1        | XX     | 0      | 0        | 00        | 0      | 10    | 0    |
| 99  | beq       | 0        | 10     | 0      | 0        | XX        | 1      | 01    | 0    |
| 19  | I-type    | 1        | 00     | 1      | 0        | 00        | 0      | 10    | 0    |
| 111 | jal       | 1        | 11     | X      | 0        | 10        | 0      | XX    | 1    |

## Extended Functionality: jal

| ор  | Instruct. | RegWrite | ImmSrc | ALUSrc | MemWrite | ResultSrc | Branch | ALUOp | Jump |
|-----|-----------|----------|--------|--------|----------|-----------|--------|-------|------|
| 111 | jal       | 1        | 11     | X      | 0        | 10        | 0      | XX    | 1    |



## Chapter 7: Microarchitecture

# Single-Cycle Performance

#### Processor Performance

#### **Program Execution Time**

- = (#instructions)(cycles/instruction)(seconds/cycle)
- = # instructions x CPI x  $T_C$

## Single-Cycle Processor Performance



 $T_c$  limited by critical path (1w)

## Single-Cycle Processor Performance

#### Single-cycle critical path:

$$T_{c\_single} = t_{pcq\_PC} + t_{\text{mem}} + \max[t_{RF\text{read}}, t_{dec} + t_{ext} + t_{\text{mux}}] + t_{\text{ALU}} + t_{\text{mem}} + t_{\text{mux}} + t_{RF\text{setup}}]$$

#### Typically, limiting paths are:

- memory, ALU, register file

- So, 
$$T_{c\_single} = t_{pcq\_PC} + t_{mem} + t_{RFread} + t_{ALU} + t_{mem} + t_{mux} + t_{RFsetup}$$
  
=  $t_{pcq\_PC} + 2t_{mem} + t_{RFread} + t_{ALU} + t_{mux} + t_{RFsetup}$ 

## Single-Cycle Performance Example

| Element                | Parameter       | Delay (ps) |
|------------------------|-----------------|------------|
| Register clock-to-Q    | $t_{pcq\_PC}$   | 40         |
| Register setup         | $t_{ m setup}$  | 50         |
| Multiplexer            | $t_{ m mux}$    | 30         |
| AND-OR gate            | $t_{ m AND-OR}$ | 20         |
| ALU                    | $t_{ m ALU}$    | 120        |
| Decoder (Control Unit) | $t_{ m dec}$    | 25         |
| Extend unit            | $t_{ m ext}$    | 35         |
| Memory read            | $t_{ m mem}$    | 200        |
| Register file read     | $t_{RF}$ read   | 100        |
| Register file setup    | $t_{RF}$ setup  | 60         |

$$T_{c\_single} = t_{pcq\_PC} + 2t_{mem} + t_{RFread} + t_{ALU} + t_{mux} + t_{RFsetup}$$
  
= 750 ps

## Single-Cycle Performance Example

Program with 100 billion instructions:

**Execution Time** = # instructions x CPI x  $T_C$ 

## Chapter 7: Microarchitecture

## Multicycle RISC-V Processor

## Single- vs. Multicycle Processor

- Single-cycle:
  - + simple
  - cycle time limited by longest instruction (lw)
  - separate memories for instruction and data
  - 3 adders/ALUs
- Multicycle processor addresses these issues by breaking instruction into shorter steps
  - shorter instructions take fewer steps
  - o can re-use hardware
  - o cycle time is faster

## Single- vs. Multicycle Processor

#### Single-cycle:

- + simple
- cycle time limited by longest instruction (lw)
- separate memories for instruction and data
- 3 adders/ALUs

#### Multicycle:

- + higher clock speed
- + simpler instructions run faster
- + reuse expensive hardware on multiple cycles
- sequencing overhead paid many times

## Same design steps as single-cycle:

- first datapath
- then control

#### Multicycle State Elements

Replace separate Instruction and Data memories with a single unified memory – more realistic





#### Multicycle Datapath: Instruction Fetch

#### **STEP 1:** Fetch instruction





#### Multicycle Datapath: 1w Get Sources

**STEP 2:** Read source operand from RF and extend immediate



## Multicycle Datapath: 1w Address

#### **STEP 3:** Compute the memory address



#### Multicycle Datapath: 1w Memory Read

#### **STEP 4:** Read data from memory



#### Multicycle Datapath: 1w Write Register

#### **STEP 5:** Write data back to register file



#### Multicycle Datapath: Increment PC

**STEP 6:** Increment PC: PC = PC+4



#### Chapter 7: Microarchitecture

# Multicycle Datapath: Other Instructions

## Multicycle Datapath: sw

#### Write data in rs2 to memory



## Multicycle Datapath: beq

#### Calculate branch target address: BTA = PC + imm



PC is updated in Fetch stage, so need to save old (current) PC

#### Multicycle RISC-V Processor



## Chapter 7: Microarchitecture

## Multicycle Control

## Multicycle Control



#### Multicycle Control: Instruction Decoder



| ор | Instruction | ImmSrc |
|----|-------------|--------|
| 3  | lw          | 00     |
| 35 | sw          | 01     |
| 51 | R-type      | XX     |
| 99 | beq         | 10     |

#### Multicycle Control: Main FSM



#### To declutter FSM:

- Write enable
   signals (RegWrite,
   MemWrite,
   IRWrite,
   PCUpdate, and
   Branch) are 0 if not
   listed in a state.
- Other signals are don't care if not listed in a state

#### Main FSM: Fetch



#### Main FSM: Decode



#### Main FSM: Address



## Main FSM: Read Memory



## Main FSM: Read Memory Datapath



#### Main FSM: Write RF



### Main FSM: Write RF Datapath



#### Main FSM: Fetch Revisited

Calculate PC+4
during Fetch stage
(ALU isn't being used)



#### Main FSM: Fetch (PC+4) Datapath



#### Chapter 7: Microarchitecture

# Multicycle Control: Other Instructions

#### Main FSM: sw



#### Main FSM: sw Datapath



#### Main FSM: R-Type Execute



#### Main FSM: R-Type Execute Datapath



#### Main FSM: R-Type ALU Write Back



#### Main FSM: R-Type ALU Write Back



#### Main FSM: beq

- Need to calculate:
  - Branch Target Address
  - rs1 rs2 (to see if equal)
- ALU isn't being used in Decode stage
  - Use it to calculate Target Address (PC + imm)

#### Main FSM: Decode Revisited



### Main FSM: Decode (Target Address)



### Main FSM: beq



#### Main FSM: beq Datapath



#### Chapter 7: Microarchitecture

# Extending the RISC-V Multicycle Processor

#### Main FSM: I-Type ALU Execute



#### Main FSM: I-Type ALU Exec. Datapath



#### Main FSM: jal



#### Main FSM: jal Datapath



#### Main FSM: jal



#### Multicycle Processor Main FSM

State Datapath µOp **Fetch** Instr ←Mem[PC]: PC ← PC+4 Decode ALUOut ← PCTarget Reset MemAdr ALUOut ← rs1 + imm Data ← Mem[ALUOut] MemRead S0: Fetch AdrSrc = 0S1: Decode **MemWB** rd ← Data **IRWrite** ALUSrcA = 01 MemWrite Mem[ALUOut] ← rd ALUSrcA = 00ALUSrcB = 01 **ExecuteR** ALUOut ← rs1 op rs2 ALUSrcB =10 ALUOp = 00ALUOp = 00**Executel** ALUOut ← rs1 op imm ResultSrc = 10 **ALUWB** rd ← ALUOut PCUpdate BEQ ALUResult = rs1-rs2: if Zero. PC ← ALUOut op = 0000011 (1w)op = = qo = go JAL PC ← ALUOut: ALUOut ← PC+4 = qo0110011 0010011 1101111 1100011 (R-type) (I-type ALU) op = 0100011 (sw)(jal) (beq) S2: MemAdr S6: ExecuteR S8: Executel S9: JAL **S10: BEQ** ALUSrcA = 10 ALUSrcA = 10 ALUSrcA = 10ALUSrcA = 01 ALUSrcA = 10ALUSrcB = 01 ALUSrcB = 00ALUSrcB = 01ALUSrcB = 10 ALUSrcB = 00ALUOp = 00ALUOp = 10ALUOp = 10ALUOp = 00ALUOp = 01ResultSrc = 00 ResultSrc = 00 **PCUpdate** Branch op = op =0000011 0100011 (lw) (sw) S3: MemRead S5: MemWrite S7: ALUWB ResultSrc = 00ResultSrc = 00 ResultSrc = 00 AdrSrc = 1 AdrSrc = 1RegWrite MemWrite S4: MemWB ResultSrc = 01 RegWrite

### Chapter 7: Microarchitecture

## Multicycle Performance

#### Multicycle Processor Performance

- Instructions take different number of cycles:
  - 3 cycles: beq
  - 4 cycles: R-type, addi, sw, jal
  - − 5 cycles: lw
- CPI is weighted average
- SPECINT2000 benchmark:
  - **25**% loads
  - 10% stores
  - 13% branches
  - 52% R-type

Average CPI = (0.13)(3) + (0.52 + 0.10)(4) + (0.25)(5) = 4.12

#### Multicycle Critical Path



#### Multicycle Processor Performance

#### Multicycle critical path:

- Assumptions:
  - RF is faster than memory
  - Writing memory is faster than reading memory

$$T_{c\_multi} = t_{pcq} + t_{dec} + 2t_{mux} + \max(t_{ALU}, t_{mem}) + t_{setup}$$

#### Multicycle Performance Example

| Element                | Parameter       | Delay (ps) |
|------------------------|-----------------|------------|
| Register clock-to-Q    | $t_{pcq\_PC}$   | 40         |
| Register setup         | $t_{ m setup}$  | 50         |
| Multiplexer            | $t_{ m mux}$    | 30         |
| AND-OR gate            | $t_{ m AND-OR}$ | 20         |
| ALU                    | $t_{ m ALU}$    | 120        |
| Decoder (Control Unit) | $t_{ m dec}$    | 25         |
| Extend unit            | $t_{ m dec}$    | 35         |
| Memory read            | $t_{ m mem}$    | 200        |
| Register file read     | $t_{RF}$ read   | 100        |
| Register file setup    | $t_{RF}$ setup  | 60         |

$$T_{c\_multi} = t_{pcq} + t_{dec} + 2t_{mux} + \max(t_{ALU}, t_{mem}) + t_{setup}$$

#### Multicycle Performance Example

For a program with **100 billion** instructions executing on a **multicycle** RISC-V processor

- CPI = 4.12 cycles/instruction
- Clock cycle time:  $T_{c\_multi}$  = 375 ps

**Execution Time = (#** instructions)  $\times$  CPI  $\times$   $T_c$ 

#### Chapter 7: Microarchitecture

# Pipelined RISC-V Processor

#### Pipelined RISC-V Processor

- Temporal parallelism
- Divide single-cycle processor into 5 stages:
  - Fetch
  - Decode
  - Execute
  - Memory
  - Writeback
- Add pipeline registers between stages

#### Single-Cycle vs. Pipelined Processor

#### Single-Cycle



#### **Pipelined**



#### Pipelined Processor Abstraction



### Single-Cycle & Pipelined Datapaths



Signals in Pipelined Processor are appended with first letter of stage (i.e., PCF, PCD, PCE).



### Corrected Pipelined Datapath



- Rd must arrive at same time as Result
- Register file written on falling edge of CLK

#### Pipelined Processor with Control



- Same control unit as single-cycle processor
- Control signals travel with the instruction (drop off when used)

### Chapter 7: Microarchitecture

# Pipelined Processor Hazards

# Pipelined Hazards

- When an instruction depends on result from instruction that hasn't completed
- Types:
  - Data hazard: register value not yet written back to register file
  - Control hazard: next instruction not decided yet (caused by branch)

## Data Hazard



# Handling Data Hazards

## Handling Data Hazards

- Insert enough nops for result to be ready
- Or move independent useful instructions forward



## Data Forwarding

- Data is available on internal busses before it is written back to the register file (RF).
- Forward data from internal busses to Execute stage.



## Data Forwarding

- Check if source register in Execute stage matches destination register of instruction in Memory or Writeback stage.
- If so, forward result.



# Data Forwarding: Hazard Unit



## Data Forwarding

- Case 1: Execute stage Rs1 or Rs2 matches Memory stage Rd?
   Forward from Memory stage
- Case 2: Execute stage Rs1 or Rs2 matches Writeback stage Rd?
   Forward from Writeback stage
- Case 3: Otherwise use value read from register file (as usual)

#### Equations for Rs1:

```
if ((Rs1E == RdM) \text{ AND } RegWriteM) // Case 1

ForwardAE = 10

else if ((Rs1E == RdW) \text{ AND } RegWriteW) // Case 2

ForwardAE = 01

else ForwardAE = 00 // Case 3
```

ForwardBE equations are similar (replace Rs1E with Rs2E)

## Data Forwarding

- Case 1: Execute stage Rs1 or Rs2 matches Memory stage Rd?
   Forward from Memory stage
- Case 2: Execute stage Rs1 or Rs2 matches Writeback stage Rd?
   Forward from Writeback stage
- Case 3: Otherwise use value read from register file (as usual)

#### Equations for Rs1:

```
if ((Rs1E == RdM) \text{ AND } RegWriteM) \text{ AND } (Rs1E != 0) // \text{ Case 1}
ForwardAE = 10
else if ((Rs1E == RdW) \text{ AND } RegWriteW) \text{ AND } (Rs1E != 0) // \text{ Case 2}
ForwardAE = 01
else ForwardAE = 00 	 // \text{ Case 3}
```

ForwardBE equations are similar (replace Rs1E with Rs2E)

## Data Hazard due to 1w Dependency



## Stalling to solve 1w Data Dependency



# Stalling Logic

 Is either source register in the Decode stage the same as the destination register in the Execute stage?

#### **AND**

Is the instruction in the Execute stage a lw?

```
IwStall = ((Rs1D == RdE) OR (Rs2D == RdE)) AND ResultSrcE<sub>0</sub>
StallF = StallD = FlushE = IwStall
```

(Stall the Fetch and Decode stages, and flush the Execute stage.)

# Stalling Hardware



# Chapter 7: Microarchitecture

# Pipelined Processor Control Hazards

## **Control Hazards**

## beq:

- Branch not determined until the Execute stage of pipeline
- Instructions after branch fetched before branch occurs
- These 2 instructions must be flushed if branch happens

## **Control Hazards**



## **Branch misprediction penalty:**

The number of instructions flushed when a branch is taken (in this case, 2 instructions)

## Control Hazards: Flushing Logic

- If branch is taken in execute stage, need to flush the instructions in the Fetch and Decode stages
  - Do this by clearing Decode and Execute Pipeline registers using FlushD and FlushE

### Equations:

```
FlushD = PCSrcE
FlushE = IwStall OR PCSrcE
```

# Control Hazards: Flushing Hardware



## RISC-V Pipelined Processor with Hazard Unit



## Summary of Hazard Logic

#### Data hazard logic (shown for SrcA of ALU):

```
((Rs1E == RdM) AND RegWriteM) AND (Rs1E != 0) // Case 1
                ForwardAF = 10
else if ((Rs1E == RdW) \text{ AND } RegWriteW) \text{ AND } (Rs1E != 0) // Case 2
                ForwardAE = 01
else
                ForwardAE = 00
                                                            // Case 3
```

#### **Load word stall logic:**

```
IwStall = ((Rs1D == RdE)) OR (Rs2D == RdE)) AND ResultSrcE_0
StallF = StallD = lwStall
```

#### **Control hazard flush:**

```
FlushD = PCSrcE
FlushE = lwStall OR PCSrcE
```

# Chapter 7: Microarchitecture

# Pipelined Performance

## Pipelined Processor Performance Example

#### SPECINT2000 benchmark:

- 25% loads
- 10% stores
- 13% branches
- 52% R-type

#### Suppose:

- 40% of loads used by next instruction
- 50% of branches mispredicted
- What is the average CPI? (Ideally it's 1, but...)

## Pipelined Processor Performance Example

### Pipelined processor critical path:

```
T_{c\_pipelined} = \max \text{ of}
t_{pcq} + t_{mem} + t_{setup} \qquad \qquad \text{Fetch}
2(t_{RFread} + t_{setup}) \qquad \qquad \text{Decode}
t_{pcq} + 4t_{mux} + t_{ALU} + t_{AND-OR} + t_{setup} \qquad \qquad \text{Execute}
t_{pcq} + t_{mem} + t_{setup} \qquad \qquad \text{Memory}
2(t_{pcq} + t_{mux} + t_{RFwrite}) \qquad \qquad \text{Writeback}
```

- Decode and Writeback stages both use the register file in each cycle
- So each stage gets half of the cycle time  $(T_c/2)$  to do their work
- Or, stated a different way, 2x of their work must fit in a cycle (T<sub>c</sub>)

# Pipelined Critical Path: Execute Stage



## Pipelined Performance Example

| Element                | Parameter       | Delay (ps) |
|------------------------|-----------------|------------|
| Register clock-to-Q    | $t_{pcq\_PC}$   | 40         |
| Register setup         | $t_{ m setup}$  | 50         |
| Multiplexer            | $t_{ m mux}$    | 30         |
| AND-OR gate            | $t_{ m AND-OR}$ | 20         |
| ALU                    | $t_{ m ALU}$    | 120        |
| Decoder (Control Unit) | $t_{ m dec}$    | 25         |
| Extend unit            | $t_{ m dec}$    | 35         |
| Memory read            | $t_{ m mem}$    | 200        |
| Register file read     | $t_{RF}$ read   | 100        |
| Register file setup    | $t_{RF}$ setup  | 60         |

$$T_{c\_pipelined} = t_{pcq} + 4t_{mux} + t_{ALU} + t_{AND-OR} + t_{setup}$$

## Pipelined Performance Example

Program with 100 billion instructions

```
Execution Time = (# instructions) × CPI × T_c
= (100 \times 10^9)(1.23)(350 \times 10^{-12})
= 43 seconds
```

# Processor Performance Comparison

| Processor    | Execution Time (seconds) | Speedup (single-cycle as baseline) |
|--------------|--------------------------|------------------------------------|
| Single-cycle | 75                       | 1                                  |
| Multicycle   | 155                      | 0.5                                |
| Pipelined    | 43                       | 1.7                                |

# Chapter 7: Microarchitecture

# Advanced Microarchitecture

## Advanced Microarchitecture

- Deep Pipelining
- Micro-operations
- Branch Prediction
- Superscalar Processors
- Out of Order Processors
- Register Renaming
- SIMD
- Multithreading
- Multiprocessors

# Deep Pipelining

- 10-20 stages typical
- Number of stages limited by:
  - Pipeline hazards
  - Sequencing overhead
  - Power
  - Cost



## Micro-operations

- Decompose complex instructions into series of simple instructions called *micro-operations* (*micro-ops* or μ-ops)
- At run-time, complex instructions are decoded into one or more micro-ops
- Used heavily in CISC (complex instruction set computer) architectures (e.g., x86)

#### **Complex Op**

lw s1, 0(s2), postincr 4

#### **Micro-op Sequence**

lw s1, 0(s2) addi s2, s2, 4

Without  $\mu$ -ops, would need 2nd write port on the register file

## **Branch Prediction**

- Guess whether branch will be taken
  - Backward branches are usually taken (loops)
  - Consider history to improve guess
- Good prediction reduces fraction of branches requiring a flush

## **Branch Prediction**

- Ideal pipelined processor: CPI = 1
- Branch misprediction increases CPI
- Static branch prediction:
  - Check direction of branch (forward or backward)
  - If backward, predict taken
  - Else, predict not taken
- Dynamic branch prediction:
  - Keep history of last several hundred (or thousand)
     branches in branch target buffer, record:
    - Branch destination
    - Whether branch was taken

## Dynamic Branch Prediction

- 1-bit branch predictor
- 2-bit branch predictor

## Branch Prediction Example

```
addi s1, zero, 0 # s1 = sum
 addi s0, zero, 0 \# s0 = i
 addi t0, zero, 10 \# t0 = 10
For:
                     # for (i=0; i<10; i=i+1)
 bge s0, t0, Done
 add s1, s1, s0 # sum = sum + i
 addi s0, s0, 1 # i = i + 1
 j For
```

Done:

## 1-Bit Branch Predictor

- Remembers whether branch was taken the last time and does the same thing
- Mispredicts first and last branch of loop

Done:

#### 2-Bit Branch Predictor



Done:

#### Only mispredicts last branch of loop

### Chapter 7: Microarchitecture

# Superscalar & Out of Order Processors

# Superscalar Processors

- Multiple copies of datapath execute multiple instructions at once
- Dependencies make it tricky to issue multiple instructions at once



## Superscalar Example

Ideal IPC: 2

**Actual IPC: 2** 

lw s7, 40(s0)
add s8, t1, t2

sub s9, s1, s3
and s10, s3, t4

or s11, s1, t5
sw s5, 80(s2)



## Superscalar with Dependencies

Ideal IPC: 2

**Actual IPC:** 6/5 = 1.2



149

## Out of Order (OOO) Processor

- Looks ahead across multiple instructions
- Issues as many instructions as possible at once
- Issues instructions out of order (as long as no dependencies)

#### Dependencies:

- RAW (read after write): one instruction writes, later instruction reads a register
- WAR (write after read): one instruction reads, later instruction writes a register
- WAW (write after write): one instruction writes, later instruction writes a register

## Out of Order (OOO) Processor

- Instruction level parallelism (ILP): number of instruction that can be issued simultaneously (average < 3)</li>
- Scoreboard: table that keeps track of:
  - Instructions waiting to issue
  - Available functional units
  - Dependencies

#### Out of Order Processor Example

Ideal IPC: 2

**Actual IPC:** 6/4 = 1.5



## Register Renaming

Ideal IPC: 2

Actual IPC: 6/3 = 2



#### SIMD

- Single Instruction Multiple Data (SIMD)
  - Single instruction acts on multiple pieces of data at once
  - Common application: graphics
  - Can apply to short arithmetic operations (also called packed arithmetic)
- For example, add eight 8-bit elements



# Chapter 7: Microarchitecture

# Multithreading & Multiprocessors

## Advanced Architecture Techniques

#### Multithreading

Wordprocessor: thread for typing, spell checking, printing

#### Multiprocessors

Multiple processors (cores) on a single chip

# Threading: Definitions

- Process: program running on a computer
  - Multiple processes can run at once: e.g., surfing
     Web, playing music, writing a paper
- Thread: part of a program
  - Each process has multiple threads: e.g., a word processor may have threads for typing, spell checking, printing

#### Threads in a Conventional Processor

#### Single-core system:

- One thread runs at once
- When one thread stalls (for example, waiting for memory):
  - Architectural state of that thread stored
  - Architectural state of waiting thread loaded into processor and it runs
  - Called context switching
- Appears to user like all threads running simultaneously

#### Multithreading

- Multiple copies of architectural state
- Multiple threads active at once:
  - When one thread stalls, another runs immediately
  - If one thread can't keep all execution units busy, another thread can use them
- Does not increase instruction-level parallelism (ILP) of single thread, but increases throughput

Intel calls this "hyperthreading"

#### Multiprocessors

- Multiple processors (cores) with a method of communication between them
- Types:
  - Homogeneous: multiple cores with shared main memory
  - Heterogeneous: separate cores for different tasks (for example, DSP and CPU in cell phone)
  - Clusters: each core has own memory system

#### **About these Notes**

**Digital Design and Computer Architecture Lecture Notes** 

© 2021 Sarah Harris and David Harris

These notes may be used and modified for educational and/or non-commercial purposes so long as the source is attributed.