# CSC 631: High-Performance Computer Architecture

Fall 2022 Lecture 4: Instruction Level Parallelism

# **Instruction Level Parallelism (ILP)**

## ■ Basic idea:

- Execute several instructions in parallel
- We already do pipelining...
  - But it can only push through at most 1 instruction/cycle

# ■ Is this Legal?!?

- ISA defines instruction execution one by one
  - I1: ADD R1 = R2 + R3
    - fetch the instruction
    - read R<sub>2</sub> and R<sub>3</sub>
    - do the addition
    - write R1
    - increment PC
- How about pipelining?
  - already breaks the "rules"
  - we fetch 12 before 11 has finished

# It's legal if program executes correctly...

- Parallelism exists in that we perform different operations (fetch, decode, ...) on several different instructions in parallel
  - as mentioned, limit of 1 IPC

# **Example: Toll Booth**





C



В



Caravanning on a trip, must stay in order to prevent losing anyone



When we get to the toll, everyone gets in the same lane to stay in order

This works... but it's slow. Everyone has to wait for D to get through the toll booth





Lane 1





Lane 2







Go through two at a time

(in parallel)

## Back to ILP... But how?

## Simple ILP recipe

- Read and decode a few instructions each cycle
  - can't execute > 1 IPC if we're not fetching > 1 IPC
- If instructions are independent, do them at the same time
- If not, do them one at a time

# **Example**

A: ADD R1 = R2 + R3

B: SUB R4 = R1 - R5

C: XOR R6 = R7  $^{\land}$  R8

D: Store R6  $\rightarrow$  0[R4]

E: MUL R3 = R5 \* R9

F: ADD R7 = R1 + R6

G: SHL  $R8 = R7 \ll R4$ 

# **Ex. Original Pentium**



# Repeat Example for Pentium-like CPU

A: ADD R1 = R2 + R3

B: SUB R4 = R1 - R5

C: XOR R6 = R7  $^{\land}$  R8

D: Store R6  $\rightarrow$  0[R4]

E: MUL R3 = R5 \* R9

F: ADD R7 = R1 + R6

G: SHL  $R8 = R7 \ll R4$ 

# This is "Superscalar"

- "Scalar" CPU executes one inst at a time
  - includes pipelined processors
- "Vector" CPU executes one inst at a time, but on vector data
  - X[0:7] + Y[0:7] is one instruction, whereas on a scalar processor, you would need eight
- "Superscalar" can execute more than one unrelated instruction at a time
  - ADD X + Y, MUL W \* Z

# **Scheduling**

- Central problem to ILP processing
  - need to determine when parallelism (independent instructions) exists
  - in Pentium example, decode stage checks for multiple conditions:
    - is there a data dependency?
      - does one instruction generate a value needed by the other?
      - do both instructions write to the same register?
    - is there a structural dependency?
      - most CPUs only have one divider, so two divides cannot execute at the same time

# **Scheduling**

# • How many instructions are we looking for?

- 3-6 is typical today
- At a peak execution bandwidth, a CPU that can ideally do N instruction s per cycle is called "N-way superscalar", "N-issue superscalar", or simply "Nway", "N-issue" or "N-wide"
  - This ''N" is also called the "issue width"

## **ILP**

- Arrange instructions based on dependencies
- ILP = Number of instructions / Longest Path

```
I1: R2 = 17

I2: R1 = 49

I3: R3 = -8

I4: R5 = LOAD 0[R3]

I5: R4 = R1 + R2

I6: R7 = R4 - R3

I7: R6 = R4 * R5
```

# **Dynamic (Out-of-Order) Scheduling**

## Cycle 1

- Operands ready? I1, I5.
- Start I1, I5.

## ■ Cycle 2

- Operands ready? I2, I3.
- Start I2,I3.

## Program code

I1: ADD R1, R2, R3
I2: SUB R4, R1, R5
I3: AND R6, R1, R7
I4: OR R8, R2, R6
I5: XOR R10, R2, R11

- Window size (W): how many instructions ahead do we look.
  - Do not confuse with "issue width" (N).
  - E.g. a 4-issue out-of-order processor can have a 128-entry window (it can look at up to 128 instructions at a time).

# **Ordering?**

- In previous example, I5 executed before I2, I3 and I4!
- How to maintain the illusion of sequentiality?



#### ILP != IPC

# ■ ILP is an attribute of the program

- also dependent on the ISA, compiler
  - ex. SIMD, FMAC, etc. can change inst count and shape of dataflow graph

# IPC depends on the actual machine implementation

- ILP is an upper bound on IPC
  - achievable IPC depends on instruction latencies, cache hit rates, branch prediction rates, structural conflicts, instruction window size, etc., etc., etc.

**Dependences and Register Renaming** 

## **ILP** is Bounded

- For any sequence of instructions, the available parallelism is limited
- Hazards/Dependencies are what limit the ILP
  - Data dependencies
  - Control dependencies
  - Memory dependencies

# **Types of Data Dependencies**

(Assume A comes before B in program order)

- RAW (Read-After-Write)
  - A writes to a location, B reads from the location, therefore B has a RAW dependency on A
  - Also called a "true dependency"

# Data Dep's (cont'd)

## ■ WAR (Write-After-Read)

- A reads from a location, B writes to the location, therefore B has a WAR dependency on A
- If B executes before A has read its operand, then the operand will be lost
- Also called an anti-dependence

# Data Dep's (cont'd)

## ■ Write-After-Write

- A writes to a location, B writes to the same location
- If B writes first, then A writes, the location will end up with the wrong value
- Also called an output-dependence

# **Control Dependencies**

- If we have a conditional branch, until we actually know the outcome, αll later instructions must wait
  - That is, all instructions are control dependent on all earlier branches
  - This is true for unconditional branches as well (e.g., can't return from a function until we've loaded the return address)

# **Memory Dependencies**

- Basically similar to regular (register) data dependencies: RAW, WAR, WAW
- However, the exact location is not known:
  - A: STORE R1, o[R2]
  - B: LOAD R<sub>5</sub>, 24[R8]
  - C: STORE R<sub>3</sub>, -8[R<sub>9</sub>]
  - RAW exists if (R2+0) == (R8+24)
  - WAR exists if (R8+24) == (R9 8)
  - WAW exists if (R2+0) == (R9 8)

# **Impact of Ignoring Dependencies**



# **Eliminating WAR Dependencies**

# WAR dependencies are from reusing registers



# **Eliminating WAW Dependencies**

# WAW dependencies are also from reusing registers



# So Why Do False Dep's Exist?

- Finite number of registers
  - At some point, you're forced to overwrite somewhere
  - Most RISC: 32 registers, x86: only 8, x86-64: 16
  - Hence WAR and WAW also called "name dependencies" (i.e. the "names" of the registers)
- So why not just add more registers?
- Thought exercise: what if you had infinite regs?

#### Reuse is Inevitable

## Loops, Code Reuse

- If you write a value to R₁ in a loop body, then R₁ will be reused every iteration → induces many false dep's
  - · Loop unrolling can help a little
    - Will run out of registers at some point anyway
    - Trade off with code bloat
- Function calls result in similar register reuse
  - If printf writes to R1, then every call will result in a reuse of R1
  - Inlining can help a little for short functions
    - Same caveats

# **Obvious Solution: More Registers**

# Add more registers to the ISA?

- Changing the ISA can break binary compatibility BAD!!!
- All code must be recompiled
- Does not address register overwriting due to code reuse from loops and function calls
- Not a scalable solution

BAD? x86-64 adds registers...

... but it does so in a mostly backwards compatible fashion

# **Better Solution: HW Register Renaming**

## Give processor more registers than specified by the ISA

 temporarily map ISA registers ("logical" or "architected" registers) to the *physical* registers to avoid overwrites

## Components:

- mapping mechanism
- physical registers
  - allocated vs. free registers
  - allocation/deallocation mechanism

# **Register Renaming**

## Example

- 13 can not exec before 12 because
   13 will overwrite R6
- 15 can not go before 12 because
   12, when it goes, will overwrite
   R2 with a stale value

## Program code

```
I1: ADD R1, R2, R3
I2: SUB R2, R1, R6
I3: AND R6, R11, R7
I4: OR R8, R5, R2
I5: XOR R2, R4, R11
```



# **Register Renaming**

- Solution: Let's give I2 temporary name/ location (e.g., S) for the value it produces.
- But I4 uses that value,

  so we must also change that to S... I5: XOR R2, R4, R11
- In fact, all uses of R5 from I3 to the next instruction that writes to R5 again must now be changed to S!
- We remove WAW deps in the same way: change R2 in I5 (and subsequent instrs) to T.

# **Register Renaming**

# Implementation

- Space for S, T, U etc.
- How do we know when to rename a register?

# Simple Solution

- Do renaming for every instruction
- Change the name of a register each time we decode an instruction that will write to it.
- Remember what name we gave it ©

## Program code

Program code

I1: ADD R1, R2, R3

12: SUB B2, R1, R6

13: AND 86, RE11, RR7

```
I1: ADD R1, R2, R3
I2: SUB S, R1, R5
I3: AND U, R11, R7
I4: OR R8, R5, S
I5: XOR T, R4, R11
```

# **Register File Organization**

We need some physical structure to store the register values



# **Putting it all Together**

## top:

$$R1 = R2 + R3$$

$$R_2 = R_4 - R_1$$

$$R_2 = R_1 + R_2$$

BNEZ R<sub>3</sub>, top

#### Free pool:

X9, X11, X7, X2, X13, X4, X8, X12, X3, X5...

|    | ARF | PRF |
|----|-----|-----|
|    | R1  | X1  |
|    | R2  | X2  |
|    | R3  | X3  |
|    | R4  | X4  |
|    | R5  | X5  |
|    | R6  | X6  |
|    |     | X7  |
|    | DAT | X8  |
|    | RAT | X9  |
| R1 | R1  | X10 |
| R2 | R2  | X11 |
| R3 | R3  | X12 |
| R4 | R4  | X13 |
| R5 | R5  | X14 |
| R6 | R6  | X15 |
|    |     | X16 |
|    |     |     |

# **Renaming in action**

| R1 = R2 + R3              | 🗶 🕽= R2 + R3              |
|---------------------------|---------------------------|
| $R_2 = R_4 - R_1$         | XII = R4 - X9             |
| $R_1 = R_3 * R_6$         | X 7= R3 * R6              |
| $R_2 = R_1 + R_2$         | X2 = X7 + X11             |
| R3 = R1 >> 1              | $x_{13} = x_{7} >> 1$     |
| BNEZ R <sub>3</sub> , top | BNEZ <i>X1</i> 3, top     |
| R1 = R2 + R3              | X4= X2+ X13<br>X8= R1- X1 |
| R2 = R4 - R1              | X17= X13 * R6             |
| R1 = R3 * R6              | X3= x12+ X7               |
| R2 = R1 + R2              | X5= X10>1                 |
| R3 = R1 >> 1              | BNEZ 🗶 🕻, top             |
| BNEZ R <sub>3</sub> , top | •                         |



# **Even Physical Registers are Limited**

- We keep using new physical registers
  - What happens when we run out?
- There must be a way to "recycle"
- When can we recycle?
  - When we have given its value to all instructions that use it as a source operand!
  - This is not as easy as it sounds

# **Instruction Commit (leaving the pipe)**



Architected register file contains the "official" processor state

When an instruction leaves the pipeline, it makes its result

"official" by updating the ARF The ARF now contains the correct value; update the RAT T42 is no longer needed, return to the physical register free pool

# **Careful with the RAT Update!**



Update ARF as usual

Deallocate physical register

Don't touch that RAT!

(Someone else is the most
recent writer to R3)

At some point in the future,
the newer writer of R3 exits

This instruction was the most
recent writer, now update the RAT

Deallocate physical register

#### **Instruction Commit: a Problem**



Decode I1 (rename R3 to T42)

Decode I2 (uses T42 instead of R3)

Execute I1 (Write result to T42)

I2 can't execute (e.g. R5 not ready)

Commit I1 (T42->R3, free T42)

Decode I3 (uses T42 instead of R6)

Execute I3 (writes result to T42)

R5 finally becomes ready

Execute I2 (read from T42) We read the wrong value!!

# CSC 631: High-Performance Computer Architecture

Fall 2022 Lecture 5: Tomasulo's Algorithm

# **Implementing Dynamic Scheduling**

# ■ Tomasulo's Algorithm

- Used in IBM 360/91 (in the 60s)
- Tracks when operands are available to satisfy data dependences
- Removes name dependences through register renaming
- Almost all modern high-performance processors use a derivative of Tomasulo's... much of the terminology survives to today.

# Tomasulo's Algorithm: The Picture



# Issue (1)

- Get next instruction from instruction queue.
- Find a free reservation station for it (if none are free, stall until one is)
- Read operands that are in the registers
- If the operand is not in the register, find which reservation station will produce it
- In effect, this step renames registers (reservation station IDs are "temporary" names)

# Issue (2)



# Execute (1)

- Monitor results as they are produced
- Put a result into all reservation stations waiting for it (missing source operand)
- When all operands available for an instruction, it is ready (we can actually execute it)
- Several ready instrs for one functional unit?
  - · Pick one.
  - Except for load/store Load/Store must be done in the proper order to avoid hazards through memory (more loads/stores this in a later lecture)

# Execute (2)



# Execute (3) More than one ready inst for the same unit



# Write Result (1)

- When result is computed, make it available on the "common data bus" (CDB), wherecwaiting reservation stations can pick it up
- Stores write to memory
- Result stored in the register file
- This step frees the reservation station
- For our register renaming, this recycles the temporary name (future instructions can again find the value in the actual register, until it is renamed again)

# Write Result (2)



# Tomasulo's Algorithm: Load/Store

- The reservation stations take care of dependences through registers.
- Dependences also possible through memory
  - Loads and stores not reordered in original IBM 360
  - We'll talk about how to do load-store reordering later

Assume

R2 is 100

R3 is 200

F4 is 2.5

Load: 2 cycles

Add: 2 cycles

Mult: 10 cycles

Divide: 40 cycles

#### **Reservation Stations**



# **Detailed Example**

Assume

R2 is 100

R3 is 200

F4 is 2.5

Load: 2 cycles

Add: 2 cycles

Mult: 10 cycles

Divide: 40 cycles



Assume

R2 is 100

R3 is 200

F4 is 2.5

Load: 2 cycles

Add: 2 cycles

Mult: 10 cycles

Divide: 40 cycles

#### **Reservation Stations**



# **Detailed Example**

Assume

R2 is 100

R3 is 200

F4 is 2.5

Load: 2 cycles

Add: 2 cycles

Mult: 10 cycles

Divide: 40 cycles



Assume

R2 is 100

R3 is 200

F4 is 2.5

Load: 2 cycles

Add: 2 cycles

Mult: 10 cycles

Divide: 40 cycles

#### **Reservation Stations**



# **Detailed Example**

Assume

R2 is 100

R3 is 200

F4 is 2.5

Load: 2 cycles

Add: 2 cycles

Mult: 10 cycles

Divide: 40 cycles



Assume

R2 is 100

R3 is 200

F4 is 2.5

Load: 2 cycles

Add: 2 cycles

Mult: 10 cycles

Divide: 40 cycles

#### **Reservation Stations**

|          |                       |    |      |     | В       | usy | Op    | Vj  |     | Vk  | Qj   | Qk   | Α |   |
|----------|-----------------------|----|------|-----|---------|-----|-------|-----|-----|-----|------|------|---|---|
|          |                       | ls | Ex   | W   | LD1     | 0   |       |     |     |     |      |      |   |   |
| 1. L.D   | F6, 34(R2)            | 1  | 2    | 4   | LD2     | 0   |       |     |     |     |      |      |   |   |
| 2 1 5    | F2 45(D2)             | 2  | 3    | 5   | AD1     | 10  | SUB.D | 1.5 |     | 0.5 |      |      |   |   |
| 2. L.D   | F2, 45(R3)            | 3  | 6    |     | AD2     | 1   | ADD.D | 1.0 |     | 2.5 | ADT  |      |   |   |
| 3. MUL.D | F0, F2, F4            | 4  | 6    | 8   | AD3     |     |       |     |     |     |      |      |   |   |
| 4. SUB.D | F8 F2 F6              | 5  |      |     | ML1     | 1   | MUL.D | 1.5 |     | 2.5 |      |      |   |   |
|          |                       | 6  |      |     | ML2     | 1   | DIV.D |     |     | 0.5 | ML1  |      |   |   |
| 5. DIV.D | F10,F0,F6             |    |      |     |         | FC  | ) F2  | F4  | F6  | F8  | F10  | F12  |   |   |
| 6. ADD.D | <del>F6, F8,</del> F2 |    |      |     |         |     | , 12  | 1 7 | 10  | 10  | 1 10 | 1 12 |   | _ |
| Cycle:   | 8                     | F  | Regi | ste | Status: | MI  | _1    |     | AD2 | ADA | ML2  |      |   |   |

# **Detailed Example**

Assume

R2 is 100

R3 is 200

F4 is 2.5

Load: 2 cycles

Add: 2 cycles

Mult: 10 cycles

Divide: 40 cycles



Assume

R2 is 100

R3 is 200

F4 is 2.5

Load: 2 cycles

Add: 2 cycles

Mult: 10 cycles

Divide: 40 cycles

#### **Reservation Stations**

|          |                       |    |      |     | В         | usy | Op    | Vj  |     | Vk  | Qj   | Qk   | Α |  |
|----------|-----------------------|----|------|-----|-----------|-----|-------|-----|-----|-----|------|------|---|--|
|          |                       | ls | Ex   | W   | LD1       | 0   |       |     |     |     |      |      |   |  |
| 1. L.D   | F6, 34(R2)            | 1  | 2    | 4   | LD2       | 0   |       |     |     |     |      |      |   |  |
| 2 1 5    | F2 4F(D2)             | 2  | 3    | 5   | AD1       | 0   |       |     |     |     |      |      |   |  |
| 2. L.D   | F2, 45(R3)            | 3  | 6    |     | AD2       | 10  | ADD.D | 1.0 |     | 2.5 |      |      |   |  |
| 3. MUL.D | F0, F2, F4            | 4  | 6    | 8   | AD3       |     |       |     |     |     |      |      |   |  |
| 4. SUB.D | F8 F2 F6              | 5  |      |     | ML1       | 1   | MUL.D | 1.5 |     | 2.5 |      |      |   |  |
|          |                       | 6  | 9    | 11  | ML2       | 1   | DIV.D |     |     | 0.5 | ML1  |      |   |  |
| 5. DIV.D | F10,F0,F6             |    |      | '   | •         | FC  | ) F2  | F4  | F6  | F8  | F10  | E12  | • |  |
| 6. ADD.D | <del>F6. F8.</del> F2 |    |      |     |           |     | 12    | 14  | 10  | 10  | 1 10 | 1 12 |   |  |
| Cycle:   | 11                    | R  | Regi | ste | r Status: | ML  | _1    |     | AB2 |     | ML2  |      |   |  |

# **Detailed Example**

Assume

R2 is 100

R3 is 200

F4 is 2.5

Load: 2 cycles

Add: 2 cycles

Mult: 10 cycles

Divide: 40 cycles



Assume

R2 is 100

R3 is 200

F4 is 2.5

Load: 2 cycles

Add: 2 cycles

Mult: 10 cycles

Divide: 40 cycles

#### **Reservation Stations**

|          |                       |    |      |      | Вι      | ısy | O  | p   | Vj   |    | Vk  | Qj   | Qk  | Α |   |
|----------|-----------------------|----|------|------|---------|-----|----|-----|------|----|-----|------|-----|---|---|
|          |                       | ls | Ex   | W    | LD1     | 0   |    |     |      |    |     |      |     |   |   |
| 1. L.D   | F6, 34(R2)            | 1  | 2    | 4    | LD2     | 0   |    |     |      |    |     |      |     |   |   |
| 2 1 D    | , ,                   | 2  | 3    | 5    | AD1     | 0   |    |     |      |    |     |      |     |   |   |
| 2. L.D   | F2, 45(R3)            | 3  | 6    | 16   | AD2     | 0   |    |     |      |    |     |      |     |   |   |
| 3. MUL.D | F0, F2, F4            | 4  | 6    | 8    | AD3     |     |    |     |      |    |     |      |     |   |   |
| 4. SUB.D | F8 F2 F6              | 5  | 17   |      | ML1     | 0   |    |     |      |    |     |      |     |   |   |
|          |                       | 6  | 9    | 11   | ML2     | 1   | DI | V.D | 3.75 |    | 0.5 |      |     |   |   |
| 5. DIV.D | F10,F0,F6             |    |      |      |         | FC  | )  | F2  | F4   | F6 | F8  | F10  | F12 |   |   |
| 6. ADD.D | <del>F6, F8,</del> F2 |    |      |      |         |     |    |     |      |    |     | 1 10 |     |   | 1 |
| Cycle:   | 17                    | F  | Regi | ster | Status: |     |    |     |      |    |     | ML2  |     |   |   |

# **Detailed Example**

**Assume** 

R2 is 100

R3 is 200

F4 is 2.5

Load: 2 cycles

Add: 2 cycles

Mult: 10 cycles

Divide: 40 cycles



Assume

R2 is 100

R3 is 200

F4 is 2.5

Load: 2 cycles

Add: 2 cycles

Mult: 10 cycles

Divide: 40 cycles

#### **Reservation Stations**

|          |                       |    |      |      | Bu      | ısy | Op    | Vj   |    | Vk  | Qj  | Qk       | Α |  |
|----------|-----------------------|----|------|------|---------|-----|-------|------|----|-----|-----|----------|---|--|
|          |                       | ls | Ex   | W    | LD1     | 0   |       |      |    |     |     |          |   |  |
| 1. L.D   | F6, 34(R2)            | 1  | 2    | 4    | LD2     | 0   |       |      |    |     |     |          |   |  |
| 2 1 D    | F2 4F(D2)             | 2  | 3    | 5    | AD1     | 0   |       |      |    |     |     |          |   |  |
| 2. L.D   | F2, 45(R3)            | 3  | 6    | 16   | AD2     | 0   |       |      |    |     |     |          |   |  |
| 3. MUL.D | F0, F2, F4            | 4  | 6    | 8    | AD3     |     |       |      |    |     |     |          |   |  |
| 4. SUB.D | F8, F2, F6            | 5  | 17   | 5    | ML1     | 0   |       |      |    |     |     |          |   |  |
|          |                       | 6  | 9    | 11   | ML2     | 10  | DIV.D | 3.75 |    | 0.5 |     |          |   |  |
| 5. DIV.D | F10,F0,F6             |    |      |      |         | FO  | F2    | F4   | F6 | F8  | F10 | F12      |   |  |
| 6. ADD.D | <del>F6, F8,</del> F2 |    |      |      |         |     | ' ' - |      |    |     |     | <u> </u> |   |  |
| Cycle:   | 57                    | R  | Regi | stei | Status: |     |       |      |    |     | MLZ |          |   |  |
| -        |                       |    |      |      |         |     | •     |      |    | •   |     |          |   |  |

# **Timing Example**

Kind of hard to keep track with previous table-based approach

Simplified version to track timing only

Load: 2 cycles

Add: 2 cycles

Mult: 10 cycles

Divide: 40 cycles

| Inst  | Operands   | Is | Exec | Wr              | Comments |
|-------|------------|----|------|-----------------|----------|
| L.D   | F6,34(R2)  | 1  | 2    | 4               |          |
| L.D   | F2, 45(R3) | 2  | 3    | 5               |          |
| MUL.D | F0,F2,F4   | 3  | 6    | <sub>/</sub> 16 |          |
| SUB.D | F8,F2,F6   | 4  | 6 /  | 8               |          |
| DIV.D | F10,F0,F6  | 5  | 17 / | 57              |          |
| ADD.D | F6,F8,F2   | 6  | 9 💆  | 11              |          |