# CS425 Computer Systems Architecture

**Fall 2020** 

**Static Instruction Scheduling** 

#### Techniques to reduce stalls

CPI = Ideal CPI + Structural stalls per instruction + RAW stalls per instruction + WAR stalls per instruction + WAW stalls per instruction

We will study two types of techniques:

| Dynamic instruction scheduling                                            | Static instruction scheduling (SW/compiler) |
|---------------------------------------------------------------------------|---------------------------------------------|
| Scoreboard (reduce RAW stalls)                                            | Loop Unrolling                              |
| Register Renaming (reduce WAR & WAW stalls)  • Tomasulo  • Reorder buffer | SW pipelining                               |
| Branch Prediction (reduce control stalls)                                 | Trace Scheduling                            |

## Scoreboard Architecture (CDC 6600)



Registers

# **Tomasulo Organization**



#### Dependencies between Instructions

- What are the sources of stalls/bubbles?
  - instructions that use the same registers
- Parallel instructions can execute without imposing any stalls (if we ignore structural hazards)

```
DIV.D F0, F2, F4ADD.D F10, F1, F3
```

Dependencies between instructions may lead to stalls

```
    DIV.D F0, F2, F4
    RAW must enter the execution stage in order
    ADD.D F10, F0, F3
```

• The dependencies between instructions limit the order of execution of these instructions (impose in order execution). In the 2<sup>nd</sup> example ADD.D **must** execute after DIV.D has completed. On the other hand, parallel instructions **may** execute in the any order (out-of-order execution). In the 1<sup>st</sup> example ADD.D can execute before DIV.D.

#### Dependencies between Instructions

- Data Dependences: instructions are data dependent when there is a chain of RAW hazards between them.
- Name Dependences: instructions are name dependent when there
  is a WAR (anti-dependence) or WAW (output-dependence) hazard
  between them.

```
L.D F0, 0(R1)
ADD.D F4, F0, F2
L.D F0, 0(R2)
```

Control Dependences: Instructions dependent via branches.
 if p1 { S1; }

#### **R4000 Performance**

- Non-Ideal CPI:
  - Load stalls:1 ή 2 clock cycles
  - Branch stalls:2 cycles + unfilledslots
  - FP result stalls:RAW data hazard (latency)
  - FP structural stalls:
     Not enough FP hardware (parallelism)



#### Instruction Level Parallelism (ILP)

- ILP: parallel execution of unrelated (independent) instructions
- gcc 17% control transfer instructions
  - -5 instructions + 1 branch
  - need to look beyond a code block to find more instruction level parallelism
- Loop level parallelism one opportunity
  - First SW, then HW approaches

#### FP Loop: where are the hazards?

```
while (R1 > 0) { M[R1] = M[R1] + F2; R1 -= 8 }
Loop: L.D    F0,0(R1);F0=vector element
ADD.D    F4,F0,F2;add scalar from F2
S.D         0(R1),F4;store result
SUBI    R1,R1,8;decrement pointer 8B (DW)
BNEZ    R1,Loop;branch R1!=zero
NOP    ;branch delay slot
```

| Stalls?                        |
|--------------------------------|
| Assume 5-stage DLX (in order)  |
| Assume a stage DEX (iii order) |
|                                |
|                                |
|                                |

#### **FP Loop Showing Stalls**

```
1 Loop: L.D FO, 0 (R1)
                       ;F0=vector element
       stall
2
       ADD.D F4, F0, F2; add scalar in F2
       stall
       stall
       S.D
             0(R1), F4 ;store result
       SUBI
             R1,R1,8
                       ;decrement pointer 8B (DW)
                       ;branch R1!=zero
       BNEZ
             R1,Loop
       stall
                        ;branch delay slot
```

| Instruction producing result | Instruction using result | Latency in clock cycles | 9 cycles:           |
|------------------------------|--------------------------|-------------------------|---------------------|
| FP ALU op                    | Another FP ALU op        | 3                       | Rewrite the code to |
| FP ALU op                    | Store double             | 2                       | minimize stalls!    |
| I oad double                 | FP ALU on                | 1                       |                     |

#### Scheduled code for FP Loop

```
1 Loop: L.D F0,0(R1)
2     stall
3     ADD.D F4,F0,F2
4     SUBI R1,R1,8
5     BNEZ R1,Loop ;delayed branch
6     S.D 8(R1),F4 ;altered when move past SUBI
```

#### Move SD past BNEZ by modifying the address offset of SD

| Instruction producing result | Instruction using result | Latency in clock cycles |
|------------------------------|--------------------------|-------------------------|
| FP ALU op                    | Another FP ALU op        | 3                       |
| FP ALU op                    | Store double             | 2                       |
| Load double                  | FP ALU op                | 1                       |

6 clocks: Unroll loop 4 times to make code faster?

#### **Loop Unrolling**

```
while (R1 > 0) { M[R1] = M[R1] + F2; R1 -= 8 }
while (R1 >= 4*8) {
 M[R1] = M[R1] + F2;
                                   Independent instructions
 M[R1-8] = M[R1-8] + F2;
                                   inside the loop. Good
 M[R1-16] = M[R1-16] + F2;
                                   opportunities for scheduling.
 M[R1-24] = M[R1-24] + F2;
 R1 -= 4*8
while (R1 > 0) { M[R1] = M[R1] + F2; R1 -= 8 }
```

## Unroll Loop 4 times: name dependencies?

```
1 Loop:L.D
           F0,0(R1)
2
      ADD.D F4,F0,F2
      S.D \quad O(R1), F4
                          ;drop SUBI & BNEZ
      L.D F0, -8 (R1)
      ADD.D F4,F0,F2
6
      S.D -8(R1), F4
                         ;drop SUBI & BNEZ
      L.D F0, -16(R1)
8
      ADD.D F4,F0,F2
9
      S.D -16(R1), F4
                         ;drop SUBI & BNEZ
10
      L.D F0, -24(R1)
11
      ADD.D F4,F0,F2
12
      S.D
            -24(R1), F4
13
       SUBI R1,R1,#32
                         ;alter to 4*8
14
      BNEZ
             R1,LOOP
15
      NOP
```

## Unroll Loop 4 times: name dependencies?

```
1 Loop: L.D
               F0,0(R1)
2
       ADD, D
               F4,F0,F2
       S.D
               (1), F4
                             ;drop SUBI & BNEZ
               F(0, -8)
       L.D
       ADD.D
               F4,F0,F2
                (R1), F4
6
       S.D
                             ;drop SUBI & BNEZ
       L.D
               F(1, -16)
8
               F4,F0,F2
       ADD, D
9
       S.D
                 6 (R1), F4
                             ;drop SUBI & BNEZ
10
               F(0, -24 (R1))
       L.D
              F4,F0,F2
11
       ADD.D
12
       S.D
              -24(R1), F4
13
              R1,R1,#32
       SUBI
                            ;alter to 4*8
14
       BNEZ
              R1,LOOP
15
       NOP
```

#### How to deal with these?

#### No name dependencies now!

```
1 Loop:L.D
           F0,0(R1)
2
      ADD.D F4,F0,F2
      S.D 0(R1), F4
                         ;drop SUBI & BNEZ
      L.D F6, -8(R1)
      ADD.D F8, F6, F2
6
      S.D -8 (R1), F8 ; drop SUBI & BNEZ
      L.D F10, -16(R1)
8
      ADD.D F12,F10,F2
9
      S.D -16(R1), F12
                         ;drop SUBI & BNEZ
10
      L.D F14, -24(R1)
11
      ADD.D F16,F14,F2
12
      S.D -24(R1), F16
13
      SUBI R1,R1,#32 ;alter to 4*8
14
      BNEZ
             R1,LOOP
15
      NOP
```

<sup>&</sup>quot;register renaming" removed WAR/WAW stalls

#### **Unroll Loop 4 times**

```
1 cycle stall
              F0,0(R1)
 Loop: L.D
                                  2 cycles stall
2
       ADD.D F4,F0,F2
       S.D \quad O(R1), F4
                            ;drop SUBI & BNEZ
       L.D F6, -8(R1)
       ADD.D F8, F6, F2
6
       S.D -8 (R1), F8
                            ;drop SUBI & BNEZ
       L.D F10, -16(R1)
8
       ADD.D F12,F10,F2
9
       S.D -16(R1), F12
                            ;drop SUBI & BNEZ
10
       L.D F14, -24(R1)
11
       ADD.D F16,F14,F2
12
       S.D
            -24(R1),F16
13
            R1,R1,#32
                           ;alter to 4*8
       SUBI
14
              R1,LOOP
       BNEZ
15
       NOP
```

eliminates overhead instructions, but increases code size

Rewrite loop to minimize stalls?

 $15 + 4 \times (1+2) = 27$  clock cycles, or 6.8 per iteration Assumes R1 is multiple of 4

#### **Schedule Unrolled Loop**

```
1 Loop:L.D
            F0,0(R1)
2
      L.D F6, -8(R1)
      L.D F10, -16(R1)
      L.D F14,-24(R1)
      ADD.D F4,F0,F2
6
      ADD.D F8, F6, F2
      ADD.D F12,F10,F2
      ADD.D F16,F14,F2
      S.D
            0(R1),F4
10
      S.D -8 (R1), F8
      S.D -16(R1),F12
11
12
           R1,R1,#32
      SUBI
13
             R1,LOOP
      BNEZ
14
             8 (R1), F16 ; 8-32 = -24
      S.D
```

14 clock cycles, or 3.5 per iteration

- What kind of assumptions did we use to reorder and move the instructions?
  - OK to move store past
     SUBI even though changes register
  - OK to move loads before stores/add: get right data?
  - When is it safe for compiler to do such changes?

## **Compiler Perspectives on Code Movement**

- Name Dependencies are hard to identify for Memory Accesses
  - -100(R4) == 20(R6)?
  - for different iterations of the loop, is 20(R6) == 20(R6)?
- In our example the compiler must understand that when R1 does not change then:

```
0 (R1) \neq -8 (R1) \neq -16 (R1) \neq -24 (R1)
```

 There were no dependencies between some loads and stores so they could be moved by each other

#### When is it safe to unroll a loop?

Example: Are there dependencies? (A,B,C distinct & non-overlapping)

```
for (i=0; i<100; i=i+1) {
    A[i+1] = A[i] + C[i];     /* S1 */
    B[i+1] = B[i] + A[i+1];    /* S2 */
}</pre>
```

- S2 uses the value, A[i+1], computed by S1 in the same iteration.
- S1 uses a value computed by S1 in an earlier iteration, since iteration i computes A[i+1] which is read in iteration i+1. The same is true of S2 for B[i] and B[i+1]. This form of dependence (across iterations) is called loop-carried dependence
- In our prior example, each iteration was distinct. Dependences in the above example force successive iterations of this loop to execute in series.
- Implies that iterations can't be executed in parallel, right?

#### Loop-carried dependence: No parallelism?

Example:

```
for (i=0; i<100; i=i+1) {
    A[i] = A[i] + B[i]; /* S1 */
    B[i+1] = C[i] + D[i]; /* S2 */
}</pre>
```

- S1 uses the value of B[i] which is produced by a previous iteration (loop-carried depedence).
- There is no other dependency. Hence, this dependence is not circular. We can conclude that the loop can be parallel.

```
A[0] = A[0] + B[0]

B[1] = C[0] + D[0]

A[1] = A[1] + B[1]

B[2] = C[1] + D[1]

A[2] = A[2] + B[2]

B[3] = C[2] + D[2]
```



```
A[0] = A[0] + B[0]
B[1] = C[0] + D[0]
A[1] = A[1] + B[1]
B[2] = C[1] + D[1]
A[2] = A[2] + B[2]
B[3] = C[2] + D[2]
```

. . .

#### Loop-carried dependence: No parallelism?

• Example:

```
for (i=0; i<100; i=i+1) {
    A[i] = A[i] + B[i];    /* S1 */
    B[i+1] = C[i] + D[i];    /* S2 */
}</pre>
```



#### Recurrence – Depedence Distance

• Example:
 for (i=1; i< 100; i=i+1) {
 Y[i] = Y[i-1] + Y[i];
}</pre>

loop-carried dependence in recurrence form.

• Example:

```
for (i=5; i< 100; i=i+1) {
    Y[i] = Y[i-5] + Y[i];
}</pre>
```

Iteration i depends on iteration i-5, thus it has a dependence distance of 5. The longer the dependence distance the more potential to extract parallelism.

## **Alternative: Software Pipelining**

- Observation: If the iterations of the loop are independent, then we can exploit more ILP by executing instructions from <u>different</u> <u>iterations</u> of the loop.
- Software pipelining: reorganizes loops so that each iteration is made from instructions chosen from different iterations of the original loop without loop unrolling (Tomasulo in SW)



| Iterat | Iteration 0 Iteration 1 |      | Itera         | Iteration 2 Iterat |            | Iteration 3 |              | Iteration 4 |                     |
|--------|-------------------------|------|---------------|--------------------|------------|-------------|--------------|-------------|---------------------|
| LD     | F0,0(R1)                |      | start-up code |                    |            |             |              |             |                     |
| ADDD   | F4,F0,F2                | LD   | F0,-8(R1)     |                    |            |             |              |             |                     |
| SD     | 0(R1),F4                | ADDE | F4,F0,F2      | LD                 | F0,-16(R1) |             |              |             |                     |
|        |                         | SD   | -8 (R1), F4   | ADDD               | F4,F0,F2   | LD          | F0,-24(R1)   |             |                     |
|        |                         |      |               | SD                 | -16(R1),F4 | ADDI        | )F4,F0,F2    | LD          | F0,-32(R1)          |
| •      |                         |      |               |                    |            | SD          | -24 (R1) ,F4 | ADDD        | F4,F0,F2            |
|        |                         |      |               |                    |            | finish      | -up code     | SD          | -32 (R1), <b>F4</b> |

|          | Itera | tion 0    | Itera | tion 1        | Itera | ation 2    | Iteration 3 |              | Ite  | eration 4    |  |
|----------|-------|-----------|-------|---------------|-------|------------|-------------|--------------|------|--------------|--|
|          | LD    | F0,0(R1)  |       | start-up code |       |            |             |              |      |              |  |
|          | ADDI  | F4,F0,F2  | LD    | F0,-8(R1)     |       |            |             |              |      |              |  |
|          | SD    | 0 (R1),F4 | ADDI  | F4,F0,F2      | LD    | F0,-16(R1) |             |              |      |              |  |
|          |       |           | SD    | -8 (R1) , F4  | ADDI  | F4,F0,F2   | LD          | F0,-24(R1)   |      |              |  |
|          |       |           |       |               | SD    | -16(R1),F4 | ADDE        | )F4,F0,F2    | LD   | F0,-32(R1)   |  |
| <b>\</b> |       |           |       |               |       |            | SD          | -24 (R1) ,F4 | ADDE | )F4,F0,F2    |  |
|          |       |           |       |               |       |            | finish      | -up code     | SD   | -32 (R1), F4 |  |

| Iteration 0     | Iteration 1     | Iteration 2     | Iteration 3     | Iteration 4            |
|-----------------|-----------------|-----------------|-----------------|------------------------|
| LD F0,0(R1)     | start-up code   |                 |                 |                        |
| ADDD F4, F0, F2 | LD F0,-8(R1)    |                 |                 |                        |
| SD 0(R1),F4     | ADDD F4, F0, F2 | LD F0,-16(R1)   |                 |                        |
|                 | SD -8 (R1), F4  | ADDD F4, F0, F2 | LD F0,-24(R1)   |                        |
|                 |                 | SD -16(R1),F4   | ADDD F4, F0, F2 | LD F0,-32(R1)          |
|                 |                 |                 | SD -24 (R1), F4 | ADDD F4, F0, F2        |
|                 |                 |                 | finish-up code  | SD -32 (R1), <b>F4</b> |

|          | Iteration 0 Iteration 1 |          | Itera | Iteration 2   |      | Iteration 3 |        | Iteration 4  |      |               |
|----------|-------------------------|----------|-------|---------------|------|-------------|--------|--------------|------|---------------|
|          | LD                      | F0,0(R1) |       | start-up code |      |             |        |              |      |               |
|          | ADDE                    | F4,F0,F2 | LD    | F0,-8(R1)     |      |             |        |              |      |               |
|          | SD                      | 0(R1),F4 | ADDI  | F4,F0,F2      | LD   | F0,-16(R1)  |        |              |      |               |
|          |                         |          | SD    | -8 (R1), F4   | ADDI | F4,F0,F2    | LD     | F0,-24(R1)   |      |               |
|          |                         |          |       |               | SD   | -16(R1),F4  | ADDD   | F4,F0,F2     | LD   | F0,-32(R1)    |
| <b>\</b> |                         |          |       |               |      |             | SD     | -24 (R1) ,F4 | ADDI | )F4,F0,F2     |
|          |                         |          |       |               |      |             | finish | -up code     | SD   | -32 (R1) , F4 |

| Iterat | ion 0    | Itera | tion 1             | Itera | ation 2    | Itera  | ation 3      | Ite  | ration 4     |
|--------|----------|-------|--------------------|-------|------------|--------|--------------|------|--------------|
| LD     | F0,0(R1) |       | start-up code      |       |            |        |              |      |              |
| ADDD   | F4,F0,F2 | LD    | F0,-8(R1)          |       |            |        |              |      |              |
| SD     | 0(R1),F4 | ADDE  | F4,F0,F2           | LD    | F0,-16(R1) |        |              |      |              |
|        |          | SD    | -8 (R1), <b>F4</b> | ADDI  | DF4,F0,F2  | LD     | F0,-24(R1)   |      |              |
|        |          |       |                    | SD    | -16(R1),F4 | ADDE   | F4,F0,F2     | LD   | F0,-32(R1)   |
|        |          |       |                    |       |            | SD     | -24 (R1) ,F4 | ADDI | )F4,F0,F2    |
|        |          |       |                    |       |            | finish | -up code     | SD   | -32 (R1), F4 |

|   | Itera | tion 0   | Itera | tion 1             | Itera | ation 2    | Itera  | ation 3      | lte  | ration 4            |
|---|-------|----------|-------|--------------------|-------|------------|--------|--------------|------|---------------------|
|   | LD    | F0,0(R1) |       | start-up code      |       |            |        |              |      |                     |
|   | ADDI  | F4,F0,F2 | LD    | F0,-8(R1)          |       |            |        |              |      |                     |
|   | SD    | 0(R1),F4 | ADDI  | )F4,F0,F2          | LD    | F0,-16(R1) |        |              |      |                     |
|   |       |          | SD    | -8 (R1), <b>F4</b> | ADDD  | F4,F0,F2   | LD     | F0,-24(R1)   |      |                     |
|   |       |          |       |                    | SD    | -16(R1),F4 | ADDI   | )F4,F0,F2    | LD   | F0,-32(R1)          |
|   |       |          |       |                    |       |            | SD     | -24 (R1) ,F4 | ADDI | F4,F0,F2            |
| • |       |          |       |                    |       |            | finish | -up code     | SD   | -32 (R1), <b>F4</b> |

|          | Itera | tion 0   | Itera | tion 1                                | Itera | ation 2    | Itera  | ation 3      | lte  | ration 4      |
|----------|-------|----------|-------|---------------------------------------|-------|------------|--------|--------------|------|---------------|
|          | LD    | F0,0(R1) |       | start-up code                         |       |            |        |              |      |               |
|          | ADDI  | F4,F0,F2 | LD    | F0,-8(R1)                             |       |            |        |              |      |               |
|          | SD    | 0(R1),F4 | ADDE  | )F4,F0,F2                             | LD    | F0,-16(R1) |        |              |      |               |
|          |       |          | SD    | -8 (R1), F4                           | ADDI  | F4,F0,F2   | LD     | F0,-24(R1)   |      |               |
|          |       |          |       |                                       | SD    | -16(R1),F4 | ADDE   | )F4,F0,F2    | LD   | F0,-32(R1)    |
| <b>↓</b> |       |          |       |                                       |       |            | SD     | -24 (R1) ,F4 | ADDI | )F4,F0,F2     |
|          |       |          |       |                                       |       |            | finish | -up code     | SD   | -32 (R1) , F4 |
|          |       |          |       | · · · · · · · · · · · · · · · · · · · |       |            |        |              |      |               |

```
Before: Unroll 3 times
                             After: Software Pipelined
        F0,0(R1)
   L.D
                                S.D
                                      0(R1),F4 ; Stores M[i]
   ADD.D F4, F0, F2
                                ADD.DF4,F0,F2; Adds to M[i-1]
   S.D 0(R1), F4
                                L.D F0,-16(R1); Loads M[i-2]
   L.D F6, -8(R1)
                                SUBI R1,R1,\#8 ; i = i - 1
   ADD.D F8, F6, F2
                                BNEZ R1,LOOP
   S.D -8 (R1), F8
  L.D F10, -16(R1)
  ADD.D F12, F10, F2
  S.D -16(R1), F12
                             5 cycles per iteration
10 SUBI R1,R1,#24
   BNEZ R1,LOOP
```

RAW hazards convert to WAR hazards.

#### Software Pipelining vs Loop Unrolling

#### Symbolic Loop Unrolling

- Maximize result-use distance
- > Less code space than unrolling

#### But..

- Harder to implement
- Execution of SUB & BNEZ in every iteration