# Microcoded and VLIW Processors

Daniel Sanchez
Computer Science & Artificial Intelligence Lab
M.I.T.

## Hardwired vs Microcoded Processors

- All processors we have seen so far are hardwired: The microarchitecture directly implements all the instructions in the ISA
- Microcoded processors add a layer of interpretation: Each ISA instruction is executed as a sequence of simpler microinstructions
  - Simpler implementation
  - Lower performance than hardwired (CPI > 1)
- Microcoding common until the 80s, still in use today (e.g., complex x86 instructions are decoded into multiple "micro-ops")

## Microcontrol Unit

[Maurice Wilkes, 1954]

Embed the control logic state table in a read-only memory array



April 29, 2021 MIT 6.823 Spring 2021 L19-3

## Microcoded Microarchitecture



## A Bus-based Datapath for MIPS



RegSel = PC; enReg=yes; MA IdMA= yes  $\leftarrow PC$ means

B  $\leftarrow \text{Reg[rt]}$ means

## Memory Module



 Assumption: Memory operates asynchronously and is slow compared to Reg-to-Reg transfers

## Microcode Controller



April 29, 2021

MIT 6.823 Spring 2021

## Jump Logic

```
\muPCSrc = Case \muJumpTypes
```

```
next \Rightarrow \muPC+1

spin \Rightarrow if (busy) then \muPC else \muPC+1

fetch \Rightarrow absolute

dispatch \Rightarrow op-group

feqz \Rightarrow if (zero) then absolute else \muPC+1

fnez \Rightarrow if (zero) then \muPC+1 else absolute
```

### Instruction Execution

#### Execution of a MIPS instruction involves

- 1. instruction fetch
- 2. decode and register fetch
- 3. ALU operation
- 4. memory operation (optional)
- 5. write back to register file (optional)
  - + the computation of the next instruction address

## Instruction Fetch

```
State
            Control points
                                                next-state
fetch<sub>0</sub> MA \leftarrow PC
fetch<sub>1</sub> IR \leftarrow Memory
fetch<sub>2</sub> A \leftarrow PC
fetch<sub>3</sub> PC \leftarrow A + 4
ALU_0 A \leftarrow Reg[rs]
ALU_1 B \leftarrow Reg[rt]
ALU_2 Reg[rd]\leftarrowfunc(A,B)
                                                                Opcode
                                                                                                               busy
                                                                        OpSel
                                                                                    ldB
                                                                         2/
ALUi_0 A \leftarrow Reg[rs]
ALUi<sub>1</sub> B \leftarrow sExt_{16}(Imm)
ALUi_2 Reg[rd] \leftarrow Op(A,B)
                                                                                          32 GPRs
                                                                                          + PC ...
                                                            ExtSel
                                                                                                           Memory
                                                                                                                   MemWrt
                                                                  Imm
                                                                                                   RegWrt
                                                                        control
                                                                                         32-bit Reg
                                                                                                   enRea
                                                                                                                   enMem
                                                           enImm
                                                                          enALU
                                                                                           data
                                                                                                             data
```

## Load & Store

| State                                                                           | Control points                                                                                       | next-state                            |
|---------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------|---------------------------------------|
| LW <sub>0</sub> LW <sub>1</sub> LW <sub>2</sub> LW <sub>3</sub> LW <sub>4</sub> | $A \leftarrow Reg[rs]$ $B \leftarrow sExt_{16}(Imm)$ $MA \leftarrow A+B$ $Reg[rt] \leftarrow Memory$ | next<br>next<br>next<br>spin<br>fetch |
| $SW_0$<br>$SW_1$<br>$SW_2$<br>$SW_3$<br>$SW_4$                                  | $A \leftarrow Reg[rs]$ $B \leftarrow sExt_{16}(Imm)$ $MA \leftarrow A+B$ $Memory \leftarrow Reg[rt]$ | next<br>next<br>next<br>spin<br>fetch |

## Branches

| State                                  | Control points                                                              | next-state                   |
|----------------------------------------|-----------------------------------------------------------------------------|------------------------------|
| $BEQZ_0$ $BEQZ_1$ $BEQZ_2$ $BEQZ_3$    | $A \leftarrow Reg[rs]$ $A \leftarrow PC$ $B \leftarrow sExt_{16}(Imm << 2)$ | next<br>fnez<br>next<br>next |
| $BEQZ_4$                               | $PC \leftarrow A + B$                                                       | fetch                        |
| BNEZ <sub>0</sub><br>BNEZ <sub>1</sub> | A ← Reg[rs]                                                                 | next<br>feqz                 |
| $BNEZ_2$                               | $A \leftarrow PC$                                                           | next                         |
| $BNEZ_3$                               | $B \leftarrow sExt_{16}(Imm << 2)$                                          | next                         |
| BNEZ <sub>4</sub>                      | $PC \leftarrow A + B$                                                       | fetch                        |

# Jumps

| State                                                                            | Control points                                                                                         | next-state                    |
|----------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------|-------------------------------|
| J <sub>0</sub> J <sub>1</sub> J <sub>2</sub>                                     | $\begin{array}{l} A & \leftarrow PC \\ B & \leftarrow IR \\ PC & \leftarrow JumpTarg(A,B) \end{array}$ | next<br>next<br>fetch         |
| $JR_0$ $JR_1$                                                                    | $A \leftarrow Reg[rs]$<br>PC $\leftarrow A$                                                            | next<br>fetch                 |
| JAL <sub>0</sub> JAL <sub>1</sub> JAL <sub>2</sub> JAL <sub>3</sub>              | $A \leftarrow PC$ $Reg[31] \leftarrow A$ $B \leftarrow IR$ $PC \leftarrow JumpTarg(A,B)$               | next<br>next<br>next<br>fetch |
| JALR <sub>0</sub><br>JALR <sub>1</sub><br>JALR <sub>2</sub><br>JALR <sub>3</sub> | $A \leftarrow PC$ $B \leftarrow Reg[rs]$ $Reg[31] \leftarrow A$ $PC \leftarrow B$                      | next<br>next<br>next<br>fetch |

## VAX 11-780 Microcode (1978)

```
, P1WFUD.1 (600,1205)
                                               26-May-81 14:58:1
                             MICRO2 1F(12)
                                                                       VAX11/780 Microcode : PCS 01, FPLA 0D, WCS122
                                                                                                                         Jage 771
; CALL2 .MIC [600,1205]
                            Procedure call
                                                  : CALLG, CALLS
                                               129744 THERE FOR CALLS OR CALLS, AFTER PROBING THE EXTENT OF THE STACK
                                               129745
                                               :29746
                                                                                        ---- CALL SITE FOR MPUSH
                                               :29747
                                                       CALL. 7: D_Q. AND. RC[T2].
                                                                                              ISTRIP MASK TO BITS 11-0
6557K
        0 U 11F4, 0811,2035,0180,F910,0000,0CD8
                                                       129748
                                                                      CALL, J/MPUSH
                                                                                                      PUSH REGISTERS
                                               129749
                                               129750
                                                               ;----; RETURN FROM MPUSH
                                               129751
                                                               CACHE_D[LONG] ,
6557K 7763K U 11F5, 0000,003C,0180,3270,0000,134A
                                                       129752
                                                                                                      ; BY SP
                                               129753
                                               :29754
6856K
           U 134A, 0018,0000,0180,FAF0,0200,134C
                                                       129755 CALL.8: R[SP]&VA_LA-K[.8]
                                                                                                      SUPDATE SP FOR PUSH OF PC &
                                               129756
                                               129757
6856K
         0 U 134C, 0800,003C,0180,FA68,0000,11F8
                                                       129758
                                                                      D_R[FP]
                                                                                                      FREADY TO PUSH FRAME POINTER
                                               129759
                                               :29750
                                                               !-----CALL SITE FOR PSHSP
                                               129761
                                                               CACHE_D[LONG],
                                                                                              ISTORE FP.
                                               129762
                                                               LAB_R[SP].
                                                                                              ; GET SP AGAIN
                                               :29763
                                                               SC_K[.FFF0].
                                                                                              1-16 TO SC
       21M U 11F8, 0000,003D,6D80,3270,0084,6CD9
                                                       129764
                                                                      CALL, J/PSHSP
                                               129765
                                               :29766
                                                               D_R[AP],
                                               129767
                                                                                              READY TO PUSH AP
6856K
        0 U 11F9, 0800,003C,3DF0,2E60,0000,134D
                                                       129768
                                                                      Q_ID[PSL]
                                                                                                     ; AND GET PSW FOR COMBINATIO
                                               :29769
                                               129770
                                                               129771
                                                               CACHE_D[LONG] .
                                                                                              STORE OLD AP
                                               129772
                                                               Q_Q.ANDNOT.K[.1F],
                                                                                              CLEAR PSW<T,N,Z,V,C>
6856K
       21M U 134D, 0019,2024,8DC0,3270,0000,134E
                                                       129773
                                                                                                      GET SP INTO LATCHES AGAIN
                                               129774
                                               :29775
6856K
        0 U 134E, 2010,0038,0180,F909,4200,1350
                                                       129776
                                                                      PC&VA_RC[T1], FLUSH.IB
                                                                                                      ! LOAD NEW PC AND CLEAR OUT
                                               129777
                                               129778
                                               :29779
                                                               D_DAL.SC.
                                                                                              1PSW TO D<31:16>
                                               129780
                                                               Q_RC[T2],
                                                                                              RECOVER MASK
                                                               SC-SC+K[.3],
                                               :29781
                                                                                              PUT -13 IN SC
6856K
        0 U 1350, OD10,0038,ODC0,6114,0084,9351
                                                      129782
                                                                      LOAD, IB, PC_PC+1
                                                                                                      ISTART FETCHING SUBROUTINE I
                                               129783
                                               129784
                                               129785
                                                              D_DAL.SC.
                                                                                              MASK AND PSW IN D<31:03>
                                               129786
                                                              O_PC[T4],
                                                                                              GET LOW BITS OF OLD SP TO Q<1:0>
6856K
        0 U 1351, 0D10,0038,F5C0,F920,0084,9352
                                                      129787
                                                                      SC_SC+K[.A]
                                                                                                      PUT -3 IN SC
                                               129788
```

# Very Long Instruction Word (VLIW) Processors

## Sequential ISA Bottleneck



## VLIW: Very Long Instruction Word



- Multiple operations packed into one instruction
- Each operation slot is for a fixed function
- Constant operation latencies are specified

## VLIW Design Principles

#### The architecture:

- Allows operation parallelism within an instruction
  - No cross-operation RAW check
- Provides deterministic latency for all operations
  - Latency measured in 'instructions'
  - No data use allowed before specified latency with no data interlocks

#### The compiler:

- Schedules (reorders) to maximize parallel execution
- Guarantees intra-instruction parallelism
- Schedules to avoid data hazards (no interlocks)
  - Typically separates operations with explicit NOPs

## Early VLIW Machines

#### FPS AP120B (1976)

- scientific attached array processor
- first commercial wide instruction machine
- hand-coded vector math libraries using software pipelining and loop unrolling

#### Multiflow Trace (1987)

- commercialization of ideas from Fisher's Yale group including "trace scheduling"
- available in configurations with 7, 14, or 28 operations/instruction
- 28 operations packed into a 1024-bit instruction word

#### Cydrome Cydra-5 (1987)

- 7 operations encoded in 256-bit instruction word
- rotating register file

## Loop Execution



How many FP ops/cycle?

# Loop Unrolling

```
for (i=0; i<N; i++) B[i] = A[i] + C; B[i] = A[i] + C; B[i] = A[i] + C; B[i+1] = A[i+1] + C; B[i+2] = A[i+2] + C; B[i+3] = A[i+3] + C;
```

Is this code correct?

# Scheduling Loop Unrolled Code



How many FLOPS/cycle?

# Software Pipelining



# Software Pipelining vs. Unrolling





Software pipelining pays startup/wind-down costs only once per loop, not once per iteration

# What if there are no loops?



- Branches limit basic block size in control-flow intensive irregular code
- Difficult to find ILP in individual basic blocks

# Trace Scheduling

[Fisher, Ellis]



- Pick string of basic blocks, a trace, that represents most frequent branch path
- Schedule whole "trace" at once
- Add fixup code to cope with branches jumping out of trace

How do we know which trace to pick?

## Problems with "Classic" VLIW

- Knowing branch probabilities
  - Profiling requires an significant extra step in build process
- Scheduling for statically unpredictable branches
  - Optimal schedule varies with branch path
- Object code size
  - Instruction padding wastes instruction memory/cache
  - Loop unrolling/software pipelining replicates code
- Scheduling memory operations
  - Caches and/or memory bank conflicts impose statically unpredictable variability
  - Uncertainty about addresses limit code reordering
- Object-code compatibility
  - Have to recompile all code for every machine, even for two machines in same generation

## **VLIW Instruction Encoding**

- Schemes to reduce effect of unused fields
  - Compressed format in memory, expand on I-cache refill
    - used in Multiflow Trace
    - introduces instruction addressing challenge
  - Provide a single-op VLIW instruction
    - Cydra-5 UniOp instructions
  - Mark parallel groups
    - used in TMS320C6x DSPs, Intel IA-64



## Cydra-5: Memory Latency Register (MLR)

- Problem: Loads have variable latency
- Solution: Let software choose desired memory latency
- Compiler schedules code for maximum load-use distance
- Software sets MLR to latency that matches code schedule
- Hardware ensures that loads take exactly MLR cycles to return values into processor pipeline
  - Hardware buffers loads that return early
  - Hardware stalls processor if loads return late

### IA-64 Predicated Execution

Problem: Mispredicted branches limit ILP

Solution: Eliminate hard-to-predict branches with predicated execution

- Almost all IA-64 instructions can be executed conditionally under predicate
- Instruction becomes NOP if predicate register false



April 29, 2021

# Fully Bypassed Datapath



Where does predication fit in?

L19-31

## IA-64 Speculative Execution

Problem: Branches restrict compiler code motion

Solution: Speculative operations that don't cause exceptions



Can't move load above branch because might cause spurious exception



Particularly useful for scheduling long latency loads early

MIT 6.823 Spring 2021

## IA-64 Data Speculation

Problem: Possible memory hazards limit code scheduling

Solution: Instruction-based speculation with hardware

monitor to check for pointer hazards



Can't move load above store because store might be to same address



Requires associative hardware in address check table

## Clustered VLIW



Cache/Memory Banks

- Divide machine into clusters of local register files and local functional units
- Lower bandwidth/higher latency interconnect between clusters
- Software responsible for mapping computations to minimize communication overhead
- Common in commercial embedded processors, examples include TI C6x series DSPs, and HP Lx processor
- Exists in some superscalar processors, e.g., Alpha 21264

## Limits of Static Scheduling

- Unpredictable branches
- Unpredictable memory behavior (cache misses and dependencies)
- Code size explosion
- Compiler complexity

#### Question:

How applicable are the VLIW-inspired techniques to traditional RISC/CISC processor architectures?

# Thank you!

Next Lecture: Vector Processors