# Lab 2: Pipelined CPU using verilog

### Announcement

- Individual Lab
- Lab Deadline: 5/16 (Tue.) 23:59

- Demo:
  - Time slot: TBD
  - Show the execution of your program to TA and answer a few questions

### Instructions

- Required Instruction Set
  - and, xor, sll, add, sub, mul, addi, srai
  - lw
  - SW
  - beq

### Hardware Specification

- Register File: 32 Registers (Write at the rising edge of the clock), 32-bit
- Instruction Memory: 1KB
- Data Memory: 4 KBytes
- 5-stage pipeline (IF, ID, EX, MEM, WB)
- Hazard handling
  - Data hazard
    - Implement the forwarding unit to reduce or avoid the stall cycles
    - The data dependency instruction following lw must stall 1 cycle
    - No need to forward to ID stage
  - Control hazard
    - The instruction following beg instruction may need to stall 1 cycle
    - Pipeline Flush

## Adding Data Memory



## Adding Pipeline Register





Handling Branch and



### Control Signals

| Instruction | ALUOp | operation        | Funct7<br>field | Funct3<br>field | Desired<br>ALU action | ALU control input |
|-------------|-------|------------------|-----------------|-----------------|-----------------------|-------------------|
| Id          | 00    | load doubleword  | XXXXXXX         | XXX             | add                   | 0010              |
| sd          | 00    | store doubleword | XXXXXXX         | XXX             | add                   | 0010              |
| beq         | 01    | branch if equal  | XXXXXXX         | XXX             | subtract              | 0110              |
| R-type      | 10    | add              | 0000000         | 000             | add                   | 0010              |
| R-type      | 10    | sub              | 0100000         | 000             | subtract              | 0110              |
| R-type      | 10    | and              | 0000000         | 111             | AND                   | 0000              |
| R-type      | 10    | or               | 0000000         | 110             | OR                    | 0001              |

**FIGURE 4.45** A **copy of Figure 4.12.** This figure shows how the ALU control bits are set depending on the ALUOp control bits and the different opcodes for the R-type instruction.

## Control Signals

| Instruction | Execution/address calculation stage control lines |        | Memory access stage control lines |              |               | Write-back stage<br>control lines |               |
|-------------|---------------------------------------------------|--------|-----------------------------------|--------------|---------------|-----------------------------------|---------------|
|             | ALUOp                                             | ALUSrc | Branch                            | Mem-<br>Read | Mem-<br>Write | Reg-<br>Write                     | Memto-<br>Reg |
| R-format    | 10                                                | 0      | 0                                 | 0            | 0             | 1                                 | 0             |
| ld          | 00                                                | 1      | 0                                 | 1            | 0             | 1                                 | 1             |
| sd          | 00                                                | 1      | 0                                 | 0            | 1             | 0                                 | X             |
| beq         | 01                                                | 0      | 1                                 | 0            | 0             | 0                                 | X             |

FIGURE 4.47 The values of the control lines are the same as in Figure 4.18, but they have been shuffled into three groups corresponding to the last three pipeline stages.

### Machine Code

| funct7       | rs2      | rs1 | funct3 | rd              | opcode  | function |
|--------------|----------|-----|--------|-----------------|---------|----------|
| 0000000      | rs2      | rs1 | 111    | rd              | 0110011 | and      |
| 0000000      | rs2      | rs1 | 100    | rd              | 0110011 | xor      |
| 0000000      | rs2      | rs1 | 001    | rd              | 0110011 | sll      |
| 0000000      | rs2      | rs1 | 000    | rd              | 0110011 | add      |
| 0100000      | rs2      | rs1 | 000    | rd              | 0110011 | sub      |
| 0000001      | rs2      | rs1 | 000    | rd              | 0110011 | mul      |
| imm[11:0]    |          | rs1 | 000    | rd              | 0010011 | addi     |
| 0100000      | imm[4:0] | rs1 | 101    | rd              | 0010011 | srai     |
| imm[11:0]    |          | rs1 | 010    | rd              | 0000011 | lw       |
| imm[11:5]    | rs2      | rs1 | 010    | imm[4:0]        | 0100011 | SW       |
| imm[12,10:5] | rs2      | rs1 | 000    | imm[4:1,11<br>] | 1100011 | beq      |

### Branch Address

bne x10, x11, 2000 // if x10 != x11, go to location  $2000_{ten} = 0111 \ 1101 \ 0000$ 0 111110 01011 01010 001 1000 0 1100111 
imm[12] imm[10:5] rs2 rs1 funct3 imm[4:1] imm[11] opcode

(The opcode in textbook conflicts with RISC-V spec. Please follow this or

next PC = current PC + Branch offset

## Adding Data Memory



### Pipeline Register



FIGURE 4.33 The pipelined version of the datapath in Figure 4.31. The pipeline registers, in color, separate each pipeline stage. They are labeled by the stages that they separate; for example, the first is labeled *IF/ID* because it separates the instruction fetch and instruction decode stages. The registers must be wide enough to store all the data corresponding to the lines that go through them. For example, the IF/ID register must be 96 bits wide, because it must hold both the 32-bit instruction fetched from memory and the incremented 64-bit PC address. We will expand these registers over the course of this chapter, but for now the other three pipeline registers contain 256, 193, and 128 bits, respectively.

## Adding Pipeline Register



## Data Hazard and Forwarding

Time (in clock cycles)



**FIGURE 4.56** A pipelined sequence of instructions. Since the dependence between the load and the following instruction (and) goes backward in time, this hazard cannot be solved by forwarding. Hence, this combination must result in a stall by the hazard detection unit.

## Data Hazard and Forwarding



## Forwarding Unit



FIGURE 4.55 A close-up of the datapath in Figure 4.52 shows a 2:1 multiplexor, which has been added to select the signed immediate as an ALU input.

| Mux control   | Source | <b>Explanation</b>                                                             |
|---------------|--------|--------------------------------------------------------------------------------|
| ForwardA = 00 | ID/EX  | The first ALU operand comes from the register file.                            |
| ForwardA = 10 | EX/MEM | The first ALU operand is forwarded from the prior ALU result.                  |
| ForwardA = 01 | MEM/WB | The first ALU operand is forwarded from data memory or an earlier ALU result.  |
| ForwardB = 00 | ID/EX  | The second ALU operand comes from the register file.                           |
| ForwardB = 10 | EX/MEM | The second ALU operand is forwarded from the prior ALU result.                 |
| ForwardB = 01 | MEM/WB | The second ALU operand is forwarded from data memory or an earlier ALU result. |

**FIGURE 4.53** The control values for the forwarding multiplexors in Figure 4.52. The signed immediate that is another input to the ALU is described in the *Elaboration* at the end of this section.

```
1. FX hazard:
if (EX/MEM.RegWrite
and (EX/MEM.RegisterRd != 0)
and (EX/MEM.RegisterRd == ID/EX.RegisterRs1)) ForwardA = 10
if (EX/MEM.RegWrite
and (EX/MEM.RegisterRd != 0)
and (EX/MEM.RegisterRd = ID/EX.RegisterRs2)) ForwardB = 10
2. MEM hazard:
if (MEM/WB.RegWrite
and (MEM/WB.RegisterRd != 0)
and not(EX/MEM.RegWrite and (EX/MEM.RegisterRd != 0)
   and (EX/MEM.RegisterRd = ID/EX.RegisterRs1))
and (MEM/WB.RegisterRd = ID/EX.RegisterRs1)) ForwardA = 01
if (MEM/WB.RegWrite
and (MEM/WB.RegisterRd != 0)
and not(EX/MEM.RegWrite and (EX/MEM.RegisterRd != 0)
   and (EX/MEM.RegisterRd = ID/EX.RegisterRs2))
and (MEM/WB.RegisterRd = ID/EX.RegisterRs2)) ForwardB = 01
```



### Hazard Detection and Stall



**FIGURE 4.57** The way stalls are really inserted into the pipeline. A bubble is inserted beginning in clock cycle 4, by changing the and instruction to a nop. Note that the and instruction is really fetched and decoded in clock cycles 2 and 3, but its EX stage is delayed until clock cycle 5 (versus the unstalled position in clock cycle 4). Likewise, the Or instruction is fetched in clock cycle 3, but its ID stage is delayed until clock cycle 5 (versus the unstalled clock cycle 4 position). After insertion of the bubble, all the dependences go forward in time 22 and no further hazards occur.

### Stall & Flush

- Counted in testbench.v
- Can be changed depend on your own design

## Handling Branch and



### testbench.v

- Initialize registers in all modules
- Load instruction.txt into instruction memory
- Create clock signal
- Dump Register files & Data memories in each cycle
- Count number of flush
- Do not modify any value initialization or message printing
- Print result to output.txt

### Execution results

```
cycle =
              0, Start = 1, Stall = 0, Flush = 0
PC =
Registers
\times 0 =
         0, x8 = 0, x16 = 0, x24 =
                                                   -24
x1 = 0, x9 = 0, x17 =
                                       0, x25 =
                                                    -25
     0, \times 10 =
                   0, \times 18 =
                                       0, x26 =
                                                    -26
     0, \times 11 =
                       0, x19 =
                                    0, \times 27 =
                                                    -27
x4 = 0, x12 = 0, x20 = 0, x28 =
                                                    56
x5 = 0, x13 = 0, x21 = 0, x29 =
                                                    58
x6 = 0, x14 = 0, x22 = 0, x30 = 0
                                                    60
x7 =
          0, x15 =
                         0, x23 =
                                       0, x31 =
                                                    62
Data Memory: 0x00 =
Data Memory: 0x04 =
Data Memory: 0x08 =
                      10
Data Memory: 0x0C =
                      18
Data Memory: 0x10 =
                      29
Data Memory: 0x14 =
Data Memory: 0x18 =
Data Memory: 0x1C =
```

Final Datapath



### Grading Policy

- (80%) Programming
  - You will get 0 point if your code cannot be compiled
  - Grading at demo. You have to answer several questions about how you implement at demo. You may get 0 point on this part if you cannot clearly answer the questions (regarded as plagiarism)
- (20%) Report
  - Implementation of each modules
  - Difficulties encountered and solutions in this lab
  - Development environment
- Late policy: 10 points per day

### Deadline

- 5/16 (Tue.) 23:59
- Late policy: 10 points per day

### Submission Rules

- studentID lab2 (dir)
  - studentID\_lab2/codes/\*.v
  - studentID\_lab2/studentID\_lab2\_report.pdf
- studentID should be ASCII-printable characters

#### MUST REMOVE

- Data Memory.v
- Instruction\_Memory.v
- Registers.v
- PC.v
- testdata/\*