

# MULTI CYCLE PIPELINED RISC-V BASED PROCESSOR PROJECT PLAN

| Date           | Document Version | Remarks                                  | Drafted by |  |
|----------------|------------------|------------------------------------------|------------|--|
| 11- Sept- 2024 | Version 1.0      | Design Overview & Micro-<br>architecture | Team 7     |  |
| 12- Sept- 2024 | Version 2.0      | Design Features & Results                | Team 7     |  |

# **VERIFICATION DOCUMENT- ALU design**

| CHAPTER 1 – DESIGN OVERVIEW 4                                     |          |
|-------------------------------------------------------------------|----------|
| 1.1 Multi Cycle Pipelined RISC-V Based Processor                  | 4        |
| 1.2 Advantages                                                    | 4        |
| 1.3 Disadvantages                                                 | 4        |
| 1.4 Applications                                                  | 5        |
| 1.5 Project Overview:                                             | 5        |
| 1.5.1 Objectives:                                                 | 5        |
| 1.5.2 Instruction Format:                                         | 6        |
| 1.5.3 Pipelining Stages:                                          | 7        |
| 1.6 Design Features:                                              | 7        |
| 1.7 Design Limitations:                                           | .1       |
| 1.8 Data and Control Paths:                                       | .2       |
| CHAPTER 2 – ARCHITECTURE                                          | .4       |
| 2.1 Micro-architecture of Multi-Cycle Pipelined RISC-V Processor: | .4       |
| 2.2 RTL Schematic of Multi-Cycle Pipelined RISC-V Processor:      | .5       |
| CHAPTER 3 - TESTBENCH IMPLEMENTATION                              | .6       |
| 3.1 Program counter:                                              | .6       |
| 3.2 Instruction decoder:                                          | .6       |
| 3.3 Control unit:                                                 | . 7      |
| 3.4 Register set:                                                 | . 7      |
| 3.5 ALU:                                                          | 3.       |
| 3.6 Pipelined Register (Dff):1                                    | 3.       |
| 3.7 Mux_tb:1                                                      | <u>.</u> |
| <b>CHAPTER 4 - RESULTS</b>                                        | Ċ        |

# CHAPTER 1 – DESIGN OVERVIEW

# 1.1 Multi Cycle Pipelined RISC-V Based Processor

In the RISC-V processor, "RISC" refers to "Reduced Instruction Set Computer," indicating it operates with a streamlined set of instructions, while "V" signifies that it represents the fifth generation. RISC-V is an open-source hardware instruction set architecture (ISA) grounded in the RISC principles.

# 1.2 Advantages

- The design allows for a reduction in clock cycle duration, improving operational speed.
- The design can eliminate the need for multiple adders, reducing the overall complexity of arithmetic operations.
- Different instructions can be executed in varying numbers of cycles, allowing for more flexibility in instruction processing.

#### 1.3 Disadvantages

- The pipeline needs special handling to manage different instruction stages, making the design more complex and harder to maintain.
- When instructions depend on each other or need the same resources, the pipeline may have to stall, causing delays.
- Instructions that require data from previous instructions may have to wait, which slows down execution.
- Increased pipeline activity leads to higher power consumption.
- Debugging becomes more difficult due to the concurrent execution of instructions at different stages.

# **1.4 Applications**

- High-Performance Computing: Pipelined processors are used in supercomputers and data centers for parallel processing tasks, improving throughput and efficiency in intensive workloads.
- Embedded Systems: Many embedded devices use pipelined RISC-V processors for optimized performance in real-time applications such as automotive systems and IoT devices.
- Networking Devices: Routers, switches, and firewalls benefit from the increased throughput of pipelined processors, enabling fast data packet processing and routing.
- Mobile Devices: Smartphones and tablets use pipelined processors to balance power efficiency with performance, providing smooth multitasking and faster app execution.
- Gaming Consoles: Pipelined processors help improve the frame rate and graphical performance, enhancing user experience in gaming consoles by processing multiple instructions simultaneously.

#### **1.5 Project Overview:**

This project focuses on the design and implementation of a multi-cycle, 5-stage pipelined RISC-V-based processor. The goal is to develop a processor that supports core ALU operations and verify its functionality through simulation and testing. This project aims to ensure the correct implementation and performance of pipelined stages, adhering to the RISC-V architecture.

#### 1.5.1 Objectives:

- Develop a 5-stage pipelined processor with multi-cycle functionality.
- Implement and verify the following ALU operations:
  - ADD
  - SUB
  - AND
  - OR
  - XOR
  - NOP
- Implement the corresponding RISC-V instructions for each operation.
- Create a testbench to verify the processor's functionality.

#### 1.5.2 Instruction Format:

In the RISC-V instruction set architecture, the instruction format is divided into several fields that specify the operation to be performed and the registers involved. The provided instruction formats for operations like ADD, SUB, AND, OR, and XOR follow the R-type format of the RISC-V architecture. The R-type format is used for register-to-register arithmetic and logic operations.

| OPERATION | FUNC7<br>(7BITS) | RS2<br>(5BITS) | RS1<br>(5BITS) | FUNC3<br>(3BITS) | RD<br>(5BITS) | OPCODE<br>(7 BITS) |
|-----------|------------------|----------------|----------------|------------------|---------------|--------------------|
| ADD       | 0000000          | 01001          | 01000          | 000              | 00110         | 0110011            |
| SUB       | 0100000          | 10010          | 10011          | 000              | 00111         | 0110011            |
| AND       | 0000000          | 01111          | 01110          | 110              | 10001         | 0110011            |
| OR        | 0000000          | 01101          | 01100          | 111              | 11111         | 0110011            |
| XOR       | 0000000          | 10111          | 10110          | 100              | 11100         | 0110011            |

Table 1 Instructions Format

#### It consists of six fields:

# 1. Opcode (7 bits):

Specifies the type of operation. In the above table, considering ADD operation as an example, 0110011 is the opcode for arithmetic add operation.

# 2. RD (5 bits):

The destination register where the result of the operation will be stored. In the above ADD operation, RD = 00110 corresponds to register R6.

#### 3. Func3 (3 bits):

Determines the specific operation to be performed within the general type specified by the opcode. Each arithmetic/logical operation has a unique func3 value:

ADD: 000
SUB: 000
AND: 110
OR: 111
XOR: 100

#### 4. RS1 (5 bits):

The first source register for the operation. This is the register from which one operand is read. In the above ADD operation, RS1 = 01000 corresponds to register R8.

# 5. **RS2 (5 bits)**:

The second source register for the operation. This register provides the second operand. In the above ADD operation, RS2 = 01001 corresponds to register R9.

# 6. Func7 (7 bits):

Provides additional function information, particularly to distinguish between different operations that share the same func3 or func7. For example, ADD and SUB has same func3 values but have different func7 values as mentioned below.

ADD: 0000000SUB: 0100000

#### 1.5.3 Pipelining Stages:

Pipelining is a technique used in processor design to improve instruction throughput by dividing the execution process into multiple stages. Each stage performs a part of the instruction execution, and multiple instructions are processed simultaneously, with each instruction in a different stage of the pipeline.

The processor pipeline is divided into four stages:

- 1. Instruction Fetch (IF): Fetches instructions from memory.
- 2. Instruction Decode (ID): Decodes the instruction and reads the required registers.
- 3. Execute (EX): Executes the ALU operations.
- 4. Write Back (WB): Writes the results back to the register file.

#### 1.6 Design Features:

The design implements a multi-cycle, pipelined RISC-V based processor with different components that work together to fetch, decode, execute, and write back instructions. Below is a breakdown of each block used in the design:

#### 1. Top Module:

The top module connects all the other modules. It takes a clock signal (clk) and a reset signal (reset). Inside this module, all the essential components for the processor are instantiated,

including the program counter, program memory, instruction decoder, control unit, ALU, register set, and pipeline registers.

#### 2. Program Counter:

The program counter (PC) holds the address of the current instruction and increments it to point to the next instruction. On every clock cycle, the PC is incremented by 1. Upon reset, the PC is set to 0 which points to the first memory location of the program memory.



Fig 1.6.1 Program Counter Block

# 3. Program Memory:

Program memory stores the instructions to be executed by the processor. On reset, specific RISC-V instructions are loaded into memory. The processor fetches instructions from this memory during the instruction Fetch stage. The memory array contains instructions that implement various RISC-V operations, like Load Immediate values, Arithmetic operations (ADD, SUB, AND, OR, XOR), and NOP (No Operation).



Fig 1.6.2 Program Memory Block

#### 4. Instruction Decoder:

This decodes the instruction fetched from program memory into its respective fields such as opcode, function codes (func3 and func7), and register addresses (Destination, Source1 and Source2).

- opcode: Determines the type of instruction.
- func3, func7: Used for specifying the ALU operation.
- ReadAdd1, ReadAdd2: Source registers address.
- WriteAdd: Destination register address.



Fig 1.6.3 Instruction Decoder Block

#### 5. Control Unit:

The control unit generates control signals that direct the operation of the processor based on the opcode and function codes. The control unit uses the opcode and function codes (func3, func7) to determine:

- The ALU operation (alu opcode).
- Whether the instruction is a load operation (is\_load).
- Whether a register write should occur (write en).
- Immediate data extension (imm\_data\_ext), where the 20-bit immediate data is extended to 32 bits.

• In the control unit, during a NOP instruction, the write\_en signal is set to 0 to prevent any unintended writes to the registers.



Fig 1.6.4 Control Unit Block

# 6. ALU (Arithmetic Logic Unit):

The ALU performs arithmetic and logic operations such as addition, subtraction, AND, OR, and XOR. Based on the alu\_opcode provided by the control unit, the ALU takes two data inputs (Data1, Data2) from the register set and performs the required operation, generating the result (alu\_out).



Fig 1.6.5 ALU Block

# 7. Register Set:

The register set stores and provides access to general-purpose registers used by the processor during instruction execution. The register set contains 32 registers, each 32 bits wide. The ReadAdd1 and ReadAdd2 inputs are used to fetch the data from source registers, and WriteAdd is used to write data to the destination register when write\_en is high.



Fig 1.6.6 Register Set Block

# 8. Pipeline Registers:

Pipeline registers are used to store intermediate values and signals as the processor moves between stages. There are several pipeline registers to hold data and control signals between pipeline stages:

- Pipeline for ALU output (pipelined\_alu\_out).
- Pipeline for immediate data extension (pipelined imm data ext).
- Pipeline for the write en signal (pipelined write en).
- Pipeline for is load signal (pipelined is load).
- Pipeline for destination register address (pipelined WriteAdd).

# 1.7 Design Limitations:

- 1. Complex Design: Implementing pipelining adds complexity to the processor design. Special handling is required for managing different instruction stages, making the design harder to create and maintain.
- 2. Data Hazards: Instructions that need data from previous instructions may experience delays while waiting for data to be available, affecting overall performance.
- 3. Debugging Challenges: The simultaneous execution of multiple instructions in different pipeline stages can make it challenging to detect and fix bugs.
- 4. Limited Instruction Set: The processor implementation may be limited to a specific subset of the RISC-V instruction set, depending on how the ALU and other components are designed.
- 5. No Branch or Jump instruction Handling: There is no control flow change logic such as branch or jump instructions. This means the program counter (PC) increments sequentially without the

- ability to branch or jump based on conditions. The PC only supports normal incrementing behaviour.
- 6. No Hazard Handling: The design does not address potential hazards such as control hazards, data hazards, or structural hazards that can affect the pipeline's efficiency.
- 7. Absence of Halt Operation: The halt functionality is not implemented, and the program counter continues incrementing indefinitely. Implementing this would allow for a proper termination of program execution.
- **8.** Preloaded instructions: The instructions are hardcoded and cannot be dynamically modified during execution. This approach restricts flexibility, as any changes to the program logic or instructions require recompiling the design and reloading the updated memory contents

#### 1.8 Data and Control Paths:

Datapath {Red} & Control Path {Green} - During Load Operation



Fig 1.7.1 Load Path

# Datapath {Red} & Control Path {Green} -During ALU Operation



Fig 1.7.2 Control Path

# CHAPTER 2 - ARCHITECTURE

# **2.1 Micro-architecture** of Multi-Cycle Pipelined RISC-V Processor:



Fig 2. 1 Micro architecture of Multi-Cycle Pipelined RISC-V Processor

The multi-cycle pipelined RISC-V processor consists of key blocks such as the Program Counter (PC), Program Memory, Instruction Decoder, Control Unit, ALU (Arithmetic Logic Unit), Register set and pipeline registers. The processor is organized into five pipelined stages: Instruction Fetch, Instruction Decode, Execute, Memory Access, and Write Back.

# 2.2 RTL Schematic of Multi-Cycle Pipelined RISC-V Processor:



Fig 2. 2 RTL Schematic of Multi-Cycle Pipelined RISC-V Processor

This schematic shows the main modules of the processor, such as the Program Counter, ALU, Control Unit, and Register File. It illustrates how these modules are connected to each other, providing a clear overview of the processor's structure. This diagram helps in understanding the organization of the processor's design.

# **CHAPTER 3 - TESTBENCH IMPLEMENTATION**

# 3.1 Program counter:

The program counter testbench verifies that the PC correctly resets to address 0 and increments sequentially with each clock cycle. The simulation passes all checks, confirming proper PC operation with no errors by the end at 170 ns.

```
Program counter reset condition passed!
# PC increment operation : Pass!
# PC increment operation :Pass!
# Present Address = 2, Previous Adrress = 1
# PC increment operation :Pass!
  Present Address = 3, Previous Adrress = 2
# PC increment operation :Pass!
# Present Address = 4, Previous Adrress = 3
 PC increment operation : Pass!
# Present Address = 5, Previous Adrress = 4
# Program counter reset condition passed!
 reset = 1, address = 0
# PC increment operation : Pass!
# Present Address = 1, Previous Adrress = 0
 PC increment operation : Pass!
# PC increment operation :Pass!
# Present Address = 3, Previous Adrress = 2
# PC increment operation :Pass!
 Present Address = 4, Previous Adrress = 3
# PC increment operation :Pass!
# Present Address = 5, Previous Adrress = 4
# PC increment operation :Pass!
 Present Address = 6, Previous Adrress = 5
# Present Address = 7, Previous Adrress = 6
# PC increment operation :Pass!
  Present Address = 8, Previous Adrress = 7
  PC increment operation : Pass!
# PC Increment operation : Pass!
# Present Address = 9, Previous Address = 8
# PC increment operation : Pass!
# Present Address = 10, Previous Address = 9
# ** Note: $finish : program_counter_tb.sv(21)
# Time: 170 ns Iteration: 0 Instance: /pc_tb
# End time: 11:22:47 on Oct 03,2024, Elapsed time: 0:00:05
   Errors: 0, Warnings: 0
```

Figure 3.1.1 Program counter simulation result

# 3.2 Instruction decoder:

```
VSIM 1> run -all
# Test 1 PASSED.
# Test 2 PASSED.
# Test 3 PASSED.
# Test 4 PASSED.
# Test 5 PASSED.
# ** Note: $finish : instruction_decoder_tb.sv(90)
# Time: 50 ns Iteration: 0 Instance: /instruction_decoder_tb
# End time: 11:44:36 on Oct 03,2024, Elapsed time: 0:00:04
# Errors: 0, Warnings: 0
```

Figure 3.1.2 Instruction decoder simulation result

The instruction decoder testbench successfully executed all five tests within 50 nanoseconds, confirming accurate functionality. The rapid execution highlights the efficiency of the simulation environment and the optimized design of the decoder.

#### 3.3 Control unit:

```
VSIM 1> run -all

# Test 1 PASSED.

# Test 2 PASSED.

# Test 3 PASSED.

# Test 4 PASSED.

# Test 5 PASSED.

# ** Note: $finish : control_unit_tb.sv(76)

# Time: 50 ns Iteration: 0 Instance: /control_unit_tb

# End time: 12:15:15 on Oct 03,2024, Elapsed time: 0:00:05

# Errors: 0, Warnings: 0
```

Figure 31.3 Control unit simulation result

The control unit testbench executed all five tests successfully in just 50 nanoseconds, verifying its functionality with precision. The quick execution reflects the optimized simulation environment and efficient control unit design.

#### 3.4 Register set:

Figure 3.1.4 Register set siimulation result

The simulation results indicate that the register write and read operations are functioning correctly for all address values and data. The simulation successfully completes all tests without any errors or warnings.

#### 3.5 ALU:

Figure 3.1.5 ALU simulation result

The simulation results indicate that the ALU operations are functioning correctly for all opcode values and input data. The simulation successfully completes all tests without any errors or warnings.

#### 3.6 Pipelined Register (Dff):

```
Time: 0 | clk: 0 | in: 00000000 | pipelined out: xxxxxxxx
  Time: 5 | clk: 1 | in: 00000000 | pipelined_out: 00000000
# Time: 10 | clk: 0 | in: ffffffff | pipelined_out: 00000000
  Time: 15 | clk: 1 | in: ffffffff | pipelined out: ffffffff
 Time: 20 | clk: 0 | in: ffffffff | pipelined out: ffffffff
Time: 25 | clk: 1 | in: ffffffff | pipelined out: ffffffff
 Time: 30 | clk: 0 | in: 00000000 | pipelined_out: ffffffff
 Time: 35 | clk: 1 | in: 00000000 | pipelined out: 00000000
Time: 40 | clk: 0 | in: 00000000 | pipelined_out: 00000000
 Time: 45 | clk: 1 | in: 00000000 | pipelined_out: 00000000
 Time: 50
             | clk: 0 | in: a5a5a5a5 | pipelined_out: 00000000
 Time: 55 | clk: 1 | in: a5a5a5a5 | pipelined_out: a5a5a5a5
 Time: 60 | clk: 0 | in: a5a5a5a5 | pipelined out: a5a5a5a5
Time: 65 | clk: 1 | in: a5a5a5a5 | pipelined_out: a5a5a5a5
 Time: 70 | clk: 0 | in: 5a5a5a5a | pipelined out: a5a5a5a5
  Time: 75 | clk: 1 | in: 5a5a5a5a | pipelined_out: 5a5a5a5a
# Time: 80 | clk: 0 | in: 5a5a5a5a | pipelined out: 5a5a5a5a
 Time: 85 | clk: 1 | in: 5a5a5a5a | pipelined_out: 5a5a5a5a
 Time: 90
             | clk: 0 | in: 12345678 | pipelined out: 5a5a5a5a
  Time: 95 | clk: 1 | in: 12345678 | pipelined out: 12345678
 Time: 100 | clk: 0 | in: 12345678 | pipelined_out: 12345678
Time: 105 | clk: 1 | in: 12345678 | pipelined_out: 12345678
   ** Note: $finish
                         : dff_tb.sv(60)
     Time: 110 ns Iteration: 0 Instance: /dff tb
  End time: 11:01:32 on Oct 03,2024, Elapsed time: 0:00:05
  Errors: 0, Warnings: 0
```

Figure 3.1.6 Pipelined register simulation result

The D-type flip-flop (DFF) testbench confirms correct functionality, with all tests passing without errors or warnings. The DFF captures and holds the input on the rising clock edge, behaving as expected for a DFF.

# 3.7 Mux tb:

```
# Monitor: Time = 0 | in0 = 62c30bc5 | in1 = 05fecf0b | sel = 0 | out = 62c30bc5  
# PASS: Time = 5 | in0 = 62c30bc5 | in1 = 05fecf0b | sel = 0 | out = 62c30bc5  
# Monitor: Time = 15 | in0 = db196cb6 | in1 = 28f70351 | sel = 0 | out = db196cb6  
# PASS: Time = 20 | in0 = db196cb6 | in1 = 28f70351 | sel = 0 | out = db196cb6  
# Monitor: Time = 30 | in0 = d043c0a0 | in1 = f6bb32ed | sel = 1 | out = f6bb32ed  
# Monitor: Time = 35 | in0 = d043c0a0 | in1 = f6bb32ed | sel = 1 | out = f6bb32ed  
# Monitor: Time = 45 | in0 = f60be8ec | in1 = 8950b012 | sel = 1 | out = 8950b012 | # ASS: Time = 50 | in0 = f60be8ec | in1 = 8950b012 | sel = 1 | out = 8950b012 | # ASS: Time = 60 | in0 = baclda75 | in1 = 3e04c37c | sel = 1 | out = 3e04c37c | # Monitor: Time = 65 | in0 = baclda75 | in1 = 3e04c37c | sel = 1 | out = 3e04c37c | # Monitor: Time = 75 | in0 = a96ebc52 | in1 = 0781130f | sel = 0 | out = a96ebc52 | # ASS: Time = 80 | in0 = a96ebc52 | in1 = 0781130f | sel = 0 | out = a96ebc52 | # Monitor: Time = 90 | in0 = c06c0480 | in1 = 9499b629 | sel = 1 | out = 9499b629 | # ASS: Time = 95 | in0 = c06c0480 | in1 = 9499b629 | sel = 1 | out = 9499b629 | # ASS: Time = 100 | in0 = c06c0480 | in1 = a19e0043 | sel = 0 | out = 079c970f | # Monitor: Time = 105 | in0 = 5fa097bf | in1 = a19e0043 | sel = 0 | out = 51210ba2 | # ASS: Time = 120 | in0 = 5fa097bf | in1 = 51210ba2 | sel = 1 | out = 51210ba2 | # ASS: Time = 125 | in0 = 5fa097bf | in1 = 51210ba2 | sel = 1 | out = 53de47a7 | # ASS: Time = 140 | in0 = 53de47a7 | in1 = b7d5bc6f | sel = 0 | out = 53de47a7 | # ASS: Time = 140 | in0 = 53de47a7 | in1 = b7d5bc6f | sel = 0 | out = 53de47a7 | # ASS: Time = 150 | sel = 1 | sel = 0 | out = 53de47a7 | # ASS: Time = 150 | sel = 0 | out = 53de47a7 | # ASS: Time = 150 | sel = 0 | out = 53de47a7 | # ASS: Time = 150 | sel = 0 | out = 53de47a7 | # ASS: Time = 150 | sel = 0 | out = 53de47a7 | # ASS: Time = 150 | sel = 0 | out = 53de47a7 | # ASS: Time = 150 | sel = 0 | out = 53de47a7 | # ASS: Time = 150 | sel = 0 | out = 53de47a7 | # ASS: Time = 150
```

Figure 3.1.7 Mux tb simulation result

The 2x1 multiplexer testbench confirms correct functionality, with all tests passing without errors or warnings. The multiplexer outputs the correct input based on the select signal, performing as expected.

# **CHAPTER 4 - RESULTS**

The implementation of the multi-cycle pipelined RISC-V processor successfully achieved all design objectives and features outlined in the project. The processor architecture, including key modules such as the Program Counter, ALU, Control Unit, and Register set, was developed and integrated as specified.

The design effectively supports all specified ALU operations: ADD, SUB, AND, OR, and XOR. Each operation was implemented and tested, confirming that the processor executes these functions correctly. The functionality of the processor was thoroughly validated using preloaded instructions in the program memory. This validation process ensured that all modules operate as intended, with the processor performing all tasks in accordance with the design requirements.

# 4.1 Load Operation Result:

Preloaded instructions for load operation:

memory[0]<=32'b0000000\_00000\_00000\_100\_01000\_0000011;//load immediate R8=4 memory[1]<=32'b0000000\_00000\_00000\_011\_01001\_0000011;//load immediate R9=3 memory[2]<=32'b0000000\_00000\_00000\_0000\_00000\_000000;//nop



Fig 4. 1 Waveforms of Load Operations

#### 4.2 ALU Operations Result:

Preloaded instructions for load operation:

memory[3]<=32'b0000000\_01001\_01000\_000\_00001\_0110011;//add R1=R8+R9 memory[4]<=32'b0100000\_01001\_01000\_000\_0010\_0110011;//subtract R2=R8-R9 memory[5]<=32'b0000000\_01001\_01000\_110\_00011\_0110011;//and R3=R8&R9 memory[6]<=32'b0000000\_01001\_01000\_111\_00100\_0110011;//or R4=R8|R9

memory[7]<=32'b0000000\_01001\_01000\_100\_00101\_0110011;//xor R5=R8^R9 memory[8]<=32'b0000000\_00010\_00011\_000\_00001\_0110011;//add R1=R2+R3 memory[9]<=32'b0000000\_00000\_00000\_00000\_000000;//nop memory[10]<=32'b0000000\_00001\_00100\_100\_00110\_0110011;//xor R6=R1^R4 memory[11]<=32'b0000000\_00000\_00000\_00000\_00000\_000000;//nop



Fig 4. 2 Waveforms of ALU Operations