# **BYOC** course

**Assignment #6** 

**HW6 MIPS CPU** 

# 1. The HW6 MIPS CPU

In this assignment we add jal, jr, lui, ori instructions to the MIPS CPU we designed in HW5. Thus, we will have a CPU supporting Rtype (add, sub, and, or, xor, slt), addi, lui, ori, beq, bne, lw ,sw, j, jal and jr instructions. Besides adding these instructions we would like to add a forwarding mechanism to enhance the CPU performance.

It is highly recommended to watch the lecture in: <a href="http://youtu.be/Yu6FFVhI4D4">http://youtu.be/Yu6FFVhI4D4</a> and the first 11 minutes of: <a href="http://youtu.be/-fylybz8p">http://youtu.be/-fylybz8p</a> M

Below we remind you of the HW5 MIPS CPU we designed in HW5. It is almost the same as the HW6 MIPS of this assignment



Fig. 1 - The HW5 MIPS CPU

## a. HW6 outline

This assignment has 3 parts. The first is to add the new instructions and it is described in section **b** below. It is recommended to fill up the table in **Appendix A** before starting to add the new instructions. We will not implement that design, just run the simulation. After a successful simulation of this part you should add the forwarding mechanism. This is done in two parts, data forwarding (which is described in section **c**) and branch forwarding (described in section **d**). Thus this HW has 3 parts: i) Add the new instructions ii) Add data forwarding iii) Add branch forwarding.

## b. PART I - Adding the new instructions

- i. LUI The simplest way to add the lui insruction is to change the sign extension circuit so that when we have a lui instruction, it shifts the imm left for 16 times. We should make sure that the rest of the circuit will behave in a similar manner to addi instruction. For example, the ALU will add it's a input value to the sext\_imm\_reg value that appears in its B input. Thus, we should make sure that the A\_reg value is 0. This can be done in several ways. The simplest way is to make sure that the Assembler always translate lui instruction so that Rs=0. Another way is to force Rs to be 0 (b"00000") when we decode a lui instruction.
- ii. ORI This instruction is almost the same as addi one. There are two differences. The first is that in ori instruction we should prevent sign extension of the imm. This is easily done by an additional change in the sign extension circuit. The other difference is forcing the ALU to perform a OR operation instead of an ADD one. The simplest way to do that is to use the 4<sup>th</sup> combination of the ALUOP vector signal. While b"00" means ADD, b"01" means SUB and b"10" means use the FUNCTION field to determine the ALU operation, we will add the combination b"11" and will change the MIPS\_ALU so that ALUOP="b11" will result with an OR operation.
  - Thus for supporting ORI, we should fix the sext\_imm circuit, force ALUOP control signal to be b"11" and change the MIPS\_ALU to support this combination.
  - [The expected behavior of the ALUOP signal is: "10" in Rtype instructions, "01" in beq & bne, "11" in ori, "00" in all other instructions]
- iii. JR Supporting this instruction is pretty easy. We should direct the Rs content value (GPR\_rd\_data1) back into the Fetch\_Unit so that the jr\_adrs signal inside the Fetch Unit will get the GPR\_rd\_data1 instead of the constant x"00400004" we had so far. This means we need to add a input signal to the Fetch\_Unit entity. This new 32 bit input signal is called jr\_adrs\_in.
- iv. JAL Supporting this instruction is a little more involved. The jal should behave exactly as the j instruction in the Fetch Unit so that when a j instruction or jal instruction appear in the IR\_reg, the PC\_source will be "11" and the PC\_reg will get the "jump\_adrs" signal at its input. This makes sure that we jump properly in both cases. In jal we should also write the PC\_plus\_4 of the instruction to \$ra, i.e., to register \$31 in the GPR File. How do we do that? We "propagate" the PC\_plus\_4 value till the WB phase and there, add it as an additional input to the MemToReg mux. We need to output the PC\_plus\_4\_plD from the Fetch\_Unit (this means a change in the i/o pins of the Fetch\_Unit). This signal needs to "propagate" till it becomes be PC\_plus\_4\_reg\_pWB. We need to make sure we issue RegWrite='1' in jal and we should force "Rd" to be 31. Since the rule for RegDst mux is

that RegDst='1' only in Rtype instructions, it means that in jal instruction it is '0' and the RegDst mux choose Rd\_pMEM to be Rt\_pEX, it means that in jal instruction we should force Rt to be 31 (b"11111").

To summarize, we need support jal in the Fetch Unit the same as we do for j instruction, we need to output PC\_plus\_4\_plD from the Fetch\_Unit and delay it till the WB phase, we need to issue a RegWrite='1', we need to expand the MemToReg mux to write the PC\_plus\_4 in the WB phase of jal instruction and we need to force Rt to be 31 in jal instruction.

See more in section **e** below.

## c. PART II - Data forwarding

In a pipelined implementation of a CPU we encounter an inherent latency problem. The result of an add instruction (we will use add instruction as an example, but the analysis is applicable also for all instructions writing back into the GPR File except lw and jal, i.e., Rtype, addi, lui and ori instructions) is available for a later instruction that uses it only after the WB phase of the add instruction is completed. The instruction using that result "reads" it from the GPR File in its' ID phase. Thus, we need to wait 3 time slots before "using" the add result in a new instruction. This is depicted in Fig. 2 below. The updated value of \$3 is written into the GPR File in the rising edge of the clock ending the WB phase of the "add \$3,\$5,\$8" instruction (marked by the red line).

Thus, the ID phase of the "add \$y,\$3,\$x" instruction which uses that value, can occur to the right of the red line. We see that the inherent 5 CKs latency of the pipelined implementation results with "wasted" time slots.



Fig. 2 – The pipelined MIPS latency

We can use these time slots for other instructions that do not write to \$3 or \$x (those who are used by the "add \$y,\$3,\$x" instruction). A smart C compiler can therefore improve the situation. However, it is easy to overcome this problem and improve the situation dramatically by "Data Forwarding".

Data Forwarding means using the updated value to be written into the GPR File even before it is written into the GPR File. This is possible since that data already exists inside the pipeline – in most cases. We read data from the GPR File in the ID phase of an instruction in order to use it in the EX phase of the instruction. This means that the forwarding should occur in the EX phase or before it, in the ID phase of the instruction we want to forward the data to.

We have 3 cases of Data Forwarding.

- 1. Case I: Forward data from previous instruction in the EX phase of the current instruction if the Rs or Rt of the current instruction is written into by the previous instruction.
  - I.e., if RegWrite\_pMEM='1' and Rd\_pMEM=Rs\_pEX, we should use ALUout\_reg value instead of A reg value.
  - Similarly, if RegWrite\_pMEM='1' and Rd\_pMEM=Rt\_pEX, we should use ALUout\_reg value instead of B\_reg value.
  - This is described by the arrow from the MEM phase of the 1<sup>st</sup> instruction (the top one) in Fig. 3, to the EX phase of the 2<sup>nd</sup> instruction.
- 2. Case II: Forward data from the instruction that was done 2 clocks ago in the EX phase of the current instruction if the Rs or Rt of the current instruction is written into by the instruction from 2 clocks ago.
  - I.e., .if RegWrite\_pWB='1' and Rd\_pWB=Rs\_pEX, we should use MemToReg mux output value instead of A reg value.
  - Similarly, if RegWrite\_pWB='1' and Rd\_pWB=Rt\_pEX, we should use MemToReg mux output value instead of B\_reg value.
  - This is described by the arrow from the WB phase of the 1<sup>st</sup> instruction in Fig. 3, to the EX phase of the 3<sup>rd</sup> instruction.
- 3. Case III: Forward data from the instruction that was done 3 clocks ago. This is done in the ID phase of the current instruction (through a "transparent GPR") if the Rs or Rt of the current instruction is written into by the instruction from 3 clocks ago.
  - This means that inside the GPR, if rd\_reg1=wr\_reg and Reg\_Write='1', then we should bypass the GPR file and output the wr\_data instead of the "regular" rd\_data1. Similarly to rd reg2 and rd data2.

This is described by the arrow from the MEM phase of the 1<sup>st</sup> instruction in Fig. 3, to the ID phase of the 4<sup>th</sup> instruction.



Fig. 3 – Data Forwarding timing diagram (from the 1<sup>st</sup> instruction to future instructions)



Fig. 3B – The 3 Data Forwarding options to an instruction (to the 4<sup>th</sup> instruction from previous instructions)

Fig. 3B shows we see that the 1<sup>st</sup> instruction writes to register \$3, the 2<sup>nd</sup> instruction writes to register \$2 and the 3<sup>rd</sup> instruction writes to register \$3. We see the 3 forwarding mechanisms working to supply updated data to the 4<sup>th</sup> instruction. In the ID phase of the 4<sup>th</sup> instruction we read the result of the 1<sup>st</sup> instruction via the "transparent GPR" mechanism supporting forwarding from 3 instructions ago. In the EX phase of the 4<sup>th</sup> instructions we see forwarding of Rs from the previous instruction (in red) and from 2 instructions ago (in magenta).

In Fig. 4 and Fig. 5 below we see the MIPS data path without and with Data Forwarding. The changes are drawn in red. The connections shown in the MIPS data path in Fig. 5 support forwarding from previous instruction (case I) and from instruction before the previous one (case II). The forwarding through "transparent" GPR File (case III) is not shown in Fig. 5. It is described in Fig. 6 further below with the changes inside the GPR File also drawn in red.



Fig. 4 – MIPS data path (part) with no forwarding



Fig. 5 – MIPS Data Path with Data Forwarding



Fig. 6 – MIPS Data Path with Data Forwarding

The only two signals we added to the HW6 MIPS CPU to support Data Forwarding are A\_reg\_wt\_fwd and B\_reg\_wt\_fwd that are the outputs of two new muxes at the A and B inputs of the ALU. Actually we also need to keep the value of Rs till the EX phase so that we can use to check whether forwarding data to the A input of the ALU is required. Thus, we also added the RS pEX register.

#### Important notes:

- 1) No forwarding should be done if we read from register \$0 [see how it is handled in Fig. 6].
- 2) We need to make sure that we handle the situation properly also in cases where we have 2 or 3 previous instructions writing to the same register we are reading from in the current instruction.
- 3) We need to make sure that we use the correct data also in sw instruction.
- 4) This forwarding does not apply to lw instruction (if it is the previous instruction) since a lw instruction has valid write data only at the WB phase after the MEM phase, while other instructions such as Rtype, addi, ori & lui that write to the GPR File have their valid data after the EX phase from MEM phase and on.
- 5) Similarly, jal data path is different than the regular instructions and data is available for forwarding only at the WB of the jal instruction.

After adding the data forwarding muxes, we should use A\_reg\_wt\_fwd instead of A\_reg wherever the A\_reg data was used and similarly, use B\_reg\_wt\_fwd instead of B\_reg wherever the B\_reg data was used.

See more in section e below

## d. PART III - Branch forwarding

In a similar manner, the pipeline inherent latency also creates problems when we perform a branch instruction. In order to decide whether to branch or not, we compared the data values read from both outputs of the GPR file. This means we have to wait until the data inside the GPR file is updated before we can branch. As in the data case, we would like to build a forwarding mechanism allowing us to compare the right values as soon they are available.



Fig. 7 – Branch Forwarding timing diagram

In Fig. 7 we see that an add instruction writes to register \$3. If we want to compare \$3 in our branch instruction, we must wait at least 1 time slot. This is so since the result of the add instruction is only available after the EX phase, i.e., from MEM phase and on. Since the branch comparison is done it its' ID phase, the branch instruction' ID phase cannot be performed before the MEM phase of the add instruction. This is not enough. We need a Branch Forwarding mechanism that will bring the updated MEM phase value to the Rs\_equals\_Rt comparator. Usually this comparator compares GPR\_rd\_data1 to GPR\_rd\_data2. Only when we compare a register that was written into (actually, will be written into) by the instruction before the previous (2 instructions ago) we need to forward the MEM phase data (which is the ALUOUT\_reg data). Note that we do not need to handle Branch Forwarding from earlier instructions since from 3 instructions ago, the "transparent GPR" of the data forwarding does that for us, and from 4 instructions ago there is no forwarding problem since the GPR File is updated on time.

You should add this mechanism as depicted in Fig. 8 below.



Fig. 8 – MIPS Data Path with Data and Branch Forwarding

The only two signals we added to the HW6 MIPS CPU to support Branch Forwarding are GPR\_rd\_data1\_wt\_fwd and GPR\_rd\_data2\_wt\_fwd that are the outputs of two new muxes at the inputs of the Rs\_equals\_Rt comparator.

Note that this mechanism should also be used by the jr instruction. In the jr instruction we have a similar latency issue. Instead of sending back the GPR\_rd\_data1 to the Fetch Unit, you need now to use the GPR\_rd\_data1\_wt\_fwd vector signal.

See more in section e below.

## e. PART II - Names & definition of signals inside the HW6\_top MIPS CPU

In your design, for the **HW6\_top** entity, you should use the exact signal names as were used in the HW5 design and <u>add</u> the following signals using the exact signal names shown below. These are already defined in the **HW6\_top\_4sim.empty** and the **HW6\_top.empty** files:

#### ID additional signals

- jr\_address 32 bit vector signal that has the Rs data value (usually from GPR\_rd\_data1) to be loaded into the PC\_reg in jr instruction
- 2. GPR\_rd\_data1\_wt\_fwd 32 bit vector signal the output of the Branch Forwarding mux of Rs value
- 3. GPR\_rd\_data2\_wt\_fwd 32 bit vector signal the output of the Branch Forwarding mux of Rt value
- 4. JAL '1' when we have jal instruction in the IR\_reg (decoded from the opcode)

#### EX phase signals

- 5. PC\_plus\_4\_pEX PC\_plus\_4\_pID (or PC\_plus\_4) delayed by 1 clock cycle.
- 6. A\_reg\_wt\_fwd 32 bit vector signal the output of the Data Forwarding mux of Rs value
- 7. B\_reg\_wt\_fwd 32 bit vector signal the output of the Data Forwarding mux of Rt value
- 8. Rs\_pEX Rs delayed by 1 clock cycle.
- 9. JAL\_pEX JAL delayed by 1 clock cycle.

#### MEM phase signals

- 10. PC plus 4 pMEM PC plus 4 pEX delayed by 1 clock cycle
- 11. JAL\_pMEM JAL\_pEX delayed by 1 clock cycle

## WB phase signals

- 12. PC\_plus\_4\_pWB PC\_plus\_4\_pMEM delayed by 1 clock cycle
- 13. JAL\_pWB JAL\_pMEM delayed by 1 clock cycle

Add more signals according to your needs. Consult the teacher just to be on the safe side.

#### Additional input in the Fetch Unit

Jr\_adrs\_in - 32 bit vector signal of the GPR\_rd\_data1 value to be load into the PC in jr instruction (should be changed to GPR\_rd\_data1\_wt\_fwd to support Branch Forwarding). To simplify things we connected it to the jr\_address signal of the HW6\_MIPS (1<sup>st</sup> on the list in this page).

## f. Names and definitions of output signals from HW6 top CPU to the TB

You need to define all output signals coming out of the **HW6\_top** entity to be tested by the HW6 TB. These signals are the same as we used in **HW5 top**:

- 1) CK\_out\_to\_TB a signal identical to the MIPS\_ck "internal" signal
- 2) RESET\_out\_to\_TB a signal identical to the MIPS\_reset "internal" signal
- 3) HOLD\_out\_to\_TB a signal identical to the MIPS\_hold "internal" signal
- 4) rdbk0\_out\_to\_TB to rdbk15\_out\_to\_TB 16 vector signals, 32 bit each that will have the data we want to check as detailed below.

In your **HW6** top design you should connect the rdbk signals as follows:

```
ID signals:
```

rdbk0 => PC\_reg

rdbk1 => IR\_reg,

rdbk2 => sext\_imm (in ID phase)

rdbk3 => Rs, Rt, Rd, Funct (Rs= bits 28:24, Rt= bits 20:16, Rd= bits 12:8, Funct= bits 5:0)

rdbk4 => RegWrite, Rs\_eq\_Rt, MemWrite

(Reg Write= bit 28, MemWrite= bit 24, Rs eq Rt=bit0)

EX signals:

rdbk5 => ALUsrcB\_pEX, ALUOP, Funct\_pEX

(ALUsrcB=bit 28, ALUOP= bits 9:8, Funct\_pEX = bits5:0)

 $rdbk6 => A_reg,$ 

rdbk7 => B\_reg,

rdbk8 => sext\_imm\_reg, rdbk9 => ALU output,

MEM signals:

rdbk10 => ALUOUT\_reg rdbk11 => B\_reg\_pMEM

## MEM & WB control signals

rdbk12 => MemWrite\_pMEM (bit31), MemToReg\_pMEM (bit28), RegWrite\_pMEM (bit24),

Rd pMEM (bits 20:16),

MemToReg\_pWB (bit12), RegWrite\_pWB (bit8), Rd\_pWB (bits 4:0)

#### WB signals:

rdbk13 => MDR reg

rdbk14 => ALUOUT\_reg\_pW rdbk15 => GPR\_wr\_data

By the way, These are the exact same signals we used in HW5 TB. As ususal, the TB will compare the expected data of these signals to the actual result of the simulation and will tell you where the errors are. It uses a file called SIM\_HW6\_TB\_data.dat containing the expected results. Since you will use these signals in the implementation phase, you should also connect them to the Host Interface [all connections are already done for you in the HW6\_top\_4sim.empty file].

## g. Simulation of the 3 designs – describing the 3 projects

In all the 3 design you should run simulations. We will give you an "empty" **HW6\_top\_4sim.vhd** file that has all of the additional new signals for the 3 parts already inside. The name and definitions of the new signals appear in section **e** below. The names and definition of the TB signal appear in section **f**. They are similar to those used in HW5. The required simulation reports are described in section **2**. Then we have the implementation instructions and the Implementation report instructions.

The files we will use to run the simulation are:

#### The design files:

- 1) HW6\_top\_4sim.vhd This is your design of HW6. It uses your designs of the Fetch\_Unit, GPR, MIPS\_ALU and the pre-prepared BYOC\_Host\_Intf\_4sim. We give you an "empty" version that has all of the additional signals for the 3 parts already inside. You should take your HW5\_top.vhd design and copy the VHDL code that you wrote into the appropriate location in the Architecture part (after the general signals and the components connections) of the HW6\_top\_4sim.vhd. You should then add the necessary changes to form the HW6\_top MIPS as explained in sections b, c and d.
- 2) **Fetch\_Unit.vhd** The Fetch Unit you designed in HW2 after the modifications of HW4. **Additional changes are required in this unit in HW6!!**
- 3) **GPR.vhd** your GPR File design you designed in HW3.
- 4) **dual port memory.vhd** part of the GPR File you designed in HW3.
- 5) MIPS\_ALU.vhd your MIPS\_ALU design prepared in HW3. Changes are required in this unit in HW6!!

#### BYOC infrastructure files:

- 6) **BYOC\_Clock\_driver\_4sim.vhd** the CK divider & driver we use for simulation (also good for the Modelsim simulator)
- 7) **BYOC\_Host\_Intf\_4sim.vhd** The pre-prepared component including the IMem, DMem and "pre-loaded" program and data. This component also creates the reset & hold signals to the rest of the HW6 MIPS CPU.

#### Simulation files:

- 8) **SIM\_HW6\_TB.vhd** The TB vhd file prepared in advance.
- 9) **SIM\_HW6\_program.dat** The program file for simulation
- 10) **SIM\_HW6\_TB\_data.dat** this is a data file prepared in advance that has the expected TB values. It is read by the HW6\_TB and used to compare the actual simulation results to the expected ones. You will have 3 versions of this file:
  - a. SIM HW6 TB data no fwding.dat
  - b. SIM\_HW6\_TB\_data\_wt\_data\_fwding.dat
  - c. SIM\_HW6\_TB\_data\_wt\_data\_and\_branch\_fwding.dat.
- 11) SIM HW6 filenames.vhd The actual path information of the dat files used by the TB.

**NOTE**: You should update that file according to your simulation project actual path and the name of the TB\_data file you want to use.

I.e., if you want to use SIM\_HW6\_TB\_data\_no\_fwding.dat, this should be the name appearing in the SIM HW6 filenames.vhd.

# 2) HW6 Simulation report

You should submit a single zip file for the Simulation and implementation phases. It should have four directories/folders. The first is called **Simulation1**, the 2<sup>nd</sup> is called **Simulation2**, the 3<sup>rd</sup> is called **Simulation 3**, the 4<sup>th</sup> is called **Implementation**. In the Simulation folders you will have 3 sub-folder of:

- Src\_4sim here you put all of the \*.vhd sources for simulation
- Sim here you should have the HW6\_4sim project created by the simulator you used
   Docs Here you put your simulation report. The first few lines in the report will have your ID
   numbers (names are optional). See the instructions below for the rest of the simulation report.

   Note that the doc file should be a WORD file and not a PDF file- to allow addition of
   remarks while grading the report.

**Simulation1** will have the "no forwarding" design of part I where you add the lui, ori, jr and jal instructions. You should answer the questions in **Appendix A** and insert these to your report of **Simulation1**. You should use the **SIM HW6 TB data no fwding.dat** file for the simulation.

**Simulation2** will have the "data forwarding" design of part II where you add the data forwarding to the design of **Simulation1**. You should answer the questions in **Appendix B** and insert these to your report of **Simulation2**. Use the **SIM\_HW6\_TB\_data\_wt\_data\_fwding.dat** file for the simulation.

**Simulation3** will have the "branch forwarding" design of part III where you add Branch Forwarding to the design of **Simulation2**. You should answer the questions in **Appendix C** and insert these to your report of **Simulation3**.

Use SIM HW6 TB data wt data and branch fwding.dat for this simulation.

Later, in the Implementation phase you will add 2 sub-folders to the **Implementation** folder. These will be:

- Src\_4ISE here you put all of the \*.vhd sources and the \*.ucf file (and no TB file)
- ISE here you should have the HW6\_MIPS project created by the Xilinx ISE SW.

# 3) HW6 MIPS CPU – implementation

After a successful simulation we want to implement the design on the Nexys2 board. The same change done in HW5 to go from the simulation version to the implementation version are required here too.

We can rename the HW6\_top\_4sim.vhd to HW6\_top.vhd and then, remove all the signals that were outputted to the TB (see 1f above). These are not required anymore. Another way is to use the pre-prepared HW6\_top.empty file, rename it to HW6\_top.vhd, and copy your VHDL code from your HW6\_top\_4sim.vhd file into the appropriate location inside the HW6\_top.vhd file.

The files we will use to implement the design on the Nexys2 board are:

- 1) **BYOC.ucf** The file listing which signal are connected to which FPGA pins in the Nexys2 board.
- 2) **HW6\_top.vhd** This is your design of HW6. It uses the GPR, MIPS\_ALU and the Fetch\_Unit. This should be based on the 3<sup>rd</sup> simulation version that supports the new istructions (ori, lui, jr, jal) and data and branch forwarding.
- 3) **GPR.vhd** your GPR File design you prepared in HW3 and modified during the HW6 simulation.
- 4) dual\_port\_memory.vhd part of the GPR File design you prepared in HW3.
- 5) MIPS\_ALU.vhd your MIPS\_ALU design you prepared in HW3 and modified during the HW6 simulation.
- 6) **Fetch\_Unit.vhd** The Fetch Unit you prepared in HW2 after the modifications of HW4, HW5 and HW6.
- 7) **BYOC\_Host\_Intf.ngc** This prepared component includes the infrastructure interfacing to the PC allowing us to load programs into IMem and data into DMenm and read feedback signals in single clock mode. It also has the VGA controller and the KBD and Flash interfaces. We give that component in the form of a single netlist file, which is an already compiled version of the vhd files forming the BYOC\_Hos\_interface.
- 8) **BYOC\_Clock\_driver.vhd** the CK divider & driver we use for implementation as of HW2.

To allow an easy debugging you should connect the rdbk signals (eventually connected to the BYOC\_Host\_intf) as follows [same as in the simulation & implementation part and in HW5 – again, these are all connected for you in the **HW6\_top.empty** file]:

#### ID signals:

rdbk0 => PC\_reg rdbk1 => IR\_reg,

rdbk2 => sext\_imm (in ID phase)

rdbk3 => Rs, Rt, Rd, Funct (Rs= bits 28:24, Rt= bits 20:16, Rd= bits 12:8, Funct= bits 5:0)

rdbk4 => RegWrite, Rs eq Rt, MemWrite

(Reg Write= bit 28, MemWrite= bit 24, Rs\_eq\_Rt=bit0)

#### EX signals:

rdbk5 => ALUsrcB\_pEX, ALUOP, Funct\_pEX

(ALUsrcB=bit 28, ALUOP= bits 9:8, Funct\_pEX = bits5:0)

 $rdbk6 => A_reg,$  $rdbk7 => B_reg,$ 

rdbk8 => sext\_imm\_reg, rdbk9 => ALU\_output,

#### MEM signals:

rdbk10 => ALUOUT\_reg rdbk11 => B\_reg\_pMEM

#### MEM & WB control signals

rdbk12 => MemWrite pMEM (bit31), MemToReg pMEM (bit28), RegWrite pMEM (bit24),

Rd\_pMEM (bits 20:16),

MemToReg\_pWB (bit12), RegWrite\_pWB (bit8), Rd\_pWB (bits 4:0)

#### WB signals:

rdbk13 => MDR\_reg

rdbk14 => ALUOUT\_reg\_pW rdbk15 => GPR\_wr\_data

So we'll run that the BYOCInterface SW and load the IMem. Then run the circuit in a single ck mode and check that the reading we see at the points we "hooked" to the rdbk signals are as what we expect.

The file we want to load into the IMem is called "Pong1\_v32.txt". Following the loading, we can run in single ck mode and see the readback values on the PC screen after each clock. If you press the run button, you should get the Pong game running on the VGA screen.

If it works, GREAT!!!

# You Built Your Own Computer!!

# 4) Implemetation report

You should submit a single zip file that has the simulations and implementation projects in 4 folders as described in section 2 above. In the **Implementation** folder you will include your entire ISE implementation project in two sub-folder that will be as follows:

- Src\_4ISE here you put all of the \*.vhd sources and the \*.ucf file (and no TB file)
- ISE here you should have the HW6 MIPS project created by the Xilinx ISE SW.

Including a copy of the bit file you created at the **Implementation** directory.

As part of completing this part of the course you will have to show me how you run the design on the Nexys2 board in the lab. And answer some questions.

# **Enjoy the assignment !!**

At the end of this assignment you will have our final CPU and run a simple computer game on the CPU that you yourself built. That is great! I hope you enjoyed the course.

# Appendix A

A.1) Fill up the following table describing what happens in each CK cycle in all instructions. You should specify the specific operations that are required for the execution of the instruction. We filled in the Rtype and j instructions – as examples. We also gave the list of required registers & signals to be mentioned in the table, in the ori instruction line.

| phase       | IF                                                          | ID                                                                                                                             | EX                                                                 | МЕМ                                                                                                | WB                          |
|-------------|-------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------|----------------------------------------------------------------------------------------------------|-----------------------------|
| Instruction |                                                             |                                                                                                                                |                                                                    |                                                                                                    |                             |
| Rtype       | IR=IMem[PC]<br>PC= PC+4                                     | A=GPR[Rs] B=GPR[Rt]  Active signals: RegDst='1' RegWrite='1' ALUOP="10" MemToReg='0'                                           | ALUOUT = A op B  Rd is chosen: Rd_pMEM=Rd_pEX                      | ALUOUT_pWB=<br>ALUOUT<br>(ALUOUT is delayed 1ck)                                                   | GPR[Rd_pWB]<br>= ALUOUT_pWB |
| addi        |                                                             |                                                                                                                                |                                                                    |                                                                                                    |                             |
| ori         | Need to tell what is loaded to IR & PC – the relevant regs. | Again, all regs that are relevant (A, B, sext_imm, PC in j & branch)  Also – all <u>active</u> signals created at the ID phase | All regs that are relevant (ALUOUT, B_reg_pMEM, Rd_pMEM, sext_imm) | All regs that are relevant (ALUOUT_reg_bWB, Rd_pWB, MDR)  MDR= DMem[adrs] or DMem[adrs]=B_reg_pMEM | GPR[Rd_pWB]<br>= ALUOUT_pWB |
| lui         |                                                             |                                                                                                                                |                                                                    |                                                                                                    |                             |
| beq         |                                                             |                                                                                                                                |                                                                    |                                                                                                    |                             |

| bne |                        |               |         |         |         |
|-----|------------------------|---------------|---------|---------|---------|
|     |                        |               |         |         |         |
| lw  |                        |               |         |         |         |
|     |                        |               |         |         |         |
| sw  |                        |               |         |         |         |
|     |                        |               |         |         |         |
| j   | IR=IMem[PC]<br>PC=PC+4 | PC= jump adrs | nothing | nothing | nothing |
|     |                        |               |         |         |         |
| jal |                        |               |         |         |         |
|     |                        |               |         |         |         |
| jr  |                        |               |         |         |         |
|     |                        |               |         |         |         |

Answer the following questions.

- A.2) Describe the changes done in order to support the ORI instruction.
- A.3) Describe the changes done in order to support the LUI instruction.
- A.4) Describe the changes done in order to support the JR instruction.
- A.5) Describe the changes done in order to support the JAL instruction.

In your answers, besides stating the reasoning in detail, show the relevant VHDL code sections to better explain your answers.

# Appendix B

Answer the following questions.

B.1) What are the limitations due to the pipeline latency of the following combinations:

- Iw after add where the add Rd is the Iw Rs
- Iw after add where the add Rd is the Iw Rt
- add after lw where the lw Rt is the add Rt
- beq after lw where the lw Rt is the beq Rs

Use a similar figure to Fig.2 and Fig. 3 to demonstrate your answers. Explain your answer!

B.1.a - Iw after add where the add Rd is the Iw Rs [e.g., Iw \$4,16(\$3)]



## B.1.b - lw after add where the add Rd is the lw Rt



B.1.c - add after lw where the lw Rt is the add Rt



B.1.d - beq after lw where the lw Rt is the beq Rs



B.2) What are the limitations of all cases of B.1 after you add he Data Forwarding? .  $\underline{\text{Explain your}}$  answer!

B.2.a - lw after add where the add Rd is the lw Rs



B.2.b - lw after add where the add Rd is the lw Rt



B.2.c - add after lw where the lw Rt is the add Rt



B.2.d - beq after lw where the lw Rt is the beq Rs



B.3) How many times do we perform the instruction following a jal instruction? Explain in detail. What are the implications? If this is a problem, what do you suggest in order to solve it?

B.4) How soon after jal instruction can we issue a jr \$31 instruction in order to return to the right location in the code? Give the answer before data forwarding is added and then after the data forwarding is added. . <u>Explain your answer!</u>

No data forwarding:



# With data forwarding:



# **Appendix C**

Answer the following questions.

C.1) What are the limitations due to the pipeline latency of the following combinations (assume Data Forwarding already exists):

- beq after add where the add Rd is the beq Rt
- beq after lw where the lw Rt is the beq Rs

Use a similar figure to Fig.2 and Fig. 3 to demonstrate your answers. Explain your answers!

C.1.a - beq after add where the add Rd is the beq Rt



C.1.b - beq after lw where the lw Rt is the beq Rs



C.2) What are the limitations of all cases of C.1 after you add the Branch Forwarding? .  $\underline{\text{Explain}}$  your answers!

C.2.a - beq after add where the add Rd is the beq Rt



C.2.b - beq after lw where the lw Rt is the beq Rs



| C.3) Why can't we check the result of the previous instruction (time slot n-1) by a beq instruction following it (time slot n)?                                           |
|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| C.4) List all of the limitations for Assembly programmer you can think of that still exist after adding the Data & Branch Forwarding circuits <u>Explain your answer!</u> |
| C5) What is the shortest loop code possible (not an infinite loop)? Any limitations? Explain in detail                                                                    |
|                                                                                                                                                                           |