# Lecture 9: The Register File Design

#### Acknowledgements

All class materials (lectures, assignments, etc.) based on material prepared by Prof. Visvesh S. Sathe, and reproduced with his permission



Visvesh S. Sathe
Associate Professor
Georgia Institute of Technology
https://psylab.ece.gatech.edu

UW (2013-2022) GaTech (2022-present)

#### Example Microprocessor Layout



- Datapath layed-out in a regular structure
- "Bitslice" is an important concept/metric
  - Aligning modules and wires along each of the 16 bits of your design will help
  - Data flow from regfile -> shift/alu/pc, from shift/alu -> regfile occur naturally along bitboundaries.
  - E.g. regfile-out lines up with alu/shift/pc IN
  - Regfile-in lines up with alu/shift/ OUT



#### Example Microprocessor Layout



- Datapath layed-out in a regular structure
- "Bitslice" is an important concept/metric
  - Aligning modules and wires along each of the 16 bits of your design will help
  - Data flow from regfile -> shift/alu/pc, from shift/alu -> regfile occur naturally along bit-boundaries.

    COMMON BITSLICE!!!
  - E.g. regfile-out lines up with alu/shift/pc IN
  - Regfile-in lines up with alu/shift/ OUT



#### **Layout Design Conventions**



- Spacing : Space between object edges
- Pitch : Space between object centerlines
- Track : Centerline of region designated for metal route
- Cell-height: Height of cell from Vdd-centerline to Vss-centerline
- Bit-slice width : **Width** of design corresponding to 1 bit
  - Important to maintain uniform bit-slice width across your structures





- Bit-slice: Unit section of your datapath corresponding to 1 bit
- Maintain uniform bit-slice across your modules
  - If the structure is narrow, insert space, or shorten height



- Bit-slice: Unit section of your datapath corresponding to 1 bit
- Maintain uniform bit-slice across your modules
  - If the structure is narrow, insert space, or shorten height



- Bit-slice: Unit section of your datapath corresponding to 1 bit
- Maintain uniform bit-slice across your modules
  - If the structure is narrow, insert space, or shorten height



- Bit-slice: Unit section of your datapath corresponding to 1 bit
- Maintain uniform bit-slice across your modules
  - If the structure is narrow, insert space, or shorten height

#### Managing Power Trunks and Tracks



- For example: Min-width = 0.1um, Min-spacing = 0.1um
- Track pitch = 0.2um
- Recommended : Cell boundary 0.5 track pitches away from track location →
   Cell width is a multiple of track pitch (0.2um in this example)
- Route on track locations. Number/Plan tracks



#### The Register File (Regfile)

- Store register values (operands for processor instructions)
  - 13-entry (R0-R12), 16-bit
  - 1 write port, 2 read ports
  - Write-before-read regfile operation

#### Ports

- VDD!
- GND!
- clk (synchronize the write operation)
- read\_addr\_0 (Read address for port 0. 4-bits)
- read\_addr\_1 (Read address for port 1. 4-bits)
- wr\_en (Write enable)
- wr\_data (write data, 16 bits)
- regfile\_data0 (Data from read port 0)
- regfile\_data1 (Data from read port 1)
- · wr- address



#### **Basic Structure**



## Timing Diagram Walkthrough



#### Naïve Implementation

Array of Flip Flops







#### Naïve Implementation

- Array of Flip Flops
- BUT....All master latches are doing the same job!



Read Data Bus1

# **More Compact Alternative**





# Basic Regfile cell



- 12T cell
- Fully static operation

#### Supporting Multiple Bitlines



- Each bitline is the muxed output connecting all column registers
- What is the capacitance on each bitline?
- How does it impact the size of the drivers?

#### **Transmission Gate Alternative**



- Advantage: Transmission gate offers lower delay for large loads
- Disadvantage: Critical path loading!
- Is there a third approach to avoid this disadvantage?



$$T_{read} = T_{clk-q} + T_{bl-drive}$$

- Delay depends on bitcell topology
- Coupling on bitline is important
- Exercise: Order the following delays
  - TX gate output. Bitlines read: (a)
     Same reg and (b) Different reg
  - Tri-state output. Bitlines read: (a)
     Same reg and (b) Different reg





$$T_{read} = T_{clk-q} + T_{bl-drive}$$

- Delay depends on bitcell topology
- Coupling on bitline is important
- Exercise: Order the following delays
  - TX gate output. Bitlines read: (a)
     Same reg and (b) Different reg
  - Tri-state output. Bitlines read: (a)
     Same reg and (b) Different reg



$$T_{read} = T_{clk-q} + T_{bl-drive}$$

- Delay depends on bitcell topology
- Coupling on bitline is important
- Exercise: Order the following delays
  - TX gate output. Bitlines read: (a)
     Same reg and (b) Different reg
  - Tri-state output. Bitlines read: (a)
     Same reg and (b) Different reg



$$T_{read} = T_{clk-q} + T_{bl-drive}$$

- Delay depends on bitcell topology
- Coupling on bitline is important
- Exercise: Order the following delays
  - TX gate output. Bitlines read: (a)
     Same reg and (b) Different reg
  - Tri-state output. Bitlines read: (a)
     Same reg and (b) Different reg





Exercise: What is the worst-case delay that will be faced by this register file? (Hint: Data written in the previous cycle may be read in the current cycle)

$$T_{clk-out} =$$





Exercise: What is the worst-case delay that will be faced by this register file? (Hint: Data written in the previous cycle may be read in the current cycle)

$$T_{clk-out} = T_{DQ-slave} +$$





• Exercise: What is the worst-case delay that will be faced by this register file? (Hint: Data written in the previous cycle may be read in the current cycle)

$$T_{clk-out} = T_{clk-wr\_addr} + T_{DQ-slave} +$$





Exercise: What is the worst-case delay that will be faced by this register file? (Hint: Data written in the previous cycle may be read in the current cycle)

$$T_{clk-out} = T_{clk-wr\_addr} + T_{DQ-slave} + T_{bl-drive}$$







• Exercise: What is the worst-case delay that will be faced by this register file? (Hint: Data written in the previous cycle may be read in the current cycle)

$$T_{clk-out} = T_{clk-wr\_addr} + T_{wr-addr-wl} + T_{DQ-slave} + T_{bl-drive}$$





- 1. What happens when slave clock arrives after master clock
- 2. Writing to multiple slaves due to glitches
  - Common problem if enabling logic is not properly done
- Run timing verification with the right state transitions for worst case delay and slew



- 1. What happens when slave clock arrives after master clock
- 2. Writing to multiple slaves due to glitches
  - Common problem if enabling logic is not properly done
- Run timing verification with the right state transitions for worst case delay and slew



- 1. What happens when slave clock arrives after master clock
- 2. Writing to multiple slaves due to glitches
  - Common problem if enabling logic is not properly done
- Run timing verification with the right state transitions for worst case delay and slew



- 1. What happens when slave clock arrives after master clock
- 2. Writing to multiple slaves due to glitches
  - Common problem if enabling logic is not properly done
- Run timing verification with the right state transitions for worst case delay and slew

#### Related Basics: Clock-Gating





clk

- Conditionally capture state into a timing element (flop/latch)
  - Reduce power



#### Related Basics: Clock-Gating







- Conditionally capture state into a timing element (flop/latch)
  - Reduce power
  - Potential area savings (if multiple flops are involved)



#### System-context

State machine generates enable ("en")

Clk, or clk\_pre gated by en

Depends on context

You will end up using clk

■ En must settle before clk\_pre →1 (why

Gater-setup



- En logic cone sees delay spread
- Early "en" evaluation causes runt pulse
  - Early de-assertion causes runt pulse



- En logic cone sees delay spread
- Early "en" evaluation causes runt pulse
  - Early de-assertion causes runt pulse
  - Early assertion causes premature clock pulse or premature runt pulse



- En logic cone sees delay spread
- Early "en" evaluation causes runt pulse
  - Early de-assertion causes runt pulse
  - Early assertion causes premature clock pulse or premature runt pulse
- Inserting T/2 delay guarantees enable for current cycle will not transition until clk goes low.
- Strictly speaking, correct BUT
  - Adds to critical path!!
  - Dissipative, variable and inefficient



- En logic cone sees delay spread
- Early "en" evaluation causes runt pulse
  - Early de-assertion causes runt pulse
  - Early assertion causes premature clock pulse or premature runt pulse
- Inserting T/2 delay guarantees enable for current cycle will not transition until clk goes low.
- Strictly speaking, correct BUT
  - Adds to critical path!!
  - Dissipative, variable and inefficient
  - (Technically, you could design for min\_delay > T/2 instead..)
- Instead, Use a latch!!



#### One more thing....

- Clock skew often reported for global clocks
  - Good way to control skew BUT most timing elements today are driven by gclk
  - gclk skew ultimately matters!
  - Skew between clk and gclk is important when either launch xor capture is ungated
    - If capture is flip flop





#### **Breakout Session:**

 Problem. Construct a clock gating methodology for gating a state machine triggered by the negative edge of the clock.