In [None]:
## Agile Hardware Design
***
# Optimizing for Delay

<img src="./images/chisel_logo.svg" alt="agile hardware design logo" style="width: 20%; float:right"/>

Peter Hanping Chen, based on

- 1. UCB Bootcamp: configuration file load-ivy.sc: 
- https://github.com/freechipsproject/chisel-bootcamp/tree/master/source
- 2. Prof. Scott Beamer, sbeamer@ucsc.edu, CSE 228A
- https://classes.soe.ucsc.edu/cse228a/Winter24/


## Plan for Today

* Sources of logic delay
* Techniques to reduce delay

## Loading The Chisel Library Into a Notebook

In [3]:
//interp.load.module(os.Path(s"${System.getProperty("user.dir")}/../resource/chisel_deps.sc"))
val path = System.getProperty("user.dir") + "/source/load-ivy.sc"
//val path = System.getProperty("user.dir") + "/source/chisel_deps.sc"
println("path: "+path)
interp.load.module(ammonite.ops.Path(java.nio.file.FileSystems.getDefault().getPath(path)))

path: /home/peter/AIU/AIU_CS800_Chisel/500_UCSC_HWD/019_Delay/001_Code/source/load-ivy.sc


[36mpath[39m: [32mString[39m = [32m"/home/peter/AIU/AIU_CS800_Chisel/500_UCSC_HWD/019_Delay/001_Code/source/load-ivy.sc"[39m

In [4]:
import chisel3._
import chisel3.util._
import chiseltest._
import chiseltest.RawTester.test

[32mimport [39m[36mchisel3._
[39m
[32mimport [39m[36mchisel3.util._
[39m
[32mimport [39m[36mchiseltest._
[39m
[32mimport [39m[36mchiseltest.RawTester.test[39m

## Sources of Delay

* _**Gate Delay**_ - time it takes for gates to compute result
  * More complicated gates or more inputs (_fan-in_) can increase delay

* _**Wire Delay**_ - time to transmit signals between gates
  * Can be worsenned by _fan-out_ (broadcasting to multiple outputs)
  * Can be much more pronounced on FPGAs

<img src="./images/delay.svg" alt="sources of delay" style="width:50%; align:left" />

## Units for Delay

* Typical time units
  * _nanosecond (ns)_ = $10^{-9}$ seconds
  * _picosecond (ps)_ = $10^{-12}$ seconds

* _Fanout of 4_ (**FO4**)
  * Slightly agnostic to tech (or even operating voltage)
  * Example: Intel Pentium 4 @ 3.4 GHz had 16.3 FO4 => FO4 Delay ~ 18ps 
  * source: https://en.wikipedia.org/wiki/FO4#cite_note-4

<img src="./images/fo4.svg" alt="FO4" style="width:25%; align:left" />

## Critical Path

* Longest _delay_ path through design (under worst case conditions)
* Clock period must be longer than critical path delay
  * Paths "shorter" (less delay) than critical path do not affect clock frequency
* Reducing critical path delay helps in two ways:
  * 1 - can increase clock frequency (improve performance)
  * 2 - can reduce supply voltage (reduce power)

<img src="./images/critical.svg" alt="critical path" style="width:50%; align: Left" />

## Static Timing Analysis (STA)

* Process to analyze a design and determine its timing behavior
* Models performance of gates and wires
* Usually concerned with worst case
* Can be after only synthesis (doesn't consider wire delays) or also post place & route

<img src="./images/sta.svg" alt="STA example" style="width:50%; align:left" />

## Fixing Critical Paths

* Even though most paths are "short," clock period set by _critical path_
* Want to decrease clock period to increase throughput (assuming no hazards or bubbles)
* _Process:_ optimize longest (delay) path by reducing delay, then do next longest, repeat
  * Can initially be done by tools, but usually humans needed for large interventions
  * Can be very time consuming

<img src="./images/histogram.svg" alt="" style="width:70%; align:left" />

## Who Fixes Delay?

* In order to let/make _the tools do the work_, need to appreciate what tools can do to make designer effort complementary and not redundant
* _What the **tools** do best_
    * Decades of research & development have gone into logic optimization
    * Tools can reduce logic to reduce cost as well as restructure it to reduce delay
    * Tools can also choose to use faster components at the cost of area or power
    * Can do most things that _do not change semantics of design_
* _What **designer** does best_
    * _Can change the design_ (semantics)
    * Consider major architectural changes
    * Make changes to enable more optimization from tools

## Pipelining

* Break up long paths by inserting registers
  * Data still travels over long path, but now over multiple cycles
  * Requires _parallelism_, as now multiple elements in flight
* Where to put registers?
  * Want to balance delay
  * Sometimes very semantically clear, but that may not always be best
  * Manually moving logic back and forth across registers can be labor intensive

<img src="./images/pipeline.svg" alt="pipeline" style="width:70%; align: left"/>

## Retiming

* Automated way tools can move registers to balance path lengths
* Can't always move a register, such as if it has feedback
* Some tools have varying levels of sophistication or flexibility
  * e.g. can only go forward or backward or only in some cases
* Can sometimes complicate verification

<img src="./images/retimed.svg" alt="retiming" style="width:70%; align: left" />

## Coding for Retiming in Chisel

* Add ability for a component to be pipelined, but _parameterize_ depth
* Make tools do the work retiming to spread registers out appropriately
* Chisel's `Pipe` object is a sequence of (shift) registers
* Example below places additional registers at end of combinational logic block
  * Some tools may prefer registers in front

```scala
class PipelinedModule(pipelineDepth: Int) extends Module {
    val io = IO ...
    // combinational logic produces: result
    io.out := Pipe(result, pipelineDepth)
}
```

## Reduce Depth of Structures

* Logic optimization in CAD tools can solve many inefficiencies, but still may need help
  * Tools aren't allowed to change observable behavior, so you will need to change design
  * May have (inadvertently) constructed highly unusual corner case tools can't optimize
* Be wary of logic depth for things that grow linearly
  * Consider pipelining
  * Consider a tree (sometimes trades area for delay)
  * Be sure to confirm it is on critical path first before optimizing
    * Even if linear, may still not be the critical path
    * Tools may be optimizing it just fine on their own

## Example Depth Reduction (for a Reduction) - 1/3

<img src="./images/reduction.svg" alt="toolflow phases" style="width:70%; align:left" />

## Example Depth Reduction (for a Reduction) - 2/3

In [5]:
def linearPopCount(l: Seq[Bool]): UInt = {
    if (l.isEmpty) 0.U
    else l.head +& linearPopCount(l.tail)
}

// PopCount (edited) from chisel3/SeqUtils.scala
def treePopCount(l: Seq[Bool]): UInt = l.size match {
    case 0 => 0.U
    case 1 => l.head
    case n => treePopCount(l take n/2) +& treePopCount(l drop n/2)
}

class CountOnes(n: Int) extends Module { // PopCount
    val io = IO(new Bundle {
        val in = Input(Vec(n, Bool()))
        val out = Output(UInt())
    })
    require(n > 0)
    io.out := linearPopCount(io.in)
//     io.out := treePopCount(io.in)
//     io.out := PopCount(io.in)    // from chisel3.util
}

defined [32mfunction[39m [36mlinearPopCount[39m
defined [32mfunction[39m [36mtreePopCount[39m
defined [32mclass[39m [36mCountOnes[39m

## Example Depth Reduction (for a Reduction) - 3/3

In [6]:
//printVerilog(new CountOnes(4))
println (getVerilog (new CountOnes(4)))

Elaborating design...
Done elaborating.
module CountOnes(
  input        clock,
  input        reset,
  input        io_in_0,
  input        io_in_1,
  input        io_in_2,
  input        io_in_3,
  output [4:0] io_out
);
  wire [1:0] _T = {{1'd0}, io_in_3}; // @[cmd4.sc 3:17]
  wire [1:0] _GEN_0 = {{1'd0}, io_in_2}; // @[cmd4.sc 3:17]
  wire [2:0] _T_1 = _GEN_0 + _T; // @[cmd4.sc 3:17]
  wire [2:0] _GEN_1 = {{2'd0}, io_in_1}; // @[cmd4.sc 3:17]
  wire [3:0] _T_2 = _GEN_1 + _T_1; // @[cmd4.sc 3:17]
  wire [3:0] _GEN_2 = {{3'd0}, io_in_0}; // @[cmd4.sc 3:17]
  assign io_out = _GEN_2 + _T_2; // @[cmd4.sc 3:17]
endmodule

