Skip to content

Commit

Permalink
doc: Add HPDCACHE.md doc and waveform
Browse files Browse the repository at this point in the history
  • Loading branch information
paulsc96 committed Sep 10, 2024
1 parent 99143d2 commit 979a21b
Show file tree
Hide file tree
Showing 2 changed files with 125 additions and 0 deletions.
61 changes: 61 additions & 0 deletions HPDACHE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
# HPDCache Evaluation Findings

The HPDCache has been integrated preliminarily into the [PULP fork of CVA6](https://github.com/pulp-platform/cva6/tree/paulsc/chs-hpdcache). This Cheshire branch tests and evaluates the integration and attempts to improve external bandwidth.

## BW Improvement Changes/Attempts So Far

Two changes were made in an attempt to improve NoC stream bandwidth so far:

* Replace `hpdcache_mem_req_write_arbiter` with a version that does not lockstep-couple the AW and W channels, i.e. does not block progress on receiving an AW until its W has been routed. In practice, there are usually two cycles of delay between AW and W, thus limiting peak possible writeout BW to 33%. See `hw/future/hpdcache_mem_req_write_arbiter_smart.sv` for a draft alternative design decoupling AW and W with a multiplexing decision FIFO.
* Integrate an autonomous stride-based prefetcher in the HPDCache, replacing the existing SW-controlled prefetcher. So far, it is only a simple stub capable of accelerating sequential streaming reads; see `hw/future/hwpf_stride_snoop_wrapper.sv`.

The effects of these improvements are small as the HPDcache is currently bottlenecked by its downstream (refill and writeback) bandwidth; especially the prefetcher is almost ineffectual as it is heavily stalled by refills on sequential reads.

Overall, the HPDCache achieves only fractions of the available NoC BW on streaming reads and writes even with the fixes above (see "Test Results").

## Run Testbench

We prepared a software test `corebw.spm.elf` that does a streaming write, then a streaming read (for now from the internal scratchpad to avoid latency-related issues). The goal is to maximize bandwidth to emulate strongly memory-bound sections common in OSes and memory management tasks, e.g. things involving memcpy and memset. To run this test in QuestaSim, first:

```
make all
```

then go to `target/sim/vsim` and start QuestaSim `>=2022.3`. In QuestaSim:

```
source compile.cheshire_soc.tcl
set BINARY ../../../sw/tests/corebw.spm.elf
set PRELMODE 1 ;# Preload using fastest method
source start.cheshire_soc.tcl
do hpdc.do ;# Helpful wave file
run -all
```

The simulation should run for around 2.5 minutes real-time and a simulated time of ~3303us.

## Test Results

The test should report through a UART output (printed in per mille):

* A write throughput of 38.7%
* A read throughput of 23.3%

These are with the improved write adapter and the prefetcher working as expected, respectively.

You should see some helpful waveforms on the CVA6 AXI4 NoC port, the HPDCache core-side requests, and the HPDCache control PE illustrating bottleneck symptoms:

* The streaming write should happen from ~948us to ~1053us
* The streaming read should happen from ~1053us to ~1231us

In both cases, the core-side interface (includes both the core and prefetcher) is ready to issue accesses, but is stalled is by the cache itself.

## Interpretation (WIP, may be wrong)

I have tried looking at the control PE's behavior to understand the bottlenecks (superficially).

During the write, the core-side accesses seem to be slowed by the replay table (rtab). I am honestly not sure if this is avoidable through decoupled dataflow or caused by a fundamental underlying limitation (SRAM R/W exclusion? Would longer lines help?); I need to understand more about this.

During the read, the core is stalled out by refills. I think these stalls could at least be reduced as more core-side stall cycles are incurred than even worst-case SRAM R/W exclusion would impose.

The big question is why exactly these bottlenecks exist (limitation or oversight?) and if the associated coupling in the control PE can be relaxed somehow (with reasonable effort and PPA cost).
64 changes: 64 additions & 0 deletions target/sim/vsim/hpdc.do
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
onerror {resume}

Check failure on line 1 in target/sim/vsim/hpdc.do

View workflow job for this annotation

GitHub Actions / lint-license

FAILED: File does not start with comment

Check failure on line 1 in target/sim/vsim/hpdc.do

View workflow job for this annotation

GitHub Actions / lint-license

FAILED: File does not start with comment
quietly WaveActivateNextPane {} 0
add wave -noupdate -divider {CVA6 NoC ports}
add wave -noupdate {/tb_cheshire_soc/fix/dut/gen_cva6_cores[0]/i_core_cva6/noc_req_o}
add wave -noupdate {/tb_cheshire_soc/fix/dut/gen_cva6_cores[0]/i_core_cva6/noc_resp_i}
add wave -noupdate -divider {CVA6 R channel}
add wave -noupdate {/tb_cheshire_soc/fix/dut/gen_cva6_cores[0]/i_core_cva6/noc_resp_i.r}
add wave -noupdate {/tb_cheshire_soc/fix/dut/gen_cva6_cores[0]/i_core_cva6/noc_resp_i.r_valid}
add wave -noupdate {/tb_cheshire_soc/fix/dut/gen_cva6_cores[0]/i_core_cva6/noc_req_o.r_ready}
add wave -noupdate -divider {CVA6 W channel}
add wave -noupdate {/tb_cheshire_soc/fix/dut/gen_cva6_cores[0]/i_core_cva6/noc_req_o.w}
add wave -noupdate {/tb_cheshire_soc/fix/dut/gen_cva6_cores[0]/i_core_cva6/noc_req_o.w_valid}
add wave -noupdate {/tb_cheshire_soc/fix/dut/gen_cva6_cores[0]/i_core_cva6/noc_resp_i.w_ready}
add wave -noupdate -divider <NULL>
add wave -noupdate -divider <NULL>
add wave -noupdate -divider {HPDCache core inputs}
add wave -noupdate -expand {/tb_cheshire_soc/fix/dut/gen_cva6_cores[0]/i_core_cva6/gen_cache_hpd/i_cache_subsystem/i_hpdcache/core_req_valid_i}
add wave -noupdate -expand {/tb_cheshire_soc/fix/dut/gen_cva6_cores[0]/i_core_cva6/gen_cache_hpd/i_cache_subsystem/i_hpdcache/core_req_ready_o}
add wave -noupdate -subitemconfig {{/tb_cheshire_soc/fix/dut/gen_cva6_cores[0]/i_core_cva6/gen_cache_hpd/i_cache_subsystem/i_hpdcache/core_req_i[1]} -expand} {/tb_cheshire_soc/fix/dut/gen_cva6_cores[0]/i_core_cva6/gen_cache_hpd/i_cache_subsystem/i_hpdcache/core_req_i}
add wave -noupdate {/tb_cheshire_soc/fix/dut/gen_cva6_cores[0]/i_core_cva6/gen_cache_hpd/i_cache_subsystem/i_hpdcache/core_req_abort_i}
add wave -noupdate {/tb_cheshire_soc/fix/dut/gen_cva6_cores[0]/i_core_cva6/gen_cache_hpd/i_cache_subsystem/i_hpdcache/core_req_tag_i}
add wave -noupdate {/tb_cheshire_soc/fix/dut/gen_cva6_cores[0]/i_core_cva6/gen_cache_hpd/i_cache_subsystem/i_hpdcache/core_req_pma_i}
add wave -noupdate {/tb_cheshire_soc/fix/dut/gen_cva6_cores[0]/i_core_cva6/gen_cache_hpd/i_cache_subsystem/i_hpdcache/core_rsp_valid_o}
add wave -noupdate {/tb_cheshire_soc/fix/dut/gen_cva6_cores[0]/i_core_cva6/gen_cache_hpd/i_cache_subsystem/i_hpdcache/core_rsp_o}
add wave -noupdate -divider <NULL>
add wave -noupdate -divider <NULL>
add wave -noupdate -divider {HPDCache control PE}
add wave -noupdate {/tb_cheshire_soc/fix/dut/gen_cva6_cores[0]/i_core_cva6/gen_cache_hpd/i_cache_subsystem/i_hpdcache/hpdcache_ctrl_i/hpdcache_ctrl_pe_i/core_req_valid_i}
add wave -noupdate {/tb_cheshire_soc/fix/dut/gen_cva6_cores[0]/i_core_cva6/gen_cache_hpd/i_cache_subsystem/i_hpdcache/hpdcache_ctrl_i/hpdcache_ctrl_pe_i/core_req_ready_o}
add wave -noupdate {/tb_cheshire_soc/fix/dut/gen_cva6_cores[0]/i_core_cva6/gen_cache_hpd/i_cache_subsystem/i_hpdcache/hpdcache_ctrl_i/hpdcache_ctrl_pe_i/rtab_req_valid_i}
add wave -noupdate {/tb_cheshire_soc/fix/dut/gen_cva6_cores[0]/i_core_cva6/gen_cache_hpd/i_cache_subsystem/i_hpdcache/hpdcache_ctrl_i/hpdcache_ctrl_pe_i/rtab_req_ready_o}
add wave -noupdate {/tb_cheshire_soc/fix/dut/gen_cva6_cores[0]/i_core_cva6/gen_cache_hpd/i_cache_subsystem/i_hpdcache/hpdcache_ctrl_i/hpdcache_ctrl_pe_i/refill_req_valid_i}
add wave -noupdate {/tb_cheshire_soc/fix/dut/gen_cva6_cores[0]/i_core_cva6/gen_cache_hpd/i_cache_subsystem/i_hpdcache/hpdcache_ctrl_i/hpdcache_ctrl_pe_i/refill_req_ready_o}
add wave -noupdate {/tb_cheshire_soc/fix/dut/gen_cva6_cores[0]/i_core_cva6/gen_cache_hpd/i_cache_subsystem/i_hpdcache/hpdcache_ctrl_i/hpdcache_ctrl_pe_i/inval_req_valid_i}
add wave -noupdate {/tb_cheshire_soc/fix/dut/gen_cva6_cores[0]/i_core_cva6/gen_cache_hpd/i_cache_subsystem/i_hpdcache/hpdcache_ctrl_i/hpdcache_ctrl_pe_i/inval_req_ready_o}
add wave -noupdate -divider <NULL>
add wave -noupdate {/tb_cheshire_soc/fix/dut/gen_cva6_cores[0]/i_core_cva6/gen_cache_hpd/i_cache_subsystem/i_hpdcache/hpdcache_ctrl_i/hpdcache_ctrl_pe_i/refill_busy_i}
add wave -noupdate {/tb_cheshire_soc/fix/dut/gen_cva6_cores[0]/i_core_cva6/gen_cache_hpd/i_cache_subsystem/i_hpdcache/hpdcache_ctrl_i/hpdcache_ctrl_pe_i/rtab_full_i}
add wave -noupdate {/tb_cheshire_soc/fix/dut/gen_cva6_cores[0]/i_core_cva6/gen_cache_hpd/i_cache_subsystem/i_hpdcache/hpdcache_ctrl_i/hpdcache_ctrl_pe_i/cmo_busy_i}
add wave -noupdate {/tb_cheshire_soc/fix/dut/gen_cva6_cores[0]/i_core_cva6/gen_cache_hpd/i_cache_subsystem/i_hpdcache/hpdcache_ctrl_i/hpdcache_ctrl_pe_i/uc_busy_i}
add wave -noupdate -divider <NULL>
add wave -noupdate {/tb_cheshire_soc/fix/dut/gen_cva6_cores[0]/i_core_cva6/gen_cache_hpd/i_cache_subsystem/i_hpdcache/hpdcache_ctrl_i/hpdcache_ctrl_pe_i/evt_read_req_o}
add wave -noupdate {/tb_cheshire_soc/fix/dut/gen_cva6_cores[0]/i_core_cva6/gen_cache_hpd/i_cache_subsystem/i_hpdcache/hpdcache_ctrl_i/hpdcache_ctrl_pe_i/evt_prefetch_req_o}
add wave -noupdate -divider <NULL>
add wave -noupdate {/tb_cheshire_soc/fix/dut/gen_cva6_cores[0]/i_core_cva6/gen_cache_hpd/i_cache_subsystem/i_hpdcache/hpdcache_ctrl_i/hpdcache_ctrl_pe_i/evt_stall_refill_o}
add wave -noupdate {/tb_cheshire_soc/fix/dut/gen_cva6_cores[0]/i_core_cva6/gen_cache_hpd/i_cache_subsystem/i_hpdcache/hpdcache_ctrl_i/hpdcache_ctrl_pe_i/evt_stall_o}
TreeUpdate [SetDefaultTree]
WaveRestoreCursors {{Cursor 4} {1986284621 ps} 0}
quietly wave cursor active 1
configure wave -namecolwidth 233
configure wave -valuecolwidth 40
configure wave -justifyvalue left
configure wave -signalnamewidth 1
configure wave -snapdistance 10
configure wave -datasetprefix 0
configure wave -rowmargin 4
configure wave -childrowmargin 2
configure wave -gridoffset 0
configure wave -gridperiod 1
configure wave -griddelta 40
configure wave -timeline 0
configure wave -timelineunits ps
update
WaveRestoreZoom {1986037301 ps} {1986709258 ps}

0 comments on commit 979a21b

Please sign in to comment.