-
Notifications
You must be signed in to change notification settings - Fork 50
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
doc: Add HPDCACHE.md doc and waveform
- Loading branch information
Showing
2 changed files
with
125 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,61 @@ | ||
# HPDCache Evaluation Findings | ||
|
||
The HPDCache has been integrated preliminarily into the [PULP fork of CVA6](https://github.com/pulp-platform/cva6/tree/paulsc/chs-hpdcache). This Cheshire branch tests and evaluates the integration and attempts to improve external bandwidth. | ||
|
||
## BW Improvement Changes/Attempts So Far | ||
|
||
Two changes were made in an attempt to improve NoC stream bandwidth so far: | ||
|
||
* Replace `hpdcache_mem_req_write_arbiter` with a version that does not lockstep-couple the AW and W channels, i.e. does not block progress on receiving an AW until its W has been routed. In practice, there are usually two cycles of delay between AW and W, thus limiting peak possible writeout BW to 33%. See `hw/future/hpdcache_mem_req_write_arbiter_smart.sv` for a draft alternative design decoupling AW and W with a multiplexing decision FIFO. | ||
* Integrate an autonomous stride-based prefetcher in the HPDCache, replacing the existing SW-controlled prefetcher. So far, it is only a simple stub capable of accelerating sequential streaming reads; see `hw/future/hwpf_stride_snoop_wrapper.sv`. | ||
|
||
The effects of these improvements are small as the HPDcache is currently bottlenecked by its downstream (refill and writeback) bandwidth; especially the prefetcher is almost ineffectual as it is heavily stalled by refills on sequential reads. | ||
|
||
Overall, the HPDCache achieves only fractions of the available NoC BW on streaming reads and writes even with the fixes above (see "Test Results"). | ||
|
||
## Run Testbench | ||
|
||
We prepared a software test `corebw.spm.elf` that does a streaming write, then a streaming read (for now from the internal scratchpad to avoid latency-related issues). The goal is to maximize bandwidth to emulate strongly memory-bound sections common in OSes and memory management tasks, e.g. things involving memcpy and memset. To run this test in QuestaSim, first: | ||
|
||
``` | ||
make all | ||
``` | ||
|
||
then go to `target/sim/vsim` and start QuestaSim `>=2022.3`. In QuestaSim: | ||
|
||
``` | ||
source compile.cheshire_soc.tcl | ||
set BINARY ../../../sw/tests/corebw.spm.elf | ||
set PRELMODE 1 ;# Preload using fastest method | ||
source start.cheshire_soc.tcl | ||
do hpdc.do ;# Helpful wave file | ||
run -all | ||
``` | ||
|
||
The simulation should run for around 2.5 minutes real-time and a simulated time of ~3303us. | ||
|
||
## Test Results | ||
|
||
The test should report through a UART output (printed in per mille): | ||
|
||
* A write throughput of 38.7% | ||
* A read throughput of 23.3% | ||
|
||
These are with the improved write adapter and the prefetcher working as expected, respectively. | ||
|
||
You should see some helpful waveforms on the CVA6 AXI4 NoC port, the HPDCache core-side requests, and the HPDCache control PE illustrating bottleneck symptoms: | ||
|
||
* The streaming write should happen from ~948us to ~1053us | ||
* The streaming read should happen from ~1053us to ~1231us | ||
|
||
In both cases, the core-side interface (includes both the core and prefetcher) is ready to issue accesses, but is stalled is by the cache itself. | ||
|
||
## Interpretation (WIP, may be wrong) | ||
|
||
I have tried looking at the control PE's behavior to understand the bottlenecks (superficially). | ||
|
||
During the write, the core-side accesses seem to be slowed by the replay table (rtab). I am honestly not sure if this is avoidable through decoupled dataflow or caused by a fundamental underlying limitation (SRAM R/W exclusion? Would longer lines help?); I need to understand more about this. | ||
|
||
During the read, the core is stalled out by refills. I think these stalls could at least be reduced as more core-side stall cycles are incurred than even worst-case SRAM R/W exclusion would impose. | ||
|
||
The big question is why exactly these bottlenecks exist (limitation or oversight?) and if the associated coupling in the control PE can be relaxed somehow (with reasonable effort and PPA cost). |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,64 @@ | ||
onerror {resume} | ||
Check failure on line 1 in target/sim/vsim/hpdc.do GitHub Actions / lint-license
|
||
quietly WaveActivateNextPane {} 0 | ||
add wave -noupdate -divider {CVA6 NoC ports} | ||
add wave -noupdate {/tb_cheshire_soc/fix/dut/gen_cva6_cores[0]/i_core_cva6/noc_req_o} | ||
add wave -noupdate {/tb_cheshire_soc/fix/dut/gen_cva6_cores[0]/i_core_cva6/noc_resp_i} | ||
add wave -noupdate -divider {CVA6 R channel} | ||
add wave -noupdate {/tb_cheshire_soc/fix/dut/gen_cva6_cores[0]/i_core_cva6/noc_resp_i.r} | ||
add wave -noupdate {/tb_cheshire_soc/fix/dut/gen_cva6_cores[0]/i_core_cva6/noc_resp_i.r_valid} | ||
add wave -noupdate {/tb_cheshire_soc/fix/dut/gen_cva6_cores[0]/i_core_cva6/noc_req_o.r_ready} | ||
add wave -noupdate -divider {CVA6 W channel} | ||
add wave -noupdate {/tb_cheshire_soc/fix/dut/gen_cva6_cores[0]/i_core_cva6/noc_req_o.w} | ||
add wave -noupdate {/tb_cheshire_soc/fix/dut/gen_cva6_cores[0]/i_core_cva6/noc_req_o.w_valid} | ||
add wave -noupdate {/tb_cheshire_soc/fix/dut/gen_cva6_cores[0]/i_core_cva6/noc_resp_i.w_ready} | ||
add wave -noupdate -divider <NULL> | ||
add wave -noupdate -divider <NULL> | ||
add wave -noupdate -divider {HPDCache core inputs} | ||
add wave -noupdate -expand {/tb_cheshire_soc/fix/dut/gen_cva6_cores[0]/i_core_cva6/gen_cache_hpd/i_cache_subsystem/i_hpdcache/core_req_valid_i} | ||
add wave -noupdate -expand {/tb_cheshire_soc/fix/dut/gen_cva6_cores[0]/i_core_cva6/gen_cache_hpd/i_cache_subsystem/i_hpdcache/core_req_ready_o} | ||
add wave -noupdate -subitemconfig {{/tb_cheshire_soc/fix/dut/gen_cva6_cores[0]/i_core_cva6/gen_cache_hpd/i_cache_subsystem/i_hpdcache/core_req_i[1]} -expand} {/tb_cheshire_soc/fix/dut/gen_cva6_cores[0]/i_core_cva6/gen_cache_hpd/i_cache_subsystem/i_hpdcache/core_req_i} | ||
add wave -noupdate {/tb_cheshire_soc/fix/dut/gen_cva6_cores[0]/i_core_cva6/gen_cache_hpd/i_cache_subsystem/i_hpdcache/core_req_abort_i} | ||
add wave -noupdate {/tb_cheshire_soc/fix/dut/gen_cva6_cores[0]/i_core_cva6/gen_cache_hpd/i_cache_subsystem/i_hpdcache/core_req_tag_i} | ||
add wave -noupdate {/tb_cheshire_soc/fix/dut/gen_cva6_cores[0]/i_core_cva6/gen_cache_hpd/i_cache_subsystem/i_hpdcache/core_req_pma_i} | ||
add wave -noupdate {/tb_cheshire_soc/fix/dut/gen_cva6_cores[0]/i_core_cva6/gen_cache_hpd/i_cache_subsystem/i_hpdcache/core_rsp_valid_o} | ||
add wave -noupdate {/tb_cheshire_soc/fix/dut/gen_cva6_cores[0]/i_core_cva6/gen_cache_hpd/i_cache_subsystem/i_hpdcache/core_rsp_o} | ||
add wave -noupdate -divider <NULL> | ||
add wave -noupdate -divider <NULL> | ||
add wave -noupdate -divider {HPDCache control PE} | ||
add wave -noupdate {/tb_cheshire_soc/fix/dut/gen_cva6_cores[0]/i_core_cva6/gen_cache_hpd/i_cache_subsystem/i_hpdcache/hpdcache_ctrl_i/hpdcache_ctrl_pe_i/core_req_valid_i} | ||
add wave -noupdate {/tb_cheshire_soc/fix/dut/gen_cva6_cores[0]/i_core_cva6/gen_cache_hpd/i_cache_subsystem/i_hpdcache/hpdcache_ctrl_i/hpdcache_ctrl_pe_i/core_req_ready_o} | ||
add wave -noupdate {/tb_cheshire_soc/fix/dut/gen_cva6_cores[0]/i_core_cva6/gen_cache_hpd/i_cache_subsystem/i_hpdcache/hpdcache_ctrl_i/hpdcache_ctrl_pe_i/rtab_req_valid_i} | ||
add wave -noupdate {/tb_cheshire_soc/fix/dut/gen_cva6_cores[0]/i_core_cva6/gen_cache_hpd/i_cache_subsystem/i_hpdcache/hpdcache_ctrl_i/hpdcache_ctrl_pe_i/rtab_req_ready_o} | ||
add wave -noupdate {/tb_cheshire_soc/fix/dut/gen_cva6_cores[0]/i_core_cva6/gen_cache_hpd/i_cache_subsystem/i_hpdcache/hpdcache_ctrl_i/hpdcache_ctrl_pe_i/refill_req_valid_i} | ||
add wave -noupdate {/tb_cheshire_soc/fix/dut/gen_cva6_cores[0]/i_core_cva6/gen_cache_hpd/i_cache_subsystem/i_hpdcache/hpdcache_ctrl_i/hpdcache_ctrl_pe_i/refill_req_ready_o} | ||
add wave -noupdate {/tb_cheshire_soc/fix/dut/gen_cva6_cores[0]/i_core_cva6/gen_cache_hpd/i_cache_subsystem/i_hpdcache/hpdcache_ctrl_i/hpdcache_ctrl_pe_i/inval_req_valid_i} | ||
add wave -noupdate {/tb_cheshire_soc/fix/dut/gen_cva6_cores[0]/i_core_cva6/gen_cache_hpd/i_cache_subsystem/i_hpdcache/hpdcache_ctrl_i/hpdcache_ctrl_pe_i/inval_req_ready_o} | ||
add wave -noupdate -divider <NULL> | ||
add wave -noupdate {/tb_cheshire_soc/fix/dut/gen_cva6_cores[0]/i_core_cva6/gen_cache_hpd/i_cache_subsystem/i_hpdcache/hpdcache_ctrl_i/hpdcache_ctrl_pe_i/refill_busy_i} | ||
add wave -noupdate {/tb_cheshire_soc/fix/dut/gen_cva6_cores[0]/i_core_cva6/gen_cache_hpd/i_cache_subsystem/i_hpdcache/hpdcache_ctrl_i/hpdcache_ctrl_pe_i/rtab_full_i} | ||
add wave -noupdate {/tb_cheshire_soc/fix/dut/gen_cva6_cores[0]/i_core_cva6/gen_cache_hpd/i_cache_subsystem/i_hpdcache/hpdcache_ctrl_i/hpdcache_ctrl_pe_i/cmo_busy_i} | ||
add wave -noupdate {/tb_cheshire_soc/fix/dut/gen_cva6_cores[0]/i_core_cva6/gen_cache_hpd/i_cache_subsystem/i_hpdcache/hpdcache_ctrl_i/hpdcache_ctrl_pe_i/uc_busy_i} | ||
add wave -noupdate -divider <NULL> | ||
add wave -noupdate {/tb_cheshire_soc/fix/dut/gen_cva6_cores[0]/i_core_cva6/gen_cache_hpd/i_cache_subsystem/i_hpdcache/hpdcache_ctrl_i/hpdcache_ctrl_pe_i/evt_read_req_o} | ||
add wave -noupdate {/tb_cheshire_soc/fix/dut/gen_cva6_cores[0]/i_core_cva6/gen_cache_hpd/i_cache_subsystem/i_hpdcache/hpdcache_ctrl_i/hpdcache_ctrl_pe_i/evt_prefetch_req_o} | ||
add wave -noupdate -divider <NULL> | ||
add wave -noupdate {/tb_cheshire_soc/fix/dut/gen_cva6_cores[0]/i_core_cva6/gen_cache_hpd/i_cache_subsystem/i_hpdcache/hpdcache_ctrl_i/hpdcache_ctrl_pe_i/evt_stall_refill_o} | ||
add wave -noupdate {/tb_cheshire_soc/fix/dut/gen_cva6_cores[0]/i_core_cva6/gen_cache_hpd/i_cache_subsystem/i_hpdcache/hpdcache_ctrl_i/hpdcache_ctrl_pe_i/evt_stall_o} | ||
TreeUpdate [SetDefaultTree] | ||
WaveRestoreCursors {{Cursor 4} {1986284621 ps} 0} | ||
quietly wave cursor active 1 | ||
configure wave -namecolwidth 233 | ||
configure wave -valuecolwidth 40 | ||
configure wave -justifyvalue left | ||
configure wave -signalnamewidth 1 | ||
configure wave -snapdistance 10 | ||
configure wave -datasetprefix 0 | ||
configure wave -rowmargin 4 | ||
configure wave -childrowmargin 2 | ||
configure wave -gridoffset 0 | ||
configure wave -gridperiod 1 | ||
configure wave -griddelta 40 | ||
configure wave -timeline 0 | ||
configure wave -timelineunits ps | ||
update | ||
WaveRestoreZoom {1986037301 ps} {1986709258 ps} |