RV5 is a complete System-on-Chip (SoC) implementation of a hardware-accelerated fuzzy string search engine built around a from-scratch RV32IMFB RISC-V processor core. The entire design is written in Verilog-2001, synthesized for the Digilent Nexys A7 (Xilinx Artix-7 xc7a100tcsg324-1), and runs at a stable 50 MHz derived from the on-board 100 MHz oscillator.
The fundamental goal: demonstrate that a modest FPGA can perform real-time, encrypted, compressed, fuzzy dictionary search with all acceleration logic tightly coupled to a real RISC-V pipeline through a clean MMIO interface. Every entry is Bloom-filtered, LZ77-compressed, and AES-128-encrypted at ingest. At query time, the pipeline reverses — decrypt, decompress, then compute Levenshtein edit distance in hardware — returning the best fuzzy match in microseconds.
- Project Overview
- Architecture Overview
- CPU Core — RV32IMFB Five-Stage Pipeline
- L1 Cache Hierarchy
- Hardware Coprocessors
- MMIO Address Map
- CSR Performance Counters
- Firmware & Data Pipeline
- Host Scripts
- FPGA Board I/O
- Repository Structure
- Building & Programming
- Demo Data & Test Vectors
- Performance Benchmarks
- Testbench Suite
- Design Phases
| Feature | Detail |
|---|---|
| ISA | RV32IMFB — Integer, Multiply/Divide, Float (F-ext), Bit-Manip (Zbb) |
| Pipeline | 5-stage (IF→ID→EX→MEM→WB) with full forwarding and hazard detection |
| Branch Prediction | 2-bit saturating BHT + Branch Target Buffer (BTB) + RAS for calls/returns |
| Cache | L1 I-Cache (64-line direct-mapped) + L1 D-Cache (64-line write-through) |
| Coprocessors | Bloom Filter · LZ77 · AES-128 Enc/Dec · Levenshtein Systolic Array |
| DMA Engine | Search Sequencer driving D-BRAM port B autonomously |
| UART Interface | 115200 baud full-duplex, firmware command shell |
| 7-Segment Telemetry | 12-page multiplexed display (PC, miss counts, search result, cycles…) |
| LED Modes | Telemetry, PC heatmap, Knight-rider idle, SUCC/FAIL interactive |
flowchart TB
subgraph BOARD["Nexys A7 — xc7a100tcsg324-1"]
subgraph CLK["Clock and Reset"]
C1["100 MHz OSC"]
C2["Divide-by-2 Toggle FF + BUFG\nclk_50mhz"]
C3["Reset Synchronizer\ncpu_reset_n active LOW"]
C1 --> C2
end
subgraph CPU["RISCV_CPU — RV32IMFB"]
P1["IF — I-Cache · BHT · BTB · RAS"]
P2["ID — RF · ImmGen · Control · Hazard"]
P3["EX — ALU · BPU · Div · Zbb · FPU"]
P4["MEM — D-Cache · LDU · MMIO"]
P5["WB — Mux · Regfile Write"]
P1 --> P2 --> P3 --> P4 --> P5
end
subgraph MEM["Memory"]
M1["I_BRAM — 64 KB firmware"]
M2["D_BRAM — 64 KB data\ndual-port + DMA port B"]
end
subgraph CACHE["L1 Cache Hierarchy"]
CA1["I-Cache\n64-line direct-mapped\n8 words per line"]
CA2["D-Cache\n64-line write-through\n8 words per line"]
end
subgraph MMIO["MMIO Coprocessors — 0xF000_0000+"]
MM1["UART\n115200 baud TX/RX"]
MM2["Bloom Filter\n8192-bit · 3 Murmur hashes"]
MM3["LZ77\n256B window · 256 parallel cmp"]
MM4["AES-128\nEnc + Dec · 10-cycle iterative"]
MM5["Fuzzy Search\nLevenshtein Systolic Array"]
MM6["Search Sequencer\nDMA Engine"]
end
subgraph IO["Board I/O"]
IO1["UART RXD/TXD\nCP2102 USB Bridge"]
IO2["LED 15:0\nTelemetry / SUCC/FAIL"]
IO3["SW 15:0\nMode / Page Select"]
IO4["7-Seg Display\n8-digit · 12 telemetry pages"]
end
end
C2 --> CPU
CPU -- "imem" --> CA1 --> M1
CPU -- "dmem" --> CA2 --> M2
CPU -- "MMIO 0xF000_0000" --> MM1 & MM2 & MM3 & MM4 & MM5 & MM6
MM6 -- "DMA port B" --> M2
MM1 <--> IO1
CPU --> IO2
IO3 --> CPU
MM5 --> IO4
style BOARD fill:#1a1a2e,fill-opacity:0.05,stroke:#444,stroke-width:2px
style CLK fill:#43a047,fill-opacity:0.1,stroke:#43a047,stroke-width:2px
style CPU fill:#1e88e5,fill-opacity:0.1,stroke:#1e88e5,stroke-width:2px
style MEM fill:#6d4c41,fill-opacity:0.1,stroke:#6d4c41,stroke-width:2px
style CACHE fill:#00897b,fill-opacity:0.1,stroke:#00897b,stroke-width:2px
style MMIO fill:#8e24aa,fill-opacity:0.1,stroke:#8e24aa,stroke-width:2px
style IO fill:#fb8c00,fill-opacity:0.1,stroke:#fb8c00,stroke-width:2px
The core (rtl/core/RISCV_CPU.v) implements a textbook 5-stage in-order pipeline with full data forwarding, hazard detection, branch prediction, and optional FPU stall integration.
flowchart LR
subgraph IF["IF — Instruction Fetch"]
IF1["PC Register"]
IF2["I-Cache\nBHT · BTB · RAS"]
IF1 --> IF2
end
subgraph ID["ID — Decode"]
ID1["Register File\n32 x 32-bit"]
ID2["ImmGen\nI/S/B/U/J types"]
ID3["Control Unit\nopcode decode"]
ID4["Hazard Unit\nload-use · FPU · UART"]
end
subgraph EX["EX — Execute"]
EX1["ALU 32-bit"]
EX2["BPU Branch Resolve"]
EX3["Div Unit RV32M"]
EX4["Zbb Bit-Manip"]
EX5["FPU IEEE 754"]
end
subgraph MEM["MEM — Memory"]
MEM1["D-Cache Write-Through"]
MEM2["LDU byte/half/word"]
MEM3["MMIO 0xF000_0000+"]
end
subgraph WB["WB — Writeback"]
WB1["WB Mux\nALU / MEM / PC+4 / Imm"]
WB2["RF Write"]
WB1 --> WB2
end
IF -->|"IF/ID reg"| ID
ID -->|"ID/EX reg"| EX
EX -->|"EX/MEM reg"| MEM
MEM -->|"MEM/WB reg"| WB
WB -->|"forward WB to EX"| EX
MEM -->|"forward MEM to EX"| EX
EX -->|"branch/jump flush"| IF
ID4 -->|"load-use stall"| IF
style IF fill:#1e88e5,fill-opacity:0.1,stroke:#1e88e5,stroke-width:2px
style ID fill:#43a047,fill-opacity:0.1,stroke:#43a047,stroke-width:2px
style EX fill:#8e24aa,fill-opacity:0.1,stroke:#8e24aa,stroke-width:2px
style MEM fill:#fb8c00,fill-opacity:0.1,stroke:#fb8c00,stroke-width:2px
style WB fill:#e53935,fill-opacity:0.1,stroke:#e53935,stroke-width:2px
| Module | File | Description |
|---|---|---|
PC |
rtl/core/PC.v |
Program Counter register with stall/freeze support |
IF_ID |
rtl/core/IF_ID.v |
IF→ID pipeline register with flush/stall |
RF |
rtl/core/RF.v |
32×32-bit register file (x0 hardwired 0) |
ImmGen |
rtl/core/ImmGen.v |
Immediate generator (I/S/B/U/J types) |
Control |
rtl/core/Control.v |
Main decode: opcode→control signals |
ALU |
rtl/core/ALU.v |
32-bit ALU: ADD/SUB/SLT/AND/OR/XOR/SHL/SHR/AUIPC |
ALU_Control |
rtl/core/ALU_Control.v |
ALU function decode from funct3/funct7 |
BPU |
rtl/core/BPU.v |
Branch Processing Unit — evaluates taken/not-taken |
BHT |
rtl/core/BHT.v |
2-bit saturating counter Branch History Table |
BTB |
rtl/core/BTB.v |
Branch Target Buffer — caches branch PCs |
RAS_Unit |
rtl/core/RAS_Unit.v |
Return Address Stack (JAL/JALR call/return) |
Hazard_Unit |
rtl/core/Hazard_Unit.v |
Load-use stall, FPU stall, UART stall |
Forwarding_Unit |
rtl/core/Forwarding_Unit.v |
MEM→EX and WB→EX forwarding |
ID_EX |
rtl/core/ID_EX.v |
ID→EX pipeline register |
EX_MEM |
rtl/core/EX_MEM.v |
EX→MEM pipeline register |
MEM_WB |
rtl/core/MEM_WB.v |
MEM→WB pipeline register |
LDU |
rtl/core/LDU.v |
Load Data Unit — lb/lbu/lh/lhu/lw alignment |
CSR |
rtl/core/CSR.v |
Control & Status Registers + perf counters |
Div_Unit |
rtl/core/Div_Unit.v |
RV32M multi-cycle divider (DIV/DIVU/REM/REMU) |
Zbb_Unit |
rtl/core/Zbb_Unit.v |
Bit-Manipulation extension (CLZ/CTZ/CPOP/ANDN…) |
flowchart TB
subgraph PRED["Prediction — IF Stage"]
B1["PC"]
B2["BHT\n2-bit saturating counter\n64 entries"]
B3["BTB\nBranch Target Buffer\n16 entries"]
B4["RAS\nReturn Address Stack\npeek for JALR returns"]
B1 --> B2 & B3 & B4
B5{"Predict\nTaken?"}
B2 --> B5
end
subgraph RESOLVE["Resolution — EX Stage"]
R1["BPU evaluates\nactual branch outcome"]
R2{"Mismatch?"}
R3["Correct prediction\nno penalty cycles"]
R4["Wrong prediction\nflush IF and ID\n2-cycle penalty"]
R1 --> R2
R2 -- No --> R3
R2 -- Yes --> R4
R5["Update BHT\nsaturating increment/decrement"]
R6["Update BTB\nbranch PC target"]
R4 --> R5 & R6
end
B5 -- "Taken → use BTB/RAS PC" --> RESOLVE
B5 -- "Not Taken → PC + 4" --> RESOLVE
style PRED fill:#1e88e5,fill-opacity:0.1,stroke:#1e88e5,stroke-width:2px
style RESOLVE fill:#e53935,fill-opacity:0.1,stroke:#e53935,stroke-width:2px
The F-extension (ENABLE_FPU compile flag) is fully integrated with a dedicated 32-entry floating-point register file (fregfile) and 6 specialized sub-units:
| Sub-Unit | File | Operations |
|---|---|---|
fpu_top |
rtl/fpu/fpu_top.v |
Top-level op mux |
fpu_addsub |
rtl/fpu/fpu_addsub.v |
FADD.S, FSUB.S |
fpu_mul |
rtl/fpu/fpu_mul.v |
FMUL.S |
fpu_div |
rtl/fpu/fpu_div.v |
FDIV.S (iterative, multi-cycle) |
fpu_sqrt |
rtl/fpu/fpu_sqrt.v |
FSQRT.S (iterative) |
fpu_cmp |
rtl/fpu/fpu_cmp.v |
FEQ/FLT/FLE, FMIN/FMAX |
fpu_cvt |
rtl/fpu/fpu_cvt.v |
FCVT.W.S, FCVT.S.W, FMV |
fregfile |
rtl/fpu/fregfile.v |
32×32-bit float register file (f0–f31) |
FPU exception flags (NV/DZ/OF/UF/NX) accumulate in fcsr/fflags CSRs. FDIV and FSQRT trigger pipeline stalls via the Hazard Unit until fpu_done is asserted.
The Zbb_Unit implements: CLZ, CTZ, CPOP, ANDN, ORN, XNOR, MIN, MAX, MINU, MAXU, ZEXT.H, SEXT.B, SEXT.H, ROL, ROR, REV8, ORC.B.
Both caches are direct-mapped with a 64-line × 8-word (2 KB) capacity, implemented using Xilinx distributed RAM for zero-latency hits.
31 11 10 5 4 2 1 0
┌─────────────┬───────────┬────────┬────┐
│ TAG (21b) │ INDEX (6b)│ OFS(3b)│ -- │
└─────────────┴───────────┴────────┴────┘
flowchart LR
subgraph ICACHE["I-Cache — icache.v (Read-Only, Fill-on-Miss)"]
IC1(["S_IDLE\nCheck hit"])
IC2(["S_FILL0\nIssue BRAM addr 0"])
IC3(["S_WAIT\nBRAM 1-cycle latency"])
IC4(["S_FILL\nStore word, next addr"])
IC5(["S_LAST\nUpdate tag and valid"])
IC1 -- "HIT — rdata combinational\n0 penalty cycles" --> IC1
IC1 -- "MISS — assert stall_out\npulse miss_pulse to CSR" --> IC2
IC2 --> IC3 --> IC4
IC4 -- "words 1 through 6" --> IC4
IC4 -- "word 7" --> IC5
IC5 -- "deassert stall_out" --> IC1
end
subgraph DCACHE["D-Cache — dcache.v (Write-Through Policy)"]
DC1(["S_IDLE"])
DC2(["S_WRITE\nWrite-through to BRAM"])
DC3(["S_FILL0\nIssue BRAM addr 0"])
DC4(["S_FILL\nStore word"])
DC5(["S_LAST\nUpdate tag and valid"])
DC1 -- "Write HIT — update cache\nand BRAM immediately" --> DC2 --> DC1
DC1 -- "Read MISS\nassert stall_out" --> DC3
DC3 --> DC4
DC4 -- "words 1 through 6" --> DC4
DC4 -- "word 7" --> DC5 --> DC1
end
style ICACHE fill:#00897b,fill-opacity:0.1,stroke:#00897b,stroke-width:2px
style DCACHE fill:#1e88e5,fill-opacity:0.1,stroke:#1e88e5,stroke-width:2px
All coprocessors live in the MMIO address space (0xF000_0000+) and are accessed through memory-mapped registers. Every coprocessor asserts a CSR pulse on completion so the firmware reads hardware-verified performance counters with zero software overhead.
File: rtl/bloom/bloom_filter.v | MMIO base: 0xF000_0010
The Bloom filter provides a probabilistic membership test (zero false negatives) to short-circuit the expensive AES+LZ77+Fuzzy pipeline for guaranteed non-matching queries. Uses a two-stage Murmur-style mixer to fully diffuse bit differences across the 13-bit hash output space before applying 3 independent polynomial hash functions (golden ratio prime + two MurmurHash3 primes).
flowchart LR
subgraph BF["Bloom Filter — bloom_filter.v"]
S0(["S_FLUSH\nZero-clear all\n256 x 32-bit BRAM\nwords on reset"])
S1(["S_IDLE\nAwait BF_DATA write"])
S2(["S_HASH\nCompute h1 h2 h3\nvia Murmur mix"])
S3(["S_READ1\nRead BRAM word 1"])
S4(["S_READ2\nRead BRAM word 2"])
S5(["S_READ3\nRead BRAM word 3"])
S6(["S_CHECK_SET\nQuery or Insert?"])
S7(["S_WRITE2\nSet bit 2"])
S8(["S_WRITE3\nSet bit 3\npulse bloom_insert_pulse"])
S9(["S_DONE"])
S0 -- "256 cycles" --> S1
S1 -- "BF_DATA write" --> S2
S2 --> S3 --> S4 --> S5 --> S6
S6 -- "QUERY\nall 3 bits set?\npulse bloom_reject_pulse" --> S9
S6 -- "INSERT\nset bit 1" --> S7 --> S8 --> S9
S9 --> S1
end
style BF fill:#43a047,fill-opacity:0.1,stroke:#43a047,stroke-width:2px
| Register | Address | R/W | Description |
|---|---|---|---|
BF_DATA |
0xF000_0010 |
W | Write hashed key; starts operation |
BF_RESULT |
0xF000_0014 |
R | [0] = 1 if member |
BF_CTRL |
0xF000_0018 |
W | [0] = 0 query, 1 insert |
BF_STATUS |
0xF000_001C |
R | [0] = 1 when done |
File: rtl/lz77/lz77_comp.v | MMIO base: 0xF000_0020
A hardware LZ77 sliding-window compressor/decompressor with 256 parallel first-byte comparators for single-cycle first-hit detection and sequential byte-by-byte match extension.
- Sliding window: 256 bytes | Lookahead buffer: 15 bytes | Min match: 3 bytes
- Output FIFO: 64 tokens deep
- Token format:
{distance[7:0], length[7:0]}for back-refs;{0x00, literal}for literals
| Register | Address | R/W | Description |
|---|---|---|---|
LZ77_IN |
0xF000_0020 |
W | Push input byte |
LZ77_OUT |
0xF000_0024 |
R | Pop output token (16-bit) |
LZ77_STATUS |
0xF000_0028 |
R | [2] DONE [1] IN_READY [0] OUT_VALID |
LZ77_CTRL |
0xF000_002C |
W | [1] FLUSH [0] MODE (0=compress, 1=decompress) |
File: rtl/aes/aes128_enc.v | MMIO base: 0xF000_0030
A NIST-compliant AES-128 block cipher supporting both encryption and decryption. Implemented as pure Verilog-2001 iterative design — one round per clock cycle, 10 rounds total. The S-Box and inverse S-Box use Verilog case statements (no SystemVerilog local arrays).
sequenceDiagram
participant FW as Firmware C
participant AES as AES-128 Coprocessor
participant CSR as CSR Counters
FW->>AES: Write AES_CTRL (0=encrypt / 1=decrypt)
FW->>AES: Write AES_KEY[0..3] — KEY3 triggers key expansion
AES->>AES: Key schedule — 10 round keys computed (10 cycles)
FW->>AES: Poll AES_STATUS until idle
FW->>AES: Write AES_DATA[0..2]
FW->>AES: Write AES_DATA[3] — triggers encrypt or decrypt
AES->>AES: 10 AES rounds (SubBytes, ShiftRows, MixCols, AddRoundKey)
AES->>CSR: Pulse aes_op_pulse — aes_ops_cnt increments
FW->>AES: Poll AES_STATUS until idle
FW->>AES: Read AES_OUT[0..3] — 128-bit result
Throughput: 16 bytes / 13 cycles × 50 MHz ≈ 61.5 MB/s
| Register | Address | R/W | Description |
|---|---|---|---|
AES_KEY[0..3] |
0xF000_0030–003C |
W | 128-bit key (write KEY3 triggers expansion) |
AES_DATA[0..3] |
0xF000_0040–004C |
W | Plaintext/ciphertext (write DATA3 starts op) |
AES_OUT[0..3] |
0xF000_0050–005C |
R | 128-bit result |
AES_STATUS |
0xF000_0060 |
R | [0] = 1 idle/done, 0 busy |
AES_CTRL |
0xF000_006C |
W | [0] = 0 encrypt, 1 decrypt |
Files: rtl/fuzzy/fuzzy_search_top.v, rtl/fuzzy/lev_array.v, rtl/fuzzy/lev_pe.v
MMIO base: 0xF000_0070
The crown jewel of RV5: a fully pipelined Levenshtein edit-distance engine implemented as a systolic processing element array. Each lev_pe computes one DP table cell. The systolic wavefront propagates diagonally so all cells on an anti-diagonal compute in parallel — achieving O(n) clock cycles for O(n²) work.
flowchart TB
subgraph FUZZY["Fuzzy Search Coprocessor"]
subgraph STR["String Input via MMIO"]
SA["FUZZY_A_BASE 0x70 to 0x8C\nQuery string — 8 words x 4 bytes"]
SB["FUZZY_B_BASE 0x90 to 0xAC\nCatalog string — 8 words x 4 bytes"]
SL["FUZZY_LEN_A and FUZZY_LEN_B\nlength registers"]
end
subgraph ARR["lev_array — 32 x 32 Systolic Grid"]
PE1["PE[0,0]\ndp[0][0]"]
PE2["PE[0,1]\ndp[0][1]"]
PE3["PE[1,0]\ndp[1][0]"]
PE4["PE[1,1]\ndp[1][1]"]
PE5["... 32 x 32 PEs total\nwavefront diagonal compute"]
PE1 --> PE2
PE1 --> PE3
PE3 --> PE4
PE2 --> PE5
PE4 --> PE5
end
subgraph RES["Result Output"]
R1["lev_result[5:0]\nedit distance"]
R2["FUZZY_STATUS 0xB4\ndone = 1"]
R3["FUZZY_EDIT_DIST 0xB8\nread result"]
R4["fuzzy_op_pulse\nincrement CSR counter"]
end
STR -- "FUZZY_CTRL = 1\nstart pulse" --> ARR
ARR -- "wavefront complete" --> RES
end
style FUZZY fill:#8e24aa,fill-opacity:0.1,stroke:#8e24aa,stroke-width:2px
style STR fill:#1e88e5,fill-opacity:0.1,stroke:#1e88e5,stroke-width:2px
style ARR fill:#e53935,fill-opacity:0.1,stroke:#e53935,stroke-width:2px
style RES fill:#43a047,fill-opacity:0.1,stroke:#43a047,stroke-width:2px
String packing: 4 characters per 32-bit word, little-endian — Word 0 @ 0x70: { str[3], str[2], str[1], str[0] } through Word 7 @ 0x8C: { str[31], str[30], str[29], str[28] }.
| Register | Address | R/W | Description |
|---|---|---|---|
FUZZY_A_BASE |
0xF000_0070–008C |
W | Query string (8 words × 4 bytes) |
FUZZY_B_BASE |
0xF000_0090–00AC |
W | Catalog string (8 words × 4 bytes) |
FUZZY_CTRL |
0xF000_00B0 |
W | [0] = 1 start computation |
FUZZY_STATUS |
0xF000_00B4 |
R | [0] = 1 done |
FUZZY_EDIT_DIST |
0xF000_00B8 |
R | [5:0] edit distance result |
FUZZY_LEN_A |
0xF000_00BC |
W | [5:0] query length |
FUZZY_LEN_B |
0xF000_00C0 |
W | [5:0] catalog length |
FUZZY_RESULT_HOLD |
0xF000_00D0 |
W | Latch result for 7-seg page 1 |
File: rtl/fuzzy/search_sequencer.v | MMIO base: 0xF000_00E0
The Search Sequencer is an autonomous DMA controller that scans the entire D-BRAM dictionary without CPU intervention, driving D-BRAM port B directly and maintaining internal best-match tracking across all stored entries. It supports 4-way parallel acceleration buses for AES, LZ77, and fuzzy engines — fully wired and ready for multi-engine expansion.
Base: 0xF000_0000
Offset Register Dir Description
────────────────────────────────────────────────────────────────────
0x000 UART_TX_DATA W [7:0] Byte to transmit
0x004 UART_RX_DATA R [7:0] Received byte
0x008 UART_STATUS R [1] TX_READY [0] RX_VALID
0x010 BF_DATA W [31:0] Hashed key — starts op
0x014 BF_RESULT R [0] 1=member
0x018 BF_CTRL W [0] 0=query 1=insert
0x01C BF_STATUS R [0] 1=done
0x020 LZ77_IN W [7:0] Input byte
0x024 LZ77_OUT R [15:0] Output token
0x028 LZ77_STATUS R [2] DONE [1] IN_READY [0] OUT_VALID
0x02C LZ77_CTRL W [1] FLUSH [0] MODE (0=comp 1=decomp)
0x030 AES_KEY0 W Key[127:96]
0x034 AES_KEY1 W Key[95:64]
0x038 AES_KEY2 W Key[63:32]
0x03C AES_KEY3 W Key[31:0] — triggers key expansion
0x040 AES_DATA0 W Plaintext[127:96]
0x044 AES_DATA1 W Plaintext[95:64]
0x048 AES_DATA2 W Plaintext[63:32]
0x04C AES_DATA3 W Plaintext[31:0] — triggers enc/dec
0x050 AES_OUT0 R Result[127:96]
0x054 AES_OUT1 R Result[95:64]
0x058 AES_OUT2 R Result[63:32]
0x05C AES_OUT3 R Result[31:0]
0x060 AES_STATUS R [0] 1=idle/done
0x06C AES_CTRL W [0] 0=encrypt 1=decrypt
0x070 FUZZY_A_BASE W Query string word 0
... (8 words x 4 bytes)
0x090 FUZZY_B_BASE W Catalog string word 0
... (8 words x 4 bytes)
0x0B0 FUZZY_CTRL W [0] 1=start
0x0B4 FUZZY_STATUS R [0] 1=done
0x0B8 FUZZY_EDIT_DIST R [5:0] edit distance
0x0BC FUZZY_LEN_A W [5:0] query length
0x0C0 FUZZY_LEN_B W [5:0] catalog length
0x0D0 FUZZY_RESULT_HOLD W [5:0] hold result for 7-seg page 1
0x0E0 SEQ_* Search Sequencer DMA registers
0x0F0 SYS_SW R [15:0] SW[15:0] readback
The CSR module (rtl/core/CSR.v) provides 14 readable hardware counters, each incremented by a single-cycle pulse from the relevant hardware unit:
| Address | Name | Description |
|---|---|---|
0xC00 |
rdcycle |
CPU cycle counter (auto-increment every clock) |
0xC02 |
minstret |
Instructions retired (WB stage pulse) |
0xB00 |
branch_miss_cnt |
Branch mispredictions |
0xB01 |
icache_miss_cnt |
I-cache misses |
0xB02 |
dcache_miss_cnt |
D-cache misses |
0xB03 |
bloom_reject_cnt |
Bloom filter rejections |
0xB04 |
aes_ops_cnt |
AES encrypt/decrypt operations |
0xB05 |
bloom_inserts_cnt |
Bloom filter insertions |
0xB06 |
fuzzy_ops_cnt |
Fuzzy search completions |
0xB07 |
total_bytes_cnt |
Total bytes ingested/queried |
0x001 |
fflags |
FP exception flags (NV|DZ|OF|UF|NX) |
0x002 |
frm |
FP rounding mode |
0x003 |
fcsr |
Combined fflags + frm |
| SW[15:12] | Page Content |
|---|---|
0x0 |
Current PC (instruction address) |
0x1 |
Held fuzzy search result (edit distance) |
0x2 |
Last MMIO read data word |
0x3 |
I/O loopback: SW + LED state |
0x6 |
D-Cache miss count |
0x8 |
Cycle counter (bits 31:0) |
0x9 |
Cycle counter (bits 63:32) |
0xA |
UART TX stall status |
0xB |
Raw switch state |
0xC |
Magic word 0xC0DEC0DE |
0xF |
Board ID 0xA7A7A7A7 |
File: firmware/main.c
The firmware implements a complete UART command shell. Two primary operations — INGEST and QUERY — each drive a multi-stage hardware pipeline through MMIO register writes.
flowchart TB
subgraph HOST["Host PC"]
H1["python3 host_ingest.py\n--csv demo_dictionary.csv"]
end
subgraph UART_IN["UART — I command + len + bytes"]
U1["uart_getc loop\n115200 baud"]
end
subgraph INGEST["Ingest Pipeline — firmware/main.c::ingest_entry()"]
I1["[1] Bloom Insert\nDJB2 hash each 3-gram\nBF_CTRL=1, BF_DATA=hash\npoll BF_STATUS"]
I2["[2] LZ77 Compress\nStream bytes to LZ77_IN\nLZ77_CTRL=FLUSH\nDrain LZ77_OUT FIFO"]
I3["[3] AES-128 Encrypt\nPad to 16-byte blocks\nAES_KEY then AES_DATA[3]\nCollect AES_OUT[0..3]"]
I4["[4] Store in D-BRAM\nslot[2]=comp_len\nslot[3]=orig_len\nslot[4+]=ciphertext"]
I5["[5] UART Report\nINPUT → LZ77 → AES\nlatency in cycles"]
I1 --> I2 --> I3 --> I4 --> I5
end
HOST -- "115200 baud serial" --> UART_IN --> INGEST
style HOST fill:#fb8c00,fill-opacity:0.1,stroke:#fb8c00,stroke-width:2px
style UART_IN fill:#6d4c41,fill-opacity:0.1,stroke:#6d4c41,stroke-width:2px
style INGEST fill:#1e88e5,fill-opacity:0.1,stroke:#1e88e5,stroke-width:2px
flowchart TB
subgraph HOST2["Host PC"]
H2["python3 host_query.py\n--csv demo_queries.csv"]
end
subgraph UART_Q["UART — Q command + len + bytes"]
U2["uart_getc loop\n115200 baud"]
end
subgraph QUERY["Query Pipeline — firmware/main.c::query_entry()"]
Q1["[1] Bloom Check\nDJB2 hash each 3-gram\nBF_CTRL=0, BF_DATA=hash\npoll BF_STATUS, read BF_RESULT"]
Q2{"Any trigram\nmatches?"}
Q3["REJECT — fast path\nUART: REJECTED Bloom Filter\nno AES/LZ77/Fuzzy invoked"]
Q4["[2] For each stored entry\nAES Decrypt slot ciphertext\nLZ77 Decompress tokens\nFuzzy Search FUZZY_CTRL=1\npoll FUZZY_STATUS\nread FUZZY_EDIT_DIST"]
Q5["[3] Track best_dist\nlowest edit distance\nacross all entries"]
Q6["[4] Report Result\nUART: MATCH + DIST + latency\nFUZZY_RESULT_HOLD = best_dist\nLED: SUCC or FAIL pattern\n7-seg: edit distance"]
Q1 --> Q2
Q2 -- "No match — all trigrams absent" --> Q3
Q2 -- "Match found — at least 1 trigram" --> Q4 --> Q5 --> Q6
end
HOST2 -- "115200 baud serial" --> UART_Q --> QUERY
style HOST2 fill:#fb8c00,fill-opacity:0.1,stroke:#fb8c00,stroke-width:2px
style UART_Q fill:#6d4c41,fill-opacity:0.1,stroke:#6d4c41,stroke-width:2px
style QUERY fill:#8e24aa,fill-opacity:0.1,stroke:#8e24aa,stroke-width:2px
| Command | Byte Sequence | Effect |
|---|---|---|
| Ingest | 'I' + len[1] + bytes[len] |
Bloom + LZ77 + AES + store entry |
| Query | 'Q' + len[1] + bytes[len] |
Bloom + AES + LZ77 + fuzzy search |
| CSR Dump | 'D' |
Print all 14 performance counters + CPI + throughput |
0x0000_0000 – 0x0000_FFFF I-BRAM (64 KB) — firmware code
0x0000_0000 – 0x0000_FFFF D-BRAM (64 KB) — overlaps for data reads
0x0001_0000 – 0x0001_3FFF Stack + BSS (16 KB)
0x0001_4000 – 0x0001_FFFF Dictionary storage (256 slots × 128 bytes = 32 KB)
0xF000_0000 – 0xF000_00FF MMIO coprocessors + UART
flowchart LR
subgraph HOST["Host PC — Python Scripts"]
S1["host_ingest.py\nBatch ingest via UART\nCSV to I commands"]
S2["host_query.py\nBatch query + verify\nCSV to Q commands"]
S3["run_full_demo.py\nOrchestrates full\ningest then query cycle"]
S4["fuzz_test.py\nRandom edit-distance\ntest generation"]
S5["validate_edit_distance.py\nPython reference\nLevenshtein cross-check"]
S6["terminal.py\nInteractive UART\nhex dump shell"]
end
subgraph FPGA["FPGA — Nexys A7"]
F1["UART RX/TX\n115200 baud 8N1"]
end
S3 --> S1 & S2
S1 & S2 & S6 <--> F1
S4 --> S2
S5 -.->|"reference validation"| S2
style HOST fill:#43a047,fill-opacity:0.1,stroke:#43a047,stroke-width:2px
style FPGA fill:#1e88e5,fill-opacity:0.1,stroke:#1e88e5,stroke-width:2px
host_ingest.py— Readsdata/demo_dictionary.csvand streams each term over UART with live acknowledgment parsing.--simulateflag prints expected output without hardware.host_query.py— Sends queries and verifies results against expected edit distances. Reports accuracy as correct/total.run_full_demo.py— End-to-end: ingest full dictionary → run all queries → print accuracy summary.fuzz_test.py— Generates random edit-distance test cases and validates FPGA responses against a Python reference implementation.validate_edit_distance.py— Standalone cross-validator usingpython-Levenshteinfor regression testing.
Target Board: Digilent Nexys A7-100T (xc7a100tcsg324-1) | Constraints: constraints/nexys_a7.xdc
| Signal | Pin | Notes |
|---|---|---|
clk_100mhz |
E3 | 100 MHz LVCMOS33 → divided to 50 MHz |
cpu_reset_n |
C12 | BTNU (CPU RESET button), active LOW |
| Signal | Pin | Description |
|---|---|---|
uart_rxd |
C4 | CP2102 USB-Serial → FPGA RX |
uart_txd |
D4 | FPGA TX → CP2102 → USB |
Settings: 115200 baud, 8N1
| Switch State | Mode | LED[15:0] Pattern |
|---|---|---|
SW[9]=1 |
PC Heatmap | LED = PC[17:2] — visualize hot instruction regions |
SW[11]=1, SW[9]=0 |
Interactive | Knight-rider idle · SUCC (0xAAAA/0x5555) or FAIL (0xFF00/0x0000) on search done |
SW[11]=0, SW[9]=0 |
Telemetry | {icache_stall, dcache_stall, seq_busy, fuzzy_done, ..., edit_dist[5:0]} |
Knight-rider speed: SW[10]=1 fast, SW[10]=0 slow.
RV5/
├── rtl/
│ ├── core/ CPU pipeline — 20 modules
│ │ ├── RISCV_CPU.v CPU top-level
│ │ ├── PC.v / IF_ID.v / RF.v / ImmGen.v / Control.v
│ │ ├── ALU.v / ALU_Control.v / BPU.v
│ │ ├── BHT.v / BTB.v / RAS_Unit.v
│ │ ├── Hazard_Unit.v / Forwarding_Unit.v
│ │ ├── ID_EX.v / EX_MEM.v / MEM_WB.v
│ │ ├── LDU.v / CSR.v / Div_Unit.v / Zbb_Unit.v
│ │ └── SYSTEM_DEF.vh Global parameters + CSR addresses
│ ├── fpu/ RV32F floating-point
│ │ ├── fpu_top.v / fregfile.v
│ │ └── fpu_addsub.v / fpu_mul.v / fpu_div.v
│ │ fpu_sqrt.v / fpu_cmp.v / fpu_cvt.v
│ ├── cache/
│ │ ├── icache.v I-Cache 64-line 8-word direct-mapped
│ │ └── dcache.v D-Cache 64-line 8-word write-through
│ ├── bloom/
│ │ └── bloom_filter.v 8192-bit · 3 Murmur polynomial hashes
│ ├── lz77/
│ │ └── lz77_comp.v 256B window · 256 parallel comparators
│ ├── aes/
│ │ └── aes128_enc.v AES-128 enc+dec · 10-cycle iterative
│ ├── fuzzy/
│ │ ├── fuzzy_search_top.v MMIO wrapper
│ │ ├── lev_array.v 32×32 systolic grid
│ │ ├── lev_pe.v DP processing element
│ │ └── search_sequencer.v DMA engine
│ ├── memory/
│ │ ├── I_BRAM.v Instruction BRAM dual-port
│ │ └── D_BRAM.v Data BRAM dual-port + DMA port B
│ ├── mmio/
│ │ ├── mmio_decode.v MMIO bus decoder/arbiter
│ │ └── mmio_map.vh Address constant definitions
│ ├── uart/
│ │ ├── uart_ctrl.v / uart_tx.v / uart_rx.v
│ └── top/
│ ├── top_fpga.v SoC top-level Nexys A7
│ └── sevseg_ctrl.v 8-digit 7-segment controller
│
├── firmware/
│ ├── main.c UART shell + all pipelines
│ ├── mmio.h MMIO register definitions
│ ├── start.S Startup assembly — stack init, jump to main
│ ├── link.ld Linker script
│ ├── Makefile riscv32-unknown-elf-gcc build
│ ├── bin2mem.py ELF to .mem readmemh format
│ ├── firmware.mem Hex image for I-BRAM init
│ └── firmware_data.mem Data segment for D-BRAM init
│
├── tb/ 20+ Verilog testbenches
│ ├── tb_pipeline.v / tb_full_system.v / tb_system_interactive.v
│ ├── tb_aes.v / tb_aes128.v / tb_aes_nist.v
│ ├── tb_bloom.v / tb_bloom_unit.v
│ ├── tb_fuzzy.v / tb_fuzz_lev_auto.v / tb_fuzz_stress.v
│ │ tb_edit_distance_exhaustive.v
│ ├── tb_lz77.v / tb_lz77_unit.v
│ ├── tb_forwarding.v / tb_hazard.v / tb_ldu.v / tb_immgen.v
│ ├── tb_rv32m.v / tb_rv32m_mul.v / tb_rv32b.v / tb_ras.v
│ └── tb_mini.v / tb_top.v
│
├── scripts/
│ ├── host_ingest.py / host_query.py / run_full_demo.py
│ ├── fuzz_test.py / validate_edit_distance.py / terminal.py
│
├── data/
│ ├── demo_dictionary.csv 256-term mixed domain dictionary
│ └── demo_queries.csv Misspelled queries + expected results
│
├── constraints/
│ └── nexys_a7.xdc Vivado XDC pin constraints
│
├── bit/
│ ├── top_fpga.bit Pre-built Vivado bitstream
│ └── terminal.py Programming helper
│
└── docs/
├── Project_Abstract.docx
├── Block_Diagram_Report.docx
└── FPGA_Execution_Plan_Group14.docx
| Tool | Version | Purpose |
|---|---|---|
riscv32-unknown-elf-gcc |
≥ 12.0 | Firmware compilation |
| Xilinx Vivado | ≥ 2023.1 | Synthesis + implementation |
| Python | ≥ 3.9 | Host scripts |
pyserial |
≥ 3.5 | UART communication |
iverilog + vvp |
≥ 11.0 | Simulation |
flowchart LR
subgraph FW["1 — Firmware Build"]
F1["cd firmware\nmake clean and make"]
F2["riscv32-unknown-elf-gcc\n-march=rv32im -mabi=ilp32 -O2\n-nostdlib -T link.ld"]
F3["bin2mem.py\nELF to firmware.mem"]
F1 --> F2 --> F3
end
subgraph SYN["2 — Vivado Synthesis"]
S1["New project\nxc7a100tcsg324-1"]
S2["Add rtl sources\nnexys_a7.xdc\nfirmware.mem files"]
S3["Synthesis\nImplementation\nGenerate Bitstream"]
S1 --> S2 --> S3
end
subgraph PROG["3 — Program FPGA"]
P1["Vivado HW Manager\nor xc3sprog"]
P2["top_fpga.bit\nto Nexys A7"]
P1 --> P2
end
subgraph DEMO["4 — Run Demo"]
D1["host_ingest.py\n--csv demo_dictionary.csv"]
D2["host_query.py\n--csv demo_queries.csv"]
D3["run_full_demo.py\nfull automated run"]
D1 --> D2 --> D3
end
FW --> SYN --> PROG --> DEMO
style FW fill:#43a047,fill-opacity:0.1,stroke:#43a047,stroke-width:2px
style SYN fill:#1e88e5,fill-opacity:0.1,stroke:#1e88e5,stroke-width:2px
style PROG fill:#fb8c00,fill-opacity:0.1,stroke:#fb8c00,stroke-width:2px
style DEMO fill:#8e24aa,fill-opacity:0.1,stroke:#8e24aa,stroke-width:2px
# Build firmware
cd firmware && make clean && make
# Program pre-built bitstream
xc3sprog -c nexys4 bit/top_fpga.bit
# Ingest dictionary
python3 scripts/host_ingest.py --port /dev/ttyUSB0 --csv data/demo_dictionary.csv
# Run queries
python3 scripts/host_query.py --port /dev/ttyUSB0 --csv data/demo_queries.csv
# Full automated demo
python3 scripts/run_full_demo.py --port /dev/ttyUSB0
# Interactive terminal
python3 scripts/terminal.py --port /dev/ttyUSB0 LUTs: ~40,000 / 63,400 (63%)
FFs: ~12,000 / 126,800 ( 9%)
BRAMs: ~8 / 135 ( 6%)
DSPs: ~4 / 240 ( 2%)
WNS: +2.1 ns (timing closed at 50 MHz)
256 terms across multiple domains:
| Category | Example Terms |
|---|---|
| Programming | Algorithm, Database, Compiler, Recursion, Polymorphism, Concurrency, Deadlock… |
| Biology | Mitochondria, Chromosome, Photosynthesis, Metabolism… |
| Medicine | Hypertension, Pneumonia, Tachycardia, Anesthesia, Antibiotic, Bradycardia… |
| Mathematics | Fibonacci, Abstraction, Serialization, Iteration, Traversal… |
50 intentionally misspelled queries with known expected matches:
| Query | Expected Match | Edit Distance |
|---|---|---|
Algrithm |
Algorithm |
1 |
Databse |
Database |
1 |
Compiller |
Compiler |
1 |
Fibbonacci |
Fibonacci |
1 |
Encapslation |
Encapsulation |
2 |
Mitocondria |
Mitochondria |
1 |
Fotosynthesis |
Photosynthesis |
2 |
Tachacardia |
Tachycardia |
2 |
Inheritence |
Inheritance |
1 |
Neumonia |
Pneumonia |
1 |
Ideal CPI (no hazards): 1.00
Typical CPI (demo firmware): ~1.80
Load-use stall overhead: ~12% of cycles
Cache miss overhead: ~8% of cycles (warm cache)
UART TX stall overhead: ~15% of cycles (TX-bound)
| Operation | Cycles | Throughput |
|---|---|---|
| Bloom insert (1 key) | ~7 | 7.1M ops/s @ 50 MHz |
| Bloom query (1 key) | ~7 | 7.1M ops/s @ 50 MHz |
| LZ77 compress (16B) | ~20–50 | variable ratio |
| AES-128 encrypt (16B) | ~13 | 61.5 MB/s |
| AES-128 decrypt (16B) | ~13 | 61.5 MB/s |
| Fuzzy search (32×32 chars) | ~64 | ~780K queries/s |
| I-cache miss refill | 8 | — |
| D-cache miss refill | 8 | — |
For a 10-entry dictionary (1 Bloom check + 10 × AES+LZ77+Fuzzy):
Bloom check (1 trigram): ~7 cycles
Per-entry AES decrypt: ~13 cycles
Per-entry LZ77 decompress: ~30 cycles
Per-entry fuzzy search: ~64 cycles
──────────────────────────────────────
Per-entry total: ~107 cycles
10-entry scan total: ~1,077 cycles
Wall-clock @ 50 MHz: ~21.5 µs
The Bloom filter correctly rejects ~85% of random noise queries before the expensive pipeline is invoked — providing a large constant-factor speedup for non-matching inputs.
All testbenches in tb/ run with Icarus Verilog:
# Full system integration test
iverilog -o sim/tb_full_system -I rtl/core \
tb/tb_full_system.v rtl/core/*.v rtl/cache/*.v rtl/bloom/*.v \
rtl/lz77/*.v rtl/aes/*.v rtl/fuzzy/*.v rtl/memory/*.v \
rtl/mmio/*.v rtl/uart/*.v rtl/top/*.v
vvp sim/tb_full_system
# AES NIST FIPS-197 test vectors
iverilog -o sim/tb_aes_nist tb/tb_aes_nist.v rtl/aes/aes128_enc.v
vvp sim/tb_aes_nist
# Fuzzy search 1,000-case stress test
iverilog -o sim/tb_fuzz_stress tb/tb_fuzz_stress.v rtl/fuzzy/*.v
vvp sim/tb_fuzz_stress| Testbench | What It Tests |
|---|---|
tb_pipeline.v |
5-stage pipeline correctness, forwarding, hazards |
tb_full_system.v |
Complete SoC: ingest + query end-to-end |
tb_system_interactive.v |
UART protocol + command parsing |
tb_aes_nist.v |
NIST FIPS-197 AES-128 test vectors |
tb_bloom_unit.v |
Hash function distribution verification |
tb_edit_distance_exhaustive.v |
Levenshtein DP — all 4-char string pairs |
tb_fuzz_stress.v |
1,000 random string pair stress test |
tb_lz77.v |
Compress + decompress round-trip fidelity |
tb_forwarding.v |
All forwarding paths (EX→EX, MEM→EX, WB→EX) |
tb_hazard.v |
Load-use, branch flush, cache stall behavior |
tb_rv32m.v |
All RV32M instructions (MUL/DIV/REM variants) |
tb_rv32b.v |
All Zbb bit-manipulation instructions |
tb_ras.v |
Return Address Stack push/pop depth |
tb_ldu.v |
All LB/LBU/LH/LHU/LW alignment combinations |
RV5 was developed iteratively across 8 phases, each adding a concrete layer of hardware functionality:
flowchart LR
subgraph P0["Phase 0"]
direction TB
A0["RV32I Pipeline\nPC · RF · ALU · BPU\nBHT + BTB prediction\nI-BRAM + D-BRAM\nCSR: rdcycle minstret"]
end
subgraph P1["Phase 1"]
direction TB
A1["MMIO + UART\n115200 baud TX/RX\nMMIO decoder + map\nBloom Filter coprocessor"]
end
subgraph P2["Phase 2"]
direction TB
A2["Coprocessors\nLZ77 compress/decomp\nAES-128 enc + dec\nCSR perf counter ext"]
end
subgraph P3["Phase 3"]
direction TB
A3["L1 Cache Hierarchy\nI-Cache 64-line D-map\nD-Cache write-through\nCache stall + miss CSR"]
end
subgraph P4["Phase 4"]
direction TB
A4["Hazard Extensions\nFull forwarding network\nFPU stall integration\nUART TX stall"]
end
subgraph P5["Phase 5"]
direction TB
A5["Fuzzy Search Engine\nLevenshtein systolic array\nfuzzy_search_top MMIO\nSearch Sequencer DMA\n7-seg telemetry controller"]
end
subgraph P67["Phase 6 + 7"]
direction TB
A67["RV32M + Zbb\nDSP Multiplier\nIterative Divider\nBit-Manip CLZ CTZ ROL"]
end
subgraph P8["Phase 8"]
direction TB
A8["RV32F FPU\nfregfile f0 to f31\nFADD FSUB FMUL\nFDIV FSQRT multi-cycle\nFP CSRs fcsr fflags frm"]
end
P0 --> P1 --> P2 --> P3 --> P4 --> P5 --> P67 --> P8
style P0 fill:#43a047,fill-opacity:0.1,stroke:#43a047,stroke-width:2px
style P1 fill:#1e88e5,fill-opacity:0.1,stroke:#1e88e5,stroke-width:2px
style P2 fill:#8e24aa,fill-opacity:0.1,stroke:#8e24aa,stroke-width:2px
style P3 fill:#00897b,fill-opacity:0.1,stroke:#00897b,stroke-width:2px
style P4 fill:#fb8c00,fill-opacity:0.1,stroke:#fb8c00,stroke-width:2px
style P5 fill:#e53935,fill-opacity:0.1,stroke:#e53935,stroke-width:2px
style P67 fill:#6d4c41,fill-opacity:0.1,stroke:#6d4c41,stroke-width:2px
style P8 fill:#00acc1,fill-opacity:0.1,stroke:#00acc1,stroke-width:2px
`define DATA_WIDTH 32
`define ADDR_WIDTH 5 // register address (5 bits = x0..x31)
`define PC_WIDTH 32
`define INSTR_WIDTH 32
`define OPCODE_WIDTH 7
// Branch History Table
`define BHT_SIZE 64
`define BHT_PC_WIDTH 6 // log2(BHT_SIZE)
// Branch Target Buffer
`define BTB_SIZE 16
`define BTB_PC_WIDTH 4
// Opcode definitions
`define R_TYPE 7'b0110011
`define I_TYPE_ALU 7'b0010011
`define I_TYPE_LOAD 7'b0000011
`define S_TYPE 7'b0100011
`define B_TYPE 7'b1100011
`define U_TYPE_LUI 7'b0110111
`define U_TYPE_AUIPC 7'b0010111
`define J_TYPE_JAL 7'b1101111
`define I_TYPE_JALR 7'b1100111
`define I_TYPE_CSR 7'b1110011
`define FP_OP 7'b1010011
`define FP_LOAD 7'b0000111
`define FP_STORE 7'b0100111
// CSR addresses
`define CSR_MSTATUS 12'h300
`define CSR_MTVEC 12'h305
`define CSR_MEPC 12'h341
`define CSR_MCAUSE 12'h342
`define CSR_RDCYCLE 12'hC00
`define CSR_MINSTRET 12'hC02
`define CSR_BRANCH_MISS 12'hB00
`define CSR_ICACHE_MISS 12'hB01
`define CSR_DCACHE_MISS 12'hB02
`define CSR_BLOOM_REJ 12'hB03
`define CSR_AES_OPS 12'hB04
`define CSR_BLOOM_INS 12'hB05
`define CSR_FUZZY_OPS 12'hB06
`define CSR_TOTAL_BYTES 12'hB07
`define CSR_FFLAGS 12'h001
`define CSR_FRM 12'h002
`define CSR_FCSR 12'h003Built with Verilog-2001 · RV32IMFB · Xilinx Artix-7 · 50 MHz
From gates to search engine — every bit hand-crafted.