[FIRRTL] Lower non-trivial memory latencies to RTL #585

seldridge · 2021-02-12T06:38:26Z

Fixes #477.

Add lowering of memories with non-zero read latency and non-unary
write latency using the strategy of the Scala FIRRTL
Compiler (SFC). If memories have read or write ports with this
property, then add delay pipes that delay the read or write for the
expected number of cycles. This deviates slightly from the SFC
behavior and creates aggregates for the pipes as opposed to one
element for each stage of the pipe.

Remove incomplete support for aggregate memory lowering. Before this
commit, memories would be split, but there was no logic to actually
handle these memories. Change this so that aggregate memories are an
error in lowering and tell the user how to run FIRRTL type lowering.

The existing register initialization logic is extended to support
arrays (as this is required for the pipeline registers).

Update tests and add a new test to check the pipeline register
behavior.

While large read and write latencies are arguably something that we could choose to not support, the standard sequential read memory (read latency 1, write latency 1) needs to have some lowering if #493 isn't run. This PR just solves the general case and aligns the behavior with SFC lowering.

I believe that removing the vestigial, incomplete memory lowering during FIRRTL to RTL conversion is justified because: (1) we should aim to do things correctly once and not have multiple different paths to produce the same effects and (2) FIRRTL to RTL conversion should eventually lower memories as aggregates anyway (structs/vectors should be preserved here). I don't think we're losing anything by ripping out the vestigial support and pushing people towards FIRRTL type lowering.

This PR synergizes with memory blackboxing #493 and addition of generator ops #547. This PR is the fallback path if memories aren't blackboxed or the logic can be reused for an eventual, default memory lowering/memory generator expansion pass.

Example

Consider the following FIRRTL IR (this is the example added in the tests). This has read and write latencies of 2:

circuit Foo:
  module Foo:
    input clock: Clock
    input rAddr: UInt<4>
    input rEn: UInt<1>
    output rData: UInt<8>
    input wAddr: UInt<4>
    input wEn: UInt<1>
    input wMask: UInt<1>
    input wData: UInt<8>

    mem memory:
      data-type => UInt<8>
      depth => 16
      reader => r
      writer => w
      read-latency => 2
      write-latency => 2
      read-under-write => undefined

    memory.r.clk <= clock
    memory.r.en <= rEn
    memory.r.addr <= rAddr
    rData <= memory.r.data

    memory.w.clk <= clock
    memory.w.en <= wEn
    memory.w.addr <= wAddr
    memory.w.mask <= wMask
    memory.w.data <= wData

If you compile this with the SFC, you get the read delayed by two cycles and the write delayed by one cycle:

module Foo(
  input        clock,
  input  [3:0] rAddr,
  input        rEn,
  output [7:0] rData,
  input  [3:0] wAddr,
  input        wEn,
  input        wMask,
  input  [7:0] wData
);
`ifdef RANDOMIZE_MEM_INIT
  reg [31:0] _RAND_0;
`endif // RANDOMIZE_MEM_INIT
`ifdef RANDOMIZE_REG_INIT
  reg [31:0] _RAND_1;
  reg [31:0] _RAND_2;
  reg [31:0] _RAND_3;
  reg [31:0] _RAND_4;
  reg [31:0] _RAND_5;
  reg [31:0] _RAND_6;
  reg [31:0] _RAND_7;
  reg [31:0] _RAND_8;
`endif // RANDOMIZE_REG_INIT
  reg [7:0] memory [0:15];
  wire [7:0] memory_r_data;
  wire [3:0] memory_r_addr;
  wire [7:0] memory_w_data;
  wire [3:0] memory_w_addr;
  wire  memory_w_mask;
  wire  memory_w_en;
  reg  memory_r_en_pipe_0;
  reg [3:0] memory_r_addr_pipe_0;
  reg  memory_r_en_pipe_1;
  reg [3:0] memory_r_addr_pipe_1;
  reg  memory_w_en_pipe_0;
  reg [3:0] memory_w_addr_pipe_0;
  reg  memory_w_mask_pipe_0;
  reg [7:0] memory_w_data_pipe_0;
  assign memory_r_addr = memory_r_addr_pipe_1;
  assign memory_r_data = memory[memory_r_addr];
  assign memory_w_data = memory_w_data_pipe_0;
  assign memory_w_addr = memory_w_addr_pipe_0;
  assign memory_w_mask = memory_w_mask_pipe_0;
  assign memory_w_en = memory_w_en_pipe_0;
  assign rData = memory_r_data;
  always @(posedge clock) begin
    if(memory_w_en & memory_w_mask) begin
      memory[memory_w_addr] <= memory_w_data;
    end
    memory_r_en_pipe_0 <= rEn;
    if (rEn) begin
      memory_r_addr_pipe_0 <= rAddr;
    end
    memory_r_en_pipe_1 <= memory_r_en_pipe_0;
    if (memory_r_en_pipe_0) begin
      memory_r_addr_pipe_1 <= memory_r_addr_pipe_0;
    end
    memory_w_en_pipe_0 <= wEn;
    if (wEn) begin
      memory_w_addr_pipe_0 <= wAddr;
    end
    if (wEn) begin
      memory_w_mask_pipe_0 <= wMask;
    end
    if (wEn) begin
      memory_w_data_pipe_0 <= wData;
    end
  end
initial begin
  `ifdef RANDOMIZE
    `ifdef INIT_RANDOM
      `INIT_RANDOM
    `endif
    `ifndef VERILATOR
      `ifdef RANDOMIZE_DELAY
        #`RANDOMIZE_DELAY begin end
      `else
        #0.002 begin end
      `endif
    `endif
`ifdef RANDOMIZE_MEM_INIT
  _RAND_0 = {1{`RANDOM}};
  for (initvar = 0; initvar < 16; initvar = initvar+1)
    memory[initvar] = _RAND_0[7:0];
`endif // RANDOMIZE_MEM_INIT
`ifdef RANDOMIZE_REG_INIT
  _RAND_1 = {1{`RANDOM}};
  memory_r_en_pipe_0 = _RAND_1[0:0];
  _RAND_2 = {1{`RANDOM}};
  memory_r_addr_pipe_0 = _RAND_2[3:0];
  _RAND_3 = {1{`RANDOM}};
  memory_r_en_pipe_1 = _RAND_3[0:0];
  _RAND_4 = {1{`RANDOM}};
  memory_r_addr_pipe_1 = _RAND_4[3:0];
  _RAND_5 = {1{`RANDOM}};
  memory_w_en_pipe_0 = _RAND_5[0:0];
  _RAND_6 = {1{`RANDOM}};
  memory_w_addr_pipe_0 = _RAND_6[3:0];
  _RAND_7 = {1{`RANDOM}};
  memory_w_mask_pipe_0 = _RAND_7[0:0];
  _RAND_8 = {1{`RANDOM}};
  memory_w_data_pipe_0 = _RAND_8[7:0];
`endif // RANDOMIZE_REG_INIT
  `endif // RANDOMIZE
end // initial
`ifdef FIRRTL_AFTER_INITIAL
`FIRRTL_AFTER_INITIAL
`endif
`endif // SYNTHESIS
endmodule

With this PR, you get the following from CIRCT (firtool -enable-lower-types -lower-to-rtl -verilog):

module Foo(
  input        clock,
  input  [3:0] rAddr,
  input        rEn,
  input  [3:0] wAddr,
  input        wEn, wMask,
  input  [7:0] wData,
  output [7:0] rData);

  reg  [7:0] memory[15:0];	// Foo.fir:12:5
  wire [3:0] memory_r_addr;	// Foo.fir:12:5
  wire       memory_r_en;	// Foo.fir:12:5
  wire       memory_r_clk;	// Foo.fir:12:5
  wire [7:0] memory_r_data;	// Foo.fir:12:5
  reg        memory_r_en_pipe[1:0];	// Foo.fir:12:5
  reg  [3:0] memory_r_addr_pipe[1:0];	// Foo.fir:12:5
  wire [3:0] memory_w_addr;	// Foo.fir:12:5
  wire       memory_w_en;	// Foo.fir:12:5
  wire       memory_w_clk;	// Foo.fir:12:5
  wire [7:0] memory_w_data;	// Foo.fir:12:5
  wire       memory_w_mask;	// Foo.fir:12:5
  reg        memory_w_en_pipe[0:0];	// Foo.fir:12:5
  reg  [3:0] memory_w_addr_pipe[0:0];	// Foo.fir:12:5
  reg        memory_w_mask_pipe[0:0];	// Foo.fir:12:5
  reg  [7:0] memory_w_data_pipe[0:0];	// Foo.fir:12:5
  wire [3:0] memory_r_addr_5;	// Foo.fir:12:5
  wire       memory_r_en_6;	// Foo.fir:12:5
  wire       memory_r_clk_7;	// Foo.fir:12:5
  wire [3:0] memory_w_addr_8;	// Foo.fir:12:5
  wire       memory_w_en_9;	// Foo.fir:12:5
  wire       memory_w_clk_10;	// Foo.fir:12:5

  wire _T = memory_r_clk;	// Foo.fir:12:5
  always_ff @(posedge _T) begin	// Foo.fir:12:5
    logic _T_0 = memory_r_en;	// Foo.fir:12:5
    memory_r_en_pipe[1'h0] <= _T_0;	// Foo.fir:12:5
    if (_T_0) begin	// Foo.fir:12:5
      memory_r_addr_pipe[1'h0] <= memory_r_addr;	// Foo.fir:12:5
    end
    memory_r_en_pipe[1'h1] <= memory_r_en_pipe[1'h0];	// Foo.fir:12:5
    if (memory_r_en_pipe[1'h0]) begin	// Foo.fir:12:5
      memory_r_addr_pipe[1'h1] <= memory_r_addr_pipe[1'h0];	// Foo.fir:12:5
    end
  end // always_ff @(posedge)
  assign memory_r_data = memory[memory_r_addr_pipe[1'h1]];	// Foo.fir:12:5
  wire _T_1 = memory_w_clk;	// Foo.fir:12:5
  always_ff @(posedge _T_1) begin	// Foo.fir:12:5
    logic _T_2 = memory_w_en;	// Foo.fir:12:5
    memory_w_en_pipe[1'h0] <= _T_2;	// Foo.fir:12:5
    if (_T_2) begin	// Foo.fir:12:5
      memory_w_addr_pipe[1'h0] <= memory_w_addr;	// Foo.fir:12:5
      memory_w_mask_pipe[1'h0] <= memory_w_mask;	// Foo.fir:12:5
      memory_w_data_pipe[1'h0] <= memory_w_data;	// Foo.fir:12:5
    end
    if (memory_w_en_pipe[1'h0] & memory_w_mask_pipe[1'h0]) begin	// Foo.fir:12:5
      memory[memory_w_addr_pipe[1'h0]] <= memory_w_data_pipe[1'h0];	// Foo.fir:12:5
    end
  end // always_ff @(posedge)
  `ifndef SYNTHESIS	// Foo.fir:12:5
    initial begin	// Foo.fir:12:5
      `INIT_RANDOM_PROLOG_	// Foo.fir:12:5
      `ifdef RANDOMIZE_MEM_INIT	// Foo.fir:12:5
        integer memory_initvar;
        for (memory_initvar = 0; memory_initvar < 16; memory_initvar = memory_initvar+1)
          memory[memory_initvar] = `RANDOM;	// Foo.fir:12:5
      `endif
      `ifdef RANDOMIZE_REG_INIT	// Foo.fir:12:5
        logic _T_3 = `RANDOM;	// Foo.fir:12:5
        memory_r_en_pipe[1'h0] = _T_3;	// Foo.fir:12:5
        memory_r_en_pipe[1'h1] = _T_3;	// Foo.fir:12:5
      `endif
      `ifdef RANDOMIZE_REG_INIT	// Foo.fir:12:5
        logic [3:0] _T_4 = `RANDOM;	// Foo.fir:12:5
        memory_r_addr_pipe[1'h0] = _T_4;	// Foo.fir:12:5
        memory_r_addr_pipe[1'h1] = _T_4;	// Foo.fir:12:5
      `endif
      `ifdef RANDOMIZE_REG_INIT	// Foo.fir:12:5
        memory_w_en_pipe[1'h0] = `RANDOM;	// Foo.fir:12:5
      `endif
      `ifdef RANDOMIZE_REG_INIT	// Foo.fir:12:5
        memory_w_addr_pipe[1'h0] = `RANDOM;	// Foo.fir:12:5
      `endif
      `ifdef RANDOMIZE_REG_INIT	// Foo.fir:12:5
        memory_w_mask_pipe[1'h0] = `RANDOM;	// Foo.fir:12:5
      `endif
      `ifdef RANDOMIZE_REG_INIT	// Foo.fir:12:5
        memory_w_data_pipe[1'h0] = `RANDOM;	// Foo.fir:12:5
      `endif
    end // initial
  `endif
  assign memory_r_addr = memory_r_addr_5;	// Foo.fir:12:5
  assign memory_r_en = memory_r_en_6;	// Foo.fir:12:5
  assign memory_r_clk = memory_r_clk_7;	// Foo.fir:12:5
  assign memory_w_addr = memory_w_addr_8;	// Foo.fir:12:5
  assign memory_w_en = memory_w_en_9;	// Foo.fir:12:5
  assign memory_w_clk = memory_w_clk_10;	// Foo.fir:12:5
  assign memory_r_clk_7 = clock;	// Foo.fir:21:18
  assign memory_r_en_6 = rEn;	// Foo.fir:22:17
  assign memory_r_addr_5 = rAddr;	// Foo.fir:23:19
  assign memory_w_clk_10 = clock;	// Foo.fir:26:18
  assign memory_w_en_9 = wEn;	// Foo.fir:27:17
  assign memory_w_addr_8 = wAddr;	// Foo.fir:28:19
  assign memory_w_mask = wMask;	// Foo.fir:29:19
  assign memory_w_data = wData;	// Foo.fir:30:19
  assign rData = memory_r_data;	// Foo.fir:2:3
endmodule

mikeurbach

I have a broader comment that's outside the scope of this PR. This isn't the first time a CIRCT component has needed to add pipeline registers like this. Around memories, Handshake has to do exactly this, in the control network. Maybe there could be some shared utility if this is a common building block that CIRCT developers reach for.

test/Conversion/FIRRTLToRTL/lower-to-rtl.mlir

seldridge · 2021-02-12T16:23:40Z

@mikeurbach wrote:

I have a broader comment that's outside the scope of this PR. This isn't the first time a CIRCT component has needed to add pipeline registers like this. Around memories, Handshake has to do exactly this, in the control network. Maybe there could be some shared utility if this is a common building block that CIRCT developers reach for.

I agree. If the utility is purely for generating pipes in RTL (or comb + sequential) / SV, then I think that's pretty straightforward to generalize. If it's a more general utility that's abstract in the dialect that it targets, that sounds even more useful, but I'm not 100% sure how to go about it. Could this be implemented as a Dialect Interface?

mikeurbach · 2021-02-12T17:53:20Z

I was imagining something that just targets RTL/Comb/Seq/SV directly right now. I will open an issue about this.

Add lowering of memories with non-zero read latency and non-unary write latency using the strategy of the Scala FIRRTL Compiler (SFC). If memories have read or write ports with this property, then add delay pipes that delay the read or write for the expected number of cycles. This deviates slightly from the SFC behavior and creates aggregates for the pipes as opposed to one element for each stage of the pipe. Remove incomplete support for aggregate memory lowering. Before this commit, memories would be split, but there was no logic to actually handle these memories. Change this so that aggregate memories are an error in lowering and tell the user how to run FIRRTL type lowering. The existing register initialization logic is extended to support arrays (as this is required for the pipeline registers). Update tests and add a new test to check the pipeline register behavior. Signed-off-by: Schuyler Eldridge <schuyler.eldridge@sifive.com>

seldridge force-pushed the dev/seldridge/issue-477 branch from 4be8e6e to 4d4d72a Compare February 12, 2021 07:06

mikeurbach reviewed Feb 12, 2021

View reviewed changes

test/Conversion/FIRRTLToRTL/lower-to-rtl.mlir Outdated Show resolved Hide resolved

seldridge force-pushed the dev/seldridge/issue-477 branch from 4d4d72a to 1c15cde Compare February 12, 2021 16:16

mikeurbach mentioned this pull request Feb 12, 2021

[Seq] Pipeline generator #587

Closed

seldridge force-pushed the dev/seldridge/issue-477 branch 5 times, most recently from e092e1f to a22c777 Compare February 18, 2021 20:08

darthscsi approved these changes Feb 22, 2021

View reviewed changes

seldridge force-pushed the dev/seldridge/issue-477 branch from a22c777 to 7ee5aff Compare February 22, 2021 19:10

seldridge merged commit 2ab3838 into main Feb 22, 2021

seldridge deleted the dev/seldridge/issue-477 branch February 22, 2021 20:20

mikeurbach mentioned this pull request Feb 24, 2021

[HandshakeToFIRRTL] Issues with Verilog generated from Standard involving memories #543

Closed

JuanEsco063 mentioned this pull request Feb 25, 2021

[FIRRTL][RTL][ExportedVerilog] Missing name in black box memory 'rtl.instance' op requires attribute 'instanceName' #670

Closed

seldridge mentioned this pull request Feb 27, 2021

RFC: Emit sync-read mems intact, with readwrite ports if applicable chipsalliance/firrtl#2092

Closed

14 tasks

seldridge mentioned this pull request Mar 11, 2021

RTL array_index_inout read vs. write (FIRRTL Memory Lowering Bug) #750

Closed

drom added this to the SiFive-1 milestone Mar 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FIRRTL] Lower non-trivial memory latencies to RTL #585

[FIRRTL] Lower non-trivial memory latencies to RTL #585

seldridge commented Feb 12, 2021 •

edited

Loading

mikeurbach left a comment

seldridge commented Feb 12, 2021

mikeurbach commented Feb 12, 2021

[FIRRTL] Lower non-trivial memory latencies to RTL #585

[FIRRTL] Lower non-trivial memory latencies to RTL #585

Conversation

seldridge commented Feb 12, 2021 • edited Loading

Example

mikeurbach left a comment

Choose a reason for hiding this comment

seldridge commented Feb 12, 2021

mikeurbach commented Feb 12, 2021

seldridge commented Feb 12, 2021 •

edited

Loading