Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FIRRTL] Lower non-trivial memory latencies to RTL #585

Merged
merged 1 commit into from
Feb 22, 2021

Conversation

seldridge
Copy link
Member

@seldridge seldridge commented Feb 12, 2021

Fixes #477.

Add lowering of memories with non-zero read latency and non-unary
write latency using the strategy of the Scala FIRRTL
Compiler (SFC). If memories have read or write ports with this
property, then add delay pipes that delay the read or write for the
expected number of cycles. This deviates slightly from the SFC
behavior and creates aggregates for the pipes as opposed to one
element for each stage of the pipe.

Remove incomplete support for aggregate memory lowering. Before this
commit, memories would be split, but there was no logic to actually
handle these memories. Change this so that aggregate memories are an
error in lowering and tell the user how to run FIRRTL type lowering.

The existing register initialization logic is extended to support
arrays (as this is required for the pipeline registers).

Update tests and add a new test to check the pipeline register
behavior.

While large read and write latencies are arguably something that we could choose to not support, the standard sequential read memory (read latency 1, write latency 1) needs to have some lowering if #493 isn't run. This PR just solves the general case and aligns the behavior with SFC lowering.

I believe that removing the vestigial, incomplete memory lowering during FIRRTL to RTL conversion is justified because: (1) we should aim to do things correctly once and not have multiple different paths to produce the same effects and (2) FIRRTL to RTL conversion should eventually lower memories as aggregates anyway (structs/vectors should be preserved here). I don't think we're losing anything by ripping out the vestigial support and pushing people towards FIRRTL type lowering.

This PR synergizes with memory blackboxing #493 and addition of generator ops #547. This PR is the fallback path if memories aren't blackboxed or the logic can be reused for an eventual, default memory lowering/memory generator expansion pass.

Example

Consider the following FIRRTL IR (this is the example added in the tests). This has read and write latencies of 2:

circuit Foo:
  module Foo:
    input clock: Clock
    input rAddr: UInt<4>
    input rEn: UInt<1>
    output rData: UInt<8>
    input wAddr: UInt<4>
    input wEn: UInt<1>
    input wMask: UInt<1>
    input wData: UInt<8>

    mem memory:
      data-type => UInt<8>
      depth => 16
      reader => r
      writer => w
      read-latency => 2
      write-latency => 2
      read-under-write => undefined

    memory.r.clk <= clock
    memory.r.en <= rEn
    memory.r.addr <= rAddr
    rData <= memory.r.data

    memory.w.clk <= clock
    memory.w.en <= wEn
    memory.w.addr <= wAddr
    memory.w.mask <= wMask
    memory.w.data <= wData

If you compile this with the SFC, you get the read delayed by two cycles and the write delayed by one cycle:

module Foo(
  input        clock,
  input  [3:0] rAddr,
  input        rEn,
  output [7:0] rData,
  input  [3:0] wAddr,
  input        wEn,
  input        wMask,
  input  [7:0] wData
);
`ifdef RANDOMIZE_MEM_INIT
  reg [31:0] _RAND_0;
`endif // RANDOMIZE_MEM_INIT
`ifdef RANDOMIZE_REG_INIT
  reg [31:0] _RAND_1;
  reg [31:0] _RAND_2;
  reg [31:0] _RAND_3;
  reg [31:0] _RAND_4;
  reg [31:0] _RAND_5;
  reg [31:0] _RAND_6;
  reg [31:0] _RAND_7;
  reg [31:0] _RAND_8;
`endif // RANDOMIZE_REG_INIT
  reg [7:0] memory [0:15];
  wire [7:0] memory_r_data;
  wire [3:0] memory_r_addr;
  wire [7:0] memory_w_data;
  wire [3:0] memory_w_addr;
  wire  memory_w_mask;
  wire  memory_w_en;
  reg  memory_r_en_pipe_0;
  reg [3:0] memory_r_addr_pipe_0;
  reg  memory_r_en_pipe_1;
  reg [3:0] memory_r_addr_pipe_1;
  reg  memory_w_en_pipe_0;
  reg [3:0] memory_w_addr_pipe_0;
  reg  memory_w_mask_pipe_0;
  reg [7:0] memory_w_data_pipe_0;
  assign memory_r_addr = memory_r_addr_pipe_1;
  assign memory_r_data = memory[memory_r_addr];
  assign memory_w_data = memory_w_data_pipe_0;
  assign memory_w_addr = memory_w_addr_pipe_0;
  assign memory_w_mask = memory_w_mask_pipe_0;
  assign memory_w_en = memory_w_en_pipe_0;
  assign rData = memory_r_data;
  always @(posedge clock) begin
    if(memory_w_en & memory_w_mask) begin
      memory[memory_w_addr] <= memory_w_data;
    end
    memory_r_en_pipe_0 <= rEn;
    if (rEn) begin
      memory_r_addr_pipe_0 <= rAddr;
    end
    memory_r_en_pipe_1 <= memory_r_en_pipe_0;
    if (memory_r_en_pipe_0) begin
      memory_r_addr_pipe_1 <= memory_r_addr_pipe_0;
    end
    memory_w_en_pipe_0 <= wEn;
    if (wEn) begin
      memory_w_addr_pipe_0 <= wAddr;
    end
    if (wEn) begin
      memory_w_mask_pipe_0 <= wMask;
    end
    if (wEn) begin
      memory_w_data_pipe_0 <= wData;
    end
  end
initial begin
  `ifdef RANDOMIZE
    `ifdef INIT_RANDOM
      `INIT_RANDOM
    `endif
    `ifndef VERILATOR
      `ifdef RANDOMIZE_DELAY
        #`RANDOMIZE_DELAY begin end
      `else
        #0.002 begin end
      `endif
    `endif
`ifdef RANDOMIZE_MEM_INIT
  _RAND_0 = {1{`RANDOM}};
  for (initvar = 0; initvar < 16; initvar = initvar+1)
    memory[initvar] = _RAND_0[7:0];
`endif // RANDOMIZE_MEM_INIT
`ifdef RANDOMIZE_REG_INIT
  _RAND_1 = {1{`RANDOM}};
  memory_r_en_pipe_0 = _RAND_1[0:0];
  _RAND_2 = {1{`RANDOM}};
  memory_r_addr_pipe_0 = _RAND_2[3:0];
  _RAND_3 = {1{`RANDOM}};
  memory_r_en_pipe_1 = _RAND_3[0:0];
  _RAND_4 = {1{`RANDOM}};
  memory_r_addr_pipe_1 = _RAND_4[3:0];
  _RAND_5 = {1{`RANDOM}};
  memory_w_en_pipe_0 = _RAND_5[0:0];
  _RAND_6 = {1{`RANDOM}};
  memory_w_addr_pipe_0 = _RAND_6[3:0];
  _RAND_7 = {1{`RANDOM}};
  memory_w_mask_pipe_0 = _RAND_7[0:0];
  _RAND_8 = {1{`RANDOM}};
  memory_w_data_pipe_0 = _RAND_8[7:0];
`endif // RANDOMIZE_REG_INIT
  `endif // RANDOMIZE
end // initial
`ifdef FIRRTL_AFTER_INITIAL
`FIRRTL_AFTER_INITIAL
`endif
`endif // SYNTHESIS
endmodule

With this PR, you get the following from CIRCT (firtool -enable-lower-types -lower-to-rtl -verilog):

module Foo(
  input        clock,
  input  [3:0] rAddr,
  input        rEn,
  input  [3:0] wAddr,
  input        wEn, wMask,
  input  [7:0] wData,
  output [7:0] rData);

  reg  [7:0] memory[15:0];	// Foo.fir:12:5
  wire [3:0] memory_r_addr;	// Foo.fir:12:5
  wire       memory_r_en;	// Foo.fir:12:5
  wire       memory_r_clk;	// Foo.fir:12:5
  wire [7:0] memory_r_data;	// Foo.fir:12:5
  reg        memory_r_en_pipe[1:0];	// Foo.fir:12:5
  reg  [3:0] memory_r_addr_pipe[1:0];	// Foo.fir:12:5
  wire [3:0] memory_w_addr;	// Foo.fir:12:5
  wire       memory_w_en;	// Foo.fir:12:5
  wire       memory_w_clk;	// Foo.fir:12:5
  wire [7:0] memory_w_data;	// Foo.fir:12:5
  wire       memory_w_mask;	// Foo.fir:12:5
  reg        memory_w_en_pipe[0:0];	// Foo.fir:12:5
  reg  [3:0] memory_w_addr_pipe[0:0];	// Foo.fir:12:5
  reg        memory_w_mask_pipe[0:0];	// Foo.fir:12:5
  reg  [7:0] memory_w_data_pipe[0:0];	// Foo.fir:12:5
  wire [3:0] memory_r_addr_5;	// Foo.fir:12:5
  wire       memory_r_en_6;	// Foo.fir:12:5
  wire       memory_r_clk_7;	// Foo.fir:12:5
  wire [3:0] memory_w_addr_8;	// Foo.fir:12:5
  wire       memory_w_en_9;	// Foo.fir:12:5
  wire       memory_w_clk_10;	// Foo.fir:12:5

  wire _T = memory_r_clk;	// Foo.fir:12:5
  always_ff @(posedge _T) begin	// Foo.fir:12:5
    logic _T_0 = memory_r_en;	// Foo.fir:12:5
    memory_r_en_pipe[1'h0] <= _T_0;	// Foo.fir:12:5
    if (_T_0) begin	// Foo.fir:12:5
      memory_r_addr_pipe[1'h0] <= memory_r_addr;	// Foo.fir:12:5
    end
    memory_r_en_pipe[1'h1] <= memory_r_en_pipe[1'h0];	// Foo.fir:12:5
    if (memory_r_en_pipe[1'h0]) begin	// Foo.fir:12:5
      memory_r_addr_pipe[1'h1] <= memory_r_addr_pipe[1'h0];	// Foo.fir:12:5
    end
  end // always_ff @(posedge)
  assign memory_r_data = memory[memory_r_addr_pipe[1'h1]];	// Foo.fir:12:5
  wire _T_1 = memory_w_clk;	// Foo.fir:12:5
  always_ff @(posedge _T_1) begin	// Foo.fir:12:5
    logic _T_2 = memory_w_en;	// Foo.fir:12:5
    memory_w_en_pipe[1'h0] <= _T_2;	// Foo.fir:12:5
    if (_T_2) begin	// Foo.fir:12:5
      memory_w_addr_pipe[1'h0] <= memory_w_addr;	// Foo.fir:12:5
      memory_w_mask_pipe[1'h0] <= memory_w_mask;	// Foo.fir:12:5
      memory_w_data_pipe[1'h0] <= memory_w_data;	// Foo.fir:12:5
    end
    if (memory_w_en_pipe[1'h0] & memory_w_mask_pipe[1'h0]) begin	// Foo.fir:12:5
      memory[memory_w_addr_pipe[1'h0]] <= memory_w_data_pipe[1'h0];	// Foo.fir:12:5
    end
  end // always_ff @(posedge)
  `ifndef SYNTHESIS	// Foo.fir:12:5
    initial begin	// Foo.fir:12:5
      `INIT_RANDOM_PROLOG_	// Foo.fir:12:5
      `ifdef RANDOMIZE_MEM_INIT	// Foo.fir:12:5
        integer memory_initvar;
        for (memory_initvar = 0; memory_initvar < 16; memory_initvar = memory_initvar+1)
          memory[memory_initvar] = `RANDOM;	// Foo.fir:12:5
      `endif
      `ifdef RANDOMIZE_REG_INIT	// Foo.fir:12:5
        logic _T_3 = `RANDOM;	// Foo.fir:12:5
        memory_r_en_pipe[1'h0] = _T_3;	// Foo.fir:12:5
        memory_r_en_pipe[1'h1] = _T_3;	// Foo.fir:12:5
      `endif
      `ifdef RANDOMIZE_REG_INIT	// Foo.fir:12:5
        logic [3:0] _T_4 = `RANDOM;	// Foo.fir:12:5
        memory_r_addr_pipe[1'h0] = _T_4;	// Foo.fir:12:5
        memory_r_addr_pipe[1'h1] = _T_4;	// Foo.fir:12:5
      `endif
      `ifdef RANDOMIZE_REG_INIT	// Foo.fir:12:5
        memory_w_en_pipe[1'h0] = `RANDOM;	// Foo.fir:12:5
      `endif
      `ifdef RANDOMIZE_REG_INIT	// Foo.fir:12:5
        memory_w_addr_pipe[1'h0] = `RANDOM;	// Foo.fir:12:5
      `endif
      `ifdef RANDOMIZE_REG_INIT	// Foo.fir:12:5
        memory_w_mask_pipe[1'h0] = `RANDOM;	// Foo.fir:12:5
      `endif
      `ifdef RANDOMIZE_REG_INIT	// Foo.fir:12:5
        memory_w_data_pipe[1'h0] = `RANDOM;	// Foo.fir:12:5
      `endif
    end // initial
  `endif
  assign memory_r_addr = memory_r_addr_5;	// Foo.fir:12:5
  assign memory_r_en = memory_r_en_6;	// Foo.fir:12:5
  assign memory_r_clk = memory_r_clk_7;	// Foo.fir:12:5
  assign memory_w_addr = memory_w_addr_8;	// Foo.fir:12:5
  assign memory_w_en = memory_w_en_9;	// Foo.fir:12:5
  assign memory_w_clk = memory_w_clk_10;	// Foo.fir:12:5
  assign memory_r_clk_7 = clock;	// Foo.fir:21:18
  assign memory_r_en_6 = rEn;	// Foo.fir:22:17
  assign memory_r_addr_5 = rAddr;	// Foo.fir:23:19
  assign memory_w_clk_10 = clock;	// Foo.fir:26:18
  assign memory_w_en_9 = wEn;	// Foo.fir:27:17
  assign memory_w_addr_8 = wAddr;	// Foo.fir:28:19
  assign memory_w_mask = wMask;	// Foo.fir:29:19
  assign memory_w_data = wData;	// Foo.fir:30:19
  assign rData = memory_r_data;	// Foo.fir:2:3
endmodule

Copy link
Contributor

@mikeurbach mikeurbach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a broader comment that's outside the scope of this PR. This isn't the first time a CIRCT component has needed to add pipeline registers like this. Around memories, Handshake has to do exactly this, in the control network. Maybe there could be some shared utility if this is a common building block that CIRCT developers reach for.

test/Conversion/FIRRTLToRTL/lower-to-rtl.mlir Outdated Show resolved Hide resolved
@seldridge
Copy link
Member Author

@mikeurbach wrote:

I have a broader comment that's outside the scope of this PR. This isn't the first time a CIRCT component has needed to add pipeline registers like this. Around memories, Handshake has to do exactly this, in the control network. Maybe there could be some shared utility if this is a common building block that CIRCT developers reach for.

I agree. If the utility is purely for generating pipes in RTL (or comb + sequential) / SV, then I think that's pretty straightforward to generalize. If it's a more general utility that's abstract in the dialect that it targets, that sounds even more useful, but I'm not 100% sure how to go about it. Could this be implemented as a Dialect Interface?

@mikeurbach
Copy link
Contributor

I was imagining something that just targets RTL/Comb/Seq/SV directly right now. I will open an issue about this.

@seldridge seldridge force-pushed the dev/seldridge/issue-477 branch 5 times, most recently from e092e1f to a22c777 Compare February 18, 2021 20:08
Add lowering of memories with non-zero read latency and non-unary
write latency using the strategy of the Scala FIRRTL
Compiler (SFC). If memories have read or write ports with this
property, then add delay pipes that delay the read or write for the
expected number of cycles. This deviates slightly from the SFC
behavior and creates aggregates for the pipes as opposed to one
element for each stage of the pipe.

Remove incomplete support for aggregate memory lowering. Before this
commit, memories would be split, but there was no logic to actually
handle these memories. Change this so that aggregate memories are an
error in lowering and tell the user how to run FIRRTL type lowering.

The existing register initialization logic is extended to support
arrays (as this is required for the pipeline registers).

Update tests and add a new test to check the pipeline register
behavior.

Signed-off-by: Schuyler Eldridge <schuyler.eldridge@sifive.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FIRRTL] Lower to RTL for Memories of Read Latency > 0, Write Latency > 1
4 participants