# VLSI Lab 5

Nicky Advokaat - 0740567 - n.advokaat@student.tue.nl Marcel Moreaux - 0499480 - m.l.moreaux@student.tue.nl

 $4^{\rm rd}$  quartile, 2014

#### Abstract

This report contains solutions for the problems described in Assignment L5 for the course VLSI Programming.

# Contents

| 1 | Problem Specification and Requirements                 | 2                |
|---|--------------------------------------------------------|------------------|
| 2 | Solution         2.1 Idea          2.2 Implementation  | 2<br>2<br>4      |
| 3 | Results 3.1 Resource Usage                             |                  |
| 4 | Appendix A: Answers to inline questions 4.1 Question 1 | 7<br>7<br>7<br>7 |
| 5 | Appendix B: Verilog source code                        | 7                |

### 1 Problem Specification and Requirements

We need to implement an upscaler that can process n streams at once, where all streams have the same sample rate and the same upscaling factor. This means we can reuse most of our code from L4 in which we created a single stream upscaler. The calculations performed for each stream are the same as for a single stream, so we just need to store more inputs and interleave the filtering. The coefficients and the way they are stored are equal can be copied from assignment L4 as well. The upscaler has the following requirements:

- The system must run at 100 MHz.
- The system can handle at least 128 streams, but preferably more. We will incrementally test the filter on the number of input streams.
- All streams are correctly upscaled from 44.1 kHZ to 48 kHZ and outputted in the correct order.

#### 2 Solution

In this section we describe the key ideas behind our design, and the decisions we made during the design process.

#### 2.1 Idea

Figure 1 shows an architecture diagram of our design.

To store the last 4 input values for each of the streams, we allocate 4 arrays of size n in BRAM. When input is ready we move the contents of the  $i^{th}$  array to the  $(i+1)^{th}$  for  $0 \le i < 4$ , and store the input value corresponding to jth stream to the  $j^{th}$  block of array 0. The filtering happens as follows. For  $0 \le i < n$  and  $0 \le j < 4$  we load the  $i^{th}$  value from the  $j^{th}$  array from the input buffer arrays. These 4 values are used as input values for the FIR, together with the corresponding coefficients according to the direct equation for the filter. The values from the array are requested one clock cycle before they are needed in the computation, since reads in the BRAM are performed synchronously. As seen in the diagram, we use 4 multipliers in parallel. Alternatively we could have used one multiplier doing 4 sequential operations for each sample, but we decided to optimize for throughput and not for hardware usage.

The coefficients h[] are the same as in lab L4 and because they are symmetric we store only 2L + 1 of them.

The input storing and FIR processing for each stream happens in the same clock cycle, so that we could theoretically produce one output per clock cycle.



Figure 1: Architecture diagram of the system.  $\,$ 

#### 2.2 Implementation

#### 3 Results

### 3.1 Resource Usage

### 3.2 Properties

ISE report the following timing statistics for our design with n = 1024.

• Synthesis report

- Minimum period: 14.714ns

- Maximum Frequency: 67.961MHz

• Post-PAR static timing report

- Minimum period: 16.155ns

- Maximum frequency: 61.900MHz

This leaves us with a maximum frequency that is lower than 100 MHz. But because we produce output every clock cycle, we can still process 1024 streams at once. Therefore the 100 MHz requirement does not seem very important, it does however mean that we can not compile our design on the ngrid server.

### 3.3 Analysis of Filter Output

In this section we will show correctness of the upscaler by analyzing the input and output. Figure 2 shows the first part of the input and output signal. There is a finite amount of startup noise, indicated by the first part of the output signal being zero. It also shows the latency of the system, the output signal has some delay compared to the input signal. But except for those differences the signals appear identical. In figure 3 we see another plot of the input and output signal. In this plot we have shifted the output signal by -3 samples, and there are dots indicating the samples. We can now see that the output signal has a higher sample frequency than the input signal.



Figure 2: Plot of the first part of the original signal (red) and the signal after filtering (blue).



Figure 3: Plot of part of the original signal (red) and the signal after filtering (blue) with dots indicating the samples.

Figure 4 shows the signals in the frequency domain. There is only minimal difference between them. The output frequency contains a higher maximum frequency because it has a higher sample rate. Finally, figure 5 displays some waveforms of the design. For this image we have used n=2. We can see this in the input and output streams, in which <0000> is interleaved with nonzero values. If we look at clk we can see that we do indeed produce output every clock cycle. We can also see a period in which filter\_in\_ack and filter\_req\_ack are zero, during which input and output remain stable.



Figure 4: Plot in the frequency domain of the original signal (top) and the signal after filtering (bottom).



Figure 5: Part of the waveforms of our design.

# 4 Appendix A: Answers to inline questions

#### 4.1 Question 1

We produce output every clock cycle. At 100 MHz we could produce  $\frac{\frac{100 \cdot 10^3}{48}}{48} \approx 2083$  streams.

#### 4.2 Question 2

Our design can process 1024 input streams. It does however not run at 100 MHz, we achieve this number of streams by producing output every clock cycle. The downside of this is that the ngrid server does not compile design not running at at least 100 MHz, so we could not test in on the Xilinx board.

#### 4.3 Question 3

Section 3.3 contains an analysis of the input and output samples. There is one stream that represents an audio file, the others are zero. We have checked that the output audio is correct and has a sample frequency of 48 KHz. In our design we make no distinction between the streams, they are processed all in the same way. We do not use the number of the stream that contains actual data. Therefore it is reasonable to assume that all streams are correctly processed and have an output sample frequency of 48 KHz.

# 5 Appendix B: Verilog source code

This appendix includes Verilog source code for the filter.v file in the ISE project.

```
'timescale 1ns / 1ps
module filter
    #(parameter DWIDTH = 16,
        parameter DDWIDTH = 2*DWIDTH,
        parameter L = 160,
        parameter L_LOG = 8,
        parameter M = 147,
        parameter M_LOG = 8,
        parameter CWIDTH = 4*L,
        parameter NR_STREAMS = 1024,
        parameter NR_STREAMS_LOG = 10)
    (input clk,
        input rst,
        output req_in,
        input ack_in,
        input [0:DWIDTH-1] data_in,
        output req_out,
        input ack_out,
        output [0:DWIDTH-1] data_out);
```

```
// Output request register
reg req_out_buf;
assign req_out = req_out_buf;
// Input request register
reg req_in_buf;
assign req_in = req_in_buf;
reg [0:NR_STREAMS_LOG-1] stream;
// state counter. l = nM mod L (calculated efficiently using
// conditionals and addition/subtraction)
reg [0:L_LOG-1] 1;
// Delayed state counter, to compensate for the fact that
// I/O and computation happen simultaneously now.
reg [0:L_LOG-1] m;
// The last 4 input samples used in the FIR (excluding
// the newest, which will be in data_in)
reg signed [0:DWIDTH-1] in0 [0:NR_STREAMS-1];
reg signed [0:DWIDTH-1] in1 [0:NR_STREAMS-1];
reg signed [0:DWIDTH-1] in2 [0:NR_STREAMS-1];
reg signed [0:DWIDTH-1] in3 [0:NR_STREAMS-1];
// The FIR coefficients (lots of them)
// Naively, we'd need [0:4L-1], but since the coefficients are
// symmetric around h[2L], we can just reuse h[2L-1:1] for // h[2L+1:4L-1]
reg signed [0:DWIDTH-1] h [0:2*L];
// Accumulator (lower bits assigned to output port directly)
reg signed [0:DDWIDTH-1] sum;
assign data_out = sum >> 15;
initial
begin
    // Initialize the ROM with coefficients from file
    $readmemh("coefficients.txt", h);
end
always @(posedge clk)
begin
    // Reset => initialize
    if (rst)
    begin
        req_in_buf <= 0;</pre>
        req_out_buf <= 0;</pre>
```

```
stream <= 0;
    1 <= 0;
end
// !Reset => run
else
begin
    // Read handshake complete
    if (req_in && ack_in)
    begin
        in0[stream] <= data_in;</pre>
        in1[stream] <= in0[stream];</pre>
        in2[stream] <= in1[stream];</pre>
        in3[stream] <= in2[stream];</pre>
        //sum <= (data_in >> 1) | (data_in & 32768);
        req_out_buf <= 1;</pre>
    end
       //Read handshake is pending then stop producing output
    if (req_in && !ack_in)
    begin
        req_out_buf <= 0;</pre>
    end
    // Write handshake complete
    if (req_out && ack_out)
    begin
        if(stream == 0)
        begin
             // 1 <= (1 + M) mod L, implemented with conditionals
             // No-overflow case
             if(1 < L - M)
             begin
                 1 <= 1 + M;
                 req_in_buf <= 0;</pre>
             // Overflow case. Overflow also means we need a new input!
             else
             begin
                 1 \le 1 - (L - M);
                 req_in_buf <= 1;</pre>
             end
             m \le 1;
        end
```

```
sum \le in0[stream] * h[m] + in1[stream] * h[m+L] + in2[stream]
                       * h[L*2-m] + in3[stream] * h[L-m];
            stream <= (stream + 1) & (NR_STREAMS-1);</pre>
        end
        //Write handshake is pending then stop acquiring output.
        if (req_out && !ack_out)
        begin
            req_in_buf <= 0;</pre>
        end
        // Idle state
        if (!req_in && !ack_in && !req_out && !ack_out)
        begin
            req_in_buf <= 1;</pre>
        end
    end
end
```

endmodule