# RTDSP Lab 4

# Yong Wen Chua (ywc110) & Ryan Savitski (rs5010)

# Declaration

Declaration: We confirm that this submission is our own work. In it, we give references and citations whenever we refer to or use the published, or unpublished, work of others. We are aware that this course is bound by penalties as set out in the College examination offenses policy.

Signed: Yong Wen Chua & Ryan Savitski

# **Contents**

| 1 | Mat  | tlab Filter Design             | 3  |
|---|------|--------------------------------|----|
|   | 1.1  | Coefficients                   | 3  |
|   | 1.2  | Frequency Response             | 3  |
| 2 | Non  | n-Circular Buffer FIR Filter   | 5  |
|   | 2.1  | Code Description               | 5  |
|   | 2.2  | Oscilloscope Traces            | 5  |
|   | 2.3  | Code Performance               | 8  |
| 3 | Circ | cular Buffer FIR Filter        | 9  |
|   | 3.1  | Naive Implementation           | 9  |
|   |      | 3.1.1 Code Description         | 9  |
|   |      | 3.1.2 Code Performance         | 9  |
|   | 3.2  | Optimised Implementation       | 10 |
|   |      | 3.2.1 Code Operation           | 10 |
|   |      | 3.2.2 Code Performance         | 11 |
|   |      | 3.2.3 Spectrum Analyser Output | 12 |

CONTENTS

| 4 | 4 Assembly Implementation |                                            |    |  |  |  |  |
|---|---------------------------|--------------------------------------------|----|--|--|--|--|
|   | 4.1                       | Linear Implementation                      | 14 |  |  |  |  |
|   |                           | 4.1.1 Code Operation                       | 14 |  |  |  |  |
|   |                           | 4.1.2 Code Performance                     | 14 |  |  |  |  |
|   | 4.2                       | Optimised Implementation                   | 15 |  |  |  |  |
|   |                           | 4.2.1 Optimisation Techniques              | 15 |  |  |  |  |
|   |                           | 4.2.2 Code Operation                       | 15 |  |  |  |  |
|   |                           | 4.2.3 Code Performance                     | 20 |  |  |  |  |
|   |                           | 4.2.4 Spectrum Analyser Traces             | 21 |  |  |  |  |
| 5 | Compiler Optimisation     |                                            |    |  |  |  |  |
| • | 5.1                       | Level 0 Optimisation                       | 22 |  |  |  |  |
|   | 5.2                       | Level 2 Optimisation                       |    |  |  |  |  |
|   | 3.2                       |                                            |    |  |  |  |  |
| Α | Cod                       | le Listings                                | 24 |  |  |  |  |
|   | A.1                       | Matlab Code for Filter Generation          | 24 |  |  |  |  |
|   | A.2                       | Non-Circular Buffer                        | 25 |  |  |  |  |
|   | A.3                       | Naive Implementation for a Circular Buffer | 27 |  |  |  |  |
|   | A.4                       | Optimised Circular Buffer Implementation   | 30 |  |  |  |  |
|   | A.5                       | Assembly Implementation                    | 33 |  |  |  |  |
|   |                           | A.5.1 C File                               | 33 |  |  |  |  |
|   |                           | A.5.2 Linear Assembly Implementation       | 36 |  |  |  |  |
|   |                           | A.5.3 Optimised Assembly Implementation    | 38 |  |  |  |  |

# 1 Matlab Filter Design

The transition bands used in this lab are: 260 Hz -> 450 Hz and 2250 Hz -> 2500 Hz (choice was made before the spec was changed). The Matlab code used to generate the listing is given in section A.1.

### 1.1 Coefficients

The coefficients generated by the Order 87 filter (with 88 coefficients) are given below. Note that the filter is linear phase and the coefficients are thus symmetric.

```
-4.8142851508671362e-03
                                                        3.2377476053097676e-03
  -5.6238234861581632e-03
                                                                                  5.2623077366623777e-03
3.8327678023773130e-04
                          2.5228524080704710e-03
                                                     6.6427594305550220e-03
                                                                               2.0191540917237553e-03
6.0838154216067970e-04
                          6.0195074513261972e-03
                                                     2.2588854699456557e-03
                                                                              -4.0581656174741142e-03
1.1053698037032480e-03
                          8.7570330682306904e-04
                                                    -9.4389095569342232e-03
                                                                              -6.7133371831993478e-03
-4.4912094561377273e-04
                          -1.0781141008779919e-02
                                                     -1.2833025044814740e-02
                                                                                7.4357338553088497e-04
-3.6475744956566657e-03
                          -1.2285472406529016e-02
                                                      4.9069216133462504e-03
                                                                                1.1791976942964414e-02
-3.5124996299853682e-03
                           8.4328459566069963e-03
                                                                                9.5414427887197482e-03
                                                      2.8242140033990469e-02
4.6527705138212187e-03
                          3.3181195014207174e-02
                                                     1.9161471979520985e-02
                                                                              -1.1640938381470692e-02
1.4905816953372706e-02
                          1.8640626747436755e-02
                                                    -3.8390515525090867e-02
                                                                              -3.0977635742666779e-02
                         -6.2265514731766294e-02
6.9891233521773809e-03
                                                    -1.0105444362367744e-01
                                                                              -9.6437225383029998e-03
-5.1295032155504995e-02
                          -2.1091838197686907e-01
                                                     -2.1562212120427096e-02
                                                                                4.2133153775698379e-01
4.2133153775698379e-01
                         -2.1562212120427096e-02
                                                    -2.1091838197686907e-01
                                                                              -5.1295032155504995e-02
-9.6437225383029998e-03
                          -1.0105444362367744e-01
                                                     -6.2265514731766294e-02
                                                                                6.9891233521773809e-03
-3.0977635742666779e-02
                          -3.8390515525090867e-02
                                                      1.8640626747436755e-02
                                                                                1.4905816953372706e-02
-1.1640938381470692e-02
                           1.9161471979520985e-02
                                                      3.3181195014207174e-02
                                                                                4.6527705138212187e-03
9.5414427887197482e-03
                          2.8242140033990469e-02
                                                     8.4328459566069963e-03
                                                                              -3.5124996299853682e-03
1.1791976942964414e-02
                          4.9069216133462504e-03
                                                    -1.2285472406529016e-02
                                                                              -3.6475744956566657e-03
7.4357338553088497e-04
                         -1.2833025044814740e-02
                                                    -1.0781141008779919e-02
                                                                              -4.4912094561377273e-04
-6.7133371831993478e-03
                          -9.4389095569342232e-03
                                                      8.7570330682306904e-04
                                                                                1.1053698037032480e-03
-4.0581656174741142e-03
                           2.2588854699456557e-03
                                                      6.0195074513261972e-03
                                                                                6.0838154216067970e-04
2.0191540917237553e-03
                          6.6427594305550220e-03
                                                     2.5228524080704710e-03
                                                                               3.8327678023773130e-04
5.2623077366623777e-03
                          3.2377476053097676e-03
                                                    -4.8142851508671362e-03
                                                                              -5.6238234861581632e-03
```

### 1.2 Frequency Response

The frequency response of the generated filter is given on the following page.







# 2 Non-Circular Buffer FIR Filter

The code for the non-circular buffer FIR filter is given in section A.2.

# 2.1 Code Description

The coefficients for the filter are kept in a global double array with the name b. An array of size 88, buffer, is used as the storage for the previous inputs, required for the convolution. The code is inside the ISR and is presented below:At the start of every ISR, the buffer's contents are shifted:

```
int i;
double output = 0;

// shift buffer
for (i = N-1; i != 0; --i)
buffer[i] = buffer[i-1];
buffer[0] = mono_read_16Bit(); // new sample

// mac loop
for (i = 0; i < N; ++i)
output += b[i] * buffer[i];

mono_write_16Bit(output); // write</pre>
```

The code starts by doing a left shift of the buffer, then reading a new sample and writing it into the start of the buffer. Then a convolution is performed with a mac loop. The convolution is done according to the following equation:

$$output = \sum_{i=0}^{87} b[i] \times buffer[i]$$

Finally, the sample is written to the output codec.

# 2.2 Oscilloscope Traces

The oscilloscope trace of the filter implemented on the DSP behave as expected with the amplitude changing accordingly.



Figure 2.1: 200 Hz input, with almost zero output. This is in the stop-band.



Figure 2.2: 400 Hz input, with increasing output amplitude. This is in the first transition band.



Figure 2.3: 500 Hz input, with maximum output amplitude. This is within the passband.



Figure 2.4: 1500 Hz input, with maximum amplitude. This is within the passband.



Figure 2.5: 2400 Hz, with decreasing amplitude. This is within the second transition band.



Figure 2.6: 3000 Hz input, with zero output. This is within the second stop-band.

# 2.3 Code Performance

The number of cycles for the simple non-circular FIR part with overhead of sample read/write cycles, are given below. Note that the actual overhead cycle counts depend on both the optimisation level and how/if the compiler inlines the calls. In addition, the breakpoint insertion offsets make it hard to time the overhead of just the sample input/output operations. Therefore it is not sensible to present the results without the overheads included. The estimated overhead at o2/o3 is 130 cycles.

| Optimisation Level | Cycle count with read/write sample overhead |
|--------------------|---------------------------------------------|
| None               | 5766                                        |
| 00                 | 4666                                        |
| 01                 | 3855                                        |
| o2/o3              | 1584                                        |

Note that seeing as this is a DSP board, the compiler toolchain is written to recognise mac-like loops and buffer operations, inserting templated assembly where needed. For example, for the above code the compiler did a triple in-flight branch shift loop that is much faster than the naive serial shifting. We are aware that simple mac loop code as above can also trigger a dual-unit pipelined template from the compiler at o2/o3 that would push the cycle counts towards 800, but we were unsuccessful in triggering that pattern. The compiler's pattern matcher is fairly unreliable from observations during lab work.

# 3 Circular Buffer FIR Filter

# 3.1 Naive Implementation

A simple version of the circular buffer was first implemented to ensure that it worked correctly. The code listing can be found in section A.3. Its operations are explained in the next section.

### 3.1.1 Code Description

A variable "index" is used to indicate the position in the array at which the "current" sample should reside. This index is incremented after every new sample is obtained, eventually wrapping around to the front of the array. Thus, if the current index is of value i, then the previous nth sample will be given by the index value of [(i-n)+N]%N where N=88 is the total number of coefficients. The array, and index are defined by

```
Int16 buffer [N] = \{0\}; // initialise everything to zero int index = 0;
```

The newly retrieved sample will first be written to the buffer.

```
_1 | *(buffer + index) = input; // equivalent to, and no faster than writing buffer[index] = input
```

A loop is then started to perform the Multiply and Accumulate (MAC) operation and the result is stored in result. Proper circular offset buffering is calculated using the method described earlier in this section.

The index is then incremented. The mod operator ensures that proper wrapping around occurs.

```
index = (index + 1)\%N;
```

#### 3.1.2 Code Performance

The number of cycles taken between the start, and the end of the ISR routine is given in the table below. In general, this implementation of the buffer performed worse than the Non-Circular buffer version described in section §2. This is due to the compiler no longer seeing the trigger patterns for optimised template loops.

| Optimisation Level | Cycle count with read/write overheads |
|--------------------|---------------------------------------|
| None               | 7377                                  |
| 00                 | 5830                                  |
| 02                 | 3934                                  |

## 3.2 Optimised Implementation

The code listing for the optimised implementation can be found in section A.4.

### 3.2.1 Code Operation

Similar to the code operation described in section 3.1.1, a variable "index" is used to indicate the position in the array at which the "current" sample should reside. This index is decremented after every new sample is obtained, eventually wrapping around.

The optimised circular buffer code starts with the following pointer declarations for buffer indexing/walks:

It is important to note that all of these values are compile time constants and are thus "free" in terms of performance.

The code proceeds to declare the pointer to the oldest sample (to be overwritten by the new sample) and calculate the amount of entries there are between current sample and the end of the sample buffer (the MAC will be two part, first part walking the sample buffer forwards, so this is the loop counter for the forward walk).

```
double* sampleptr = buffer + index; // point to oldest sample initially

int loopcnt = bufferEndptr - sampleptr; // how many iterations are needed for a single-step loop
```

As the MAC loops will be manually unrolled 4 times in C (reasoning below), we also need to calculate how much misalignment there is for MAC loop iterations, to be handled separately after the first MAC loop. In addition, four accumulators are declared.

```
char modunroll = loopcnt % 4; // non-integral leftover of an unrolled loop

// accumulators
double result = 0;
double result2 = 0;
double result3 = 0;
double result4 = 0;
```

Then, the sample is read into the correct part of the circular buffer:

```
* sampleptr = mono_read_16Bit(); // read sample into buffer
```

Then the code drops into a 4-unrolled MAC loop that steps the buffers in the forwards direction until it hits the end of the sample buffer (can be a range of values since this is a circular buffer). Afterwards, the non-integral case is handled for one, two or three leftover values. Note that the loop count is known at the start of the ISR and thus the compiler will turn the condition check into a simple loop counter, which is the optimal solution.

```
// process samples until the end of the sample buffer is hit
while(sampleptr < bufferEndptr-3)

{
    result += (*coeffptr++) * (*sampleptr++);
    result2 += (*coeffptr++) * (*sampleptr++);
    result3 += (*coeffptr++) * (*sampleptr++);
    result4 += (*coeffptr++) * (*sampleptr++);
}

// take care of non-integral leftover iterations
if (modunroll>0) result += (*coeffptr++) * (*sampleptr++);
if (modunroll>1) result2 += (*coeffptr++) * (*sampleptr++);
if (modunroll>2) result3 += (*coeffptr++) * (*sampleptr++);
```

Then we reset the sample pointer to the start of the sample buffer, effectively wrapping it around and do another pass of the MAC for the remainder of the FIR operation. Note that the difference between the current coefficient pointer and the end of the coefficient buffer is the required amount of iterations. This is also turned into a simple loop counter by the compiler.

```
sampleptr = buffer; // wrap pointer to beginning of the buffer

// pass the remainder of the buffer (amount of iterations = how many coefficients there are left to process)

while (coeffptr < coeffEndptr - 3)

result += (*coeffptr++) * (*sampleptr++);

result 2 += (*coeffptr++) * (*sampleptr++);

result 3 += (*coeffptr++) * (*sampleptr++);

result 4 += (*coeffptr++) * (*sampleptr++);

// take care of non-integral leftover iterations
if (modunroll==1) result += (*coeffptr++) * (*sampleptr++);
if (modunroll==1 || modunroll==2) result 2 += (*coeffptr++) * (*sampleptr++);
if (modunroll==1 || modunroll==2 || modunroll==3) result 3 += (*coeffptr++) * (*sampleptr++);</pre>
```

The wrapup part of the code does the final accumulation of partial accumulators, stepping the index and writing the resulting sample to the codec.

```
// sum the accumulators
result = result + result2 + result3 + result4;

// advance index into circular buffer
index = (index == 0) ? N-1 : index -1;

mono_write_16Bit(result); // output sample
```

The reason for unrolling the loop four times is as follows (all with o2/o3 optimisation settings):

- with one accumulator, the compiler was not noticing the possible software pipelining of the results and was limited by serial writes to one accumulator.
- with two accumulators, the compiler notices that it can use both sides of the processor. But there are still several nop cycles.
- with four accumulators, the compiler is able to unroll its generated loop an extra time to remove amount of nop cycles.

#### 3.2.2 Code Performance

With four manual loop unrolls to use as hints for the compiler and o2/o3 compiler setting, the C version can be pushed to 422 cycles for the entire ISR. From our measurements, we estimate the read/write overheads for the samples to be around 130 cycles at o2/o3, therefore we estimate that the actual optimised C version FIR pass is 192 cycles.

| Optimisation Level | Est. Cycle count without sample overheads | Cycle count with read/write overheads |  |
|--------------------|-------------------------------------------|---------------------------------------|--|
| none               | 3042                                      | 3172                                  |  |
| o1                 | 1842                                      | 1972                                  |  |
| 02/03              | 192                                       | 422                                   |  |

Note that this code's performance at a high level of optimisation is very impressive, the C code is structured to take advantage of the compiler's pattern matcher to utilise the tight mac loops and automatic unrolling.

Potentially we could improve the code further to take advantage of the coefficient symmetry to reduce the amount of loads by a quarter and halve the amount of multiplications. However, at C level, the compiler could not pick up on the possible software pipelining in practice. Software pipelining will be discussed in section 4.2.1.

Another possible optimisation path would be to change the filter, while keeping it conformant to the spec, to have a significant amount of zero coefficients, which could be handled very quickly by the code.

### 3.2.3 Spectrum Analyser Output

The output for the spectrum analyser is given in the figures below. Due to the input being fed to only one channel on the DSP, along with the potential divider in the circuitry, the value "seen" by the DSP will be one-fourth of what was provided by the analyser. This leads to an approximate -12 dB gain for the output in the frequency response. The figures given below have the necessary offset to reflect this.

The phase is observed to still be linear on hardware, but it is important to note that the overall change in the roughly 300 Hz to 2500 Hz range is 140 radians. This is more than the 70 radians change we observed in the Matlab plots, meaning that the slope of the phase response is steeper.

The relationship between the phase response  $\phi(\omega)$ , and the group delay  $\tau_q(\omega)$  is given by

$$\tau_g(\omega) = -\frac{d\phi(\omega)}{d\omega}$$

Thus, a higher group delay will contribute to a steeper phase response, which was observed. This means that the hardware filter has a higher group delay, which can be attributed to the delay in the codec buffers.



# 4 Assembly Implementation

An implementation of the MAC operation was done in assembly. The code for the C file that calls the assembly function is given in section A.5.1. The ISR routine simply reads the sample from the output, calls the assembly function and writes the output. A buffer size of 1024 bytes was used. This is because there are 88 entries in the buffer, and 88 entries require  $88 \times \frac{64}{8} = 704$  bytes of space. When rounded up to the nearest power of two, we get 1024.

Two versions of the assembly function were implemented, and will be detailed later in this section.

## 4.1 Linear Implementation

An assembly implementation of the MAC operation without any parallelism was implemented to test the output.

The code listing can be found in section A.5.2. In the comments to the code, the numbers in brackets after the code indicate the number of delay slots required after the instruction is sent to E1 stage of the pipeline before its results can be used. For floating point instructions, a second number will indicate the number of latency cycles after the E1 stage of the pipeline before the functional unit can execute another instruction.

# 4.1.1 Code Operation

The structure of the code before, and after the MAC loop is generally the same as the assembly code provided. The AMR register is set to have a value of 0x90004, which sets the register A5 to use circular buffering with a block size of 1024 bytes. The MAC loop then simply consists of code to load the sample data and the coefficients, multiply them together, and finally add them to an accumulator. The straightforward loop code is given below:

In the first execute packet of the loop, the coefficient and the sample are loaded into their respective registers (A11:A10, and B11:B10) in parallel using the D units on both sides. 4 delay slots are required before the results can be used. The values are then multiplied using the MPYDP instruction, which uses the M1 unit, and utilises the cross path (thus the .M1X). 9 delay slots are required before the results are added using ADDDP. Then, a further six delay slots are required before the loop begins again.

#### 4.1.2 Code Performance

The number of cycles taken between the start, and the end of the ISR routine is given in the table below. The number given is the lowest number of clock cycles observed. The C code in this case do not change much through the various optimisation level. This is because the compiler does not optimise the assembly code, and the assembly code has a constant number of clock cycles (including the five NOPs after the branch back to C). This linear and straightforward implementation of the MAC operation in assembly actually performs worse than the Non-Circular Buffer implemented in C at higher levels of optimisation. This is because at higher levels of optimisation, the compiler will attempt to optimise using techniques such as software pipelining. This will be further discussed in section §5.

| Optimisation Level | Number of Clock Cycles | Assembly Code |
|--------------------|------------------------|---------------|
| None               | 2736                   |               |
| 00                 | 2736                   | 2594          |
| 02                 | 2730                   |               |

# 4.2 Optimised Implementation

Various techniques can be employed to optimise the assembler code and shave the number of cycles required by five times. The techniques will be described in this section. The code listing can be found in section A.5.3.

### 4.2.1 Optimisation Techniques

There are various techniques that can be employed to take advantage of the VLIW architecture of the DSP hardware. This mostly include exploiting the ability to schedule multiple instructions that utilise different functional units to be executed in parallel, and also to understand how the pipeline works for the various instructions so as to interleave instructions. Some of the techniques used by the compiler (described in section §5) are also used.

Double precision (DP) instructions are the first area for optimisation. The delay slots between two consecutive DP instructions where the second instruction makes use of the result from the first instruction could be reduced by one (for example MPYDP followed by ADDDP). This is because the DP instructions write the lower half of the results to the register first, before writing the upper half of the results to the register in the final delay slot. DP instructions that read the lower half results first in E1, followed by the upper half in E2 can be scheduled to start executing in the final delay slot of the previous DP instruction. Thus, the number of delay slots between MPYDP followed by ADDDP can be reduced from 9 to 8.

Utilising multiple functional units on both sides is the second area for optimisation. This works, so long as the operations do not write to the same registers in the same cycle. There is also a need to be careful to not read more than four registers in the same register file in the same execute packet. Thus, two MPYDP and ADDDP operations can take place in parallel utilising both of the functional units. This can roughly half the number of cycles required for the code to run, but does, however, require twice the number of registers required.

Software pipelining for loops is the third area for optimisation. Software pipelining is analogous to hardware pipelining where multiple instructions are interleaved so that the functional units can be maximally utilised during their delay slots, subject to their latencies, if any. Software pipelining, along with loop unrolling are techniques used by compilers to optimise code. In software pipelining, the pipeline is first primed using a pipeline prologue. The main loop kernel is then executed for the required number of times, with several loop cycles unrolled to execute interleaved. Then, the loop epilogue will finish up any outstanding tasks. This technique can roughly reduce the number of cycles by a factor roughly equivalent to the number of times the loop is unrolled, but requires proper planning and tracking.

Finally, taking advantage of the branch delay slots can also reduce the numbers of cycles in a non-trivial manner. The branch instruction requires five delay slots afterwards, whether the branch is taken or not. Those five execute packets are guaranteed to execute, and thus code can be executed during those execute packets.

These techniques are employed in the code implementation, to be explained later on in this section.

#### 4.2.2 Code Operation

The register usage is described in the comments in the listing in section A.5.3. Where possible (restricted by the number of functional units available, and the avoidance of hazards due to dependencies), operations are run in parallel.

# Code Setup

The assembly function first starts off by setting the AMR register is set to have a value of 0x90004, which sets the register A5 to use circular buffering with a block size of 1024 bytes. At the same time, we save some of the values in registers we

are going to use later onto the stack using the Stack Pointer B15. The address pointer for the sample that was just read is dereferenced, along with the address of the circular buffer.

```
MVC.S2
                               AMR, B13
                                                ; (0) Save contents of AMR reg to B13
           STW .D2
                               B3, *++B15
                                               ; (0) save return to C to stack
2
           LDDW D1
                                               ; (4) Get the 32 bit data for read samp put it in A11: A10
                               *A6, A11: A10
            STW .D2
                               B6, *++B15
                                               ; (0) save &filtered samp to stack
        MVK .S2
                               4H, B2
                                               ; (0) Set AMR to allow A5 to be used for circular addressing
             with BK0
           LDW .D1
                               *A4,A5
                                               ; (4) Get the address of the circ ptr, dereference then
            place in A5
                               9H,B2
                                               ; (0) Set BKO to allow for 1024 bytes addressing
            MVKLH .S2
            MVC.S2
                               B2, AMR
                                               ; (0) set AMR reg
10
            NOP 2
                                           ; A5 now holds address pointing into delay circ
11
                                                                                                   : set
                circular mode using the AMR
       MVC .S2
                    AMR, B13
                               ; (0) Save contents of AMR reg to B13
12
       MVK .S2
                     4H, B2
                               ; (0) Lower half, set A5 to be circular buffering addressing mode using
13
            BK<sub>0</sub>
       MVKLH .S2
                     9H, B2
                               ; (0) Upper half. Set BKO to work for 1024 bytes
14
       MVC .S2
                    B2, AMR
                               ; (0) set AMR reg
15
            NOP 2
                                           ; A5 now holds address pointing into delay circ
16
```

Next, the sample that was just read is written into the appropriate address in memory (the circular buffer), and the registers used for accumulations are ZEROed. The address for the next execution of the convolution function to write the new sample to is written back to memory.

```
STW D1
                              A11,*--A5
                                            ; (0) Store new input sample (MSB) to delay circ array
1
           ZERO .S1
                              Α1
                                             ;(0) zero accumulator LSB
           ZERO .S2
                              B3
3
           STW .D1
                              A10,*--A5
                                             ; (0) Store new input sample (LSB) to delay circ array
                              Α0
           ZERO.S1
                                             ; (0) zero accumulator MSB
           ZERO .S2
                              B2
           STW D1
                              A5, * A4
                                         ; (0) write back the decremented pointer to circ_ptr
                                         ; this points to the end of the MSB of where the next sample
                                         ; will be stored on the next call to this function
10
```

### Pipeline Description

The pipeline used in the implementation is complicated and warrants explanation. It can be summarised in figure 4.1.

### **Branch Considerations**

The loop for the MAC operations takes four cycles. Thus, a branch operation for the end of the ith loop iteration must be scheduled on the i-1th loop iteration. In addition, the branch for the end of the first loop iteration must be scheduled 2 cycles before the start of the first loop iteration. This also results in a spurious loop being inserted at the end. Consider that there are N loop iterations expected. At loop iteration N-1, the branch back to the start of the loop for loop iteration N will be scheduled. In normal operations, the loop DOES NOT branch back to the beginning of the loop in the final loop iteration.

### Instructions Properties

It should be noted that how the various instructions reads and writes registers, and their respective latency and delay slots play an important role in the operation of the pipeline. Without them, the pipeline would fail. Consider the following instructions being issued to the E1 stage of the pipeline at cycle i:

- LDDW writes the results to the registers at i+4
- MPYDP reads the input registers from i to i+3, and outputs the lower half of the result at i+8 and the upper half at i+9
- ADDDP reads the lower half input at i and the upper half at i+1, and outputs the lower half of the result at i+3 and the upper half at i+4

Each iteration of a MAC operation basically involves all of the three instructions order in this order.

### **Pipeline Operation**

Refer to figure 4.1. It can be seen from the lower diagram that the pipeline only fills up fully at the fifth loop iteration. Thus, we need a separate counter (register B1) to only allow accumulation to occur at the fifth loop iteration. Also, this means that we have to add 5 more iterations to the loop.

Both sides of architecture are used for the MAC operation. This results in side A and side B (see figure 4.1) being multiplied and accumulated separately. The accumulated results from both sides need to be added together at the end.s

Due to the fact that the ADDDP instruction is issued before the results from the previous invocation is complete, this will result in two separate accmulated sums for every other set of coefficients and samples being swapped in and out of the registers for each side of the functional units. Thus, there are virtually four data paths in operation at one go. At the end of the loop, the accumulated result from one of the virtual data pths must be prevented from being overwritten by the other virtual data path.

At the end of the loops, due to the added loop iterations so that the pipeline can be filled up, spurious operations will be performed, as described in the diagram. We must be careful not to use the registers that can be filled up with spurious data.

# Loop Code

First, the extra counter B1 required to kickstart the ADDDP instructions at the fifth iteration is initialised to 10. This counter is added to the loop counter B0. B1 is initialised to 10 because the loop counter subtracts by two each time (two coefficients/samples are loaded to each side) and thus B1 has to be multipled by 2. The branch necessary for the first loop iteration is also performed.

|             |    |       |                 | +Ą        |                 |                |
|-------------|----|-------|-----------------|-----------|-----------------|----------------|
| 23          | 0  | 9     |                 |           |                 | B2 & XB4 & LB4 |
| 22          |    |       |                 | Н         |                 |                |
| 21 22       |    |       |                 |           |                 |                |
| 20          |    |       | +A1 & XA3 & LA3 |           |                 |                |
|             | 0  |       |                 | П         | B3              | П              |
| 19          | )  |       |                 |           | +B1 & XB3 & LB3 |                |
| 18          |    |       |                 |           |                 | Н              |
| 17          |    |       |                 |           |                 |                |
| 16 17 18    |    |       |                 | xA2 & LA4 |                 |                |
| 15          | 2  | 4     |                 |           |                 | xB2 & LB4      |
| 14          |    |       |                 |           |                 | 36365          |
| 13          |    |       |                 |           |                 | П              |
| 12 13 14 15 |    |       | xA1 & LA3       |           |                 |                |
| 11          | 4  | 3     |                 |           | xB1 & LB3       |                |
| 8 9 10      |    |       |                 |           |                 | П              |
| 8           |    |       |                 | LA2       |                 | Н              |
| 7           | 28 |       |                 | Ť         |                 | LB2            |
| 9           | 9  | 2     |                 | П         |                 | П              |
| 5           |    |       |                 | Г         |                 | П              |
| 4           |    |       | LA1             | П         |                 | П              |
| 3           |    | rossa |                 | П         | LB1             | Н              |
| 2           | 8  | 1     |                 | Т         |                 | П              |
| 1           |    |       | F 3             |           |                 | П              |
| Cycle       | 81 | Loop  | A1              | A2        | B1              | 82             |

Legend:

Numbers: The ith MAC iteration involving the ith sample and coefficient

• A & B - The operations involving the M and L units on side A and B respectively. Note that loads use both sides.

L - Load operation; x- Multiplication operation; + - Addition operation

So +B2 refers to the second sample being accumulated (added) on the B side.

Explanation:

We will ignore the the fact that sides A and B are both used at the same time and just consider the iterations. Consider the entirety of each iteration of the MAC operation as a 6-stage pipeline (the columns).

Each iteration of the loop will move the pipeline stages forward by one

The loading operation (L) takes 2 pipeline stages. The multiply operation (x) takes 3 pipleine stages, but the results can be used in the same iteration

Red operation refers to spurious operations that cannot be prevent The number indicates the ith iteration of the MAC operation The addition operation (+) takes 2 pipeline stages.

| Iteration |     |     |     |     |     |     |
|-----------|-----|-----|-----|-----|-----|-----|
| 1         | П   |     |     |     |     |     |
| 2         | 12  | IJ  |     |     |     |     |
| 3         | 13  | 77  | x1  |     |     |     |
| 4         | L4  | ខា  | x2  | 1x  |     |     |
| 5         | 1.5 | 14  | x3  | x2  | x   | Ŧ   |
| 9         | 16  | 1.5 | ×4  | х3  | x2  | +5  |
|           |     |     |     |     |     |     |
| 45        | L45 | 144 | x43 | x42 | x41 | +41 |
| 46        | 146 | 145 | x44 | x43 | x42 | +42 |
| 47        | L47 | 146 | x45 | ×44 | x43 | +43 |
| 48        | 148 | 147 | x46 | x45 | ×44 | +44 |
| 49        |     | 148 |     | x46 | x45 | +45 |
| 20        |     |     |     |     | ×46 | +46 |

Figure 4.1: Pipeline description diagram.

ywc110 & rs5010 18 The loop proper begins. Sides A and Sides B perform the MAC accordingly. This set of code results in the pipeline described in figure 4.1. The conditional execution is done so as to prevent some of the spurious operations from occurring at the end of the loop. As describied above, to prevent one virtual data path on side B from overwriting the other virtual data path, we must move the registers that store the result for side B accumulation away. This happens at the end of the loop.

```
loop:
1
           [B0] SUB S2
                                       ; (0) Decrement loop counter by 2, because we are doing two
                             B0.2.B0
2
               calculations togerher
                             B1,2,B1 ; (0) countdown to allow start of addition. Countdown is done
           [B1] SUB .D2
               by two
                                       ; because the loop counter is decremented by two. And since we
                                           added B0
                                       ; and B1 together before the loop, we must also double B1's
                                           value and subtract
                                       ; by 2 each time
8
           [B0] B S2
                              loop
                                            ; (5) for current iteration i, kickstart the branch back
10
               for iteration i+1
           [B0] LDDW D1
                              *A5++, A11:A10; (4) B-Load delayed sample
11
                              *B4++, B11:B10; (4) B-Load coefficient
           [B0] LDDW .D2
          [B0] MPYDP .M2X
                             B11:B10, A11:A10, B3:B2; (9,4) B - Multiply
13
14
       || [!B1] ADDDP .L2
                             B7:B6, B3:B2, B7:B6; (6,2) B — Accumulate
15
16
           [B0] LDDW .D1
                              *A5++, A9:A8; (4) A-Load delayed sample
17
           [B0] LDDW D2
                              *B4++, B9:B8; (4) A — Load coefficient
          [B0] MPYDP M1X
                             A9:A8, B9:B8, A3:A2; (9,4) A — Multiply
19
       || [!B1] ADDDP .L1
                             A1:A0, A3:A2, A1:A0; (6,2) A — Accumulate
       || [!B0] MV .S2
                              B6, B12; (0) for the final iteration this cycle, the LH result for B-
21
           Addition -44 is
                                      ; written on this cycle. We move the LH result for B-Addition-43
22
                                           out of the
                                      ; way to prevent losing them
23
```

We then continue the moves for the upper half of the B-side and do the same for the A-side. A-side move is done one cycle later because its MAC operations are one cycle behind that of B-side. Both virtual data paths are then added.

```
MV .D1
                              A0, A12; (0) the UH result for A-Addition-44 is
                                      ; written on this cycle. We move the UH result for A-Addition -43
2
                                           out of the
                                      ; way to prevent losing them
          MV .S2
                              B7, B13; (0) the UH result for B-Addition-44 is
                                      ; written on this cycle. We move the UH result for B-Addition-43
                                           out of the
                                      ; way to prevent losing them
6
           MV .D1
                              A1, A13; (0) the LH result for A-Addition-44 is
                                      ; written on this cycle. We move the LH result for A-Addition -43
                                           out of the
                                      ; way to prevent losing them
10
          ADDDP .L2
                              B7:B6, B13:B12, B13:B12
                                                       ; (6,2) the supurious B-Addition-45 will write
11
           the LH result in
12
                                                        ; 2 cycles after this. Better start adding
                                                            result of B-Addition-44
13
```

ywc110 & rs5010

```
ADDDP .L1 A1:A0, A13:A12, A13:A12 ; (6,2) the supurious A—Addition—45 will write the LH result in ; 2 cycles after this. Better start adding result of A—Addition—44
```

## Code Cleanup

The registers saved to stack will be restored, and the accumulated values from A-side and B-side are added up. They are then saved to address and the code returns to C.

```
LDW .D2
                                             ; (4) get &filtered samp from stack
           LDW .D2
                               *B15--, B0
                                             ; (4) get return to C from stack
           NOP 3
           ADDDP L1X
                               A13:A12, B13:B12, A13:A12; (6,2) Add the results of Side A and B
                together
6
           NOP
            ; return to C code
10
   lend:
            B S 2
                               RΩ
                                              ; (5) branch to b3 (register b3 holds the return address)
11
           NOP
                    3
12
13
                       result of MAC back to C
            : send the
14
                                               ; (0) Write accumulator (LSB) into filtered samp
           STW.D2
                               A12, *B6
15
16
           STW.D2
                               A13,*+B6[1]
                                               ; (0) Write accumulator (MSB) into filtered samp
17
           MVC S2
                               B1,AMR
                                              ; (0) restore AMR reg to previous contents
18
19
            end
20
```

It should be noted that with this optimisation, the number of coefficients, N, **MUST** be a multiple of two. If the number of coefficients is not a multiple of two, additional coefficients with values of zero should be added to make N a multiple of two. Otherwise, the code will compute the result wrongly.

## 4.2.3 Code Performance

The number of cycles taken between the start, and the end of the ISR routine is given in the table below. The number given is the lowest number of clock cycles observed. The number might vary due to cache hits and/or misses. The C code in this case do not change much through the various optimisation level. This is because the compiler does not optimise the assembly code, and the assembly code has a constant number of clock cycles. The optimisation technique discussed in section 4.2.1 provide massive improvement to the code performance, by five fold.

| Optimisation Level | Number of Clock Cycles | Assembly Code |
|--------------------|------------------------|---------------|
| None               | 378                    |               |
| 00                 | 378                    | 239           |
| 02                 | 378                    |               |

# 4.2.4 Spectrum Analyser Traces

The output for the spectrum analyser is given below. As before, the -12 dB offset that occurs has been corrected in the trace below. The group delay that can be observed from the phase response is explained as before, from section 3.2.3.



# 5 Compiler Optimisation

The various optimisations performed by the compiler are described in the SPRU1870<sup>1</sup> document. Optimisation might result in a larger code size.

When no optimisation is done, the compiler generally generates assembly code "as-is" with no optimisation done to the code. This usually results in the fastest compilation time, and is easiest to debug (but with performance trade-off.). The section will examine the various optimisation performed by the compiler at various levels and examine how they could contribute to the increase in performance.

# 5.1 Level 0 Optimisation

The compiler will attempt to simplify the control-flow-graph (i.e. if/else, for, switch etc. statements). The code for both the different implementations do not use as much of these control statements, and thus not much improvement will arise from there. The compiler will also attempt to eliminate unused code, which is not present in both implementations. Next, the compiler will attempt to simplify statements and expressions. The implementations do not generally contain overly complicated expressions, and statements. However, the following statements contain expressions that always evaluate to the same constant values, and the compiler might attempt to "collapse" them into a constant value at compile-time, rather than ask them to be computed at run-time.

```
// from non-circular buffer  i = N-1; \\ 3 \\ 4 \\ // from circular buffer \\ 5 \\ 6 \\ double* coeffptr = b; \\ 6 \\ double* coeffEndptr = b + N; // points to the element AFTER the coefficient array
```

The compiler will also attempt to inline functions marked with the keyword inline. However, the keyword was not used in both implementations. The compiler will assign variables to registers, reducing the amount of memory access. This might have contributed a significant amount of code performance improvements to both implementation. It is likely that the circular buffer implementation benefited more from this optimisation, due to its use of pointers, which would have resulted in unnecessary amounts of dereferencing of pointers to pointers).

Finally, the compiler will attempt to perform loop rotation (or loop inversion)<sup>2</sup>. Consider the following for loop in C code which will be transformed (essentially) by the compiler into an equivalent while loop in assembly. This results in two branches being run continually in a loop, and branches, whether taken or not, could lead to pipeline stalls (or in this architecture additional NOPs being inserted, which are wasteful if not optimised properly).

```
for (i = ; i < N; ++i) doSomething();

// transformed into
i = 0;

while (i < N) { // conditional branch to after end of loop if i >= N
doSomething();
++i;
} // unconditional branch to start of loop
```

Loop rotation replaces the whole while block with an if block containing a do..while loop, which reduces the number of branches in the loop to one.

 $<sup>^{1} \</sup>texttt{http://www.ti.com/general/docs/lit/getliterature.tsp?literatureNumber=spru187o\&fileType=pdf}$ 

 $<sup>^2</sup> See \ http://llvm.org/devmtg/2009-10/Scalar Evolution And Loop Optimization.pdf \ and \ http://en.wikipedia.org/wiki/Loop\_inversion.pdf \ and \ http://en.wikipedia.org/wiki/Loop_inversion.pdf \ and \ a$ 

```
i i = 0;
if (i < N){ // conditional branch to after end of loop if i >= N OUTSIDE the loop
do{
   doSomething();
   i++;
} while (i < N); // conditional branch to beginning of loop if i < N
}</pre>
```

This technique contributes a significant improvement in both implementations. This technique also enable code that are loop-invariant to be moved out of the loop themselves<sup>3</sup>.

# 5.2 Level 2 Optimisation

Level 2 optimisation performs all the optimisation in Levels 0 and 1. The optimisation performed in Level 1 (Performs local copy/constant propagation, Removes unused assignments, and Eliminates local common expressions) are not applicable to the implementations

The compiler attempts to perform various loop optimisation such as software pipelining and loop unrolling as described in section 4.2.1. These optimisations contribute the most to the improvement in performance, seeing that most of the code is spent in loops.

The compiler also attempts to convert array references in loops to incremented pointer form, which was what was done already in the circular buffer implementation. In this case, it is the fact that the circular buffer only loops over the values once, rather than twice by the non-circular buffer, that gives it the performance advantage.

The various global optimisation done by the compiler is not relevant.

<sup>&</sup>lt;sup>3</sup>See http://en.wikipedia.org/wiki/Loop-invariant\_code\_motion

# A Code Listings

### A.1 Matlab Code for Filter Generation

Based on the specification given, the following Matlab code was used to generate the filter:

```
clear;
2
   rp = 0.4;
                                   % passband ripple
   rs = 50;
                                   % stopband ripple
   f = [0.065 \ 0.1125 \ 0.5625 \ 0.625]; % Normalised frequencies
   a = [0 \ 1 \ 0];
                                   % amplitude
   fs = 8000;
                                   % sampling frequency
   % calculate deviation
   dev = [10^{(-rs/20)} (10^{(rp/20)}-1)/(10^{(rp/20)}+1) 10^{(-rs/20)}];
10
11
   % determine the order
12
   [n, fo, ao, w] = firpmord(f, a, dev);
13
14
   b = firpm(n+3, fo, ao, w);
15
16
   % time to plot
17
   figure
19
   % linear gain plot
   subplot (2,2,[1 3]);
   % [h,f] = freqz(b,a,n,fs)
   [h, omega] = freqz(b, 1, 2048, fs);
   plot(fo.*(fs/2), ao, omega, abs(h));
   legend('Ideal', 'Design');
   grid minor;
   xlabel('Frequency (Hz)');
27
   ylabel('Gain');
29
   % magnitude bode plot
   subplot (2,2,2)
31
   %semilogx (omega, mag2db(abs(h)));
   plot (omega, mag2db(abs(h)));
   x | im ([10 fs/2]);
   grid minor;
   xlabel ('Frequency (Hz)');
   ylabel('Gain⊔(dB)');
   % phase bode plot
   subplot (2,2,4)
   %semilogx (omega, unwrap(angle(h)));
   plot (omega, unwrap(angle(h)));
   x \lim ([10 fs/2]);
   grid minor;
   xlabel('Frequency (Hz)');
   ylabel('Phase (radians)');
   % write to file
   format long e
  save ('fir_coef.txt', 'b', '-ascii', '-double', '-tabs');
   save ('fir coef float.txt', 'b', '-ascii', '-tabs');
```

A.2 Non-Circular Buffer A CODE LISTINGS

### A.2 Non-Circular Buffer

```
DEPARTMENT OF ELECTRICAL AND ELECTRONIC ENGINEERING
2
                  IMPERIAL COLLEGE LONDON
3
4
             EE 3.19: Real Time Digital Signal Processing
               Dr Paul Mitcheson and Daniel Harvey
6
                           LAB 4 - Non-circular FIR
   ************************************
9
10
  11
12
  #include < stdlib . h>
13
14
  #include <stdio.h>
  // Included so program can make use of DSP/BIOS configuration tool.
15
  #include "dsp bios cfg.h"
17
  /st The file dsk6713 h must be included in every program that uses the BSL. This
18
     example also includes dsk6713 aic23.h because it uses the
19
     AIC23 codec module (audio interface). */
  #include "dsk6713.h"
21
  #include "dsk6713 aic23.h"
23
  // math library (trig functions)
  #include <math.h>
25
  // Some functions to help with writing/reading the audio ports when using interrupts.
27
  #include <helper functions ISR.h>
29
  30
31
  /* Audio port configuration settings: these values set registers in the AIC23 audio
32
     interface to configure it. See TI doc SLWS106D 3-3 to 3-10 for more info. */
33
  DSK6713_AIC23_Config Config = { \
34
        35
        /* REGISTER
                           FUNCTION
                                      SETTINGS
36
        37
     0 \times 0017, /* 0 LEFTINVOL Left line input channel volume 0dB
                                                                    */\
38
     0x0017, /* 1 RIGHTINVOL Right line input channel volume 0dB
39
                                                                    */\
     0 \times 01f9, /* 2 LEFTHPVOL Left channel headphone volume 0dB
                                                                    */\
40
     0x01f9, /* 3 RIGHTHPVOL Right channel headphone volume 0dB
                                                                    */\
41
     0x0011, /* 4 ANAPATH
                       Analog audio path control DAC on, Mic boost 20dB*/\
42
     0×0000, /* 5 DIGPATH
                       Digital audio path control
                                                 All Filters off
43
                                                                    */\
     0 \times 0000, /* 6 DPOWERDOWN Power down control
                                                                   */\
                                                  All Hardware on
44
     0x0043, /* 7 DIGIF
                       Digital audio interface format 16 bit
                                                                    */\
                                                                   */\
     0x008d, /* 8 SAMPLERATE Sample rate control
                                                 8 KHZ
46
     0x0001 /* 9 DIGACT Digital interface activation On
                                                                    */\
        48
  };
50
51
  // Codec handle:— a variable used to identify audio interface
52
  DSK6713 AIC23 CodecHandle H Codec;
54
```

A.2 Non-Circular Buffer A CODE LISTINGS

```
// The order of the FIR filter +1
   #define N 88
58
   // include the coefficients
60
   #include "fir coef.txt"
62
   // define the buffer
63
   Int16 buffer [N] = \{0\};
64
   66
   void init hardware(void);
   void init_HWI(void);
68
   void ISR AIC(void);
   Int16 convoluteNonCircular(void);
70
   void main(){
72
73
74
     // initialize board and the audio port
75
     init hardware();
76
77
     /* initialize hardware interrupts */
78
    init HWI();
80
     /* loop indefinitely, waiting for interrupts */
81
82
     {};
83
85
   87
   void init hardware()
88
89
      // Initialize the board support library, must be called first
90
      DSK6713 init();
91
      // Start the AIC23 codec using the settings defined above in config
93
      H Codec = DSK6713 AIC23 openCodec(0, &Config);
95
     /* Function below sets the number of bits in word used by MSBSP (serial port) for
     receives from AIC23 (audio port). We are using a 32 bit packet containing two
97
     16 bit numbers hence 32BIT is set for receive */
    MCBSP FSETS(RCR1, RWDLEN1, 32BIT);
99
100
     /* Configures interrupt to activate on each consecutive available 32 bits
101
     from Audio port hence an interrupt is generated for each L & R sample pair */
1 02
    MCBSP FSETS(SPCR1, RINTM, FRM);
103
1 04
     /* These commands do the same thing as above but applied to data transfers to
105
     the audio port */
106
    MCBSP FSETS(XCR1, XWDLEN1, 32BIT);
107
    MCBSP FSETS(SPCR1, XINTM, FRM);
108
109
110
111
112
```

```
113
   void init_HWI(void)
114
115
     IRQ _ globalDisable();
                            // Globally disables interrupts
116
     IRQ nmiEnable();
                            // Enables the NMI interrupt (used by the debugger)
117
     IRQ map(IRQ EVT RINT1,4); // Maps an event to a physical interrupt
118
     IRQ enable(IRQ EVT RINT1); // Enables the event
119
                            // Globally enables interrupts
     IRQ globalEnable();
120
121
122
123
    /***************** WRITE YOUR INTERRUPT SERVICE ROUTINE HERE********************
124
125
   void ISR AIC (void)
126
127
     int i;
128
     double output = 0;
129
130
     // shift buffer
1 31
     for (i = N-1; i != 0; ---i)
132
       buffer[i] = buffer[i-1];
133
1 34
     // new sample
135
     buffer[0] = mono_read_16Bit();
136
137
     // mac loop
138
139
     for (i = 0; i < N; ++i)
       output += b[i] * buffer[i];
140
141
       mono_write_16Bit(output); // write
142
143 }
```

# A.3 Naive Implementation for a Circular Buffer

```
1
              DEPARTMENT OF ELECTRICAL AND ELECTRONIC ENGINEERING
2
                     IMPERIAL COLLEGE LONDON
3
               EE 3.19: Real Time Digital Signal Processing
                  Dr Paul Mitcheson and Daniel Harvey
6
                                LAB 4 - Naive Circular FIR
10
   11
12
  #include < stdlib . h>
13
  #include <stdio.h>
14
   // Included so program can make use of DSP/BIOS configuration tool.
15
  #include "dsp bios cfg.h"
16
17
  /* The file dsk6713.h must be included in every program that uses the BSL. This
18
     example also includes dsk6713_aic23.h because it uses the
19
     AIC23 codec module (audio interface). */
20
  #include "dsk6713.h"
  #include "dsk6713 aic23.h"
22
23
```

```
// math library (trig functions)
  #include <math.h>
26
  // Some functions to help with writing/reading the audio ports when using interrupts.
27
  #include <helper functions ISR h>
28
  30
31
  /* Audio port configuration settings: these values set registers in the AIC23 audio
32
    interface to configure it. See TI doc SLWS106D 3-3 to 3-10 for more info. */
33
  DSK6713 AIC23 Config Config = { \
34
       35
       /* REGISTER
                           FUNCTION
                                           SETTINGS
36
       37
     0x0017, /* 0 LEFTINVOL Left line input channel volume 0dB
                                                                  */\
38
     0x0017, /* 1 RIGHTINVOL Right line input channel volume 0dB
39
                                                                  */\
     0 \times 01 f9, /* 2 LEFTHPVOL Left channel headphone volume
                                                                  */\
40
     0x01f9, /* 3 RIGHTHPVOL Right channel headphone volume 0dB
                                                                  */\
41
     0x0011, /* 4 ANAPATH Analog audio path control DAC on, Mic boost 20dB*/\
42
     0 \times 0000, /* 5 DIGPATH Digital audio path control
                                                All Filters off
                                                                  */\
43
     0 \times 0000, /* 6 DPOWERDOWN Power down control
                                                 All Hardware on
                                                                  */\
44
     0x0043, /* 7 DIGIF
                      Digital audio interface format 16 bit
                                                                  */\
45
     0x008d, /* 8 SAMPLERATE Sample rate control
                                                8 KHZ
                                                                  */\
46
     0x0001 /* 9 DIGACT Digital interface activation On
                                                                  */\
47
       48
49
  };
50
51
  // Codec handle:— a variable used to identify audio interface
52
  DSK6713 AIC23 CodecHandle H Codec;
53
55
  // The order of the FIR filter +1
57
  #define N 88
59
  // include the coefficients
  #include "fir coef.txt"
61
62
  // define the buffer
63
  Int16 buffer [N] = \{0\};
65
  // index of the current "current" (zero) sample
  int index = 0;
67
   69
  void init hardware(void);
  void init HWI(void);
71
  void ISR_AIC(void);
  Int16 convolute(Int16 input);
73
  74
75
   // initialize board and the audio port
76
   init hardware();
77
78
    /* initialize hardware interrupts */
   init_HWI();
80
```

```
81
     /st loop indefinitely, waiting for interrupts st/
82
     w hile (1)
83
     {};
85
87
   88
   void init hardware()
89
       // Initialize the board support library, must be called first
91
       DSK6713 init();
93
       // Start the AIC23 codec using the settings defined above in config
94
       H Codec = DSK6713 AIC23 openCodec(0, &Config);
95
     /* Function below sets the number of bits in word used by MSBSP (serial port) for
97
     receives from AIC23 (audio port). We are using a 32 bit packet containing two
98
     16 bit numbers hence 32BIT is set for receive */
99
     MCBSP FSETS(RCR1, RWDLEN1, 32BIT);
1 00
101
     /* Configures interrupt to activate on each consecutive available 32 bits
1.02
     from Audio port hence an interrupt is generated for each L & R sample pair */
103
     MCBSP FSETS(SPCR1, RINTM, FRM);
1 04
105
     /* These commands do the same thing as above but applied to data transfers to
106
107
     the audio port */
     MCBSP FSETS(XCR1, XWDLEN1, 32BIT);
1.08
     MCBSP FSETS(SPCR1, XINTM, FRM);
109
110
111
112
113
    114
   void init HWI(void)
115
116
     IRQ globalDisable();
                            // Globally disables interrupts
117
     IRQ nmiEnable();
                           // Enables the NMI interrupt (used by the debugger)
118
     IRQ map(IRQ EVT RINT1,4); // Maps an event to a physical interrupt
119
     IRQ enable(IRQ EVT RINT1);
                               // Enables the event
120
                           // Globally enables interrupts
     IRQ globalEnable();
122
123
124
    125
126
   void ISR AIC(void){
127
       Int16 sample = mono read 16Bit(); // read
128
       sample = convolute(sample); // convolute
129
       mono write 16Bit(sample); // write
130
1 31
132
   // Perform convolution
133
   Int16 convolute(Int16 input){
1 34
     int i:
135
     double result = 0;
     // write to current "zero" sample
137
```

```
**(buffer + index) = input;

**(buffer + index) = input;

for (i = 0; i < N; i++)
    result += b[i]* buffer[ ((index-i) + N) % N];

// advance index
index = (index + 1)%N;

return (Int16) round(result);
}</pre>
```

# A.4 Optimised Circular Buffer Implementation

```
DEPARTMENT OF ELECTRICAL AND ELECTRONIC ENGINEERING
2
                    IMPERIAL COLLEGE LONDON
              EE 3.19: Real Time Digital Signal Processing
                 Dr Paul Mitcheson and Daniel Harvey
6
                             LAB 4 — Circular FIR
10
   11
12
  #include < stdlib . h>
  #include < stdio .h>
14
   // Included so program can make use of DSP/BIOS configuration tool.
  #include "dsp bios cfg.h"
16
17
  /* The file dsk6713.h must be included in every program that uses the BSL. This
18
     example also includes dsk6713 aic23.h because it uses the
19
     AIC23 codec module (audio interface). */
20
  #include "dsk6713.h"
21
  #include "dsk6713 aic23.h"
22
23
   // math library (trig functions)
24
  #include <math.h>
25
26
   // Some functions to help with writing/reading the audio ports when using interrupts.
27
  #include <helper functions ISR.h>
28
29
   30
31
   /* Audio port configuration settings: these values set registers in the AIC23 audio
32
     interface to configure it. See TI doc SLWS106D 3-3 to 3-10 for more info. */
33
   DSK6713 AIC23 Config Config = { \
        35
            REGISTER
                               FUNCTION
36
        37
      0x0017, /* 0 LEFTINVOL Left line input channel volume 0dB
                                                                         */\
                                                                         */\
      0x0017, /* 1 RIGHTINVOL Right line input channel volume 0dB
39
      0x01f9, /* 2 LEFTHPVOL Left channel headphone volume
                                                      0dB
                                                                         */\
      0 \times 01f9, /* 3 RIGHTHPVOL Right channel headphone volume 0dB
                                                                         */\
41
      0x0011, /* 4 ANAPATH Analog audio path control
                                                      DAC on, Mic boost 20dB*/\
      0x0000, /* 5 DIGPATH
                           Digital audio path control
                                                     All Filters off
                                                                         */\
43
      0 \times 0000, /* 6 DPOWERDOWN Power down control
                                                      All Hardware on
                                                                         */\
```

ywc110 & rs5010

```
0x0043, /* 7 DIGIF
                          Digital audio interface format 16 bit
                                                                      */\
45
      0x008d, /* 8 SAMPLERATE Sample rate control
                                                  8 KHZ
                                                                      */\
      0x0001 /* 9 DIGACT Digital interface activation On
                                                                      */\
47
        };
49
50
51
   // Codec handle:— a variable used to identify audio interface
52
   DSK6713 AIC23 CodecHandle H Codec;
53
55
   // The order of the FIR filter +1
57
   #define N 88
59
   // include the coefficients
   #include "fir coef.txt"
61
62
   // define the buffer
63
   double buffer [N] = \{0\};
64
65
   // index of the current "current" (zero) sample
66
   int index = 0;
67
68
   // macro that based on the index of the current zero sample,
69
   // calculate the index of the array to read
   // including handling wrap arounds
   //#define GET INDEX(index, offset) (index + offset)%N
72
73
   74
   void init hardware(void);
   void init HWI(void);
   void ISR AIC(void);
   78
   void main(){
    // initialize board and the audio port
80
    init _ hardware ();
81
82
    /* initialize hardware interrupts */
83
    init HWI();
84
85
    /* loop indefinitely, waiting for interrupts */
86
87
    w hile (1)
    {};
88
89
90
91
   92
   void init_hardware()
93
94
      // Initialize the board support library, must be called first
95
      DSK6713 init();
96
97
      // Start the AIC23 codec using the settings defined above in config
98
      H\_Codec = DSK6713\_AIC23\_openCodec(0, &Config);
99
100
    /st Function below sets the number of bits in word used by MSBSP (serial port) for
101
```

```
receives from AIC23 (audio port). We are using a 32 bit packet containing two
1.02
      16 bit numbers hence 32BIT is set for receive */
103
     MCBSP FSETS(RCR1, RWDLEN1, 32BIT);
1 04
      /* Configures interrupt to activate on each consecutive available 32 bits
106
      from Audio port hence an interrupt is generated for each L & R sample pair */
     MCBSP FSETS(SPCR1, RINTM, FRM);
108
      /* These commands do the same thing as above but applied to data transfers to
110
      the audio port */
111
     MCBSP FSETS(XCR1, XWDLEN1, 32BIT);
112
     MCBSP FSETS(SPCR1, XINTM, FRM);
114
115
116
117
    118
    void init HWI(void)
119
120
                              // Globally disables interrupts
      IRQ globalDisable();
1 21
     IRQ nmiEnable();
                              // Enables the NMI interrupt (used by the debugger)
122
     IRQ_map(IRQ_EVT_RINT1,4); // Maps an event to a physical interrupt
123
      IRQ enable(IRQ EVT RINT1);
                                  // Enables the event
124
                             // Globally enables interrupts
      IRQ globalEnable();
125
126
127
    /***************** WRITE YOUR INTERRUPT SERVICE ROUTINE HERE********************
129
130
    void ISR_AIC(void)
1 31
132
     // FIR filter
133
      // operation principle: do a forward pass of the sample buffer until the end is hit,
      // then wrap the sample pointer to the start of the buffer and do the remainder of
1 35
      // iterations that can be computed from the amount of coefficients left to process (with pointer
           arithmetic).
137
      double* coeffptr = b;
138
      double* coeffEndptr = b + N; // points to the element AFTER the coefficient array
      double* sampleptr = buffer + index; // point to oldest sample initially
140
      double* bufferEndptr = buffer + N; // one after last element
141
142
      int loopent = bufferEndptr - sampleptr; // how many iterations are needed for a single-step loop
143
      char modunroll = loopcnt % 4; // non-integral leftover of an unrolled loop
144
145
      // accumulators
146
147
      double result = 0;
      double result2 = 0;
148
      double result3 = 0;
149
      double result4 = 0;
150
151
      *sampleptr = mono read 16Bit(); // read sample into buffer
152
153
      // process samples until the end of the sample buffer is hit
154
      while (sampleptr < buffer End ptr - 3)
155
156
        result += (*coeffptr++) * (*sampleptr++);
157
```

```
result2 += (*coeffptr++) * (*sampleptr++);
158
         result3 += (*coeffptr++) * (*sampleptr++);
         result4 += (*coeffptr++) * (*sampleptr++);
160
162
       // take care of non-integral leftover iterations
       if (modunro||>0) result += (*coeffptr++) * (*sampleptr++);
1 64
       if (modunro||>1) result 2 += (*coeffptr++) * (*sampleptr++);
       if (modunro||>2) result 3 += (*coeffptr++) * (*sampleptr++);
166
         sampleptr = buffer; // wrap pointer to beginning of the buffer
168
         // pass the remainder of the buffer (amount of iterations = how many coefficients there are
170
              left to process)
         while (coeffptr < coeffEndptr -3)
171
172
           result += (*coeffptr++) * (*sampleptr++);
173
1 74
         result2 += (*coeffptr++) * (*sampleptr++);
         result3 += (*coeffptr++) * (*sampleptr++);
175
         result4 += (*coeffptr++) * (*sampleptr++);
         }
177
178
         // take care of non—integral leftover iterations
179
       if (modunro||==1) result += (*coeffptr++) * (*sampleptr++);
180
       if (\text{modunro}||==1 || \text{modunro}||==2) result 2 += (*\text{coeffptr}++) * (*\text{sampleptr}++);
181
        \text{if } ( \, \mathsf{modunrol} | ==1 \, \mid \mid \, \mathsf{modunrol} | ==2 \, \mid \mid \, \mathsf{modunrol} | ==3) \, \, \mathsf{result3} \, \, += \, (*\, \mathsf{coeffptr} ++) \, * \, (*\, \mathsf{sampleptr} ++); \\
182
183
         // sum the accumulators
1 84
         result = result + result2 + result3 + result4;
186
       // advance index into circular buffer
       index = (index == 0) ? N-1 : index -1;
188
       mono write 16Bit(result); // output sample
190
1 91
```

### A.5 Assembly Implementation

### A.5.1 C File

```
/* The file dsk6713.h must be included in every program that uses the BSL. This
     example also includes dsk6713 aic23.h because it uses the
     AIC23 codec module (audio interface). */
20
  #include "dsk6713.h"
  #include "dsk6713 aic23.h"
22
  // math library (trig functions)
24
  #include <math.h>
25
26
  // Some functions to help with writing/reading the audio ports when using interrupts.
  #include <helper functions ISR.h>
28
  30
31
  /* Audio port configuration settings: these values set registers in the AIC23 audio
32
     interface to configure it. See TI doc SLWS106D 3-3 to 3-10 for more info. */
33
  DSK6713 AIC23 Config Config = { \
34
        35
                              FUNCTION SETTINGS
        /* REGISTER
                                                               */
36
        37
      0x0017, /* 0 LEFTINVOL Left line input channel volume 0dB
                                                                          */\
38
      0 \times 0017, /* 1 RIGHTINVOL Right line input channel volume 0dB
                                                                          */\
30
      0x01f9, /* 2 LEFTHPVOL Left channel headphone volume 0dB
                                                                          */\
40
      0x01f9, /* 3 RIGHTHPVOL Right channel headphone volume 0dB
                                                                          */\
41
      0×0011, /* 4 ANAPATH
                         Analog audio path control
                                                      DAC on, Mic boost 20dB*/\
42
      0x0000, /* 5 DIGPATH
                           Digital audio path control
                                                      All Filters off
43
                                                                          */\
      0x0000, /* 6 DPOWERDOWN Power down control
                                                      All Hardware on
                                                                          */\
      0x0043, /* 7 DIGIF Digital audio interface format 16 bit
                                                                          */\
45
      0x008d, /* 8 SAMPLERATE Sample rate control
                                                       8 KHZ
                                                                          */\
      0 × 0 0 0 1
            /* 9 DIGACT Digital interface activation On
                                                                          */\
47
  };
49
51
  // Codec handle:— a variable used to identify audio interface
  DSK6713 AIC23 CodecHandle H Codec;
53
55
  // The order of the FIR filter + 1
57
  #define N 88
59
  // The size, in bytes, of the buffer
  #define BUFFER BYTE SIZE 1024
61
62
  // the buffer
63
  double x buffer[BUFFER BYTE SIZE/8] = \{0\};
64
65
  // Byte align
  #pragma DATA ALIGN(x buffer, BUFFER BYTE SIZE)
67
68
  // pointer to first element
  double *X PTR = x buffer;
70
71
  // include the coefficients
72
  #include "fir coef.txt"
73
74
```

```
// index of the current "current" (zero) sample
   int index = 0;
77
   // Assembly circular FIR
   extern void circ_FIR_DP(double **ptr, double *coef, double *input samp, double *filtered samp,
79
       unsigned int numCoefs);
80
    void init hardware(void);
82
   void init_HWI(void);
   void ISR AIC(void);
84
   void main(){
86
     // initialize board and the audio port
87
     init hardware();
88
89
     /* initialize hardware interrupts */
90
     init HWI();
91
92
     /* loop indefinitely, waiting for interrupts */
93
     w hile (1)
94
     {};
95
96
97
98
   /********************************** init hardware() *****************************
99
   void init hardware()
100
1.01
       // Initialize the board support library, must be called first
1 02
       DSK6713_init();
103
       // Start the AIC23 codec using the settings defined above in config
105
       H Codec = DSK6713 AIC23 openCodec(0, &Config);
107
     /* Function below sets the number of bits in word used by MSBSP (serial port) for
108
     receives from AIC23 (audio port). We are using a 32 bit packet containing two
109
     16 bit numbers hence 32BIT is set for receive */
     MCBSP FSETS(RCR1, RWDLEN1, 32BIT);
111
112
     /* Configures interrupt to activate on each consecutive available 32 bits
113
     from Audio port hence an interrupt is generated for each L & R sample pair */
     MCBSP FSETS(SPCR1, RINTM, FRM);
115
116
     /st These commands do the same thing as above but applied to data transfers to
117
     the audio port */
118
     MCBSP FSETS(XCR1, XWDLEN1, 32BIT);
119
     MCBSP FSETS(SPCR1, XINTM, FRM);
120
121
122
123
124
                       125
   void init HWI(void)
126
127
     IRQ_globalDisable();
                           // Globally disables interrupts
128
129
     IRQ nmiEnable();
                           // Enables the NMI interrupt (used by the debugger)
     IRQ_map(IRQ_EVT_RINT1,4); // Maps an event to a physical interrupt
130
```

```
IRQ _enable(IRQ _EVT_RINT1);
                       // Enables the event
1.31
    132
133
135
     137
  void ISR AIC(void){
138
    double sample = 0, output = 0;
139
    sample = mono_read_16Bit(); // read
    circ FIR DP(&X PTR, b, &sample, &output, N);
141
    mono write 16Bit ((Int16) output);
142
143 }
```

#### A.5.2 Linear Assembly Implementation

```
DEPARTMENT OF ELECTRICAL AND ELECTRONIC ENGINEERING
2
                            IMPERIAL COLLEGE LONDON
                    EE 3.19: Real Time Digital Signal Processing
                      Course by: Dr Paul Mitcheson
             LAB 4: Double precision FIR using Circular Buffer Hardware
                        ****** circ FIR DP.ASM *******
10
11
                        Written by D. Harvey: 18 Jan 2010
13
14
15
16
           global circ FIR DP
17
18
19
           text
           ********************* circ FIR DP description ********************
21
22
       The input delay buffer has a data length of (size in bytes)/(data type length).
23
           The buffer you create must have a power of 2 size in bytes
24
     i.e its length in bytes must equal 2^X bytes (where X is integer between 1 and 32).
25
           Also ensure that its data length (size in bytes/8) is longer than the
27
         coefficient array data length. The buffer will need to be data aligned
28
         using #pragma DATA ALIGN(delay buff name, B) before it is defined
29
              where B is your chosen delay buffer size in bytes.
30
31
    circ FIR DP function call in C;
32
33
    circ FIR DP(&circ ptr, &coef[0], &read samp, &filtered samp, N);
34
35
     ************************** Register Assignments ********************************
36
37
   ; A0 LSB Multiplication result
                                       BO Loop Counter
   ; A1 MSB "
                             В1
   ; A2
                         B2 Used to set AMR to circular mode
  ; A3
                         B3 Return to C Address
```

```
; A4 &circ ptr
                            B4 &coef[k]
42
                         B5
   ; A5 circ ptr
                           B6 &filtered_samp
   ; A6 &read samp
   ; A7
                       B7
   ; A8 Number of Coefs (N)
46
   ; A9
                      В9
   ; A10 LSB delay_circ[j]
                              B10 LSB coef[k]
   ; A11 MSB "
                         B11 MSB "
   ; A12
                         B12
50
   ; A13
                         B13 Temp Store for previous AMR register value
   ; A14 MSB Accumulator
52
   ; A15 LSB "
                         B15
     See Real Time Digital Signal Processing by Nasser Kehtarnavaz (page 146) for more
54
     info on mixing C and Assembly.
                                *****************
56
57
   circ FIR DP:
58
59
      ; set circular mode using the AMR
60
      MVC .S2
                 AMR, B13
                         ; (0) Save contents of AMR reg to B13
61
      MVK .S2
                 4H, B2
                          ;(0) Lower half, set A5 to be circular buffering addressing mode using
62
          BK0
      MVKLH .S2
                 9H, B2
                          ; (0) Upper half. Set BKO to work for 1024 bytes
63
      MVC . S 2
                 B2,AMR
                           ; (0) set AMR reg
65
      ; get the data passed from C
66
67
                 *A6,A11:A10; (4) Get the 64 bit data for read_samp put it in A11:A10
      LDDW D1
68
      LDW .D1
                 *A4,A5
                        ; (4) Get the address of the circ ptr, dereference then place in A5
69
      NOP 4
                     ; A5 now holds address pointing into delay_circ
70
71
      STW .D1
                 A11,*--A5; (0) Store new input sample (MSB) to delay circ array
72
                   A14 ; (0) zero accumulator LSB
73
     | ZERO S1
      STW .D1
                 A10,*--A5; (0) Store new input sample (LSB) to delay circ array
74
     || ZERO .S1
                         ; (0) zero accumulator MSB
                 A15
75
76
77
      STW .D1
                 A5, * A4
                           ; (0) write back the decremented pointer to circ ptr
78
                   ; this points to the end of the MSB of where the next sample
                   ; will be stored on the next call to this function
80
81
       || MV .S2X
                    A8, B0
                               ; (0) move parameter (numCoefs) passed from C into b0
82
83
       84
85
   loop:
86
87
       88
      LDDW D1
                 *A5++, A11:A10 ; (4) loads the (delayed) sample into A11:A10, and post increment
89
          pointer
     || LDDW .D2
                   *B4++, B11:B10; (4) load the coefficient into B11:B10, and post increment
90
        pointer
      NOP 4
91
      MPYDP .M1X
                   A11: A10, B11: B10, A11: A10; (9, 4) DP multiply
92
      NOP 9
93
      ADDDP L1 A15:A14, A11:A10, A15:A14; (6, 2) DP ADD
      NOP 6
95
```

```
96
97
98
        ; MAC must use 64 bit IEEE double floating point data obtained from arrays defined in C
100
1 01
1 02
103
1 04
1 05
        ; manage loop
106
            SUB .D2
                            B0,1,B0
                                     ; (0) b0 - 1 \rightarrow b0
                       loop
                             ; (5) loop back if b0 is not zero
       [B0] B.S2
108
            NOP
109
                        5
110
         ; **************************** loop end ************************
111
112
        ; send the result of MAC back to C
113
114
        STW .D2
                     A14,*B6 ; (0) Write accumulator (LSB) into filtered samp
115
        STW .D2
                     A15,*+B6[1];(0) Write accumulator (MSB) into filtered samp
116
117
        ; restore previous buffering mode
118
119
      | | MVC . S 2
                       B13,AMR
                                  ; (0) restore AMR reg to previous contents
120
121
122
        ; return to C code
123
                                  ; (5) branch to b3 (register b3 holds the return address)
            B . S 2
                         B3
1 24
            NOP
125
126
             . end
127
```

### A.5.3 Optimised Assembly Implementation

```
DEPARTMENT OF ELECTRICAL AND ELECTRONIC ENGINEERING
2
                                       IMPERIAL COLLEGE LONDON
                            EE 3.19: Real Time Digital Signal Processing
                                 Course by: Dr Paul Mitcheson
                      LAB 4: Double precision FIR using Circular Buffer Hardware
                                ****** circ FIR DP.ASM *******
10
11
12
                                  Written by D. Harvey: 18 Jan 2010
13
14
15
16
           .global _circ_FIR_DP
17
18
19
           .text
20
           **************** _ circ_FIR_DP description *********************
21
22
```

```
The input delay buffer has a data length of (size in bytes)/(data type length).
23
                    The buffer you create must have a power of 2 size in bytes
        i.e its length in bytes must equal 2^X bytes (where X is integer between 1 and 32).
25
                 Also ensure that its data length (size in bytes/8) is longer than the
27
                coefficient array data length. The buffer will need to be data aligned
                using #pragma DATA ALIGN(delay buff name, B) before it is defined
29
                         where B is your chosen delay buffer size in bytes.
30
31
   ; circ FIR DP function call in C;
32
33
   ; circ FIR DP( &circ ptr, &coef[0], &read samp, &filtered samp, N );
35
     36
37
                                             B0 Loop Counter
   ; A0 LSB Accumulator A
   ; A1 MSB
39
                                             B2 Used to set AMR to circular mode - then reused LSB
   ; A2 LSB Multiplied result
                                 Α
       Multiplied Result B
                                             B3 Return to C Address (original) — then reused MSB "
   ; A3 MSB
41
   ; A4 &circ ptr

    possible reuse

                                             B4 &coef[k] - don't use for calc
42
   ; A5 circ_ptr

    don't use for calc

                                             B5
   ; A6 &read samp
                     — possible reuse
                                             B6 &filtered samp - (original) - then reused LSB
       Accumulator B
   ; A7
                                             B7
                                                                                           MSB "
45
                                             B8 LSB coef[k] A
   ; A8 N, then LSB delay circ[j] A
               MSB "
                                             B9 MSB
   ; A10 LSB delay_circ[j]
                                             B10 LSB coef[k] B
                              В
   ; A11 MSB
                                             B11 MSB "
   ; A12
                                             B12
   ; A13
                                             B13
   ; A14
                                             B14 Data pointer (DO NOT USE)
52
                                             B15 Stack Pointer (DO NOT USE)
   ; A15
54
56
   _circ_FIR_DP:
57
           MVC.S2
                             AMR, B13
                                            ; (0) Save contents of AMR reg to B13
58
       | | STW . D 2
                                           ; (0) save return to C to stack
                             B3, *++B15
       LDDW.D1
                                            ; (4) Get the 32 bit data for read samp put it in A11: A10
                             *A6,A11:A10
60
           STW .D2
                             B6, *++B15
                                           ; (0) save &filtered samp to stack
62
       | | MVK . S 2
                             4H,B2; (0)Set AMR to allow A5 to be used for circular addressing with
           BK0
       | LDW .D1
                                           ; (4) Get the address of the circ ptr, dereference then
                             *A4,A5
64
           place in A5
                             9H, B2; (0) Set BKO to allow for 1024 bytes addressing
           MVKLH . S 2
65
           MVC.S2
                             B2,AMR
                                           ; (0) set AMR reg
66
67
           NOP 2
                                        ; A5 now holds address pointing into delay circ
68
69
           STW .D1
                             A11,*--A5
                                          ; (0) Store new input sample (MSB) to delay circ array
70
       || ZERO S1
                                           ; (0) zero accumulator LSB
                             Α1
71
         ZERO .S2
                             B3
72
73
74
           STW .D1
                             A10,*--A5
                                           ;(0) Store new input sample (LSB) to delay circ array
       || ZERO.S1
                             Α0
                                           ; (0) zero accumulator MSB
75
```

```
|| ZERO .S2
                               B2
76
77
            STW.D1
                               A5, * A4
                                         ; (0) write back the decremented pointer to circ ptr
 78
                                         ; this points to the end of the MSB of where the next sample
                                         ; will be stored on the next call to this function
80
        || MV .L2X
                               A8, B0
                                           ; (0) move parameter (numCoefs) passed from C into b0
82
                               10, B1
                                             ; (0) setup countdown to start addition
        | | MVK . S 2
84
85
            ADD
                               .L2 B0, B1, B0; (0) Branch needs 10/2=5 iterations to setup and running
86
        || B .S2
                                       ; (0) Loop is only 4 cycles,
                               loop
                                       ; so we need to kickstart the branch back for loop iteration 1
88
89
        NOP
                                   ; NOP to allow the branch for Loop iteration 1 to happen right at
90
            the end
91
            ; ********************************* loop begin **************************
92
93
    loop:
94
            [B0] SUB .S2
                               B0,2,B0
                                       ;(0) Decrement loop counter by 2, because we are doing two
95
                calculations togerher
96
97
            [B1] SUB .D2
                               B1,2,B1; (0) countdown to allow start of addition. Countdown is done
98
                by two
                                        ; because the loop counter is decremented by two. And since we
                                            added B0
                                        ; and B1 together before the loop, we must also double B1's
100
                                            value and subtract
                                        ; by 2 each time
1 01
102
            [B0] B S2
                                             ; (5) for current iteration i, kickstart the branch back
103
                               loop
                for iteration i+1
           [B0] LDDW D1
                               *A5++, A11:A10; (4) B-Load delayed sample
1 04
                               *B4++, B11:B10; (4) B-Load coefficient
        11
           [B0] LDDW .D2
105
                               B11:B10, A11:A10, B3:B2; (9,4) B - Multiply
           [B0] MPYDP M2X
        || [!B1] ADDDP .L2
                               B7:B6, B3:B2, B7:B6; (6,2) B — Accumulate
107
108
109
            [B0] LDDW D1
                               *A5++, A9:A8; (4) A-Load delayed sample
110
            [B0] LDDW D2
                               *B4++, B9:B8; (4) A - Load coefficient
111
           [B0] MPYDP M1X
                               A9:A8, B9:B8, A3:A2; (9,4) A — Multiply
112
        || [!B1] ADDDP .L1
                               A1:A0, A3:A2, A1:A0; (6,2) A — Accumulate
113
        || [!B0] MV .S2
                               B6, B12; (0) for the final iteration this cycle, the LH result for B-
114
            Addition -44 is
                                       ; written on this cycle. We move the LH result for B-Addition -43
115
                                            out of the
                                       ; way to prevent losing them
116
117
118
                                    ********* loop end *******
119
            MV D1
                               A0, A12; (0) the UH result for A-Addition-44 is
120
                                        ; written on this cycle. We move the UH result for A-Addition-43
121
                                            out of the
122
                                       ; way to prevent losing them
        || MV .S2
                               B7, B13; (0) the UH result for B-Addition-44 is
123
```

```
; written on this cycle. We move the UH result for B-Addition-43
1 24
                                              out of the
                                        ; way to prevent losing them
125
            \overline{\text{MV}} . D 1
                                A1, A13; (0) the LH result for A-Addition-44 is
127
                                         ; written on this cycle. We move the LH result for A-Addition -43
                                             out of the
                                        ; way to prevent losing them
        || ADDDP .L2
                                B7:B6, B13:B12, B13:B12 ; (6,2) the supurious B-Addition-45 will write
130
            the LH result in
                                                           ; 2 cycles after this. Better start adding
131
                                                               result of B-Addition-44
1 32
            ADDDP .L1
                                A1:A0, A13:A12, A13:A12 ; (6,2) the supurious A-Addition-45 will write
133
                the LH result in
                                                           ; 2 cycles after this. Better start adding
1 34
                                                               result of A-Addition-44
135
            LDW .D2
                                *B15--, B6
                                             ; (4) get &filtered samp from stack
136
            LDW .D2
                                *B15--, B0
                                             ; (4) get return to C from stack
137
138
            NOP 3
139
            ADDDP .L1X
                                A13:A12, B13:B12, A13:A12; (6,2) Add the results of Side A and B
140
                together
141
            NOP
142
            ; return to C code
144
    lend:
            B.S2
                                B0
                                             ; (5) branch to b3 (register b3 holds the return address)
146
            NOP
                     3
147
148
             ; send the result of MAC back to C
149
                                               ; (0) Write accumulator (LSB) into filtered samp
            STW.D2
                                A12,*B6
150
151
            STW.D2
                                               ; (0) Write accumulator (MSB) into filtered samp
                                A13,*+B6[1]
152
        || MVC.S2
                                               ; (0) restore AMR reg to previous contents
                                B1,AMR
154
155
             .end
```