# AFSK Demodulator
## Step 5: Digital PLL

-----

This notebook will outline the steps necessary to move the Digital PLL to FPGA.  This will be our largest and most complex project yet.

This code is part of the [AFSK Demodulator on Pynq](afsk-demodulator-fpga.ipynb) project.

The purpose of this code is to continue our migration of the Python demodulator code to FPGA.  We will be streaming audio data into the FPGA and streaming processed data out from the FPGA.

This is the third step of moving a demodulator processing step into the FPGA. At this point demodulation is being done in FPGA.  We are left with clock recovery and HDLC framing.  Here we address clock recovery.

At this point we must diverge from the design pattern we have been following.  No longer are we simply streaming data in and out.  The PLL had to indicate *lock* status, it has to output a *sample* indicator.  And it may need to output information about *jitter* for diagnostic purposes.

The Digital PLL in Python provides all of these interfaces.  However, we can change the interface.  We only need to provide two outputs from the PLL: a stream of sampled bits, and a lock indicator.  Audio data will be clocked in via a stream interface.  This will be demodulated to a bitstream and processed by the digital PLL.  The demodulator will clock out 3 bits for each audio sample: the demodulated bit, the a lock flag, and a sample indicator.  The sample indicator will never go high if the lock flag is low.

Recall from the Python implementation of the PLL that we need an IIR filter and a hysteresis module.  We will build and test these independently.  The PLL also made use of floating point math in the PLL, IIR filter and hysteresis code.  We will change that to fixed point.

## Prerequisites

At this point you are expected to have:

 * A configured PYNQ environment.
 * Vivado installed on your computer and configured for your board.
 * Experience working through the tutorials at https://pynq.readthedocs.io/.
 * Familiarized yourself with the AFSK demodulator implementation in Python.
 * Completed the first four steps of the tutorial to familiarize yourself with the process of creating a streaming interface.

## Outline

We are going to modify the FPGA IP we created in the third tutorial to add the low-pass filter for the correlator output we are now generating, and turn that back into a bitstream.

We will perform the following steps in this section:

 1. Create a C++ file that accepts a block of 16-bit data, performs the FIR, correlator and low-pass filter operations,  and sends the resulting bitstream back.
 1. Create a C++ test case for the above file.
 1. Generate an IP package from the code that can be used in Vivado.
 1. Create a Zynq project in Vivado that uses the IP.
 1. Export the bitstream for our project from Vivado.
 1. Use Python running on the PS to load the bitstream to the PL, and verify that it works.
 1. Integrate the FPGA module with the existing demodulator code, replacing the existing Python code.

First we are going to generate the FIR filter coefficients.  Then we are going to generate some sample data for our test bench. 

## Filter Coefficients

We continue to generate the filter coefficents, because we still need to test against the Python implementation.  But we no longer need to print them out.  Our work with filters is complete.  We now focus on the digitl PLL.

In [1]:
import numpy as np
from scipy.signal import lfiltic, lfilter, firwin
from scipy.io.wavfile import read

audio_file = read('../base/TNC_Test_Ver-1.102-26400-1sec.wav')
sample_rate = audio_file[0]
audio_data = audio_file[1]

bpf_coeffs = np.array(firwin(141, [1100.0/(sample_rate/2), 2300.0/(sample_rate/2)], width = None,
        pass_zero = False, scale = True, window='hann') * 32768, dtype=int)

lpf_coeffs = np.array(firwin(101, [760.0/(sample_rate/2)], width = None,
        pass_zero = True, scale = True, window='hann') * 32768, dtype=int)


## Test Bench Data

We will now generate the input and output data for our test bench.  We will again use our working Python model to generate data as a baseline.  We need to generate PLL output data.  This is going to be a bit different than the data currently provided because we are changing the interface slightly.  We need to generate an array containing three numbers (bits) from the PLL: input, locked, sample.

In [7]:
import sys
sys.path.append('../base')
from DigitalPLL import DigitalPLL

pll = DigitalPLL(sample_rate, 1200.0)

class fir_filter(object):
    def __init__(self, coeffs):
        self.coeffs = coeffs
        self.zl = lfiltic(self.coeffs, 32768, [], [])
    def __call__(self, data):
        result, self.zl = lfilter(self.coeffs, 32768, data, -1, self.zl)
        return result

bpf = fir_filter(bpf_coeffs)
lpf = fir_filter(lpf_coeffs)

delay = 12

f = bpf(audio_data[:264])
c = np.array([int(x >= 0) for x in f])
# Delay the data
d = np.append(np.zeros(delay, dtype=int), np.array(c[:0-delay], dtype=int))
# XOR the digitized data with the delayed version
x = np.logical_xor(c, d)
l = lpf(x * 2 - 1)
comp = np.array([int(x >= 0) for x in l])

locked = np.zeros(len(comp), dtype=int)
sample = np.zeros(len(comp), dtype=int)

for i in range(len(comp)):
    sample[i] = pll(comp[i])
    locked[i] = pll.locked()


print(audio_data[:264])
print([[x,y,z] for (x,y,z) in zip(comp, sample, locked)])

[  719   748   468   487   533   880  1187  1717  2124  2262  2417  2371
  2106  1794  1275   690     3  -721 -1382 -1855 -2227 -2378 -2383 -2243
 -1953 -1510  -958  -291   214   497   833   909   818   620   290  -207
  -787 -1396 -2019 -2434 -2756 -2914 -2901 -2762 -2424 -1954 -1371  -667
   -66   270   638   762   762   682   490   235   100   161   280   583
   913  1391  1576  1634  1685  1398  1093   658   255    94     2   105
   349   761  1288  1898  2303  2564  2793  2744  2612  2264  1851  1280
   586  -143  -830 -1336 -1795 -1993 -2038 -1917 -1622 -1209  -646    28
   598   929  1265  1382  1330  1190   843   387  -157  -776 -1420 -1866
 -2227 -2379 -2346 -2193 -1868 -1409  -796  -111   557   949  1380  1636
  1604  1550  1310   946   449  -113  -744 -1260 -1629 -1888 -1907 -1800
 -1579 -1171  -623    23   707  1176  1579  1826  1836  1802  1550  1144
   641    30  -639 -1236 -1742 -2039 -2141 -2132 -1915 -1584 -1074  -460
   237   790  1137  1509  1588  1497  1286   937   

  out_full[ind] += zi
  out = out_full[ind]
  zf = out_full[ind]


The data above represents the PLL output from the same 10ms of data we have been testing with during this development process.  The values represent the input, sample, lock.

## Vivado HLS

WWe are going to make the biggest additions to the code since we started.  We will continue to use core pieces we created earlier, but we now add the digital PLL.  This requires two additional components: an IIR filter and hysteresis.  For these components, which in Python are implemented using floating point types, we are going to switch to 18-bit fixed point.  Why 18 bits?  Because that is the limit to the DSP48 blocks on the Zynq.  And initial results show that it worked.

If you would like to learn more about the capabilities of the DSP blocks in Zynq, the DSP48 User Guide from Xilinx is very detailed: https://www.xilinx.com/support/documentation/user_guides/ug479_7Series_DSP48E1.pdf

 1. Start Vivado HLS.
    ```bash
    vivado_hls
    ```
 1. Create a new project under the project_04 directory call HLS.
 1. Create a top-level function called demodulate4.
 1. Create 5 new files:
    * [demodulate.hpp](HLS/demodulate.hpp)
    * [demodulate.cpp](HLS/demodulate.cpp)
    * [hysteresis.hpp](HLS/hysteresis.hpp)
    * [iir_filter.hpp](HLS/iir_filter.hpp)
    * [digital_pll.hpp](HLS/digital_pll.hpp)    
 1. Create a new test bench:
    * [demodulate_test.cpp](HLS/demodulate_test.cpp)
 
The important part of this module is the addition of the three new header files which implement the digital PLL.  These work exactly the same as the digital PLL from the Python implementation.  The bulk of the code was copied from the [Mobilinkd TNC3 firmware](https://github.com/mobilinkd/tnc3-firmware) and modifies slightly for fixed-point math.

-----

This is the header:

```c++
#include <ap_axi_sdata.h>
#include <hls_stream.h>
#include <stdint.h>

#define BPF_COEFF_LEN 141

typedef ap_axis<16,1,1,1> idata_type;
typedef ap_axis<1,1,1,1> odata_type;

void demodulate5(idata_type input, odata_type& output);

```

The only change we needed to make here is to change the top-level function name.

And this is the source:

```c++
#include "demodulate.hpp"
#include "digital_pll.hpp"

#include "ap_shift_reg.h"

const ap_int<13> bpf_coeffs[] =
{    0,     0,     0,     0,     0,     0,     1,     3,     5,     8,     8,     5,
    -2,   -13,   -27,   -40,   -46,   -44,   -32,   -12,    11,    32,    44,    44,
    32,    14,     0,    -2,    13,    49,    97,   143,   170,   160,   104,     6,
  -118,  -244,  -340,  -381,  -352,  -258,  -120,    24,   138,   192,   173,    97,
     0,   -67,   -56,    62,   287,   575,   850,  1021,  1001,   737,   228,  -462,
 -1216, -1879, -2293, -2336, -1956, -1182,  -133,  1008,  2030,  2736,  2988,  2736,
  2030,  1008,  -133, -1182, -1956, -2336, -2293, -1879, -1216,  -462,   228,   737,
  1001,  1021,   850,   575,   287,    62,   -56,   -67,     0,    97,   173,   192,
   138,    24,  -120,  -258,  -352,  -381,  -340,  -244,  -118,     6,   104,   160,
   170,   143,    97,    49,    13,    -2,     0,    14,    32,    44,    44,    32,
    11,   -12,   -32,   -44,   -46,   -40,   -27,   -13,    -2,     5,     8,     8,
     5,     3,     1,     0,     0,     0,     0,     0,     0,
};

const ap_int<12> lpf_coeffs[] =
{
    0,    0,    0,    1,    3,    5,    8,   11,   14,   17,   20,   21,   20,   17,
   11,    2,   -9,  -25,  -44,  -66,  -91, -116, -142, -167, -188, -205, -215, -217,
 -209, -190, -156, -109,  -47,   30,  123,  230,  350,  481,  622,  769,  919, 1070,
 1217, 1358, 1488, 1605, 1704, 1785, 1844, 1880, 1893, 1880, 1844, 1785, 1704, 1605,
 1488, 1358, 1217, 1070,  919,  769,  622,  481,  350,  230,  123,   30,  -47, -109,
 -156, -190, -209, -217, -215, -205, -188, -167, -142, -116,  -91,  -66,  -44,  -25,
   -9,    2,   11,   17,   20,   21,   20,   17,   14,   11,    8,    5,    3,    1,
	0,    0,    0,
};

template <typename InOut, typename Filter, size_t N>
InOut fir_filter(InOut x, Filter (&coeff)[N])
{
    static InOut shift_reg[N];

    int32_t accum = 0;
    filter_loop: for (size_t i = N-1 ; i != 0; i--)
    {
#pragma HLS unroll factor=20
        shift_reg[i] = shift_reg[i-1];
        accum += shift_reg[i] * coeff[i];
    }

    shift_reg[0] = x;
    accum += shift_reg[0] * coeff[0];

    return static_cast<InOut>(accum >> 15);
}

ap_shift_reg<bool, 12> delay_line;
DigitalPLL<> dpll(26400, 1200);

void demodulate5(idata_type& input, odata_type& output)
{
#pragma HLS INTERFACE axis port=input
#pragma HLS INTERFACE axis port=output
#pragma HLS interface ap_ctrl_none port=return

	ap_int<16> bpfiltered, lpfiltered;
	ap_int<1> comp, delayed, comp2;
	ap_int<2> corr;

	bpfiltered = fir_filter(input.data, bpf_coeffs);
	comp = bpfiltered >= 0 ? 1 : 0;
	delayed = delay_line.shift(comp);
	corr = comp ^ delayed;
	corr <<= 1;
	corr -= 1;
	lpfiltered = fir_filter(corr, lpf_coeffs);
	comp2 = lpfiltered >= 0 ? 1 : 0;
	typename DigitalPLL<>::result_type result = dpll(comp2 != 0);

	ap_int<3> tmp = (std::get<0>(result) << 2) |
			(std::get<1>(result) << 1) | std::get<2>(result);
	output.data = tmp;
    output.dest = input.dest;
    output.id = input.id;
    output.keep = input.keep;
    output.last = input.last;
    output.strb = input.strb;
    output.user = input.user;
}
```


### C++11

Like before, we needed to add a configuration setting to control the timing contstraints.  In Vivado HLS, right click on the "solution1" window and select "Solution Settings...".  In the *Solution Settings* window, in the *General* tab, click the *Add* button.  Add a "config_core" setting for core "DSP48" with a latency of 3.  This is required to meet timing constraints with the new code.

We also use some new C++11 features -- specifically tuples.  For this we need to add compilation flags for use during simulation and synthesis.  Right click on the "HLS" project name in the Explorer window on the right side of the Vivado HLS UI and select "Project Settings...".  In the *Project Settings* window, select the *Similation* tab.  Then select the "demodulate_test.cpp" file.  Click the *Edit CFLAGS* button and add "-std=c++11" to the flags.  Go the to *Synthesis* tab, highlight the "demodulate.cpp" file and make the same change.

-----

Once the code and test bench are written, we need to run the C simulation, C synthesis, C/RTL co-simulation, then package the IP.  The two simulation steps run our test bench.  This verifies that the code will sythesize properly and that it functions properly.  For a software engineer, this is the same as compiling and running unit tests.

Once the IP is packaged, we are done in HLS.

## Vivado

We will now switch over to Vivado and create a block design.  These steps should start to feel very familiar to you by now.

 1. Start Vivado and create a new project.
 1. Give it a path -- in our case `afsk-demodulator-pynq/project_05` and the name `Vivado`.
 1. Select the `RTL Project` project type.
 1. In the "Default Part" screen, switch to the "Boards" tab. Select the your board from the list.
 1. Click "Finish".
 
With the new project open in Vivado, we need to create a block design.  We are going to follow the exact some procedure we did in the first three.

 1. On the right side, in the Flow Navigator, select *Create Block Diagram*.
 1. Use the default name, design_1.
 1. Go into Tools|Settings.
    1. In the settings dialog, choose IP|Repository.
    1. Select "+" to add a repository.
    1. Add Project_05/HLS as a repository.  You should see that it has 1 IP called `demodulate5` in there.
    1. When done, click "OK".
 1. In the Diagram view (main window) select "+" to add IP.
 1. Add the Zynq processing system and run block automation.
 1. When done, double-click the Zynq block and find the *High-performance AXI Slave Ports*.
 1. Click on the High-performance AXI Slave Ports.
 1. Enable the *S AXI HP0 interface*, then click OK.
 1. Add an AXI Stream Interconnect, AXI Direct Memory Access and the demodulator IP.
 1. Open the AXI Direct Memory Access, disable scatter/gather, and set the stream widths to 16 bits.
 1. Wire up the demodulator to the AXI Direct Memory Access and run connection automation.
    * A few additional modules are added: AXI SmartConnect, AXI Interconnect, and Processor System Reset
![BlockDiagram](BlockDiagram.png)
 1. Rename the demodulator block to "demodulate" and the DMA block to "dma".
 1. Combine the demodulate and dma blocks into a hierarchy called "demodulator".
 1. Generate the HDL wrapper by clicking on the design in the Sources box, right clicking, and selecting "Generate HDL Wrapper".
 1. Generate the bitstream. Again, this will take some time.
 1. Export the block design (File|Export|Export Block Design...)
 1. Collect the following files:
    - Vivado.srcs/sources_1/bd/design_1/hw_handoff/design_1.hwh
    - Vivado.runs/impl_1/design_1_wrapper.bit
    - design_1.tcl
    * rename these file to "project_03.{ext}" so that you have project_05.bit, project_05.tcl and project_05.hwh
 1. On the mounted Pynq filesystem, copy these files to `pynq/overlays/afsk_demodulator/`.
    ```bash
cp project_05.{tcl,bit,hwh} /var/run/media/${USER}/PYNQ/pynq/overlays/afsk_demodulator/
```
 1. You can now jump to the Jupyter notebook on the Pynq device.