## PYNQ
This notebook is to run on the PYNQ! You'll need the bitfile and "hardware handoff" file from part 1.
The files should be named `hls4ml_demo.bit` and `hls4ml_demo.hwh`. In principle they can be named anything you like, but the two files do need to have the same name apart from the extension.
You should have the files:
- `hls4ml_demo.bit`
- `hls4ml_demo.hwh`
- `X_test.npy`
- `y_hls.npy`

In [None]:
from pynq import Overlay
from pynq import MMIO
import numpy as np
import struct
from datetime import datetime

## Load bitfile (overlay)
We will load the bitfile we generated onto the PL of the PYNQ SoC. The `.hwh` is used to define the Python interface for us, nice!
https://pynq.readthedocs.io/en/latest/overlay_design_methodology/python_overlay_api.html

In [None]:
overlay = Overlay("./hls4ml_demo.bit")

## Check
This should return `True` if it loaded, otherwise something went wrong

In [None]:
overlay.is_loaded()

## IP
This is the `hls4ml` generated IP (our NN).

In [None]:
ip = overlay.myproject_axi_0

## Register map
These are the registers of our IP which we can read/write.
There should be one for NN inputs and one for its outputs.

In [None]:
print(ip.register_map)
print("Memory_in_V.address: " + str(ip.register_map.Memory_in_V.address))
print("Memory_out_V.address: " + str(ip.register_map.Memory_out_V.address))

## MMIO
We used the `s_axilite` interface, so we will communicate using the MMIO.
In the HLS top level you would see, for example:
``` 
    #pragma HLS INTERFACE ap_ctrl_none port=return
    #pragma HLS INTERFACE s_axilite port=in
    #pragma HLS INTERFACE s_axilite port=out
```
This is the most simple interface, but also the slowest. One of the future tasks is to use a higher performance connection like DMA.
https://pynq.readthedocs.io/en/latest/overlay_design_methodology/pspl_interface.html

We need to specifiy the start address and width of the MMIO address space for each interface.

In [None]:
in_mmio = MMIO(ip.mmio.base_addr + ip.register_map.Memory_in_V.address, 8 * 4)
ou_mmio = MMIO(ip.mmio.base_addr + ip.register_map.Memory_out_V.address, 3 * 4)

## Data
Load the jet tagging dataset that we saved on the host earlier.

In [None]:
X = np.load('./X_test.npy').astype(np.float32)
X = X[:1000]
y = []

## Driver / Encoding, Decoding
Our hls4ml NN used `ap_fixed<16,6>` for the input and output data types. Our dataset in the `X_test.npy` file contains `float` values. We need to make a few transformations to write to the NN.
- Cast to `int`. We need to 'shift' our `float`s up by 10 bits (the number of fractional bits of the `<16,6>` to 'align' the bits properly. This is `encode`
- Pack a pair of bits. The AXI interface here uses 32 bit data, but our values need to be 16 bits. We need to pack 2 x 16 bit values into 1x32 bit integer. This is `encode_pair`

At the output of the NN we need to do the reverse:

- Slice each 32 bit integer into two 16 bit values (the upper and lower 16 bits). This uses bit-masking (the ` yab & 0x0000ffff` in `decode_pair`)
- Shift back down to the physical range by 10 bits

In future this might become a 'driver': https://pynq.readthedocs.io/en/latest/overlay_design_methodology/python_overlay_api.html#customising-drivers

In [None]:
def encode(xi):
    return int(round(xi * 2**10))

def encode_pair(xa, xb):
    return encode(xa) + encode(xb) * 2**16
    #return encode(xb) + encode(xa) * 2**16

def decode(yi):
    return yi * 2**-10

def decode_pair(yab):
    ya = (yab & 0x0000ffff) * 2**-10
    ya = ya if ya < 32 else ya - 64
    yb = (yab & 0xffff0000) * 2**-26
    yb = yb if yb < 32 else yb - 64
    return ya, yb

def get_output(mmio):
    y = np.zeros(6)
    for i in range(3):
        yi = decode_pair(mmio.read(4 * i))
        y[2*i], y[2*i+1] = yi[0], yi[1]
    return y[:5]

## Run the inference!
Now we actually write the data to our hls4ml IP with `in_mmio.write` and read the output with `get_output(ou_mmio)`

In [None]:
timea = datetime.now()
for Xi in X:
    for i in range(8):
        xab = encode_pair(Xi[2*i], Xi[2*i+1])
        in_mmio.write(4 * i, xab)
    y.append(get_output(ou_mmio))
timeb = datetime.now()

## Time
How long did it take? You'll notice the time per inference is much higher than the IP latency or II. We're totally dominated by the IO and encoding/decoding.

In [None]:
def print_dt(timea, timeb, N):
    dt = (timeb - timea) 
    dts = dt.seconds + dt.microseconds * 10**-6
    rate = len(X) / dts
    print("Classified {} samples in {} seconds ({} inferences / s)".format(N, dts, rate))
    
print_dt(timea, timeb, len(X))

## Compare
Load the `csim` dataset and print a few values out. Hopefully they're basically the same! There could be some small difference due to the encoding / decoding being different to convert our `float`s to `ap_fixed<16,6>`.

In [None]:
y_hls = np.load('./y_hls.npy')

In [None]:
print("Running on the board:")
for i in range(5):
    print(y[i])
print("Running on the CPU csim:")
for i in range(5):
    print(y_hls[i])

## More data
Now let's classify the whole dataset and save it.

In [None]:
X = np.load('./X_test.npy').astype(np.float32)
y = []
timea = datetime.now()
time0 = datetime.now()
for iXi, Xi in enumerate(X):
    for i in range(8):
        xab = encode_pair(Xi[2*i], Xi[2*i+1])
        in_mmio.write(4 * i, xab)
    y.append(get_output(ou_mmio))
    if iXi % 5000 == 0:
        time1 = datetime.now()
        print_dt(time0, time1, 5000)
        time0 = datetime.now()

timeb = datetime.now()
print_dt(timea, timeb, len(X))
np.save('y_pynq.npy', y)

## Continue to part 3
Download/Upload the `y_pynq.npy` back to the host where you ran the part 1 notebook to make a final comparison in the part 3 notebook