# First Steps with PYNQ

Import PYNQ as you would do with any other Python package by executing the next Python code cell:

In [1]:
import pynq

## Downloading and inspecting a design

The PYNQ `Overlay` class represents a bitstream loaded on to the FPGA device of the accelerator card. When you create an Overlay instance the specified xclbin file will be automatically loaded onto the FPGA.

In [2]:
ol = pynq.Overlay('intro.xclbin')

The Overlay class provides information about the design. One advantage of a high-level language is that we can inspect the object directly to see what our design contains. We can do this by using built-in help system in Jupyter:

In [3]:
ol?

[0;31mType:[0m            Overlay
[0;31mString form:[0m     <pynq.overlay.Overlay object at 0x7f645db5a438>
[0;31mFile:[0m            ~/anaconda3/lib/python3.7/site-packages/pynq/overlay.py
[0;31mDocstring:[0m      
Default documentation for overlay intro.xclbin. The following
attributes are available on this overlay:

IP Blocks
----------
vadd_1               : pynq.overlay.DefaultIP
vmult_1              : pynq.overlay.DefaultIP

Hierarchies
-----------
None

Interrupts
----------
None

GPIO Outputs
------------
None

Memories
------------
bank0                : Memory
[0;31mClass docstring:[0m
This class keeps track of a single bitstream's state and contents.

The overlay class holds the state of the bitstream and enables run-time
protection of bindings.

Our definition of overlay is: "post-bitstream configurable design".
Hence, this class must expose configurability through content discovery
and runtime protection.

The overlay class exposes the IP and hierarchies as attri

We can query the Overlay's `ip_dict` to get details of the hardware kernels in the design:

In [4]:
ol.ip_dict

{'vadd_1': {'phys_addr': 0,
  'addr_range': 4096,
  'type': 'xilinx.com:hls:vadd:1.0',
  'hw_control_protocol': 'ap_ctrl_hs',
  'fullpath': 'vadd_1',
  'registers': {'CTRL': {'address_offset': 0,
    'access': 'read-write',
    'size': 4,
    'description': 'OpenCL Control Register',
    'type': 'unsigned int',
    'id': None,
    'fields': {'AP_START': {'access': 'read-write',
      'bit_offset': 0,
      'bit_width': 1,
      'description': 'Start the accelerator'},
     'AP_DONE': {'access': 'read-only',
      'bit_offset': 1,
      'bit_width': 1,
      'description': 'Accelerator has finished - cleared on read'},
     'AP_IDLE': {'access': 'read-only',
      'bit_offset': 2,
      'bit_width': 1,
      'description': 'Accelerator is idle'},
     'AP_READY': {'access': 'read-only',
      'bit_offset': 3,
      'bit_width': 1,
      'description': 'Accelerator is ready to start next computation'},
     'AUTO_RESTART': {'access': 'read-write',
      'bit_offset': 7,
      'bit_width'

In the output above, you can expand vadd_1 and vmult_1 to see their attributes, or search for an attribute in the *Filter* search box. Try expanding vadd_1, and then try filtering for **addr**

Another way to inspect the design is by using tab-completion. In the next cell, try placing the cursor after the `ol.` and pressing the tab key to see a list of attributes for the object which includes the hardware kernels and memory banks. (You **should not** execute this cell before editing, it is used for illustrating the tab-completion.)

In [None]:
ol.

## Managing Memory

Before exploring the hardware kernels, we need some buffers on the card for them to use. In PYNQ all buffer management is performed by the `pynq.allocate` function. If you have used [NumPy](https://numpy.org/) in the past, then the API should be familiar. `allocate` takes two parameters: the shape of the array and the [data type to use](https://docs.scipy.org/doc/numpy/user/basics.types.html). NumPy represents basic types as a two-character string with the first character being the type and the second the length in bytes. For example, here are some common types:

|NumPy type | Type |
|---:|---|
| `"u4"` | 32-bit unsigned integer |
| `"i8"` | 64-bit signed integer   |
| `"f4" `| single precision floating point   |

For our vector addition kernel lets allocate three buffers of 1024x1024 unsigned integers:

In [5]:
in_a = pynq.allocate((1024,1024), 'u4')
in_b = pynq.allocate((1024,1024), 'u4')
out_c = pynq.allocate((1024,1024), 'u4')

As buffers in PYNQ are NumPy arrays we can use standard NumPy operations to assign values to our input array. To keep things simple, we'll set our two input buffers to constant values. When copying data into the buffers we must use some form of indexing operation to make sure we copy into the contents of the buffer and not change the buffer object itself. For example, executing `in_a = 100` would result in our variable `in_a` being reassigned to the integer 100 which is not what we want.

In [6]:
in_a[:] = 100
in_b[:] = 200

A PYNQ buffer is actually two buffers that are mirrors of each other - one in the host's memory, and one on the target FPGA device, or the memory attached to this device. So far we have only updated the host version of the buffer so any hardware kernsl on the device will not see the updated values. To update the device's version of the buffer we use the `.sync_to_device` function to replicate the changes.

In [7]:
in_a.sync_to_device()
in_b.sync_to_device()

## Accelerator Objects

To interact with the hardware kernel we use the corresponding attribute of the Overlay class. For this example we're going to be looking at the `vadd_1` kernel which has been discussed in depth in the rest of the workshop.

In [8]:
vadd = ol.vadd_1

The object representing the kernel has three main APIs

 * `.call` to call (execute) the hardware kernel
 * `.signature` to get the Python function signature of the kernel - this will match how the kernel was originally written
 * `.args` to get a dictionary describing the arguments in more details - for example which memory bank the argument is attached to

Starting with `.signature` we can check that in this case, the vector add function takes 4 arguments - 2 input arrays, an output array and the number of elements in the array:


In [9]:
vadd.signature

<Signature (in1: 'unsigned int const *', in2: 'unsigned int const *', out_r: 'unsigned int*', size: 'int')>

For more details we can refer to `.args` which, among other things, will also tell us what memory banks the arguments should be placed in. For this example only the default `bank0` is used.

In [10]:
vadd.args

{'in1': XrtArgument(name='in1', index=1, type='unsigned int const *', mem='bank0'),
 'in2': XrtArgument(name='in2', index=2, type='unsigned int const *', mem='bank0'),
 'out_r': XrtArgument(name='out_r', index=3, type='unsigned int*', mem='bank0'),
 'size': XrtArgument(name='size', index=4, type='int', mem=None)}

Finally we can call the hardware kernel using the buffers we allocated previously. The `.call` function will block until the computation is complete.

In [11]:
vadd.call(in_a, in_b, out_c, 1024*1024)

Even though the computation has finished the result is still in the device buffer. To actually see the result in the notebook we need to bring the result back into host memory using `.sync_from_device`.

In [12]:
out_c.sync_from_device()

We can now print the buffer and see what looks like a sea of constant values (as expected).

In [13]:
out_c

PynqBuffer([[300, 300, 300, ..., 300, 300, 300],
            [300, 300, 300, ..., 300, 300, 300],
            [300, 300, 300, ..., 300, 300, 300],
            ...,
            [300, 300, 300, ..., 300, 300, 300],
            [300, 300, 300, ..., 300, 300, 300],
            [300, 300, 300, ..., 300, 300, 300]], dtype=uint32)

To be more rigorous we can use NumPy to verify in software that the output buffer is actually the result of adding the two input buffers.

In [14]:
import numpy as np
np.array_equal(in_a + in_b, out_c)

True

## Exercises

The overlay that's currently loaded also contains a `vmult_1` kernel which performs vector multiplication.

 1. Call the vmult kernel and verify it does what's expected
 2. Combine it with the `vadd` kernel to create a vector multiply-accumulate function
 3. (Optional) Have a look at the NumPy random number class to create some more interesting test data than just constants

New cells will automatically be added to the notebook after you run each one. Once you're finished shut down the kernel using the _Kernel_ menu to make sure the device is free for the other notebooks.

Copyright (C) 2020 Xilinx, Inc