![figure](../lab8/lab8_figures/politecnico_h-01.png)
# **Eletrónica Configurável / Configurable Electronics**
#### Mestrado em Engenharia Eletrotécnica / Master in Electrical and Electronic Engineering

## **LabWork8 - Transferring data between PS and PL**

__________

## Introduction ##
In this tutorial you will learn how to use custom overlays and transfer data between the PS and PL in PYNQ. Low and high performance data transfers will be observed using different overlays, with and without DMAs. This notebook can be uploaded to the PYNQ board and you can run it from there.


### Objectives ###
After completing this lab, you will be able to:

* Use the Zynq GPIO (PS, AXI and MMIO) from PYNQ;
* Use allocate buffers to transfer data from PS to PL;
* Use DMAs to interface AXI accelerators;


In the instructions below **{sources}** refers to `C:\Xilinx\MEE_EC\sources` and **{labs}** refers to `(C:\Xilinx\MEE_EC\labs)`

This tutorial was inspired in Xilinx [PYNQ Workshop](https://github.com/Xilinx/PYNQ_Workshop). 

__________

## Step 1 - GPIO with PYNQ (PS, AXI and MMIO)##


### Step 1.1 ###

The aim of this step is to show how to use the Zynq PS GPIO from PYNQ. The PS GPIO are simple wires from the PS, and don't need a controller in the programmable logic. Up to 64 PS GPIO are available, and they can be used to connect simple control and data signals to IP or peripherals in the PL.

This example uses a bitstream that connects PS GPIO to the LEDs, buttons, and switches. The overlay could be designed in Vivado with the simple Block Design shown in the figure below

![Figure](../lab8/lab8_figures/fig1.png)


The ps_gpio.bit and ps_gpio.hwh files can be found in the bitstream directory local to this folder. 


<div class="alert alert-block alert-warning">
<b>Don't forget:</b> In the bitstream directory you can also find a ps_gpio.tcl file. This file can be used to rebuild the block diagram in Vivado. 
</div>


* Check if the the files ps_gpio.bit and ps_gpio.hwh exists in the bitstream directory in PYNQ.

In [None]:
!dir ./bitstream/ps_gpio.*

* Download the bitstream. It can be downloaded by passing the relative path to the Overlay class. This GPIO class will be used to access the PS GPIO.


In [None]:
from pynq import Overlay
ps_gpio_design = Overlay("./bitstream/ps_gpio.bit")

* In the design PS GPIO pins 0 to 3 are connected to the pushbuttons, and pins 4 to 5 are connected to the dip-switches on the PYNQ-Z2 board.In Pyhon we should therefore use the same order.

In [None]:
from pynq import GPIO

button0 = GPIO(GPIO.get_gpio_pin(0), 'in')
button1 = GPIO(GPIO.get_gpio_pin(1), 'in')
button2 = GPIO(GPIO.get_gpio_pin(2), 'in')
button3 = GPIO(GPIO.get_gpio_pin(3), 'in')

switch0 = GPIO(GPIO.get_gpio_pin(4), 'in')
switch1 = GPIO(GPIO.get_gpio_pin(5), 'in')

* Try pressing the button BTN0 on the board and rerunning the cell below. The other buttons and switches can be read in a similar way.

In [None]:
button0.read()

* Try pressing different buttons (BTN1, BTN2, BTN3), and moving the switches (SW0, SW1) while executing the cell below. Interrupt the kernel when satisfied.

In [None]:
from time import sleep
while(True):
    print(f"Button0: {button0.read()}")
    print(f"Button1: {button1.read()}")
    print(f"Button2: {button2.read()}")
    print(f"Button3: {button3.read()}")

    print("")
    print(f"Switch0: {switch0.read()}")
    print(f"Switch1: {switch1.read()}")
    sleep(2)

* The LEDs can be used in a similar way, the only difference is the direction passed to the GPIO class. The LEDs are connected to PS GPIO 6 to 9 in the design we are using. Run the cells below.

In [None]:
led0 = GPIO(GPIO.get_gpio_pin(6), 'out')
led0.write(1)

In [None]:
led1 = GPIO(GPIO.get_gpio_pin(7), 'out')
led2 = GPIO(GPIO.get_gpio_pin(8), 'out')
led3 = GPIO(GPIO.get_gpio_pin(9), 'out')

In [None]:
from time import sleep

led1.write(1)
sleep(1)
led2.write(1)
sleep(1)
led3.write(1)

* Finally, turn off the LEDs

In [None]:
led0.write(0)
led1.write(0)
led2.write(0)
led3.write(0)

* Run a loop to set the LEDs to the value of the pushbuttons. Before executing the next cell, make sure Switch 0 (SW0) is "on". While the loop is running, press a push-button and notice the corresponding LED turns on. To exit the loop, change Switch 0 to off.

In [None]:
while(switch0.read() == 1):
    led0.write(button0.read())
    led1.write(button1.read())
    led2.write(button2.read())
    led3.write(button3.read())  

### Step 1.2 ###

The aim of this step is to show how to use AXI GPIO from PYNQ. Multiple AXI GPIO controllers can be implemented in the programmable logic and used to control internal or external GPIO signals.

This example uses a bitstream that connects three AXI GPIO controllers to the LEDs, buttons, and switches. Each AXI GPIO controller has 2 channels, so multiple peripherals could be controlled from one AXI GPIO IP, but for simplicity and demonstration purposes, separate AXI GPIO controllers are used.

![Figure](../lab8/lab8_figures/fig2.png)


The axi_gpio.bit and axi_gpio.hwh files can be found in the bitstreams directory local to this folder. The bitstream can be downloaded by passing the relative path to the Overlay class.


* Check the bitstream and .tcl exists in the bitstream directory

In [None]:
!dir ./bitstream/axi_gpio.*

* Download the bitstream

In [None]:
from pynq import Overlay
axi_gpio_design = Overlay("./bitstream/axi_gpio.bit")

* Check the IP Dictionary for this design. 

<div class="alert alert-block alert-info">
<b>Note:</b> The IP dictionary lists AXI IP in the design, and for this example will list the AXI GPIO controllers for the buttons, LEDs, and switches. The Physical address, the address range and IP type will be listed. If any interrupts, or GPIO were connected to the PS, they would also be reported. 
</div>


In [None]:
axi_gpio_design.ip_dict

In [None]:
hex(axi_gpio_design.ip_dict["buttons"]["phys_addr"])

* The PYNQ AxiGPIO class will be used to access the AXI GPIO controllers. The instances can be found and referenced from the IP dictionary.

In [None]:
from pynq.lib import AxiGPIO

buttons_instance = axi_gpio_design.ip_dict['buttons']
buttons = AxiGPIO(buttons_instance).channel1

In [None]:
buttons.read()

* The buttons controller is connected to all four user push-buttons on the board (BTN0 to BTN3). Try pressing any combination of the buttons and rerunning the cell above.


* The AXI GPIO controller for the switches can be used in a similar way:

In [None]:
switches_instance = axi_gpio_design.ip_dict['switches']
switches = AxiGPIO(switches_instance).channel1

In [None]:
print(f"Switches: {switches.read()}")

* The LEDs can be used in a similar way.

In [None]:
led_instance = axi_gpio_design.ip_dict['leds']
led = AxiGPIO(led_instance).channel1

* The outputs can be addressed using a slice.

In [None]:
led[0:4].write(0x1)

In [None]:
from time import sleep

led[0:4].write(0x3)
sleep(1)
led[0:4].write(0x7)
sleep(1)
led[0:4].write(0xf)

* Turn off the LEDs

In [None]:
led[0:4].off()

* Run a loop to set the LEDs to the value of the pushbuttons. Before executing the next cell, make sure Switch 0 (SW0) is "on". While the loop is running, press a push-button and notice the corresponding LED turns on. To exist the loop, change Switch 0 to off.

In [None]:
while((switches.read() & 0x1) == 1):
    led[0:4].write(buttons.read())

### Step 1.3 ###

The aim of this step is to show how to use the MMIO (Memory Mapped I/O) PYNQ class.

This example uses the same bitstream from the previous step with three AXI GPIO controllers connected to the LEDs, buttons, and switches. While there are PYNQ drivers available to read and write the AXI GPIO LEDs, switches and buttons for demonstration purposes the AXI GPIO controllers will be used to show how the PYNQ MMIO class can be used.

<div class="alert alert-block alert-info">
<b>Note:</b> This step will seem very similar to the previous one. We will be exercising the buttons, switches and LEDs in a similar way, but you should note that we are now using the MMIO class directly, and that there are small differences in code. For the MMIO class, we will be specifying an offset address that we read or write to. If you examine the driver code for the LED, switches, and buttons classes, you will notice that they use the PYNQ MMIO class
</div>


* Download the axi_gpio.bit overlay

In [None]:
from pynq import Overlay
axi_gpio_design = Overlay("./bitstream/axi_gpio.bit")

MMIO can map arrays, or a range of addresses. A physical memory address and a range are required by the MMIO class.

In this example, the MMIO class will be used to directly access the register space of the AXI GPIO and control the IP.

An AXI GPIO controller has two channels. In the design, only 1 channel of each AXI controller is used (as described in the previous step). 

We will only use two registers: 1) The data register is mapped to offset 0x0; and 2) the tri-state register is mapped to offset 0x4. To use an AXI GPIO, the tri-state register must be set to configure the IP as input or output. The data register can be read or written to. 

For example, the AXI GPIO connected to the LEDs sets the tri-state register to configure the IP as an output. The LEDs will turn on or off corresponding to the value written to the data register. For the buttons, or switches, the IP is configured as input and the value in the data register will be the value corresponding to the position of the switches or buttons.

In the following example, 3 MMIO instances will be created corresponding to each AXI GPIO.


* First assign the physical addresses of the controllers to python variables.

In [None]:
buttons_address = axi_gpio_design.ip_dict['buttons']['phys_addr']
switches_address = axi_gpio_design.ip_dict['switches']['phys_addr']
leds_address = axi_gpio_design.ip_dict['leds']['phys_addr']

print("Physical address of buttons:  0x" + format(buttons_address, '08x'))
print("Physical address of switches: 0x" + format(switches_address, '08x'))
print("Physical address of LEDs:     0x" + format(leds_address, '08x'))


An MMIO instance is created with an address and a range. The range specifies the range of addresses that can be accessed from the base address. Care must be taken when reading and writing addresses in the system that they physically exist. Reading or writing to location that is not accessible can cause the system to hang.

In [None]:
from pynq import MMIO
RANGE = 8 # Number of bytes; 2x 32-bit locations which is all we need for this example
buttons = MMIO(buttons_address, RANGE) 

* Write 0xffffffff to the tri-state register at offset 0x4 to configure the IO as inputs.

In [None]:
buttons.write(0x4, 0xffffffff) 

In [None]:
print(f"Push-buttons: {buttons.read()}")

* As before, try pressing any combination of the push-buttons while re-running the cell above. 


* The AXI GPIO controller for the switches can be used in a similar way:

In [None]:
switches = MMIO(switches_address, RANGE)
switches.write(0x4, 0xffffffff) 

In [None]:
print(f"Switches: {switches.read()}")

* The LEDs can be used in a similar way, this time 0x0 is written to the tri-state register to configure the IO as output.

In [None]:
leds = MMIO(leds_address, RANGE)
leds.write(0x4, 0x0) # Write 0x0 to location 0x4; Set tri-state to output

In [None]:
leds.write(0x0, 0xF) # Write 0xf to location 0x0 (Turn on the LEDs)

* Similarly to the previous step, we will run a loop to set the LEDs to the value of the pushbuttons. Before executing the next cell, make sure Switch 0 (SW0) is "on". While the loop is running, press a push-button and notice the corresponding LED turns on. To exist the loop, change Switch 0 to off.

In [None]:
while((switches.read(0x0) & 0x1) == 1):
    leds.write(0x0, buttons.read(0x0))

___________________ 

## Step 2 - Transfering data from PS to PL ##

In the previous step we have seen how to use PS_GPIO, AXI_GPIO and do memory-mapped reads and writes using PYNQ. In general, we will only use these means to transfer relativelly small amounts of data between the PS and the PL. ZYNQ processors have higher-performance AXI slave interfaces for transfering larger amounts of data. These ports allow AXI masters in the PL to directly access the PS memory system.

Before a PL master can access the PS memory, it needs to know which memory addresses it can access. Remember the operating system running on the board is managing a virtualized memory system. So we need to allocate PS memory and pass the address of this memory buffer to the IP in the PL.

The PYNQ **allocate** class can be used to do this. Allocate will assign a contiguous block of memory. This helps with system performance as a contiguous block of memory is easier and more efficient to the PL IP to access. Performance is lower when accessing fragmented memory and we need to use a DMA that supports this, although it will use more PL resources.

### Step 2.1 ###

In this step we will see how to **allocate** and how we can use the allocated memory from the PL. 

<div class="alert alert-block alert-info">
<b>Note:</b> The allocate() driver is overlay-agnostic, meaning it can be used no matter what overlay you are using.
</div>


* Let's start by doing a simple check on the available memory in the system.

In [None]:
def free_mem():
    mem = !cat /proc/meminfo | grep 'MemFree'
    print(mem)

* Check free memory

In [None]:
free_mem()

* Next, import the **allocate** class and run help

In [None]:
from pynq import allocate
allocate?

We can see that we need to pass a shape (size) for the amount of memory I want. The default type is 32-bit unsigned integer (**u4**), but I can specify the data type I want and I can do this using **numpy** types.

* Check memory again. We've probably used a litle bit of memory with the code we've run above. This will make it a little bit easier to do a before and after comparison.

In [None]:
free_mem()

* Now create a memory buffer with 10 million floating-point 32-bit elements (~40 Mbytes).

In [None]:
import numpy as np 
buffer = allocate(shape=(10000000,), dtype=np.float32)

* This has created a contiguous array of 40 Mbytes. If we check the available memory again we should see this number goes down by approximately that amount. This is a live system, so other processes are running and therefore free memory may fluctuate a little (numbers may not match exactly).

In [None]:
free_mem()

* We can see allocate gives us contiguous memory and it also gives us the virtual and physical addresses for this memory. Check the memory buffer addresses by running the following cell. The **physical address** is what we need to pass to the IP in the PL.

In [None]:
print("Buffer pointer address (physical memory):")
print(hex(buffer.physical_address))
print("Buffer pointer address (virtual memory):")
print(hex(buffer.virtual_address))

* We should free the memory once we are finished. It is always a good practice to free the contiguous memory after use. This prevents memory leaks from the program.

In [None]:
del buffer
free_mem()

<div class="alert alert-block alert-info">
<b>Note:</b> It is normal that the available memory may not be exactly the same as the previous number.
</div>

### Step 2.2 ###

In this step, the PYNQ allocate class will be used to allocate a memory buffer in the DDR memory. The physical address of the memory will be passed to the PL, in this case to an IOP in the base overlay. The IOP has a connection to the PS DRAM. An application will run on the IOP to modify the contents of the memory buffer in the PS DRAM.

In a similar way, another IP in the PL could use a physical memory pointer to access PS DRAM.

* Create a buffer with 1000 32-bit integer elements.

In [None]:
from pynq import allocate
import numpy as np 
py_buffer = allocate(shape=(1000,), dtype=np.int32)

The **virtual address** can be used by any application running in Linux. This could be a Python application, or a C/C++ or other application running in Linux. The **physical address** can be passed to an IP block in an overlay (in the PL).


* Check the memory buffer addresses

In [None]:
print("py_buffer physical address {}".format(hex(py_buffer.physical_address)))

* Download the base overlay

In [None]:
from pynq.overlays.base import BaseOverlay
base = BaseOverlay('base.bit')

The C code for a new function that will run on a MicroBlaze is provided in the next cell. The C function parameters are a physical address, a length, and data. The function will modify the contents of the memory. It will modify data in the range `[address : address+length]`, by reading the contents of each memory location, and adding an offset value data.

* Create MicroBlaze program to run in the ARDUINO IOP from the base overlay.

In [None]:
%%microblaze base.ARDUINO
void my_function(unsigned int physical_address, unsigned int length, int data) {
    int i;
    int *mb_buffer;
    
    // in Microblaze, DDR is accessed through a GP port at offset 0x20000000
    mb_buffer = (int *)(physical_address|0x20000000); // Cast to pointer and convert to DDR offset address

    // Write memory buffer in DDR
    for(i=0; i<length; i++){
        mb_buffer[i]= mb_buffer[i] + data;
    }
}

* Initialize the buffer with some values:

In [None]:
length = 20 
for i in range(length):
    py_buffer[i] = i + 100

* Check the content of the buffer

In [None]:
py_buffer[0:length]

* Call the IOP function with the physical pointer address returned form the allocate instance, along with an initialization value and a length. The IOP application will then write to the memory buffer.

In [None]:
data = -11
my_function(py_buffer.physical_address, length, data)

* Check the contents of the buffer after the IOP application has modified the buffer. The cell above can be re-run with different values of data and length.

In [None]:
py_buffer[0:length]

* Free the memory

In [None]:
del py_buffer

### Step 2.3 ###

This step will show how to use a DMA to stream data from PS memory to the PL. We will allocate buffers in the PS memory for the DMA to access.

We will use an overlay with two AXI_DMA IPs from the Vivado IP catalog and one AXI Stream FIFO (input and output AXI stream interfaces). The FIFO represents an accelerator, which is here just implementing a loopback between the DMA AXI stream ports. A single DMA could be used with a read and write channel enabled, but for demonstration purposes, two different DMAs will be used.

Both DMAs have master ports (**M_AXI_MM2S** | **M_AXI_S2MM**) connected to the ZYNQ high performance ports (**S_AXI_HP0** | **S_AXI_HP2**). DMAs also have streaming ports (**M_AXIS_MM2S** | **M_AXIS_S2MM**) for sending and receiving data from the FIFO IP. Note:

1. The first DMA with read channel enabled is connected from DDR to IP input stream (reading from DDR, and sending to AXI stream).

2. The second DMA has a write channel enabled and is connected to IP output stream to DDR (receiving from AXI stream, and writing to DDR memory).


![Figure](../lab8/lab8_figures/fig3.png)


* Download the overlay. The overlay can be downloaded automatically when instantiating an overlay class.

In [None]:
from pynq import Overlay
overlay = Overlay("./bitstream/dma_tutorial.bit")

* We can check the IPs in this overlay. Notice the DMAs **axi_dma_from_pl_to_ps** and **axi_dma_from_pl_to_ps**.

In [None]:
overlay.ip_dict

* Check also for overly help. Note that the AXI_DMA IP blocks have been assigned to the PYNQ axi dma class (**pynq.lib.dma.DMA**).

In [None]:
overlay?

*  Using the labels for the DMAs listed in the dictionary, create two DMA objects.

In [None]:
import pynq.lib.dma

dma_send = overlay.axi_dma_from_ps_to_pl
dma_recv = overlay.axi_dma_from_pl_to_ps

We are now ready to read some data from memory and write it to FIFO. The first step is to allocate the input buffer with 100 32-bit unsigned integers.

* Import **pynq.allocate** to allocate the buffer, and **NumPy** to specify the type of the buffer.

In [None]:
from pynq import allocate
import numpy as np

data_size = 100
input_buffer = allocate(shape=(data_size,), dtype=np.uint32)

* The array can be used like any other NumPy array. We can write some test data to the array. Later the data will be transferred by the DMA to the FIFO.

In [None]:
for i in range(data_size):
    input_buffer[i] = i + 0xcafe0000

* Let's check the contents of the array. The data in the following cell will be sent from PS (DDR memory) to PL (streaming FIFO). Print first few values of buffer.

In [None]:
for i in range(10):
    print(hex(input_buffer[i]))

* Now we are ready to carry out DMA transfer from a memory block in DDR to FIFO, using the **transfer** function. Note that we passed the memory buffer itself, which will automatically include the physical address so I don't need to pass this manually.

In [None]:
dma_send.sendchannel.transfer(input_buffer)

* Let's read the data back from FIFO stream, and write to MM memory. The steps are similar. We will prepare an empty array before reading data back from FIFO.

In [None]:
output_buffer = allocate(shape=(data_size,), dtype=np.uint32)

for i in range(10):
    print('0x' + format(output_buffer[i], '02x'))

In [None]:
dma_recv.recvchannel.transfer(output_buffer)

* The next cell will print out the data received from PL (streaming FIFO) to PS (DDR memory). This should be the same as the data we sent previously.

In [None]:
for i in range(10):
    print('0x' + format(output_buffer[i], '02x'))

* Verify that the arrays are equal (a more complete comparison to check that the data received is actually the data sent).

In [None]:
print("Array are equal: {}".format(np.array_equal(input_buffer, output_buffer)))

* Free all the memory buffers. Don't forget to free the memory buffers to avoid memory leaks!

In [None]:
del input_buffer, output_buffer

________________

## Step 3 - Resizing an image with HW accelerator ##

We will now have a look on the potential of having both a processor (PS) and FPGA fabric (PL) to implement computational intensive algorithms. There are two notebooks that illustrate the resize operation. One notebook shows the image resizing done purely in software using Python Image Library. The second notebook shows the resize operation being performed in the programmable logic using a resizer IP from the Xilinx xfopencv library.


### Step 3.1 ###

Before you start you need to install the PYNQ "Hello World" repository from Xilinx. This repository can be found [here](https://github.com/Xilinx/PYNQ-HelloWorld) but it is more convinient to download it to PYNQ. 


* You need to open a terminal on your PYNQ board and run `sudo pip3 install pynq-helloworld --no-build-isolation` to download and install the repository. Alternativelly, you can also use the next cell in this notebook:


In [None]:
!sudo pip3 install pynq-helloworld --no-build-isolation

And you should get a result similar to the figure below.

![Figure](../lab7/lab7_figures/fig6.png)


* If you get a warning suggesting an upgrade in pip version, you can update with:

In [None]:
!/usr/local/share/pynq-venv/bin/python3 -m pip install --upgrade pip

* To get the notebooks and install them in lab8 directory, make sure you are there (with `pwd`) and then run `get-notebooks` with the **-p** option to target the current (*lab8*) folder.

In [None]:
!pwd

In [None]:
!pynq get-notebooks pynq-helloworld -p .


* When the cell finishes executing check your **lab8** folder in PYNQ and note that you should already have a new **pynq-helloworld** folder. Inside this folder you will find:
    * Two notebooks with the resizing function performed in software (resizer_ps) and hardware (resizer_ls).
    * One folder with images used in notebooks and the image to be resized.
    * The **.bit** file, required to program the FPGA fabric (PL).
    * The **.hwh** file, that contains all the information regarding the Vivado Block Design (CPUs, Buses, IP and the ports and pins used in the system, such as interrupts and IOs), needed to build a platform for a user's target device.


![Figure](../lab8/lab8_figures/fig8.png)

### Step 3.2 ###


In this section you will run the notebooks provided in the "Hello World" repository. You will probably find some parts of the code to be much more complicated than what you have seen before but don't worry. The idea is just for you to have a quick notion of the possibilities. We will cover these topics in more detail in the next labs.


* Open notebook **resizer_ps.ipynb** and follow instructions. These notebooks have code cells embedded and can be run directly with **shift-enter**. Wait for a cell to finish before running the next cell.

<div class="alert alert-block alert-info">
<b>Info:</b> When a cell is running the cell number becomes an asterisc: **In[ * ]**. 
</div>

* Now that you have seen how to resize an image using the processor (**PS**), open notebook **resizer_pl.ipynb** to see how the **Resizer** overlay can be used to perform the same algorithm much faster. Follow instructions.


* Note that the processor took more than 1 second to perform the resizing operation, while the hardware function took only ~250ms.

________________