# PYNQ‑Z2 Lab Protocol: YUV Filter with Streams + DMA

**Audience:** Embedded Systems students
**Board:** PYNQ‑Z2 (Zynq‑7020)
**Tools:** Vitis HLS, Vivado, PYNQ (Jupyter)
**Goal:** Implement a streaming RGB→YUV→filter→RGB accelerator, integrate it with AXI DMA, and measure speedup vs. a Python baseline.

---

## Why filter in YUV?

* **Perceptual separation:** YUV splits luminance (**Y**) from chrominance (**U, V**). Many visual tasks—brightness/contrast tweaks, denoising, edge enhancement—primarily affect **Y** without shifting color.
* **Cleaner color handling:** Adjusting **Y** avoids RGB cross‑talk; color fidelity holds while brightness changes. You can also filter **U/V** (e.g., chroma noise reduction) independently.
* **Bandwidth & compression awareness:** Video systems commonly operate in YUV (e.g., 4:2:0). Thinking in YUV maps to real pipelines students will meet in the wild.

> In this lab we scale the **Y** channel, then convert back to RGB—simple, visible, and stream‑friendly.

## Why `hls::stream`?

* **Throughput via pipelining:** Streams let stages run concurrently (source → `rgb2yuv` → `scale_y` → `yuv2rgb` → sink). With `#pragma HLS DATAFLOW`, HLS schedules them as a pipeline.
* **Backpressure built‑in:** AXI‑Stream handshakes (`TVALID/TREADY`) ensure no drops under bursty traffic.
* **Small buffers:** Process pixels as they arrive; avoid full‑frame BRAMs.
* **AXI‑Stream ready:** Streams map naturally to AXIS for DMA/video subsystems.

## Why a DMA engine?

* **Fast PS↔PL transfers:** AXI DMA moves large buffers between DDR and your accelerator without CPU copies.
* **MM2S/S2MM:** Memory‑to‑Stream feeds pixels into the IP; Stream‑to‑Memory collects the results.
* **Scales with image size:** Sustained throughput with minimal CPU involvement; clean timing measurements.

---

## Provided HLS top (summary)

The top function uses AXI‑Stream I/O and a float parameter `scale_Y`:

```c
#pragma HLS INTERFACE axis      port=stream_in
#pragma HLS INTERFACE axis      port=stream_out
// Expose scale_Y via AXI‑Lite (verify/add if missing):
#pragma HLS INTERFACE s_axilite port=scale_Y bundle=CTRL
#pragma HLS INTERFACE s_axilite port=return  bundle=CTRL
// Optional to enable stage concurrency:
#pragma HLS DATAFLOW
```

> The internal stages: `rgb2yuv` → `scale_y` → `yuv2rgb` pass `ap_axis<24,...>` frames and preserve `TLAST`.

---

## Part A — Software baseline (Python on PYNQ)

1. Copy `yuv_filter_soft.ipynb` and a test image (e.g., `input.png`) to the board’s Jupyter folder.
2. Run the notebook:
3. Record the **time per frame**; this is your CPU baseline.

---

## Part B — Build HLS IP in Vitis HLS

1. **Create HLS Component** (C/C++), name `yuv_filter_hls`. Set **Clock** to **100 MHz** to match FCLK0.
2. **Add source**: your provided `yuv_filter.cpp` (with the pragmas above).
   Ensure top function is named `yuv_filter` and ports are AXIS + AXI‑Lite (for `scale_Y`).
3. **Testbench**: Use the provided `yuv_filter_tb.cpp` to test functionality.
4. **C Simulation** → verify functional behavior.
5. **C Synthesis** → check latency/II; expect II≈1 per pixel after DATAFLOW.
6. **(Optional) C/RTL Co‑Sim** → confirm stream protocol passes.
7. **Export RTL** → *Package IP*. Note the **IP repository path**.

---

## Part C — Vivado Block Design (PS + DMA + HLS IP)

1. **New Project** targeting **PYNQ‑Z2** (board).
2. **Create Block Design** (`system`).
3. **Add IP:** `ZYNQ7 Processing System` → **Run Block Automation** (DDR, clocks, MIO).
4. **Add IP:** `AXI Direct Memory Access` and configure it by double-click on it (enable **MM2S** and **S2MM** channels; disable SG for simplicity).
   Also add **AXI SmartConnect/Interconnect** as prompted by automation.
5. **Add your HLS IP** from the repo (*Settings → IP → Repository → Add*).
6. **Hook up control (AXI‑Lite):**

   * Connect **DMA** control and **yuv_filter** control to **PS M_AXI_GP0** (Connection Automation helps).
   * Assign addresses in **Address Editor**.
7. **Hook up memory (HP port):**

   * Enable **S_AXI_HP0** on the Zynq PS.
   * Connect **DMA MM2S/S2MM** master ports to **S_AXI_HP0** via SmartConnect.
8. **Hook up streams:**

   * **MM2S M_AXIS** → **yuv_filter s_axis**.
   * **yuv_filter m_axis** → **S2MM S_AXIS**.
9. **Data width check:**

   * The HLS IP streams **32‑bit** pixels. If the DMA’s AXIS width isn’t 32, change it.
10. **Clocks/Resets:** Drive IP, DMA, and converters from **FCLK_CLK0 (100 MHz)** and a **Processor System Reset**.
11. **Validate Design** (green check) → fix any DRCs.
12. **Generate HDL Wrapper** (let Vivado manage it).
13. **Run** Synthesis → Implementation → **Generate Bitstream**.
14. Find artifacts:

    * Bitstream: `.../impl_1/system_wrapper.bit`
    * Handoff (.hwh): in the BD `system` folder (search for `*.hwh`).

> **Overlay naming:** Rename to shared base (e.g., `yuv.bit`, `yuv.hwh`).

---

## Part D — Run on PYNQ with DMA

1. Copy to the board: `yuv.bit`, `yuv.hwh`, `yuv_filter_hard.ipynb`, and a test image.
2. Run the notebook:
3. Compare to the CPU result (timing and a quick visual diff). For a number, compute MSE:

```python
from yuv_filter_numpy import yuv_filter_numpy
cpu = yuv_filter_numpy(np.array(Image.open('input.png').convert('RGB')), 1.15)
import numpy as np
mse = np.mean((cpu.astype(np.int16) - out_rgb.astype(np.int16))**2)
print('MSE vs CPU:', mse)
```

---

## Deliverables

* **Diagram/Screenshot** of the Vivado Block Design (PS, DMA, DataWidthConv if used, HLS IP).
* **Timing table** (CPU vs. HW) for at least two `scale_Y` values and two image sizes.
* **Result images** (CPU and HW) for one case.
* **Notes**: stream width choice, TKEEP/TLAST handling, clocking, any issues + fixes.

---

## Troubleshooting

* **Can’t see IP in Vivado:** Add the correct repo folder (contains `component.xml`) and *Refresh*.
* **DMA errors / hangs:** Ensure both DMA control ports are mapped, clocks connected, and `recv.transfer()` is called **before** `send.transfer()`; wait on both.
* **No `.hwh` found:** It’s generated alongside the block design; search the project tree for `*.hwh`.
* **Color looks off:** Remember the RGB↔YUV math uses limited‑range offsets (16/128). Check for extra scaling or incorrect channel order.
* **Throughput low:** Enable `#pragma HLS DATAFLOW` and ensure each stage reads/writes the stream once per pixel (II≈1). Keep PL at 100 MHz initially.

---

## Optional extensions

* Filter **U/V** channels (e.g., chroma denoise) and compare artifacts.
* Swap to **VDMA** and operate on video frames/lines with `TLAST` per line.
