# PYNQ‑Z2 Lab Protocol: YUV Filter with Streams + DMA

**Audience:** Embedded Systems students
**Board:** PYNQ‑Z2 (Zynq‑7020)
**Tools:** Vitis HLS, Vivado, PYNQ (Jupyter)
**Goal:** Implement a streaming RGB→YUV→filter→RGB accelerator, integrate it with AXI DMA, and measure speedup vs. a Python baseline.

---

## Why filter in YUV?

* **Perceptual separation:** YUV splits luminance (**Y**) from chrominance (**U, V**). Many visual tasks—brightness/contrast tweaks, denoising, edge enhancement—primarily affect **Y** without shifting color.
* **Cleaner color handling:** Adjusting **Y** avoids RGB cross‑talk; color fidelity holds while brightness changes. You can also filter **U/V** (e.g., chroma noise reduction) independently.
* **Bandwidth & compression awareness:** Video systems commonly operate in YUV (e.g., 4:2:0). Thinking in YUV maps to real pipelines students will meet in the wild.

> In this lab we scale the **Y** channel, then convert back to RGB—simple, visible, and stream‑friendly.

## Why `hls::stream`?

* **Throughput via pipelining:** Streams let stages run concurrently (source → `rgb2yuv` → `scale_y` → `yuv2rgb` → sink).
* **Backpressure built‑in:** AXI‑Stream handshakes (`TVALID/TREADY`) ensure no drops under bursty traffic.
* **Small buffers:** Process pixels as they arrive; avoid full‑frame BRAMs.
* **AXI‑Stream ready:** Streams map naturally to AXIS for DMA/video subsystems.

## Why a DMA engine?

* **Fast PS↔PL transfers:** AXI DMA moves large buffers between DDR and your accelerator without CPU copies.
* **MM2S/S2MM:** Memory‑to‑Stream feeds pixels into the IP; Stream‑to‑Memory collects the results.
* **Scales with image size:** Sustained throughput with minimal CPU involvement; clean timing measurements.

## Part A — Software baseline (Python on PYNQ)

1. Copy `yuv_filter_soft.ipynb` and a test image (e.g., `input.png`) to the board’s Jupyter folder.
2. Run the notebook:
3. Record the **time per frame**; this is your CPU baseline.

---

## Part B — Build three HLS IP cores in Vitis HLS

Do these steps **three times** (one per top function): `rgb2yuv_ip`, `scale_y_ip`, `yuv2rgb_ip`.

1. **Create HLS Component** (C/C++), clock **100 MHz**.
2. **Add source:** the provided code; set the top function accordingly.
3. **Interfaces:**

   * `#pragma HLS INTERFACE axis port=stream_in`
   * `#pragma HLS INTERFACE axis port=stream_out`
   * For `scale_y_ip`: `#pragma HLS INTERFACE s_axilite port=scale_Y bundle=CTRL` and `s_axilite port=return bundle=CTRL`.
4. **C Simulation:** Use a simple stream TB that pushes N pixels and sets `TLAST` on the last.
5. **C Synthesis:** Aim for **II=1** per stage. Check utilization and timing @100 MHz.
6. *(Optional)* C/RTL Co‑Sim to validate AXIS handshake and TLAST propagation.
7. **Export RTL → Package IP**. Note the **repository folders** of all three IPs.

---

## Part C — Vivado Block Design: PS + DMA + 3× IP (AXIS chain)

1. **New Project** → select **PYNQ‑Z2** board.
2. **Create Block Design** (e.g., `system`).
3. **Add `ZYNQ7 Processing System`** → **Run Block Automation** (DDR, FCLK0=100 MHz).
4. **Add `AXI DMA`**:

   * Enable **MM2S** and **S2MM**; disable Scatter‑Gather for simplicity.
   * Connect DMA control (AXI‑Lite) to **PS M_AXI_GP0** (Connection Automation).
5. **Add the three HLS IPs** (add their repo paths in *Settings → IP → Repository*): `rgb2yuv_ip`, `scale_y_ip`, `yuv2rgb_ip`.
6. **Connect control of `scale_y_ip`** (AXI‑Lite) to PS M_AXI_GP0. Assign addresses in **Address Editor**.
7. **AXI High‑Performance port:** Enable **S_AXI_HP0** on Zynq PS; connect DMA MM2S/S2MM **M_AXI** to **S_AXI_HP0** via SmartConnect.
8. **AXI‑Stream chain:**

   * **MM2S M_AXIS** → `rgb2yuv_ip` **s_axis**
   * `rgb2yuv_ip` **m_axis** → `scale_y_ip` **s_axis**
   * `scale_y_ip` **m_axis** → `yuv2rgb_ip` **s_axis**
   * `yuv2rgb_ip` **m_axis** → **S2MM S_AXIS**
9. **Data width and TKEEP:**

   * If using **32‑bit** streams (as in the code), keep all AXIS links at 32 bits and set transfer length to **4×pixels**. TKEEP may be 0xF (all bytes valid; last byte unused).
   * Alternatively, re‑synthesize all cores as **24‑bit** streams and insert **AXIS Data Width Converters** at DMA if needed.
10. **Clocks/Resets:** Drive DMA and all IPs from **FCLK_CLK0** and a **Processor System Reset**. Route resets to each AXIS block.
11. **Validate Design** (no DRCs), **Generate HDL Wrapper**, **Run Synthesis → Implementation → Generate Bitstream**.
12. Collect artifacts: `system_wrapper.bit` and the `system.hwh` (search for `*.hwh`). Rename to shared base, e.g., `yuv.bit`, `yuv.hwh`.

---

## Part D — Run on PYNQ with DMA

1. Copy to the board: `yuv.bit`, `yuv.hwh`, `yuv_filter_hard.ipynb`, and a test image.
2. Run the notebook:
3. Compare to the CPU result (timing and a quick visual diff). For a number, compute MSE:

```python
from yuv_filter_numpy import yuv_filter_numpy
cpu = yuv_filter_numpy(np.array(Image.open('input.png').convert('RGB')), 1.15)
import numpy as np
mse = np.mean((cpu.astype(np.int16) - out_rgb.astype(np.int16))**2)
print('MSE vs CPU:', mse)
```

---

## Deliverables

* **Diagram/Screenshot** of the Vivado Block Design (PS, DMA, DataWidthConv if used, HLS IP).
* **Timing table** (CPU vs. HW) for at least two `scale_Y` values and two image sizes.
* **Result images** (CPU and HW) for one case.
* **Notes**: stream width choice, TKEEP/TLAST handling, clocking, any issues + fixes.

---

## Troubleshooting

* **Can’t see IP in Vivado:** Add the correct repo folder (contains `component.xml`) and *Refresh*.
* **DMA errors / hangs:** Ensure both DMA control ports are mapped, clocks connected, and `recv.transfer()` is called **before** `send.transfer()`; wait on both.
* **No `.hwh` found:** It’s generated alongside the block design; search the project tree for `*.hwh`.
* **Color looks off:** Remember the RGB↔YUV math uses limited‑range offsets (16/128). Check for extra scaling or incorrect channel order.
* **Throughput low:** Enable `#pragma HLS DATAFLOW` and ensure each stage reads/writes the stream once per pixel (II≈1). Keep PL at 100 MHz initially.

---

## Optional extensions

* Filter **U/V** channels (e.g., chroma denoise) and compare artifacts.
* Swap to **VDMA** and operate on video frames/lines with `TLAST` per line.
