# Game of Life — Streaming Shift‑Register Pipeline (2 Blocks)

**Audience:** Embedded Systems students
**Board:** PYNQ‑Z2 (Zynq‑7020)
**Tools:** Vitis HLS, Vivado, PYNQ (Jupyter)
**Goal:** Replace a naïve DDR‑centric GoL kernel (9 reads per output) with a streaming, line‑buffered design that reads **each input pixel once** and achieves **II≈1** through pipelining.

---

## 1) Problem with the previous implementation

* **Access pattern:** For each output cell, the kernel issued **9 separate DDR reads** (center + 8 neighbors) via `m_axi`.
* **Consequences:**

  * **Bandwidth ×9:** For N pixels, read volume ≈ 9·N (plus writes). This quickly saturates HP ports and AXI interconnect.
  * **Poor II:** The HLS schedule stalls on memory latency and bursts; achieving II=1 is difficult.
  * **Cache/BRAM underuse:** Neighboring windows overlap but data are re‑fetched instead of reused on‑chip.

**Takeaway:** Game of Life has strong **spatial locality** (overlapping 3×3 windows). Reuse on‑chip; don’t refetch from DDR.

---

## 2) Solution: line buffers + shift register (sliding 3×3 window)

**Idea:** Stream pixels **once** from DMA. Keep the last **three image rows** in on‑chip line buffers, and for each row, a **3‑tap horizontal shift register** to form a 3×3 window every cycle.

![image info](./implementation.png "FIFO Architecture")

**Pipeline phases:**

1. **Fill:** After reading first 2 rows and first 2 pixels of the 3rd row, the first valid 3×3 window becomes available.
2. **Steady state:** Each new pixel shifts the registers and produces **one 3×3 window per cycle** (II≈1).
3. **Drain:** At row end, handle borders and line‑buffer rotation (LB0←LB1, LB1←LB2).

**Latency vs throughput:** Per‑pixel **latency** still spans the three stages (fill + compute), but **throughput** improves dramatically—one window per clock once filled.

---

## 3) Architecture for the lab: two streaming blocks

We split the design into two HLS IPs to make the windowing and the rule clear and reusable.

### Block A — **Window Generator** (DMA → 3×3 windows)

* **Input (AXI‑Stream):** 1 pixel per beat (binary GoL state). Choose one encoding:

  * **Simple:** 8‑bit per pixel (`0` or `1`) — easier to debug.
  * **Packed:** 1‑bit pixels packed into 32 or 64‑bit words — higher throughput; more advanced.
* **Output (AXI‑Stream):** a **3×3 window** per beat, e.g.:

  * **Option S (simple):** 9×8‑bit = **72‑bit** TDATA (pad to 96/128 with a width converter if needed).
  * **Option P (packed):** 9×1‑bit packed into **16‑bit** TDATA.
* **Control:** optional AXI‑Lite for `width`, `height`, `border_mode`.
* **Behavior:** maintains **3 line buffers** (length = image width) and **3 horizontal shift registers**; emits one 3×3 window per pixel once filled. Propagates `TLAST` at **end‑of‑frame**.

**Border handling (choose and document):**

* **Zero/Dead padding:** outside the image is dead cells.
* **Replicate:** edge pixels extend.
* **Toroidal wrap:** edges wrap around (classic GoL on a torus).
  Pick one policy and keep it consistent in both blocks and the Python reference.

**HLS tips:**

* Use `#pragma HLS PIPELINE II=1` in the inner loop.
* Store line buffers in BRAM (`#pragma HLS RESOURCE variable=LB* core=RAM_S2P_*`).
* Partition the 3×3 window (`#pragma HLS ARRAY_PARTITION variable=win complete dim=0`).
* Form windows with a **do/while** pattern: read → update buffers → produce window → write.

### Block B — **GoL Update** (3×3 window → center pixel)

* **Input (AXI‑Stream):** one 3×3 window per beat (match format from Block A).
* **Output (AXI‑Stream):** updated **center** pixel per beat (binary).
* **Rule:** sum the 8 neighbors; apply standard GoL:

  * live if (sum==3) or (sum==2 and center==1), else 0.
* **HLS tips:**

  * Fully unroll the summation of the 8 neighbors.
  * Keep II=1 with simple combinational logic; register outputs.

**Streaming chain:**

```
AXI DMA (MM2S) → WindowGen (Block A) → GoL Update (Block B) → AXI DMA (S2MM)
```

---

## 4) Lab steps

### Part A — CPU baseline (optional refresher)

1. Run the pure‑Python GoL for N iterations and record **ms/iteration** for a chosen size (e.g., 1024×1024). Save result image/array.

### Part B — HLS: Block A (Window Generator)

1. **Create HLS Component** `gol_window_gen` (clock 100 MHz).
2. **Interfaces:**

   * `#pragma HLS INTERFACE axis port=stream_in` (pixels)
   * `#pragma HLS INTERFACE axis port=stream_out` (3×3 windows)
   * `#pragma HLS INTERFACE s_axilite port=width  bundle=CTRL` (optional)
   * `#pragma HLS INTERFACE s_axilite port=height bundle=CTRL`
   * `#pragma HLS INTERFACE s_axilite port=return bundle=CTRL`
3. **C Simulation:** feed a small frame with a known pattern; check that the first valid window appears at (row=2,col=2) and that `TLAST` is at end‑of‑frame.
4. **C Synthesis:** confirm **II=1**; inspect BRAM usage (three line buffers).
5. *(Optional)* C/RTL Co‑Sim to verify AXIS handshake.
6. **Export RTL → Package IP**.

### Part C — HLS: Block B (GoL Update)

1. **Create HLS Component** `gol_update` (clock 100 MHz).
2. **Interfaces:**

   * `#pragma HLS INTERFACE axis port=stream_in`  (3×3 windows)
   * `#pragma HLS INTERFACE axis port=stream_out` (pixels)
   * `#pragma HLS INTERFACE s_axilite port=return bundle=CTRL`
3. **C Simulation:** drive a few windows to confirm the rule; unit‑test edge cases.
4. **C Synthesis:** aim for **II=1**; unroll neighbor sum.
5. **Export RTL → Package IP**.

### Part D — Vivado Block Design

1. New project → **PYNQ‑Z2** board → **Block Design** `system`.
2. Add **ZYNQ7 PS** → **Run Block Automation** (FCLK0=100 MHz).
3. Add **AXI DMA** (MM2S + S2MM, SG disabled). Map DMA control to **PS M_AXI_GP0**.
4. Add repos for **`gol_window_gen`** and **`gol_update`**. Insert both IPs.
5. **AXI‑Stream chain:**

   * **MM2S M_AXIS** → `gol_window_gen` **s_axis**
   * `gol_window_gen` **m_axis** → `gol_update` **s_axis**
   * `gol_update` **m_axis** → **S2MM S_AXIS**
6. **Widths:**

   * If Block A outputs **72‑bit**, insert **AXIS Data Width Converter(s)** to match DMA (e.g., 128‑bit). Adjust DMA transfer length accordingly.
   * If you used **16/32/64/128‑bit** packed windows, keep a consistent width throughout the chain.
7. **HP port:** enable **S_AXI_HP0**; connect DMA MM2S/S2MM **M_AXI** → **HP0** via SmartConnect.
8. **Clocks/Resets:** all AXIS IPs and DMA on **FCLK_CLK0** with **Processor System Reset**.
9. **Address Editor:** assign addresses for DMA (and optional control regs on Block A/B if used).
10. **Validate Design** → **Generate HDL Wrapper** → **Synthesis → Implementation → Bitstream**.
11. Rename artifacts to common base (e.g., `gol_pipe.bit/.hwh`).

### Part E — PYNQ run & measurement

1. Copy `gol_pipe.bit/.hwh` to the board plus a Jupyter notebook.
2. Allocate input/output DMA buffers; stream a single frame through the pipeline.
3. Confirm correctness against a software iteration (same border policy!).
4. Measure **ms/frame** (or Msamples/s). Compare to the old 9‑reads design.
5. Run multiple frames to observe the steady‑state throughput (pipeline filled).

---

## 5) Deliverables

* Screenshot of the **Block Design** with DMA → WindowGen → Update.
* Timing table (old DDR‑heavy vs. new streaming) for at least one image size.
* Short explanation: **How the shift register works**, and **why throughput improved while per‑pixel latency didn’t**.
* Border policy documented; prove correctness on edges with a small test pattern.

---

## 6) Troubleshooting

* **No output / stalls:** Ensure `recv.transfer()` is issued **before** `send.transfer()` and `wait()` on both channels.
* **AXIS width mismatch:** Use **AXIS Data Width Converter** or re‑synthesize to a common width.
* **Border artifacts:** Align the border policy between WindowGen and the Python reference; verify TLAST timing.
* **II>1:** Check you only read one pixel per cycle; pipeline the inner loop; avoid BRAM read conflicts by using dual‑port BRAM for line buffers.

---

## 7) Optional extensions

* Bit‑packed 1‑bit pixel path (32/64‑bit DMA) and a **9‑way popcount** in `gol_update`.
* Add an **AXIS FIFO** between blocks to visualize backpressure and measure decoupling.
* Multi‑iteration streaming: feed the output back as the next input without PS involvement.
