From 0c4441a70d727cf8ea7e5d5bf7166b02d57f720e Mon Sep 17 00:00:00 2001
From: jkinsky <john.kinsky@intel.com>
Date: Mon, 27 Feb 2023 14:30:34 -0600
Subject: [PATCH] Update Host-Device Streaming using USM readme
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Changed sample name in readme to match name in sample.json file. Restructured to match new template, with exceptions for FPGA structure. Moved images into “assets” subfolder. Corrected some formatting. Rewrote and restructured some sections for clarity.
---
 .../simple_host_streaming/README.md           | 434 +++++++++---------
 .../{ => assets}/kernel-relaunch.png          | Bin
 .../{ => assets}/kernel-rtt.png               | Bin
 .../{ => assets}/multi-kernel-pipeline.png    | Bin
 ...ulti-kernel-producer-consumer-pipeline.png | Bin
 .../multi-kernel-producer-consumer.png        | Bin
 .../{ => assets}/multi-kernel.png             | Bin
 .../{ => assets}/single-kernel.png            | Bin
 8 files changed, 226 insertions(+), 208 deletions(-)
 rename DirectProgramming/C++SYCL_FPGA/Tutorials/DesignPatterns/simple_host_streaming/{ => assets}/kernel-relaunch.png (100%)
 rename DirectProgramming/C++SYCL_FPGA/Tutorials/DesignPatterns/simple_host_streaming/{ => assets}/kernel-rtt.png (100%)
 rename DirectProgramming/C++SYCL_FPGA/Tutorials/DesignPatterns/simple_host_streaming/{ => assets}/multi-kernel-pipeline.png (100%)
 rename DirectProgramming/C++SYCL_FPGA/Tutorials/DesignPatterns/simple_host_streaming/{ => assets}/multi-kernel-producer-consumer-pipeline.png (100%)
 rename DirectProgramming/C++SYCL_FPGA/Tutorials/DesignPatterns/simple_host_streaming/{ => assets}/multi-kernel-producer-consumer.png (100%)
 rename DirectProgramming/C++SYCL_FPGA/Tutorials/DesignPatterns/simple_host_streaming/{ => assets}/multi-kernel.png (100%)
 rename DirectProgramming/C++SYCL_FPGA/Tutorials/DesignPatterns/simple_host_streaming/{ => assets}/single-kernel.png (100%)
diff --git a/DirectProgramming/C++SYCL_FPGA/Tutorials/DesignPatterns/simple_host_streaming/README.md b/DirectProgramming/C++SYCL_FPGA/Tutorials/DesignPatterns/simple_host_streaming/README.md
index d7b4758e9b..dadfc3535e 100755
--- a/DirectProgramming/C++SYCL_FPGA/Tutorials/DesignPatterns/simple_host_streaming/README.md
+++ b/DirectProgramming/C++SYCL_FPGA/Tutorials/DesignPatterns/simple_host_streaming/README.md
@@ -1,35 +1,48 @@
-# Simple Host-Device Streaming
-This tutorial demonstrates how to use SYCL* Universal Shared Memory (USM) to stream data between the host and FPGA device and achieve low latency while maintaining throughput.
+# `Host-Device Streaming using USM` Sample
 
+This sample demonstrates how to use SYCL* Universal Shared Memory (USM) to stream data between the host and FPGA device and achieve low latency while maintaining throughput.
 
-| Optimized for                     | Description
----                                 |---
-| OS                                | Linux* Ubuntu* 18.04/20.04 <br> RHEL*/CentOS* 8 <br> SUSE* 15 <br> Windows* 10
-| Hardware                          | Intel® Agilex®, Arria® 10, and Stratix® 10 FPGAs
-| Software                          | Intel® oneAPI DPC++/C++ Compiler
-| What you will learn               | How to achieve low-latency host-device streaming while maintaining throughput
-| Time to complete                  | 45 minutes
+| Area                 | Description
+|:--                   |:--
+| What you will learn  | How to achieve low-latency host-device streaming while maintaining throughput
+| Time to complete     | 45 minutes
+| Category             | Code Optimization
 
-> **Note**: Even though the Intel DPC++/C++ OneAPI compiler is enough to compile for emulation, generating reports and generating RTL, there are extra software requirements for the simulation flow and FPGA compiles.
+## Purpose
+
+The purpose of this tutorial is to show you how to take advantage of SYCL USM host allocations and zero-copy host memory to implement a streaming host-device design with low latency and high throughput. Before starting this tutorial, we recommend first reviewing the **Pipes** (pipes) and **Zero-Copy Data Transfer** (zero_copy_data_transfer) FPGA tutorials, which will teach you more about SYCL pipes and SYCL USM and zero-copy data transfers, respectively.
+
+This tutorial includes three designs:
+
+1. An offload design that maximizes throughput with no optimization for latency (`DoWorkOffload` in `simple_host_streaming.cpp`).
+2. A single-kernel design that uses the methods described below to achieve a much lower latency while maintaining throughput (`DoWorkSingleKernel` in `simple_host_streaming.cpp` and `single_kernel.hpp`).
+3. A multi-kernel design that uses the methods described below to achieve a much lower latency while maintaining throughput (`DoWorkMultiKernel` in `simple_host_streaming.cpp` and `multi_kernel.hpp`).
+
+>**Note**: This tutorial demonstrates an implementation of host streaming that will be supplanted by better techniques in a future release. See the [Drawbacks and Future Work](#drawbacks-and-future-work) section below.
+
+## Prerequisites
+
+| Optimized for        | Description
+|:---                  |:---
+| OS                   | Ubuntu* 18.04/20.04 <br> RHEL*/CentOS* 8 <br> SUSE* 15 <br> Windows* 10
+| Hardware             | Intel® Agilex®, Arria® 10, and Stratix® 10 FPGAs
+| Software             | Intel® oneAPI DPC++/C++ Compiler
+
+
+> **Note**: Even though the Intel® oneAPI DPC++/C++ Compiler is enough to compile for emulation, generating reports, generating RTL, there are extra software requirements for the simulation flow and FPGA compiles.
 >
-> For using the simulator flow, Intel® Quartus® Prime Pro Edition and one of the following simulators must be installed and accessible through your PATH:
+> For using the simulator flow, you must have Intel® Quartus® Prime Pro Edition and one of the following simulators installed and accessible through your PATH:
 > - Questa*-Intel® FPGA Edition
 > - Questa*-Intel® FPGA Starter Edition
-> - ModelSim® SE
+> - ModelSim SE
 >
 > When using the hardware compile flow, Intel® Quartus® Prime Pro Edition must be installed and accessible through your PATH.
->
-> :warning: Make sure you add the device files associated with the FPGA that you are targeting to your Intel® Quartus® Prime installation.
-
-*Notice: SYCL USM host allocations, used in this tutorial, are only supported on FPGA boards that have a USM capable BSP (e.g. the Intel® FPGA PAC D5005 with Intel Stratix® 10 SX with USM support: intel_s10sx_pac:pac_s10_usm) or when targeting an FPGA family/part number.
-
 
-> **Notice**: This tutorial demonstrates an implementation of host streaming that will be supplanted by better techniques in a future release. See the [Drawbacks and Future Work](#drawbacks-and-future-work)*
+> **Warning** Make sure you add the device files associated with the FPGA that you are targeting to your Intel® Quartus® Prime installation.
 
-## Prerequisites
+>**Notice**: SYCL USM host allocations, which are used in this sample, are only supported on FPGA boards that have a USM capable BSP (For example, the Intel® FPGA PAC D5005 with Intel Stratix® 10 SX with USM support: **intel_s10sx_pac:pac_s10_usm**) or when targeting an FPGA family/part number.
 
-This sample is part of the FPGA code samples.
-It is categorized as a Tier 3 sample that demonstrates a design pattern.
+This sample is part of the FPGA code samples. It is categorized as a Tier 3 sample that demonstrates a design pattern.
 
 ```mermaid
 flowchart LR
@@ -47,17 +60,12 @@ flowchart LR
 ```
 
 Find more information about how to navigate this part of the code samples in the [FPGA top-level README.md](/DirectProgramming/C++SYCL_FPGA/README.md).
-You can also find more information about [troubleshooting build errors](/DirectProgramming/C++SYCL_FPGA/README.md#troubleshooting), [running the sample on the Intel® DevCloud](/DirectProgramming/C++SYCL_FPGA/README.md#build-and-run-the-samples-on-intel-devcloud-optional), [using Visual Studio Code with the code samples](/DirectProgramming/C++SYCL_FPGA/README.md#use-visual-studio-code-vs-code-optional), [links to selected documentation](/DirectProgramming/C++SYCL_FPGA/README.md#documentation), etc.
+You can also find more information about [troubleshooting build errors](/DirectProgramming/C++SYCL_FPGA/README.md#troubleshooting), [running the sample on the Intel® DevCloud](/DirectProgramming/C++SYCL_FPGA/README.md#build-and-run-the-samples-on-intel-devcloud-optional), [using Visual Studio Code with the code samples](/DirectProgramming/C++SYCL_FPGA/README.md#use-visual-studio-code-vs-code-optional), [links to selected documentation](/DirectProgramming/C++SYCL_FPGA/README.md#documentation), and more.
 
-## Purpose
-The purpose of this tutorial is to show you how to take advantage of SYCL USM host allocations and zero-copy host memory to implement a streaming host-device design with low latency and high throughput. Before starting this tutorial, we recommend first reviewing the **Pipes** (pipes) and **Zero-Copy Data Transfer** (zero_copy_data_transfer) FPGA tutorials, which will teach you more about SYCL pipes and SYCL USM and zero-copy data transfers, respectively.
-
-This tutorial includes three designs:
-1. An offload design that maximizes throughput with no optimization for latency (`DoWorkOffload` in *simple_host_streaming.cpp*)
-2. A single-kernel design that uses the methods described below to achieve a much lower latency while maintaining throughput (`DoWorkSingleKernel` in *simple_host_streaming.cpp* and *single_kernel.hpp*)
-3. A multi-kernel design that uses the methods described below to achieve a much lower latency while maintaining throughput (`DoWorkMultiKernel` in *simple_host_streaming.cpp* and *multi_kernel.hpp*)
+## Key Implementation Details
 
 ### Offload Processing
+
 Typical SYCL designs perform _offload processing_. All of the input data is prepared by the CPU, and then transferred to the device (in our case, an FPGA). Kernels are started on the device to process the data. When the kernels finish, the CPU copies the output data from the FPGA back to its memory. Memory synchronization is achieved on the host after the device kernel signals its completion.
 
 Offload processing achieves excellent throughput when the memory transfers and kernel computation are performed on large data sets, as the CPU's kernel management overhead is minimized. Data transfer overhead can be concealed using *double buffering* or *n-way buffering* to maximize kernel throughput. However, a significant shortcoming of this design pattern is latency. The coarse grain synchronization of waiting for the entire set of data to processed results in a latency that is equal to the processing time of the entire data.
@@ -65,63 +73,62 @@ Offload processing achieves excellent throughput when the memory transfers and k
 This tutorial will demonstrate a simple host-device streaming design that reduces latency and maintains throughput.
 
 ### Host-Device Streaming Processing
-The method for achieving lower latency between the host and device is to break data set into smaller chunks and, instead of enqueueing a single long-running kernel, launch a set of shorter-running kernels. Together, these shorter-running kernels process the data in smaller batches. As memory synchronization occurs upon kernel completion, this strategy makes the output data available to the CPU in a more granular way. This is illustrated in the figure below. The red lines show the time when the first set of data is available in the host.
 
-![](single-kernel.png)
+The method for achieving lower latency between the host and device is to break data set into smaller chunks and, instead of enqueuing a single long-running kernel, launch a set of shorter-running kernels. Together, these shorter-running kernels process the data in smaller batches. As memory synchronization occurs upon kernel completion, this strategy makes the output data available to the CPU in a more granular way. This is illustrated in the figure below. The red lines show the time when the first set of data is available in the host.
+
+![](assets/single-kernel.png)
 
 In the streaming version, the first piece of data is available in the host earlier than in the offload version. How much earlier? Say we have `total_size` elements of data to process and we break the computation into `chunks` chunks of size `chunk_size=total_size/chunks` (as is the case in the figure above). Then, in a perfect world, the streaming design will achieve a latency that is `chunks` times better than the offload version.
 
 #### Setting the `chunk_size`
+
 Why not set `chunk_size` to 1 (i.e. `chunks=total_size`) to minimize the latency? In the figure above, you may notice small gaps between the kernels in the streaming design (e.g. between K<sub>0</sub> and K<sub>1</sub>). This is caused by the overhead of launching kernels and detecting kernel completion on the host. These gaps increase the total processing time and therefore decrease the throughput of the design (i.e. compared to the offload design, it takes more time to process the same amount of data). If these gaps are negligible, then the throughput is negligibly affected.
 
 In the streaming design, the choice of the `chunk_size` is thus a tradeoff between latency (a smaller chunk size results in a smaller latency) and throughput (a smaller chunk size increases the relevance of the inter-kernel latency).
 
-#### Lower bounds on latency
-Lowering the `chunk_size` can reduce latency, sometimes at the expense of throughput. However, even if you aren't concerned with throughput, there still exists a lower-bound on the latency of a kernel. In the figure below, t<sub>launch</sub> is the time for a kernel launch signal to go from the host to the device, t<sub>kernel</sub> is the time for the kernel to execute on the device, and t<sub>finish</sub> is the time for the finished signal to go from the device to the host. Even if we set `chunk_size` to 0 (i.e. launch a kernel that does *nothing*) and therefore t<sub>kernel</sub> ~= 0, the latency is still t<sub>launch</sub> + t<sub>finish</sub>. In other words, the lower bound on kernel latency is the time needed for the "start "signal to get from the host to the device and for the "finished" signal to get from the device to the host.
+#### Lower Bounds on Latency
 
-![](kernel-rtt.png)
+Lowering the `chunk_size` can reduce latency, sometimes at the expense of throughput. However, even if you aren't concerned with throughput, there still exists a lower-bound on the latency of a kernel. In the figure below, t<sub>launch</sub> is the time for a kernel launch signal to go from the host to the device, t<sub>kernel</sub> is the time for the kernel to execute on the device, and t<sub>finish</sub> is the time for the finished signal to go from the device to the host. Even if we set `chunk_size` to 0 (i.e. launch a kernel that does *nothing*) and therefore t<sub>kernel</sub> ~= 0, the latency is still t<sub>launch</sub> + t<sub>finish</sub>. In other words, the lower bound on kernel latency is the time needed for the "start "signal to get from the host to the device and for the **finished** signal to get from the device to the host.
 
-In the previous section, we discussed how gaps between kernel invocations can degrade throughput. In the figure above, there appears to be a minimum t<sub>launch</sub> + t<sub>finish</sub> gap between kernel invocations. This is illustrated as the *Naive Relaunch* timeline in the figure below. Fortunately, this overhead is circumvented by an automatic runtime kernel launch scheme that buffers kernel arguments on the device before the previous kernel finishes. This enables kernels queue **on the device** and to begin execution without waiting for the previous kernel's "finished" to propagate back to the host. We call this *Fast Kernel Relaunch* and it is also illustrated in the figure below. While the details of fast kernel relaunch are beyond the scope of this tutorial, if suffices to understand that it reduces the gap between kernel invocations and allows you to achieve lower latency while maintaining throughput.
+![](assets/kernel-rtt.png)
 
-![](kernel-relaunch.png)
+In the previous section, we discussed how gaps between kernel invocations can degrade throughput. In the figure above, there appears to be a minimum t<sub>launch</sub> + t<sub>finish</sub> gap between kernel invocations. This is illustrated as the *Naive Relaunch* timeline in the figure below. Fortunately, this overhead is circumvented by an automatic runtime kernel launch scheme that buffers kernel arguments on the device before the previous kernel finishes. This enables kernels queue **on the device** and to begin execution without waiting for the previous kernel's "finished" to propagate back to the host. We call this *Fast Kernel Relaunch* and it is also illustrated in the figure below. While the details of fast kernel relaunch are beyond the scope of this tutorial, it is enough to understand that it reduces the gap between kernel invocations and allows you to achieve lower latency without reducing throughput.
+
+![](assets/kernel-relaunch.png)
+
+#### Multiple Kernel Pipeline
 
-#### Multiple kernel pipeline
 More complicated FPGA designs often instantiate multiple kernels connected by SYCL pipes (for examples, see the FPGA Reference Designs). Suppose you have a kernel system of `N` kernels connected by pipes, as in the figure below.
 
-![](multi-kernel-pipeline.png)
+![](assets/multi-kernel-pipeline.png)
 
 With the goal of achieving lower latency, you use the technique described in the previous section to launch multiple invocations of your `N` kernels to process chunks of data. This would give you a timeline like the figure below.
 
-![](multi-kernel.png)
+![](assets/multi-kernel.png)
 
 Notice the gaps between the start times of the `N` kernels for a single `chunk`. This is the t<sub>launch</sub> time discussed in the previous section. However, the multi-kernel design introduces a potential new lower bound on the latency for a single chunk because processing a single chunk of data requires launching `N` kernels, which takes `N` x t<sub>launch</sub>. If `N` (the number of kernels in your system) is sufficiently large, this will limit your achievable latency.
 
 For designs with `N > 2`, a different approach is recommended. The idea is to enqueue your system of `N` kernels **once**, and to introduce Producer (`P`) and Consumer (`C`) kernels to handle the production and consumption of data from and to the host, respectively. This method is illustrated in the figure below. The Producer streams data from the host and presents it to the kernel system through a SYCL pipe. The output of the kernel system is consumed by the Consumer and written back into host memory.
 
-![](multi-kernel-producer-consumer-pipeline.png)
+![](assets/multi-kernel-producer-consumer-pipeline.png)
 
 To achieve low latency, we still process the data in chunks, but instead of having to enqueue `N` kernels for each chunk, we only have to enqueue a single Producer and Consumer kernel per chunk. This enables us to reduce the lower bound on the latency to MAX(2 x t<sub>launch</sub>, t<sub>launch</sub> + t<sub>finish</sub>). Notice that this lower bound does not depend on the number of kernels in our system (`N`).
 
 **This method should only be used when `N > 2`**. FPGA area is sacrificed to implement the Producer and Consumer and their pipes in order to achieve lower overall processing latency.
 
-![](multi-kernel-producer-consumer.png)
+![](assets/multi-kernel-producer-consumer.png)
 
 ### Drawbacks and Future Work
+
 Fundamentally, the ability to stream data between the host and device is built around SYCL USM host allocations. The underlying problem is how to efficiently synchronize between the host and device to signal that _some_ data is ready to be processed, or has been processed. In other words, how does the host signal to the device that some data is ready to be processed? Conversely, how does the device signal to the host that some data is done being processed?
 
 One method to achieve this signaling is to use the start of a kernel to signal to the device that data is ready to be processed, and the end of a kernel to signal to the host that data has been processed. This is the approach taken in this tutorial. However, this method has two notable drawbacks. First, the latency to start and end kernels is high (as of now, roughly 50us). To maintain high throughput, we must size the `chunk_size` sufficiently large to hide the inter-kernel latency, resulting in a latency increase. Second, the programming model to achieve this performance is non-trivial as you must intelligently manage the SYCL device queue.
 
 We are currently working on an API and tutorial to address both of these drawbacks. This API will decrease the latency to synchronize between the host and device and therefore enable lower latency with maintained throughput. It will also dramatically improve the usability of the programming model to achieve this performance.
 
-## Key Concepts
-* Runtime kernel management.
-* Host-device streaming designs.
+## Build the `Host-Device Streaming using USM` Sample
 
-## Building the `simple_host_streaming` Tutorial
-
-> **Note**: When working with the command-line interface (CLI), you should configure the oneAPI toolkits using environment variables. 
-> Set up your CLI environment by sourcing the `setvars` script located in the root of your oneAPI installation every time you open a new terminal window. 
-> This practice ensures that your compiler, libraries, and tools are ready for development.
+>**Note**: When working with the command-line interface (CLI), you should configure the oneAPI toolkits using environment variables. Set up your CLI environment by sourcing the `setvars` script in the root of your oneAPI installation every time you open a new terminal window. This practice ensures that your compiler, libraries, and tools are ready for development.
 >
 > Linux*:
 > - For system wide installations: `. /opt/intel/oneapi/setvars.sh`
@@ -134,171 +141,182 @@ We are currently working on an API and tutorial to address both of these drawbac
 >
 > For more information on configuring environment variables, see [Use the setvars Script with Linux* or macOS*](https://www.intel.com/content/www/us/en/develop/documentation/oneapi-programming-guide/top/oneapi-development-environment-setup/use-the-setvars-script-with-linux-or-macos.html) or [Use the setvars Script with Windows*](https://www.intel.com/content/www/us/en/develop/documentation/oneapi-programming-guide/top/oneapi-development-environment-setup/use-the-setvars-script-with-windows.html).
 
-### On a Linux* System
-
-1. Generate the `Makefile` by running `cmake`.
-  ```
-  mkdir build
-  cd build
-  ```
-  To compile for the default target (the Agilex® device family), run `cmake` using the command:
-  ```
-  cmake ..
-  ```
-
-  > **Note**: You can change the default target by using the command:
-  >  ```
-  >  cmake .. -DFPGA_DEVICE=<FPGA device family or FPGA part number>
-  >  ``` 
-  >
-  > Alternatively, you can target an explicit FPGA board variant and BSP by using the following command: 
-  >  ```
-  >  cmake .. -DFPGA_DEVICE=<board-support-package>:<board-variant>
-  >  ``` 
-  >
-  > You will only be able to run an executable on the FPGA if you specified a BSP.
-
-2. Compile the design through the generated `Makefile`. The following build targets are provided, matching the recommended development flow:
-
-   * Compile for emulation (fast compile time, targets emulated FPGA device):
-     ```
-     make fpga_emu
-     ```
-   * Compile for simulation (medium compile time, targets simulated FPGA device):
-     ```
-     make fpga_sim
-     ```
-   * Generate the optimization report:
-     ```
-     make report
-     ```
-   * Compile for FPGA hardware (longer compile time, targets FPGA device):
-     ```
-     make fpga
-     ```
-
-### On a Windows* System
-1. Generate the `Makefile` by running `cmake`.
-  ```
-  mkdir build
-  cd build
-  ```
-  To compile for the default target (the Agilex® device family), run `cmake` using the command:
-  ```
-  cmake -G "NMake Makefiles" ..
-  ```
-  > **Note**: You can change the default target by using the command:
-  >  ```
-  >  cmake -G "NMake Makefiles" .. -DFPGA_DEVICE=<FPGA device family or FPGA part number>
-  >  ``` 
-  >
-  > Alternatively, you can target an explicit FPGA board variant and BSP by using the following command: 
-  >  ```
-  >  cmake -G "NMake Makefiles" .. -DFPGA_DEVICE=<board-support-package>:<board-variant>
-  >  ``` 
-  >
-  > You will only be able to run an executable on the FPGA if you specified a BSP.
-
-2. Compile the design through the generated `Makefile`. The following build targets are provided, matching the recommended development flow:
-
-   * Compile for emulation (fast compile time, targets emulated FPGA device):
-     ```
-     nmake fpga_emu
-     ```
-   * Compile for simulation (medium compile time, targets simulated FPGA device):
-     ```
-     nmake fpga_sim
-     ```
-   * Generate the optimization report:
-     ```
-     nmake report
-     ```
-   * Compile for FPGA hardware (longer compile time, targets FPGA device):
-     ```
-     nmake fpga
-     ```
+
+### On Linux*
+
+1. Change to the sample directory.
+2. Build the program for Intel® Agilex® device family, which is the default.
+   ```
+   mkdir build
+   cd build
+   cmake ..
+   ```
+   > **Note**: You can change the default target by using the command:
+   >  ```
+   >  cmake .. -DFPGA_DEVICE=<FPGA device family or FPGA part number>
+   >  ```
+   >
+   > Alternatively, you can target an explicit FPGA board variant and BSP by using the following command:
+   >  ```
+   >  cmake .. -DFPGA_DEVICE=<board-support-package>:<board-variant>
+   >  ```
+   >
+   > You will only be able to run an executable on the FPGA if you specified a BSP.
+
+3. Compile the design. (The provided targets match the recommended development flow.)
+
+    1. Compile for emulation (fast compile time, targets emulated FPGA device):
+       ```
+       make fpga_emu
+       ```
+    2. Generate the optimization report:
+       ```
+       make report
+       ```
+      The report resides at `simple_host_streaming_report.prj/reports/report.html`.
+
+    3. Compile for simulation (fast compile time, targets simulated FPGA device, reduced data size):
+       ```
+       make fpga_sim
+       ```
+    4. Compile for FPGA hardware (longer compile time, targets FPGA device):
+       ```
+       make fpga
+       ```
+
+### On Windows*
+
+1. Change to the sample directory.
+2. Build the program for the Intel® Agilex® device family, which is the default.
+   ```
+   mkdir build
+   cd build
+   cmake -G "NMake Makefiles" ..
+   ```
+   > **Note**: You can change the default target by using the command:
+   >  ```
+   >  cmake -G "NMake Makefiles" .. -DFPGA_DEVICE=<FPGA device family or FPGA part number>
+   >  ```
+   >
+   > Alternatively, you can target an explicit FPGA board variant and BSP by using the following command:
+   >  ```
+   >  cmake -G "NMake Makefiles" .. -DFPGA_DEVICE=<board-support-package>:<board-variant>
+   >  ```
+   >
+   > You will only be able to run an executable on the FPGA if you specified a BSP.
+
+3. Compile the design. (The provided targets match the recommended development flow.)
+
+   1. Compile for emulation (fast compile time, targets emulated FPGA device):
+      ```
+      nmake fpga_emu
+      ```
+   2. Generate the optimization report:
+      ```
+      nmake report
+      ```
+      The report resides at `simple_host_streaming_report.prj.a/reports/report.html`.
+
+   3. Compile for simulation (fast compile time, targets simulated FPGA device, reduced data size):
+      ```
+      nmake fpga_sim
+      ```
+   4. Compile for FPGA hardware (longer compile time, targets FPGA device):
+      ```
+      nmake fpga
+      ```
 
 > **Note**: If you encounter any issues with long paths when compiling under Windows*, you may have to create your ‘build’ directory in a shorter path, for example c:\samples\build.  You can then run cmake from that directory, and provide cmake with the full path to your sample directory.
 
-## Examining the Reports
-Locate `report.html` in the `simple_host_streaming_report.prj/reports/` directory. Open the report in any of Chrome*, Firefox*, Edge*, or Internet Explorer*.
-
-## Running the Sample
-
-1. Run the sample on the FPGA emulator (the kernel executes on the CPU):
-  ```
-  ./simple_host_streaming.fpga_emu     (Linux)
-  simple_host_streaming.fpga_emu.exe   (Windows)
-  ```
-2. Run the sample on the FPGA simulator:
-  * On Linux
-    ```
-    CL_CONTEXT_MPSIM_DEVICE_INTELFPGA=1 ./simple_host_streaming.fpga_sim
-    ```
-  * On Windows
-    ```
-    set CL_CONTEXT_MPSIM_DEVICE_INTELFPGA=1
-    simple_host_streaming.fpga_sim.exe
-    set CL_CONTEXT_MPSIM_DEVICE_INTELFPGA=
-    ```
-3. Run the sample on the FPGA device (only if you ran `cmake` with `-DFPGA_DEVICE=<board-support-package>:<board-variant>`):
-  ```
-  ./simple_host_streaming.fpga         (Linux)
-  simple_host_streaming.fpga.exe       (Windows)
-  ```
-
-### Example of Output
-You should see the following output in the console:
-
-1. When running on the FPGA emulator
-    ```
-    # Chunks:             16
-    Chunk count:          256
-    Total count:          4096
-    Iterations:           1
-
-    Running the basic offload kernel
-    Offload average latency:          0.1892 ms
-    Offload average throughput:       1385.3707 MB/s
-
-    Running the latency optimized single-kernel design
-    Single-kernel average latency:          0.0447 ms
-    Single-kernel average throughput:       188.3596 MB/s
-
-    Running the latency optimized multi-kernel design
-    Multi-kernel average latency:          0.2674 ms
-    Multi-kernel average throughput:       39.0021 MB/s
-
-    PASSED
-    ```
-    > **Note**: The FPGA emulator does not accurately represent the performance (throughput or latency) of the kernels.
-
-2. When running on the Intel® FPGA PAC D5005 with Intel Stratix® 10 SX with USM support:
-    ```
-    # Chunks:             512
-    Chunk count:          32768
-    Total count:          16777216
-    Iterations:           4
-
-    Running the basic offload kernel
-    Offload average latency:          99.6709 ms
-    Offload average throughput:       107772.8673 MB/s
-
-    Running the latency optimized single-kernel design
-    Single-kernel average latency:          0.2109 ms
-    Single-kernel average throughput:       10689.9578 MB/s
-
-    Running the latency optimized multi-kernel design
-    Multi-kernel average latency:          0.2431 ms
-    Multi-kernel average throughput:       10674.7123 MB/s
-
-    PASSED
-    ```
-    > **Note**: The experimentally measured bandwidth of the PCIe is ~11 GB/s (bi-directional, ~22 MB/s total). The FPGA device performance numbers above show that the offload, single-kernel, and multi-kernel designs are all able to saturate the PCIe bandwidth (since this design reads and writes over PCIe, a design throughput of 10.7 GB/s uses 10.7 x 2 = 21.4 GB/s of total PCIe bandwidth). However, the single-kernel and multi-kernel designs saturate the PCIe bandwidth with a latency that is ~473x lower than the offload kernel.
+## Run the `Host-Device Streaming using USM` Sample
+
+### On Linux
+
+1. Run the sample on the FPGA emulator (the kernel executes on the CPU).
+   ```
+   ./simple_host_streaming.fpga_emu
+   ```
+2. Run the sample on the FPGA simulator.
+   ```
+   CL_CONTEXT_MPSIM_DEVICE_INTELFPGA=1 ./simple_host_streaming.fpga_sim
+   ```
+3. Run the sample on the FPGA device (only if you ran `cmake` with `-DFPGA_DEVICE=<board-support-package>:<board-variant>`).
+   ```
+   ./simple_host_streaming.fpga
+   ```
+
+### On Windows
+
+1. Run the sample on the FPGA emulator (the kernel executes on the CPU).
+   ```
+   simple_host_streaming.fpga_emu.exe
+   ```
+2. Run the sample on the FPGA simulator.
+   ```
+   set CL_CONTEXT_MPSIM_DEVICE_INTELFPGA=1
+   simple_host_streaming.fpga_sim.exe
+   set CL_CONTEXT_MPSIM_DEVICE_INTELFPGA=
+   ```
+3. Run the sample on the FPGA device (only if you ran `cmake` with `-DFPGA_DEVICE=<board-support-package>:<board-variant>`).
+   ```
+   simple_host_streaming.fpga.exe
+   ```
+
+## Example Output
+
+### Example Output for FPGA Emulator
+
+```
+# Chunks:             16
+Chunk count:          256
+Total count:          4096
+Iterations:           1
+
+Running the basic offload kernel
+Offload average latency:          0.1892 ms
+Offload average throughput:       1385.3707 MB/s
+
+Running the latency optimized single-kernel design
+Single-kernel average latency:          0.0447 ms
+Single-kernel average throughput:       188.3596 MB/s
+
+Running the latency optimized multi-kernel design
+Multi-kernel average latency:          0.2674 ms
+Multi-kernel average throughput:       39.0021 MB/s
+
+PASSED
+```
+
+>**Note**: The FPGA emulator does not accurately represent the performance (throughput or latency) of the kernels.
+
+### Example Output for Intel® FPGA PAC D5005 with Intel Stratix® 10 SX with USM Support
+
+```
+# Chunks:             512
+Chunk count:          32768
+Total count:          16777216
+Iterations:           4
+
+Running the basic offload kernel
+Offload average latency:          99.6709 ms
+Offload average throughput:       107772.8673 MB/s
+
+Running the latency optimized single-kernel design
+Single-kernel average latency:          0.2109 ms
+Single-kernel average throughput:       10689.9578 MB/s
+
+Running the latency optimized multi-kernel design
+Multi-kernel average latency:          0.2431 ms
+Multi-kernel average throughput:       10674.7123 MB/s
+
+PASSED
+```
+
+>**Note**: The experimentally measured bandwidth of the PCIe is ~11 GB/s (bidirectional, ~22 MB/s total). The FPGA device performance numbers above show that the offload, single-kernel, and multi-kernel designs are all able to saturate the PCIe bandwidth (since this design reads and writes over PCIe, a design throughput of 10.7 GB/s uses 10.7 x 2 = 21.4 GB/s of total PCIe bandwidth). However, the single-kernel and multi-kernel designs saturate the PCIe bandwidth with a latency that is ~473x lower than the offload kernel.
 
 ## License
 
 Code samples are licensed under the MIT license. See
 [License.txt](https://github.com/oneapi-src/oneAPI-samples/blob/master/License.txt) for details.
 
-Third party program Licenses can be found here: [third-party-programs.txt](https://github.com/oneapi-src/oneAPI-samples/blob/master/third-party-programs.txt).
+Third party program Licenses can be found here: [third-party-programs.txt](https://github.com/oneapi-src/oneAPI-samples/blob/master/third-party-programs.txt).
\ No newline at end of file
diff --git a/DirectProgramming/C++SYCL_FPGA/Tutorials/DesignPatterns/simple_host_streaming/kernel-relaunch.png b/DirectProgramming/C++SYCL_FPGA/Tutorials/DesignPatterns/simple_host_streaming/assets/kernel-relaunch.png
similarity index 100%
rename from DirectProgramming/C++SYCL_FPGA/Tutorials/DesignPatterns/simple_host_streaming/kernel-relaunch.png
rename to DirectProgramming/C++SYCL_FPGA/Tutorials/DesignPatterns/simple_host_streaming/assets/kernel-relaunch.png
diff --git a/DirectProgramming/C++SYCL_FPGA/Tutorials/DesignPatterns/simple_host_streaming/kernel-rtt.png b/DirectProgramming/C++SYCL_FPGA/Tutorials/DesignPatterns/simple_host_streaming/assets/kernel-rtt.png
similarity index 100%
rename from DirectProgramming/C++SYCL_FPGA/Tutorials/DesignPatterns/simple_host_streaming/kernel-rtt.png
rename to DirectProgramming/C++SYCL_FPGA/Tutorials/DesignPatterns/simple_host_streaming/assets/kernel-rtt.png
diff --git a/DirectProgramming/C++SYCL_FPGA/Tutorials/DesignPatterns/simple_host_streaming/multi-kernel-pipeline.png b/DirectProgramming/C++SYCL_FPGA/Tutorials/DesignPatterns/simple_host_streaming/assets/multi-kernel-pipeline.png
similarity index 100%
rename from DirectProgramming/C++SYCL_FPGA/Tutorials/DesignPatterns/simple_host_streaming/multi-kernel-pipeline.png
rename to DirectProgramming/C++SYCL_FPGA/Tutorials/DesignPatterns/simple_host_streaming/assets/multi-kernel-pipeline.png
diff --git a/DirectProgramming/C++SYCL_FPGA/Tutorials/DesignPatterns/simple_host_streaming/multi-kernel-producer-consumer-pipeline.png b/DirectProgramming/C++SYCL_FPGA/Tutorials/DesignPatterns/simple_host_streaming/assets/multi-kernel-producer-consumer-pipeline.png
similarity index 100%
rename from DirectProgramming/C++SYCL_FPGA/Tutorials/DesignPatterns/simple_host_streaming/multi-kernel-producer-consumer-pipeline.png
rename to DirectProgramming/C++SYCL_FPGA/Tutorials/DesignPatterns/simple_host_streaming/assets/multi-kernel-producer-consumer-pipeline.png
diff --git a/DirectProgramming/C++SYCL_FPGA/Tutorials/DesignPatterns/simple_host_streaming/multi-kernel-producer-consumer.png b/DirectProgramming/C++SYCL_FPGA/Tutorials/DesignPatterns/simple_host_streaming/assets/multi-kernel-producer-consumer.png
similarity index 100%
rename from DirectProgramming/C++SYCL_FPGA/Tutorials/DesignPatterns/simple_host_streaming/multi-kernel-producer-consumer.png
rename to DirectProgramming/C++SYCL_FPGA/Tutorials/DesignPatterns/simple_host_streaming/assets/multi-kernel-producer-consumer.png
diff --git a/DirectProgramming/C++SYCL_FPGA/Tutorials/DesignPatterns/simple_host_streaming/multi-kernel.png b/DirectProgramming/C++SYCL_FPGA/Tutorials/DesignPatterns/simple_host_streaming/assets/multi-kernel.png
similarity index 100%
rename from DirectProgramming/C++SYCL_FPGA/Tutorials/DesignPatterns/simple_host_streaming/multi-kernel.png
rename to DirectProgramming/C++SYCL_FPGA/Tutorials/DesignPatterns/simple_host_streaming/assets/multi-kernel.png
diff --git a/DirectProgramming/C++SYCL_FPGA/Tutorials/DesignPatterns/simple_host_streaming/single-kernel.png b/DirectProgramming/C++SYCL_FPGA/Tutorials/DesignPatterns/simple_host_streaming/assets/single-kernel.png
similarity index 100%
rename from DirectProgramming/C++SYCL_FPGA/Tutorials/DesignPatterns/simple_host_streaming/single-kernel.png
rename to DirectProgramming/C++SYCL_FPGA/Tutorials/DesignPatterns/simple_host_streaming/assets/single-kernel.png