From 2437799324ab0cb9a02923758aa286e21c4f727e Mon Sep 17 00:00:00 2001
From: Petr Kurapov <petr.a.kurapov@intel.com>
Date: Thu, 4 Jul 2024 13:47:47 +0000
Subject: [PATCH 1/4] Add GPU pipeline overview

---
 doc/GPUPipeline.md | 40 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 40 insertions(+)
 create mode 100644 doc/GPUPipeline.md

diff --git a/doc/GPUPipeline.md b/doc/GPUPipeline.md
new file mode 100644
index 000000000..07fc478ef
--- /dev/null
+++ b/doc/GPUPipeline.md
@@ -0,0 +1,40 @@
+# GPU Pipeline overview
+
+This is a living document for GPU pipeline design. It's purpose is to keep the decision history and provide a guiding overview for development. We expect swift changes in the design as we go, so this mostly highlights guiding principles.
+
+## Initial state description
+
+The primary goal of the design is to adhere to certain qualities of the final solution.
+The spirit of the design is to reuse the existing parts, prefer upstream, and target long-term support in conjunction with other devices.
+
+At the highest level, the pipeline can be split into three main stages:
+1. High-level platform-independent* transformations. These are to be shared with other flows (e.g., fusion).
+2. GPU-specific transformations. These are responsible for HW mapping and include everything until a SPIR-V is emitted.
+3. Code generation. This is tailored to a particular platform and is performed by a backend.
+
+There are existing paths for each stage (sometimes multiple, the choice affects other parts). A short landscape description follows.
+
+### Landscape
+There are two primary ways of generating GPU target binary code, both going through IGC: scalar and vector paths.
+
+The scalar (aka SIMT) path relies on IGC's vectorization capabilities to map logical threads to SIMD lanes. Handling synchronization (e.g., cross-lane communication) is the main burden for otherwise transformation-amenable representation. 
+
+The vector (aka SIMD) path in IGC expects the IR to have a certain explicitly-vectorized form, primarily built via a set of intrinsics (VC-intinsics). The main complexity of the approach for the pipeline is handling data/compute distribution between those vectors and handling such a deviation from other GPU types lowering paths.
+
+Today, there are two main options to reach the low-level compiler:
+1. Lower to SPIR-V dialect and serialize it (IMEX).
+2. Lower to LLVM IR and use the SPIR-V Translator (Triton).
+
+Both produce a SPIR-V that can be consumed by IGC.
+
+Going up the pipeline, the abstractions needed to express specific ISA semantics (e.g., DPAS and nd-load required for efficient contraction implementation) are covered by XeGPU dialect. The dialect allows for both SIMT and SIMD -style lowering.
+
+TODO: gpu(x), linalg-to-scf, gpu-map-parallel-loops.
+
+### The path of least resistance
+First milestone for the pipeline creation aims at taking what's working now and putting it together.
+
+This includes:
+- Going through XeGPU dialect
+- Using IMEX's XeGPU lowering
+- Adapting TPP's linalg-to-xegpu

From ec5a12e3133ae0afcfdda7677d2bf0702d14c00c Mon Sep 17 00:00:00 2001
From: Petr Kurapov <petr.a.kurapov@intel.com>
Date: Mon, 8 Jul 2024 12:04:44 +0000
Subject: [PATCH 2/4] Add integration section

---
 doc/GPUPipeline.md | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/doc/GPUPipeline.md b/doc/GPUPipeline.md
index 07fc478ef..247a18dd5 100644
--- a/doc/GPUPipeline.md
+++ b/doc/GPUPipeline.md
@@ -31,6 +31,22 @@ Going up the pipeline, the abstractions needed to express specific ISA semantics
 
 TODO: gpu(x), linalg-to-scf, gpu-map-parallel-loops.
 
+### Integration
+There are three major point of integration that affect the way the pipeline is built:
+1. Input representation.
+2. Memory management.
+3. Runtime interfaces.
+
+The primary input for our pipelines is linalg on tesnors with named ops. These are pretty flexible (adding more to the upstream is more-or-less straightforward) and cover a lot of ground.
+
+Memory management has to deal with weight caching, dynamic shapes, input/output handling, etc. Certain decisions on the compiler user side lead to additional complications in the pipeline.
+For example, having to deal with 'logical' tenors for OneDNN imposes constraints on constant folding.
+
+The choice of runtime interface defines how much additional logic should reside in the pipeline. For managed devices (such as a GPU) there are two distinct options:
+1. The compiler only emits a binary for the target device.
+2. The compiler emits a binary and a launch stub that interacts with an appropriate runtime.
+The latter provides more context, and thus, potentially more opportunities for optimization. The former gives more control to the user and simplifies the pipeline.
+
 ### The path of least resistance
 First milestone for the pipeline creation aims at taking what's working now and putting it together.
 

From d34cd2155058e9cbe9c8e23fa7d85ac8924b15b8 Mon Sep 17 00:00:00 2001
From: Petr Kurapov <petr.a.kurapov@intel.com>
Date: Thu, 18 Jul 2024 10:41:51 +0000
Subject: [PATCH 3/4] Add the compilation-related decisions section

---
 doc/GPUPipeline.md | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/doc/GPUPipeline.md b/doc/GPUPipeline.md
index 247a18dd5..b509980b8 100644
--- a/doc/GPUPipeline.md
+++ b/doc/GPUPipeline.md
@@ -54,3 +54,8 @@ This includes:
 - Going through XeGPU dialect
 - Using IMEX's XeGPU lowering
 - Adapting TPP's linalg-to-xegpu
+
+## Decisions
+
+### Compilation
+* Generate the code with kernel outlining. The motivation is that the compiler can take over some of the scheduling-related tasks. The implies the interface with a framework needs to expose synchronization mechanism (e.g., pass a GPU queue). This also affects kernel caching. JITed or non-JITed execution (GPU module converted to serialized SPIR-V or to an actual target-specific binary) are similar cases from that point of view. Both will need to retrieve the artifact and pass it to the lowered from `gpu.launch` runtime call.

From ce66732d33c452247027d71affc3753b03c98997 Mon Sep 17 00:00:00 2001
From: Petr Kurapov <petr.a.kurapov@intel.com>
Date: Fri, 19 Jul 2024 11:16:09 +0000
Subject: [PATCH 4/4] Add GPU pipeline outlook from the kernel lowering
 perspective

---
 doc/GPUPipeline.md | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/doc/GPUPipeline.md b/doc/GPUPipeline.md
index b509980b8..b1d760fbe 100644
--- a/doc/GPUPipeline.md
+++ b/doc/GPUPipeline.md
@@ -59,3 +59,14 @@ This includes:
 
 ### Compilation
 * Generate the code with kernel outlining. The motivation is that the compiler can take over some of the scheduling-related tasks. The implies the interface with a framework needs to expose synchronization mechanism (e.g., pass a GPU queue). This also affects kernel caching. JITed or non-JITed execution (GPU module converted to serialized SPIR-V or to an actual target-specific binary) are similar cases from that point of view. Both will need to retrieve the artifact and pass it to the lowered from `gpu.launch` runtime call.
+* To align with the future pipelines, the target representation for the gpu module is LLVM. The actual path to the binary will be hidden inside `gpu-module-to-binary` implementation. From the kernel lowering perspective, the outlook of the target pipeline looks like:
+
+```
+builtin.module(
+    gpu-kernel-outlining,
+    xe-attach-target{chip=xe_3 O=3},
+    gpu.module(convert-gpu-to-llvm-spv),
+    gpu-to-llvm,
+    gpu-module-to-binary
+)
+```