openvinotoolkit · wangleis · May 17, 2024 · May 11, 2024 · May 11, 2024 · May 11, 2024
@@ -18,29 +18,36 @@ Execution via the heterogeneous mode can be divided into two independent steps:
 
 1. Setting hardware affinity to operations (`ov::Core::query_model <https://docs.openvino.ai/2024/api/c_cpp_api/classov_1_1_core.html#doxid-classov-1-1-core-1acdf8e64824fe4cf147c3b52ab32c1aab>`__ is used internally by the Hetero device).
 2. Compiling a model to the Heterogeneous device assumes splitting the model to parts, compiling them on the specified devices (via `ov::device::priorities <https://docs.openvino.ai/2024/api/c_cpp_api/structov_1_1device_1_1_priorities.html>`__), and executing them in the Heterogeneous mode. The model is split to subgraphs in accordance with the affinities, where a set of connected operations with the same affinity is to be a dedicated subgraph. Each subgraph is compiled on a dedicated device and multiple `ov::CompiledModel <https://docs.openvino.ai/2024/api/c_cpp_api/classov_1_1_compiled_model.html#doxid-classov-1-1-compiled-model>`__ objects are made, which are connected via automatically allocated intermediate tensors.
+
+   If you set pipeline parallel (via ``ov::hint::model_distribution_policy``), the model is split into multiple stages, and each stage is assigned to a different device. The output of one stage is fed as input to the next stage.
 
 These two steps are not interconnected and affinities can be set in one of two ways, used separately or in combination (as described below): in the ``manual`` or the ``automatic`` mode.
 
 Defining and Configuring the Hetero Device
-++++++++++++++++++++++++++++++++++++++++++
+##########################################
 
 Following the OpenVINO™ naming convention, the Hetero execution plugin is assigned the label of ``"HETERO".`` It may be defined with no additional parameters, resulting in defaults being used, or configured further with the following setup options:
 
 
-+-------------------------------+--------------------------------------------+-----------------------------------------------------------+
-| Parameter Name & C++ property | Property values                            | Description                                               |
-+===============================+============================================+===========================================================+
-| | "MULTI_DEVICE_PRIORITIES"   | | HETERO: <device names>                   | | Lists the devices available for selection.              |
-| | ``ov::device::priorities``  | | comma-separated, no spaces               | | The device sequence will be taken as priority           |
-| |                             | |                                          | | from high to low.                                       |
-+-------------------------------+--------------------------------------------+-----------------------------------------------------------+
-
++--------------------------------------------+-------------------------------------------------------------+-----------------------------------------------------------+
+| Parameter Name & C++ property              | Property values                                             | Description                                               |
++============================================+=============================================================+===========================================================+
+| | "MULTI_DEVICE_PRIORITIES"                | | ``HETERO: <device names>``                                | | Lists the devices available for selection.              |
+| | ``ov::device::priorities``               | |                                                           | | The device sequence will be taken as priority           |
+| |                                          | | comma-separated, no spaces                                | | from high to low.                                       |
++--------------------------------------------+-------------------------------------------------------------+-----------------------------------------------------------+
+| |                                          | | ``empty``                                                 | | Model distribution policy for inference with            |
+| | "MODEL_DISTRIBUTION_POLICY"              | | ``ov::hint::ModelDistributionPolicy::PIPELINE_PARALLEL``  | | multiple devices. Distribute model to multiple          |
+| |                                          | |                                                           | | devices during model compilation.                       |
+| | ``ov::hint::model_distribution_policy``  | | HETERO only support PIPELINE_PARALLEL, The default value  | |                                                         |
+| |                                          | | is empty                                                  | |                                                         |
++--------------------------------------------+-------------------------------------------------------------+-----------------------------------------------------------+
 
 Manual and Automatic modes for assigning affinities
-+++++++++++++++++++++++++++++++++++++++++++++++++++
+###################################################
 
 The Manual Mode
---------------------
+++++++++++++++++++
 
 It assumes setting affinities explicitly for all operations in the model using `ov::Node::get_rt_info <https://docs.openvino.ai/2024/api/c_cpp_api/classov_1_1_node.html#doxid-classov-1-1-node-1a6941c753af92828d842297b74df1c45a>`__ with the ``"affinity"`` key.
 
@@ -66,7 +73,10 @@ Randomly selecting operations and setting affinities may lead to decrease in mod
 
 
 The Automatic Mode
---------------------
+++++++++++++++++++
+
+Without pipeline parallelism
+-----------------------------
 
 It decides automatically which operation is assigned to which device according to the support from dedicated devices (``GPU``, ``CPU``, etc.) and query model step is called implicitly by Hetero device during model compilation.
 
@@ -90,9 +100,33 @@ It does not take into account device peculiarities such as the inability to infe
          :language: cpp
          :fragment: [compile_model]
 
+Pipeline parallelism
+------------------------
+
+The pipeline parallelism is set via ``ov::hint::model_distribution_policy``, This mode is an efficient technique to infer large models on multiple devices. The model is split into multiple stages, and each stage is assigned to a different device (``dGPU``, ``iGPU``, ``CPU``, etc.). This mode assign operations to different devices as reasonably as possible, ensuring that different stages can be executed in sequence and minimizing the amount of data transfer between different devices.
-The pipeline parallelism is set via ``ov::hint::model_distribution_policy``, This mode is an efficient technique to infer large models on multiple devices. The model is split into multiple stages, and each stage is assigned to a different device (``dGPU``, ``iGPU``, ``CPU``, etc.). This mode assign operations to different devices as reasonably as possible, ensuring that different stages can be executed in sequence and minimizing the amount of data transfer between different devices.
+The pipeline parallelism is set via ``ov::hint::model_distribution_policy``. This mode is an efficient technique to infer large models on multiple devices. The model is split into multiple stages, and each stage is assigned to a different device (``dGPU``, ``iGPU``, ``CPU``, etc.). This mode assign operations to different devices as reasonably as possible, ensuring that different stages can be executed in sequence and minimizing the amount of data transfer between different devices.
-The pipeline parallelism is set via ``ov::hint::model_distribution_policy``, This mode is an efficient technique to infer large models on multiple devices. The model is split into multiple stages, and each stage is assigned to a different device (``dGPU``, ``iGPU``, ``CPU``, etc.). This mode assign operations to different devices as reasonably as possible, ensuring that different stages can be executed in sequence and minimizing the amount of data transfer between different devices.
+The pipeline parallelism is set via ``ov::hint::model_distribution_policy``. This mode is an efficient technique to infer large models on multiple devices. The model is split into multiple stages, and each stage is assigned to a different device (``dGPU``, ``iGPU``, ``CPU``, etc.). This mode assign operations to different devices as reasonably as possible, ensuring that different stages can be executed in sequence and minimizing the amount of data transfer between different devices.
+
+For large models which don’t fit on a single first priority device, model pipeline parallelism is employed where certain parts of the model are placed on different devices to ensure that the device has enough memory to infer these operations, and assign other operations to next priority device.
+
+
+.. tab-set::
+
+   .. tab-item:: Python
+      :sync: py
+
+      .. doxygensnippet:: docs/snippets/ov_hetero.py
+         :language: Python
+         :fragment: [set_pipeline_parallelism]
+
+   .. tab-item:: C++
+      :sync: cpp
+
+      .. doxygensnippet:: docs/snippets/ov_hetero.cpp
+         :language: cpp
+         :fragment: [set_pipeline_parallelism]
+
 
 Using Manual and Automatic Modes in Combination
------------------------------------------------
++++++++++++++++++++++++++++++++++++++++++++++++
 
 In some cases you may need to consider manually adjusting affinities which were set automatically. It usually serves minimizing the number of total subgraphs to optimize memory transfers. To do it, you need to "fix" the automatically assigned affinities like so:
 
@@ -121,7 +155,7 @@ Importantly, the automatic mode will not work if any operation in a model has it
    `ov::Core::query_model <https://docs.openvino.ai/2024/api/c_cpp_api/classov_1_1_core.html#doxid-classov-1-1-core-1acdf8e64824fe4cf147c3b52ab32c1aab>`__ does not depend on affinities set by a user. Instead, it queries for an operation support based on device capabilities.
 
 Configure fallback devices
-++++++++++++++++++++++++++
+##########################
 
 If you want different devices in Hetero execution to have different device-specific configuration options, you can use the special helper property `ov::device::properties <https://docs.openvino.ai/2024/api/c_cpp_api/structov_1_1device_1_1_properties.html#doxid-group-ov-runtime-cpp-prop-api-1ga794d09f2bd8aad506508b2c53ef6a6fc>`__:
 
@@ -146,15 +180,15 @@ If you want different devices in Hetero execution to have different device-speci
 In the example above, the ``GPU`` device is configured to enable profiling data and uses the default execution precision, while ``CPU`` has the configuration property to perform inference in ``fp32``.
 
 Handling of Difficult Topologies
-++++++++++++++++++++++++++++++++
+################################
 
 Some topologies are not friendly to heterogeneous execution on some devices, even to the point of being unable to execute.
 For example, models having activation operations that are not supported on the primary device are split by Hetero into multiple sets of subgraphs which leads to suboptimal execution.
 If transmitting data from one subgraph to another part of the model in the heterogeneous mode takes more time than under normal execution, heterogeneous execution may be unsubstantiated.
 In such cases, you can define the heaviest part manually and set the affinity to avoid sending data back and forth many times during one inference.
 
 Analyzing Performance of Heterogeneous Execution
-++++++++++++++++++++++++++++++++++++++++++++++++
+################################################
 
 After enabling the ``OPENVINO_HETERO_VISUALIZE`` environment variable, you can dump GraphViz ``.dot`` files with annotations of operations per devices.
 
@@ -186,7 +220,7 @@ Here is an example of the output for Googlenet v1 running on HDDL (device no lon
 
 
 Sample Usage
-++++++++++++++++++++
+############
 
 OpenVINO™ sample programs can use the Heterogeneous execution used with the ``-d`` option:
 

diff --git a/docs/snippets/ov_hetero.cpp b/docs/snippets/ov_hetero.cpp
@@ -53,5 +53,13 @@ auto compiled_model = core.compile_model(model, "HETERO",
 );
 //! [configure_fallback_devices]
 }
+
+{
+//! [set_pipeline_parallelism]
+std::set<ov::hint::ModelDistributionPolicy> model_policy = {ov::hint::ModelDistributionPolicy::PIPELINE_PARALLEL};
+auto compiled_model =
+    core.compile_model(model, "HETERO:GPU.1,GPU.2", ov::hint::model_distribution_policy(model_policy));
+//! [set_pipeline_parallelism]
+}
 return 0;
 }
diff --git a/docs/snippets/ov_hetero.py b/docs/snippets/ov_hetero.py
@@ -53,3 +53,15 @@ def main():
     core.set_property("CPU", {hints.inference_precision: ov.Type.f32})
     compiled_model = core.compile_model(model=model, device_name="HETERO")
     #! [configure_fallback_devices]
+
+    #! [set_pipeline_parallelism]
+    import openvino.properties.hint as hints
+
+    compiled_model = core.compile_model(
+        model,
+        device_name="HETERO:GPU.1,GPU.2",
+        config={
+            hints.model_distribution_policy:
+            "PIPELINE_PARALLEL"
+        })
+    #! [set_pipeline_parallelism]