From aea7a1b1a0dd86ecb7cc2328684c26e581c076f0 Mon Sep 17 00:00:00 2001 From: Wangwang Wang Date: Sat, 11 May 2024 11:38:27 +0800 Subject: [PATCH 1/5] Update the doc to HETERO pipeline parallelism --- .../hetero-execution.rst | 66 ++++++++++++++----- docs/snippets/ov_hetero.cpp | 6 ++ docs/snippets/ov_hetero.py | 12 ++++ 3 files changed, 67 insertions(+), 17 deletions(-) diff --git a/docs/articles_en/openvino-workflow/running-inference/inference-devices-and-modes/hetero-execution.rst b/docs/articles_en/openvino-workflow/running-inference/inference-devices-and-modes/hetero-execution.rst index 1842abe226447e..5e81880c0e55a9 100644 --- a/docs/articles_en/openvino-workflow/running-inference/inference-devices-and-modes/hetero-execution.rst +++ b/docs/articles_en/openvino-workflow/running-inference/inference-devices-and-modes/hetero-execution.rst @@ -18,29 +18,34 @@ Execution via the heterogeneous mode can be divided into two independent steps: 1. Setting hardware affinity to operations (`ov::Core::query_model `__ is used internally by the Hetero device). 2. Compiling a model to the Heterogeneous device assumes splitting the model to parts, compiling them on the specified devices (via `ov::device::priorities `__), and executing them in the Heterogeneous mode. The model is split to subgraphs in accordance with the affinities, where a set of connected operations with the same affinity is to be a dedicated subgraph. Each subgraph is compiled on a dedicated device and multiple `ov::CompiledModel `__ objects are made, which are connected via automatically allocated intermediate tensors. + + If you set pipeline parallel (via ``ov::hint::model_distribution_policy``), the model is split into multiple stages, and each stage is assigned to a different device. The output of one stage is fed as input to the next stage. These two steps are not interconnected and affinities can be set in one of two ways, used separately or in combination (as described below): in the ``manual`` or the ``automatic`` mode. Defining and Configuring the Hetero Device -++++++++++++++++++++++++++++++++++++++++++ +########################################## Following the OpenVINO™ naming convention, the Hetero execution plugin is assigned the label of ``"HETERO".`` It may be defined with no additional parameters, resulting in defaults being used, or configured further with the following setup options: -+-------------------------------+--------------------------------------------+-----------------------------------------------------------+ -| Parameter Name & C++ property | Property values | Description | -+===============================+============================================+===========================================================+ -| | "MULTI_DEVICE_PRIORITIES" | | HETERO: | | Lists the devices available for selection. | -| | ``ov::device::priorities`` | | comma-separated, no spaces | | The device sequence will be taken as priority | -| | | | | | from high to low. | -+-------------------------------+--------------------------------------------+-----------------------------------------------------------+ - ++--------------------------------------------+---------------------------------------------------------+-----------------------------------------------------------+ +| Parameter Name & C++ property | Property values | Description | ++============================================+=========================================================+===========================================================+ +| | "MULTI_DEVICE_PRIORITIES" | | HETERO: | | Lists the devices available for selection. | +| | ``ov::device::priorities`` | | comma-separated, no spaces | | The device sequence will be taken as priority | +| | | | | | from high to low. | ++--------------------------------------------+---------------------------------------------------------+-----------------------------------------------------------+ +| | "MODEL_DISTRIBUTION_POLICY" | | ov::hint::ModelDistributionPolicy::PIPELINE_PARALLEL | | Model distribution policy for inference with | +| | ``ov::hint::model_distribution_policy`` | | | | multiple devices. Distribute model to multiple | +| | | | | | devices during model compilation. | ++--------------------------------------------+---------------------------------------------------------+-----------------------------------------------------------+ Manual and Automatic modes for assigning affinities -+++++++++++++++++++++++++++++++++++++++++++++++++++ +################################################### The Manual Mode --------------------- +++++++++++++++++++ It assumes setting affinities explicitly for all operations in the model using `ov::Node::get_rt_info `__ with the ``"affinity"`` key. @@ -66,7 +71,10 @@ Randomly selecting operations and setting affinities may lead to decrease in mod The Automatic Mode --------------------- +++++++++++++++++++ + +Without pipeline parallelism +----------------------------- It decides automatically which operation is assigned to which device according to the support from dedicated devices (``GPU``, ``CPU``, etc.) and query model step is called implicitly by Hetero device during model compilation. @@ -90,9 +98,33 @@ It does not take into account device peculiarities such as the inability to infe :language: cpp :fragment: [compile_model] +Pipeline parallelism +------------------------ + +The pipeline parallelism is set via ``ov::hint::model_distribution_policy``, This mode is an efficient technique to infer large models on multiple GPUs or devices. The model is split into multiple stages, and each stage is assigned to a different device (``dGPU``, ``iGPU``, ``CPU``, etc.). This mode will assign operations to different devices as reasonably as possible, ensuring that different stages can be executed in sequence and minimizing the amount of data transfer between different devices. + +For large models which don’t fit on a single first priority device, model pipeline parallelism is employed where certain parts of the model are placed on different devices to ensure that the device has enough memory to infer these operations, and assign other operations to next priority device. + + +.. tab-set:: + + .. tab-item:: Python + :sync: py + + .. doxygensnippet:: docs/snippets/ov_hetero.py + :language: Python + :fragment: [set_pipeline_parallelism] + + .. tab-item:: C++ + :sync: cpp + + .. doxygensnippet:: docs/snippets/ov_hetero.cpp + :language: cpp + :fragment: [set_pipeline_parallelism] + Using Manual and Automatic Modes in Combination ------------------------------------------------ ++++++++++++++++++++++++++++++++++++++++++++++++ In some cases you may need to consider manually adjusting affinities which were set automatically. It usually serves minimizing the number of total subgraphs to optimize memory transfers. To do it, you need to "fix" the automatically assigned affinities like so: @@ -121,7 +153,7 @@ Importantly, the automatic mode will not work if any operation in a model has it `ov::Core::query_model `__ does not depend on affinities set by a user. Instead, it queries for an operation support based on device capabilities. Configure fallback devices -++++++++++++++++++++++++++ +########################## If you want different devices in Hetero execution to have different device-specific configuration options, you can use the special helper property `ov::device::properties `__: @@ -146,7 +178,7 @@ If you want different devices in Hetero execution to have different device-speci In the example above, the ``GPU`` device is configured to enable profiling data and uses the default execution precision, while ``CPU`` has the configuration property to perform inference in ``fp32``. Handling of Difficult Topologies -++++++++++++++++++++++++++++++++ +################################ Some topologies are not friendly to heterogeneous execution on some devices, even to the point of being unable to execute. For example, models having activation operations that are not supported on the primary device are split by Hetero into multiple sets of subgraphs which leads to suboptimal execution. @@ -154,7 +186,7 @@ If transmitting data from one subgraph to another part of the model in the heter In such cases, you can define the heaviest part manually and set the affinity to avoid sending data back and forth many times during one inference. Analyzing Performance of Heterogeneous Execution -++++++++++++++++++++++++++++++++++++++++++++++++ +################################################ After enabling the ``OPENVINO_HETERO_VISUALIZE`` environment variable, you can dump GraphViz ``.dot`` files with annotations of operations per devices. @@ -186,7 +218,7 @@ Here is an example of the output for Googlenet v1 running on HDDL (device no lon Sample Usage -++++++++++++++++++++ +############ OpenVINO™ sample programs can use the Heterogeneous execution used with the ``-d`` option: diff --git a/docs/snippets/ov_hetero.cpp b/docs/snippets/ov_hetero.cpp index 791340afff56ef..132c787236048a 100644 --- a/docs/snippets/ov_hetero.cpp +++ b/docs/snippets/ov_hetero.cpp @@ -53,5 +53,11 @@ auto compiled_model = core.compile_model(model, "HETERO", ); //! [configure_fallback_devices] } + +{ +//! [set_pipeline_parallelism] +compiled_model = core.compile_model(model, "HETERO:GPU.1,GPU.2", ov::hint::model_distribution_policy(ov::hint::ModelDistributionPolicy::PIPELINE_PARALLEL)); +//! [set_pipeline_parallelism] +} return 0; } diff --git a/docs/snippets/ov_hetero.py b/docs/snippets/ov_hetero.py index 7f338081f69c48..dc46dea0dbfb6a 100644 --- a/docs/snippets/ov_hetero.py +++ b/docs/snippets/ov_hetero.py @@ -53,3 +53,15 @@ def main(): core.set_property("CPU", {hints.inference_precision: ov.Type.f32}) compiled_model = core.compile_model(model=model, device_name="HETERO") #! [configure_fallback_devices] + + #! [set_pipeline_parallelism] + import openvino.properties.hint as hints + + compiled_model = core.compile_model( + model, + device_name="HETERO:GPU.1,GPU.2", + config={ + hints.model_distribution_policy: + "PIPELINE_PARALLEL" + }) + #! [set_pipeline_parallelism] From 0a549eebb5c73f5729ba675d37d2671059db5747 Mon Sep 17 00:00:00 2001 From: Wangwang Wang Date: Sat, 11 May 2024 12:03:10 +0800 Subject: [PATCH 2/5] Update the snippets of hetero code --- docs/snippets/ov_hetero.cpp | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/docs/snippets/ov_hetero.cpp b/docs/snippets/ov_hetero.cpp index 132c787236048a..2c17de269961bf 100644 --- a/docs/snippets/ov_hetero.cpp +++ b/docs/snippets/ov_hetero.cpp @@ -56,7 +56,9 @@ auto compiled_model = core.compile_model(model, "HETERO", { //! [set_pipeline_parallelism] -compiled_model = core.compile_model(model, "HETERO:GPU.1,GPU.2", ov::hint::model_distribution_policy(ov::hint::ModelDistributionPolicy::PIPELINE_PARALLEL)); +std::set model_policy = {ov::hint::ModelDistributionPolicy::PIPELINE_PARALLEL}; +auto compiled_model = + core.compile_model(model, "HETERO:GPU.1,GPU.2", ov::hint::model_distribution_policy(model_policy)); //! [set_pipeline_parallelism] } return 0; From 2722a49a6a65714855d91a8211a6d8894d2bdf66 Mon Sep 17 00:00:00 2001 From: Wangwang Wang Date: Sat, 11 May 2024 15:06:43 +0800 Subject: [PATCH 3/5] Update the doc --- .../hetero-execution.rst | 26 ++++++++++--------- 1 file changed, 14 insertions(+), 12 deletions(-) diff --git a/docs/articles_en/openvino-workflow/running-inference/inference-devices-and-modes/hetero-execution.rst b/docs/articles_en/openvino-workflow/running-inference/inference-devices-and-modes/hetero-execution.rst index 5e81880c0e55a9..6950ba1781892b 100644 --- a/docs/articles_en/openvino-workflow/running-inference/inference-devices-and-modes/hetero-execution.rst +++ b/docs/articles_en/openvino-workflow/running-inference/inference-devices-and-modes/hetero-execution.rst @@ -29,17 +29,19 @@ Defining and Configuring the Hetero Device Following the OpenVINO™ naming convention, the Hetero execution plugin is assigned the label of ``"HETERO".`` It may be defined with no additional parameters, resulting in defaults being used, or configured further with the following setup options: -+--------------------------------------------+---------------------------------------------------------+-----------------------------------------------------------+ -| Parameter Name & C++ property | Property values | Description | -+============================================+=========================================================+===========================================================+ -| | "MULTI_DEVICE_PRIORITIES" | | HETERO: | | Lists the devices available for selection. | -| | ``ov::device::priorities`` | | comma-separated, no spaces | | The device sequence will be taken as priority | -| | | | | | from high to low. | -+--------------------------------------------+---------------------------------------------------------+-----------------------------------------------------------+ -| | "MODEL_DISTRIBUTION_POLICY" | | ov::hint::ModelDistributionPolicy::PIPELINE_PARALLEL | | Model distribution policy for inference with | -| | ``ov::hint::model_distribution_policy`` | | | | multiple devices. Distribute model to multiple | -| | | | | | devices during model compilation. | -+--------------------------------------------+---------------------------------------------------------+-----------------------------------------------------------+ ++--------------------------------------------+-------------------------------------------------------------+-----------------------------------------------------------+ +| Parameter Name & C++ property | Property values | Description | ++============================================+=============================================================+===========================================================+ +| | "MULTI_DEVICE_PRIORITIES" | | ``HETERO: `` | | Lists the devices available for selection. | +| | ``ov::device::priorities`` | | | | The device sequence will be taken as priority | +| | | | comma-separated, no spaces | | from high to low. | ++--------------------------------------------+-------------------------------------------------------------+-----------------------------------------------------------+ +| | | | ``empty`` | | Model distribution policy for inference with | +| | "MODEL_DISTRIBUTION_POLICY" | | ``ov::hint::ModelDistributionPolicy::PIPELINE_PARALLEL`` | | multiple devices. Distribute model to multiple | +| | | | | | devices during model compilation. | +| | ``ov::hint::model_distribution_policy`` | | HETERO only support PIPELINE_PARALLEL, The default value | | | +| | | | is empty | | | ++--------------------------------------------+-------------------------------------------------------------+-----------------------------------------------------------+ Manual and Automatic modes for assigning affinities ################################################### @@ -101,7 +103,7 @@ It does not take into account device peculiarities such as the inability to infe Pipeline parallelism ------------------------ -The pipeline parallelism is set via ``ov::hint::model_distribution_policy``, This mode is an efficient technique to infer large models on multiple GPUs or devices. The model is split into multiple stages, and each stage is assigned to a different device (``dGPU``, ``iGPU``, ``CPU``, etc.). This mode will assign operations to different devices as reasonably as possible, ensuring that different stages can be executed in sequence and minimizing the amount of data transfer between different devices. +The pipeline parallelism is set via ``ov::hint::model_distribution_policy``, This mode is an efficient technique to infer large models on multiple devices. The model is split into multiple stages, and each stage is assigned to a different device (``dGPU``, ``iGPU``, ``CPU``, etc.). This mode assign operations to different devices as reasonably as possible, ensuring that different stages can be executed in sequence and minimizing the amount of data transfer between different devices. For large models which don’t fit on a single first priority device, model pipeline parallelism is employed where certain parts of the model are placed on different devices to ensure that the device has enough memory to infer these operations, and assign other operations to next priority device. From 1dab6c7f238da8cc23c36a63b8d72cd245e82207 Mon Sep 17 00:00:00 2001 From: Wangwang Wang Date: Fri, 17 May 2024 10:12:24 +0800 Subject: [PATCH 4/5] Update the doc --- .../inference-devices-and-modes/hetero-execution.rst | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/articles_en/openvino-workflow/running-inference/inference-devices-and-modes/hetero-execution.rst b/docs/articles_en/openvino-workflow/running-inference/inference-devices-and-modes/hetero-execution.rst index 6950ba1781892b..ce5a5be13896c5 100644 --- a/docs/articles_en/openvino-workflow/running-inference/inference-devices-and-modes/hetero-execution.rst +++ b/docs/articles_en/openvino-workflow/running-inference/inference-devices-and-modes/hetero-execution.rst @@ -103,9 +103,9 @@ It does not take into account device peculiarities such as the inability to infe Pipeline parallelism ------------------------ -The pipeline parallelism is set via ``ov::hint::model_distribution_policy``, This mode is an efficient technique to infer large models on multiple devices. The model is split into multiple stages, and each stage is assigned to a different device (``dGPU``, ``iGPU``, ``CPU``, etc.). This mode assign operations to different devices as reasonably as possible, ensuring that different stages can be executed in sequence and minimizing the amount of data transfer between different devices. +The pipeline parallelism is set via ``ov::hint::model_distribution_policy``. This mode is an efficient technique to infer large models on multiple devices. The model is split into multiple stages, and each stage is assigned to a different device (``dGPU``, ``iGPU``, ``CPU``, etc.). This mode assign operations to different devices as reasonably as possible, ensuring that different stages can be executed in sequence and minimizing the amount of data transfer between different devices. -For large models which don’t fit on a single first priority device, model pipeline parallelism is employed where certain parts of the model are placed on different devices to ensure that the device has enough memory to infer these operations, and assign other operations to next priority device. +For large models which don’t fit on a single first priority device, model pipeline parallelism is employed where certain parts of the model are placed on different devices to ensure that the device has enough memory to infer these operations. .. tab-set:: From 2617d6ca139244262abb2c621cf6dad7528cc1da Mon Sep 17 00:00:00 2001 From: Wangwang Wang Date: Fri, 17 May 2024 10:39:44 +0800 Subject: [PATCH 5/5] Assets move --- .../inference-devices-and-modes/hetero-execution.rst | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/articles_en/openvino-workflow/running-inference/inference-devices-and-modes/hetero-execution.rst b/docs/articles_en/openvino-workflow/running-inference/inference-devices-and-modes/hetero-execution.rst index b3dbe0ff81e5ca..300177e1c96994 100644 --- a/docs/articles_en/openvino-workflow/running-inference/inference-devices-and-modes/hetero-execution.rst +++ b/docs/articles_en/openvino-workflow/running-inference/inference-devices-and-modes/hetero-execution.rst @@ -113,14 +113,14 @@ For large models which don’t fit on a single first priority device, model pipe .. tab-item:: Python :sync: py - .. doxygensnippet:: docs/snippets/ov_hetero.py + .. doxygensnippet:: docs/articles_en/assets/snippets/ov_hetero.py :language: Python :fragment: [set_pipeline_parallelism] .. tab-item:: C++ :sync: cpp - .. doxygensnippet:: docs/snippets/ov_hetero.cpp + .. doxygensnippet:: docs/articles_en/assets/snippets/ov_hetero.cpp :language: cpp :fragment: [set_pipeline_parallelism]