Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DOCS] Update the document of HETERO pipeline parallelism #24470

Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -18,29 +18,36 @@ Execution via the heterogeneous mode can be divided into two independent steps:

1. Setting hardware affinity to operations (`ov::Core::query_model <https://docs.openvino.ai/2024/api/c_cpp_api/classov_1_1_core.html#doxid-classov-1-1-core-1acdf8e64824fe4cf147c3b52ab32c1aab>`__ is used internally by the Hetero device).
2. Compiling a model to the Heterogeneous device assumes splitting the model to parts, compiling them on the specified devices (via `ov::device::priorities <https://docs.openvino.ai/2024/api/c_cpp_api/structov_1_1device_1_1_priorities.html>`__), and executing them in the Heterogeneous mode. The model is split to subgraphs in accordance with the affinities, where a set of connected operations with the same affinity is to be a dedicated subgraph. Each subgraph is compiled on a dedicated device and multiple `ov::CompiledModel <https://docs.openvino.ai/2024/api/c_cpp_api/classov_1_1_compiled_model.html#doxid-classov-1-1-compiled-model>`__ objects are made, which are connected via automatically allocated intermediate tensors.

If you set pipeline parallel (via ``ov::hint::model_distribution_policy``), the model is split into multiple stages, and each stage is assigned to a different device. The output of one stage is fed as input to the next stage.
peterchen-intel marked this conversation as resolved.
Show resolved Hide resolved

These two steps are not interconnected and affinities can be set in one of two ways, used separately or in combination (as described below): in the ``manual`` or the ``automatic`` mode.

Defining and Configuring the Hetero Device
++++++++++++++++++++++++++++++++++++++++++
##########################################

Following the OpenVINO™ naming convention, the Hetero execution plugin is assigned the label of ``"HETERO".`` It may be defined with no additional parameters, resulting in defaults being used, or configured further with the following setup options:


+-------------------------------+--------------------------------------------+-----------------------------------------------------------+
| Parameter Name & C++ property | Property values | Description |
+===============================+============================================+===========================================================+
| | "MULTI_DEVICE_PRIORITIES" | | HETERO: <device names> | | Lists the devices available for selection. |
| | ``ov::device::priorities`` | | comma-separated, no spaces | | The device sequence will be taken as priority |
| | | | | | from high to low. |
+-------------------------------+--------------------------------------------+-----------------------------------------------------------+

+--------------------------------------------+-------------------------------------------------------------+-----------------------------------------------------------+
| Parameter Name & C++ property | Property values | Description |
+============================================+=============================================================+===========================================================+
| | "MULTI_DEVICE_PRIORITIES" | | ``HETERO: <device names>`` | | Lists the devices available for selection. |
| | ``ov::device::priorities`` | | | | The device sequence will be taken as priority |
| | | | comma-separated, no spaces | | from high to low. |
+--------------------------------------------+-------------------------------------------------------------+-----------------------------------------------------------+
| | | | ``empty`` | | Model distribution policy for inference with |
| | "MODEL_DISTRIBUTION_POLICY" | | ``ov::hint::ModelDistributionPolicy::PIPELINE_PARALLEL`` | | multiple devices. Distribute model to multiple |
| | | | | | devices during model compilation. |
| | ``ov::hint::model_distribution_policy`` | | HETERO only support PIPELINE_PARALLEL, The default value | | |
| | | | is empty | | |
+--------------------------------------------+-------------------------------------------------------------+-----------------------------------------------------------+

Manual and Automatic modes for assigning affinities
+++++++++++++++++++++++++++++++++++++++++++++++++++
###################################################

The Manual Mode
--------------------
++++++++++++++++++

It assumes setting affinities explicitly for all operations in the model using `ov::Node::get_rt_info <https://docs.openvino.ai/2024/api/c_cpp_api/classov_1_1_node.html#doxid-classov-1-1-node-1a6941c753af92828d842297b74df1c45a>`__ with the ``"affinity"`` key.

Expand All @@ -66,7 +73,10 @@ Randomly selecting operations and setting affinities may lead to decrease in mod


The Automatic Mode
--------------------
++++++++++++++++++

Without pipeline parallelism
-----------------------------

It decides automatically which operation is assigned to which device according to the support from dedicated devices (``GPU``, ``CPU``, etc.) and query model step is called implicitly by Hetero device during model compilation.

Expand All @@ -90,9 +100,33 @@ It does not take into account device peculiarities such as the inability to infe
:language: cpp
:fragment: [compile_model]

Pipeline parallelism
------------------------

The pipeline parallelism is set via ``ov::hint::model_distribution_policy``, This mode is an efficient technique to infer large models on multiple devices. The model is split into multiple stages, and each stage is assigned to a different device (``dGPU``, ``iGPU``, ``CPU``, etc.). This mode assign operations to different devices as reasonably as possible, ensuring that different stages can be executed in sequence and minimizing the amount of data transfer between different devices.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The pipeline parallelism is set via ``ov::hint::model_distribution_policy``, This mode is an efficient technique to infer large models on multiple devices. The model is split into multiple stages, and each stage is assigned to a different device (``dGPU``, ``iGPU``, ``CPU``, etc.). This mode assign operations to different devices as reasonably as possible, ensuring that different stages can be executed in sequence and minimizing the amount of data transfer between different devices.
The pipeline parallelism is set via ``ov::hint::model_distribution_policy``. This mode is an efficient technique to infer large models on multiple devices. The model is split into multiple stages, and each stage is assigned to a different device (``dGPU``, ``iGPU``, ``CPU``, etc.). This mode assign operations to different devices as reasonably as possible, ensuring that different stages can be executed in sequence and minimizing the amount of data transfer between different devices.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, thanks.


For large models which don’t fit on a single first priority device, model pipeline parallelism is employed where certain parts of the model are placed on different devices to ensure that the device has enough memory to infer these operations, and assign other operations to next priority device.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe can re-organize this part? kind of confusing of different devices, and next priority device

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the description that may cause confusion has been removed.


.. tab-set::

.. tab-item:: Python
:sync: py

.. doxygensnippet:: docs/snippets/ov_hetero.py
:language: Python
:fragment: [set_pipeline_parallelism]

.. tab-item:: C++
:sync: cpp

.. doxygensnippet:: docs/snippets/ov_hetero.cpp
:language: cpp
:fragment: [set_pipeline_parallelism]


Using Manual and Automatic Modes in Combination
-----------------------------------------------
+++++++++++++++++++++++++++++++++++++++++++++++

In some cases you may need to consider manually adjusting affinities which were set automatically. It usually serves minimizing the number of total subgraphs to optimize memory transfers. To do it, you need to "fix" the automatically assigned affinities like so:

Expand Down Expand Up @@ -121,7 +155,7 @@ Importantly, the automatic mode will not work if any operation in a model has it
`ov::Core::query_model <https://docs.openvino.ai/2024/api/c_cpp_api/classov_1_1_core.html#doxid-classov-1-1-core-1acdf8e64824fe4cf147c3b52ab32c1aab>`__ does not depend on affinities set by a user. Instead, it queries for an operation support based on device capabilities.

Configure fallback devices
++++++++++++++++++++++++++
##########################

If you want different devices in Hetero execution to have different device-specific configuration options, you can use the special helper property `ov::device::properties <https://docs.openvino.ai/2024/api/c_cpp_api/structov_1_1device_1_1_properties.html#doxid-group-ov-runtime-cpp-prop-api-1ga794d09f2bd8aad506508b2c53ef6a6fc>`__:

Expand All @@ -146,15 +180,15 @@ If you want different devices in Hetero execution to have different device-speci
In the example above, the ``GPU`` device is configured to enable profiling data and uses the default execution precision, while ``CPU`` has the configuration property to perform inference in ``fp32``.

Handling of Difficult Topologies
++++++++++++++++++++++++++++++++
################################

Some topologies are not friendly to heterogeneous execution on some devices, even to the point of being unable to execute.
For example, models having activation operations that are not supported on the primary device are split by Hetero into multiple sets of subgraphs which leads to suboptimal execution.
If transmitting data from one subgraph to another part of the model in the heterogeneous mode takes more time than under normal execution, heterogeneous execution may be unsubstantiated.
In such cases, you can define the heaviest part manually and set the affinity to avoid sending data back and forth many times during one inference.

Analyzing Performance of Heterogeneous Execution
++++++++++++++++++++++++++++++++++++++++++++++++
################################################

After enabling the ``OPENVINO_HETERO_VISUALIZE`` environment variable, you can dump GraphViz ``.dot`` files with annotations of operations per devices.

Expand Down Expand Up @@ -186,7 +220,7 @@ Here is an example of the output for Googlenet v1 running on HDDL (device no lon


Sample Usage
++++++++++++++++++++
############

OpenVINO™ sample programs can use the Heterogeneous execution used with the ``-d`` option:

Expand Down
8 changes: 8 additions & 0 deletions docs/snippets/ov_hetero.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -53,5 +53,13 @@ auto compiled_model = core.compile_model(model, "HETERO",
);
//! [configure_fallback_devices]
}

{
//! [set_pipeline_parallelism]
std::set<ov::hint::ModelDistributionPolicy> model_policy = {ov::hint::ModelDistributionPolicy::PIPELINE_PARALLEL};
auto compiled_model =
core.compile_model(model, "HETERO:GPU.1,GPU.2", ov::hint::model_distribution_policy(model_policy));
//! [set_pipeline_parallelism]
}
return 0;
}
12 changes: 12 additions & 0 deletions docs/snippets/ov_hetero.py
Original file line number Diff line number Diff line change
Expand Up @@ -53,3 +53,15 @@ def main():
core.set_property("CPU", {hints.inference_precision: ov.Type.f32})
compiled_model = core.compile_model(model=model, device_name="HETERO")
#! [configure_fallback_devices]

#! [set_pipeline_parallelism]
import openvino.properties.hint as hints

compiled_model = core.compile_model(
model,
device_name="HETERO:GPU.1,GPU.2",
config={
hints.model_distribution_policy:
"PIPELINE_PARALLEL"
})
#! [set_pipeline_parallelism]