Add GPU acceleration documentation #2384

Naarcha-AWS · 2023-01-11T17:51:54Z

Signed-off-by: Naarcha-AWS naarcha@amazon.com

Description

Fixes issue #2340

Issues Resolved

List any issues this PR will resolve, e.g. Closes [...].

Checklist

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license and subject to the Developers Certificate of Origin.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Naarcha-AWS <naarcha@amazon.com>

ylwu-amzn · 2023-01-11T18:10:45Z

_ml-commons-plugin/cluster-settings.md

@@ -15,7 +15,7 @@ This page provides an overview of `opensearch.yml` settings that can be configur
 ### Setting

 ```
-plugins.ml_commons.only_run_on_ml_node: false


Not for this line.
Seems these system setting missing

plugins.ml_commons.max_ml_task_per_node: 10:

Description: How many ML task can run on per node. Set as 0 means no ML task allowed to run.

Default value: 10,
Value range: [0, 10000]

plugins.ml_commons.max_model_on_node: 10:

Description: how many models can loaded to one ML node. If set as 0, then can't load any model.

Default value: 10,
Value range: [0, 10000]

(new setting in 2.5) plugins.ml_commons.ml_task_timeout_in_seconds: 600

Description: How long the ML task will live. By default it will time out after 10minutes. Then task will be set as FAILED.

Default value: 600,
Value range: [1, 86400]

(new setting in 2.5) plugins.ml_commons.native_memory_threshold: 90

Description: We add a new circuit breaker in 2.5 which will check all system memory usage (not just JVM heap usage) before running ML task. If it exceeds the threshold, will throw exception and stop running the ML task. Set as 0 means no ML task allowed to run, set as 100 means close the system memory circuit breaker.

Default value: 90,
Value range: [0, 100]

I'll address these in separate PRs.

For the new ML settings in 2.5.

To add the ones we missed in 2.4.

_ml-commons-plugin/gpu-acceleration.md

Signed-off-by: Naarcha-AWS <naarcha@amazon.com>

_ml-commons-plugin/model-serving-framework.md

Signed-off-by: Naarcha-AWS <naarcha@amazon.com>

JeffHuss

Great job! A few sections could use some work. I left detailed comments. Let me know if you have any questions!

_ml-commons-plugin/gpu-acceleration.md

_ml-commons-plugin/model-serving-framework.md

_ml-commons-plugin/gpu-acceleration.md

alicejw1

Looks good!

Only comment would be to agree to break up the large code block into individual steps as Jeff mentioned.

Signed-off-by: Naarcha-AWS <naarcha@amazon.com>

natebower

@Naarcha-AWS Please see my comments and changes and let me know if you have any questions. Thanks!

_ml-commons-plugin/gpu-acceleration.md

_ml-commons-plugin/model-serving-framework.md

Co-authored-by: Nate Bower <nbower@amazon.com>

_ml-commons-plugin/gpu-acceleration.md

ylwu-amzn · 2023-01-13T22:24:10Z

_ml-commons-plugin/cluster-settings.md

@@ -15,7 +15,7 @@ This page provides an overview of `opensearch.yml` settings that can be configur
 ### Setting

 ```
-plugins.ml_commons.only_run_on_ml_node: false
+plugins.ml_commons.only_run_on_ml_node: true


Add warning for user about this change?

If user has old cluster before 2.5, by default they can run ML tasks on data nodes. From 2.5, the default value of this setting changed, then user can only run ML tasks on ML node. If no ml nodes, it will throw exception. User can refer to https://opensearch.org/docs/latest/ml-commons-plugin/index/#ml-node for how to add ML node.

If user still want to run ML task on data nodes, they can change this setting to false.

_ml-commons-plugin/model-serving-framework.md

hdhalter · 2023-01-14T00:48:44Z

This was identified as experimental in the release highlights, so adding the experimental tag for release notes.

Signed-off-by: Naarcha-AWS <naarcha@amazon.com>

Co-authored-by: Yaliang Wu <ylwu@amazon.com>

Signed-off-by: Naarcha-AWS <naarcha@amazon.com>

ylwu-amzn · 2023-01-17T18:20:34Z

_ml-commons-plugin/gpu-acceleration.md

+4. Copy the Neuron library into OpenSearch. The following command uses a directory named `opensearch-2.5.0`:
+
+   ```
+   OPENSEARCH_HOME=~/opensearch-2.5.0


This line is not enough for copying Neuron library into OpenSearch. Should merge this step with step2

User should copy neuron lib to OpenSearch lib folder first, then set PYTORCH_EXTRA_LIBRARY_PATH path

I see this comment not addressed yet, any plan?

Moved the step to before the PYtorch step.

ylwu-amzn · 2023-01-17T18:22:21Z

_ml-commons-plugin/gpu-acceleration.md

+
+### Troubleshooting
+
+Due to the amount of data required to work with ML models, you might encounter the following `max file descriptors` error when trying to run OpenSearch in a GPU-accelerated cluster: 


This is a general error, not just for GPU-accelerated cluster.

Suggested change

Due to the amount of data required to work with ML models, you might encounter the following `max file descriptors` error when trying to run OpenSearch in a GPU-accelerated cluster:

Due to the amount of data required to work with ML models, you might encounter the following `max file descriptors` error when trying to run OpenSearch:

_ml-commons-plugin/gpu-acceleration.md

Co-authored-by: Yaliang Wu <ylwu@amazon.com>

Signed-off-by: Naarcha-AWS <naarcha@amazon.com>

* Add GPU acceleration documentation Signed-off-by: Naarcha-AWS <naarcha@amazon.com> * Address tech feedback Signed-off-by: Naarcha-AWS <naarcha@amazon.com> * Address technical feedback Signed-off-by: Naarcha-AWS <naarcha@amazon.com> * Adjust model size sentence Signed-off-by: Naarcha-AWS <naarcha@amazon.com> * Add optional to neuron step Signed-off-by: Naarcha-AWS <naarcha@amazon.com> * Add Jeff's feedback Signed-off-by: Naarcha-AWS <naarcha@amazon.com> * Add copy and customize for Inferntia examples Signed-off-by: Naarcha-AWS <naarcha@amazon.com> * Update _ml-commons-plugin/gpu-acceleration.md Co-authored-by: Nate Bower <nbower@amazon.com> * Update _ml-commons-plugin/gpu-acceleration.md Co-authored-by: Nate Bower <nbower@amazon.com> * Apply suggestions from code review Co-authored-by: Nate Bower <nbower@amazon.com> * Fix link Signed-off-by: Naarcha-AWS <naarcha@amazon.com> * Apply suggestions from code review Co-authored-by: Caroline <113052567+carolxob@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Caroline <113052567+carolxob@users.noreply.github.com> * Update _ml-commons-plugin/gpu-acceleration.md Co-authored-by: Caroline <113052567+carolxob@users.noreply.github.com> * Update _ml-commons-plugin/gpu-acceleration.md Co-authored-by: Caroline <113052567+carolxob@users.noreply.github.com> * Update _ml-commons-plugin/gpu-acceleration.md Co-authored-by: Caroline <113052567+carolxob@users.noreply.github.com> * Fix numbering in final section Signed-off-by: Naarcha-AWS <naarcha@amazon.com> * Add final tech feedback Signed-off-by: Naarcha-AWS <naarcha@amazon.com> * A couple more suggestion Signed-off-by: Naarcha-AWS <naarcha@amazon.com> * Apply suggestions from code review Co-authored-by: Yaliang Wu <ylwu@amazon.com> * Fix Neural Search link Signed-off-by: Naarcha-AWS <naarcha@amazon.com> * Add experimental warning Signed-off-by: Naarcha-AWS <naarcha@amazon.com> * Update _ml-commons-plugin/gpu-acceleration.md Co-authored-by: Yaliang Wu <ylwu@amazon.com> * Final tech feedback Signed-off-by: Naarcha-AWS <naarcha@amazon.com> * Move OpenSearch to step 2. Signed-off-by: Naarcha-AWS <naarcha@amazon.com> Signed-off-by: Naarcha-AWS <naarcha@amazon.com> Co-authored-by: Nate Bower <nbower@amazon.com> Co-authored-by: Caroline <113052567+carolxob@users.noreply.github.com> Co-authored-by: Yaliang Wu <ylwu@amazon.com>

This reverts commit c3f3fe7.

This reverts commit 753b45e.

Add GPU acceleration documentation

bbaf1e8

Signed-off-by: Naarcha-AWS <naarcha@amazon.com>

Naarcha-AWS requested a review from a team as a code owner January 11, 2023 17:51

Naarcha-AWS added the v2.5.0 'Issues and PRs related to version v2.5.0' label Jan 11, 2023

Naarcha-AWS self-assigned this Jan 11, 2023

Naarcha-AWS added the 3 - Tech Review PR: Tech review in progress label Jan 11, 2023

ylwu-amzn reviewed Jan 11, 2023

View reviewed changes

_ml-commons-plugin/gpu-acceleration.md Outdated Show resolved Hide resolved

ylwu-amzn reviewed Jan 11, 2023

View reviewed changes

_ml-commons-plugin/gpu-acceleration.md Outdated Show resolved Hide resolved

ylwu-amzn reviewed Jan 11, 2023

View reviewed changes

_ml-commons-plugin/gpu-acceleration.md Outdated Show resolved Hide resolved

ylwu-amzn reviewed Jan 11, 2023

View reviewed changes

_ml-commons-plugin/gpu-acceleration.md Outdated Show resolved Hide resolved

ylwu-amzn reviewed Jan 11, 2023

View reviewed changes

_ml-commons-plugin/gpu-acceleration.md Outdated Show resolved Hide resolved

Naarcha-AWS added 2 commits January 11, 2023 13:43

Address tech feedback

ec94e35

Signed-off-by: Naarcha-AWS <naarcha@amazon.com>

Address technical feedback

01c8f4a

Signed-off-by: Naarcha-AWS <naarcha@amazon.com>

dylan-tong-aws reviewed Jan 11, 2023

View reviewed changes

_ml-commons-plugin/model-serving-framework.md Outdated Show resolved Hide resolved

hdhalter added the release-notes PR: Include this PR in the automated release notes label Jan 11, 2023

Naarcha-AWS added 4 - Doc Review PR: Doc review in progress and removed 3 - Tech Review PR: Tech review in progress labels Jan 12, 2023

Naarcha-AWS added 2 commits January 12, 2023 11:42

Adjust model size sentence

86ff92c

Signed-off-by: Naarcha-AWS <naarcha@amazon.com>

Add optional to neuron step

894e882

Signed-off-by: Naarcha-AWS <naarcha@amazon.com>

JeffHuss previously requested changes Jan 12, 2023

View reviewed changes

JeffHuss reviewed Jan 12, 2023

View reviewed changes

_ml-commons-plugin/gpu-acceleration.md Outdated Show resolved Hide resolved

alicejw1 approved these changes Jan 12, 2023

View reviewed changes

Naarcha-AWS added 2 commits January 12, 2023 15:55

Add Jeff's feedback

11d8958

Signed-off-by: Naarcha-AWS <naarcha@amazon.com>

Add copy and customize for Inferntia examples

93f3c07

Signed-off-by: Naarcha-AWS <naarcha@amazon.com>

Naarcha-AWS requested a review from JeffHuss January 12, 2023 22:00

Naarcha-AWS added 5 - Final Editorial Review PR: Editorial Review in progress and removed 4 - Doc Review PR: Doc review in progress labels Jan 12, 2023

natebower reviewed Jan 13, 2023

View reviewed changes

natebower removed the 5 - Final Editorial Review PR: Editorial Review in progress label Jan 13, 2023

Update _ml-commons-plugin/gpu-acceleration.md

20c7058

Co-authored-by: Nate Bower <nbower@amazon.com>

ylwu-amzn reviewed Jan 13, 2023

View reviewed changes

_ml-commons-plugin/gpu-acceleration.md Outdated Show resolved Hide resolved

ylwu-amzn reviewed Jan 13, 2023

View reviewed changes

_ml-commons-plugin/model-serving-framework.md Outdated Show resolved Hide resolved

hdhalter added the experimental label Jan 14, 2023

Naarcha-AWS and others added 5 commits January 17, 2023 10:07

Add final tech feedback

074de78

Signed-off-by: Naarcha-AWS <naarcha@amazon.com>

A couple more suggestion

194b055

Signed-off-by: Naarcha-AWS <naarcha@amazon.com>

Apply suggestions from code review

356d337

Co-authored-by: Yaliang Wu <ylwu@amazon.com>

Fix Neural Search link

d9f8c15

Signed-off-by: Naarcha-AWS <naarcha@amazon.com>

Add experimental warning

53f9dfb

Signed-off-by: Naarcha-AWS <naarcha@amazon.com>

ylwu-amzn reviewed Jan 17, 2023

View reviewed changes

_ml-commons-plugin/gpu-acceleration.md Outdated Show resolved Hide resolved

ylwu-amzn reviewed Jan 17, 2023

View reviewed changes

_ml-commons-plugin/gpu-acceleration.md Outdated Show resolved Hide resolved

ylwu-amzn reviewed Jan 17, 2023

View reviewed changes

_ml-commons-plugin/gpu-acceleration.md Outdated Show resolved Hide resolved

ylwu-amzn reviewed Jan 17, 2023

View reviewed changes

_ml-commons-plugin/gpu-acceleration.md Outdated Show resolved Hide resolved

ylwu-amzn reviewed Jan 17, 2023

View reviewed changes

_ml-commons-plugin/gpu-acceleration.md Outdated Show resolved Hide resolved

ylwu-amzn reviewed Jan 17, 2023

View reviewed changes

_ml-commons-plugin/gpu-acceleration.md Outdated Show resolved Hide resolved

ylwu-amzn reviewed Jan 17, 2023

View reviewed changes

_ml-commons-plugin/gpu-acceleration.md Outdated Show resolved Hide resolved

Naarcha-AWS and others added 3 commits January 17, 2023 14:32

Update _ml-commons-plugin/gpu-acceleration.md

27f10dc

Co-authored-by: Yaliang Wu <ylwu@amazon.com>

Final tech feedback

d75c987

Signed-off-by: Naarcha-AWS <naarcha@amazon.com>

Move OpenSearch to step 2.

946a0d6

Signed-off-by: Naarcha-AWS <naarcha@amazon.com>

Naarcha-AWS merged commit efa9f99 into main Jan 18, 2023

This was referenced Jan 18, 2023

[DOC] GPU-support on ML node #2241

Closed

[DOC] [ml-commons] Add GPU support and prebuilt model to doc #2340

Closed

vagimeli added a commit that referenced this pull request Jan 26, 2023

Revert "Add GPU acceleration documentation (#2384)"

b70390f

This reverts commit c3f3fe7.

vagimeli added a commit that referenced this pull request Jan 30, 2023

Revert "Add GPU acceleration documentation (#2384)"

e47a6ab

This reverts commit 753b45e.

Naarcha-AWS deleted the ml-gpu-support branch March 28, 2024 23:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add GPU acceleration documentation #2384

Add GPU acceleration documentation #2384

Naarcha-AWS commented Jan 11, 2023 •

edited

Loading

ylwu-amzn Jan 11, 2023 •

edited

Loading

Naarcha-AWS Jan 11, 2023

JeffHuss left a comment

alicejw1 left a comment

natebower left a comment

ylwu-amzn Jan 13, 2023 •

edited

Loading

hdhalter commented Jan 14, 2023

ylwu-amzn Jan 17, 2023 •

edited

Loading

ylwu-amzn Jan 17, 2023

Naarcha-AWS Jan 18, 2023

ylwu-amzn Jan 17, 2023 •

edited

Loading


		### Troubleshooting

		Due to the amount of data required to work with ML models, you might encounter the following `max file descriptors` error when trying to run OpenSearch in a GPU-accelerated cluster:

Add GPU acceleration documentation #2384

Add GPU acceleration documentation #2384

Conversation

Naarcha-AWS commented Jan 11, 2023 • edited Loading

Description

Issues Resolved

Checklist

ylwu-amzn Jan 11, 2023 • edited Loading

Choose a reason for hiding this comment

Naarcha-AWS Jan 11, 2023

Choose a reason for hiding this comment

JeffHuss left a comment

Choose a reason for hiding this comment

alicejw1 left a comment

Choose a reason for hiding this comment

natebower left a comment

Choose a reason for hiding this comment

ylwu-amzn Jan 13, 2023 • edited Loading

Choose a reason for hiding this comment

hdhalter commented Jan 14, 2023

ylwu-amzn Jan 17, 2023 • edited Loading

Choose a reason for hiding this comment

ylwu-amzn Jan 17, 2023

Choose a reason for hiding this comment

Naarcha-AWS Jan 18, 2023

Choose a reason for hiding this comment

ylwu-amzn Jan 17, 2023 • edited Loading

Choose a reason for hiding this comment

Naarcha-AWS commented Jan 11, 2023 •

edited

Loading

ylwu-amzn Jan 11, 2023 •

edited

Loading

ylwu-amzn Jan 13, 2023 •

edited

Loading

ylwu-amzn Jan 17, 2023 •

edited

Loading

ylwu-amzn Jan 17, 2023 •

edited

Loading