Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add GPU acceleration documentation #2384

Merged
merged 25 commits into from
Jan 18, 2023
Merged

Add GPU acceleration documentation #2384

merged 25 commits into from
Jan 18, 2023

Conversation

Naarcha-AWS
Copy link
Collaborator

@Naarcha-AWS Naarcha-AWS commented Jan 11, 2023

Signed-off-by: Naarcha-AWS naarcha@amazon.com

Description

Fixes issue #2340

Issues Resolved

List any issues this PR will resolve, e.g. Closes [...].

Checklist

  • By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license and subject to the Developers Certificate of Origin.
    For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Naarcha-AWS <naarcha@amazon.com>
@Naarcha-AWS Naarcha-AWS requested a review from a team as a code owner January 11, 2023 17:51
@Naarcha-AWS Naarcha-AWS added the v2.5.0 'Issues and PRs related to version v2.5.0' label Jan 11, 2023
@Naarcha-AWS Naarcha-AWS self-assigned this Jan 11, 2023
@Naarcha-AWS Naarcha-AWS added the 3 - Tech Review PR: Tech review in progress label Jan 11, 2023
@@ -15,7 +15,7 @@ This page provides an overview of `opensearch.yml` settings that can be configur
### Setting

```
plugins.ml_commons.only_run_on_ml_node: false
Copy link
Contributor

@ylwu-amzn ylwu-amzn Jan 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not for this line.
Seems these system setting missing

  1. plugins.ml_commons.max_ml_task_per_node: 10:

Description: How many ML task can run on per node. Set as 0 means no ML task allowed to run.

Default value: 10,
Value range: [0, 10000]

  1. plugins.ml_commons.max_model_on_node: 10:

Description: how many models can loaded to one ML node. If set as 0, then can't load any model.

Default value: 10,
Value range: [0, 10000]

  1. (new setting in 2.5) plugins.ml_commons.ml_task_timeout_in_seconds: 600

Description: How long the ML task will live. By default it will time out after 10minutes. Then task will be set as FAILED.

Default value: 600,
Value range: [1, 86400]

  1. (new setting in 2.5) plugins.ml_commons.native_memory_threshold: 90

Description: We add a new circuit breaker in 2.5 which will check all system memory usage (not just JVM heap usage) before running ML task. If it exceeds the threshold, will throw exception and stop running the ML task. Set as 0 means no ML task allowed to run, set as 100 means close the system memory circuit breaker.

Default value: 90,
Value range: [0, 100]

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll address these in separate PRs.

  1. For the new ML settings in 2.5.
  2. To add the ones we missed in 2.4.

Signed-off-by: Naarcha-AWS <naarcha@amazon.com>
Signed-off-by: Naarcha-AWS <naarcha@amazon.com>
@hdhalter hdhalter added the release-notes PR: Include this PR in the automated release notes label Jan 11, 2023
@Naarcha-AWS Naarcha-AWS added 4 - Doc Review PR: Doc review in progress and removed 3 - Tech Review PR: Tech review in progress labels Jan 12, 2023
Signed-off-by: Naarcha-AWS <naarcha@amazon.com>
Signed-off-by: Naarcha-AWS <naarcha@amazon.com>
Copy link

@JeffHuss JeffHuss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job! A few sections could use some work. I left detailed comments. Let me know if you have any questions!

_ml-commons-plugin/gpu-acceleration.md Outdated Show resolved Hide resolved
_ml-commons-plugin/gpu-acceleration.md Outdated Show resolved Hide resolved
_ml-commons-plugin/gpu-acceleration.md Outdated Show resolved Hide resolved
_ml-commons-plugin/gpu-acceleration.md Outdated Show resolved Hide resolved
_ml-commons-plugin/gpu-acceleration.md Outdated Show resolved Hide resolved
_ml-commons-plugin/gpu-acceleration.md Outdated Show resolved Hide resolved
_ml-commons-plugin/gpu-acceleration.md Outdated Show resolved Hide resolved
_ml-commons-plugin/gpu-acceleration.md Outdated Show resolved Hide resolved
_ml-commons-plugin/model-serving-framework.md Outdated Show resolved Hide resolved
_ml-commons-plugin/model-serving-framework.md Show resolved Hide resolved
Copy link
Contributor

@alicejw1 alicejw1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

Only comment would be to agree to break up the large code block into individual steps as Jeff mentioned.

Signed-off-by: Naarcha-AWS <naarcha@amazon.com>
Signed-off-by: Naarcha-AWS <naarcha@amazon.com>
@Naarcha-AWS Naarcha-AWS added 5 - Final Editorial Review PR: Editorial Review in progress and removed 4 - Doc Review PR: Doc review in progress labels Jan 12, 2023
Copy link
Collaborator

@natebower natebower left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Naarcha-AWS Please see my comments and changes and let me know if you have any questions. Thanks!

_ml-commons-plugin/gpu-acceleration.md Outdated Show resolved Hide resolved
_ml-commons-plugin/gpu-acceleration.md Outdated Show resolved Hide resolved
_ml-commons-plugin/gpu-acceleration.md Outdated Show resolved Hide resolved
_ml-commons-plugin/gpu-acceleration.md Outdated Show resolved Hide resolved
_ml-commons-plugin/gpu-acceleration.md Outdated Show resolved Hide resolved
_ml-commons-plugin/gpu-acceleration.md Outdated Show resolved Hide resolved
_ml-commons-plugin/gpu-acceleration.md Outdated Show resolved Hide resolved
_ml-commons-plugin/gpu-acceleration.md Outdated Show resolved Hide resolved
_ml-commons-plugin/gpu-acceleration.md Outdated Show resolved Hide resolved
_ml-commons-plugin/model-serving-framework.md Outdated Show resolved Hide resolved
@natebower natebower removed the 5 - Final Editorial Review PR: Editorial Review in progress label Jan 13, 2023
Co-authored-by: Nate Bower <nbower@amazon.com>
@@ -15,7 +15,7 @@ This page provides an overview of `opensearch.yml` settings that can be configur
### Setting

```
plugins.ml_commons.only_run_on_ml_node: false
plugins.ml_commons.only_run_on_ml_node: true
Copy link
Contributor

@ylwu-amzn ylwu-amzn Jan 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add warning for user about this change?

If user has old cluster before 2.5, by default they can run ML tasks on data nodes. From 2.5, the default value of this setting changed, then user can only run ML tasks on ML node. If no ml nodes, it will throw exception. User can refer to https://opensearch.org/docs/latest/ml-commons-plugin/index/#ml-node for how to add ML node.

If user still want to run ML task on data nodes, they can change this setting to false.

@hdhalter
Copy link
Collaborator

This was identified as experimental in the release highlights, so adding the experimental tag for release notes.

Naarcha-AWS and others added 5 commits January 17, 2023 10:07
Signed-off-by: Naarcha-AWS <naarcha@amazon.com>
Signed-off-by: Naarcha-AWS <naarcha@amazon.com>
Co-authored-by: Yaliang Wu <ylwu@amazon.com>
Signed-off-by: Naarcha-AWS <naarcha@amazon.com>
Signed-off-by: Naarcha-AWS <naarcha@amazon.com>
4. Copy the Neuron library into OpenSearch. The following command uses a directory named `opensearch-2.5.0`:

```
OPENSEARCH_HOME=~/opensearch-2.5.0
Copy link
Contributor

@ylwu-amzn ylwu-amzn Jan 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is not enough for copying Neuron library into OpenSearch. Should merge this step with step2

User should copy neuron lib to OpenSearch lib folder first, then set PYTORCH_EXTRA_LIBRARY_PATH path

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see this comment not addressed yet, any plan?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved the step to before the PYtorch step.


### Troubleshooting

Due to the amount of data required to work with ML models, you might encounter the following `max file descriptors` error when trying to run OpenSearch in a GPU-accelerated cluster:
Copy link
Contributor

@ylwu-amzn ylwu-amzn Jan 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a general error, not just for GPU-accelerated cluster.

Suggested change
Due to the amount of data required to work with ML models, you might encounter the following `max file descriptors` error when trying to run OpenSearch in a GPU-accelerated cluster:
Due to the amount of data required to work with ML models, you might encounter the following `max file descriptors` error when trying to run OpenSearch:

Naarcha-AWS and others added 3 commits January 17, 2023 14:32
Co-authored-by: Yaliang Wu <ylwu@amazon.com>
Signed-off-by: Naarcha-AWS <naarcha@amazon.com>
Signed-off-by: Naarcha-AWS <naarcha@amazon.com>
@Naarcha-AWS Naarcha-AWS merged commit efa9f99 into main Jan 18, 2023
vagimeli pushed a commit that referenced this pull request Jan 19, 2023
* Add GPU acceleration documentation

Signed-off-by: Naarcha-AWS <naarcha@amazon.com>

* Address tech feedback

Signed-off-by: Naarcha-AWS <naarcha@amazon.com>

* Address technical feedback

Signed-off-by: Naarcha-AWS <naarcha@amazon.com>

* Adjust model size sentence

Signed-off-by: Naarcha-AWS <naarcha@amazon.com>

* Add optional to neuron step

Signed-off-by: Naarcha-AWS <naarcha@amazon.com>

* Add Jeff's feedback

Signed-off-by: Naarcha-AWS <naarcha@amazon.com>

* Add copy and customize for Inferntia examples

Signed-off-by: Naarcha-AWS <naarcha@amazon.com>

* Update _ml-commons-plugin/gpu-acceleration.md

Co-authored-by: Nate Bower <nbower@amazon.com>

* Update _ml-commons-plugin/gpu-acceleration.md

Co-authored-by: Nate Bower <nbower@amazon.com>

* Apply suggestions from code review

Co-authored-by: Nate Bower <nbower@amazon.com>

* Fix link

Signed-off-by: Naarcha-AWS <naarcha@amazon.com>

* Apply suggestions from code review

Co-authored-by: Caroline <113052567+carolxob@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Caroline <113052567+carolxob@users.noreply.github.com>

* Update _ml-commons-plugin/gpu-acceleration.md

Co-authored-by: Caroline <113052567+carolxob@users.noreply.github.com>

* Update _ml-commons-plugin/gpu-acceleration.md

Co-authored-by: Caroline <113052567+carolxob@users.noreply.github.com>

* Update _ml-commons-plugin/gpu-acceleration.md

Co-authored-by: Caroline <113052567+carolxob@users.noreply.github.com>

* Fix numbering in final section

Signed-off-by: Naarcha-AWS <naarcha@amazon.com>

* Add final tech feedback

Signed-off-by: Naarcha-AWS <naarcha@amazon.com>

* A couple more suggestion

Signed-off-by: Naarcha-AWS <naarcha@amazon.com>

* Apply suggestions from code review

Co-authored-by: Yaliang Wu <ylwu@amazon.com>

* Fix Neural Search link

Signed-off-by: Naarcha-AWS <naarcha@amazon.com>

* Add experimental warning

Signed-off-by: Naarcha-AWS <naarcha@amazon.com>

* Update _ml-commons-plugin/gpu-acceleration.md

Co-authored-by: Yaliang Wu <ylwu@amazon.com>

* Final tech feedback

Signed-off-by: Naarcha-AWS <naarcha@amazon.com>

* Move OpenSearch to step 2.

Signed-off-by: Naarcha-AWS <naarcha@amazon.com>

Signed-off-by: Naarcha-AWS <naarcha@amazon.com>
Co-authored-by: Nate Bower <nbower@amazon.com>
Co-authored-by: Caroline <113052567+carolxob@users.noreply.github.com>
Co-authored-by: Yaliang Wu <ylwu@amazon.com>
vagimeli pushed a commit that referenced this pull request Jan 25, 2023
* Add GPU acceleration documentation

Signed-off-by: Naarcha-AWS <naarcha@amazon.com>

* Address tech feedback

Signed-off-by: Naarcha-AWS <naarcha@amazon.com>

* Address technical feedback

Signed-off-by: Naarcha-AWS <naarcha@amazon.com>

* Adjust model size sentence

Signed-off-by: Naarcha-AWS <naarcha@amazon.com>

* Add optional to neuron step

Signed-off-by: Naarcha-AWS <naarcha@amazon.com>

* Add Jeff's feedback

Signed-off-by: Naarcha-AWS <naarcha@amazon.com>

* Add copy and customize for Inferntia examples

Signed-off-by: Naarcha-AWS <naarcha@amazon.com>

* Update _ml-commons-plugin/gpu-acceleration.md

Co-authored-by: Nate Bower <nbower@amazon.com>

* Update _ml-commons-plugin/gpu-acceleration.md

Co-authored-by: Nate Bower <nbower@amazon.com>

* Apply suggestions from code review

Co-authored-by: Nate Bower <nbower@amazon.com>

* Fix link

Signed-off-by: Naarcha-AWS <naarcha@amazon.com>

* Apply suggestions from code review

Co-authored-by: Caroline <113052567+carolxob@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Caroline <113052567+carolxob@users.noreply.github.com>

* Update _ml-commons-plugin/gpu-acceleration.md

Co-authored-by: Caroline <113052567+carolxob@users.noreply.github.com>

* Update _ml-commons-plugin/gpu-acceleration.md

Co-authored-by: Caroline <113052567+carolxob@users.noreply.github.com>

* Update _ml-commons-plugin/gpu-acceleration.md

Co-authored-by: Caroline <113052567+carolxob@users.noreply.github.com>

* Fix numbering in final section

Signed-off-by: Naarcha-AWS <naarcha@amazon.com>

* Add final tech feedback

Signed-off-by: Naarcha-AWS <naarcha@amazon.com>

* A couple more suggestion

Signed-off-by: Naarcha-AWS <naarcha@amazon.com>

* Apply suggestions from code review

Co-authored-by: Yaliang Wu <ylwu@amazon.com>

* Fix Neural Search link

Signed-off-by: Naarcha-AWS <naarcha@amazon.com>

* Add experimental warning

Signed-off-by: Naarcha-AWS <naarcha@amazon.com>

* Update _ml-commons-plugin/gpu-acceleration.md

Co-authored-by: Yaliang Wu <ylwu@amazon.com>

* Final tech feedback

Signed-off-by: Naarcha-AWS <naarcha@amazon.com>

* Move OpenSearch to step 2.

Signed-off-by: Naarcha-AWS <naarcha@amazon.com>

Signed-off-by: Naarcha-AWS <naarcha@amazon.com>
Co-authored-by: Nate Bower <nbower@amazon.com>
Co-authored-by: Caroline <113052567+carolxob@users.noreply.github.com>
Co-authored-by: Yaliang Wu <ylwu@amazon.com>
vagimeli added a commit that referenced this pull request Jan 26, 2023
vagimeli added a commit that referenced this pull request Jan 30, 2023
@Naarcha-AWS Naarcha-AWS deleted the ml-gpu-support branch March 28, 2024 23:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
6 - Done but waiting to merge PR: The work is done and ready to merge experimental release-notes PR: Include this PR in the automated release notes v2.5.0 'Issues and PRs related to version v2.5.0'
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants