-
Notifications
You must be signed in to change notification settings - Fork 431
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add GPU acceleration documentation #2384
Conversation
Signed-off-by: Naarcha-AWS <naarcha@amazon.com>
@@ -15,7 +15,7 @@ This page provides an overview of `opensearch.yml` settings that can be configur | |||
### Setting | |||
|
|||
``` | |||
plugins.ml_commons.only_run_on_ml_node: false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not for this line.
Seems these system setting missing
plugins.ml_commons.max_ml_task_per_node: 10
:
Description: How many ML task can run on per node. Set as 0
means no ML task allowed to run.
Default value: 10
,
Value range: [0, 10000]
plugins.ml_commons.max_model_on_node: 10
:
Description: how many models can loaded to one ML node. If set as 0, then can't load any model.
Default value: 10
,
Value range: [0, 10000]
- (new setting in 2.5)
plugins.ml_commons.ml_task_timeout_in_seconds: 600
Description: How long the ML task will live. By default it will time out after 10minutes. Then task will be set as FAILED.
Default value: 600
,
Value range: [1, 86400]
- (new setting in 2.5)
plugins.ml_commons.native_memory_threshold: 90
Description: We add a new circuit breaker in 2.5 which will check all system memory usage (not just JVM heap usage) before running ML task. If it exceeds the threshold, will throw exception and stop running the ML task. Set as 0
means no ML task allowed to run, set as 100
means close the system memory circuit breaker.
Default value: 90
,
Value range: [0, 100]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll address these in separate PRs.
- For the new ML settings in 2.5.
- To add the ones we missed in 2.4.
Signed-off-by: Naarcha-AWS <naarcha@amazon.com>
Signed-off-by: Naarcha-AWS <naarcha@amazon.com>
Signed-off-by: Naarcha-AWS <naarcha@amazon.com>
Signed-off-by: Naarcha-AWS <naarcha@amazon.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job! A few sections could use some work. I left detailed comments. Let me know if you have any questions!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
Only comment would be to agree to break up the large code block into individual steps as Jeff mentioned.
Signed-off-by: Naarcha-AWS <naarcha@amazon.com>
Signed-off-by: Naarcha-AWS <naarcha@amazon.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Naarcha-AWS Please see my comments and changes and let me know if you have any questions. Thanks!
Co-authored-by: Nate Bower <nbower@amazon.com>
@@ -15,7 +15,7 @@ This page provides an overview of `opensearch.yml` settings that can be configur | |||
### Setting | |||
|
|||
``` | |||
plugins.ml_commons.only_run_on_ml_node: false | |||
plugins.ml_commons.only_run_on_ml_node: true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add warning for user about this change?
If user has old cluster before 2.5, by default they can run ML tasks on data nodes. From 2.5, the default value of this setting changed, then user can only run ML tasks on ML node. If no ml nodes, it will throw exception. User can refer to https://opensearch.org/docs/latest/ml-commons-plugin/index/#ml-node for how to add ML node.
If user still want to run ML task on data nodes, they can change this setting to false
.
This was identified as experimental in the release highlights, so adding the experimental tag for release notes. |
Signed-off-by: Naarcha-AWS <naarcha@amazon.com>
Signed-off-by: Naarcha-AWS <naarcha@amazon.com>
Co-authored-by: Yaliang Wu <ylwu@amazon.com>
Signed-off-by: Naarcha-AWS <naarcha@amazon.com>
Signed-off-by: Naarcha-AWS <naarcha@amazon.com>
4. Copy the Neuron library into OpenSearch. The following command uses a directory named `opensearch-2.5.0`: | ||
|
||
``` | ||
OPENSEARCH_HOME=~/opensearch-2.5.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line is not enough for copying Neuron library into OpenSearch. Should merge this step with step2
User should copy neuron lib to OpenSearch lib folder first, then set PYTORCH_EXTRA_LIBRARY_PATH
path
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see this comment not addressed yet, any plan?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved the step to before the PYtorch step.
|
||
### Troubleshooting | ||
|
||
Due to the amount of data required to work with ML models, you might encounter the following `max file descriptors` error when trying to run OpenSearch in a GPU-accelerated cluster: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a general error, not just for GPU-accelerated cluster.
Due to the amount of data required to work with ML models, you might encounter the following `max file descriptors` error when trying to run OpenSearch in a GPU-accelerated cluster: | |
Due to the amount of data required to work with ML models, you might encounter the following `max file descriptors` error when trying to run OpenSearch: |
Co-authored-by: Yaliang Wu <ylwu@amazon.com>
Signed-off-by: Naarcha-AWS <naarcha@amazon.com>
Signed-off-by: Naarcha-AWS <naarcha@amazon.com>
* Add GPU acceleration documentation Signed-off-by: Naarcha-AWS <naarcha@amazon.com> * Address tech feedback Signed-off-by: Naarcha-AWS <naarcha@amazon.com> * Address technical feedback Signed-off-by: Naarcha-AWS <naarcha@amazon.com> * Adjust model size sentence Signed-off-by: Naarcha-AWS <naarcha@amazon.com> * Add optional to neuron step Signed-off-by: Naarcha-AWS <naarcha@amazon.com> * Add Jeff's feedback Signed-off-by: Naarcha-AWS <naarcha@amazon.com> * Add copy and customize for Inferntia examples Signed-off-by: Naarcha-AWS <naarcha@amazon.com> * Update _ml-commons-plugin/gpu-acceleration.md Co-authored-by: Nate Bower <nbower@amazon.com> * Update _ml-commons-plugin/gpu-acceleration.md Co-authored-by: Nate Bower <nbower@amazon.com> * Apply suggestions from code review Co-authored-by: Nate Bower <nbower@amazon.com> * Fix link Signed-off-by: Naarcha-AWS <naarcha@amazon.com> * Apply suggestions from code review Co-authored-by: Caroline <113052567+carolxob@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Caroline <113052567+carolxob@users.noreply.github.com> * Update _ml-commons-plugin/gpu-acceleration.md Co-authored-by: Caroline <113052567+carolxob@users.noreply.github.com> * Update _ml-commons-plugin/gpu-acceleration.md Co-authored-by: Caroline <113052567+carolxob@users.noreply.github.com> * Update _ml-commons-plugin/gpu-acceleration.md Co-authored-by: Caroline <113052567+carolxob@users.noreply.github.com> * Fix numbering in final section Signed-off-by: Naarcha-AWS <naarcha@amazon.com> * Add final tech feedback Signed-off-by: Naarcha-AWS <naarcha@amazon.com> * A couple more suggestion Signed-off-by: Naarcha-AWS <naarcha@amazon.com> * Apply suggestions from code review Co-authored-by: Yaliang Wu <ylwu@amazon.com> * Fix Neural Search link Signed-off-by: Naarcha-AWS <naarcha@amazon.com> * Add experimental warning Signed-off-by: Naarcha-AWS <naarcha@amazon.com> * Update _ml-commons-plugin/gpu-acceleration.md Co-authored-by: Yaliang Wu <ylwu@amazon.com> * Final tech feedback Signed-off-by: Naarcha-AWS <naarcha@amazon.com> * Move OpenSearch to step 2. Signed-off-by: Naarcha-AWS <naarcha@amazon.com> Signed-off-by: Naarcha-AWS <naarcha@amazon.com> Co-authored-by: Nate Bower <nbower@amazon.com> Co-authored-by: Caroline <113052567+carolxob@users.noreply.github.com> Co-authored-by: Yaliang Wu <ylwu@amazon.com>
* Add GPU acceleration documentation Signed-off-by: Naarcha-AWS <naarcha@amazon.com> * Address tech feedback Signed-off-by: Naarcha-AWS <naarcha@amazon.com> * Address technical feedback Signed-off-by: Naarcha-AWS <naarcha@amazon.com> * Adjust model size sentence Signed-off-by: Naarcha-AWS <naarcha@amazon.com> * Add optional to neuron step Signed-off-by: Naarcha-AWS <naarcha@amazon.com> * Add Jeff's feedback Signed-off-by: Naarcha-AWS <naarcha@amazon.com> * Add copy and customize for Inferntia examples Signed-off-by: Naarcha-AWS <naarcha@amazon.com> * Update _ml-commons-plugin/gpu-acceleration.md Co-authored-by: Nate Bower <nbower@amazon.com> * Update _ml-commons-plugin/gpu-acceleration.md Co-authored-by: Nate Bower <nbower@amazon.com> * Apply suggestions from code review Co-authored-by: Nate Bower <nbower@amazon.com> * Fix link Signed-off-by: Naarcha-AWS <naarcha@amazon.com> * Apply suggestions from code review Co-authored-by: Caroline <113052567+carolxob@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Caroline <113052567+carolxob@users.noreply.github.com> * Update _ml-commons-plugin/gpu-acceleration.md Co-authored-by: Caroline <113052567+carolxob@users.noreply.github.com> * Update _ml-commons-plugin/gpu-acceleration.md Co-authored-by: Caroline <113052567+carolxob@users.noreply.github.com> * Update _ml-commons-plugin/gpu-acceleration.md Co-authored-by: Caroline <113052567+carolxob@users.noreply.github.com> * Fix numbering in final section Signed-off-by: Naarcha-AWS <naarcha@amazon.com> * Add final tech feedback Signed-off-by: Naarcha-AWS <naarcha@amazon.com> * A couple more suggestion Signed-off-by: Naarcha-AWS <naarcha@amazon.com> * Apply suggestions from code review Co-authored-by: Yaliang Wu <ylwu@amazon.com> * Fix Neural Search link Signed-off-by: Naarcha-AWS <naarcha@amazon.com> * Add experimental warning Signed-off-by: Naarcha-AWS <naarcha@amazon.com> * Update _ml-commons-plugin/gpu-acceleration.md Co-authored-by: Yaliang Wu <ylwu@amazon.com> * Final tech feedback Signed-off-by: Naarcha-AWS <naarcha@amazon.com> * Move OpenSearch to step 2. Signed-off-by: Naarcha-AWS <naarcha@amazon.com> Signed-off-by: Naarcha-AWS <naarcha@amazon.com> Co-authored-by: Nate Bower <nbower@amazon.com> Co-authored-by: Caroline <113052567+carolxob@users.noreply.github.com> Co-authored-by: Yaliang Wu <ylwu@amazon.com>
Signed-off-by: Naarcha-AWS naarcha@amazon.com
Description
Fixes issue #2340
Issues Resolved
List any issues this PR will resolve, e.g. Closes [...].
Checklist
For more information on following Developer Certificate of Origin and signing off your commits, please check here.