Support multi trial jobs on same GPU #1109

SparkSnail · 2019-05-26T15:37:13Z

No description provided.

fix remote bug (microsoft#523)

merge master

src/nni_manager/rest_server/restValidationSchemas.ts

QuanluZhang · 2019-05-27T05:48:08Z

src/nni_manager/training_service/local/localTrainingService.ts

    private designatedGpuIndices!: Set<number>;
    private log: Logger;
    private localTrailConfig?: TrialConfig;
    private localConfig?: LocalConfig;
    private isMultiPhase: boolean = false;
    private jobStreamMap: Map<string, ts.Stream>;
+    private maxTrialNumPerGPU: number = 1;
+    private useActiveGPU: boolean = false;


how about init them in constructor?

fixed, moved to constructor.

QuanluZhang · 2019-05-27T06:08:13Z

src/nni_manager/training_service/remote_machine/gpuScheduler.ts

+        if(trialJobDetail === undefined) {
+            throw new Error(`could not get trialJobDetail by id ${trialJobId}`);
+        } 
+        if (trialJobDetail.rmMeta !== undefined && trialJobDetail.rmMeta.occupiedGpuIndexMap !== undefined && trialJobDetail.gpuIndices !== undefined && trialJobDetail.gpuIndices.length > 0) {


better to split this line to four lines

fixed, split to four lines.

QuanluZhang · 2019-05-27T06:13:59Z

src/nni_manager/training_service/remote_machine/gpuScheduler.ts

+                        if(rmMeta.occupiedGpuIndexMap !== undefined) {
+                            let num = rmMeta.occupiedGpuIndexMap.get(gpuInfo.index);
+                            let maxTrialNumPerGPU: number = rmMeta.maxTrialNumPerGPU? rmMeta.maxTrialNumPerGPU: 1;
+                            if((num === undefined && (!rmMeta.useActiveGPU && gpuInfo.activeProcessNum === 0 || !rmMeta.useActiveGPU)) || (num !== undefined && num < maxTrialNumPerGPU)) {


the same, better to use multiple lines

suggest 2 lines for this case

fixed, split to 2 lines.

QuanluZhang · 2019-05-27T06:38:43Z

src/nni_manager/training_service/remote_machine/gpuScheduler.ts

+                }
+                rmMeta.occupiedGpuIndexMap.set(gpuInfo.index, num + 1);
+            }else {
+                rmMeta.occupiedGpuIndexMap = new Map<number, number>();


also reconsider this logic

fixed, initialize in constructor.

merge master

QuanluZhang · 2019-05-27T07:52:00Z

tools/nni_cmd/launcher.py

+        if request_data['local_config']:
+            if request_data['local_config'].get('gpuIndices') and isinstance(request_data['local_config'].get('gpuIndices'), int):
+                request_data['local_config']['gpuIndices'] = str(request_data['local_config'].get('gpuIndices'))
+            if request_data['local_config'].get('maxTrialNumOnEachGPU'):


QuanluZhang · 2019-05-27T07:52:15Z

tools/nni_cmd/launcher.py

+            if request_data['local_config'].get('maxTrialNumOnEachGPU'):
+                request_data['local_config']['maxTrialNumOnEachGPU'] = request_data['local_config'].get('maxTrialNumOnEachGPU')
+            if request_data['local_config'].get('useActiveGPU'):
+                request_data['local_config']['useActiveGPU'] = request_data['local_config'].get('useActiveGPU')


GPU? -> Gpu

chicm-ms · 2019-05-27T08:55:05Z

docs/en_US/ExperimentConfig.md

+
+  * __useActiveGpu__
+
+    __useActiveGpu__ is used to specify whether to use a GPU if there is another process. By default, NNI will use the GPU only if there is no another active process in the GPU, if __useActiveGpu__ is set to true, NNI will use the GPU regardless of another processes.


need to state that this option is not applicable for NNI on Windows.

…TrialonGPU

leelaylay · 2019-05-27T16:59:21Z

docs/en_US/ExperimentConfig.md

+  * __useActiveGpu__
+
+    __useActiveGpu__ is used to specify whether to use a GPU if there is another process. By default, NNI will use the GPU only if there is no another active process in the GPU, if __useActiveGpu__ is set to true, NNI will use the GPU regardless of another processes. This field is not applicable for NNI on Windows.
+


Nice Job!
I test it on my local machine and it works successfully.
It seems that it also works in remote mode when I catch a sight of the code.

I suggest put one example code here.

localConfig: maxTrialNumPerGpu: 5

good suggestion, I've added an configuration in cifar10-pytorch-example

Shinai Yang (FA TALENT) and others added 30 commits December 25, 2018 11:17

fix remote bug

d77a99c

Merge pull request #106 from Microsoft/master

695d866

fix remote bug (microsoft#523)

Merge pull request #107 from Microsoft/master

b7e9799

merge master

add document

7cb03f9

add document

44d1565

update

7ab7386

update

d9e1ea8

update

2c225a8

update

be23f55

Merge pull request #108 from Microsoft/master

6f760ab

merge master

fix remote issue

9161209

fix forEach

e661c55

Merge pull request #109 from Microsoft/master

4e5d836

merge master

fix conflict

f80e737

Merge branch 'Microsoft-master'

aefc219

update doc according to comments

4fec2cc

Merge pull request #111 from Microsoft/master

dc45661

merge master

update

11fec6f

update

a03a191

update

7c7832c

Merge pull request #112 from Microsoft/master

2c862dc

merge master

remove 'any more'

85c015d

Merge branch 'master' of https://github.com/SparkSnail/nni

85cb472

Merge pull request #113 from Microsoft/master

3784355

merge master

Merge pull request #114 from Microsoft/master

d91c980

merge master

Merge pull request #115 from Microsoft/master

9786650

merge master

Merge pull request #116 from Microsoft/master

ef176d2

merge master

Merge pull request #117 from Microsoft/master

1089e80

merge master

Merge pull request #119 from Microsoft/master

627e823

merge master

Merge pull request #120 from Microsoft/master

b633c26

merge master

Shinai Yang (FA TALENT) and others added 6 commits May 21, 2019 16:10

fix comment and variables

0fafebc

fix schema

c863243

fix comments

f826132

add useActiveGPU field

8b572ed

fix launcher.py

18911e3

Merge pull request #172 from microsoft/master

40bae6e

merge master

SparkSnail requested review from chicm-ms and QuanluZhang May 27, 2019 02:04

chicm-ms reviewed May 27, 2019

View reviewed changes

src/nni_manager/rest_server/restValidationSchemas.ts Outdated Show resolved Hide resolved

QuanluZhang reviewed May 27, 2019

View reviewed changes

Shinai Yang (FA TALENT) and others added 4 commits May 27, 2019 14:52

fix comments

a1141a9

fix comments

0d7dcc7

fix occupiedGpuMap

3b84b0f

Merge pull request #173 from microsoft/master

c5acd8c

merge master

QuanluZhang reviewed May 27, 2019

View reviewed changes

fix comments

aad3f77

chicm-ms reviewed May 27, 2019

View reviewed changes

Shinai Yang (FA TALENT) added 2 commits May 27, 2019 17:16

Merge branch 'master' of https://github.com/SparkSnail/nni into multi…

87b50b2

…TrialonGPU

fix comments

b445cb3

chicm-ms approved these changes May 27, 2019

View reviewed changes

QuanluZhang approved these changes May 27, 2019

View reviewed changes

leelaylay reviewed May 27, 2019

View reviewed changes

add example for maxTrialNumPerGPU

ad65af0

SparkSnail merged commit 252d35e into microsoft:master May 28, 2019

SparkSnail mentioned this pull request Jun 4, 2019

[dup #608] GPU can't be total free, anyway to run trials use only one GPU #1092

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support multi trial jobs on same GPU #1109

Support multi trial jobs on same GPU #1109

SparkSnail commented May 26, 2019

QuanluZhang May 27, 2019

SparkSnail May 27, 2019

QuanluZhang May 27, 2019

SparkSnail May 27, 2019

QuanluZhang May 27, 2019

QuanluZhang May 27, 2019

SparkSnail May 27, 2019

QuanluZhang May 27, 2019

SparkSnail May 27, 2019

QuanluZhang May 27, 2019

SparkSnail May 27, 2019

QuanluZhang May 27, 2019

SparkSnail May 27, 2019

chicm-ms May 27, 2019

SparkSnail May 27, 2019

leelaylay May 27, 2019

SparkSnail May 28, 2019


		* __useActiveGpu__

		__useActiveGpu__ is used to specify whether to use a GPU if there is another process. By default, NNI will use the GPU only if there is no another active process in the GPU, if __useActiveGpu__ is set to true, NNI will use the GPU regardless of another processes.

Support multi trial jobs on same GPU #1109

Support multi trial jobs on same GPU #1109

Conversation

SparkSnail commented May 26, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment