Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Task is always in a waiting state in the remote machine #3905

Closed
guoxiaojie-schinper opened this issue Jul 6, 2021 · 20 comments
Closed

Task is always in a waiting state in the remote machine #3905

guoxiaojie-schinper opened this issue Jul 6, 2021 · 20 comments
Assignees

Comments

@guoxiaojie-schinper
Copy link

guoxiaojie-schinper commented Jul 6, 2021

Describe the issue:

I use nnictl to create a task to schedule to a remote machine, and the GPU on the remote machine is also sufficient, but it keeps prompting the following message.

[2021-07-06 17:09:47] INFO (TrialDispatcher) TrialDispatcher: 1 live env, and 1 reusable, but no GPU available so request a new one.

GPU information in the remote machine:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.80 Driver Version: 460.80 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce RTX 3090 Off | 00000000:1A:00.0 Off | N/A |
| 0% 47C P8 36W / 370W | 5MiB / 24268MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 GeForce RTX 3090 Off | 00000000:68:00.0 Off | N/A |
| 0% 46C P8 26W / 370W | 19MiB / 24265MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1085 G /usr/lib/xorg/Xorg 4MiB |
| 1 N/A N/A 1085 G /usr/lib/xorg/Xorg 9MiB |
| 1 N/A N/A 1326 G /usr/bin/gnome-shell 8MiB |
+-----------------------------------------------------------------------------+

Environment:

  • NNI version: 2.3
  • Training service (local|remote|pai|aml|etc): remote
  • Client OS: Ubuntu20.04
  • Server OS (for remote mode only): Ubuntu20.04
  • Python version: 3.7.6
  • PyTorch/TensorFlow version: 1.7.1
  • Is conda/virtualenv/venv used?: virtualenv
  • Is running in Docker?: No

Configuration:

  • Experiment config (remember to remove secrets!):
  • Search space:

Log message:

  • nnimanager.log:
[2021-07-06 17:05:49] INFO (TrialDispatcher) TrialDispatcher: 1 live env, and 1 reusable, but no GPU available so request a new one.
[2021-07-06 17:05:49] INFO (TrialDispatcher) TrialDispatcher: 1 live env, and 1 reusable, but no GPU available so request a new one.
[2021-07-06 17:05:50] INFO (TrialDispatcher) TrialDispatcher: 1 live env, and 1 reusable, but no GPU available so request a new one.
[2021-07-06 17:05:51] INFO (TrialDispatcher) TrialDispatcher: 1 live env, and 1 reusable, but no GPU available so request a new one.
[2021-07-06 17:05:52] INFO (TrialDispatcher) TrialDispatcher: 1 live env, and 1 reusable, but no GPU available so request a new one.
[2021-07-06 17:05:53] INFO (TrialDispatcher) TrialDispatcher: 1 live env, and 1 reusable, but no GPU available so request a new one.
[2021-07-06 17:05:54] INFO (TrialDispatcher) TrialDispatcher: 1 live env, and 1 reusable, but no GPU available so request a new one.
[2021-07-06 17:05:54] INFO (TrialDispatcher) TrialDispatcher: 1 live env, and 1 reusable, but no GPU available so request a new one.
[2021-07-06 17:05:55] INFO (TrialDispatcher) TrialDispatcher: 1 live env, and 1 reusable, but no GPU available so request a new one.
[2021-07-06 17:05:56] INFO (TrialDispatcher) TrialDispatcher: 1 live env, and 1 reusable, but no GPU available so request a new one.
[2021-07-06 17:05:57] INFO (TrialDispatcher) TrialDispatcher: 1 live env, and 1 reusable, but no GPU available so request a new one.
[2021-07-06 17:05:58] INFO (TrialDispatcher) TrialDispatcher: 1 live env, and 1 reusable, but no GPU available so request a new one.
[2021-07-06 17:05:59] INFO (TrialDispatcher) TrialDispatcher: 1 live env, and 1 reusable, but no GPU available so request a new one.
[2021-07-06 17:06:00] INFO (TrialDispatcher) TrialDispatcher: 1 live env, and 1 reusable, but no GPU available so request a new one.
[2021-07-06 17:06:02] INFO (TrialDispatcher) TrialDispatcher: 1 live env, and 1 reusable, but no GPU available so request a new one.
[2021-07-06 17:06:03] INFO (TrialDispatcher) TrialDispatcher: 1 live env, and 1 reusable, but no GPU available so request a new one.
[2021-07-06 17:06:03] INFO (TrialDispatcher) TrialDispatcher: 1 live env, and 1 reusable, but no GPU available so request a new one.
[2021-07-06 17:06:04] INFO (TrialDispatcher) TrialDispatcher: 1 live env, and 1 reusable, but no GPU available so request a new one.
[2021-07-06 17:06:04] INFO (TrialDispatcher) TrialDispatcher: 1 live env, and 1 reusable, but no GPU available so request a new one.
[2021-07-06 17:06:05] INFO (TrialDispatcher) TrialDispatcher: 1 live env, and 1 reusable, but no GPU available so request a new one.
[2021-07-06 17:06:07] INFO (TrialDispatcher) TrialDispatcher: 1 live env, and 1 reusable, but no GPU available so request a new one.
[2021-07-06 17:06:08] INFO (TrialDispatcher) TrialDispatcher: 1 live env, and 1 reusable, but no GPU available so request a new one
  • dispatcher.log:
[2021-07-06 17:05:04] INFO (nni.runtime.msg_dispatcher_base/MainThread) Dispatcher started
[2021-07-06 17:05:04] INFO (hyperopt.tpe/Thread-1) tpe_transform took 0.000968 seconds
[2021-07-06 17:05:04] INFO (hyperopt.tpe/Thread-1) TPE using 0 trials
[2021-07-06 17:06:09] INFO (nni.runtime.msg_dispatcher_base/MainThread) Dispatcher exiting...
[2021-07-06 17:06:10] INFO (nni.runtime.msg_dispatcher_base/MainThread) Dispatcher terminiated
  • nnictl stdout and stderr:
-----------------------------------------------------------------------
                Experiment start time 2021-07-06 17:05:02
-----------------------------------------------------------------------
  • config file:
searchSpaceFile: search_space.json
trialCommand: python3 mnist.py
trialCodeDirectory: /data/george/code/schinper-nni/nni/examples/trials/mnist-pytorch
trialGpuNumber: 1
trialConcurrency: 4
maxTrialNumber: 20
tuner:
  name: TPE
  classArgs:
    optimize_mode: maximize
trainingService:
  platform: remote
  machineList:
    - host: 172.18.18.204
      port: 22
      user: root
      password: schinper
      pythonPath: /opt/virtualenvs/nni/bin
      useActiveGpu: True
      maxTrialNumberPerGpu: 2
@guoxiaojie-schinper guoxiaojie-schinper changed the title Task is always in a waiting state in Task is always in a waiting state in the remote machine Jul 6, 2021
@acured
Copy link
Contributor

acured commented Jul 7, 2021

Could you try to add "nniManagerIp" in your config file and try another run?

@guoxiaojie-schinper
Copy link
Author

Could you try to add "nniManagerIp" in your config file and try another run?

Thanks for your reply, I have the add "nniManagerIp" in my config file, but it doesn't work. Config file as following:

searchSpaceFile: search_space.json
trialCommand: python3 mnist.py
trialCodeDirectory: /data/george/code/schinper-nni/nni/examples/trials/mnist-pytorch
trialGpuNumber: 1
trialConcurrency: 1
maxTrialNumber: 20
nniManagerIp: 172.18.18.206
tuner:
  name: TPE
  classArgs:
    optimize_mode: maximize
trainingService:
  platform: remote
  machineList:
    - host: 172.18.18.204
      port: 22
      user: root
      password: schinper
      pythonPath: /opt/virtualenvs/nni/bin
      useActiveGpu: True
      maxTrialNumberPerGpu: 1

@acured
Copy link
Contributor

acured commented Jul 7, 2021

Is your nnimanager machine reachable by your remote machine? You can have a quick test from remote machine. This problem seems caused by can not reach the local machine from remote.

@guoxiaojie-schinper
Copy link
Author

Is your nnimanager machine reachable by your remote machine? You can have a quick test from remote machine. This problem seems caused by can not reach the local machine from remote.

I'm sure that the machines can reach each other, and the firewall is turned off on both machines.

(base) root@train06:~# ping 172.18.18.204
PING 172.18.18.204 (172.18.18.204) 56(84) bytes of data.
64 bytes from 172.18.18.204: icmp_seq=1 ttl=64 time=0.266 ms
64 bytes from 172.18.18.204: icmp_seq=2 ttl=64 time=0.211 ms
64 bytes from 172.18.18.204: icmp_seq=3 ttl=64 time=0.243 ms

(base) root@train04:~# ping 172.18.18.206
PING 172.18.18.206 (172.18.18.206) 56(84) bytes of data.
64 bytes from 172.18.18.206: icmp_seq=1 ttl=64 time=0.228 ms
64 bytes from 172.18.18.206: icmp_seq=2 ttl=64 time=0.249 ms
64 bytes from 172.18.18.206: icmp_seq=3 ttl=64 time=0.233 ms

@acured
Copy link
Contributor

acured commented Jul 8, 2021

Thanks your test, could you run it again on debug mode? And paste "nnimanager.log" to here?

Set debug mode: nnictl create --config config.yml --debug

@guoxiaojie-schinper
Copy link
Author

guoxiaojie-schinper commented Jul 13, 2021

Thanks your test, could you run it again on debug mode? And paste "nnimanager.log" to here?

Set debug mode: nnictl create --config config.yml --debug

Please check the log from nnimanger when I add --debug option.

[2021-07-13 16:19:03] DEBUG (ShellExecutor) remoteExeCommand(6679): [export PATH=/opt/virtualenvs/nni/bin:$PATH && kill -0 `cat '/tmp/nni-experiments/WrRe913u/envs/mlCk1/pid'`]
[2021-07-13 16:19:03] DEBUG (ShellExecutor) remoteExeCommand(6679) exit(0)
stdout: 
stderr: 
[2021-07-13 16:19:03] DEBUG (TrialDispatcher) TrialDispatcher: wait GPU, live environment 1, reusable 1, TMP_NO_AVAILABLE_GPU.
[2021-07-13 16:19:03] INFO (TrialDispatcher) TrialDispatcher: 1 live env, and 1 reusable, but no GPU available so request a new one.
[2021-07-13 16:19:04] DEBUG (TrialDispatcher) TrialDispatcher: wait GPU, live environment 1, reusable 1, TMP_NO_AVAILABLE_GPU.
[2021-07-13 16:19:04] INFO (TrialDispatcher) TrialDispatcher: 1 live env, and 1 reusable, but no GPU available so request a new one.
[2021-07-13 16:19:05] DEBUG (TrialDispatcher) TrialDispatcher: wait GPU, live environment 1, reusable 1, TMP_NO_AVAILABLE_GPU.
[2021-07-13 16:19:05] INFO (TrialDispatcher) TrialDispatcher: 1 live env, and 1 reusable, but no GPU available so request a new one.
[2021-07-13 16:19:06] DEBUG (TrialDispatcher) TrialDispatcher: wait GPU, live environment 1, reusable 1, TMP_NO_AVAILABLE_GPU.
[2021-07-13 16:19:06] INFO (TrialDispatcher) TrialDispatcher: 1 live env, and 1 reusable, but no GPU available so request a new one.
[2021-07-13 16:19:07] DEBUG (IpcInterface) ipcInterface command type: [PI], content:[]
[2021-07-13 16:19:08] DEBUG (TrialDispatcher) TrialDispatcher: wait GPU, live environment 1, reusable 1, TMP_NO_AVAILABLE_GPU.
[2021-07-13 16:19:08] INFO (TrialDispatcher) TrialDispatcher: 1 live env, and 1 reusable, but no GPU available so request a new one.
[2021-07-13 16:19:08] DEBUG (TrialDispatcher) TrialDispatcher: env mlCk1 received command GI.
[2021-07-13 16:19:08] DEBUG (TrialDispatcher) TrialDispatcher: wait GPU, live environment 1, reusable 1, TMP_NO_AVAILABLE_GPU.
[2021-07-13 16:19:08] INFO (TrialDispatcher) TrialDispatcher: 1 live env, and 1 reusable, but no GPU available so request a new one.
[2021-07-13 16:19:08] DEBUG (ShellExecutor) remoteExeCommand(5727): [export PATH=/opt/virtualenvs/nni/bin:$PATH && test -e /tmp/nni-experiments/WrRe913u/envs/mlCk1/pid && echo True || echo False]
[2021-07-13 16:19:08] DEBUG (ShellExecutor) remoteExeCommand(5727) exit(0)
stdout: True

stderr: 
[2021-07-13 16:19:08] DEBUG (ShellExecutor) remoteExeCommand(3338): [export PATH=/opt/virtualenvs/nni/bin:$PATH && kill -0 `cat '/tmp/nni-experiments/WrRe913u/envs/mlCk1/pid'`]
[2021-07-13 16:19:08] DEBUG (ShellExecutor) remoteExeCommand(3338) exit(0)
stdout: 
stderr: 
[2021-07-13 16:19:08] DEBUG (TrialDispatcher) TrialDispatcher: wait GPU, live environment 1, reusable 1, TMP_NO_AVAILABLE_GPU.
[2021-07-13 16:19:08] INFO (TrialDispatcher) TrialDispatcher: 1 live env, and 1 reusable, but no GPU available so request a new one.
[2021-07-13 16:19:08] DEBUG (NNIRestHandler) GET: /experiment: body: {}
[2021-07-13 16:19:08] DEBUG (NNIRestHandler) GET: /trial-jobs: body: {}
[2021-07-13 16:19:08] DEBUG (NNIDataStore) getTrialJobsByReplayEvents begin
[2021-07-13 16:19:08] DEBUG (NNIDataStore) getTrialJobsByReplayEvents done
[2021-07-13 16:19:08] DEBUG (NNIRestHandler) GET: /metric-data: body: {}
[2021-07-13 16:19:08] DEBUG (NNIRestHandler) GET: /check-status: body: {}
[2021-07-13 16:19:09] DEBUG (TrialDispatcher) TrialDispatcher: wait GPU, live environment 1, reusable 1, TMP_NO_AVAILABLE_GPU.
[2021-07-13 16:19:09] INFO (TrialDispatcher) TrialDispatcher: 1 live env, and 1 reusable, but no GPU available so request a new one.
[2021-07-13 16:19:10] DEBUG (TrialDispatcher) TrialDispatcher: wait GPU, live environment 1, reusable 1, TMP_NO_AVAILABLE_GPU.
[2021-07-13 16:19:10] INFO (TrialDispatcher) TrialDispatcher: 1 live env, and 1 reusable, but no GPU available so request a new one.
[2021-07-13 16:19:11] DEBUG (TrialDispatcher) TrialDispatcher: wait GPU, live environment 1, reusable 1, TMP_NO_AVAILABLE_GPU.
[2021-07-13 16:19:11] INFO (TrialDispatcher) TrialDispatcher: 1 live env, and 1 reusable, but no GPU available so request a new one.
[2021-07-13 16:19:12] DEBUG (IpcInterface) ipcInterface command type: [PI], content:[]
[2021-07-13 16:19:13] DEBUG (TrialDispatcher) TrialDispatcher: wait GPU, live environment 1, reusable 1, TMP_NO_AVAILABLE_GPU.
[2021-07-13 16:19:13] INFO (TrialDispatcher) TrialDispatcher: 1 live env, and 1 reusable, but no GPU available so request a new one.
[2021-07-13 16:19:13] DEBUG (ShellExecutor) remoteExeCommand(5797): [export PATH=/opt/virtualenvs/nni/bin:$PATH && test -e /tmp/nni-experiments/WrRe913u/envs/mlCk1/pid && echo True || echo False]
[2021-07-13 16:19:13] DEBUG (ShellExecutor) remoteExeCommand(5797) exit(0)
stdout: True

stderr: 
[2021-07-13 16:19:13] DEBUG (ShellExecutor) remoteExeCommand(3235): [export PATH=/opt/virtualenvs/nni/bin:$PATH && kill -0 `cat '/tmp/nni-experiments/WrRe913u/envs/mlCk1/pid'`]
[2021-07-13 16:19:13] DEBUG (ShellExecutor) remoteExeCommand(3235) exit(0)
stdout: 
stderr: 
[2021-07-13 16:19:13] DEBUG (TrialDispatcher) TrialDispatcher: wait GPU, live environment 1, reusable 1, TMP_NO_AVAILABLE_GPU.
[2021-07-13 16:19:13] INFO (TrialDispatcher) TrialDispatcher: 1 live env, and 1 reusable, but no GPU available so request a new one.
[2021-07-13 16:19:14] DEBUG (TrialDispatcher) TrialDispatcher: env mlCk1 received command GI.
[2021-07-13 16:19:14] DEBUG (TrialDispatcher) TrialDispatcher: wait GPU, live environment 1, reusable 1, TMP_NO_AVAILABLE_GPU.
[2021-07-13 16:19:14] INFO (TrialDispatcher) TrialDispatcher: 1 live env, and 1 reusable, but no GPU available so request a new one.
[2021-07-13 16:19:15] DEBUG (TrialDispatcher) TrialDispatcher: wait GPU, live environment 1, reusable 1, TMP_NO_AVAILABLE_GPU.
[2021-07-13 16:19:15] INFO (TrialDispatcher) TrialDispatcher: 1 live env, and 1 reusable, but no GPU available so request a new one.
[2021-07-13 16:19:16] DEBUG (TrialDispatcher) TrialDispatcher: wait GPU, live environment 1, reusable 1, TMP_NO_AVAILABLE_GPU.
[2021-07-13 16:19:16] INFO (TrialDispatcher) TrialDispatcher: 1 live env, and 1 reusable, but no GPU available so request a new one.
[2021-07-13 16:19:17] DEBUG (IpcInterface) ipcInterface command type: [PI], content:[]
[2021-07-13 16:19:17] DEBUG (TrialDispatcher) TrialDispatcher: wait GPU, live environment 1, reusable 1, TMP_NO_AVAILABLE_GPU.
[2021-07-13 16:19:17] INFO (TrialDispatcher) TrialDispatcher: 1 live env, and 1 reusable, but no GPU available so request a new one.
[2021-07-13 16:19:18] DEBUG (ShellExecutor) remoteExeCommand(8030): [export PATH=/opt/virtualenvs/nni/bin:$PATH && test -e /tmp/nni-experiments/WrRe913u/envs/mlCk1/pid && echo True || echo False]
[2021-07-13 16:19:18] DEBUG (ShellExecutor) remoteExeCommand(8030) exit(0)
stdout: True

stderr: 
[2021-07-13 16:19:18] DEBUG (ShellExecutor) remoteExeCommand(9077): [export PATH=/opt/virtualenvs/nni/bin:$PATH && kill -0 `cat '/tmp/nni-experiments/WrRe913u/envs/mlCk1/pid'`]
[2021-07-13 16:19:18] DEBUG (ShellExecutor) remoteExeCommand(9077) exit(0)
stdout: 
stderr: 
[2021-07-13 16:19:18] DEBUG (TrialDispatcher) TrialDispatcher: wait GPU, live environment 1, reusable 1, TMP_NO_AVAILABLE_GPU.
[2021-07-13 16:19:18] INFO (TrialDispatcher) TrialDispatcher: 1 live env, and 1 reusable, but no GPU available so request a new one.
[2021-07-13 16:19:18] DEBUG (NNIRestHandler) GET: /experiment: body: {}
[2021-07-13 16:19:18] DEBUG (NNIRestHandler) GET: /trial-jobs: body: {}
[2021-07-13 16:19:18] DEBUG (NNIDataStore) getTrialJobsByReplayEvents begin
[2021-07-13 16:19:18] DEBUG (NNIDataStore) getTrialJobsByReplayEvents done
[2021-07-13 16:19:18] DEBUG (NNIRestHandler) GET: /metric-data: body: {}
[2021-07-13 16:19:18] DEBUG (NNIRestHandler) GET: /check-status: body: {}
[2021-07-13 16:19:19] DEBUG (TrialDispatcher) TrialDispatcher: wait GPU, live environment 1, reusable 1, TMP_NO_AVAILABLE_GPU.
[2021-07-13 16:19:19] INFO (TrialDispatcher) TrialDispatcher: 1 live env, and 1 reusable, but no GPU available so request a new one.
[2021-07-13 16:19:20] DEBUG (TrialDispatcher) TrialDispatcher: env mlCk1 received command GI.
[2021-07-13 16:19:20] DEBUG (TrialDispatcher) TrialDispatcher: wait GPU, live environment 1, reusable 1, TMP_NO_AVAILABLE_GPU.
[2021-07-13 16:19:20] INFO (TrialDispatcher) TrialDispatcher: 1 live env, and 1 reusable, but no GPU available so request a new one.

@albertogilramos
Copy link

@acured @guoxiaojie-schinper : I can confirm the issue on remote machine with nni 2.2 and 2.3 (and nni built from latest master 5b99b59), but works fine on nni 2.1.

In particular, /tmp/nni-experiments/EXP_ID/scripts contains gpu_metrics, pid for 2.1 but not for 2.2 nor 2.3.

In particular for 2.2 and 2.3, in my case the first trial runs fine, but if I user cancel it, then the second trial just keeps waiting and on the NNIManager log I see

[2021-07-14 07:52:29] DEBUG (TrialDispatcher) TrialDispatcher: env hIhRt received command GI.
[2021-07-14 07:52:29] TRACE (CommandChannel) CommandChannel: env hIhRt emit command: GI, [object Object].
[2021-07-14 07:52:29] DEBUG (TrialDispatcher) TrialDispatcher: wait GPU, live environment 1, reusable 1, TMP_NO_AVAILABLE_GPU.
[2021-07-14 07:52:29] INFO (TrialDispatcher) TrialDispatcher: 1 live env, and 1 reusable, but no GPU available so request a new one.

Is it possible to re-open this issue so that the problem can be addressed? Thank you very much.

@guoxiaojie-schinper
Copy link
Author

@acured @guoxiaojie-schinper : I can confirm the issue on remote machine with nni 2.2 and 2.3 (and nni built from latest master 5b99b59), but works fine on nni 2.1.

In particular, /tmp/nni-experiments/EXP_ID/scripts contains gpu_metrics, pid for 2.1 but not for 2.2 nor 2.3.

In particular for 2.2 and 2.3, in my case the first trial runs fine, but if I user cancel it, then the second trial just keeps waiting and on the NNIManager log I see

[2021-07-14 07:52:29] DEBUG (TrialDispatcher) TrialDispatcher: env hIhRt received command GI.
[2021-07-14 07:52:29] TRACE (CommandChannel) CommandChannel: env hIhRt emit command: GI, [object Object].
[2021-07-14 07:52:29] DEBUG (TrialDispatcher) TrialDispatcher: wait GPU, live environment 1, reusable 1, TMP_NO_AVAILABLE_GPU.
[2021-07-14 07:52:29] INFO (TrialDispatcher) TrialDispatcher: 1 live env, and 1 reusable, but no GPU available so request a new one.

Is it possible to re-open this issue so that the problem can be addressed? Thank you very much.

I turned off the problem by mistake, I re-open this issue, thanks

@acured
Copy link
Contributor

acured commented Jul 19, 2021

Hi, @albertogilramos Thanks your feedback, and thanks @guoxiaojie-schinper 's debug log.

@albertogilramos could you give me more information about how do you user cancel the trial? I can not reproduce it now.

@acured
Copy link
Contributor

acured commented Jul 19, 2021

BTW, there is a fix related on GPU release issue #3941. May solve our problem in next nni version.

@albertogilramos
Copy link

@acured : On NNI 2.3 remote mode, I'm using the following pytorch minimal example linear regression on just one machine (with one gpu) that acts as the master and the slave for testing, in particular note in config.yml that the <IP> in nniManagerIp and host are the same, and that I've also anonymized user with <USER>. The issue that occurs is that while the first trial runs and finishes successfully as can be seen in the attached picture the second trial waits forever and never starts:

(Note I've not yet had the change to build from master after #3941 to see if that fixes the issue. I'll try to do so this evening.)

config.yml:

debug: true
experimentName: Experiment Name
logLevel: trace
maxExperimentDuration: 99990h
maxTrialNumber: 99990
nniManagerIp: <IP>
searchSpaceFile: search_space.json
trainingService:
  machineList:
  - gpuIndices: '0'
    host: <IP>
    maxTrialNumberPerGpu: 1
    sshKeyFile: ~/.ssh/id_rsa
    useActiveGpu: true
    user: <USER>
  platform: remote
trialCodeDirectory: .
trialCommand: python3 main.py
trialConcurrency: 1
trialGpuNumber: 1
tuner:
  classArgs:
    optimize_mode: minimize
  name: TPE
useAnnotation: false

search_space.json:

{
  "lr": {
    "_type": "choice",
    "_value": [
      0.0001,
      0.001,
      0.01,
      0.1
    ]
  }
}

main.py:

# pylint: skip-file
import argparse
import itertools

import nni
import numpy as np
import torch


def get_fake_data_generator(*, w, b, scale, batch_size):
    size = (batch_size, 1)

    def fake_data_generator():
        while True:
            x = np.random.normal(size=size)
            y = w * x + b + np.random.normal(scale=scale, size=size)
            yield x.astype(np.float32), y.astype(np.float32)

    return fake_data_generator


class Model(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.w = torch.nn.Parameter(torch.randn(1, 1))
        self.b = torch.nn.Parameter(torch.zeros(1, 1))

    def forward(self, x):
        return x @ self.w + self.b


def main(params):
    print(params)
    np.random.seed(params["seed"])
    torch.manual_seed(params["seed"])
    fake_data_generator = get_fake_data_generator(
        w=params["w_true"],
        b=params["b_true"],
        scale=params["scale"],
        batch_size=params["batch_size"],
    )
    model = Model()
    optim = torch.optim.SGD(model.parameters(), lr=params["lr"])

    for epoch in range(params["epochs"]):
        for x_true, y_true in itertools.islice(fake_data_generator(),
                                               params["steps"]):
            x_true = torch.from_numpy(x_true)
            y_true = torch.from_numpy(y_true)
            y_pred = model(x_true)
            loss = (y_true - y_pred)**2
            loss = loss.mean()
            optim.zero_grad()
            loss.backward()
            optim.step()
        result = {
            "default": float(loss.detach().numpy()),
            "w": float(model.w.detach().numpy()),
            "b": float(model.b.detach().numpy()),
        }
        nni.report_intermediate_result(result)
    result = {
        "default": float(loss.detach().numpy()),
        "w": float(model.w.detach().numpy()),
        "b": float(model.b.detach().numpy()),
    }
    nni.report_final_result(result)


if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="")
    parser.add_argument("--seed", default=1234, type=int)
    parser.add_argument("--w_true", default=1.0, type=float)
    parser.add_argument("--b_true", default=0.5, type=float)
    parser.add_argument("--scale", default=0.1, type=float)
    parser.add_argument("--batch_size", default=32, type=int)
    parser.add_argument("--lr", default=1e-2, type=float)
    parser.add_argument("--epochs", default=10, type=int)
    parser.add_argument("--steps", default=1000, type=int)
    args = parser.parse_args()
    params = args.__dict__
    tuned_params = nni.get_next_parameter()
    params.update(tuned_params)
    main(params)

nni_stall

@albertogilramos
Copy link

@acured: I confirm my issue is solved in latest master (442342c). Specifically, I installed nni nighly via

git clone https://github.com/Microsoft/nni.git
cd nni
git checkout 442342cb19eb810187f1f4e12983ddb3e6d8cacb
export NNI_RELEASE=2.0
python3 -m pip install --upgrade pip setuptools wheel
python3 setup.py clean --all
python3 setup.py build_ts
python3 setup.py bdist_wheel -p manylinux1_x86_64
python3 -m pip install dist/nni-2.0-py3-none-manylinux1_x86_64.whl

after which the trials don't stay waiting forever even if I user cancel them as can be seen in the picture:

nni_fixed

@guoxiaojie-schinper : perhaps you want to try also this version of nni from master and see if that solves your problem as well?

Thank you very much.

@guoxiaojie-schinper
Copy link
Author

@acured: I confirm my issue is solved in latest master (442342c). Specifically, I installed nni nighly via

git clone https://github.com/Microsoft/nni.git
cd nni
git checkout 442342cb19eb810187f1f4e12983ddb3e6d8cacb
export NNI_RELEASE=2.0
python3 -m pip install --upgrade pip setuptools wheel
python3 setup.py clean --all
python3 setup.py build_ts
python3 setup.py bdist_wheel -p manylinux1_x86_64
python3 -m pip install dist/nni-2.0-py3-none-manylinux1_x86_64.whl

after which the trials don't stay waiting forever even if I user cancel them as can be seen in the picture:

nni_fixed

@guoxiaojie-schinper : perhaps you want to try also this version of nni from master and see if that solves your problem as well?

Thank you very much.

I have updated the nni and installed the latest code by python setup.py develop from the master branch, but it still doesn't work for me.

@albertogilramos
Copy link

@guoxiaojie-schinper : in case it helps, your command (python setup.py develop) installs it in dev mode, whereas mine installed it in persistent mode via a wheel (see above).

See https://nni.readthedocs.io/en/stable/Tutorial/InstallationLinux.html#installation

Also this needs to be done in each master and slave machines.

@guoxiaojie-schinper
Copy link
Author

@guoxiaojie-schinper : in case it helps, your command (python setup.py develop) installs it in dev mode, whereas mine installed it in persistent mode via a wheel (see above).

See https://nni.readthedocs.io/en/stable/Tutorial/InstallationLinux.html#installation

Also this needs to be done in each master and slave machines.

Thanks very much for your quick reply.

I have used the following command to update the NNI in both Master and Slave machines just as you recommend.

python3 -m pip install --upgrade nni

But it still can't work for me. And the latest wheel was released on June 15, 2021, on pypi.org, so I think it was not the newest version. Is there something wrong with my understanding?

@albertogilramos
Copy link

@guoxiaojie-schinper : if you use

python3 -m pip install --upgrade nni

you'll get the latest wheel version from pypi (https://pypi.org/project/nni/) which is 2.3 (https://pypi.org/project/nni/#history), but what you want is to rather than downloading from pypi instead build the wheel yourself from the latest master for which there is no pypi package (nni doesn't release nightly versions on pypi). So if you want to reproduce what worked for me then you need to do the following on your master and slaves:

git clone https://github.com/Microsoft/nni.git
cd nni
git checkout 442342cb19eb810187f1f4e12983ddb3e6d8cacb
export NNI_RELEASE=2.0
python3 -m pip install --upgrade pip setuptools wheel
python3 setup.py clean --all
python3 setup.py build_ts
python3 setup.py bdist_wheel -p manylinux1_x86_64
python3 -m pip install dist/nni-2.0-py3-none-manylinux1_x86_64.whl

This will clone the repo, checkout a commit that is ahead of 2.3 and worked for me, create the wheel package yourself and finally install it.

Hope this helps.

@guoxiaojie-schinper
Copy link
Author

python3 setup.py bdist_wheel -p manylinux1_x86_64

Thanks for your reply, but it still doesn't work for me. I think this version just solves the problem of "Bug in IP detection", not my issue. In the new version, it can correctly detect the IP address, however in the pre-version, if I don't set nniManagerIp in configure file, it will throw out "Job management error: getIPV4Address() failed because os.networkInterfaces().eth0 is undefined".

@acured
Copy link
Contributor

acured commented Jul 22, 2021

Thanks @albertogilramos , I'm glad this solves your problem.

Hi @guoxiaojie-schinper , this fix is not in the release build. If you want to try latest code, you can install NNI from source code. ref here: https://nni.readthedocs.io/en/stable/Tutorial/InstallationLinux.html#install-nni-through-source-code

Or you can wait the next version of NNI released.

@OuYaozhong
Copy link

I am very happy to see someone have the same problem.


In brief, what happened to me is the same as @guoxiaojie-schinper

I am running the demo of nni repo, /example/trial/mnist-pytorch

If running the config_remote.yml locally in remote machine (certainly, the trainningService has been change to local), everything is normal.

But if the same config_remote.yml running in my local machine (MacBook Pro), and the slave worker is the workstation with Nvidia GeForce 2080 GPU, it doesn't work exactly same as @guoxiaojie-schinper .


In detail,

Environment: NNI on both local and remote are install by python3 -m pip install --upgrade nni in conda environment

config_remote.yml (if use remote):

searchSpaceFile: search_space.json
#trialCommand: nvidia-smi && which python3 && python3 mnist.py
trialCommand: python3 mnist.py
trialGpuNumber: 1
trialConcurrency: 4
maxTrialNumber: 20
nniManagerIp: 10.113.217.251
debug: true
tuner:
  name: TPE
  classArgs:
    optimize_mode: maximize
trainingService:
  platform: remote
  machineList:
    - host: 10.113.217.230 
      user: root
      sshKeyFile: ~/.ssh/nni_docker
      port: 8145
      pythonPath: /opt/conda/envs/py38torch190cu111/bin
      useActiveGpu: true
      maxTrialNumberPerGpu: 8

config_remote.yml (if use in local):

searchSpaceFile: search_space.json
#trialCommand: nvidia-smi && which python3 && python3 mnist.py
trialCommand: python3 mnist.py
trialGpuNumber: 1
trialConcurrency: 4
maxTrialNumber: 20
nniManagerIp: 10.113.217.230
debug: true
tuner:
  name: TPE
  classArgs:
    optimize_mode: maximize
trainingService:
  platform: local
  useActiveGpu: true
  maxTrialNumberPerGpu: 8

Description:

  1. If I run the code and config (local mode, using the second yml file) in remote machine locally, every setting is running as expected. The task in gpu is the same as the trialConcurrency and gpu is used by nni exactly, the speed of output waiting time is also as expectation.
  2. If I run the code and config (remote mode, using the first yml file) in local machine (MacBook Pro with latest OS) connecting to the remote machine, some strange phenomena are occurred. I list below.

-> 2.1 If I set the trialGpuNumber = 1, trialCommand = python3 mint.py, the phenomenon is the same as @guoxiaojie-schinper . All the tasks show waiting status forever. And the NNIManager Log shows the INFO: INFO (TrialDispatcher) TrialDispatcher: 1 live env, and 1 reusable, but no GPU available so request a new one. And use top and nvidia-smi in remote machine can confirm that that task is not running exactly, due to low cpu usage and no related gpu process. And the waiting status can be kept up to several hours and still be waiting there. And no matter if the nvidia-smi is added to the trialCommand (just like the commented one trialCommand), the phenomenon is the same.

-> 2.2 If I set the trialGpuNumber > 1, nni will tell me over the limitation and none of the machine can reach that. And in fact my remote machine has only one gpu. This situation is reasonable.

-> 2.3 If I set the trialGpuNumber = 0, and the trialCommand = python3 mnist.py, and inside or outside the docker, even though my trialConcurrency = 4, only one task is running and another 3 keep waiting until the running one finish. Unlike [1] and [2] above, all the task are waiting forever, in this condition, the task is running one by one beyond the control of the augment trialConcurrency. And the task is run by cpu (waiting up to 4 min for this mnist demo of your nni repo to reach full 800% usage of cpu with 8 kernel i7 cpu). But just take a longer time than gpu, it still can run without waiting forever.
(The information provided in right side sentence is not very confident because I don't remember exactly or situation show up rarely) And maybe something will randomly use the gpu in first task or several tasks.

-> 2.4 If I set the trialGpuNumber = 0 again, but add the nvidia-smi to the trialCommand, i.e. trialCommand: nvidia-smi && which python3 && python3 mnist.py, and outside the docker, the task will run with gpu after about 4 min, which is much slower than normal one in 2.5. The gpu usage I have checked in nvidia-smi command in remote machine (a related process show up in nvidia-smi) and the output speed of nni can also confirm these phenomenon. But it seems that it seems that the next task will use more time to wait for gpu usage.

-> 2.5 If I set the trialGpuNumber = 0 again, but add the nvidia-smi to the trialCommand, i.e. trialCommand: nvidia-smi && which python3 && python3 mnist.py, and inside the docker, the task must be run with gpu, but still one by one.

@scarlett2018
Copy link
Member

Closing as fixed on #4035

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants