Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't run more than n trials with trialConcurrency=n > 1 #5689

Open
studywolf opened this issue Sep 30, 2023 · 8 comments
Open

Can't run more than n trials with trialConcurrency=n > 1 #5689

studywolf opened this issue Sep 30, 2023 · 8 comments

Comments

@studywolf
Copy link

Describe the issue:

When I set trialConcurrency > 1, NNI fails out with

[2023-09-30 12:57:40] ERROR (nni.runtime.msg_dispatcher_base/Thread-1 (command_queue_worker)) 7
Traceback (most recent call last):
  File "/home/wolf/miniconda3/envs/mausspaun/lib/python3.10/site-packages/nni/runtime/msg_dispatcher_base.py", line 108, in command_queue_worker
    self.process_command(command, data)
  File "/home/wolf/miniconda3/envs/mausspaun/lib/python3.10/site-packages/nni/runtime/msg_dispatcher_base.py", line 154, in process_command
    command_handlers[command](data)
  File "/home/wolf/miniconda3/envs/mausspaun/lib/python3.10/site-packages/nni/runtime/msg_dispatcher.py", line 148, in handle_report_metric_data
    self._handle_final_metric_data(data)
  File "/home/wolf/miniconda3/envs/mausspaun/lib/python3.10/site-packages/nni/runtime/msg_dispatcher.py", line 201, in _handle_final_metric_data
    self.tuner.receive_trial_result(id_, _trial_params[id_], value, customized=customized,
  File "/home/wolf/miniconda3/envs/mausspaun/lib/python3.10/site-packages/nni/algorithms/hpo/tpe_tuner.py", line 197, in receive_trial_result
    params = self._running_params.pop(parameter_id)
KeyError: 7
[2023-09-30 12:57:41] INFO (nni.runtime.msg_dispatcher_base/MainThread) Dispatcher exiting...
[2023-09-30 12:57:44] INFO (nni.runtime.msg_dispatcher_base/MainThread) Dispatcher terminiated

When the trialConcurrency = n > 1, then NNI runs n trials and fails out with this error. This happens for all the different n values i've tried (2, 5, 10, 100). When trialConcurrency=1, no problems.

Environment:

  • NNI version: 3.0
  • Training service (local|remote|pai|aml|etc): local
  • Client OS: ubuntu
  • Server OS (for remote mode only):
  • Python version: 3.10.8
  • PyTorch/TensorFlow version: N/A
  • Is conda/virtualenv/venv used?: conda
  • Is running in Docker?: no

Configuration:

  • Experiment config (remember to remove secrets!):
{
 "params": {
   "experimentType": "hpo",
   "searchSpaceFile": "/home/wolf/Dropbox/code/mouse-arm/examples/nni_arm_parameters/search_space.json",
   "trialCommand": "python nni_sweep.py",
   "trialCodeDirectory": "/home/wolf/Dropbox/code/mouse-arm/examples/nni_arm_parameters",
   "trialConcurrency": 5,
   "useAnnotation": false,
   "debug": false,
   "logLevel": "info",
   "experimentWorkingDirectory": "/home/wolf/nni-experiments",
   "tuner": {
     "name": "TPE",
     "classArgs": {
       "optimize_mode": "minimize"
     }
   },
   "trainingService": {
     "platform": "local",
     "trialCommand": "python nni_sweep.py",
     "trialCodeDirectory": "/home/wolf/Dropbox/code/mouse-arm/examples/nni_arm_parameters",
     "debug": false,
     "maxTrialNumberPerGpu": 1,
     "reuseMode": false
   }
 },
 "execDuration": "13m 8s",
 "nextSequenceId": 14,
 "revision": 95
}

I haven't created a minimal reproducible example yet, I'm hoping someone might recognize this problem, as it seems pretty basic and maybe is just a version issue somewhere?

@studywolf studywolf changed the title Can't run more than n trials with trialConcurrency > 1 Can't run more than n trials with trialConcurrency=n > 1 Sep 30, 2023
@igodrr
Copy link

igodrr commented Oct 2, 2023

I encountered the same problem, sometimes it stopped after about ten trials ,and sometimes it stopped after more than 100 trials. I haven't found what caused the problem.

@cehw
Copy link

cehw commented Oct 4, 2023

I also have similar problems.

@kv-42
Copy link

kv-42 commented Nov 13, 2023

I have the same issue as well and looking forward the solution.

Environment:
NNI version: 3.0
Training service (local|remote|pai|aml|etc): local
Client OS: ubuntu 22.04.3
Server OS (for remote mode only):
Python version: 3.10.13
PyTorch/TensorFlow version: PyTorch 2.1.0
Is conda/virtualenv/venv used?: virtualenv
Is running in Docker?: no

@wby13
Copy link

wby13 commented Nov 30, 2023

same issue.

[2023-11-29 21:52:30] ERROR (nni.runtime.msg_dispatcher_base/Thread-1) 1
Traceback (most recent call last):
File "/home/bingyaowang/anaconda3/envs/myrsn1/lib/python3.8/site-packages/nni/runtime/msg_dispatcher_base.py", line 108, in command_queue_worker
self.process_command(command, data)
File "/home/bingyaowang/anaconda3/envs/myrsn1/lib/python3.8/site-packages/nni/runtime/msg_dispatcher_base.py", line 154, in process_command
command_handlerscommand
File "/home/bingyaowang/anaconda3/envs/myrsn1/lib/python3.8/site-packages/nni/runtime/msg_dispatcher.py", line 148, in handle_report_metric_data
self._handle_final_metric_data(data)
File "/home/bingyaowang/anaconda3/envs/myrsn1/lib/python3.8/site-packages/nni/runtime/msg_dispatcher.py", line 201, in handle_final_metric_data
self.tuner.receive_trial_result(id
, trial_params[id], value, customized=customized,
File "/home/bingyaowang/anaconda3/envs/myrsn1/lib/python3.8/site-packages/nni/algorithms/hpo/tpe_tuner.py", line 197, in receive_trial_result
params = self._running_params.pop(parameter_id)
KeyError: 1
[2023-11-29 21:52:31] DEBUG (websockets.client/NNI-WebSocketEventLoop) < TEXT '{"type":"EN","content":"{\"trial_job_id\":\"..._index\\\": 0}\"}"}' [402 bytes]
[2023-11-29 21:52:31] DEBUG (websockets.client/NNI-WebSocketEventLoop) < TEXT '{"type":"GE","content":"1"}' [27 bytes]
[2023-11-29 21:52:31] INFO (nni.runtime.msg_dispatcher_base/MainThread) Dispatcher exiting...
[2023-11-29 21:52:33] DEBUG (websockets.client/NNI-WebSocketEventLoop) < PING '' [0 bytes]
[2023-11-29 21:52:33] DEBUG (websockets.client/NNI-WebSocketEventLoop) > PONG '' [0 bytes]
[2023-11-29 21:52:33] DEBUG (websockets.client/NNI-WebSocketEventLoop) % sending keepalive ping
[2023-11-29 21:52:33] DEBUG (websockets.client/NNI-WebSocketEventLoop) > PING c8 af a3 c2 [binary, 4 bytes]
[2023-11-29 21:52:33] DEBUG (websockets.client/NNI-WebSocketEventLoop) < PONG c8 af a3 c2 [binary, 4 bytes]
[2023-11-29 21:52:33] DEBUG (websockets.client/NNI-WebSocketEventLoop) % received keepalive pong
[2023-11-29 21:52:34] DEBUG (websockets.client/NNI-WebSocketEventLoop) > TEXT '{"type": "bye"}' [17 bytes]
[2023-11-29 21:52:34] DEBUG (websockets.client/NNI-WebSocketEventLoop) = connection is CLOSING
[2023-11-29 21:52:34] DEBUG (websockets.client/NNI-WebSocketEventLoop) > CLOSE 4000 (private use) client intentionally close [28 bytes]
[2023-11-29 21:52:34] DEBUG (websockets.client/NNI-WebSocketEventLoop) < CLOSE 4000 (private use) client intentionally close [28 bytes]
[2023-11-29 21:52:34] DEBUG (websockets.client/NNI-WebSocketEventLoop) = connection is CLOSED
[2023-11-29 21:52:34] INFO (nni.runtime.msg_dispatcher_base/MainThread) Dispatcher terminiated

I set trial_concurrency==8 and it always stopped at 10~14 trials.

@XYxiyang
Copy link

Same issue, I set trial_concurrency=16 and stopped at ~20 trials, the dispatcher is terminated

@arvoelke
Copy link

arvoelke commented Jan 20, 2024

Same issue here on the latest version of NNI. It seems random how many trials along it gets each time. Always

    params = self._running_params.pop(parameter_id)
KeyError: ...

in dispatcher.log.

I think the problem went away after downgrading to nni<3.

@datngo93
Copy link

I faced the same problem, and in my case, a stopgap solution is to use "Anneal" tuner instead of "TPE" tuner.
Hope it help!

@studywolf
Copy link
Author

studywolf commented Jan 25, 2024

I found anything above 2.5 gives me the problem, been okay up to the hard coded memory limit with version 2.5 (roughly 45k trials)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants