Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to establish a new connection #3496

Open
Roy-Kid opened this issue Mar 29, 2021 · 33 comments
Open

Failed to establish a new connection #3496

Roy-Kid opened this issue Mar 29, 2021 · 33 comments

Comments

@Roy-Kid
Copy link

Roy-Kid commented Mar 29, 2021

I try to use nni in the HPC at our school. The code is work on my computer. The HPC has many compute nodes and we should submit the tasks on the manager node. But this error raise:

requests.exceptions.ConnectionError: HTTPConnectionPool(host='localhost', port=17513): Max retries exceeded with url: /api/v1/nni/check-status (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x2b352d9c6f28>: Failed to establish a new connection: [Errno 111] Connection refused',))

I think it might be related to the url. may be I should use nniManagerIP to fix this problem? what host should i specify?

@SparkSnail
Copy link
Contributor

Hi @Roy-Kid, are you using remote mode to submit job? could you share your full content of nniManager.log?

@Roy-Kid
Copy link
Author

Roy-Kid commented Apr 1, 2021

Hi @Roy-Kid, are you using remote mode to submit job? could you share your full content of nniManager.log?

Hi, the experiment fails at the very beginning then the log fold can not be create. Here is some errors raise:

[2021-04-01 23:06:50] Timeout, retry...
[2021-04-01 23:06:51] Create experiment failed
Traceback (most recent call last):
  File "/public/home/Klsr_yqli_04/liuly/anaconda3/lib/python3.6/site-packages/urllib3/connection.py", line 170, in _new_conn
    (self._dns_host, self.port), self.timeout, **extra_kw
  File "/public/home/Klsr_yqli_04/liuly/anaconda3/lib/python3.6/site-packages/urllib3/util/connection.py", line 96, in create_connection
    raise err
  File "/public/home/Klsr_yqli_04/liuly/anaconda3/lib/python3.6/site-packages/urllib3/util/connection.py", line 86, in create_connection
    sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/public/home/Klsr_yqli_04/liuly/anaconda3/lib/python3.6/site-packages/urllib3/connectionpool.py", line 706, in urlopen
    chunked=chunked,
  File "/public/home/Klsr_yqli_04/liuly/anaconda3/lib/python3.6/site-packages/urllib3/connectionpool.py", line 394, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "/public/home/Klsr_yqli_04/liuly/anaconda3/lib/python3.6/site-packages/urllib3/connection.py", line 234, in request
    super(HTTPConnection, self).request(method, url, body=body, headers=headers)
  File "/public/home/Klsr_yqli_04/liuly/anaconda3/lib/python3.6/http/client.py", line 1239, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/public/home/Klsr_yqli_04/liuly/anaconda3/lib/python3.6/http/client.py", line 1285, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/public/home/Klsr_yqli_04/liuly/anaconda3/lib/python3.6/http/client.py", line 1234, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/public/home/Klsr_yqli_04/liuly/anaconda3/lib/python3.6/http/client.py", line 1026, in _send_output
    self.send(msg)
  File "/public/home/Klsr_yqli_04/liuly/anaconda3/lib/python3.6/http/client.py", line 964, in send
    self.connect()
  File "/public/home/Klsr_yqli_04/liuly/anaconda3/lib/python3.6/site-packages/urllib3/connection.py", line 200, in connect
    conn = self._new_conn()
  File "/public/home/Klsr_yqli_04/liuly/anaconda3/lib/python3.6/site-packages/urllib3/connection.py", line 182, in _new_conn
    self, "Failed to establish a new connection: %s" % e
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7f3069207630>: Failed to establish a new connection: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/public/home/Klsr_yqli_04/liuly/anaconda3/lib/python3.6/site-packages/requests/adapters.py", line 449, in send
    timeout=timeout
  File "/public/home/Klsr_yqli_04/liuly/anaconda3/lib/python3.6/site-packages/urllib3/connectionpool.py", line 756, in urlopen
    method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
  File "/public/home/Klsr_yqli_04/liuly/anaconda3/lib/python3.6/site-packages/urllib3/util/retry.py", line 574, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='localhost', port=17513): Max retries exceeded with url: /api/v1/nni/check-status (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f3069207630>: Failed to establish a new connection: [Errno 111] Connection refused',))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "launch.py", line 32, in <module>
    experiment.run(17513)
  File "/public/home/Klsr_yqli_04/liuly/anaconda3/lib/python3.6/site-packages/nni/experiment/experiment.py", line 156, in run
    self.start(port, debug)
  File "/public/home/Klsr_yqli_04/liuly/anaconda3/lib/python3.6/site-packages/nni/experiment/experiment.py", line 112, in start
    self._proc = launcher.start_experiment(self.id, self.config, port, debug)
  File "/public/home/Klsr_yqli_04/liuly/anaconda3/lib/python3.6/site-packages/nni/experiment/launcher.py", line 51, in start_experiment
    raise e
  File "/public/home/Klsr_yqli_04/liuly/anaconda3/lib/python3.6/site-packages/nni/experiment/launcher.py", line 38, in start_experiment
    _check_rest_server(port)
  File "/public/home/Klsr_yqli_04/liuly/anaconda3/lib/python3.6/site-packages/nni/experiment/launcher.py", line 145, in _check_rest_server
    rest.get(port, '/check-status')
  File "/public/home/Klsr_yqli_04/liuly/anaconda3/lib/python3.6/site-packages/nni/experiment/rest.py", line 26, in get
    return request('get', port, api)
  File "/public/home/Klsr_yqli_04/liuly/anaconda3/lib/python3.6/site-packages/nni/experiment/rest.py", line 16, in request
    resp = requests.request(method, url, timeout=timeout)
  File "/public/home/Klsr_yqli_04/liuly/anaconda3/lib/python3.6/site-packages/requests/api.py", line 61, in request
    return session.request(method=method, url=url, **kwargs)
  File "/public/home/Klsr_yqli_04/liuly/anaconda3/lib/python3.6/site-packages/requests/sessions.py", line 542, in request
    resp = self.send(prep, **send_kwargs)
  File "/public/home/Klsr_yqli_04/liuly/anaconda3/lib/python3.6/site-packages/requests/sessions.py", line 655, in send
    r = adapter.send(request, **kwargs)
  File "/public/home/Klsr_yqli_04/liuly/anaconda3/lib/python3.6/site-packages/requests/adapters.py", line 516, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='localhost', port=17513): Max retries exceeded with url: /api/v1/nni/check-status (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f3069207630>: Failed to establish a new connection: [Errno 111] Connection refused',))
[2021-04-01 23:06:52] Stopping experiment, please wait...
[2021-04-01 23:06:52] Experiment stopped

PS:
After the local mode fails, I try to use remote mode to run the experiment in the HPC of our school. This time the experiment can establish successfully, but the trials are always running in the WebUI. I turn to check out the job queue but find no unfinished job. Our task should use bash script to submit to the HPC, so I set the trial command as "bsub < work.lsf", but no task is submitted. So I want to ask by the way, how to use nni under this circumstance?

@SparkSnail
Copy link
Contributor

Hi @Roy-Kid , from the error information, seems NNI fails to connect to local service localhost:17513, could you please make sure the port 17513 is available on your environment? You could use nnictl create --config {config_path} --port {port_number} to set another ports when creating new experiments.
In your remote mode, do you mean that NNI could submit job successfully, but trial status stuck in Running state? could you use nnictl create --config {config_path} --debug to start experiment, and provide nniManager.log file here?

@kvartet
Copy link
Contributor

kvartet commented Jun 10, 2021

hello @Roy-Kid, could you follow this and update the status of the issue? Thank you!

Hi @Roy-Kid , from the error information, seems NNI fails to connect to local service localhost:17513, could you please make sure the port 17513 is available on your environment? You could use nnictl create --config {config_path} --port {port_number} to set another ports when creating new experiments.
In your remote mode, do you mean that NNI could submit job successfully, but trial status stuck in Running state? could you use nnictl create --config {config_path} --debug to start experiment, and provide nniManager.log file here?

@Roy-Kid
Copy link
Author

Roy-Kid commented Jun 10, 2021

Hi, @SparkSnail @kvartet !
I have left the institute and not use HPC anymore, so I hardly test the new version. So sorry for that. Once I have the chance I will try it ASAP.

I think the confusing thing is that we submit the task by using a queue system like PBS, so how to write the script to run the trials, not on the management node makes me confused. If you have any idea, please update the tutorial :-) It is much more helpful for those who do not familiar with Linux!

Thanks again for your selfless help!

@wuyong-hdu
Copy link

We have the same problem.
requests.exceptions.ConnectionError: HTTPConnectionPool(host='localhost', port=8080): Max retries exceeded with url: /api/v1/nni/check-status (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fdbe7b0a250>: Failed to establish a new connection: [Errno 111] Connection refused'))

@wuyong-hdu
Copy link

The details:
(pytorch) wy@Tiger:~/mnist-pytorch$ nnictl create --config config_windows.yml
[2022-06-09 13:32:46] Creating experiment, Experiment ID: k5doghe7
[2022-06-09 13:32:46] Starting web server...
[2022-06-09 13:32:47] WARNING: Timeout, retry...
[2022-06-09 13:32:48] WARNING: Timeout, retry...
[2022-06-09 13:32:49] ERROR: Create experiment failed
Traceback (most recent call last):
File "/home/wy/.local/lib/python3.7/site-packages/urllib3/connection.py", line 175, in _new_conn
(self._dns_host, self.port), self.timeout, **extra_kw
File "/home/wy/.local/lib/python3.7/site-packages/urllib3/util/connection.py", line 95, in create_connection
raise err
File "/home/wy/.local/lib/python3.7/site-packages/urllib3/util/connection.py", line 85, in create_connection
sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/wy/.local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 710, in urlopen
chunked=chunked,
File "/home/wy/.local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 398, in _make_request
conn.request(method, url, **httplib_request_kw)
File "/home/wy/.local/lib/python3.7/site-packages/urllib3/connection.py", line 239, in request
super(HTTPConnection, self).request(method, url, body=body, headers=headers)
File "/home/wy/miniconda3/envs/pytorch/lib/python3.7/http/client.py", line 1281, in request
self._send_request(method, url, body, headers, encode_chunked)
File "/home/wy/miniconda3/envs/pytorch/lib/python3.7/http/client.py", line 1327, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
File "/home/wy/miniconda3/envs/pytorch/lib/python3.7/http/client.py", line 1276, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "/home/wy/miniconda3/envs/pytorch/lib/python3.7/http/client.py", line 1036, in _send_output
self.send(msg)
File "/home/wy/miniconda3/envs/pytorch/lib/python3.7/http/client.py", line 976, in send
self.connect()
File "/home/wy/.local/lib/python3.7/site-packages/urllib3/connection.py", line 205, in connect
conn = self._new_conn()
File "/home/wy/.local/lib/python3.7/site-packages/urllib3/connection.py", line 187, in _new_conn
self, "Failed to establish a new connection: %s" % e
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7fdbe7b0a250>: Failed to establish a new connection: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/wy/miniconda3/envs/pytorch/lib/python3.7/site-packages/requests/adapters.py", line 450, in send
timeout=timeout
File "/home/wy/.local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 786, in urlopen
method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
File "/home/wy/.local/lib/python3.7/site-packages/urllib3/util/retry.py", line 592, in increment
raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='localhost', port=8080): Max retries exceeded with url: /api/v1/nni/check-status (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fdbe7b0a250>: Failed to establish a new connection: [Errno 111] Connection refused'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/wy/.local/bin/nnictl", line 8, in
sys.exit(parse_args())
File "/home/wy/.local/lib/python3.7/site-packages/nni/tools/nnictl/nnictl.py", line 497, in parse_args
args.func(args)
File "/home/wy/.local/lib/python3.7/site-packages/nni/tools/nnictl/launcher.py", line 92, in create_experiment
exp.start(port, debug, run_mode)
File "/home/wy/.local/lib/python3.7/site-packages/nni/experiment/experiment.py", line 117, in start
self._proc = launcher.start_experiment(self._action, self.id, config, port, debug, run_mode, self.url_prefix)
File "/home/wy/.local/lib/python3.7/site-packages/nni/experiment/launcher.py", line 119, in start_experiment
raise e
File "/home/wy/.local/lib/python3.7/site-packages/nni/experiment/launcher.py", line 97, in start_experiment
_check_rest_server(port, url_prefix=url_prefix)
File "/home/wy/.local/lib/python3.7/site-packages/nni/experiment/launcher.py", line 258, in _check_rest_server
rest.get(port, '/check-status', url_prefix)
File "/home/wy/.local/lib/python3.7/site-packages/nni/experiment/rest.py", line 43, in get
return request('get', port, api, prefix=prefix)
File "/home/wy/.local/lib/python3.7/site-packages/nni/experiment/rest.py", line 31, in request
resp = requests.request(method, url, timeout=timeout)
File "/home/wy/miniconda3/envs/pytorch/lib/python3.7/site-packages/requests/api.py", line 61, in request
return session.request(method=method, url=url, **kwargs)
File "/home/wy/miniconda3/envs/pytorch/lib/python3.7/site-packages/requests/sessions.py", line 529, in request
resp = self.send(prep, **send_kwargs)
File "/home/wy/miniconda3/envs/pytorch/lib/python3.7/site-packages/requests/sessions.py", line 645, in send
r = adapter.send(request, **kwargs)
File "/home/wy/miniconda3/envs/pytorch/lib/python3.7/site-packages/requests/adapters.py", line 519, in send
raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='localhost', port=8080): Max retries exceeded with url: /api/v1/nni/check-status (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fdbe7b0a250>: Failed to establish a new connection: [Errno 111] Connection refused'))

@xztcwang
Copy link

xztcwang commented Jul 1, 2022

We have the same issue:

Reference: https://nni.readthedocs.io/en/stable/reference/experiment_config.html
[2022-06-30 21:45:45] Creating experiment, Experiment ID: in59ltr2
[2022-06-30 21:45:45] Starting web server...
[2022-06-30 21:45:46] WARNING: Timeout, retry...
[2022-06-30 21:45:47] WARNING: Timeout, retry...
[2022-06-30 21:45:48] ERROR: Create experiment failed
Traceback (most recent call last):
File "/opt/anaconda3/envs/flowtorch_config/lib/python3.7/site-packages/urllib3/connection.py", line 175, in _new_conn
(self._dns_host, self.port), self.timeout, **extra_kw
File "/opt/anaconda3/envs/flowtorch_config/lib/python3.7/site-packages/urllib3/util/connection.py", line 95, in create_connection
raise err
File "/opt/anaconda3/envs/flowtorch_config/lib/python3.7/site-packages/urllib3/util/connection.py", line 85, in create_connection
sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/opt/anaconda3/envs/flowtorch_config/lib/python3.7/site-packages/urllib3/connectionpool.py", line 710, in urlopen
chunked=chunked,
File "/opt/anaconda3/envs/flowtorch_config/lib/python3.7/site-packages/urllib3/connectionpool.py", line 398, in _make_request
conn.request(method, url, **httplib_request_kw)
File "/opt/anaconda3/envs/flowtorch_config/lib/python3.7/site-packages/urllib3/connection.py", line 239, in request
super(HTTPConnection, self).request(method, url, body=body, headers=headers)
File "/opt/anaconda3/envs/flowtorch_config/lib/python3.7/http/client.py", line 1281, in request
self._send_request(method, url, body, headers, encode_chunked)
File "/opt/anaconda3/envs/flowtorch_config/lib/python3.7/http/client.py", line 1327, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
File "/opt/anaconda3/envs/flowtorch_config/lib/python3.7/http/client.py", line 1276, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "/opt/anaconda3/envs/flowtorch_config/lib/python3.7/http/client.py", line 1036, in _send_output
self.send(msg)
File "/opt/anaconda3/envs/flowtorch_config/lib/python3.7/http/client.py", line 976, in send
self.connect()
File "/opt/anaconda3/envs/flowtorch_config/lib/python3.7/site-packages/urllib3/connection.py", line 205, in connect
conn = self._new_conn()
File "/opt/anaconda3/envs/flowtorch_config/lib/python3.7/site-packages/urllib3/connection.py", line 187, in _new_conn
self, "Failed to establish a new connection: %s" % e
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7f0353f50e10>: Failed to establish a new connection: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/opt/anaconda3/envs/flowtorch_config/lib/python3.7/site-packages/requests/adapters.py", line 499, in send
timeout=timeout,
File "/opt/anaconda3/envs/flowtorch_config/lib/python3.7/site-packages/urllib3/connectionpool.py", line 786, in urlopen
method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
File "/opt/anaconda3/envs/flowtorch_config/lib/python3.7/site-packages/urllib3/util/retry.py", line 592, in increment
raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='localhost', port=7008): Max retries exceeded with url: /api/v1/nni/check-status (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f0353f50e10>: Failed to establish a new connection: [Errno 111] Connection refused'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/opt/anaconda3/envs/flowtorch_config/bin/nnictl", line 8, in
sys.exit(parse_args())
File "/opt/anaconda3/envs/flowtorch_config/lib/python3.7/site-packages/nni/tools/nnictl/nnictl.py", line 497, in parse_args
args.func(args)
File "/opt/anaconda3/envs/flowtorch_config/lib/python3.7/site-packages/nni/tools/nnictl/launcher.py", line 91, in create_experiment
exp.start(port, debug, RunMode.Detach)
File "/opt/anaconda3/envs/flowtorch_config/lib/python3.7/site-packages/nni/experiment/experiment.py", line 135, in start
self._start_impl(port, debug, run_mode, None, [])
File "/opt/anaconda3/envs/flowtorch_config/lib/python3.7/site-packages/nni/experiment/experiment.py", line 104, in _start_impl
self.url_prefix, tuner_command_channel, tags)
File "/opt/anaconda3/envs/flowtorch_config/lib/python3.7/site-packages/nni/experiment/launcher.py", line 147, in start_experiment
raise e
File "/opt/anaconda3/envs/flowtorch_config/lib/python3.7/site-packages/nni/experiment/launcher.py", line 125, in start_experiment
_check_rest_server(port, url_prefix=url_prefix)
File "/opt/anaconda3/envs/flowtorch_config/lib/python3.7/site-packages/nni/experiment/launcher.py", line 195, in _check_rest_server
rest.get(port, '/check-status', url_prefix)
File "/opt/anaconda3/envs/flowtorch_config/lib/python3.7/site-packages/nni/experiment/rest.py", line 43, in get
return request('get', port, api, prefix=prefix)
File "/opt/anaconda3/envs/flowtorch_config/lib/python3.7/site-packages/nni/experiment/rest.py", line 31, in request
resp = requests.request(method, url, timeout=timeout)
File "/opt/anaconda3/envs/flowtorch_config/lib/python3.7/site-packages/requests/api.py", line 59, in request
return session.request(method=method, url=url, **kwargs)
File "/opt/anaconda3/envs/flowtorch_config/lib/python3.7/site-packages/requests/sessions.py", line 587, in request
resp = self.send(prep, **send_kwargs)
File "/opt/anaconda3/envs/flowtorch_config/lib/python3.7/site-packages/requests/sessions.py", line 701, in send
r = adapter.send(request, **kwargs)
File "/opt/anaconda3/envs/flowtorch_config/lib/python3.7/site-packages/requests/adapters.py", line 565, in send
raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='localhost', port=7008): Max retries exceeded with url: /api/v1/nni/check-status (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f0353f50e10>: Failed to establish a new connection: [Errno 111] Connection refused'))

@SparkSnail
Copy link
Contributor

Add @liuzhe-lz for help.

@SparkSnail SparkSnail removed their assignment Jul 4, 2022
@wmbai
Copy link

wmbai commented Jul 9, 2022

Hi, everyone!
I met the same problem when I run my code with nni (v. 2.8).
However, the same code works successfully with nni (v. 2.5).
It might be a solution to install nni v.2.5 and I also hope someone can find out what's wrong in the newest version.

@chengpr
Copy link

chengpr commented Aug 8, 2022

我将版本回退到2.5可行,这个报错就没有了

@scarlett2018
Copy link
Member

Hi, everyone! I met the same problem when I run my code with nni (v. 2.8). However, the same code works successfully with nni (v. 2.5). It might be a solution to install nni v.2.5 and I also hope someone can find out what's wrong in the newest version.

Thanks @wmbai.

@liuzhe-lz - cc scrum master @ultmaster - this might be an regression of v2.8.

@xiangtaowong
Copy link

xiangtaowong commented Sep 13, 2022

i got the same error in v2.9

@Lijiaoa
Copy link
Contributor

Lijiaoa commented Sep 19, 2022

hi @xiangtaowong Looks like your same error had got solved in issue #5126, yes?

@xiangtaowong
Copy link

xiangtaowong commented Sep 19, 2022

hi @xiangtaowong Looks like your same error had got solved in issue #5126, yes?

yes, I got the same error, and I follow his suggestion that changing all the data and output path to /home, without the remote disk, and sometimes it works.
But also sometimes it doesn't work, maybe another reason is due to a change in the item of experimentWorkingDirectory in the config.yml, and maybe you could see @szhang963 's HighEffiNNI for some possible results

@JuliaWasala
Copy link

Is there a solution to this? I'm not using a config.yml file, I set the configuration in the python script ( as in Hello NAS example). A week or so ago I was able to start the web server on my institute cluster, but now I keep getting the same error.

@ultmaster
Copy link
Contributor

As of v2.10, this error generally means "NNI manager fails to start" (since NNI manager is running in another process, we have trouble displaying the real reason why it fails to start. We can only tell that we can't connect to that process.)

To see the real exception, please go to ~/nni-experiments/<experiment_id> and check the logs inside.

@JuliaWasala
Copy link

JuliaWasala commented Jan 27, 2023 via email

@ultmaster
Copy link
Contributor

I get the same error if I want to view a previous experiment with nnictl view. I have some experiments files from the couple of days I was able to start the web serve. The nnictl logs don’t show much, to the experiment.log the following was added:

`

[2023-01-18 15:17:39] INFO (nni.nas.experiment.pytorch) Stopping experiment, please wait...

[2023-01-27 10:28:09] INFO (nni.experiment) Creating experiment, Experiment ID: 8nfh3acj

[2023-01-27 10:28:09] INFO (nni.experiment) Starting web server...

[2023-01-27 10:28:10] WARNING (nni.experiment) Timeout, retry...

[2023-01-27 10:28:11] WARNING (nni.experiment) Timeout, retry...

[2023-01-27 10:28:12] ERROR (nni.experiment) Create experiment failed

`

If I try to start a fresh experiment, it only creates a log directory with a single experiment.log file, which also contains the same output above and nothing else. Is there another place I can look to find the real source of the error?

From: Yuge Zhang @.***>

Sent: Friday, 27 January 2023 14:16

To: microsoft/nni @.***>

Cc: Julia Wąsala @.>; Comment @.>

Subject: Re: [microsoft/nni] Failed to establish a new connection (#3496)

As of v2.10, this error generally means "NNI manager fails to start" (since NNI manager is running in another process, we have trouble displaying the real reason why it fails to start. We can only tell that we can't connect to that process.)

To see the real exception, please go to ~/nni-experiments/<experiment_id> and check the logs inside.

Reply to this email directly, view it on GitHub#3496 (comment), or unsubscribehttps://github.com/notifications/unsubscribe-auth/AIISASUFGYW3UXINBUVQPSLWUPC7RANCNFSM4Z7N5N2Q.

You are receiving this because you commented.Message ID: @.@.>>

Can you find a nnimanager.log? experiment.log wasn't really helpful because it's also from the Python side.

@JuliaWasala
Copy link

None of the experiments that failed with the "failed to establish connection" error have a nnimanager.log; the only file in those experiment folders is the experiment.log. If I use nnictl view to view a previous experiment, nothing is added to the pre-existing nnimanager.log

@wangyanhao0517
Copy link

The same issue "ConnectionRefusedError: [Errno 111] Connect call failed ('127.0.0.1', 8088)"

@Lijiaoa
Copy link
Contributor

Lijiaoa commented Mar 13, 2023

v3.0 will fix this issue, please wait the new release of nni

@LeiWang1999
Copy link

@Lijiaoa when will v3.0 be released? I got the same issue..

@Lijiaoa
Copy link
Contributor

Lijiaoa commented Mar 20, 2023

#5418 (comment)

@why-in-Shanghaitech
Copy link

I have a simple fix for this issue: give it more retries.

_check_rest_server(port, url_prefix=url_prefix)

Change the line into the following:

_check_rest_server(port, retry=30, url_prefix=url_prefix)

Many people may work on a cluster without sufficient CPU resources. 3 seconds might be too strict to start a server.

@LeiWang1999
Copy link

thanks for sharing @why-in-Shanghaitech @Lijiaoa

@ShrutiSarikaChakraborty

Hi, I am facing the issue on the latest version. Any suggestions?
ConnectionError: ('Connection aborted.', ConnectionResetError(10054, 'An existing connection was forcibly closed by the remote host', None, 10054, None)

@Lijiaoa Lijiaoa mentioned this issue May 15, 2023
@lukelu312
Copy link

Hi, I am facing the issue on the latest version. Any suggestions? ConnectionError: ('Connection aborted.', ConnectionResetError(10054, 'An existing connection was forcibly closed by the remote host', None, 10054, None)

Got the same issue, experiment would stop unexpectedly because of loss of connection. Any solution for the issue ? Hope to hear your expertise. @ShrutiSarikaChakraborty

@ShrutiSarikaChakraborty
Copy link

ShrutiSarikaChakraborty commented Jul 6, 2023 via email

@lukelu312
Copy link

Hello, I just switched to the legacy version. Thanks, Shruti

On Wed, 5 Jul 2023, 21:35 lukelu312, @.> wrote: Hi, I am facing the issue on the latest version. Any suggestions? ConnectionError: ('Connection aborted.', ConnectionResetError(10054, 'An existing connection was forcibly closed by the remote host', None, 10054, None) Got the same issue, experiment would stop unexpectedly because of loss of connection. Any solution for the issue ? Hope to hear your expertise. @ShrutiSarikaChakraborty https://github.com/ShrutiSarikaChakraborty — Reply to this email directly, view it on GitHub <#3496 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AI4CVYHCPNNHADCDSQJQUPDXOXF2XANCNFSM4Z7N5N2Q . You are receiving this because you were mentioned.Message ID: @.>

Which legacy version are you using, v2.10.1 or a lower one ? Thanks for your reply @ShrutiSarikaChakraborty

@xinlnix
Copy link

xinlnix commented Aug 11, 2023

me code works successfully with nni (v. 2.5).
It might be a solution to install nni v.2.5 and I also hope someone can find out what's wrong in the newest versio

It don't work on my code. :(

@ranranrannervous
Copy link

I found that version 3.0 still has this problem

@ChenchenHu007
Copy link

The same issue occurs in the nni_hello_hpo of version 3.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests