Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to connect to PAI http code:500 #1076

Closed
Crysple opened this issue May 14, 2019 · 2 comments
Closed

Failed to connect to PAI http code:500 #1076

Crysple opened this issue May 14, 2019 · 2 comments
Labels
bug Something isn't working nnidev

Comments

@Crysple
Copy link
Contributor

Crysple commented May 14, 2019

Short summary about the issue/question:

Brief what process you are following:

How to reproduce it:

nni Environment:

  • nni version:
  • nni mode(local|pai|remote):
  • OS:
  • python version:
  • is conda or virtualenv used?:
  • is running in docker?:

need to update document(yes/no):

Anything else we need to know:

Error info:

[5/12/2019, 1:10:07 AM] ERROR [ 'PAI Training service: get job info for trial Y7TZR from PAI Cluste
r failed!' ] 
[5/12/2019, 1:10:08 AM] ERROR [ 'Submit trial XPxTn failed, http code:500, http body: [object Objec
t]' ]
[5/12/2019, 1:10:08 AM] ERROR [ 'Error: Submit trial XPxTn failed, http code:500, http body: [objec
t Object]\n    at Request.request [as _callback] (/data/home/v-zejlin/.conda/envs/pynni/nni/training_service/pai/paiTrainingService.js:322:33)\n    at Request.self.callback (/data/home/v-zejlin/.conda/envs/pynni/nni/node_modules/request/request.js:185:22)\n    at Request.emit (events.js:182:13)\n    at Request.<anonymous> (/data/home/v-zejlin/.conda/envs/pynni/nni/node_modules/request/request.js:1161:10)\n    at Request.emit (events.js:182:13)\n    at IncomingMessage.<anonymous> (/data/home/v-zejlin/.conda/envs/pynni/nni/node_modules/request/request.js:1083:12)\n    at Object.onceWrapper (events.js:273:13)\n    at IncomingMessage.emit (events.js:187:15)\n    at endReadableNT (_stream_readable.js:1094:12)\n    at process._tickCallback (internal/process/next_tick.js:63:19)' ]
[5/12/2019, 1:10:08 AM] INFO [ 'Change NNIManager status from: TUNER_NO_MORE_TRIAL to: ERROR' ]

Two experiments resulted in the same bug. Note that after I decreased the interval time of updating PAI token (from originally 2 hours to half an hour), it was fixed.

Root cause analyze:
There are 2 types of 500 errors (so far we know), trial failure or experiment failure. For trial failure, in this issue, we will catch the trial failure and add an NNI Error log, fail the trial but won't failure the entire experiment. For experiment failure, fail the experiment and add NNI error log.

@Crysple
Copy link
Contributor Author

Crysple commented May 15, 2019

Sorry, it is useless to decrease the interval time now. It failed again

@scarlett2018 scarlett2018 added this to the May 2019 Release milestone May 20, 2019
@scarlett2018 scarlett2018 added the bug Something isn't working label May 20, 2019
@scarlett2018
Copy link
Member

It can be reproduce by running over 100 trails (sometimes). We need to expose more log from PAI to indicate what's the rootcause first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working nnidev
Projects
None yet
Development

No branches or pull requests

3 participants