Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValueError: Out of range float values are not JSON compliant #1589

Closed
martsalz opened this issue Oct 8, 2019 · 6 comments · Fixed by #1958
Closed

ValueError: Out of range float values are not JSON compliant #1589

martsalz opened this issue Oct 8, 2019 · 6 comments · Fixed by #1958
Assignees
Labels
bug Something isn't working NNI SDK user raised

Comments

@martsalz
Copy link

martsalz commented Oct 8, 2019

Short summary about the issue/question:

When executing an experiment, the following error message appears for some trials:

ValueError: Out of range float values are not JSON compliant

What's the reason for this?

nni Environment:

  • nni version: v0.8-320-g421065b
  • nni mode(local|pai|remote): local
  • OS: CentOS 7
  • python version: Python 3.6
  • is conda or virtualenv used?: virtualenv
  • is running in docker?: no

Anything else we need to know:

stderr:

Using TensorFlow backend.
Traceback (most recent call last):
  File "test.py", line 164, in <module>
    train(ARGS, RECEIVED_PARAMS)
  File "test.py", line 136, in train
    validation_data=(x_test, y_test), callbacks=[SendMetrics(), TensorBoard(log_dir=TENSORBOARD_DIR)])
  File "/home/msalz/venv_nni_dev/lib64/python3.6/site-packages/keras/engine/training.py", line 1178, in fit
    validation_freq=validation_freq)
  File "/home/msalz/venv_nni_dev/lib64/python3.6/site-packages/keras/engine/training_arrays.py", line 224, in fit_loop
    callbacks.on_epoch_end(epoch, epoch_logs)
  File "/home/msalz/venv_nni_dev/lib64/python3.6/site-packages/keras/callbacks.py", line 152, in on_epoch_end
    callback.on_epoch_end(epoch, logs)
  File "test.py", line 119, in on_epoch_end
    nni.report_intermediate_result(logs["val_loss"])
  File "/home/msalz/nni/build/nni/trial.py", line 81, in report_intermediate_result
    'value': metric
  File "/home/msalz/venv_nni_dev/lib64/python3.6/site-packages/json_tricks/nonp.py", line 99, in dumps
    primitives=primitives, fallback_encoders=fallback_encoders, **jsonkwargs).encode(obj)
  File "/usr/lib64/python3.6/json/encoder.py", line 199, in encode
    chunks = self.iterencode(o, _one_shot=True)
  File "/usr/lib64/python3.6/json/encoder.py", line 257, in iterencode
    return _iterencode(o, 0)
ValueError: Out of range float values are not JSON compliant
authorName: default
experimentName: example_mnist
maxExecDuration: 2h
maxTrialNum: 1000
trialConcurrency: 2
localConfig:
    useActiveGpu: true
    maxTrialNumPerGpu: 2
#choice: local, remote, pai
trainingServicePlatform: local
searchSpacePath: search_space.json
#choice: true, false
useAnnotation: false
advisor:
  #choice: Hyperband
  builtinAdvisorName: BOHB
  classArgs:
    min_budget: 4
    max_budget: 32
    #eta: proportion of discarded trials
    eta: 2
    #choice: maximize, minimize
    optimize_mode: minimize
trial:
  command: python3 test.py
  codeDir: .
  gpuNum: 1

image

@chicm-ms
Copy link
Contributor

chicm-ms commented Oct 9, 2019

@martsalz Hi, can you log the value of logs["val_loss"] ? I suspect that the value of logs["val_loss"] is out of range for the failed trail. Sometimes if learning rate is too large, the loss value could be 'nan'.

@suiguoxin suiguoxin added the bug Something isn't working label Oct 18, 2019
@ultmaster ultmaster added waiting user confirm and removed bug Something isn't working labels Oct 20, 2019
@martsalz
Copy link
Author

Because the experiments take a few hours/days and the error message occurs sporadically in my opinion, the reproducibility is not that easy.....

@martsalz
Copy link
Author

@chicm-ms Yes, logs["val_loss"] returns the value nan:

`[11/22/2019, 12:05:09 PM] PRINT - ETA: 0s - loss: nan
[11/22/2019, 12:05:09 PM] PRINT 499/500 [============================>.]
[11/22/2019, 12:05:09 PM] PRINT - ETA: 0s - loss: nan
[11/22/2019, 12:05:13 PM] PRINT 500/500 [==============================]
[11/22/2019, 12:05:13 PM] PRINT - 118s 236ms/step - loss: nan - val_loss: nan

[11/22/2019, 12:05:13 PM] ERROR (mnist_keras/MainThread) Out of range float values are not JSON compliant
Traceback (most recent call last):
File "test.py", line 76, in
train(ARGS, RECEIVED_PARAMS)
File "test.py", line 63, in train
validation_steps=val_steps)
File "/home/msalz/venv_nni_dev/lib64/python3.6/site-packages/keras/legacy/interfaces.py", line 91, in wrapper
return func(*args, **kwargs)
File "/home/msalz/venv_nni_dev/lib64/python3.6/site-packages/keras/engine/training.py", line 1658, in fit_generator
initial_epoch=initial_epoch)
File "/home/msalz/venv_nni_dev/lib64/python3.6/site-packages/keras/engine/training_generator.py", line 255, in fit_generator
callbacks.on_epoch_end(epoch, epoch_logs)
File "/home/msalz/venv_nni_dev/lib64/python3.6/site-packages/keras/callbacks.py", line 152, in on_epoch_end
callback.on_epoch_end(epoch, logs)
File "test.py", line 41, in on_epoch_end
nni.report_intermediate_result(logs["val_loss"])
File "/home/msalz/nni/build/nni/trial.py", line 84, in report_intermediate_result
'value': metric
File "/home/msalz/venv_nni_dev/lib64/python3.6/site-packages/json_tricks/nonp.py", line 99, in dumps
primitives=primitives, fallback_encoders=fallback_encoders, **jsonkwargs).encode(obj)
File "/usr/lib64/python3.6/json/encoder.py", line 199, in encode
chunks = self.iterencode(o, _one_shot=True)
File "/usr/lib64/python3.6/json/encoder.py", line 257, in iterencode
return _iterencode(o, 0)
ValueError: Out of range float values are not JSON compliant
`

@martsalz
Copy link
Author

How can this bug be fixed quickly? In my experiment with 173 trainings, 65 of them failed due to this error. With 8h/model very frustrating.

@chicm-ms
Copy link
Contributor

chicm-ms commented Dec 5, 2019

We are trying to fix this with PR #1958

Too large learning rate can lead to nan loss value, a quick fix is to check your trial code / search space and set learning rate to a smaller value.

Since the loss value of the failed jobs are nan, the hyper parameter of those jobs won't be the best even if they are not failed.

@chicm-ms
Copy link
Contributor

Closing this issue since the problem is fixed in nni v1.4. @martsalz , you can check our latest nni version.

@scarlett2018 scarlett2018 added bug Something isn't working NNI SDK and removed waiting user confirm labels Apr 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working NNI SDK user raised
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants