Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mismatched hyperparameters between web server display and their actual values #5726

Open
WenjieDu opened this issue Dec 27, 2023 · 3 comments

Comments

@WenjieDu
Copy link

WenjieDu commented Dec 27, 2023

Describe the issue:

Environment:

  • NNI version: 3.0
  • Training service (local|remote|pai|aml|etc): local
  • Client OS: Ubuntu 20.04.4 LTS (GNU/Linux 5.13.0-30-generic x86_64)
  • Server OS (for remote mode only):
  • Python version: 3.11
  • PyTorch/TensorFlow version: 2.1.2
  • Is conda/virtualenv/venv used?: Conda
  • Is running in Docker?: No

Configuration:

  • Experiment config (remember to remove secrets!):
experimentName: MRNN hyper-param searching
authorName: WenjieDu
trialConcurrency: 1
trainingServicePlatform: local
searchSpacePath: MRNN_ETTm1_tuning_space.json
multiThread: true
useAnnotation: false
tuner:
    builtinTunerName: Random

trial:
    command: enable_tuning=1 pypots-cli tuning --model pypots.imputation.MRNN --train_set ../../data/ettm1/train.h5 --val_set ../../data/ettm1/val.h5
    codeDir: .
    gpuNum: 1

localConfig:
    useActiveGpu: true
    maxTrialNumPerGpu: 20
    gpuIndices: 3
  • Search space:
{
  "n_steps":  {"_type":"choice","_value":[60]},
  "n_features":  {"_type":"choice","_value":[7]},
  "patience":  {"_type":"choice","_value":[10]},
  "epochs":  {"_type":"choice","_value":[200]},
  "rnn_hidden_size":  {"_type":"choice","_value":[16,32,64,128,256,512]},
  "lr":{"_type":"loguniform","_value":[0.0001,0.01]}
}

Log message:

  • nnimanager.log:
[2023-12-27 16:16:42] INFO (NNIManager) submitTrialJob: form: {
  sequenceId: 7,
  hyperParameters: {
    value: '{"parameter_id": 7, "parameter_source": "algorithm", "parameters": {"n_steps": 60, "n_features": 7, "patience": 10, "epochs": 200, "rnn_hidden_size": 32, "lr": 0.0008698020401037771}, "parameter_index": 0}',
    index: 0
  },
  placementConstraint: { type: 'None', gpus: [] }
}
[2023-12-27 16:16:42] INFO (LocalV3.local) Created trial XsB6F
  • dispatcher.log:
[2023-12-27 16:15:06] INFO (numexpr.utils/MainThread) Note: detected 128 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2023-12-27 16:15:06] INFO (numexpr.utils/MainThread) Note: NumExpr detected 128 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
[2023-12-27 16:15:06] INFO (numexpr.utils/MainThread) NumExpr defaulting to 8 threads.
[2023-12-27 16:15:06] INFO (nni.tuner.random/MainThread) Using random seed 220808582
[2023-12-27 16:15:06] INFO (nni.runtime.msg_dispatcher_base/MainThread) Dispatcher started
[2023-12-27 16:15:06] INFO (nni.runtime.msg_dispatcher/Thread-1 (command_queue_worker)) Initial search space: {'n_steps': {'_type': 'choice', '_value': [60]}, 'n_features': {'_type': 'choice', '_value': [7]}, 'patience': {'_type': 'choice', '_value': [10]}, 'epochs': {'_type': 'choice', '_value': [200]}, 'rnn_hidden_size': {'_type': 'choice', '_value': [16, 32, 64, 128, 256, 512]}, 'lr': {'_type': 'loguniform', '_value': [0.0001, 0.01]}}
  • nnictl stdout and stderr:
2023-12-27 16:16:44 [INFO]: Have set the random seed as 2204 for numpy and pytorch.
2023-12-27 16:16:44 [INFO]: The tunner assigns a new group of params: {'n_steps': 60, 'n_features': 7, 'patience': 10, 'epochs': 200, 'rnn_hidden_size': 256, 'lr': 0.0054442307300676335}
2023-12-27 16:16:45 [INFO]: No given device, using default device: cuda
2023-12-27 16:16:45 [WARNING]: ‼️ saving_path not given. Model files and tensorboard file will not be saved.
2023-12-27 16:16:48 [INFO]: MRNN initialized with the given hyperparameters, the number of trainable parameters: 401,619
2023-12-27 16:16:48 [INFO]: Option lazy_load is set as False, hence loading all data from file...
2023-12-27 16:16:52 [INFO]: Epoch 001 - training loss: 1.3847, validating loss: 1.3214

How to reproduce it?:

Note that in the nnimanager.log: lr of trial XsB6F is 0.0008698020401037771 and this is also the value displayed on the local web page, but in the nnictl stdout log, the actual lr received by the model is 0.0054442307300676335, and they're mismatched. This is not a single case, I notice that hyperparameters of some trials are mismatched between the nnimanager tells and their actual values, while some of them are matched and fine.

@axinbme
Copy link

axinbme commented Jan 11, 2024

I had the same problem.

@void-echo
Copy link

Plus one 🤣

@WenjieDu
Copy link
Author

Seriously? Nobody takes care of this high-risk issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants