Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why training/test reward value fluctuate a lot #28

Closed
quito418 opened this issue Jan 16, 2018 · 2 comments
Closed

Why training/test reward value fluctuate a lot #28

quito418 opened this issue Jan 16, 2018 · 2 comments

Comments

@quito418
Copy link

Hello, Hongzimao

While training, in tensorboard, or /sim/results/log_test the average reward value fluctuates a lot.

Is it supposed to be like this? Below is the test result of models by every 100 epoch. ( I used 0.5 as initial entropy and set entropy decaying with 0.99998

default

Second, How did you selected the best model. Was it based on how high the test result was?

Lastly, how did you came to conclusion that the model generalize well in real world. I'm concerned that on new real word test set, the test result could decrease.

Thank you!

@quito418 quito418 changed the title Why training /test reward value fluctuate a lot Why training/test reward value fluctuate a lot Jan 16, 2018
@hongzimao
Copy link
Owner

hongzimao commented Jan 16, 2018

In our experience the learning curve climbs more steadily. You might already converge to a good policy based on this figure, can you compare against our pre-trained model?

We select the model based on a validation set. You can divide the training trace to a training and validation set.

Please refer to the paper section 5.3 for generalization. Synthetic trace is also provided. But as you suggested, if the network characteristics change drastically from training to actual deployment, the performance could degrade. It's also interesting to do (refined) training online.

@quito418
Copy link
Author

I tested with your pretrained model and our trained model. I selected trained model on how well it performed in test. Each of them outperformed MPC.

I'm now clear about generalization!

Thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants