-
Notifications
You must be signed in to change notification settings - Fork 281
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Why the result is not better than MPC? #11
Comments
What you did was all correct. But as what we stated in the paper, as well as the README.md in /sim:
You can easily achieve an automatic exploration decay for ENTROPY_WEIGHT. The reason we didn’t explicitly do this is to have others see the effect of this parameter, as you just discovered :). Hope this helps. |
Did you load the trained model of previous run when you decay the factor? We (as well as others who reproduced it; some posts on issues already) didn't do anything fancy, just plain decay once or twice should work. |
I figured out what's the problem. As you said, I should stop the program, load the previous trained model, then re-run the python script. I've got good result by this way. But at first, I just set Why the "re-run" works differently with my method? Both methods keep the previous trained model, while "re-run" resets the optimizer. Is that the reason? |
I'm glad you got the good performance 👍 As for automatically decaying the exploration factor, notice that I think any reasonable decay function should work (e.g., linear, step function, etc.). If you manage to get that work, could you post your result (maybe open another issue)? Although we have our internal implementation (we didn't post it because (1) it's fairly easy to implement and (2) more importantly we intentionally want others to observe this effect), we would appreciate a lot if someone can reproduce and improve our implementation. Thanks! |
Sure. I'll try to use |
I wanna know why the result of CDF is not smaller than 100? Is this correct? |
Hi Hongzi,
I tried to reproduce the result of Pensieve. After several attempts, I failed to get an ideal result (better performance than MPC). The following is the way I used. The code was downloaded from GitHub, and the trace files were got from Dropbox:
python multi_agent.py
to train the model;python rl_no_training.py
in test/ folder to test the model, trace files in test_sim_traces are also used;python plot_results.py
to compare the results with DP method & MPC method.I put two figures of total_reward and CDF here. We can see the performance of Pensieve is not better than MPC.
Here is a figure of tensorboard. The training step is about 160,000.
I found the result is not very stable after long time training (more than 10,000). Thus the trained models bring different performance when testing. For example, the model of 164500 steps got a reward of 35.2, while the model of 164600 steps got a reward of 33.7.
Did I do something wrong, so that I couldn't get the same result as you described in the paper? The pretrain_linear_reward model performs good. How do you get it? Can you give me a hand to solve these questions, any answer is highly appreciated.
Thanks!
The text was updated successfully, but these errors were encountered: