-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unstable performance, sometimes agent converges to no_op action #7
Comments
I don't recall using any special hyperparameters for this map, defaults should work. |
I use the feature/action space config in the readme, and try 32 agents and 24 agents. The steps I said is the iterations in the tensorboard. |
Try with 32 agents on the default feature/actions (simply don't specify the |
OK~Thank you very much! |
by the way, which map do I need to change feature/action space config. |
I initially created it for the FindAndDefeatZerglings map, but I actually just used the default when I prepared the results for my thesis. |
I would like to add to the thank you for some proper piece of code :) I would like to ask you something. I am trying to replicate the CollectMineralShards and so far failed to climb close to 100 score within 200k obs. Afaik, the only difference is that I use 8 workers rather than the default 32. However, that should only make my training longer right? Since the graph x is determined in number of batches of 512. Thanks |
by
In on-policy algorithms such as A2C, agent count can significantly affect performance. It should eventually converge, but not guaranteed to be in the same number of samples. |
Makes sense thanks :) |
@SarunasSS I've ran some tests with 8 agents and discovered a subtle bug where the agent stops moving and "poisons" grads with 0 ep reward. I guess with 32 agents it didn't matter as on average it still improved so I never noticed. I have an idea where it's coming from but can't give an ETA on the fix for now. |
@inoryy what do you mean stops moving? It could stop when it explores non-move actions right ( eg. all the selects ) ? So it could be related to the exploration scheme |
@SarunasSS no, it looks like it just completely stops taking any actions for the rest of an episode (and can eventually lead to all agents producing 0 rewards for the rest of the run). It might be trying to make an invalid move, can only know for sure after manually investigating which is difficult since all of this happens at random even on same seed. |
Hi, @inoryy I also discover this bug. I print the actions and softmax probabilities. The agent stop moving because action is no_op, whose index is 0 in action list. Besides, probabilities of all actions are 0, because selected actions are masked by the "available actions" in the config, the normalization doesn't work when all probabilities are 0 . I try to add a uniform distribution, when all probabilities of "available actions" are 0, to encourage exploration. |
@inoryy I managed to replicate the issue as well. Indeed like @CHENGY12 said the problem is that the no_op becomes the only action with p > 0. Depending on your reward structure this could be a local minimum. Eg. in Defeat roaches if marines do not engage the roaches the score is 0 which is better than losing all marines ( -9 ) thus the no_op can dominate. Any ideas how to resolve this? |
@SarunasSS I'll look into it during this weekend. Should be easy to find thanks to @CHENGY12 information. |
I have been investigating the this problem in depth. In most of the training tries I get that the agents converge to using the no_op operation and gets stuck there no matter what exploration scheme I'm using ( I've tried boltzmann and e-greedy ). It is weird that it converges to no-op even though it has reached large rewards before. Do you have any ideas what might be going wrong? |
@SarunasSS sorry I got side-tracked a bit. I'm almost certain the issue boils down to a case where all available action policy probabilities end up 0, so re-normalization does nothing and results with very bad gradients. It should be an easy fix, but the bigger issue is finding the time / hardware to extensively test it out. |
A little update for people following this issue: I'm currently re-writing the project essentially from scratch, so in the interest of time I've decided not to investigate the issue in legacy codebase. During rewrite I'll of course make sure to avoid repeating the problem. Re-written project will include updated dependencies, cleaner API, better flexibility, optimized performance and much more. ETA on initial release: end of august. |
Fixed! |
Thank you for great release. I try to train an agent on CollectMineralShards, but can not repeat the performance as reported. I made several tries, but only get reward=75 at 100k steps. Is there any config parameters I should change? Thanks~
The text was updated successfully, but these errors were encountered: