Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unstable performance, sometimes agent converges to no_op action #7

Closed
CHENGY12 opened this issue Jul 12, 2018 · 19 comments
Closed

Unstable performance, sometimes agent converges to no_op action #7

CHENGY12 opened this issue Jul 12, 2018 · 19 comments
Labels

Comments

@CHENGY12
Copy link

Thank you for great release. I try to train an agent on CollectMineralShards, but can not repeat the performance as reported. I made several tries, but only get reward=75 at 100k steps. Is there any config parameters I should change? Thanks~

@inoryy
Copy link
Owner

inoryy commented Jul 12, 2018

I don't recall using any special hyperparameters for this map, defaults should work.
Just to be clear, by steps you mean training steps with 512 samples each? Are you using the default feature/action space config (not the readme example)? How many agents do you use, 32?

@CHENGY12
Copy link
Author

I use the feature/action space config in the readme, and try 32 agents and 24 agents. The steps I said is the iterations in the tensorboard.

@inoryy
Copy link
Owner

inoryy commented Jul 12, 2018

Try with 32 agents on the default feature/actions (simply don't specify the cfg_path arg).

@CHENGY12
Copy link
Author

OK~Thank you very much!

@CHENGY12
Copy link
Author

by the way, which map do I need to change feature/action space config.

@inoryy
Copy link
Owner

inoryy commented Jul 12, 2018

I initially created it for the FindAndDefeatZerglings map, but I actually just used the default when I prepared the results for my thesis.

@SarunasSS
Copy link

I would like to add to the thank you for some proper piece of code :)

I would like to ask you something. I am trying to replicate the CollectMineralShards and so far failed to climb close to 100 score within 200k obs. Afaik, the only difference is that I use 8 workers rather than the default 32. However, that should only make my training longer right? Since the graph x is determined in number of batches of 512. Thanks

@inoryy
Copy link
Owner

inoryy commented Jul 16, 2018

within 200k obs

by obs you mean number of updates (n_updates in console logs) or number of samples (n_samples in console logs)? The learning curve numbers show n_updates. My runs converge to 100 score around 35k updates, which is about 18 million samples.

use 8 workers rather than the default 32. However, that should only make my training longer right?

In on-policy algorithms such as A2C, agent count can significantly affect performance. It should eventually converge, but not guaranteed to be in the same number of samples.

@SarunasSS
Copy link

Makes sense thanks :)

@inoryy
Copy link
Owner

inoryy commented Jul 18, 2018

@SarunasSS I've ran some tests with 8 agents and discovered a subtle bug where the agent stops moving and "poisons" grads with 0 ep reward. I guess with 32 agents it didn't matter as on average it still improved so I never noticed. I have an idea where it's coming from but can't give an ETA on the fix for now.

@SarunasSS
Copy link

@inoryy what do you mean stops moving? It could stop when it explores non-move actions right ( eg. all the selects ) ? So it could be related to the exploration scheme

@inoryy
Copy link
Owner

inoryy commented Jul 18, 2018

@SarunasSS no, it looks like it just completely stops taking any actions for the rest of an episode (and can eventually lead to all agents producing 0 rewards for the rest of the run). It might be trying to make an invalid move, can only know for sure after manually investigating which is difficult since all of this happens at random even on same seed.

@CHENGY12
Copy link
Author

Hi, @inoryy I also discover this bug. I print the actions and softmax probabilities. The agent stop moving because action is no_op, whose index is 0 in action list. Besides, probabilities of all actions are 0, because selected actions are masked by the "available actions" in the config, the normalization doesn't work when all probabilities are 0 . I try to add a uniform distribution, when all probabilities of "available actions" are 0, to encourage exploration.

@SarunasSS
Copy link

@inoryy I managed to replicate the issue as well. Indeed like @CHENGY12 said the problem is that the no_op becomes the only action with p > 0. Depending on your reward structure this could be a local minimum. Eg. in Defeat roaches if marines do not engage the roaches the score is 0 which is better than losing all marines ( -9 ) thus the no_op can dominate.

Any ideas how to resolve this?

@inoryy
Copy link
Owner

inoryy commented Jul 24, 2018

@SarunasSS I'll look into it during this weekend. Should be easy to find thanks to @CHENGY12 information.

@SarunasSS
Copy link

I have been investigating the this problem in depth. In most of the training tries I get that the agents converge to using the no_op operation and gets stuck there no matter what exploration scheme I'm using ( I've tried boltzmann and e-greedy ).

It is weird that it converges to no-op even though it has reached large rewards before. Do you have any ideas what might be going wrong?

@inoryy
Copy link
Owner

inoryy commented Aug 2, 2018

@SarunasSS sorry I got side-tracked a bit. I'm almost certain the issue boils down to a case where all available action policy probabilities end up 0, so re-normalization does nothing and results with very bad gradients. It should be an easy fix, but the bigger issue is finding the time / hardware to extensively test it out.

@inoryy
Copy link
Owner

inoryy commented Aug 13, 2018

A little update for people following this issue: I'm currently re-writing the project essentially from scratch, so in the interest of time I've decided not to investigate the issue in legacy codebase. During rewrite I'll of course make sure to avoid repeating the problem.

Re-written project will include updated dependencies, cleaner API, better flexibility, optimized performance and much more. ETA on initial release: end of august.

@inoryy inoryy added the on hold label Aug 13, 2018
@inoryy inoryy changed the title repeat experiments on CollectMineralShards Unstable performance, sometimes agent converges to no_op action Aug 13, 2018
@inoryy inoryy closed this as completed Nov 25, 2018
@inoryy
Copy link
Owner

inoryy commented Nov 25, 2018

Fixed!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants