Unstable performance, sometimes agent converges to no_op action #7

CHENGY12 · 2018-07-12T01:24:07Z

Thank you for great release. I try to train an agent on CollectMineralShards, but can not repeat the performance as reported. I made several tries, but only get reward=75 at 100k steps. Is there any config parameters I should change? Thanks~

inoryy · 2018-07-12T03:25:23Z

I don't recall using any special hyperparameters for this map, defaults should work.
Just to be clear, by steps you mean training steps with 512 samples each? Are you using the default feature/action space config (not the readme example)? How many agents do you use, 32?

CHENGY12 · 2018-07-12T03:35:29Z

I use the feature/action space config in the readme, and try 32 agents and 24 agents. The steps I said is the iterations in the tensorboard.

inoryy · 2018-07-12T03:58:03Z

Try with 32 agents on the default feature/actions (simply don't specify the cfg_path arg).

CHENGY12 · 2018-07-12T04:23:02Z

OK~Thank you very much！

CHENGY12 · 2018-07-12T05:38:49Z

by the way, which map do I need to change feature/action space config.

inoryy · 2018-07-12T05:41:23Z

I initially created it for the FindAndDefeatZerglings map, but I actually just used the default when I prepared the results for my thesis.

SarunasSS · 2018-07-16T17:19:03Z

I would like to add to the thank you for some proper piece of code :)

I would like to ask you something. I am trying to replicate the CollectMineralShards and so far failed to climb close to 100 score within 200k obs. Afaik, the only difference is that I use 8 workers rather than the default 32. However, that should only make my training longer right? Since the graph x is determined in number of batches of 512. Thanks

inoryy · 2018-07-16T17:43:57Z

within 200k obs

by obs you mean number of updates (n_updates in console logs) or number of samples (n_samples in console logs)? The learning curve numbers show n_updates. My runs converge to 100 score around 35k updates, which is about 18 million samples.

use 8 workers rather than the default 32. However, that should only make my training longer right?

In on-policy algorithms such as A2C, agent count can significantly affect performance. It should eventually converge, but not guaranteed to be in the same number of samples.

SarunasSS · 2018-07-17T05:16:49Z

Makes sense thanks :)

inoryy · 2018-07-18T07:33:29Z

@SarunasSS I've ran some tests with 8 agents and discovered a subtle bug where the agent stops moving and "poisons" grads with 0 ep reward. I guess with 32 agents it didn't matter as on average it still improved so I never noticed. I have an idea where it's coming from but can't give an ETA on the fix for now.

SarunasSS · 2018-07-18T08:11:01Z

@inoryy what do you mean stops moving? It could stop when it explores non-move actions right ( eg. all the selects ) ? So it could be related to the exploration scheme

inoryy · 2018-07-18T08:16:23Z

@SarunasSS no, it looks like it just completely stops taking any actions for the rest of an episode (and can eventually lead to all agents producing 0 rewards for the rest of the run). It might be trying to make an invalid move, can only know for sure after manually investigating which is difficult since all of this happens at random even on same seed.

CHENGY12 · 2018-07-19T09:55:14Z

Hi, @inoryy I also discover this bug. I print the actions and softmax probabilities. The agent stop moving because action is no_op, whose index is 0 in action list. Besides, probabilities of all actions are 0, because selected actions are masked by the "available actions" in the config, the normalization doesn't work when all probabilities are 0 . I try to add a uniform distribution, when all probabilities of "available actions" are 0, to encourage exploration.

SarunasSS · 2018-07-24T07:43:47Z

@inoryy I managed to replicate the issue as well. Indeed like @CHENGY12 said the problem is that the no_op becomes the only action with p > 0. Depending on your reward structure this could be a local minimum. Eg. in Defeat roaches if marines do not engage the roaches the score is 0 which is better than losing all marines ( -9 ) thus the no_op can dominate.

Any ideas how to resolve this?

inoryy · 2018-07-24T14:12:43Z

@SarunasSS I'll look into it during this weekend. Should be easy to find thanks to @CHENGY12 information.

SarunasSS · 2018-08-01T10:51:38Z

I have been investigating the this problem in depth. In most of the training tries I get that the agents converge to using the no_op operation and gets stuck there no matter what exploration scheme I'm using ( I've tried boltzmann and e-greedy ).

It is weird that it converges to no-op even though it has reached large rewards before. Do you have any ideas what might be going wrong?

inoryy · 2018-08-02T15:38:25Z

@SarunasSS sorry I got side-tracked a bit. I'm almost certain the issue boils down to a case where all available action policy probabilities end up 0, so re-normalization does nothing and results with very bad gradients. It should be an easy fix, but the bigger issue is finding the time / hardware to extensively test it out.

inoryy · 2018-08-13T09:14:47Z

A little update for people following this issue: I'm currently re-writing the project essentially from scratch, so in the interest of time I've decided not to investigate the issue in legacy codebase. During rewrite I'll of course make sure to avoid repeating the problem.

Re-written project will include updated dependencies, cleaner API, better flexibility, optimized performance and much more. ETA on initial release: end of august.

inoryy · 2018-11-25T18:29:52Z

Fixed!

inoryy mentioned this issue Jul 18, 2018

Agent occasionally stops moving #10

Closed

inoryy added the on hold label Aug 13, 2018

inoryy changed the title ~~repeat experiments on CollectMineralShards~~ Unstable performance, sometimes agent converges to no_op action Aug 13, 2018

inoryy mentioned this issue Aug 16, 2018

Is this a bug in runner.py? #14

Closed

inoryy closed this as completed Nov 25, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unstable performance, sometimes agent converges to no_op action #7

Unstable performance, sometimes agent converges to no_op action #7

CHENGY12 commented Jul 12, 2018

inoryy commented Jul 12, 2018 •

edited

Loading

CHENGY12 commented Jul 12, 2018

inoryy commented Jul 12, 2018

CHENGY12 commented Jul 12, 2018

CHENGY12 commented Jul 12, 2018

inoryy commented Jul 12, 2018

SarunasSS commented Jul 16, 2018

inoryy commented Jul 16, 2018

SarunasSS commented Jul 17, 2018

inoryy commented Jul 18, 2018

SarunasSS commented Jul 18, 2018

inoryy commented Jul 18, 2018

CHENGY12 commented Jul 19, 2018

SarunasSS commented Jul 24, 2018

inoryy commented Jul 24, 2018

SarunasSS commented Aug 1, 2018

inoryy commented Aug 2, 2018

inoryy commented Aug 13, 2018

inoryy commented Nov 25, 2018

Unstable performance, sometimes agent converges to no_op action #7

Unstable performance, sometimes agent converges to no_op action #7

Comments

CHENGY12 commented Jul 12, 2018

inoryy commented Jul 12, 2018 • edited Loading

CHENGY12 commented Jul 12, 2018

inoryy commented Jul 12, 2018

CHENGY12 commented Jul 12, 2018

CHENGY12 commented Jul 12, 2018

inoryy commented Jul 12, 2018

SarunasSS commented Jul 16, 2018

inoryy commented Jul 16, 2018

SarunasSS commented Jul 17, 2018

inoryy commented Jul 18, 2018

SarunasSS commented Jul 18, 2018

inoryy commented Jul 18, 2018

CHENGY12 commented Jul 19, 2018

SarunasSS commented Jul 24, 2018

inoryy commented Jul 24, 2018

SarunasSS commented Aug 1, 2018

inoryy commented Aug 2, 2018

inoryy commented Aug 13, 2018

inoryy commented Nov 25, 2018

inoryy commented Jul 12, 2018 •

edited

Loading