Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trying to compare this to universe-starter-agent (A3C) #22

Closed
nczempin opened this issue Mar 24, 2017 · 83 comments
Closed

Trying to compare this to universe-starter-agent (A3C) #22

nczempin opened this issue Mar 24, 2017 · 83 comments

Comments

@nczempin
Copy link

Setting up openai/universe, I used the "universe starter agent" as a smoke test.

After adjusting the number of workers to better utilize my CPU, I saw the default PongDeterministic-v3 start winning after about 45 minutes.

Then I wanted to try GA3C on the same machine; given that you quote results of 6x or better, I expected it to perform at least as good as that result.

However, it turns out that with GA3C the agent only starts winning after roughly 90 minutes.

I'm assuming that either my first (few) run(s) on the starter agent were just lucky, or that my runs on GA3C were unlucky. Also I assume that the starter agent has other changes from the A3C that you compared GA3C against, at least in parameters, possibly in algorithm.

So, what can I (an experienced software engineer but with no background in ML), do to make the two methods more comparable on my machine? Is it just a matter of tweaking a few parameters? Is Pong not a good choice to make the comparison?

I have an i7-3930k, a GTX 1060 (6 GB) and 32 GB of RAM.

@ashern
Copy link

ashern commented Mar 24, 2017

The OpenAI agent uses an LSTM policy network & GAE for the loss function.

This repo has a far simpler implementation of A3C, using a vanilla feed forward network for the policy & I'm pretty sure using a less recent loss function (though I haven't confirmed that last point recently).

While I personally had high hopes that this implementation would be useful for speeding things up, I've recently gone back to working with the OpenAI framework for my testing. I think some people have been working to get the LSTM policy working w/ GPU based A3C, but I haven't seen any working code that improves on the OpenAI type model....

I'd love to be corrected if I'm incorrect on any of the above.

@nczempin
Copy link
Author

ok, that explains it.

Is "get LSTM policy working with GA3C" an open research problem or merely a matter of implementation details?

@nczempin
Copy link
Author

And does Pong happen to be particularly sensitive to LSTM or would it be no different in the other Atari games?

@swtyree
Copy link
Collaborator

swtyree commented Mar 24, 2017

I did a few tests with the universe starter agent when it was just released. Based on that limited experience, it seemed that the setup was a bit overfit to Pong--performance was reasonable for other games, but exceptionally fast for Pong. But as the previous commenter mentioned, it also uses an LSTM and GAE, which are helpful in some cases. If you run more extensive tests, I'd be curious to know how it performs on a wider suite of games.

@nczempin
Copy link
Author

nczempin commented Mar 24, 2017 via email

@nczempin
Copy link
Author

nczempin commented Mar 24, 2017 via email

@ashern
Copy link

ashern commented Mar 24, 2017

The appendix of the original A3C paper has a ton of comparisons across different games & models, which should help you avoid some testing.

LSTM A3C is widely implemented open-source - a quick search should turn up a few options. The Universe & Miyosuda implementations seems to be the most commonly used.

@nczempin
Copy link
Author

nczempin commented Mar 24, 2017

The appendix of the original A3C paper has a ton of comparisons across different games & models, which should help you avoid some testing.

Not sure what this refers to; are you saying I could have avoided wasting time on CoasterRacer by being more aware of the comparisons? My goal was just to "play around with openai universe" rather than get deep into testing. If anything, I'd be interested in adding an environment such as MAME or one of the other emulators, which is more obviously an engineering task.

LSTM A3C is widely implemented open-source - a quick search should turn up a few options. The Universe & Miyosuda implementations seems to be the most commonly used

Is this a response to my question about GA3C with LSTM? If so, the implicit assumption is that there are no fundamental issues that would complicate an endeavour to do so, for example by looking at the A3C implementations. Is this what you're saying? My understanding from the GA3C paper is that they consider it to be a fundamental approach and that A3C just happened to perform the best, so adding LSTM should not be a big deal.

@nczempin
Copy link
Author

also, what would be a better venue to have discussions such as this one? Don't really want to clutter up the project issues.

@ashern
Copy link

ashern commented Mar 24, 2017

I simply meant - there exists a readily available corpus of tests conducted by professional researchers. Use it as you wish.

Implementing LSTM policy is simply an engineering issue, albeit a moderately difficult one in this case. Have at it - and please publish if you get good results! There are other issues open in this repo, I believe, where there are already discussions around LSTM/GPU.

@4SkyNet
Copy link

4SkyNet commented Mar 24, 2017

@nczempin you should add a GAE, cuz it's the most crucial part but easy to implement. LSTM don't really affect your results so much (but LSTM can helps a bit with dynamics of the game, but it's more policy oriented).
PS> see some results from vanilla article (last page)

@nczempin
Copy link
Author

nczempin commented Mar 24, 2017

I simply meant - there exists a readily available corpus of tests conducted by professional researchers. Use it as you wish.

Well, the "which ones should I try" was really offering my "services" to @swtyree: in case I make some more comparisons with my setup anyway it doesn't make a big difference to me which other roms I try, so if someone does have a preference, I might as well support that.

Implementing LSTM policy is simply an engineering issue, albeit a moderately difficult one in this case. Have at it - and please publish if you get good results!

"Publish" sounds intimidating to me, but if I do get anything off the ground, I promise to put the source up on github; perhaps fork and PR here. I probably have to brush up my background in this area a little first (and I definitely have some things I'd like to do first, as mentioned before), so don't hold your breath.

There are other issues open in this repo, I believe, where there are already discussions around LSTM/GPU.

I saw an issue on the universe starter agent, asking about GPU. It doesn't seem to have gone anywhere.

@mbz
Copy link
Contributor

mbz commented Mar 24, 2017

Please check out the pull requests section. GAE has been already implemented by @etienne87 in this pull request. He also implemented an specific pre-processing and configuration which provides a better comparison with starter agent.

@nczempin
Copy link
Author

nczempin commented Mar 24, 2017

@nczempin you should add a GAE, cuz it's the most crucial part but easy to implement. LSTM don't really affect you results so much (but LSTM can helps a bit with dynamics of the game).
PS> see some results from vanilla article (last page)

GAE being? All I get is Google App Engine, and I don't find a reference to the term in the A3C paper.

Edit: Generalized Advantage Estimation.

Please check out the pull requests section. GAE has been already implemented by @etienne87 in this pull request. He also implemented an specific pre-processing and configuration which provides a better comparison with starter agent.

I'll have a look at that. Should I use a different game from the purportedly overfitted Pong, or would it be fine? I guess we'd know the answer when/if I try...

@mbz
Copy link
Contributor

mbz commented Mar 24, 2017

GAE stands for Generalized Advantage Estimation. It's always a good idea to start by Pong (since it's usually the fastest to converge), but as long as you avoid pong specific logic. things should generalize to other games as well.

@nczempin
Copy link
Author

Okay, I checked out the PR, but it breaks the dependencies on the vanilla openai-universe.

I'm willing to give it a whirl if it once the PR is in a usable state more or less as-is.

@4SkyNet
Copy link

4SkyNet commented Mar 24, 2017

if you see some results from original paper, there are some good environments such as: Amidar, Berzerk and Krull for faster converge. But DeepMind trained all of these games with the same parameters, since that the gamma (discount factor) can be taken for each environment individually to get the better results.

@nczempin
Copy link
Author

So am I right in assessing that my issue #22 essentially boils down to issue #3?

@nczempin
Copy link
Author

or should I rename it to something that specifically references GAE?

@nczempin
Copy link
Author

nczempin commented Mar 24, 2017 via email

@4SkyNet
Copy link

4SkyNet commented Mar 24, 2017

@nczempin you can try with Seaquest, Boxing or other similar with more policy oriented approach than value (Breakout).
PS> I prefer Boxing cuz it's simply enough, but it takes a lot of time to see some distinguishes from a random (8mil for me for an almost vanilla A3C), than Breakout for example

@nczempin
Copy link
Author

nczempin commented Mar 24, 2017

So after around 24000 seconds (400 minutes, 6.6667 hours), here's what I get with GA3C with my 3930k, 32 GB and GTX 1060 (6GB):
[Time: 23999] [Episode: 20379 Score: 350.0000] [RScore: 268.7030 RPPS: 966] [PPS: 1258 TPS: 210] [NT: 16 NP: 2 NA: 9]

[Time: 23999] [Episode: 20380 Score: 317.0000] [RScore: 268.7230 RPPS: 966] [PPS: 1258 TPS: 210] [NT: 16 NP: 2 NA: 9]

[Time: 24001] [Episode: 20381 Score: 355.0000] [RScore: 268.8350 RPPS: 966] [PPS: 1258 TPS: 210] [NT: 16 NP: 2 NA: 9]

[Time: 24004] [Episode: 20382 Score: 295.0000] [RScore: 268.8870 RPPS: 966] [PPS: 1258 TPS: 210] [NT: 16 NP: 2 NA: 9]

[Time: 24008] [Episode: 20383 Score: 247.0000] [RScore: 268.8910 RPPS: 965] [PPS: 1258 TPS: 210] [NT: 16 NP: 3 NA: 8]

It seemed to make progress right from the start, unlike with Pong, where both algorithms seemed to be clueless for a while and then "suddenly get it" and no longer lose, followed by a very long time of very slow growth of average score (the points it conceded always seemed to be the very first few ones, once it had one a single point it seemed to go into very similar states.

GA3C on Amidar seems to be stuck just under 270; I will now see what I get on the same machine with universe-starter-agent.

@ifrosio
Copy link
Collaborator

ifrosio commented Mar 24, 2017

Based on the latest version of our paper, we get more stable and faster convergence for TRAINING_MIN_BATCH_SIZE = 20 ... 40 in Config.py. If you haven't done it yet, you can try this.

@nczempin
Copy link
Author

Based on the latest version of our paper, we get more stable and faster convergence for TRAINING_MIN_BATCH_SIZE = 20 ... 40 in Config.py. If you haven't done it yet, you can try this.

On Pong again or on any of the other ones I'll try?

@4SkyNet
Copy link

4SkyNet commented Mar 24, 2017

@nczempin DeepMind reaches almost 284 within 1 day (80 millions). You result isn't so bad, since that DeepMind selects 5 best runs from 50 and averaged it. You also can encounter with some saturation or exploration problem after some time. If you use RMSProp as target optimizer you can anneal the learning rate a bit slower.
PS> and, as you can see, DeepMind has some instability in training. It seems that a Hogwild can cause such issue, but it also occurs with more synchronize way.

@ifrosio
Copy link
Collaborator

ifrosio commented Mar 24, 2017

The improvement with TRAINING_MIN_BATCH_SIZE should be observed for all games (although we tested few of them only).

@nczempin
Copy link
Author

nczempin commented Mar 25, 2017

Here's the situation with (the universe starter agent) python3 train.py --num-workers 6 --env-id Amidar-v0 --log-dir /tmp/amidar
after roughly 8 hours:

image

@nczempin
Copy link
Author

I picked 6 workers because that's how many cores my CPU has, but perhaps up to 12 could have helped, given Hyperthreading etc. But naive "analysis" suggests that GA3C still wins in this particular case, because it gets more than double the score.

It would be interesting to know how much the speedup is due to using the CPU cores more efficiently because of the dynamic load balancing vs. including the GPU.

Even just getting a dynamic number of threads, without any specific GPU improvements, is a big convenience over having to pick them yourself statically.

@nczempin
Copy link
Author

GA3C PongDeterministic-v3 again with TRAINING_MIN_BATCH_SIZE=20:
[Time: 2702] [Episode: 3682 Score: -21.0000] [RScore: -20.2860 RPPS: 1513] [PPS: 1513 TPS: 51] [NT: 5 NP: 4 NA: 26]
...
[Time: 8993] [Episode: 7191 Score: -7.0000] [RScore: -14.3570 RPPS: 1514] [PPS: 1286 TPS: 44] [NT: 6 NP: 3 NA: 43]

@nczempin
Copy link
Author

GA3C PongDeterministic-v3 again with TRAINING_MIN_BATCH_SIZE=40:

[Time: 2701] [Episode: 3988 Score: -21.0000] [RScore: -20.1950 RPPS: 1663] [PPS: 1637 TPS: 31] [NT: 7 NP: 3 NA: 44]

[Time: 5402] [Episode: 6053 Score: -13.0000] [RScore: -17.0820 RPPS: 1628] [PPS: 1512 TPS: 29] [NT: 11 NP: 2 NA: 35]

[Time: 8996] [Episode: 7551 Score: -10.0000] [RScore: -13.0080 RPPS: 1609] [PPS: 1494 TPS: 28] [NT: 15 NP: 4 NA: 32]

@nczempin
Copy link
Author

nczempin commented Mar 26, 2017

Universe starter agent with
python3 train.py --num-workers 12 --env-id BoxingDeterministic-v3 --log-dir /tmp/boxingd3_12 to see how much effect using 12 workers on my 6-core, 12 threads (hyperthreading) i7 would have. Wall clock time was roughly 7.5 hours.
image

@4SkyNet
Copy link

4SkyNet commented Mar 26, 2017

Hum, if you flip image you need to map actions as well, or we need label-safe augmentations techniques.

You are right @etienne87 >> actions should remaps too if it applicable --> and it's simple enough for left-right flipping, where we've to remap left to right && right to left (for Atari games)

Does that even have this parameter?

Yes, they call it lambda and it's equal to 1.0 by defaults

...using 12 workers on my 6-core, 12 threads

@nczempin as I see everything correctly --> there are no improvements from 6 to 12 agents in time >> global_step/sec a bit more 600 for both cases

@4SkyNet
Copy link

4SkyNet commented Mar 26, 2017

@nczempin hm, I've also notices that the episode time becomes a bit shorter for your Boxing within the training --> it's a goo sign, but it's hard to properly control it with currant A3C:
Common episode is equal to 2min, but when your agent reaches 100 points --> you've get KO and game turns in terminal state to start the new one from scratch

@nczempin
Copy link
Author

nczempin commented Mar 26, 2017

@nczempin hm, I've also notices that the episode time becomes a bit shorter for your Boxing within the training --> it's a goo sign, but it's hard to properly control it with current A3C:
Common episode is equal to 2min, but when your agent reaches 100 points --> you've get KO and game turns in terminal state to start the new one from scratch

Are you saying either of the A3Cs have implicit assumptions about episode lengths being more uniform? Or just the original one?

In general, eventually agents will reach maximum scores. In ALE for many of them that roll over the score, this is not actually possible; an episode could potentially play to infinity. It is an open question there how to handle the situation of wrapping. Depending on how the Python interface is used, agents might be discouraged to wrap the score. IMHO it's pointless to keep going once you can wrap the score (for an agent; not necessarily for a human).

@nczempin
Copy link
Author

@nczempin as I see everything correctly --> there are no improvements from 6 to 12 agents in time >> global_step/sec a bit more 600 for both cases

I thought I saw I slight improvement in wall-clock time, but I didn't look at it in detail. I guess I should have included the fps images as well.

@4SkyNet
Copy link

4SkyNet commented Mar 26, 2017

Are you saying either of the A3Cs have implicit assumptions about episode lengths being more uniform? Or just the original one?

No, you just control the training quality by acquiring of some reward (discounted reward).
There are no information about your lives or KOs(in boxing case) for current A3C

@nczempin
Copy link
Author

Okay, I think I need to clarify what I was referring to:

I don't understand this statement:

but it's hard to properly control it with current A3C:

It sounded to me like you're saying that episodes terminating early is somehow a problem, because some trait of "current A3C" somehow optimizes for episodes that don't terminate early.

@4SkyNet
Copy link

4SkyNet commented Mar 27, 2017

It sounded to me like you're saying that episodes terminating early is somehow a problem, because some trait of "current A3C" somehow optimizes for episodes that don't terminate early.

A3C sets the last terminal discounted reward to 0, that's not so good for Boxing.
It's better to set this value to relatively big value to encourage the agent to reach KOs
as soon as possible (we clips all rewards in -1..1, since we can set to only 1 for current algo).
From the other hand, if we have some lives and when we lost the live we should set reward
signal to some negative value --> but we don't control such cases in A3C

@nczempin
Copy link
Author

A3C sets the last terminal discounted reward to 0

Really? What is the motivation behind this?
I can see how it would make a difference if you had just one final reward at the end and that would be set to 0 (although I'm not sure I understand this correctly), but in Boxing you simply terminate early when one of the players reaches 100, but it is not like one big punch that gets you from 0 to 100.

we clips all rewards in -1..1, since we can set to only 1 for current algo

Wouldn't normalizing be better than clipping? And for both options, wouldn't knowing the max score be helpful? I can't even imagine how you'd clip or normalize without knowing the max (other than the max-seen-so-far), to be honest.

Once (,if this gets implemented,) (some of) the Atari games get to have maximum values in ALE, e. g. Boxing has 100, Pong 21, perhaps there could/should be some way to take advantage of this in algorithms; but wouldn't this count as domain knowledge?

if we have some lives and when we lost the live we should set reward
signal to some negative value --> but we don't control such cases in A3C

I also thought about including other reward signals in ALE, but in the end the number of lives is just part of the state, and with behaviour that minimizes losing lives presumably you'd maximize the score. Or maximizing the rewards will turn out to involve avoiding to lose lives.

@nczempin
Copy link
Author

Universe Starter Agent:
python3 train.py --num-workers 12 --env-id SeaquestDeterministic-v3 --log-dir /tmp/seaquestd3_12
Wall clock time: just over 10 hours.
image

@nczempin
Copy link
Author

Regular GA3C stuck near 1700 points on Seaquest after 12 hours (but still better than universe starter agent):

[Time:     3600] [Episode:     4009 Score:   520.0000] [RScore:   573.0000 RPPS:  1647] [PPS:  1636 TPS:    31] [NT: 10 NP:  2 NA: 32]

[Time:     7201] [Episode:     6712 Score:   560.0000] [RScore:   650.4000 RPPS:  1589] [PPS:  1625 TPS:    31] [NT: 15 NP:  4 NA: 34]

[Time:     9271] [Episode:     8357 Score:  1220.0000] [RScore:  1000.7400 RPPS:  1614] [PPS:  1626 TPS:    31] [NT: 15 NP:  2 NA: 28]

[Time:    16630] [Episode:    13066 Score:  1720.0000] [RScore:  1700.0000 RPPS:  1648] [PPS:  1641 TPS:    31] [NT: 10 NP:  6 NA: 41]

[Time:    27096] [Episode:    19523 Score:  1760.0000] [RScore:  1700.1600 RPPS:  1639] [PPS:  1647 TPS:    31] [NT: 17 NP:  3 NA: 36]

[Time:    45286] [Episode:    30468 Score:  1680.0000] [RScore:  1705.9800 RPPS:  1620] [PPS:  1636 TPS:    31] [NT:  8 NP:  4 NA: 34]

@nczempin
Copy link
Author

after a promising start, GAE gets stuck near 212 points:

[Time: 19148] [Episode: 36972 Score: 200.0000] [RScore: 212.2800 RPPS: 1936] [PPS: 1996 TPS: 38] [NT: 9 NP: 2 NA: 38]

@nczempin
Copy link
Author

Montezuma's Revenge, basic GA3C (one can hope):

[Time: 36385] [Episode: 77507 Score: 0.0000] [RScore: 0.3000 RPPS: 1219] [PPS: 1473 TPS: 28] [NT: 35 NP: 4 NA: 35]

@nczempin
Copy link
Author

nczempin commented Mar 28, 2017

I've just added Adventure to ALE:, that might be even harder than Montezuma's Revenge with current algorithms.

I wonder if intrinsic motivation would help it, like it did Montezuma's (a little bit; Adventure is not quite as dangerous as Montezuma's, but the rewards are even sparser).

@nczempin
Copy link
Author

I'm trying to move my changes to ALE into gym; it's quite tedious because they have diverged, and it's not immediately obvious in what way.

@nczempin
Copy link
Author

Okay, I seem to have managed to get it to work; here's Adventure running:

[Time: 34] [Episode: 1 Score: -1.0000] [RScore: -1.0000 RPPS: 43] [PPS: 44 TPS: 26] [NT: 1 NP: 2 NA: 33]

Really wondering if it will ever get a +1.

Any tips on which implementation I should pick to make this more likely would be appreciated.

Would it be of any help to step in and control manually (sorta off-policy learning)?

Change to openai/atari-py
Change to openai/gym

@nczempin
Copy link
Author

Huh, the agent reached a score of 0. That's only possible by timing out of the episode. I hope it doesn't learn to sit idly at home forever...

[Time: 2369] [Episode: 182 Score: 0.0000] [RScore: -0.9945 RPPS: 921] [PPS: 922 TPS: 30] [NT: 2 NP: 4 NA: 31]

Which parameters do I need to set so it will eventually explore to bringing the chalice back to the golden castle?

I'm guessing there's no hope yet; it may require custom rewards to encourage exploration, opening castle doors, etc.

I'm currently looking into providing custom rewarders for openai/universe. The docs are sparse...

@nczempin
Copy link
Author

Unsurprisingly, the GA3C agent got nowhere on Adventure. Perhaps "distance of chalice to yellow castle" and "have discovered chalice" should somehow be added as rewards, but the values would be somewhat arbitrary.

[Time: 50487] [Episode: 2815 Score: 0.0000] [RScore: -0.9160 RPPS: 1425] [PPS: 1456 TPS: 28] [NT: 6 NP: 2 NA: 22]

@nczempin
Copy link
Author

nczempin commented Mar 30, 2017

In any case, I think I've answered my original question about GA3C vs. universe-starter-agent and will close this huge thread now.

@etienne87
Copy link

So ... what are the conclusions?

@nczempin
Copy link
Author

GA3C indeed makes better use of available resources; the GAE can help a lot but needs some parameter tweaking that I'm not ready for.

But mainly the conclusion is that I need to spend more time learning about how all of these things work before I can actually give a conclusion.

So I'll continue to try and help in areas where I can bring in my skills (e. g. adding more games to ALE, perhaps add an environment to Gym, other engineering-focused tasks that will satisfy my curiosity and perhaps help the researchers or others), and go right back to Supervised Learning, with maybe a little bit of generic (not general) AI thrown in, plus work my way through Tensorflow tutorials (possibly look at the other libraries), maybe implement some of the classic algorithms from scratch myself, etc.

@4SkyNet
Copy link

4SkyNet commented Mar 30, 2017

A3C sets the last terminal discounted reward to 0

Really? What is the motivation behind this?

Indeed - I think that the main motivation about it that terminal isn't good (for most Atari games, since also we use the same set of parameters for all games) and we also can't do some estimation by value "function"

we clips all rewards in -1..1, since we can set to only 1 for current algo

Wouldn't normalizing be better than clipping?

ALE, e. g. Boxing has 100, Pong 21, perhaps there could/should be some way to take advantage of this in algorithms;

Perhaps, but we have to do some investigation of pros and cons wrt rewards changes.
There are some good thoughts from Hado van Hasselt.

I also thought about including other reward signals in ALE

It's good to do some reasonable investigation in this way.

the number of lives is just part of the state

Hm, I don't think so, cuz we just got the raw image as state.
But gym has some access to the lives, I'm not sure about of ALE, but I think it also does.

maximizing the rewards will turn out to involve avoiding to lose lives

Mostly yup, especially if it slowdown rewards gain. But some AI's in some games try also to loss their lives if they could get some advantage.

I wonder if intrinsic motivation would help it, like it did Montezuma's (a little bit; Adventure is not quite as dangerous as Montezuma's, but the rewards are even sparser).

You are right. The intrinsic motivation extremely helps you in such game types > you can add it to some A3C implementation (exploration bonus)

Any tips on which implementation I should pick to make this more likely would be appreciated.

So I'll continue to try and help in areas where I can bring in my skills (e. g. adding more games to ALE, perhaps add an environment to Gym, other engineering-focused tasks that will satisfy my curiosity and perhaps help the researchers or others)

It's great to hear. I also recommend to see on Retro-Learning-Environment (RLE) - it cover not only Atari and also some SNES and even Sega perhaps.

I'm guessing there's no hope yet; it may require custom rewards to encourage exploration, opening castle doors, etc.

Yup, or do something like this

GA3C indeed makes better use of available resources;

Yeah - it's the main reason to organize your data workflow in more efficient way.
It's vanilla A3C, but we can add some reasonably improvements to it.

@nczempin
Copy link
Author

the number of lives is just part of the state

Hm, I don't think so, cuz we just got the raw image as state.
But gym has some access to the lives, I'm not sure about of ALE, but I think it also does.

Okay, I have to be careful here; what I meant was, the number of lives is (in many cases, ALE does indeed allow to query this, but it's not strictly necessary) part of the internal state of the game, not necessarily of the state that's observed by the agent (which is just pixels).

@nczempin
Copy link
Author

A3C sets the last terminal discounted reward to 0

Really? What is the motivation behind this?

Indeed - I think that the main motivation about it that terminal isn't good (for most Atari games, since also we use the same set of parameters for all games) and we also can't do some estimation by value "function"

But then how does GA3C even see the -1 on Adventure? Or are you saying the original A3C does it, while GA3C doesn't?

Adventure provides a good reason not simply to use 0 upon "fail", because there is a difference between "failing because eaten" and "failing because we timed out".

Hm, or maybe there isn't.

@4SkyNet
Copy link

4SkyNet commented Mar 30, 2017

But then how does GA3C even see the -1 on Adventure? Or are you saying the original A3C does it, while GA3C doesn't?

No, all rewards for both should represents (since clipping) as -1, 1 or 0 if we said about vanilla A3C.
But we don't use these raw rewards --> we use its discounted sum wrt gamma:
It looks like as follows:

def discounted_reward(real_rewards, gamma):
    discounted_r = np.zeros_like(real_rewards, dtype=np.float32)
    running_add = 0
    for t in reversed(range(0, discounted_r.size)): 
        running_add = running_add * gamma + r[t]
        discounted_r[t] = running_add
    return discounted_r

rew_test = [1, -1, 0, 1, 0]
print(discounted_reward(rew_test, .95))

[ 0.72187066 -1.31918108  0.80844843  0.91000599 -1.12114406]

And if terminal is reached we just set running_add to 0 for A3C,
if not --> we estimate it by value "function"
I can said some things not so accurate as it seems like it strictly zero.
We just skip out future estimation, cuz it ends but holds the last receiving reward.
Anyway, this reward could be higher for some games, but we lose some kind of
generality if we o that.

@nczempin
Copy link
Author

Still not sure I understand: Are you saying shooting a regular space invader gives as much reward as shooting a mothership?

The discounting is orthogonal to that question.

Apart from that, I'm not really in a position to argue about any of this. When I have a better understanding of actor-critic and all that I need to know before that, I may revisit this. So far I've watched the David Silver lectures, and I have some catching up to do.

In my engineer mindset I also like to implement all these different techniques like TD(lambda) etc., and obviously there is a lot I have to do before I even get to regular AC.

@4SkyNet
Copy link

4SkyNet commented Mar 30, 2017

Are you saying shooting a regular space invader gives as much reward as shooting a mothership?
The discounting is orthogonal to that question.

Yes, but I don't really know about space invaders scores > all of them is clipped in -1..1 for generalization and it's not good way if we want to give more from specific environments (perhaps, we have to do some investigation, cuz "blind" moves can also affect in negative way).

Discounting reward it just some technique wrt horizon of view to our received rewards. It could be more optimistic, for example, if we hit "motherships". And it is also affect on behavior, for example again:
gamma=0.95 is more preferable than gamma=0.99 for PacMan, since in the last case it has more fear to do some things and sitting in the corner.

@nczempin
Copy link
Author

Not sure I follow.

The rewards come from the environment, the algorithm is trying to figure out how to get these rewards and the algorithms are ranked based on how well they do compared to other algorithms (or humans) in score. If you treat a regular space invader the same as a mothership (indirectly; your algorithm knows nothing about different types, it just knows that in state s1 it was better to move slightly to the right and then shoot, to get the (points for the) mothership rather than to the left and get the points for the regular invader.

That is completely general as long as the environment gives out rewards.

As I said, I know what discounting reward is and what it is for; like in finance, getting a reward now is better than getting a rewards tomorrow, and how much better is determined by the discount factor, which is usually a measure of uncertainty; in finance it is dynamic, based on risk vs. reward.

But the discount factor doesn't have anything to do with the reward coming from a mothership or not, unless your algorithm takes into account that to get the higher score it also risks dying more often.

And when you have very sparse rewards, it makes sense to have a low discount factor (a high gamma), because otherwise the reward might disappear after enough steps due to rounding. Although technically if the rewards are really sparse (like in Adventure, only +1 right at the end) it shouldn't make any difference as long as you don't round to 0. The 1 will always be more than any other reward.

I guess in that case (rounding it away) it may even make sense to dynamically adjust the gamma: If you keep finishing episodes without getting any rewards, gamma should eventually be increased so that later rewards eventually get counted in the present.

@4SkyNet
Copy link

4SkyNet commented Mar 30, 2017

@nczempin you definitely right.
And it's not so simple with gammas, but you can try.
PS> for chess we always get +1, -1 or 0 at the end and some hand-crafting intermediate rewards often hurts than helps, cuz it really hard to estimate such things by human)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants