New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add terminal Reward . #1370

Merged
merged 13 commits into from Apr 28, 2018

Conversation

Projects
None yet
4 participants
@mirraaj
Contributor

mirraaj commented Apr 16, 2018

The previous implementation did not care about the terminal condition and its reward. Hence I am adding terminal condition to the environment, to take care of the terminal conditions .

mirraaj added some commits Apr 16, 2018

Added terminal case to Q Learning.
Previous implementation did not care of terminal case.
@mirraaj

This comment has been minimized.

Contributor

mirraaj commented Apr 16, 2018

Dear @zoq , I have added terminal case to the Q Learning Algorithm. This implementation was overlooked previously. Please let me know what do you think about this.

@mirraaj

This comment has been minimized.

Contributor

mirraaj commented Apr 17, 2018

Style : #1372
Static Code Analysis : I am not sure why this error has occurred. I guess all the previous analysis on this file was successful.

@zoq , Please let me know if any explanation is required.

@sourabhvarshney111

This comment has been minimized.

Contributor

sourabhvarshney111 commented Apr 17, 2018

@luffy1996 I don't think there is an error which you have created. It is like if you have changed the file then every code of that file is checked for static code analysis. See static checks for #1309 , you can easily detect that these might not be errors too but instead a use of the statement in different manner.

@zoq

This comment has been minimized.

Member

zoq commented Apr 17, 2018

Don't worry about the Static Code Analysis issue, in this case it's just fine to use action inside the condition.

@mirraaj

This comment has been minimized.

Contributor

mirraaj commented Apr 18, 2018

@sourabhvarshney111 . I looked at the PR and it seems that this error has nothing to do with me changing few lines of the code. Thanks for pointing it out.

@mirraaj

This comment has been minimized.

Contributor

mirraaj commented Apr 18, 2018

@zoq , Do you think this implementation should be merged ?

@zoq

This comment has been minimized.

Member

zoq commented Apr 19, 2018

The issue I see is, that is, that this bias could influence the learning strategy. Let's say we have two runs with -10 and -11 reward and -11 is also a terminal state. With your code we end up with -10 and 90 as the reward which could be fine, but this would imply that the second run (90) is way better in terms of the reward. Let me know what you think. I guess, if someone used this before, we could make the done reward configurable.

@mirraaj

This comment has been minimized.

Contributor

mirraaj commented Apr 19, 2018

I do agree with you that we should make the done reward configurable. Do you think after adding the condition we can merge this?

mirraaj added some commits Apr 19, 2018

@mirraaj

This comment has been minimized.

Contributor

mirraaj commented Apr 20, 2018

@zoq, The terminal reward makes sure that the agent learns the objective of the game. Let me consider two cases and explain to you how this helps in learning.

a) Cart_Pole: For Cartpole game, the objective is to keep the cart-pole up. If it falls then we are losing the game. The game should make sure to punish the agent for taking the step. In this case, we are giving a reward of 0 to take care of this case.

b) Acrobat: For this game, the purpose is to reach the minimum height in a minimum number of steps. Now if we give a reward of -1 here, like the previous implementation, the agent doesn't learn the dynamics of the game. Giving a positive reward helps the agent to learn this fact that it took a correct step.

Now let me discuss about your example. The reward +100 will make sure that the agent took the correct step. Since it is reaching a terminal state it will increase the value of Q(s, a) for the state action pair. Let me know what do you think about this.

@zoq

This comment has been minimized.

Member

zoq commented Apr 23, 2018

For this game, the purpose is to reach the minimum height in a minimum number of steps. Now if we give a reward of -1 here, like the previous implementation, the agent doesn't learn the dynamics of the game.

I don't think this is true, it could learn the task since the agent might be able to correlate the reward over the time. However, I agree that returning another reward might be beneficial, but I'm not sure 100 is a good default value, why not just go with 0 as done for the Cartpole task?

@mirraaj

This comment has been minimized.

Contributor

mirraaj commented Apr 24, 2018

0 could also be given to the agent as the default value. I am going ahead with default values used by openai gym environment.

@zoq

If you could handle the issue I pointed out above, I think this is ready. Thanks for looking into the issue.

for (size_t i = 0; i < sampledNextStates.n_cols; ++i)
{
target(sampledActions[i], i) = sampledRewards[i] + config.Discount() *
(isTerminal[i] ? 0.0 : nextActionValues(bestActions[i], i));
if (isTerminal[i])

This comment has been minimized.

@zoq

zoq Apr 24, 2018

Member

The current code already covers that:

config.Discount() * (isTerminal[i] ? 0.0 : nextActionValues(bestActions[i], i))

if isTerminal[i] is true we multiply config.Discount() with 0.0.

@zoq

zoq approved these changes Apr 24, 2018

No more comments from my side, thanks. I'll go ahead and merge this in 3 days to leave time for any other comments.

@rcurtin

Looks good to me, thanks for the contribution! It might be better to choose a larger default doneReward but I am not sure if I understand the issue fully so ignore my comment if it doesn't make sense. But it seems that with default values, there actually is no reward at all for when the task is done. So if generally we would want to specify a larger reward, it might be useful to change the default or add some documentation pointing out that doneReward should not be left as the default.

@mirraaj

This comment has been minimized.

Contributor

mirraaj commented Apr 27, 2018

@rcurtin I do understand your point. The choice of doneReward is usually user dependent. I was planning for a higher reward, but @zoq pointed out that we should keep it less. I guess he was correct at this point because we should not introduce large bias from our end. Let the user determine the magnitude of doneReward for the game.

@mirraaj

This comment has been minimized.

Contributor

mirraaj commented Apr 28, 2018

@zoq, I think you can go ahead and add this. :)

@zoq zoq merged commit d390433 into mlpack:master Apr 28, 2018

4 of 5 checks passed

Static Code Analysis Checks Build finished.
Details
Memory Checks
Details
Style Checks Build finished.
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@zoq

This comment has been minimized.

Member

zoq commented Apr 28, 2018

Thanks, happy to use this in some future tests.

@mirraaj mirraaj deleted the mirraaj:AddTerminaleward branch Apr 29, 2018

@mirraaj mirraaj restored the mirraaj:AddTerminaleward branch Apr 29, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment