Wrong A3C implementation #40

jkulhanek · 2019-02-13T12:10:38Z

I believe there is a bug in the A3C algorithm implementation. In the file "ProcessAgent.py" on line 107. The sub-episode return should be the value in the next state not the previous.

I suggest replacing:

prediction, value = self.predict(self.env.current_state)
           
...
            if done or time_count == Config.TIME_MAX:
                terminal_reward = 0 if done else value

with:

prediction, value = self.predict(self.env.current_state)
           
...
            if done or time_count == Config.TIME_MAX:
                terminal_reward = 0
               if not done:
                     (_, terminal_reward) = self.predict(self.env.current_state)

The text was updated successfully, but these errors were encountered:

wgeul · 2020-02-25T22:27:33Z

I think that the overall design is of a different approach:

            prediction, value = self.predict(self.env.current_state)
            action = self.select_action(prediction)
            reward, done = self.env.step(action)
            reward_sum += reward
            exp = Experience(self.env.previous_state, action, prediction, reward, done)

The last line indicates that experience is backward looking. I assume that's why the terminal_reward that is equal to value is consistent when done is False?

jkulhanek closed this as completed Mar 6, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wrong A3C implementation #40

Wrong A3C implementation #40

jkulhanek commented Feb 13, 2019 •

edited

Loading

wgeul commented Feb 25, 2020

Wrong A3C implementation #40

Wrong A3C implementation #40

Comments

jkulhanek commented Feb 13, 2019 • edited Loading

wgeul commented Feb 25, 2020

jkulhanek commented Feb 13, 2019 •

edited

Loading