# Observe and Look Further: Achieving Consistent Performance on Atari

## Summary

- despite advances in DRL, still fail to learn human-level policies consistenly over a set of diverse tasks
- Three challenges and our approaches
  - diverse reward distribution results in large variance
    - transformed bellman operator to reduce variance
  - limits on reasoning over long time horizons => up to discounted factor 0.99
    - temporal consistency loss => up to discounted factor 0.999
  - limited exploration
    - guide exploration by using human demonstrations


## Introduction

- Learning human-level policies consistenly across the entire set of games remains open problem

### Three challenges

#### diversity of reward distribution across games 

- in density, in scale
- large variance, unstable learning
- in DQN by Mnih(2015) use clipped reward, -1.0 to 1.0 
  - limited, ex) 9-pin BOWLING game

#### reason over long time horizons

- important when very very sparse reward situations, 
  - ex) MONTEZUMA's REVENGE : reward interval several hundreds time steps
- algorithm should be able to handle discounted factors close to 1.0 
- practically, limited to 0.99

#### efficient exploration of MDP

- even in very sparse rewards, algorithm should discover long trajectories with a high cumulative reward in a reasonable amount of time


### Our contributions

none of the existing deep RL algorithms have been able to address these three challenges at once.


## Algorithm

- transformed bellman operator
- temporal consistency loss
- Combining Ape-X DQN and DQfD

### DQN Background

![](./images/obs_look_further/01.png)

![](./images/obs_look_further/02.png)

![](./images/obs_look_further/03.png)


#### From udacity RL

![](./images/obs_look_further/04.png)
![](./images/obs_look_further/05.png)
![](./images/obs_look_further/06.png)
![](./images/obs_look_further/07.png)
![](./images/obs_look_further/08.png)
![](./images/obs_look_further/09.png)
![](./images/obs_look_further/10.png)
![](./images/obs_look_further/11.png)
![](./images/obs_look_further/12.png)




### Transformed Bellman Operator

- algorithm to diverge if the variance of the optimization target is too high
- In order to reduce the variance, Mnih et al. clip the reward distribution to the interval
- Our method : scale action-value not reward value

![](./images/obs_look_further/13.png)

- Is it contraction still? YES!!
- Does it reduce action value? YES !!

![](./images/obs_look_further/14.png)
![](./images/obs_look_further/15.png)
![](./images/obs_look_further/16.png)

- TD Loss 정의

![](./images/obs_look_further/17.png)


## Temporal consistency (TC) loss

- instabiliy can still occur as the discount factor approaches 1
- Increasing the discount factor decreases the temporal difference in value between non-rewarding states. 
- In particular, unwanted generalization of the neural network to the next state (due to the similarity of temporally adjacent target values) can result in catastrophic TD backups.

- TC loss : one-step prediction에 대해서도 TD loss를 정의
  - This makes sure that the updated estimates adhere to the operator and thus are consistent over time
 
![](./images/obs_look_further/18.png)

## Ape-X DQFD

combine 
- DQfD algorithm
- distributed prioritized experience replay

### Deep Q-learning from Demonstrations

- https://arxiv.org/abs/1704.03732

- pre-train Q-network with replay buffer with only human demonstration data
- and then, normal Q-learning by its own replay buffer


### Distributed prioritized experience replay

- https://arxiv.org/abs/1803.00933

- prioritized experience replay, (Shaul, 2016)
  - previously, use uniform sampling on the experiences
  - priority high when TD error was high before
    - more valuable experience when it correct our policy in a good way
  - 이 경우, 과대 평가, 과소 평가 되는 경험도 있으므로, 
    - pure greedy prioritization과 uniform random sampling의 중간 정도로 샘플링
- distributed experience replay, (Horgen 2018)
  - acting, learning seperation
  - actors provide experiences to the central learner
  - learner learn based on prioritied experiences
  - called Ape-X architecture
  
![](./images/obs_look_further/19.png)

### Ape-X DQFD

- use distributed prioritized experience replay
- use human imitataion data, mixing to RL agent experiences


![](./images/obs_look_further/20.png)

- learn by TD, TC, IM loss

![](./images/obs_look_further/21.png)

## Experiments

![](./images/obs_look_further/22.png)
![](./images/obs_look_further/23.png)