# Reinforcement Learning with Unsupervised Auxiliary Tasks

* 싸이그래머 / DGM : 파트 1 - 딥마인드 논문 리뷰 [1]
* 김무성

# Contents
* 1 RELATED WORK
* 2 BACKGROUND
* 3 AUXILIARY TASKS FOR REINFORCEMENT LEARNING
    - 3.1 AUXILIARY CONTROL TASKS
    - 3.2 AUXILIARY REWARD TASKS
    - 3.3 EXPERIENCE REPLAY
    - 3.4 UNREAL AGENT
* 4 EXPERIMENTS
    - 4.1 LABYRINTH RESULTS
    - 4.2 ATARI
* 5 CONCLUSION

# Abstract
* Deep reinforcement learning agents have achieved state-of-the-art results by 
    - <font color="red">directly maximising cumulative reward</font>. 
* However, <font color="blue">environments contain</font> 
    - a much wider variety of 
        - <font color="red">possible training signals</font>. 
* In this paper, we introduce 
    - an agent that also maximises 
        - <font color="red">many other pseudo-reward</font> functions 
            - <font color="blue">simultaneously</font> by reinforcement learning. 
        - All of these tasks 
            - <font color="red">share a common representation</font> that, like 
                - <font color="blue">unsupervised learning</font>, 
                - continues to develop in 
                    - the <font color="blue">absence of extrinsic rewards</font>. 
* We also introduce 
    - a novel mechanism for 
        - <font color="red">focusing this representation upon extrinsic rewards</font>, 
        - so that learning <font color="red">can rapidly adapt</font> 
            - to the most <font color="blue">relevant aspects of the actual task</font>.

#### 참고
* [2] (Vidoe) DeepMind - Reinforcement Learning with Unsupervised Auxiliary Tasks - https://youtu.be/Uz-zGYrYEjA

<img src="https://storage.googleapis.com/deepmind-live-cms/documents/iclrgif.gif" width=600 />

<img src="figures/cap1.png" width=600 />

#### stream of sensorimotor data

* Natural and artificial agents <font color="red">live in a stream of sensorimotor data</font>. 
* These actions <font color="red">influence the future course of the sensorimotor stream</font>. 
* In this paper we develop agents that <font color="red">learn to predict and control this stream</font>, by solving a host of reinforcement learning problems, <font color="red">each focusing on a distinct feature</font> of <font color="blue">the sensorimotor stream</font>. 
*  <font color="red">Our hypothesis</font> is that an agent that  <font color="red">can flexibly control its future experiences</font> will also be able to  <font color="red">achieve any goal with which it is presented</font>, such as maximising its future rewards.

#### pseudo-rewards

* The classic reinforcement learning paradigm focuses on the maximisation of extrinsic reward. 
* However, in many interesting domains, <font color="red">extrinsic rewards are only rarely observed</font>.
* Even if extrinsic rewards are frequent, the sensorimotor stream <font color="red">contains an abundance of other possible learning targets</font>.
* <font color="blue">Traditionally, unsupervised learning</font> attempts to <font color="blue">reconstruct these targets</font>, such as the pixels in the current or subsequent frame. It is typically <font color="blue">used to accelerate the acquisition of a useful representation</font>.
* In contrast, our learning objective is to predict and control features of the sensorimotor stream, by treating them as pseudo-rewards for reinforcement learning. Intuitively, this set of tasks is more closely matched with the agent’s long-term goals, potentially leading to more useful representations.

#### experience replay mechanism

##### 참고 
* [4] Experience Replay - http://unpredictablepattern.blogspot.kr/2016/02/experience-replay.html
* [5] The future of memory: remembering, imagining, and the brain - https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3815616/

Our architecture uses reinforcement learning to approximate both the optimal policy and optimal value function for many different pseudo-rewards.
* These include the long-term goal of predicting cumulative extrinsic reward as well as short-term predictions of extrinsic reward.
* To learn more efficiently, <font color="red">our agents use an experience replay mechanism</font> to provide additional updates to the critics.
    - Just as animals dream about positively or negatively rewarding events more frequently (Schacter et al., 2012), our agents preferentially replay sequences containing rewarding events.

#### jointly learned representation

* Importantly, both the auxiliary control and auxiliary prediction tasks
    - share 
        - the convolutional neural network and 
        - LSTM 
    - that the base agent uses to act. 
* By using this jointly learned representation, 
    - the base agent learns to optimise
        - extrinsic reward much faster and, 
        - in many cases, 
            - achieves better policies at the end of training.

#### Asynchronous Advantage Actor-Critic (A3C) framework

##### 참고
* [6] 텐서플로우 설치도 했고 튜토리얼도 봤고 기초 예제도 짜봤다면 TensorFlow KR Meetup 2016 - http://www.slideshare.net/carpedm20/ss-63116251

This paper brings together the state-of-the-art Asynchronous Advantage Actor-Critic (A3C) framework (Mnih et al., 2016), outlined in Section 2,
* with 
    - auxiliary control tasks and 
    - auxiliary reward tasks, 
        - defined in sections Section 3.1 and Section 3.2 respectively.

#### auxiliary tasks
* These auxiliary tasks 
    - do not require 
        - any extra supervision or 
        - signals from the environment 
            - than the vanilla A3C agent. 
* The result is our UNsupervised REinforcement and Auxiliary Learning (UNREAL) agent (Section 3.4)

#### UREAL agent
* In Section 4 we apply our UNREAL agent 
    - to a challenging set of 
        - 3D-vision based domains known as the Labyrinth (Mnih et al., 2016), 
            - learning solely from the raw RGB pixels of a first-person view. 
* Our agent significantly outperforms the baseline agent using vanilla A3C, even when the baseline was augmented with an unsupervised reconstruction loss, in terms of speed of learning, robustness to hyperparameters, and final performance. 
* The result is an agent 
    - which on average achieves 87% of expert human-normalised score,
    - compared to 54% with A3C, 
    - and on average 10× faster than A3C. 
* Our UNREAL agent also significantly outperforms 
    - the previous state-of-the-art in the Atari domain.

# 1 RELATED WORK

* A variety of reinforcement learning architectures have focused on learning temporal abstractions, such as options (Sutton et al., 1999b), with policies that may maximise pseudo-rewards (Konidaris & Barreto, 2009; Silver & Ciosek, 2012).
    - The emphasis here has typically been on the development of temporal abstractions that facilitate high-level learning and planning.
    - In contrast, our agents do not make any direct use of the pseudo-reward maximising policies that they learn (although this is an interesting direction for future research). 
    - Instead, they are used solely as auxiliary objectives for developing a more effective representation.
* The Horde architecture (Sutton et al., 2011) also applied reinforcement learning to identify value functions for a multitude of distinct pseudo-rewards.
    - However, this architecture was not used for representation learning; instead each value function was trained separately using distinct weights.
* The UVFA architecture (Schaul et al., 2015a) is a factored representation of a continuous set of optimal value functions, combining features of the state with an embedding of the pseudo-reward function. 
* Similarly, the successor representation (Dayan, 1993; Barreto et al., 2016; Kulkarni et al., 2016) factors a continuous set of expected value functions for a fixed policy, by combining an expectation over features of the state with an embedding of the pseudo-reward function.
* Another, related line of work involves learning models of the environment (Schmidhuber, 2010; Xie et al., 2015; Oh et al., 2015). Although learning environment models as auxiliary tasks could improve RL agents (e.g. Lin & Mitchell (1992); Li et al. (2015)), this has not yet been shown to work in rich visual environments.
* More recently, auxiliary predictions tasks have been studied in 3D reinforcement learning environments.

# 2 BACKGROUND

<img src="figures/cap1.png" width=600 />

# 3 AUXILIARY TASKS FOR REINFORCEMENT LEARNING
* 3.1 AUXILIARY CONTROL TASKS
* 3.2 AUXILIARY REWARD TASKS
* 3.3 EXPERIENCE REPLAY
* 3.4 UNREAL AGENT

In this section we incorporate auxiliary tasks into the reinforcement learning framework in order to promote faster training, more robust learning, and ultimately higher performance for our agents.

## 3.1 AUXILIARY CONTROL TASKS

#### auxiliary control tasks

* The auxiliary control tasks 
    - we consider are defined 
        - as additional pseudo-reward functions 
            - in the environment the agent is interacting with. 
* We formally define 
    - an auxiliary control task c 
        - by a reward function $r^{(c)}$ : S × A → R, 
            - where 
                - S is the space of possible states and 
                - A is the space of available actions.
* The underlying state space S 
    - includes both 
        - the history of observations and 
        - rewards as well as 
            - the state of the agent itself, 
                - i.e. the activations of the hidden units of the network.

#### policy

* Given a set of auxiliary control tasks C, 
    - let $π^{(c)}$ be the agent’s policy 
        - for each auxiliary task c ∈ C and 
        - let π be the agent’s policy on the base task. 

#### ovrall objective

The overall objective is to maximise total performance across all these auxiliary tasks,

<img src="figures/cap2.png" width=600 />

#### loss fuction

* In principle, any reinforcement learning method could be applied to maximise these objectives. 
* However, to efficiently learn to maximise many different pseudo-rewards simultaneously in parallel from a single stream of experience, it is necessary to use off-policy reinforcement learning. 
* We focus on value-based RL methods that approximate the optimal action-values by Q-learning. 
* Specifically, for each control task c we optimise an 
    - n-step Q-learning loss 
    
    $L^{(c)}_Q$ = E[($R_{t:t+n}$ + $γ^n$$max_{a′}$$Q^{(c)}$(s′,a′,$θ^-$)− $Q^{(c)}$(s,a,θ)$)^2$]

#### auxiliary reward functions
* Pixel changes 
    - Changes in the perceptual stream often correspond to important events in an environment. 
    - We train agents that 
        - learn a separate policy 
            - for maximally changing the pixels 
                - in each cell 
                    - of an n × n non-overlapping grid 
                        - placed over the input image. 
    - We refer to these auxiliary tasks as <font color="red">pixel control</font>. 
* Network features 
    - Since the policy or value networks of an agent learn to extract task-relevant high-level features of the environment (Mnih et al., 2015; Zahavy et al., 2016; Silver et al., 2016) 
        - they can be useful quantities for the agent to learn to control. 
    - Hence, the activation of <font color="blue">any hidden unit of the agent’s neural network can itself be an auxiliary reward</font>. 
    - We train agents that 
        - learn a separate policy 
            - for maximally activating 
                - each of the units 
                    - in a specific hidden layer. 
    - We refer to these tasks as <font color="red">feature control</font>.

<img src="figures/cap1.png" width=600 />

* The Figure 1 (b) shows 
    - an A3C agent architecture 
        - augmented with 
            - a set of auxiliary pixel control tasks. 
     - In this case, 
         - the base policy π shares 
             - both 
                 - the convolutional visual stream and 
                 - the LSTM with the auxiliary policies.
* The output of the auxiliary network head is 
    - an $N_{act}$ × n × n tensor $Q^{aux}$ 
        - where $Q^{aux}$(a,i,j) 
            - represents the network’s current estimate of 
                - the optimal discounted expected change 
                - in cell (i, j) of the input after taking action a. 
* We exploit 
    - the spatial nature of 
        - the auxiliary tasks 
            - by using a deconvolutional neural network 
    - to produce the auxiliary values $Q^{aux}$.

## 3.2 AUXILIARY REWARD TASKS

<img src="figures/cap3.png" width=600 />

* In addition to learning generally about the dynamics of the environment, an agent must learn to maximise the global reward stream. 
* To learn a policy to maximise rewards, an agent requires features that recognise states that lead to high reward and value. 
* <font color="red">An agent with a good representation of rewarding states, will allow the learning of good value functions, and in turn should allow the easy learning of a policy</font>.

* <font color="red">However, in many interesting environments reward is encountered very sparsely</font>, meaning that it can take a long time to train feature extractors adept at recognising states which signify the onset of reward. 
* <font color="red">We want to remove the perceptual sparsity of rewards</font> and rewarding states to aid the training of an agent, but to do so in a way which does not introduce bias to the agent’s policy.

#### reward prediction

* To do this, we introduce the auxiliary task of reward prediction 
    - that of predicting the onset of immediate reward given some historical context. 
    - This task consists of processing a sequence of consecutive observations, and requiring <font color="red">the agent to predict the reward picked up in the subsequent unseen frame</font>. 
    - This is similar to value learning focused on immediate reward (γ = 0).
* We train the reward prediction task on sequences 
    - $S_τ$ = ($s_{τ−k}$,$s_{τ−k+1}$,...,$s_{τ−1}$) 
        - to predict the reward $r_τ$ , and 
        - sample $S_τ$ from the experience of our policy π 
            - in a skewed manner 
    - so as to overrepresent rewarding events 
        - (presuming rewards are sparse within the environment).
* Specifically, 
    - we sample such that
        - zero rewards and non-zero rewards 
            - are equally represented, 
            - i.e. the predicted probability of 
                - a non-zero reward is P($r_τ$=0)=0.5.
* The reward prediction is trained to minimise a loss $L_RP$. 
* In our experiments 
    - we use a multiclass cross-entropy classification loss across 
        - three classes (zero, positive, or negative reward), 
            - although a mean-squared error loss is also feasible.

<img src="figures/cap1.png" width=600 />

* The auxiliary reward predictions may use a different architecture to the agent’s main policy. 
* Rather than simply “hanging” the auxiliary predictions off the LSTM,
    - we use a simpler feedforward network 
        - that concatenates as tack of states $S_τ$
            - after being encoded by the agent’s CNN, see Figure 1 (c)
    - The idea is to simplify 
        - the temporal aspects of 
            - the prediction task 
        - in both 
            - the future direction 
                - (focusing only on immediate reward prediction rather than long-term returns) and 
            - past direction 
                - (focusing only on immediate predecessor states rather than the complete history)

## 3.3 EXPERIENCE REPLAY

* Experience replay has proven to be an effective mechanism for improving both the data efficiency and stability of deep reinforcement learning algorithms (Mnih et al., 2015). 
* <font color="red">The main idea is to store transitions in a replay buffer, and then apply learning updates to sampled transitions from this buffer</font>.

<img src="figures/cap1.png" width=600 />

#### Skewed sampling

* Experience replay provides a natural mechanism for skewing the distribution of reward prediction samples towards rewarding events: 
    - we simply split the replay buffer into 
        - rewarding and 
        - non-rewarding subsets, 
    - and replay equally from both subsets. 
* <font color="red">The skewed sampling of transitions from a replay buffer means</font> 
    - that <font color="red">rare rewarding states will be oversampled</font>, and 
    - learnt from far more frequently than 
        - if we sampled sequences directly from the behaviour policy.

#### value function replay

* In addition to reward prediction, we also use the replay buffer to perform value function replay. 
* This amounts to 
    - resampling recent historical sequences 
        - from the behaviour policy distribution and 
    - performing extra value function regression 
        - in addition to the on-policy value function regression in A3C. 
* By resampling 
    - previous experience, and 
    - randomly varying the temporal position 
        - of the truncation window over 
            - which the n-step return is computed, 
    - <font color="red">value function replay performs</font> 
        - <font color="blue">value iteration</font> and 
        - <font color="blue">exploits newly discovered features</font> 
            - shaped by reward prediction.

<font color="red">Experience replay is also used to increase the efficiency and stability of the auxiliary control tasks. Q-learning updates are applied to sampled experiences that are drawn from the replay buffer, allowing features to be developed extremely efficiently.</font>

## 3.4 UNREAL AGENT

The UNREAL algorithm combines the benefits of two separate, state-of-the-art approaches to deep reinforcement learning. 
* The primary policy is trained with A3C (Mnih et al., 2016): 
    - it learns from parallel streams of experience to gain efficiency and stability; 
    - it is updated online using policy gradient methods; 
    - and it uses a recurrent neural network 
        - to encode the complete history of experience. 
* This allows the agent to learn effectively in partially observed environments.


The auxiliary tasks are trained on very recent sequences of experience that are stored and randomly sampled; 
* these sequences may be prioritised (in our case according to immediate rewards) (Schaul et al., 2015b); 
* these targets are trained off-policy by Q-learning; 
* and they may use simpler feedforward architectures. 
* This allows the representation to be trained with maximum efficiency.

#### loss function

The UNREAL algorithm optimises a single combined loss function with respect to the joint parameters of the agent, θ, 
* that combines 
    - the A3C loss $L_{A3C}$ together with 
    - an auxiliary control loss $L_{PC}$, 
    - auxiliary reward prediction loss $L_{RP}$ and 
    - replayed value loss $L_{VR}$,

<img src="figures/cap4.png" width=600 />

# 4 EXPERIMENTS
* 4.1 LABYRINTH RESULTS
* 4.2 ATARI

## 4.1 LABYRINTH RESULTS
* 4.1.1 RESULTS

<img src="figures/cap5.png" width=600 />

### 4.1.1 RESULTS

#### Unsupervised Reinforcement Learning

<img src="figures/cap6.png" width=600 />

## 4.2 ATARI

<img src="figures/cap7.png" width=600 />

# 5 CONCLUSION

<img src="figures/cap8.png" width=600 />
<img src="figures/cap9.png" width=600 />
<img src="figures/cap10.png" width=600 />
<img src="figures/cap11.png" width=600 />
<img src="figures/cap12.png" width=600 />
<img src="figures/cap13.png" width=600 />

# 참고자료
* [1] (paper) REINFORCEMENT LEARNING WITH UNSUPERVISED AUXILIARY TASKS - https://arxiv.org/pdf/1611.05397.pdf
* [2] (Vidoe) DeepMind - Reinforcement Learning with Unsupervised Auxiliary Tasks - https://youtu.be/Uz-zGYrYEjA
* [3] (blog) Reinforcement learning with unsupervised auxiliary tasks - https://deepmind.com/blog/reinforcement-learning-unsupervised-auxiliary-tasks/
* [4] Experience Replay - http://unpredictablepattern.blogspot.kr/2016/02/experience-replay.html
* [5] The future of memory: remembering, imagining, and the brain - https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3815616/
* [6] 텐서플로우 설치도 했고 튜토리얼도 봤고 기초 예제도 짜봤다면 TensorFlow KR Meetup 2016 - http://www.slideshare.net/carpedm20/ss-63116251