> - author: Dongwan Kim
- date: 2019-02-26

# Imagination-Augmented Agents for Deep Reinforcement Learning

# Abstract

- I2A: architecture <u>combining</u> **model-free** and **model-based** aspects

- Existing model-based RL: prescribe how a model should be used to arrive at a policy. 
 - For typical model-based RL, rules for update policy or value using output of Env-Model are defined. For example, Dyna-Q uses below update rule.
\begin{align}
R, S' &\leftarrow Model(S, A) \\
Q(S, A) &\leftarrow Q(S, A) + \alpha [R + r \max_a Q(S', a) - Q(S, A)]
\end{align}

- I2A: **learns to interpret predictions** from a learned environment model to construct implicit plans in arbitrary ways.
 - I2A use encoded env-model's output as additional context to train a policy or value network, so we don't know clearly how predicted trajectories affect policy or value.

- I2As show improved data efficiency, performance, and robustness to model misspecification compared to several baselines.

# 1. Introduction

- Model-free RL usually requires large amount of traning data and the resulting policies do not readily generalize to novel tasks.

- Model-based RL aims to address these shortcomings, but **suffers from model errors** resulting from function approximation. These errors compound during planning, causing over-optimism and poor agent performance.

- I2A shows **robustness against model imperfections**, which use approximate environment models by "learning to interpret" their imperfect predictions.

# 2. The I2A architecture

![image1.png](attachment:image1.png)

![image2.png](attachment:image2.png)

###### Imagination core

- The environment model together with rollout poicy $\hat{\pi}$ constitute the imagination core module, which predicts next time step.

> - $\hat{o}_{t+1}, \hat{r}_{t+1} = \text{IC}(o_t)$
 - $\hat{a}_{t} = \text{PolicyNet}(o_t)$, $~~\hat{\pi}$: rollout policy, which is determined by $\text{PolicyNet}$
 - $\hat{o}_{t+1}, \hat{r}_{t+1} = \text{EnvModel}(o_t, \hat{a}_t)$

> PolicyNet과 EnvModel의 구조는?

###### Single Imagination rollout

- The imagination core is used to produce $n$ trajectories $\hat{\mathcal{T}}_1, \dots, \hat{\mathcal{T}}_n$. Each imagined trajectory $\hat{\mathcal{T}}$ is a sequence of features $(\hat{f}_{t+1}, \dots, \hat{f}_{t+\tau})$, where $t$ is the current time, $\tau$ the length of the rollout, and $\hat{f}_{t+i}$ the output of the env model.

\begin{align} \hat{f}_{t+i} &= [\hat{o}_{t+1}, \hat{r}_{t+1} ] = \text{IC}(\hat{o}_{t+i-1}) \\
\hat{\mathcal{T}}_i &= (\hat{f}_{t+1}, \dots, \hat{f}_{t+\tau})
\end{align}

- Each rollout $\hat{\mathcal{T}}_i$ is encoded as rollout embedding $e_i$, and then embeddings $e_1, \dots, e_n$ are aggregated as $c_{\text{ia}}$

$$e_i = \mathcal{E}(\hat{\mathcal{T}}_i) \\
c_{\text{ia}}=\mathcal{A}(e_1, \dots, e_n)$$

###### Full I2A Architecture

- The final componet of the I2A is the policy module, which is a network that takes the information $c_{\text{ia}}$ from model-based predictions, as well as the output $c_{mf}$ of a model-free path. The I2As learnings to combine information from its model-free and imagination-augmented path.

$$\pi, V = \text{FC}(c_{\text{ia}}, c_{\text{mf}})$$

# 3. Architectural choices and experimental setup

### 3.1 Roll strategy

- For this experiments, we perform one rollout for each possible action in the environment. The first action in the $i^{\text{th}}$ rollout is the $i^{\text{th}}$ action of the action set $\mathcal{A}$, and subsequent actions for all rollouts are produced by a shared rollout policy $\hat{\pi}$.

- Training rollout policy $\hat{\pi}$
 - by adding to the total loss **a cross entropy auxiliary loss** between the imagination-augmented policy $\pi$ and the policy $\hat{\pi}$, both for the current observation.

### 3.2 I2A components and environment models

- In this experiments, the encoder is an LSTM with convolutional layers which sequentially processes a trajectory $\mathcal{T}$. The features $\hat{f}_t$ are fed to the LSTM in reverse order to mimic Bellman type backup operations. (but choice of forward, backward bi-directional seems to have little impact on the performance.)

- Training environment model
 - pretrain and freeze (this led to faster runtime of the I2A architecture compared to training jointly)
 - jointly train with full network by adding $l_{\text{model}}$ to the total loss.   


- For all environments, training data for this environment model was generated from trajectories of a partially trained standard model-free agent.

### 3.3 Agent training and baseline agents

- Using a fixed pretrained env model, remaining I2A parameters are trained with A3C.

- Added entropy regularizer on the policy $\pi$ to encourage exploration and the auxiliary loss the distill $\pi$ into the rollout policy $\hat{\pi}$.

 - standard: A3C (I2A without model-based part)
 - standard(large): A3C with increased parameters. (slightly larger number of parameters than I2A)
 - copy-model I2A: replaced env model in the I2A with 'copy' model that simply return the input observation. (same number of parameters and same architecture)

# 4. Sokoban experiments

![image3.png](attachment:image3.png)

- Sokoban
 - boxes can only be pushed $\rightarrow$ mistakes can make puzzle unsolvable, so some kinds of planning could be helpful.
 - procedually generates a new level each episode $\rightarrow$ agent cannot memorize specific puzzles
 - 10x10 grid world $\rightarrow$ grid coding for env could be helpful but RGB image is used directly in this experiment.

### 4.1 I2A performance vs. baselines on Sokoban

![image4.png](attachment:image4.png)

- Figure(Left) : learning curves of I2A and various baselines
 - I2A > standard(large A3C) > standard(A3C) > no reward I2A > copy-model I2A

- Figure(Right) : how the length of individual rollouts affects performance.
 - Longer rollouts perform better, but diminishing returns with longer rollouts.

### 4.2 Learning with imperfect models

![image5.png](attachment:image5.png)

- I2A is able to handle imperfect env model
- Created poor env-model  as shown in figure left, having smaller number of parameters
- I2A agent with poor model ended outperformed the I2A with good model. (requires further investigation)
- @

### 4.3 Further insights into the workings of the I2A architecture

- I2A with the copy model performs worse $\rightarrow$ env-model is crucial
- I2A with env-model not predicting reward performs worse but after much longer training recovered performance. $\rightarrow$ reward prediction is not absolutely necessary.

### 4.4 Imaginaiton efficiency and comparison with perfect-model planning methods

![image6.png](attachment:image6.png)

- I2A requires much lower number of simulation steps compared to MCTS(Monte Carlo Tree Search)
- @

### 4.5 Generalization experiments

- (Table2 right) I2A generalize well on levels with different numbers of boxes (agent trained on levels with 4 boxes)

# 5. Learning one model for many tasks in MiniPacman

![image7.png](attachment:image7.png)

- green: player
- red: dangerous ghosts
- dark blue: food
- black: empty corridoers
- cyan: power pills - when eaten, for a fixed number of steps, the player moves faster, and the ghosts run away and can be eaten.

- There are 5 events: moving, eating food, eating power pill, eating a ghost, and being eaten by a ghost.           
We can have different reward settings for these events,         
and this reward schemes lead to very different policies.

- Trained a single env model and used this for many tasks(having different reward setting),    
and outperformed the standard agent in all tasks.

# 6. Related work

- @

# 7. Discussion

- Presented I2A, an approach combining model-free and model-based idea, and this outperforms model-free baselines.
- I2As trade-off environment interactions for computation by pondering before acting.     
 - In this experiments, The I2A was always less than an order of maginitude slower per interaction than the model-free baselines.     
 - Further research required for abstract environment models to reduce computation cost.
- I2As require far fewer function calls to the model than MCTS.
- @