# Abstract

I2A: architecture combining model-free and model-based aspects

Existing model-based RL: prescribe how a model should be used to arrive at a policy. 
 - 일반적인 model-based RL에서는 모델의 결과물($R, S' \leftarrow Model(S, A)$)을 이용해 policy나 value를 업데이트할 방법을 명확히 규정함

I2A: learns to interpret predictions from a learned environment model to construct implicit plans in arbitrary ways.
 - I2A에서는 모델의 출력을 인코딩하고 신경망의 추가적인 입력으로 사용해서, 이것이 어떻게 policy나 value를 업데이트하는지 알수 없다. 대신 이 과정을 학습시킨다.

I2As show improved data efficiency, performance, and robustness to model misspecification compared to several baselines.

# 1. Introduction

Model-free RL usually requires large amount of traning data and the resulting policies do not readily generalize to novel tasks.

Model-based RL aims to address these shortcomings, but suffers from model errors resulting from function approximation. These errors compound during planning, causing over-optimism and poor agent performance.

I2A shows robustness against model imperfections, which use approximate environment models by "learning to interpret" their imperfect predictions.

# 2. The I2A architecture

###### Imagination core

The environment model together with rollout poicy $\hat{\pi}$ constitute the imagination core module, which predicts next time step.

- $\hat{o}_{t+1}, \hat{r}_{t+1} = \text{IC}(o_t)$
 - $\hat{a}_{t} = \text{PolicyNet}(o_t)$
 - $\hat{o}_{t+1}, \hat{r}_{t+1} = \text{EnvModel}(o_t, \hat{a}_t)$

- $\hat{\pi}$: rollout policy, which is determined by $\text{PolicyNet}$

###### Single Imagination rollout

The imagination core is used to produce $n$ trajectories $\hat{\mathcal{T}}_1, \dots, \hat{\mathcal{T}}_n$. Each imagined trajectory $\hat{\mathcal{T}}$ is a sequence of features $(\hat{f}_{t+1}, \dots, \hat{f}_{t+\tau})$, where $t$ is the current time, $\tau$ the length of the rollout.

\begin{align} \hat{f}_{t+i} &= [\hat{o}_{t+1}, \hat{r}_{t+1} ] = \text{IC}(\hat{o}_{t+i-1}) \\
\hat{\mathcal{T}}_i &= (\hat{f}_{t+1}, \dots, \hat{f}_{t+\tau})
\end{align}

Each rollout $\hat{\mathcal{T}}_i$ is encoded as rollout embedding $e_i$, and then embeddings $e_1, \dots, e_n$ are aggregated as $c_{\text{ia}}$

$$e_i = \mathcal{E}(\hat{\mathcal{T}}_i) \\
c_{\text{ia}}=\mathcal{A}(e_1, \dots, e_n)$$

###### Full I2A Architecture

The final componet of the I2A is the policy module, which is a network that takes the information $c_{\text{ia}}$ from model-based predictions, as well as the output $c_{mf}$ of a model-free path. The I2As learnings to combine information from its model-free and imagination-augmented path.

$$\pi, V = \text{FC}(c_{\text{ia}}, c_{\text{mf}})$$

# 3. Architectural choices and experimental setup

### 3.1 Roll strategy

For this experiments, we perform one rollout for each possible action in the environment. The first action in the $i^{\text{th}}$ rollout is the $i^{\text{th}}$ action of the action set $\mathcal{A}$, and subsequent actions for all rollouts are produced by a shared rollout policy $\hat{\pi}$.

Training rollout policy $\hat{\pi}$
- by adding to the total loss 'a cross entropy auxiliary loss' between the imagination-augmented policy $\pi$ and the policy $\hat{\pi}$, both for the current observation.

### 3.2 I2A components and environment models

In this experiments, the encoder is an LSTM with convolutional layers which sequentially processes a trajectory $\mathcal{T}$. The features $\hat{f}_t$ are fed to the LSTM in reverse order to mimic Bellman type backup operations. (but choice of forward, backward bi-directional seems to have little impact on the performance.)

Training environment model
- pretrain and freeze (this led to faster runtime of the I2A architecture compared to training jointly)
- jointly train with full network by adding $l_{\text{model}}$ to the total loss.   
pre-trained env. model 

For all environments, training data for this environment model was generated from trajectories of a partially trained standard model-free agent.

### 3.3 Agent training and baseline agents

Using a fixed pretrained env model, remaining I2A parameters are trained with A3C.

Added entropy regularizer on the policy $\pi$ to encourage exploration and the auxiliary loss the distill $\pi$ into the rollout policy $\hat{\pi}$.

- standard: A3C (I2A without model-based part)
- standard(large): A3C with increased parameters. (slightly larger number of parameters than I2A)
- copy-model I2A: replaced env model in the I2A with 'copy' model that simply return the input observation. (same number of parameters and same architecture)

# 4. Sokoban experiments

### 4.1 I2A performance vs. baselines on Sokoban

### 4.2 Learning with imperfect models

### 4.3 Further insights into the workings of the I2A architecture

### 4.4 Imaginaiton efficiency and comparison with perfect-model planning methods

### 4.5 Generalization experiments

# 5. Learning one model for many tasks in MiniPacman

# 6. Related work

# 7. Discussion