# $$\text{Attentive World Models}$$

$$\text{Aaron Dharna       | Michael Tynes}$$
$$\text{Fordham University | Fordham University}$$
$$\text{adharna@fordham.edu| mtynes@fordham.edu}$$

In [2]:
from IPython.display import HTML
from IPython.display import display

# Taken from https://stackoverflow.com/questions/31517194/how-to-hide- \
# one-specific-cell-input-or-output-in-ipython-notebook

tag = HTML('''<script>
code_show=true; 
function code_toggle() {
    if (code_show){
        $('div.cell.code_cell.rendered.selected div.input').hide();
    } else {
        $('div.cell.code_cell.rendered.selected div.input').show();
    }
    code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
To show/hide this cell's raw code input, click <a href="javascript:code_toggle()">here</a>.''')
display(tag)

import os
import numpy as np

from models import *
from utils import *

import matplotlib.pyplot as plt

from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

# NOTE: I have started putting images for us to load into `AWM/images`. 

## Introduction
### Control in MDPs

Markov decision processes allow us to model control tasks in a concise way. We have an agent interacting with an environment that consists of a set of states $\mathcal S$, a set of actions that can be performed in those states $\mathcal A$, and a scalar reward associated with each state $R(s)$.

We assume ethat there exists a funciton $p(s_{t+1} | s_t, a_t)$ that fully describes the dynamics of the environment. Thus it is Markovian. 

| ![](images/MDP.png)|
|:----:|
|Sutton and Barto, 2018|

An environment is 'solved' by an agent when it finds a policy for aciton $\pi(a_t | s_t)$ maximize expected long term reward, which is defined as  

$$G_t = R_t + \gamma R_{t+1} + \gamma^2R_{t+2} + \cdots$$
for $\gamma \in [0, 1]$

Thus our optimal policy $\pi_*$ is defined as 
$$\pi_* = \arg \max_\pi  \mathbb E [G_t | s_t, a_t] \forall s$$

----

## $$\text{Models}$$
### VAE

To accomplish the Vision task needed for CarRacing-V0, we (and Ha and Schmidhuber) used a Convolutional Variational Auto-encoder (VAE). The VAE is filling in the task of compressing our image-state representations into a 32-dimensional vector, z. This latent vector representation should now be representative of the input image. If this is the case, then we have managed to take a 64x64 dimensional object and embed it into a 32 dimensional subspace without losing relavent informatino about the input. In fact, we can check that by recreating a new image from the z-space and seeing that the important information hass been preserved. 
    
|![](images/vae_recon_input.png) | ![](images/vae_recon_output.png)|
|:----:|:----:|
| original   | reconstruction   |
    
Furthermore, due to the VAE being a generative model, we can dream up completely new images by simply injecting some noise into the bottleneck layer. 

|![](images/vae_dream_image.png) |
|:----:|
| dream image|
    
In this manner we have managed to compress the spatial information of CarRacing. However, that is not all that is required to solve this task. We must also learn temporal information. 
    
### MDN-RNN

RNN models the dynamics of the world, learning the mappings from ((compresed) state, action) pairs to next states.
Assuming the VAE does its job of comopressing the states well, we essentially learn the function
$$f: s_t, a_t \rightarrow s_{t+1}$$ which is an essential property if MDPs. But of course, MDPs are stochastic by definition, which is where the MDN proves useful, allowing us to learn a distribution of next states given a state, action pair, that is: 
$$p(s_{t+1} | s_t, a_t)$$
    
    
    
    
    
### Controller

    - The controller, C, interacts with the environment. Since we have gone through all of the work of learning V and M, we can use a simple function approximator to handle the control -- in fact, C is linear transformation upon z (the compressed image) and h (the compressed temporal information). 
    
    

In [6]:
C = Controller(256+32, 3)
C.shape

(288, 3)

## OpenAI Gym  
Simulation API for many different (and difficult) control tasks. We chose to work in `CarRacing_v0` environment as that is the same environment that Ha uses. Ha's World Model's solution was the first automated system to sucessfully solve this task.  In this API, we have access to our environment, recieve rewards and states, and can pass in new actions for the agent to take. 

```python
def rollout(controller):
  ''' env, rnn, vae are '''
  ''' global variables  '''
  obs = env.reset()
  h = rnn.initial_state()
  done = False
  cumulative_reward = 0
  while not done:
    z = vae.encode(obs)
    a = controller.action([z, h])
    obs, reward, done = env.step(a)
    cumulative_reward += reward
    h = rnn.forward([a, z, h])
  return cumulative_reward
```

| ![](images/CarRacing.gif)|
|:----:|
|Ha and Schmidhuber 2018|

CarRacing-v0 defines “solving” as getting average reward of 900 over 100 consecutive trials, which means the agent can only afford very few driving mistakes. [OpenAI](https://github.com/openai/gym/blob/master/gym/envs/box2d/car_racing.py)

----  

## Methodology
### Data collection
    - What's our dataset?
Our dataset, $\mathcal{D}$, is play-traces of the CarRacing_v0 environment in OpenAI Gym. These traces combined states and actions. (If we were using an RL algorithm to teach C, then we would also record reward information at each step.) We recorded 500 playthroughs of the CarRacing_v0 environment where each playthrough was 384 steps long. This lead to a total of 192000 frames recorded. Each frame of $\mathcal{D}$ was resized from `400x600x3` to a `64x64x3` image. After training $\mathcal{V}$ with $\mathcal{D}$ we then applied  $\mathcal{V}(\mathcal{D})$ such that we now also recoreded `z_states`. When we combined $\mathcal{V}(\mathcal{D})$ with $\mathcal{D}$ we created a new dataset, $\mathcal{D'}$. These z_states are 32-dimensional latent representation of the images extracted from V's bottleneck. $\mathcal{D'}$ is then used to train $\mathcal{M}$ and $\mathcal{C}$.

There was a surprising amount of difficult getting good traces to learn from. 
    
### Training
Each model was trained individually.
$\mathcal{V}$ was trained to minimize the evidence lower bound.
$\mathcal{M}$ was trained to minimize the negative log-likelihood between $\mathcal{V}$ encoded states and samples from factoried gaussian parameters output by the MDN-head. 
$\mathcal{C}$ was trained using an evolutionary algorithm that maximizes the fitness of a population of linear-controllers using a non-gradient optimization method.
    
    
### Attention
### Evolution

Evolution is an alternatve approach to reinforcement learning when learning a control task. Evolution as an optimization paradigm draws inspiration from natute. In 'normal' optimization, we have a singular function approximator, $\mathcal F$ whose parameters we alter until it learns to fit some given dataset. In gradient based optimization, we usually find a optimum by taking a derivative of a loss function with respect to the parameters of $\mathcal F$, setting that to zero, and then editing $\mathcal F$'s parameters in the direction of greatest descent. In an evolutionary paradigm we do not do this. 

First, we build a population of potential candidate solutions $\mathcal P$. We then evaluate each $p \in \mathcal P$ on the task at hand receiving a score of p's performace. After ranking each p $\in \mathcal P$ we then pick the best performing individuals and create new candidate solutions from them. Since the next generation is created from the best performing individuals of the previous generation, the behavior of the individuals should keep improving. 

----

In [1]:
ls images

[0m[01;35mattn_mdn_rnn_learning_curves.png[0m  [01;35mrollout.png[0m           [01;35mvae_training_loss.png[0m
[01;35mCarRacing.gif[0m                     [01;35mvae_dream_image.png[0m   [01;35mvae_val_loss.png[0m
[01;35mmdn_rnn_learning_curves.png[0m       [01;35mvae_recon_input.png[0m   [01;35mworld_model.png[0m
[01;35mMDP.png[0m                           [01;35mvae_recon_output.png[0m


## $$\text{Results}$$
### VAE
Vae training was successful. See vae_training.ipynb (and the associated HTML) for more. 

|![](images/vae_training_loss.png)|![](images/vae_val_loss.png)|
|:---:|:---:|
|Train loss| Val loss|

### MDN-RNN
Training for the non-attentive version of the MDN-RNN appears to have also been successful. Interestingly, the val and train loss track one another almost perfectly. This may be due to strong performance in modelling the world dynamics. It does look good enough to arouse some suspicion, but we haven't been able to find anything wrong with the code...  

|![](images/mdn_rnn_learning_curves.png)|
|:---:|
|MDN-RNN learning curves|

### ATTN-MDN

The attentive MDN-RNN learning curves are a little more erratic, although they still seem to converge. A random hyperparameter search could calm the loss time dynamics.

|![](images/attn_mdn_rnn_learning_curves.png)|
|:---:|
|ATTN-MDN-RNN learning curves|


#### CONTROLLER

----  

## Discussion
### Where did this go wrong?
### What's next?
### Profit.

## Citations:

#### (1)  worldmodels.github.io, Ha and Schmidhuber 2018  
(2) Xu, K., Ba, J., Kiros, R., et al. **Show, Attend, and Tell**, 2015, arXiv e-prints, arXiv:1502.03044
(3) 
(4) 
(5) 
(6) 
(7) 
(8) 
(9) 



