# $$\text{Attentive World Models}$$

$$\text{Aaron Dharna       | Michael Tynes}$$
$$\text{Fordham University | Fordham University}$$
$$\text{adharna@fordham.edu| mtynes@fordham.edu}$$

## Introduction
### Control in MDPs

Markov decision processes allow us to model control tasks in a concise way. We have an agent interacting with an environment that consists of a set of states $\mathcal S$, a set of actions that can be performed in those states $\mathcal A$, and a scalar reward associated with each state $R(s)$.

We assume ethat there exists a funciton $p(s_{t+1} | s_t, a_t)$ that fully describes the dynamics of the environment. Thus it is Markovian. 

| ![](images/MDP.png)|
|:----:|
|Sutton and Barto, 2018|

An environment is 'solved' by an agent when it finds a policy for aciton $\pi(a_t | s_t)$ maximize expected long term reward, which is defined as  

$$G_t = R_t + \gamma R_{t+1} + \gamma^2R_{t+2} + \cdots$$
for $\gamma \in [0, 1]$

Thus our optimal policy $\pi_*$ is defined as 
$$\pi_* = \arg \max_\pi  \mathbb E [G_t | s_t, a_t] \forall s$$

----

## $$\text{Models}$$
### VAE

To accomplish the Vision task needed for CarRacing-V0, we (and Ha and Schmidhuber) used a Convolutional Variational Auto-encoder (VAE). The VAE is filling in the task of compressing our image-state representations into a 32-dimensional vector, z. This latent vector representation should now be representative of the input image. If this is the case, then we have managed to take a 64x64 dimensional object and embed it into a 32 dimensional subspace without losing relavent informatino about the input. In fact, we can check that by recreating a new image from the z-space and seeing that the important information hass been preserved. 
    
|![](images/vae_recon_input.png) | ![](images/vae_recon_output.png)|
|:----:|:----:|
| original   | reconstruction   |
    
Furthermore, due to the VAE being a generative model, we can dream up completely new images by simply injecting some noise into the bottleneck layer. 

|![](images/vae_dream_image.png) |
|:----:|
| dream image|
    
In this manner we have managed to compress the spatial information of CarRacing. However, that is not all that is required to solve this task. We must also learn temporal information. 
    
### MDN-RNN

RNN models the dynamics of the world, learning the mappings from ((compresed) state, action) pairs to next states.
Assuming the VAE does its job of comopressing the states well, we essentially learn the function
$$f: s_t, a_t \rightarrow s_{t+1}$$ which is an essential property if MDPs. But of course, MDPs are stochastic by definition, which is where the MDN proves useful, allowing us to learn a distribution of next states given a state, action pair, that is: 
$$p(s_{t+1} | s_t, a_t)$$
    
    
### Controller

The controller, C, interacts with the environment. Since we have gone through all of the work of learning our VAE and MDN-RNN, we can use a simple function approximator to handle the control -- in fact, C is linear transformation upon z (the compressed image) and h (the compressed temporal information).


### Putting it all together
|![](./images/world_model.png)| ![](images/rollout.png)|
|:---:|:----:|
|Ha and Schmidhuber, 2018|Interacting with the WM|


## OpenAI Gym  
Simulation API for many different (and difficult) control tasks. We chose to work in `CarRacing_v0` environment as that is the same environment that Ha uses. Ha's World Model's solution was the first automated system to sucessfully solve this task.  In this API, we have access to our environment, recieve rewards and states, and can pass in new actions for the agent to take. 

| ![](images/CarRacing.gif)|
|:----:|
|Ha and Schmidhuber 2018|
CarRacing-v0 defines “solving” as getting average reward of 900 over 100 consecutive trials, which means the agent can only afford very few driving mistakes. [OpenAI](https://github.com/openai/gym/blob/master/gym/envs/box2d/car_racing.py)

----  

## CarRacing Env
```
        States: default states are numpy array images of size 96x96x3  
        Reward: R = 1000 - 0.1*|F|, where |F| is the number of frames in this rollout.  
        Action space: A 3 element vector: [Gas, steer, break] with Ranges respectively: [0,1], [-1,1], [0,1]  

```


## $$\text{Methodology}$$
### Data collection

Our dataset, $\mathcal{D}$, is play-traces of the CarRacing_v0 environment in OpenAI Gym. These traces combined states and actions. (If we were using an RL algorithm to teach C, then we would also record reward information at each step.) We recorded 500 playthroughs of the CarRacing_v0 environment where each playthrough was 384 steps long. This lead to a total of 192000 frames recorded. Each frame of $\mathcal{D}$ was resized from `400x600x3` to a `64x64x3` image. After training $\mathcal{V}$ with $\mathcal{D}$ we then applied  $\mathcal{V}(\mathcal{D})$ such that we now also recoreded `z_states`. When we combined $\mathcal{V}(\mathcal{D})$ with $\mathcal{D}$ we created a new dataset, $\mathcal{D'}$. These z_states are 32-dimensional latent representation of the images extracted from V's bottleneck. $\mathcal{D'}$ is then used to train $\mathcal{M}$ and $\mathcal{C}$.

There was a surprising amount of difficult getting good traces to learn from. 
    
### Training
Each model was trained individually.  

$\mathcal{V}$ was trained to minimize the evidence lower bound.  
$\mathcal{M}$ was trained to minimize the negative log-likelihood between $\mathcal{V}$-encoded-states and samples from factoried gaussian parameters output by the MDN-head.   
$\mathcal{C}$ was trained using an evolutionary algorithm that maximizes the fitness of a population of linear-controllers using a non-gradient optimization method.  
       
#### Dream Training
Dream training bypasses the real environment and uses M to predict the next state given the current state and the action chosen by the controller. To illustrate this, we have created the below graphic how to interact with the World Model.  Recall, we can do this because M models the environment dynamics as:

## $$p(s_{t+1} | s_t, a_t)$$

|![](./images/dream_training.png)|
|:---:|

While not fully faithlful to the real enironment, this can reveal interesting properties of the world model, as discussed in Ha and Schmidhuber 2018. Additionally, it is much faster as neural nets run faster than the open AI gym. 
    
### Attention

Attention layers add masks on top of feature representations in order to higlight important features involved in long term dependencies. Attention has its roots in machine translation where it is useful in preserving context inforamation over long sequence lengths, e.g. preserving semantic information across long sentences when syntax in languages may be very different. Attention allows networks to better find global correlations over time in the case of RNNs and over space in the case of CNNs. Where as convolutions find local features by their nature as small filters being slid over images, attention maps, which are just filters on top of conv-volumes, allow networks to mask out local noise and detect global features. Source: long conversations with grad student in Aaron's NYU lab. 

We used additive attetion, rather than self- or mutliplicative attention. Additive attention is written up in [2]. We implemented additive attention as a class which is stored in the `models.py` file and is used in the attn-mdn-rnn class. You can find this in `attentive-mdn-rnn.ipynb`.  

### Evolution

Evolution is an alternatve approach to reinforcement learning when learning a control task. Evolution as an optimization paradigm draws inspiration from natute. In 'normal' optimization, we have a singular function approximator, $\mathcal F$ whose parameters we alter until it learns to fit some given dataset. In gradient based optimization, we usually find a optimum by taking a derivative of a loss function with respect to the parameters of $\mathcal F$, setting that to zero, and then editing $\mathcal F$'s parameters in the direction of greatest descent. In an evolutionary paradigm we do not do this. 

First, we build a population of potential candidate solutions $\mathcal P$. We then evaluate each $p \in \mathcal P$ on the task at hand receiving a score of p's performace. After ranking each p $\in \mathcal P$ we then pick the best performing individuals and create new candidate solutions from them. Since the next generation is created from the best performing individuals of the previous generation, the behavior of the individuals should keep improving. 

----

## $$\text{Results}$$
### VAE
Vae training was successful. See vae_training.ipynb (and the associated HTML) for more. 

|![](images/vae_training_loss.png)|![](images/vae_val_loss.png)|
|:---:|:---:|
|Train loss| Val loss|

### MDN-RNN
Training for the non-attentive version of the MDN-RNN appears to have also been successful. Interestingly, the val and train loss track one another almost perfectly. This may be due to strong performance in modelling the world dynamics. It does look good enough to arouse some suspicion, but we haven't been able to find anything wrong with the code...  

|![](images/mdn_rnn_learning_curves.png)|
|:---:|
|MDN-RNN learning curves|

### ATTN-MDN

The attentive MDN-RNN learning curves are a little more erratic, although they still seem to converge. A random hyperparameter search could calm the loss time dynamics.

|![](images/attn_mdn_rnn_learning_curves.png)|
|:---:|
|ATTN-MDN-RNN learning curves|


#### CONTROLLER


|![](images/controller_training.png)|
|:---:|
|Controller learning curve|
|X-axis: generation, Y-axis: best score for generation|
|Training time: 11hrs|

Evolution Hyperparameters Used:  
```       
        Population size: 10  
        Generations: 50
```
This is a very small population size and #generations. Ideally we would use something more like this:  

```     
        Population size: 100  
        Generations: 4000  
        But due to computational resource and time constraints, we had to cut back.
```

Additionally: this only shows the performance of the controller that got inputs from the standard (non-attentive) MDN-RNN. The job training the attention version died mid-evolution. We have not had time to re-run it. This will go in future work. 

----  

## Discussion

### Where did this go wrong?

This project was pretty ambitius, especially since we needed to use VAEs and RNNs but Mike only learned those in class, which left us with two weeks of both of us knowing what we were doing. Considering that, our results are pretty darn good. 

Notably, the performance of the VAE and standard RNN look good, partially because we were able to use the same hyperparameters that Ha and Shmidhuber published on their github. The attentive RNN looks worse possibly because we did not have time to conduct a good hyperparameter search. 

Additionally, we simply did not have the time/parallel compute resources to conduct the evolutionary search for a good controller, which is likely why the controller performance is so bad. The evolutionary algorithm has to perform hundreds of thousands of rollouts in the openai gym carracing-v0 environment, each of which takes 30 seconds. Without access to parallel compute resources to train on, this is time prohibitive. 


### What's next

Aside from checking our code for bugs and getting access to parallel computing hardware/software, we can make the following improvements: 
 
* Try attention with dream training: 
    * This does not hit the openai gym bottleneck: it can go as fast as matrix multiplaction. 
* Add self-attention to the VAE
* Longer training time for C
* Cut off bottom part of the image
    * This should make mode-collapse less likely


### Profit.

# $$\text{Citations}$$

#### (1)  worldmodels.github.io, Ha and Schmidhuber 2018  
(2) Xu, K., Ba, J., Kiros, R., et al. **Show, Attend, and Tell**, 2015, arXiv e-prints, arXiv:1502.03044
(3) 