# Learning to estimate world state

## Introduction


Planning is undoubtedly a relevant tactic in AI and Robotics. Many tasks, such as efficient navigation or chess-playing, rely on agent's ability to simulate possible futures to determine most favourable actions. However, in order to predict reliably, agents require robust models of their environments. While the engineer often has the relevant knowledge, it may be difficult to transfer it to the agent.
1. It may be difficult to express one's understanding algorithmically -- e.g. how does one assess goodness of a Go move?
2. The mismatch between abstract mathematical models and complex unstructured reality often means the predicitions are far from useful. 

When knowledge transfer problem is challenging, one may attempt to create an agent that gains the relevant understanding from its own experience. This work focuses on construction of deployable predictive models from scratch. The goal is to understand, which aspects of environments make the learning more difficult and what subproblems need to be solved.

The idea of agents building their own models is definitely not novel: an early example is [Dyna](https://www.cs.cmu.edu/afs/cs/project/jair/pub/volume4/kaelbling96a-html/node29.html). However, recent advances in deep learning enabled automated understanding of truly complex relationships and formation of rich representations. 

Impressive recent examples where agents learnt to successfully predict environments include:
1. [Recurrent Environment Simulators](https://arxiv.org/abs/1704.02254) and [Action-Conditional Video Prediction using Deep Networks in Atari Games](https://arxiv.org/abs/1507.08750), where a deep recurrent network learns to predict future screen frames based on past percepts and joystick data. See videoes [here](https://drive.google.com/drive/folders/0B_L2b7VHvBW2SEllTmlEX1l0RGc).
2. [Prediction Under Uncertainty with Error-Encoding Networks](https://arxiv.org/abs/1711.04994), where a network learns to sample possible futures by generating likely future percepts - in effect learning a stochastic model of the environment.

Still, even those cutting edge algorithms struggle in more challenging environments. The first one is not very good at modelling stochastic state transitions. The second one was not shown to be able to produce long-term predictions.

In this work, we attempt to delineate different challenges in learning environment model from percepts. To achieve this, a configurable simulator of billiard balls was implemented. Then, a deep neural architectured was trained to predict to predict the outputs of the simulator. Since the underlying model was available, the results of predictions can be compared to near-optimal performance of sequential monte carlo methods (also know as particle filter state estimation). By considering information available to the network and loss functions which shape its learning, one can make educated guess about performance of the architecture in different learning scenarios.

This document is structured as follows:

**Section 2** describes the problem agents learning about the world are facing through an information/computation 
flow perspective. It then describes variations in the problem.

**Section 3** 


## Problem statement



### General problem

Learning agents are spawned in various environments and are given different tasks to complete. Let us start by considering an abstraction of time-aware agent learning about its worl (Figure xxx). The arrows in the graph represent computations -- however, they can also be viewed as flow of information. The vertical arrows coincide with the direction of flow of time (downwards).

Left hand side of the diagram represents calculations "carried out" in the environment. State of the world evolves according to some rules.

The right hand side refers to computations in agent's mind. The agent can try to learn the state of the world thanks to information flowing through the observation channel. How much of the world's state can agent directly observe depends on the problem.

How information is processed in the right hand side of the graph depends on the task at hand.

<img src="images/general_problem.png" width="500" align="center">
<i>Figure xxx: Graph representing computations and flow of information in the world and through agent's mind.

Consider the variables from the diagram:
* $s_t$ is state of the environment at time $t$
* $T_W$ is a computation that takes the environment from state $s_t$ to $s_{t+1}$. That is, $T_W : s_t \mapsto s_{t+1}$ 
* $o_t$ is observation at time $t$ produced by a particular sensor based on the underlying world state $s_t$
* $P_S$ is a computation that generates an observation $o_t$ from state $s_t$. That is, $P_S : s_t \mapsto o_t$
* $bs_t$ is a representation of agent's knowledge at time $t$
* $T_B$ is a computation that propagates agent's beliefs forward in time
* $P^{-1}_{B}$ is a computation that enables update of agent's belief using the observation
* $a_t$ part of the world state that is under direct control of the agent, the agent may influence the trajectory of future world states by controlling this variable (does not apply to passive agents).

Representing the problem of an agent learning about its environment in this fashion is useful as it will enable us to clearly observe different challenges that arise during learning. We will see how $T_W$ and $P_S$ vary between environments such as Atari games and colliding billiard balls and learn what it implies about difficulty of learning about those environments. Lastly, this view translates well into design of modular neural networks.

### Examples of environments and tasks

Let us consider examples of environments along with tasks performed by various agents to understand how they map on the proposed abstraction.

#### State estimation

In applications such as target tracking or robot navigation the goal is to continually estimate the most likely state of the world given noisy observations from sensors. The particular variable of interest might be an $(x, y)$ position of an aircraft and the observations consitute, for example, radar measurements.

Well-known examples of algorithms for this class of problems include [Kalman filter (KF)](https://en.wikipedia.org/wiki/Kalman_filter) (KL) and [Particle filter (PF)](https://en.wikipedia.org/wiki/Particle_filter) (also known as Sequential Monte Carlo). 

For these algorithms, the representation of agent's knowledge ($bs$) as well as methods for propagation beliefs into future ($T_B$) and probablistic update based on measurements ($P^{-1}_B$) are all fixed and  defined explictly by the programmer.

In Kalman filter the knowledge is represented by multivariate Gaussian distribution over the variables of interest, eg for a target in 2D space those might be position and velocity:

<center>$s = (x, y, \dot{x}, \dot{y})^T$</center>

In a particle filter, the distribution over states is represented by a collection of copies of the state vector. Because of this PF can represent and model non-linear processes which comes at increased computational cost relative to KF.

Core weakness of this family of methods is the fact that the model of the world ($T_W$) and the sensor ($P_S$) must be at least approximately known. Any mismatch will result in lower performace of the estimators. Additionally, the solutions are applicable only low-dimensional problems.

Pure state estimators are passive observers and do not perform actions $a$. However, in a closely related field of active sensing, agents can control their sensors (eg rotate a camera) to allow for more efficient estimation of state.

#### Playing Atari games

Atari console is a recently popular testbed for a host of deep learning algorithms. The most common goal for this environment learning to play (ie select joystick commands) in a way to maximise game score. The problem has been popularised by seminal reinforcement learning paper [Human-level control through deep reinforcement
learning](https://storage.googleapis.com/deepmind-media/dqn/DQNNaturePaper.pdf). State of the art deep learning algorithms perform at superhuman level after roughly a day of exposure to the environment.

In reinforcement learning the agent interacts with the environment by selecting actions ($a_t$) and tracks state of the environment ($s_t$) through observations (see game video below) ($o_t$). The agent is also provided with reward signal (r_t), which can be viewed as another kind of observation. The goal is to learn how to act in a given situation, so that the rewards observed are maximised.

The environment is popular because because it lies at the intersection of being sufficiently:
* difficult to be interesting for humans to play
* simple for early deep reinforcement learning to be successful
* accessible so that large amounts of data can be gathered for training 

<img src="images/atari.gif" width="500" align="center">
<i>Figure xxx: A series of observations from Space Invaders on Atari 2600.</i>

Let us map Atari game world on abstraction developed earlier:
* $s_t$ is memory state of the console engine at time $t$. It is not human-readable, but can be understood by reading game's code.
* $T_W$ is the code that performs operations on the memory and reads the joystick. The transition function is completely deterministic given the random seed. Without it, the transitions are mildly stochastic for most games (eg which alien fires a projectile at a given time), but can be significant for some (in Asteroid the player can be teleported to a random location).
* $P_S$ there are two sensory modalities: 
    * game state is projected on the screen
        * this is fully deterministic (ie given state always leads to the same image)
        * the state is almost fully-observable -- still, there is usually some information that cannot be inferred from a single screen, eg random seed, direction of movement of enemies 
    * reward is extracted from the game state -- fully deterministic -- this signal is used as a reward in reinforcement learning 

<img src="images/dqn.png" width="800" align="center">
<i>Figure xxx: A deep convolutional neural network learns a mapping from observations ($o_t$) to actions ($a_t$) that leads to highly rewarded behaviour. The layers between input and output can be viewed as agent's representation of knowledge ($bs_t$) about the aspects of game state ($s_t$) relevant for the problem.</i>

Interestingly, most Atari playing agents (like [DQN](https://storage.googleapis.com/deepmind-media/dqn/DQNNaturePaper.pdf)) do not implement $T_B$ computation -- the time link is left out. Instead multiple consecutive frames are concatenated and then interpreted by convolutional neural network which implements $P^{-1}_B$). This is possible because $s$ is almost fully-observable via $o$. Multiple frames tell the agent everything there is to know about the state of the game.

In this setting, agent constructs knowledge representation on its own. The representation is shaped by stochastic gradient descent with the goal of finding a good mapping between current observation and action that maximises expected reward. Under such incentives one can expect that parts of the console state that are not relevant for this problem will not be represented by the agent. (Which is reasonable and efficient for the task at hand.) 

#### Predicting Atari games

Atari testbed is being used also for testing predictive powers of neural networks. In this setting the agent is tasked with prediction of future percepts (up to hundreds of time steps ahead). Video examples of predictions can be found in the following directory: [here](https://drive.google.com/drive/folders/0B_L2b7VHvBW2SEllTmlEX1l0RGc).

<img src="images/env_sims.png" width="700" align="center">
<i>Figure xxx: Graphical representation of computations in neural architecture for Atari prediction in [Recurrent Environment Simulators](https://arxiv.org/abs/1704.02254). $x_t$ and $\hat{x}_t$ are game frame and its reconstruction at time $t$, $s$ corresponds to belief about game state, $a$ corresponds to joystick action.</i>

The only information supplied to the agent are observations and joystick actions (which can be viewed as an observation as well). In the process of prediction of future percepts, the agent:
1. updates its belief $bs_t$ about the state of environment $s_t$ using the observation $o_t$ (convolutional network)
2. propagates its belief forward in time to $bs_{t+1}$ (recurrent network)
3. maps the belief $bs_{t+1}$ onto observation space $\hat{o}_t+1$ (deconvolutional network)

The procedure is improved iteratively via gradient descent on the error in reconstruction. Once the network has converged


in a way that minimises error in the reconstruction of future percepts, the agent learns $T_B$ and $P^{-1}_B$. Consider a modified diagram in Figure xxx  that matches this task better.

<img src="images/special_problem.png" width="700" align="center">
<i>Figure xxx: In this task the agent produces reconstructions of observations $\hat{o}_t$. Additionally, the access to observations may be restricted for many time steps, so that the agent learns to propagate information forward in time.</i>


### Axes of variation

### Is predicting worth it?

## Learning to represent and track world state

### Simulated environment

### Baseline state estimator -- Sequential Monte Carlo

### Predictive Autoencoder