<a href="https://colab.research.google.com/github/jarrydmartinx/deep-rl/blob/master/unifying_count_based_exploration_intrinsic_motivation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Unifying Count-Based Exploration and Intrinsic Motivation (Bellemare (2016)

@(Exploration)[Papers, Bellemare] 
## Overview
### Problem
Generalising uncertainty across states

### Context 
Non-tabular RL (with FA)

### Approaches
*   Use (visit-)density models to measure uncertainty
*   A method for deriving a pseudo-count from any density model. These pseudo-counts are best thought of as "function approximation for exploration"
*   Allows generalisation of count-based exploration algorithms to the non-tabular case
*   Achieves sensible pseudocounts from raw pixels
*   Transforms pseudo-counts into exploration bonuses
*  Show that intrinsic motivation and count-based exploration are two sides of the same coin

##Notes
### I Introduction
* What is the uncertainty over: reward and transition functions
 * In tabular setting can quantify uncertainty using counts and Chernoff bounds, or inferred from a posterior over the environment parameters.
* Both confidence intervals and posterior shrink as the inverse square root of the state-action visit count $N(x,a)$
* This is fundamental to most theoretical results on exploration
####MBIE-EB
* Adds exploration bonus of $N(x,a)^{-\frac{1}{2}}$
* Accounts for uncertainties in both T and R
* Enables a finite time bound on the agent's suboptimality
#### Large Domains
* States are visited at most once $\rightarrow$ counts are useless
#### Intrinsic Motivation
* Provides qualitative guidance for exploration
* Explore what surprises you, what's new, get curious
	* *Prediction error*, or 
	* *Learning Progress* ($\Delta$ pred. err.): if $e_t(A)$ is the error made by the agent at $t$ over some event $A$, and $e_{t+1}$ the same error after observing a new piece of information, then learning progress is: $$e_t(A) - e_{t-1}(A) $$ 
	* *Compression Progress*
	* *Information Gain*
* IM methods are attractive because they (seem to) remain applicable even in the absence of the Markov Property (and in absence of tabular representation), both of which are required for count-based algorithms
* Yet the theoretical foundations of IM remain basically absent

### Notation and Concepts
* A *density model* is any model that assumes the states are independently distributed (but not necessarily identically distributed)


### III From Densities to Pseudo-counts
* To do IM exploration, we need to be able to answer the question: "How novel is this (possibly unseen) state"? 
	* Empirical counts can't do it.
	* Nor is the problem solved by a Bayesian approach: even variable alphabet models (e.g. Hutter et al., 2013), can only assign a small diminishing probability to unseen states.
* Approach: Extract pseudo-counts from a simple density model and use them within a variant of MBIE-EB
* Which Density model: a simplified, pixel-level version of the CTS model for Atari 2600 frames (Bellemare et al., 2014). An impoverished model compared to SOA density models for images. Being count-based, it is very fast though.

#### Pseudo-Counts and the Recoding Probability
Suppose we have a density model $\rho$ over $\mathcal{X}$. 
* The model, may be approximate, biased or even inconsistent.
Define the *recoding probability* of a state $$

### IV The Connection to Intrinsic Motivation
* Pseudocounts are closely related to Information Gain
* IG is commonly used to quantify uncertainty or novelty
* IG is defined w.r.t. a *mixture model* $\xi$ over a class of density models $\mathcal{M}$ 
	* The model makes predictions according to a weighted combination from $\mathcal{M}$, with $w_n(\rho)$ the posterior weight on $\rho$. $$ \xi_n(x) := \xi(x ; x_{1:n}) := \int_{\rho\in\mathcal{M}} w_n(\rho)\rho(x ; x_{1:n})\ d\rho$$
	* The posterior weights are defined recursively, starting from a prior distribution $$w_{n+1}(\rho) := w_n(\rho, x_{n+1})\qquad \qquad w_n(\rho, x):= \frac{w_n(\rho)\rho(x ; x_{1:n})}{\xi_n(x)}$$
	* Information gain is then the KL-divergence from prior to posterior that results from observing $x$. $$IG_n(x):= IG(x ; x_{1:n}) := KL(w_n(. , x)\ \Vert\ w_n )$$

* Note that this means that something is really only novel with respect to your model, or with respect to your predictions. If it violates your predictions in some broad sense, or changes the way you would predict, then it is novel.
* Note also the point from Algorithmic Information Theory, that an uncompressible string is not interesting, OR is this a failing of AIT. **TODO**
* Or rather Schmidhuber's notion that if it doesn't improve your ability to predict, it's not interesting.


### Experiments
* Focus on Atari 2600 games from the ALE, focusing on games where myopic exploration fails. Great improvement over then SOA in Montezuma's Revenge.
* Apply them to an experience replay setting and to an actor-critic setting, improved performance in both cases. 



