# CP in an Observational Setting, ver. 1.1

## Changelog
### Ver 1.1
1. Representation network is splitted into convolution encoder and recurrent part, to address issue when recurrent part copies elements of previous state to new state and conscious choose them to predict. See [this comment](https://theconsciousnessprior.slack.com/archives/C87HW5Q9X/p1517188030000148) for details.
2. $R^2$-like approximation to mutual information loss is proposed.

### Ver 1.0
Initial version, formalization of a diagram posted by William Fedus.

## Notation
1. Denote a set $\{1, \ldots, m\}$ by $\mathbb N_m$ for any natural $m$.
2. For time-dependent vector $y_t \in \mathbb R^m$ denote by $y_t[k]$ is's $k$-th component. For $K \in \mathbb N_m^n$ (vector of indexes), we denote the vector of the corresponding values $(y_t[k_1],\ldots, y_t[k_n]) \in \mathbb R^n$ by $y_t[K]$.

## Architecture
### Variables
**Input**: a sequence of multidimensional vectors $x_t$, each $x_t \in \mathbb R^N$, $t\in \mathbb N$ (or $\mathbb Z$), $N$ may be quite large (e.g. pixels of a video frame)

**Encoded state** at the moment $t$ is $e_t \in \mathbb R^u$, $u$ can be quite large.

**Representation state** at the moment $t$ is $h_t \in \mathbb R^r$, $r$ is a dimension of the representation space (may be quite large).

**Conscious state** at the moment $t$ is $c_t=(B_t, b_t, A_t)$[^1], where $B_t\in \mathbb N_r^s$ is a vector of indexes of $h_t$ we are interested in at a particular moment, $b_t\in \mathbb R^s$ is the corresponding values, i.e. $b_t=h_t[B_t]$ and $A_t=(A_{1, t}, \ldots, A_{p, t})\in \mathbb N_u^p$ is a vector of indexes of $e_{t+1}$ we are going to predict.

[^1]: This notation is a bit different from William's but consistent with Bengio's paper: $c_t$ is a full consious state, including $A_t$. In William's diagram $c_t=(B_t, b_t)$.

### Networks
**Encoder network** $E$ is convolutional encoder that has access to fixed number of previous frames:

$$e_t = E(x_t, x_{t-1}, \ldots, x_{t-d}),$$

Consider, for example, convolutional embedding of an image that has access only to the current frame.

**Recurrent representation network** $R$ is a function (presented by RNN):
$$
h_t=R(e_t, h_{t-1}).
$$

The composition of encoder network and recurrent representation network is collectively called simply **representation network**. See [this comment](https://theconsciousnessprior.slack.com/archives/C87HW5Q9X/p1517188030000148) for details on why we need this splitting of encoder and recurrent representation networks.

**Conscious network** is a function (presented by another RNN):
$$
c_{t} = C(h_t, c_{t-1}, z_t),
$$
where $z_t$ is a random noise source. Actually, $C$ have to define only $B_t$ and $A_t$ parts of $c_t$ as $b_t$ is defined by the relation $b_t=h_t[B_t]$. So the domain of $C$ is $\mathbb N_r^s\times\mathbb N_u^p$.

**Generator network** is a function
$$
  \widehat{a}_t = G(c_t).
$$
The objective of the generator is to predict the value of (some part of) the encoded state vector $e_{t+1}$ in the next moment at indexes $A_t$. 

### Loss
The main objective of conscious network is to select features that can be used to predict (parts of) the future encoded state. However, this objective can be satisfied trivially: the representation network can ignore $x_t$ and produce constant values (i.e. $h_t=0$ for all $t$) which will be perfectly predictable from their own previous values. To avoid this failure mode, we want to maximize the mutual information between $c_t = h_{t}[B_t]$ and $e_{t+1}[A_t]$. (To consider mutual information, we need random variables; to make $c_t$ and $e_{t+1}[A_t]$ random variables we just need to pick random $t$; it defines the probability space we are working on.)

Estimation of mutual information is non-trivial problem, discussed in the literature (see e.g. [here](https://arxiv.org/abs/1801.04062)).

However, to begin, we can consider simplified version of mutual information objective that uses variance instead of entropy. This objective is a variant of $R^2$ maximization. 

For every index $i \in \mathbb N_u$, denote by $T_i$ the set of all time moments $t$ such that $i \in A_{t-1}$ (i.e. on step $t-1$ the conscious network selected $i$ as one of the elements we are going to predict).

Now let us define $RSS_i$ (residual sum of squares) which measures our error in predicting of the future state at index $i$ given the previous state:

$$
RSS_i=\sum_{t \in T_i}(\widehat{a}_{t}[i]-e_{t+1}[i])^2
$$

We also define $TSS_i$ in the following way:

$$\overline{e}[i]=\frac{1}{|T_i|}\sum_{t\in T_i} e_t[i]$$

$$TSS_i(e)=\sum_{t \in T_i} (e_t[i]-\overline{e}[i])^2$$

Now our objective is

$$\tag{1}
\sum_{i=1}^u \log RSS_i - \sum_{i=1}^u \log TSS_i \to \min$$

**Remark.** The proposed objective is scale invariant. Moreover, it is closely related to the real mutual information. Indeed, for any random variables, $I(X, Y) = H(Y) - H(Y|X)$. Moreover, $H(\lambda X) = H(X) + \log |\lambda|$. It means that e.g. for Gaussian $X$, entropy is approximately $\log SD(X) = \frac{1}{2}\log Var(X)$. The $RSS$ term in (1) is related to entropy of $e_t$ and $TSS$ term is related to entropy of $e_t$ conditioned on $c_t$.