Idea: memorize the $Q$ function qua [Model-Free Episodic Control](https://arxiv.org/abs/1606.04460).

# CMT API

From the paper we have:

1. $(u, z) \leftarrow \text{Query}(x)$ where $z = \{ (x_n, \omega_n) \}$ is an ordered set of retrieved key-value pairs.
1. $\text{Update}(x, (x_n, \omega_n), r, u)$ provides feedback reward $r$ for retrieval of $(x_n, \omega_n)$ for query $x$.
   1. Must be compatible with self-consistency or supervised and unsupervised updates conflict.
1. $\text{Insert}(x, \omega)$ creates a new memory.

# Memorized Q pseudocode -- Attempt 2

Basic idea:
* Estimate value of $a$ in context $x$ by stored value associated with first memory retrieved from CMT queried with $(x, a)$.
* Play $\epsilon$-greedy with greedy action being the maximum estimated value. 
* Play action $a$ in context $x$ and observe reward $r$.
* For each action $a'$, update the memory retrieved with query $(x, a')$ $\ldots$

# Memorized Q pseudocode

Basic idea:
* Estimate value of $a$ in context $x$ by stored value associated with first memory retrieved from CMT queried with $(x, a)$.
* Play $\epsilon$-greedy with greedy action being the maximum estimated value. 
* After playing action $a$ in context $x$ and observing reward $r$:
   * Update the memory retrieved with query $(x, a)$ using feedback reward of $r$.
   * Store memory with key $(x, a)$ and value $r$.

In [5]:
def MemorizedQ():
    mem = CMT()
    env = Environment()  # distribution over (x, r) pairs.

    Actions = set(...)   # fixed set of actions for now
    epsilon = ...        # epsilon-greedy exploration

    while True:
        x = env.Observe()
        querySet = { a: (u, ((xprime, aprime), rprime))
                     for a in Actions
                     for (u, z) in [ CMT.Query(key=(x, a)) ]
                     if len(z) > 0
                     for ((xprime, aprime), rprime) in [ z[0] ]
                   }
        if len(querySet) > 0:
            greedy, _ = max(querySet.iteritems(), lambda kv: kv[1][1][1]) # action with largest first retrieved reward
        else:
            greedy = next(iter(Actions))                                  # if memory is completely empty, play action 0

        pa = (1 - epsilon) * IndicatorDistribution(greedy) + epsilon * UniformDistribution(Actions)
        a = pa.sample()
        r = env.ObserveReward(a)

        if a in querySet:
            # question: what's the feedback reward?
            # question: do we only do this when we take the greedy action?

            u, (xprime, aprime), rprime = querySet[a]
            CMT.Update(key=(x, a), retrieved=((xprime, aprime), rprime), feedbackreward=None, u=u)

        CMT.Insert(key=(x, a), value=r)

### Is this compatible with self-consistency?

I'm not sure.  Suppose there is no reward variance, so we just dealing with the partial feedback issue.
* Any memory retrieved when querying on $(x, a)$ will be updated with feedback reward $r$.
* Conditional on calling `Update()`, feedback reward is constant.
* Except that some retrieved memories will "win the argmax" and some will lose, changing frequency of `Update()`.
* Consider the memory retrieved by `Query(key=(x, a))`.
   * Possible inserted $((x, a), r)$ pair will lose the argmax after additional inserts.
   * This could be appropriate as another action $a'$ might be better in a neighborhood of $x$ but hadn't been observed yet.
   * However retrieving $((x'', a''), r'')$ with $r'' > r$ would win the argmax and receive reward $r$.
   
**Idea**: this could be self-consistent if we update all the actions.