# Summary

Deep Meta Reinforcement Learning is about training an agent on a distribution of similar tasks, ie, interrelated Markov Decision Processes (MDPs), so that it can perform well on average on those MDPs. 

In order to do that, [Wang et al., 2016](https://arxiv.org/pdf/1611.05763v1.pdf) proposed an architecture where a recurrent neural network of type Long-Short-Term Memory (LSTM) is trained using the experience gathered by an agent performing on an environment following another Reinforcement Learning algorithm of type Actor-Critic (A2C and A3C). They ran a total of 7 experiments to analyze the behaviour of their method on different contexts. Their main purpose was to establish whether meta-RL could: 
- Deal with the exploration/exploitation trade-off,
- Understand the abstract structure of a task.
 

### The Two-Step Task

The task we reproduced is a variation of the [two-step task](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3077926/). It is a MDP with 3 states, one first-stage state $S_1$ from which the agent can take one of two different actions ($A_1$, $A_2$), each one leading to the two second-stage states ($S_2$ and $S_3$) with correlated distributions (see the figure below). Once in a second stage-state, the agent gets a reward of $1$ with probability $(p, 1-p)$ in states $(S_2,S_3)$ respectively, with $p  \in \{0.1, 0.9\}$. This reward probability is randomly reset at each episode.

![Two-Step Task](ressources/two_step_task.png)

The two-step task was used because it brings out the behaviour of the RL algorithm; it dissociates model-based algorithms from model-free ones. 

#### Model-based Vs Model-free

When an agent performs a model-based algorithm, it learns an internal model of the environment he's in (transition and reward functions) and relies on it when planning his next move. On the other hand, model-free algorithms don't store any information of the model but map values to actions in states through local updates, relying on the prediction error (ie, the difference between the predicted value for that state and the actual outcome). We say that model-based is goal-oriented and model-free is more of an habitual behaviour.


The two step task makes it easier to distinguish between these two methods because an agent following a model-based strategy takes into account the transition probability of the actions whereas an agent following a model-free algorithm focuses on the reward only. For instance, if a model-free RL agent gets a reward at his last trial performing $A_1$, he will repeat it at his current trial regardless of whether the transition he went through was common ($75\%$) or rare ($25\%$). On the other hand, a model-based RL agent will repeat his last action only if the transition was common. 

A common method used to evaluate this behaviour is the stay probability.


#### The stay probability

The stay probability is the probability of repeating the same action performed in the last trial, given that the last transition was common (with the highest probability of reaching the state that resulted in a positive outcome) or rare, and rewarded or not. An agent following a model-free algorithm repeats his last action as long as it brings him a reward, regardless of the type of the transition he goes through (common/rare), so the plot gives two bars equally high when the last trial is rewarded. On the other hand, a model-based agent repeats more often actions that entails rewards when the transition is common, so the probability of repeating the common transition bar is higher than the rare one.


![Model-based Vs Model-free plot](ressources/stayprob_plots.png)
 <center>Canonical pattern of behavior for model-free (left) and model-based (right) learning from [here](https://www.biorxiv.org/content/biorxiv/early/2018/04/13/295964.full.pdf) </center>


### Tests

#### Network

The writers chose the Advantage Actor-Critic (A2C) algorithm, as it implements a recurrent neural network trained through the expericence gathered by multiple workers learning on their own. The network takes 4 inputs:
- $t$ : current timestep, with $t_{max} = 200$.
- $x_t$ : observation at timestep t.
- $r_{t-1}$ : last reward.
- $a_{t-1}$ : last action.

and returns two outputs: 
- $V$ is the value of the current state,
- $\pi$ is the softmax policy distribution.

![Architecture of the network](ressources/architecture.png)

#### Training process

We used the Asynchronous Advantage Actor-Critic (A3C) implemented by Juliani, and launched it with only one worker. A2C being the synchronous version of A3C, using a single thread (worker) gives the same result for both algorithms. 

Wang et al. published two papers describing their method. The main difference is that their [first paper](https://arxiv.org/pdf/1611.05763v1.pdf) ran 10 trials per episode, with the two step task being reset at each episode (the reward probabilities are fixed during an episode), while the [second](https://www.biorxiv.org/content/biorxiv/early/2018/04/13/295964.full.pdf) ran episodes of 200 trials, with the reward probabilities switching with a probability of $2.5\%$. This second paper contains more information about the parameters used in training, like the discount factor.

The training process took longer than the one specified by in the [second paper](https://www.biorxiv.org/content/biorxiv/early/2018/04/13/295964.full.pdf) whom obtained the model-based like plot after 10,000 episodes. The results we show are after 40,000 training episodes for the first versio, and 20,000 for the second.

According to the results we obtained, which are similar to the ones showed by Wang et al, despite training the LSTM algorithm using a single-threaded A3C which is a model-free procedure, the learned LSTM showed a model-based behaviour. 

![gif of evolution of training](results/arxiv/arxiv_40k/train/training_40k_gif.gif)

### Neuroscience

In their [other version](https://www.biorxiv.org/content/biorxiv/early/2018/04/13/295964.full.pdf) of the paper, more neuroscience oriented, they present a new modelisation of the learning process in animals, where the prefrontal cortex, along with the basal ganglia and the thalamic nuclei, are represented by a recurrent neural network whose synaptic weights are adjusted by the phasic dopamine release that runs in parallel a model-free like Reinforcement Learning, based on the stimulus-response associations.