# Summary

Deep Meta Reinforcement Learning is about training an agent on a distribution of similar tasks, ie, interrelated Markov Decision Processes (MDPs), so that it can perform well on average on those MDPs. 

In order to do that, [Wang et al, 2016](https://arxiv.org/pdf/1611.05763v1.pdf) proposed an architecture where a recurrent neural network of type Long-Short-Term Memory (LSTM) is trained using the experience gathered by an agent performing on an environment following another Reinforcement Learning algorithm of type Actor-Critic (A2C and A3C). They ran a total of 7 experiments to analyze the behaviour of their method on different contexts. Their main purpose was to establish whether meta-RL could: 
- Deal with the exploration/exploitation tradeoff,
- Understand the abstract structure of a task.
 

### The Two-Step Task
The task we reproduced is a variation of the [two-step task](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3077926/). It is a MDP with 3 states, one first-stage state $S_1$ from which the agent can take one of two different actions ($A_1$, $A_2$), each one leading to the two second-stage states ($S_2$ and $S_3$) with correlated distributions (see the figure below). Once in a second stage-state, the agent gets a reward of $1$ with probability $(p, 1-p)$ in states $(S_2,S_3)$ respectively, with $p  \in \{0.1, 0.9\}$. This reward probability is reset at each episode.

![Two-Step Task](ressources/two_step_task.png)


This task brings out the behaviour of the RL algorithm used on it, because it dissociates model-based algorithms from model-free ones. 

When an agent performs a model-based algorithm, it learns an internal model of the environment he's in (transition and reward functions) and relies on it when planning its next move. On the other hand, model-free algorithms, like Q-Learning, don't store any information of the model but map values to actions in states through local updates. 

The two step task makes it easier to distinguish between these two methods because an agent following a model-based strategy takes into account the transition probability of the actions whereas an agent following a model-free algorithm focuses on the reward only. 

### Network

The writers chose the Advantage Actor-Critic Algorithm, as its structure (multiple workers learning on their own and updating the master recurrent network periodically) already does the trick. The network takes 4 inputs:
- $t$ : current timestep, with $t_{max} = 200$.
- $x_t$ : observation at timestep t.
- $r_{t-1}$ : last reward.
- $a_{t-1}$ : last action.

and gives two outputs: 
- $V$ is the value of the current state,
- $\pi$ is the softmax policy distribution.

![Architecture of the network](ressources/architecture.png)

### Test

Talk about the stay probability.

Actor-Critic algorithms are model-free but the learned LSTM has model-based behaviour.


### Neuroscience

Talk about PFC, dopamine, rats...