# n-step returns
The next improvement from the Rainbow manuscript [1] which I implemented are n-step returns. 
A good and thorough introduction to n-step returns can be found in Chapter 7 of the Barto & Sutton RL book [2].
The idea is to replace the one step return and next state's Q-value with the discounted return over n-steps and use Q-value of the n-th state for the target of the update step:

$$ y_t = r_{t+1} +... + \gamma^{n-1} r_{t+n} + \gamma^{n} \cdot Q(s_{t+n}, \text{argmax}_{a} Q(s_{t+n}, a, \Theta), \Theta^{-})$$

Using n-step rewards has two advantages: first, it can help speed up the learning of q-values. Second, the n-step update reduces the bias of the method (at the cost of an increasing variance). Even though I am almost certain that Barto & Sutton discuss the bias/variance trade-off of n-step methods, I was not able to find it after briefly looking through the cited book, and a web search revealed [3], which has an excellent discussion of the topic.

## Results
I implemented n-steps return by adding a second deque as a buffer to the ReplayMemory. This buffer stores up to n samples and once it is full it starts filling the main ReplayMemory with samples that contain the n-step reward.
All figures were again plotted with help of the BSuite analysis Jupyter Notebook [4].

The comparison of the DDQN agent without 1-step returns (run014) and 4-step returns (run016) is shown below. The agent with 4-step returns shows slightly better results. Further increasing n up to 8 did not lead to better performance. 

**Edit:** The original results of run016 had DDQN switched off. Run18 shows results with 4-step returns and DDQN switched on. While the results are slightly different, there is no big difference.

![Radar plot with a comparison between the different settings.](./figures/n_step_radar.png)

The bar plot shows that the main performance increase comes from the catch, cartpole and  mountain car experiments and their noisy / scaled variants. Not too suprisingly n-step rewards had no impact on the bandit experiment (since it ends after one step, only 1-step returns are used in this experiment) or on exploration and memory. 

![Bar plot with a comparison between the different settings.](./figures/n_step_barplot.png)

As one example, the learning curves for the cartpole experiment show that using 4-step rewards largely increases and stabilizies the performance.

![Learning curves for the cartpole experiment. DQN with 1-step rewards (run014); DQN with 4-step rewards (run016) and DDQN with 4-step rewards](./figures/n_step_cartpole.png)


## References
[1] Hessel, Matteo, et al. Rainbow: Combining improvements in deep reinforcement learning. In: Thirty-Second AAAI Conference on Artificial Intelligence. 2018.  
[2] Sutton, Richard S., and Andrew G. Barto. Reinforcement learning: An introduction. MIT press, 2018.  
[3] https://www.endtoend.ai/blog/bias-variance-tradeoff-in-reinforcement-learning/, last accessed: 2020-03-28.  
[4] Osband, Ian, et al. "Behaviour Suite for Reinforcement Learning." arXiv preprint arXiv:1908.03568, 2019.  