# Navigation

---
## Project  Navigation: REPORT

<img src="https://camo.githubusercontent.com/7ad5cdff66f7229c4e9822882b3c8e57960dca4e/68747470733a2f2f73332e616d617a6f6e6177732e636f6d2f766964656f2e756461636974792d646174612e636f6d2f746f706865722f323031382f4a756e652f35623165613737385f726561636865722f726561636865722e676966">

### Introduction

This report provides a description of the implementation for the Deep Reinforcement Learning Nanodegree Project 2, where I had to train an agent to perform continuous control action on a double jointed robot arm to reach and track a goal. Please refer to the [README.md]() on this repository for more information

### Learning Algorithm 

1. ### The Agent

    The agent architecture can be found on "**ddpg_agent.py**" . This file implements an "Agent" class that holds:

    * args (class defined on the notebook): A set of parameters that will define the agent hyperparameters
    * state_size (int): dimension of each state
    * action_size (int): dimension of each action
    
     The agent uses an actor-critic DDPG algorithm. Some of the highlights of this algorithm are.
    * Actor critic network architecture
    * off-policy (slower convergence but more stable)
    * supports continous observation space and continous actions space.
    * Uses random experience replay
    * Uses multiple environment to parallely populate the replay buffer and thus train faster
    * Uses double network (for actor/critic each). One for target and one for learning. Soft update is used to 
      slowly blend the target to the local model. 
    



 2. ### The Policy network 
 
 Also described in the README.md file.
 
  **Actor**    
    - Hidden: (input, 256)  - ReLU
    - Hidden: (256, 128)    - ReLU
    - Output: (128, 4)      - TanH

  **Critic**
    - Hidden: (input, 256)              - ReLU
    - Hidden: (256 + action_size, 128)  - ReLU
    - Hidden: (128, 128)  - ReLU
    - Output: (128, 1)                  - Linear

3. ### Hyper-parameters

    As described in (more details) the readme file the following hyper parameters are used in this algorithm.

  - Learning Rate: 1e-4 (in both DNN actor/critic) # learning rate 
  - Batch Size: 128     # minibatch size
  - Replay Buffer: 1e5  # replay buffer size
  - Gamma: 0.99         # discount factor
  - Tau: 1e-3           # for soft update of target parameters
  - Ornstein-Uhlenbeck noise parameters (0.15 theta and 0.2 sigma.) # Noise use to introduce entropy in the system to explore more
  - n_episodes Size: 1000     # Maximum number of episodes for which training will proceed
  - checkpoint_score: 30     # if the score is greater than this threshold, network is checkpointed and training is finished. 


### Results

Three different agent has been trained. Please check the Navigation.ipynb notebook file for more details.

**DDPG agent with random replay buffer. v2 of the Reacher Env 

With use of parallel environments the convergence is much much faster

<img src="ddpg1.png">

The actor and critic models are checkpointed in<br>
./MultiEnvCheckPt/Episode119_actor.pth<br>
./MultiEnvCheckPt/Episode119_critic.pth<br>

I initially tried the v1 of the environment (single). The convergence was so much slower (almost 1/5 th) that I decided to try out this parallel environment version.



### Future Work

All this results and conclusions suggest a series of changes (Future Work) to improve the agent's performance and to reduce it's instability. Future work will include 

* Implement D4PG algorithm [5]
* To implement policy based algorithms 
* A Distributional Perspective on Reinforcement Learning [7][8]. 
* Study different and more complex NN' architectures applicable to the problem.
* Auto tune hyper paramaters and NN architectures 


### References

* [1] DPG: Deterministic Policy Gradient Algorithms  (http://proceedings.mlr.press/v32/silver14.pdf))
* [2] DDPG:Human-level control through deep reinforcement learning (https://www.nature.com/articles/nature14236)
* [3] Deep Reinforcement Learning with Double Q-learning (https://arxiv.org/abs/1509.06461)
* [4] Prioritized Experience Replay (https://arxiv.org/abs/1511.05952)
* [5] D4PG: DISTRIBUTED DISTRIBUTIONAL DETERMINISTIC POLICY GRADIENTS (https://arxiv.org/pdf/1804.08617.pdf)
* [6] Reinforcement Learning: An Introduction (https://s3-us-west-1.amazonaws.com/udacity-drlnd/bookdraft2018.pdf)
* [7] A Distributional Perspective on Reinforcement Learning (https://arxiv.org/abs/1707.06887)
* [8]. Ray -rllib - A distributed framework for RL and hyperparameter tuning (https://ray.readthedocs.io/en/latest/rllib.html)
