## Algorithm

The algorithm used to solve the environment is Deep Deterministic Policy Gradients (DDPG) as described [here](https://arxiv.org/pdf/1509.02971.pdf).

The algorithm adpats the idea of actor-critic algorithm to the continuous action domain. It's an off policy algorithm which combines ideas from [Deep Q learning](https://web.stanford.edu/class/psych209/Readings/MnihEtAlHassibis15NatureControlDeepRL.pdf) and [deteministic policy gradients](http://proceedings.mlr.press/v32/silver14.pdf) to learn policies in environments that has continuous action spaces.

DDPG uses two networks. One network for actor which represents current policy and maps a state to an action. Other network is for the critic which given a state action pair provides its value(state-action value) as output. As DDPG employs deep neural network to approximate state-action value and to represent policy it makes use of Replay Memory and target networks to stablize the training. Also, to help with exploration of the sapce noise is added to the output of the actor network. This noise is sampled from a noise process.

To solve the Reacher environment(single agent) the following is used with the basic DDPG algorithm:
- Replay memory with priority based sampling (Priority Replay Buffer). It is implemented using Segment Tree data structure.
- Multi-step return is used instead of single step returns.

Above two are used with the Reacher environment as the rewards are scarce and using Priority Replay buffer with multistep returns might help with the convergence as well as speed of convergence. 

### Algorithm Implemented

1. `Initialize an actor network and a critic network`.
  - `An actor has two networks: Local`$\mu^{\theta}$ `and target`$\mu^{\theta^{'}}$ `network with parameters` $\theta$ `and` $\theta^{'}$. `Initially` $\theta = \theta^{'}$
  - `A critic has two networks: Local` $Q^{\phi}$ `and target` $Q^{\phi^{'}}$ `network with parameters` $\phi$ `and` $\phi^{'}$. `Initially` $\phi = \phi^{'}$
  
2. `Initialize Priority Replay Buffer` $\mathcal{D}$.
3. `Initialize noise process` $\mathcal{N}$.
4. `for episode = 1 to M:`
  - `Reset environment and get initial state` $s$
  - `Untill the episode is complete do:`
    - `From state` $s$ `obtain rewards` $r=(r_{1},r_{2},r_{3},r_{4},r_{5})$ `for 5-step returns and state` $s^{'}$ `after the final step. An agent chooses an action as follows:`
      - `if episode <= 10:`$a\sim StandardNormal(0,1)$  `(random action)`   
      `Otherwise: ` $a = \mu^{\theta}(s) + \mathcal{N}$
    - `Get initial priority for `$(s,a,r,s^{'})$ ` and Store` $(s,a,r,s^{'},priority)$ `in` $\mathcal{D}$
    - $s = s^{'}$
    - `if len`$(\mathcal{D})$ ` > UPDATE_AFTER`  
      - `Sample a mini-batch on priority basis` $\mathcal{B}$ `from` $\mathcal{D}$.   $\mathcal{B}= (s^{j},a^{j},r^{j},s^{'j},w^{j})$    
      `where,` $s^{j}$: `states`    
      $a^{j}$: `actions`    
      $r^{j}$: `set of 5 rewards(5 step return)`    
      $s^{'j}$: `next states`    
      $w^{j}$: `Importance sampling weights`
      - `Set` $target= r^{j}_{1} + \gamma r^{j}_{2} + \gamma^{2} r^{j}_{3} + \gamma^{3} r^{j}_{4} + \gamma^{4} r^{j}_{5} + \gamma^{5} Q^{\phi^{'}}(s^{'j},a^{'j})$    
      `where, `$a^{'j} = \mu^{\theta{'}}(s^{'j})$
      - `Update critic by minimizing the following loss:` $w^{j}\times(target - Q^{\phi}(s^{j},a^{j}))^{2}$
      - `Update the actor using sampled policy gradient: ` $\bigtriangledown_{\theta}J \approx \frac{1}{B} \sum_{j} \bigtriangledown_{\theta}\mu(s^{j})\bigtriangledown_{a}Q^{\phi}(s^{j},a^{j})\mid_{a^{j}=\mu^{\theta}(s^{j})}$
    - `Update traget network parameters:`
      - $\theta^{'} = \tau \theta + (1-\tau)\theta^{'}$
      - $\phi^{'} = \tau \phi + (1-\tau)\phi^{'}$


## Hyper-parameters

- BUFFER_SIZE = $10^{6}$
   - Size of the replay buffer
- BATCH_SIZE = 128 
   - number of instances sampled in a batch
- GAMMA = 0.99
   - Discount factor
- TAU = $10^{-3}$ 
   - Parameter for soft update of the target network parameters
- LR_ACTOR = $10^{-4}$
   - Learning rate for actor network 
- LR_QNET = $3\times10^{-4}$
   - Learning rate for critic network
- WEIGHT_DECAY = $10^{-4}$
   - L2 weight decay parameters for network layers
- NOISE_SCALE = 0.1
   - scaling parameter of the noise added to actions
- UPDATE_AFTER = 1000
   - Minimum number of samples in the buffer after which learning can start
- ALPHA = 0.6             
  - factor controling amount of prioritization
- REPLAY_EPS = 1e-6         
  - added to priority to facilitate exploration
- BETA = 1                
  - factor controling amount of importance sampling weights decay


### Critic Architecture

- Critic Network takes state of an agent and predicted action as its input. 
- Individual State is of size 33, so a tensor of shape (33,)becomes input to the first linear layer of size 256.
- Input to the Second linear layer is a concatenated tensor. Action tensor from the input is concatenated with the output of first_linear_layer + Leaky_ReLU block. Second layer is of size 256. It is followed by a leaky ReLU layer.
- Third layer is of size 128, it is followed by a leaky ReLU layer.
- Final layer is of size 1.

Detailed arcitecture is given below

1. **linear1**: Linear(in_features=33, out_features=256, bias=True)
2. **Leaky ReLU** layer
3. **Concatenation**: Concatenate output from `linear1 + Leaky_ReLU` layer to action tensor of size (4, ).
4. **linear2**: Linear(in_features=260, out_features=256, bias=True)
5. **Leaky ReLU** layer
6. **linear3**: Linear(in_features=256, out_features=128, bias=True)
7. **Leaky ReLU** layer
8. **linear4**: Linear(in_features=128, out_features=1, bias=True)

![CriticArchitecture](./critic-architecture.png)

### Actor Architecture

An actor takesa state as input (tensor of size 33) and outputs a action tensor of size (4,)

Detailed architecture is given below:

1. **linear1**: Linear(in_features=33, out_features=256, bias=True)
2. **ReLU** layer
3. **linear2**: Linear(in_features=256, out_features=4, bias=True)
4. **activation**: Tanh()

![ActorArchitecture](./actor-architecture.png)

### Plot of the Rewards

**The environment is solved in 1300 episodes**

![Result](./result.png)

### Future Work

- **Tuning Hyperparameters**: Trying different setting of hyper-parameters and presenting how changing them have impact on learning time or if the model diverges for some setting.
- **Multi-Agent Environment**: Applying this same algorithm to Multi-Agent reacher environment with a common replay memory for training multiple agents simultaneously.
- **Other Algorithms**: Trying PPO and A3C to solve the environment and compare the results.

### Trained agents playing
![AgentPlaying](./sample_play.gif)

[//]: # (Image References)

[image1]: ./sample_play.gif "Trained Agent"


[//]: # (Image References)

[image1]: ./sample_play.gif "Trained Agent"