## Algorithm

The algorithm used to solve the environment is Multi Agent Deep Deterministic Policy Gradients as described [here](https://arxiv.org/pdf/1706.02275.pdf).

The algorithm adpats the actor-critic architecture of Deep Determimistic Policy Gradients to make it work for environments comprising of more than one agent. The adaptation presented can be used in various scenarios of multi agent setting, for example competitive, cooperative etc.

The adaptation uses framework of centralized training with decentralized execution. Centralized training means that while training the critic can use extra information(like actions taken by other agents) to ease training process. Decentralized execution makes sure that while testing each actor uses only its local observation to predict actions.

![OverviewArchitecture](./OverviewArchitecture.png)

### Algorithm Implemented

1. `Initialize two actor-critic pairs for the two agents`.
  - `Each actor has two networks: Local`$\mu^{\theta}$ `and target`$\mu^{\theta^{'}}$ `network with parameters` $\theta$ `and` $\theta^{'}$.
  - `Each critic has two networks: Local` $Q^{\phi}$ `and target` $Q^{\phi^{'}}$ `network with parameters` $\phi$ `and` $\phi^{'}$.
  
2. `Initialize Replay Memory` $\mathcal{D}$.
3. `Initialize noise process` $\mathcal{N}_{t}$ `for each agent`.
4. `for episode = 1 to M:`
  - `Reset environment and get initial state` $s$
  - `Untill the episode is complete do:`
    - `if episode <= 500 select random actions for agents:` $a_{t}\sim StandardNormal(0,1)$ `Otherwise: ` $a_{t} = \mu^{\theta}(o_{t}) + \mathcal{N}_{t}$ `where` $o_{t}$ `is local observation to agent t`
    - `Execute actions` $a=(a_{1},a_{2})$ `and observe reward` $r$ `and new state` $s^{'}$
    - `Store` $(s,a,r,s^{'})$ `in` $\mathcal{D}$
    - $s = s^{'}$
    - `if len`$(\mathcal{D})$ ` > UPDATE_AFTER then for agent = 1 to 2 do:` 
      - `Sample a uniformly random mini-batch` $\mathcal{B}$ `from` $\mathcal{D}$.   $\mathcal{B}= (s^{j},a^{j},r^{j},s^{'j})$  
      - `Set` $target_{t} = r^{j}_{t} + \gamma Q^{\phi^{'}}(s^{'j},a_{1}^{'},a_{2}^{'})\mid_{a_{t}^{'}=\mu^{\theta^{'}}(o_{t}^{j})}$
      - `Update critic by minimizing the huber loss:` $HuberLoss(target_{t},Q^{\phi}(s^{j},a_{1}^{j},a_{2}^{j}) )$
      - `Update the actor using sampled policy gradient: ` $\bigtriangledown_{\theta_{t}}J \approx \frac{1}{B} \sum_{j} \bigtriangledown_{\theta_{t}}\mu_{t}(o_{t}^{j})\bigtriangledown_{a_{t}}Q^{\phi}(s^{j},a_{1}^{j},a_{2}^{j})\mid_{a_{t}^{j}=\mu^{\theta}(o_{t}^{j})}$
    - `Update traget network parameters:`
      - $\theta^{'} = \tau \theta + (1-\tau)\theta^{'}$
      - $\phi^{'} = \tau \phi + (1-\tau)\phi^{'}$


## Hyper-parameters

- BUFFER_SIZE = $10^{6}$
   - Size of the replay buffer
- BATCH_SIZE = 128 
   - number of instances sampled in a batch
- GAMMA = 0.99
   - Discount factor
- TAU = $10^{-3}$ 
   - Parameter for soft update of the target network parameters
- LR_ACTOR = $10^{-3}$
   - Learning rate for actor network 
- LR_QNET = $10^{-4}$
   - Learning rate for critic network
- WEIGHT_DECAY = 0
   - L2 weight decay parameters for network layers
- NOISE_SCALE = 0.1
   - scaling parameter of the noise added to actions
- UPDATE_AFTER = 1000
   - Minimum number of samples in the buffer after which learning can start

### Critic Architecture

- Critic Netwoek takes as input states of the two agents and actions predicted using the target networks of two agents as input. 
- Individual State is of size 24, so state from two agents becomes a tensor of size (48, ) that becomes input to the first linear layer.
- Input to the Second linear layer is a concatenated tensor. Action tensor from the first actor is concatenated with the output of first_linear_layer + ReLU block.
- Similarly the input to third linear layer is a concatenated tensor. Action tensor from the second actor is concatenated with the output of second_linear_layer + ReLU block.
- Output is a single value.

Detailed arcitecture if given below

1. **linear1**: Linear(in_features=48, out_features=512, bias=True)
2. **ReLU** layer
3. **Concatenation**: Concatenate output from `linear1 + ReLU` layer to action tensor of size (2,). This action is from first actor
4. **linear2**: Linear(in_features=514, out_features=256, bias=True)
5. **ReLU** layer
6. **Concatenation**: Concatenate output from `linear2 + ReLU` layer to action tensor of size (2,). This action is from second actor
7. **linear3**: Linear(in_features=258, out_features=256, bias=True)
8. **ReLU** layer
9. **linear4**: Linear(in_features=256, out_features=128, bias=True)
10. **ReLU** layer
11. **linear5**: Linear(in_features=128, out_features=1, bias=True)

![CriticArchitecture](./critic-architecture.png)

### Actor Architecture

An actor takes its local state as input (tensor of size 24) and outputs a tensor of size (2,)

Detailed architecture is given below:

1. **linear1**: Linear(in_features=24, out_features=128, bias=True)
2. **ReLU** layer
3. **linear2**: Linear(in_features=128, out_features=2, bias=True)
4. **activation**: Tanh()

![ActorArchitecture](./actor-architecture.png)

### Plot of the Rewards

**The environment is solved in 1600 episodes**

![Result](./result.png)

### Future Work

- **Priority Buffer**: Incorprating Replay memory with priority based sampling instead of uniform sampling. It might speed up the learning.
- **Tuning Hyperparameters**: Trying different setting of hyper-parameters and presenting how changing them have impact on learning time or if the model diverges for some setting.
- **Ensemble of actors**: As mentioned in the [paper](https://arxiv.org/pdf/1706.02275.pdf), single policy might overfit to the behaviour of the other agent so it might produce better results if ensemble of policy is used.
- **Inferring  Policies of other agent**: also mentioned in the [paper](https://arxiv.org/pdf/1706.02275.pdf), Instead of using the actions predicted by other agents keep an approximation of their policy and learn this approximation.

### Trained agents playing
![AgentPlaying](./sample_play.gif)