# Learning algorithm

The proposed solution uses the Multi Agent Deep Deterministc Policy Gradient ([MADDPG](https://arxiv.org/abs/1706.02275)) algorithm to solve the Tennis environment. This algorithm provides a solution based on the DDPG algorithm for environments with multiple agents.

As a recap, the DDPG algorithm combines Value-based and Policy-based methods by using 2 different networks:

- Actor: models the policy. That is, it returns an action for a given state
- Critic: models the Q-values. That is, it returns the expected return for a given pair of state-action

MADDPG proposes a centralized approach for the Critic network and a decentralized approach for the Actor network. This means that each agent only uses local information to select the actions based on the policy, while the critic uses information from all agents. This extra information makes training easier and allows for centralized training with decentralized execution i.e. each agents takes actions based on their own observations of the environment.

![maddpg](./maddpg.png)

This version of the algorithm uses Experience replay, target networks and noise decay.

## High level description

### Neural networks architecture

Each agent has 2 NNs: Actor and Critic. Each one has its target network associated, helping to reduce the correlation between the outputs and the NN weights. Multiple NN architectures have been tried, with this final structures:

- Actor: 3 dense layers with decreasing dimensionality. Inputs are tensors of size 24 (states), and the NN dimensions are 256-128-2. The final output size is the actual action space size. Each layer has a batch normalization previous layer, that should reduce the training time. The activation functions are leaky RELUs to enable outputs smaller than 0. The final activation is a tanh function since the action space is continous.

- Critic: 3 dense layers with decreasing dimensionality. The first layer has a batch normalization layer and a dense layer that accepts states and outputs tensors of dimension 256. These tensors are then concatenated with the actions, and 2 more dense layers are included, with dimensions 128-1. All activate functions are leaky RELUs. It is important to note that this network receives states and actions from all agents. 

### Algorithm

- 2 different agents are initialized
- For each agent, the Actor and Critic networks (local & target, 4 in total) are initialised with random weights
- An initial state is drawn from the environment. We have to keep in mind that this environment has 2 agents, so the states and actions will belong to each of the agents
- For each episode, an initial state is taken, and the algorithm can perform 2 actions: Sample and Learn
- When sampling, the algorithm will choose an action using the Actor networks. Each agent receives its state, and outputs an action. The actions are concatenated and passed to the environment, receiving reward and thefollowing state. This 'experience' will be stored in a replay memory.
- When selecting actions, Ornstein-Uhlenbeck funciton OU Noise is used to encourage exploration. A simple decay process is applied.
- Each agent learns separately every N steps, and with multiple learning iterations every step. Each agent samples a batch of experiences from the replay memory, which includes states and actions from all agents. Then, it will update the Critic & Actor networks using the local & target networks, in a very similar way as the DDPG algorithm. The only difference is that the Critic uses information of all agents (states & actions)
- The target networks are updated using a soft update approach.

# Experiments

Before arriving to the final solution, multiple tests have been carried out.

1. First, the DDPG agent implemented in the [second project](https://github.com/gscharly/drl_p2_continous_control) was used with a couple of modifications. The same Actor & Critic networks are used, and the experiences of each agent are used to update them. The environment was solved in around 2500 episodes, showing that the training process could probable be faster.
2. Therefore, the MADDPG approach was followed to try to better adapt to the multi agent environment where both agents need to collaborate and compete. The environment was solved in less than 1000 episodes.
3. Different NN architectures were tried out. Using batch normalization seems to help training time, and the best results were achieved using leaky relu activation functions.

The DDPG agent's weights & results can be found under weights/ddpg. The MADDPG agents artifacts can be found under weights/maddpg.

# Hyperparameters

- Actor learning rate: 1e-3
- Critic learning rate: 1e-3
- Discount factor: 0.99
- Tau (soft update): 1e-3
- Batch size: 128
- Learn every step, with 5 learning iterations in each step
- Noise decay: .999

# Rewards plot

Both DDPG and MADDPG rewards plots are included. DDPG was trained for 3000 episodes and MADDPG for 1000 episodes.

- Number of episodes required to solve the problem using DDPG: 2500.
- Number of episodes required to solve the problem using MADDPG: 837.

## DDPG
![ddpg](../weights/ddpg/scores.png)

## MADDPG
![maddpg](../weights/maddpg/scores.png)

# Agents playing!

![agents](agent.gif "agent")

# Ideas for future work

- Implement [Prioritized Experience Replay](https://arxiv.org/abs/1511.05952). This can improve learning by increasing the probability of sampling important experiences.
- Implement [Adaptive noise scaling](https://soeren-kirchner.medium.com/deep-deterministic-policy-gradient-ddpg-with-and-without-ornstein-uhlenbeck-process-e6d272adfc3). Instead of adding noise to the action, noise is added to the Actor's weight, which can lead to more consistent exploration and a richer set of behaviors. It is adaptive since the noise is increased/decreased based on the comparison between the original action and the action that would be selected when adding noise.
- Further hyperparameter tuning: learning rates, noise decay...