# Learning algorithm

The proposed solution uses the Deep Deterministic Policy Gradient (DDPG) algorithm to solve the Reacher environment with multiple agents. This algorithm provides a solution when the state and action spaces have large dimensions and are continuous. It combines Value-based and Policy-based methods by using 2 different neural networks:

- Actor: models the policy. That is, it returns an action for a given state
- Critic: models the Q-values. That is, it returns the expected return for a given pair of state-action

This version of the algorithm uses Experience replay, target networks and adaptive noise scaling.

## High level description

### Neural networks architecture

The agent has 2 NNs: Actor and Critic. Each one has its target network associated, helping to reduce the correlation between the outputs and the NN weights. Multiple NN
architectures have been tried. Finally, the final ones include:

- Actor: 3 dense layers with decreasing dimensionality. Inputs are tensors of size 33 (states), and the NN dimensions are 256-128-4. The final output size is the actual action space size. Each layer has a batch normalization previous layer, that should reduce the training time. The activation functions are leaky RELUs to enable outputs
smaller than 0. The final activation is a tanh function since the action space is continous.

- Critic: 4 dense layers with decreasing dimensionality. The first layer has a batch normalization layer and a dense layer that accepts states and outputs tensors of dimension 256. These tensors are then concatenated with the actions, and 3 more dense layers are included, with dimensions 128-64-1. All activate functions are leaky RELUs.

### Algorithm
- The Actor and Critic networks (local & target, 4 in total) are initialised with random weights
- An initial state is drawn from the environment. We have to keep in mind that this environment has 20 agents, so the states and actions will belong to each of the agents
- For each episode, there will be T steps to take. An initial state is taken, and the algorithm can perform 2 actions: Sample and Learn
- When sampling, the algorithm will choose an action using the Actor network, receiving reward and following state from the environment. This 'experience' will be stored in a replay memory.
- When selecting the action, Adaptive noise scaling is used, following this [blog](https://soeren-kirchner.medium.com/deep-deterministic-policy-gradient-ddpg-with-and-without-ornstein-uhlenbeck-process-e6d272adfc3). In a nutshell, instead of adding noise to the action, noise is added to the Actor's weight, which can lead to more consistent exploration and a richer set of behaviors. It is adaptive since the noise is increased/decreased based on the comparison between the original action and the action that would be selected when adding noise.
- When learning, the algorithm will sample a batch of experiences from the replay memory. This helps decoupling consecutive steps. Then, it will update the Critic network using the local & target networks, in a very similar way as the Deep Q learning algorithm from the Value-based methods section. The Actor will be updated using as a loss the Q values generated by the Critic with the current state and the predicted action by the current Actor.
- The target networks are updated using a soft update approach.

# Experiments
Before arriving to the final solution, multiple tests have been carried out.

1. First, the single agent environment was used to train a simple agent. The DDPG implementation was borrowed from the Bipedal enviroment course [solution](https://github.com/udacity/deep-reinforcement-learning/blob/master/ddpg-bipedal/ddpg_agent.py) as a baseline. The actions we modified using OU noise. The agent was unable to learn in the first 300 episodes,
so we decided to move on the the multiple environment to see if gathering more experiences improved the learning process.
2. The same agent was tested with the multiple environment. At first, it appeared to learn, but after episode 150 the performance started to decrease.
3. The neural networks structure was changed, including more layers, changing the activation function to leaky RELU (SELU was also tried) and including batch normalization layers. This seemed to do the trick, reaching a score of 30 after 60 episodes, and being able to solve the environment in the first 150 episodes.
4. Finally, Adaptive noise scaling was introduced with promising results. The agent reached a score of 30 in episode 21, and solved the environment in 102 episodes.

# Hyperparameters

- Actor learning rate: 1e-4
- Critic learning rate: 1e-3
- Discount factor: 0.99
- Tau (soft update): 1e-3
- Batch size: 256
- Adaptive noise scalar: 0.05
- Adaptive noise distance: 0.7
- Adaptive noise decay: 0.99

# Rewards plot

Number of episodes required to solve the environment: 102. The agent was trained for 200 episodes and the weights of the NNs can be found under ../weights

In the rewards plot we can see the comparison between the last 2 experiments. It can be seen that the Adaptive noise scaling reaches the target faster, and for the 
next 100 episodes keeps a correct average.


![title](./rewards_comparison.png)

# Agent playing!

![SegmentLocal](agent.gif "segment")

# Ideas for future work

- Try out different neural network architectures: tune dimensiones, add/remove layers and activation functions
- Implement other algorithms that might work in this environment, such as Proximal Policy Optimization (PPO) or Distributed Distributional Deterministic Policy Gradients (D4PG)
- Try to update the NNs weights with less frequency instead of at every round. That might stabilize learning and avoid those sudden decreases in the scores