## Problem

In this environment, two agents control rackets to bounce a ball over a net. If an agent hits the ball over the net, it receives a reward of +0.1. If an agent lets a ball hit the ground or hits the ball out of bounds, it receives a reward of -0.01. Thus, the goal of each agent is to keep the ball in play.

The observation space consists of 8 variables corresponding to the position and velocity of the ball and racket. Each agent receives its own, local observation. Two continuous actions are available, corresponding to movement toward (or away from) the net, and jumping.

The task is episodic, and in order to solve the environment, your agents must get an **average score of +0.5** (over 100 consecutive episodes, after taking the maximum over both agents). Specifically,

After each episode, we add up the rewards that each agent received (without discounting), to get a score for each agent. This yields 2 (potentially different) scores. We then take the maximum of these 2 scores.
This yields a single score for each episode.
The environment is considered solved, when the average (over 100 episodes) of those scores is at least +0.5.

## Learning Algorithm:

The implemented algorithm is a Multi Agent DDPG. This DDPG is very similar to the Vanilla Policy Gradient (or REINFORCE), but it has some slight differences. The Actor netowrk creates a deterministic policy (hence the name of the algorithm), and this action(s) are fed into the Critic network as an input, in order to evaluare the quality of that state/action pair. He we can expand the Actor-Critic architecture to the continuous domain.

For the multi agent version is slightly more complicated.

![reward plot](imgs/model.png)

source: https://arxiv.org/pdf/1706.02275.pdf

Each Actor just recieves the information of each agent, and outputs its action. This makes possible in the execution phase to take actions just based on it own observations. But to be aware of the rest of the environment, the Critic Network takes the State and the Actions of ALL the agents. This is a source of instability in training, but can lead to a much better performance, since during the training phase the agents can know what other agents are *thinking* and doing, in order to adapt its behaviour to maximize the reward.

The Q-Networks for each agent are different, but they both have the same inputs.

It uses a Neural Network with the following architecture:
    
**MU NETWORK**

    Input shape: (24, )
    Dense_1 : 256 neurons, ReLU activation
    Dense_2 : 128 neurons, ReLU activation
    Output: 2 actions (TanH activation)

**Q NETWORK**

    Input shape: (state_size+action_size) * num_agents
    Dense_1 : 256 neurons, ReLU activation
    Dense_2 : 128 neurons, ReLU activation
    Output: 1 value


A soft weight update of the target Mu and Q Network is made with a blending factor TAU=0.001, every timesteps.

To train, a batch size of 128 is used, along with a Replay Buffer of capacity 100000. The optimizer is Adam with a LR of 10e-3 for both networks

A discount factor GAMMA of 0.99 is used.


The number of episodes is set to 1000. Every episode run until its done.

The model weights are saved in '*mu.pth*' and  '*q.pth*'

## Plot of reward

This is the plot of reward over time. The training is finished when the average reward is over 0.5 (red line)

![reward plot](imgs/scores.png)


## Ideas for Future Work

Different hyperparamter 

To improve the performance or speed up training several ideas are proposed:

    - Try a noise schduler. To reduce exploration along the training, we could scale the noise in an epsilon greedy fashion, reducing while the training is in a later stage. This might not work and could prevent from learning more, bute definetly is a good idea to try.
    - Implement Prioritized Experience Replay
    - Implement a distributed version
    - Try Hill Climbing for the Neural Netowrk weights
    - Analyze the stability of the training. Since the evironment is non stationary, we coul try freezing one network and train the other. This could be a good exercise.