Student: Steven Hooker

Project: Collaboration and Competition - Tennis

Course: Deep Reinforcement Learning

### Environment

The main goal of this project is to train two agents to solve the "Tennis" environment. Within this environment the agents need to collaborative hit the ball so that it stays in game.

Following the given description of the challenge.

In this environment, two agents control rackets to bounce a ball over a net. If an agent hits the ball over the net, it receives a reward of +0.1. If an agent lets a ball hit the ground or hits the ball out of bounds, it receives a reward of -0.01. Thus, the goal of each agent is to keep the ball in play.

The observation space consists of 8 variables corresponding to the position and velocity of the ball and racket. Each agent receives its own, local observation. Two continuous actions are available, corresponding to movement toward (or away from) the net, and jumping.


The task is episodic, and in order to solve the environment, your agents must get an average score of +0.5 (over 100 consecutive episodes, after taking the maximum over both agents). Specifically,

After each episode, we add up the rewards that each agent received (without discounting), to get a score for each agent. This yields 2 (potentially different) scores. We then take the maximum of these 2 scores.
This yields a single score for each episode.
The environment is considered solved, when the average (over 100 episodes) of those scores is at least +0.5.

### Agent & Model

In order to solve the challenge the MADDPG was chosen to be impplemented. 
[MADDPG paper](https://arxiv.org/pdf/1706.02275.pdf). 


* MADDPG is an adapted version of the DDPG method to solve multi agent problems.
* Information specific to DDPG can be found in the report of the previous project [p2_continuous control Report](https://github.com/luctrate/p2_continuous-control/blob/master/Report.ipynb)

* In order to be able to use collective experience the critic uses the combined state space to judge situations. Agents use local experience via the actor to act and get suggestions for collaborative opportunities via the critic. This is called centralized training with decentralized execution.


#### Model 


    Actor network parameters
  
    ----------------------------------------------------------------
            Layer (type)               Output Shape         Param #
    ================================================================
           BatchNorm1d-1                   [-1, 24]              48
                Linear-2                  [-1, 400]          10,000
                Linear-3                  [-1, 300]         120,300
                Linear-4                    [-1, 2]             602
    ================================================================
    Total params: 130,950
    Trainable params: 130,950
    Non-trainable params: 0
    ----------------------------------------------------------------
    Input size (MB): 0.00
    Forward/backward pass size (MB): 0.01
    Params size (MB): 0.50
    Estimated Total Size (MB): 0.51
    ----------------------------------------------------------------

    Critic network parameters
 
    ----------------------------------------------------------------
            Layer (type)               Output Shape         Param #
    ================================================================
           BatchNorm1d-1                   [-1, 48]              96
                Linear-2                  [-1, 400]          19,600
                Linear-3                  [-1, 300]         121,500
                Linear-4                    [-1, 1]             301
    ================================================================
    Total params: 141,497
    Trainable params: 141,497
    Non-trainable params: 0
    ----------------------------------------------------------------
    Input size (MB): 0.00
    Forward/backward pass size (MB): 0.01
    Params size (MB): 0.54
    Estimated Total Size (MB): 0.55
    ----------------------------------------------------------------


#### Hyperparameters

Following hyperparameters where used to solve this challenge.
```python
BUFFER_SIZE = int(1e5)   # replay buffer size
BATCH_SIZE = 64          # minibatch size
GAMMA = 0.99             # discount factor
TAU = 1e-3               # for soft update of target parameters
lrn_rate_actor = 1e-4    # learning rate actor
lrn_rate_critic = 1e-3   # learning rate critic
initial_noise_scale=1.0  # initial noise factor
noise_reduction=0.999998 # noise reduction factor to reduce noise over time
```

The code and instructions on how start, use the trained agent and train the agent from scratch can be founds here. 
https://github.com/luctrate/p3_collab-compet

#### Results

The following plot shows the score over the episodes and that the challenge was solved after 1227 episodes.

![results.png](./assets/results.png)

### Enhancements
Unfortunately, due to the current situation I have less time allocated for the course. I would like to implement prioritized experience replay to see how it influences training time as I can imagine there are a lot of 'useless' expeience tuples which get picked over and over again. This show itself in the long time it took the agent to see some progress. 1265 Episodes and still 0.15 avg score and 0.50 62 episodes later. 