

# Project report: Collaboration and Competition

## 1: Learning algorithm

### 1.1: Description of algorithm

In this project, like in the previous, we apply the Deep Deterministic Policy Gradient (DDPG) algorithm to solve the problem.

For the details regarding the algorithm, please see the [REPORT](https://github.com/oelvetun/Deep-Reinforcement-Learning/blob/master/Project%202%20-%20Continuous%20Control/REPORT.ipynb) from that project.

The only difference is essentially that we here have applied an Experience Replay Buffer, which we previously explained in the [REPORT](https://github.com/oelvetun/Deep-Reinforcement-Learning/blob/master/Project%201%20-%20Navigation/REPORT.ipynb) from the first project.

The main parts of the algorithm can be summarized as follows:
* A shared (prioritized) replay buffer for both of the agents
* Shared network weights for both agents (and also a common critic)
* Each agent observes its own local environment and adds the experience to the replay buffer
* At each time step, both agents pick an action

### 1.2: Chosen Hyperparameters

* ``BUFFER_SIZE = int(1e5)``:  Chosen size of replay buffer
* ``BATCH_SIZE = 128``:        Chosen batch size of learning examples 
* ``GAMMA = 0.99``:            Discount factor
* ``TAU = 1e-3``:              Soft update of fixed Q-target weights
* ``LR_ACTOR = 2e-4``:         Learning rate of actor in optimization algorithm
* ``LR_CRITIC = 1e-4``:        Learning rate of critic in optimization algorithm
* ``UPDATE_EVERY = 2``:        Number of actions chosen between each learning step
* ``TIMES_UPDATE = 1``:        Number of batches to run each time we update   
* ``EPSILON = 1``:             Starting point for noise decline
* ``EPSILON_DECAY = 1e-4``:    Noise decay for each episode

Parameters in Ornstein-Uhlenbeck process
* ``MU = 0.0``
* ``THETA = 0.15`` 
* ``SIGMA = 0.2``

Parameters for Prioritized Experience Replay
* ``ALPHA``: 0.5               Randomness in priority
* ``BETA_START``: 0.3          Starting point for importance sampling tuning
* ``BETA_INCREASE``: 1e-4      Increase of beta for importance sampling

### 1.3: Neural network

##### Actor

The neural network we use for the *Actor* is a simple feed-forward network with the following layers

* BatchNorm 1
* Layer 1: (state_size, 128)
* ReLU 1
* BatchNorm 2
* Layer 2: (128, 128)
* ReLU 2
* BatchNorm 3
* Layer 3: (128, action_size)
* Tanh

where state_size = 33 and action_size = 4.

##### Critic

The neural network we use for the *Critic* is a simple feed-forward network with the following layers

* Layer 1: (state_size, 256)
* ReLU 1
* BatchNorm
* Layer 2: (cat(128, action_size), 128)
* ReLU 2
* Layer 3: (128, action_size)

## 3: Plot of Rewards

The algorithm used 8330 episodes to solve the problem. We see a plot of the rewards received for each episode. After approx 2500 episodes, the goal was almost solved, but then it deteriorated, and only came back after more than 8000 episodes.

<img src="tennis_rewards.png" width="500">

## 4: Ideas for Future Work

There are several ways to improve the performance of the agent. Specifically, one could 

* Try to add noise directly to the parameters, instead of on the action. This have been shown to often give superior performance. The algorithm can be explored [here](https://arxiv.org/abs/1706.01905).
* Spend much more effort on tuning the hyperparameters. What would the optimal choice of network architecture? Could we change the learning parameter, or the importance sampling parameters to improve the learning? With more time on our hands, we can spent a lot of time on this. That being said, I have already spent quite some time on this, so we have already come a long way. Be aware that reinforcement learning algorithms are can easily diverge with wrong hyperparameters. Particulary is the DDPG-method applied here relatively unstable.
* Consequently, it will be interesting to implement more stable and advanced algorithms, MADDPG, which is specially designed for such Multi-Agent problems.
* With more exploration of the parameters, could it be possible to mirror the states and actions, such that the network can learn from both sides simultaneously. That is, good actions from one side should be able to be mirrored such that the agent knows that this action would also be good to perform at the other side of the net. (It might be that the states is already formulated in this manner, but could be worth investigating)
* Make the algorithm more realistic by using raw pixel data as input, instead of the sensors on position and velocity. This will make its observation more identical to human perception. 