# Collaboration

Project #3 for the Udacity Reinforcement Learning Nanodegree.  
Eduardo Peynetti
__________

<img src="images/tennis.png" width="600"  />

## Introduction

For this project, we will work with the [Tennis](https://github.com/Unity-Technologies/ml-agents/blob/master/docs/Learning-Environment-Examples.md#tennis) environment.

In this environment, two agents control rackets to bounce a ball over a net. If an agent hits the ball over the net, it receives a reward of +0.1.  If an agent lets a ball hit the ground or hits the ball out of bounds, it receives a reward of -0.01.  Thus, the goal of each agent is to keep the ball in play.

The observation space consists of 8 variables corresponding to the position and velocity of the ball and racket. Each agent receives its own, local observation.  Two continuous actions are available, corresponding to movement toward (or away from) the net, and jumping. 

The task is episodic, and in order to solve the environment, your agents must get an average score of +0.5 (over 100 consecutive episodes, after taking the maximum over both agents). Specifically,

- After each episode, we add up the rewards that each agent received (without discounting), to get a score for each agent. This yields 2 (potentially different) scores. We then take the maximum of these 2 scores.
- This yields a single **score** for each episode.

The environment is considered solved, when the average (over 100 episodes) of those **scores** is at least +0.5.



## Model

The problem is solved by using a Multi Agent Deep Deterministic Policy Gradient (MADDPG) architecture, which is based on the paper [Multi-Agent Actor-Critic for Mixed
Cooperative-Competitive Environments](https://papers.nips.cc/paper/7217-multi-agent-actor-critic-for-mixed-cooperative-competitive-environments.pdf)

The code for the agents can be found in [drltools/agent.py](https://github.com/lalopey/drl/blob/master/drltools/agent/agent.py), under the MADDPGAgent class.  
The code for the PyTorch models for the Critic and the Actor can be found in [drltools/model.py](https://github.com/lalopey/drl/blob/master/drltools/model/model.py)

### DDPG

Each player is modeled through a DDPG architecture. An explanation of this model can be found in the [report for Project 2](https://github.com/lalopey/drl/blob/master/2%20-%20Continuous%20Control%20-DDPG/Report.ipynb).

### Multi Agent Manager

The MADDPGAgent class is a multi-class manager. It contains a DDPGAgent class for each of the 2 players, it has a memory buffer from which both players can sample individually and it controls the order of learning for each player.

Unlike the Reacher20 problem from Project 1, where we had multiple agents with the exact same behavior, but which did not interact with each other, in here the actions of one player will influence the actions of the other. They are also motivated to cooperate to earn more joint reward.

In the Reacher20 problem, we could sample from each of the 20 agents, store their experiences in a buffer, and then train one model for all agents from a joint experience buffer.

In the Tennis problem, we need different models for each agent, as the actions of one influence the actions of the other, so the necessity for a multi agent manager.

Unlike in the other problems, we also learn after a fixed number of episodes instead of after a fixed number of steps. This gives both agents the opportunity to have a full sample of shared experiences.


## Results

### Agent Training and Model Architecture

The model architecture used to solve the problem is as follows:

For each player:  

- A target and local neural network for the Critic, both with the same architecture:
    - One layer with states concatenated with action space as input, with 512 nodes. The layer is activated with ReLu and batch normalized
    - A second layer with 256 nodes, ReLu activated and batch normalized
    - A final layer with a single output, as we are estimating a deterministic policy
    
- A target and local neural network for the Actor, both with the same architecture:
    - One layer with states as input, with 512 nodes. The layer is activated with ReLu and batch normalized
    - A second layer, with 256 nodes, ReLu activated.
    - A final layer that maps into action space, activated with tanh so that the output is between -1 and 1 as its possible for the actions.
    

- A batch size of 256 is used.
- The local networks are optimized with Adam, with a learning rate of 1e-4 and 5e-4 for the actor and critic, respectively.
- The $\gamma$ in the TD update is 0.995.
- The weights of the target network are updated through a soft update θ_target = τ*θ_local + (1 - τ)*θ_target with τ = 1e-3 
- A buffer size of 1e5 for experience replay.
- An Ornstein-Uhlenbeck noise process with parameters $\mu=0$, $\theta=0.15$, $\sigma=0.1$ is used
- The buffer is sampled 15 times every 5 episodes.


### Multi Agent DDPG

The environment is solved in 1307 episodes. It takes over a thousand episodes for the first player to hit the ball on the first round, and to start seeing consistent response hits. Once the players learn well how to hit the ball twice, training speeds up considerably.

Episode 100 Average Score: 0.01  
Episode 200 Average Score: 0.01  
Episode 300 Average Score: 0.04  
Episode 400 Average Score: 0.04  
Episode 500 Average Score: 0.07  
Episode 600 Average Score: 0.09  
Episode 700 Average Score: 0.10  
Episode 800 Average Score: 0.09  
Episode 900 Average Score: 0.05  
Episode 1000 Average Score: 0.08  
Episode 1100 Average Score: 0.13  
Episode 1200 Average Score: 0.29  
Episode 1300 Average Score: 0.43  
Episode 1307 Average Score: 0.50   
Environment solved in 1307 episodes with an Average Score of 0.50   

<img src="images/maddpg_tennis.png" width="450"  />


# Future improvements

On this project we levered what we had learned about DDPG in the last project, and introduced a multi agent manager for shared memory and controled learning from interacting agents.

Multi-agent competitive/cooperative reinforcement learning is an active area of research. In the MADDPG solution used, each player learns individually using common experiences. It's posssible too to have a model that learns jointly for all the players, and each pkayer can use individual and shared knowledge.

Some attempts are more complex multi-agent models are:

[QMIX](https://arxiv.org/abs/1803.11485)  
[COMA](https://arxiv.org/abs/1705.08926)  
[VDN](https://arxiv.org/abs/1706.05296)  

We could also improve on the way the replay buffer samples through
 [Prioritized Experience Replay](https://arxiv.org/abs/1511.05952) and [Hindsight Experience Replay](https://arxiv.org/abs/1707.01495)

