# Collaboration and Competition - Solving the Tennis Environment

---
Below we use Deep Deterministic Policy Gradient (DDPG) and Population Based Training of Neural Networks to solve a Unity-based tennis environment.

## 1. Prerequisites

To repeat the experiment in this notebook, first install the prerequisites according to the readme file. Adjust the file name of the unity environment below according to specifics of your platform.

First, we start the environment.

In [83]:
from unityagents import UnityEnvironment
import numpy as np

env = UnityEnvironment(file_name="Tennis_Windows_x86_64/Tennis.exe")

INFO:unityagents:
'Academy' started successfully!
Unity Academy name: Academy
        Number of Brains: 1
        Number of External Brains : 1
        Lesson number : 0
        Reset Parameters :
		
Unity brain name: TennisBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 8
        Number of stacked Vector Observation: 3
        Vector Action space type: continuous
        Vector Action space size (per agent): 2
        Vector Action descriptions: , 


Environments contain **_brains_** which are responsible for deciding the actions of their associated agents. Here we check for the first brain available, and set it as the default brain we will be controlling from Python.

In [84]:
# get the default brain
brain_name = env.brain_names[0]
brain = env.brains[brain_name]

## 2. The State and Action Spaces

Let us run the code cell below to print some information about the environment, allowing us to verify that the environment matches the expectations we have outlined in the readme.

In [85]:
# reset the environment
env_info = env.reset(train_mode=True)[brain_name]

# number of agents 
num_agents = len(env_info.agents)
print('Number of agents:', num_agents)

# size of each action
action_size = brain.vector_action_space_size
print('Size of each action:', action_size)

# examine the state space 
states = env_info.vector_observations
state_size = states.shape[1]
print('There are {} agents. Each observes a state with length: {}'.format(states.shape[0], state_size))
print('The state for the first agent looks like:', states[0])

Number of agents: 2
Size of each action: 2
There are 2 agents. Each observes a state with length: 24
The state for the first agent looks like: [ 0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.         -6.65278625 -1.5
 -0.          0.          6.83172083  6.         -0.          0.        ]


## 3. Take Random Actions in the Environment

Before we move into training an agent based on DDPG, we can check that we know how to interact with the environment by running the code below. It uses the Python API to control the agent and receive feedback from the environment, at this point selecting entirely random actions.

In [8]:
for i in range(5):                                         # play game for 5 episodes
    env_info = env.reset(train_mode=False)[brain_name]     # reset the environment    
    states = env_info.vector_observations                  # get the current state (for each agent)
    scores = np.zeros(num_agents)                          # initialize the score (for each agent)
    while True:
        actions = np.random.randn(num_agents, action_size) # select an action (for each agent)
        actions = np.clip(actions, -1, 1)                  # all actions between -1 and 1
        env_info = env.step(actions)[brain_name]           # send all actions to tne environment
        next_states = env_info.vector_observations         # get next state (for each agent)
        rewards = env_info.rewards                         # get reward (for each agent)
        dones = env_info.local_done                        # see if episode finished
        scores += env_info.rewards                         # update the score (for each agent)
        states = next_states                               # roll over states to next time step
        if np.any(dones):                                  # exit loop if episode finished
            break
    print('Total score (averaged over agents) this episode: {}'.format(np.mean(scores)))

Total score (averaged over agents) this episode: 0.04500000085681677
Total score (averaged over agents) this episode: -0.004999999888241291
Total score (averaged over agents) this episode: 0.04500000085681677
Total score (averaged over agents) this episode: 0.04500000085681677
Total score (averaged over agents) this episode: -0.004999999888241291


### 4. Learning to act in the reacher environment with DDPG and Population Based Training

We will use the [Deep Deterministic Policy Gradient (DDPG) algorithm](https://arxiv.org/abs/1509.02971) (Lillicrap et al, 2016) to attempt to learn to act in the environment described in the `README.md`. The algorithm is the same as used in a previous project in the Reacher environment, but it's application to a two agent system includes additional choices.

The DDPG algorithm is an actor critic algorithm for updating an actor-network (an artificial neural network that implements the deterministic policy function \mu(s) - attempting to approximate the optimal policy), based on a deterministic policy gradient derived from the critic network's evaluation of the actions.

This way of using a critic that approximates the action value function Q(s,a) to provide a policy gradient for a parameterised policy was derived in the [Deterministic Policy Gradient](http://proceedings.mlr.press/v32/silver14.pdf) paper (Silver et al, 2014). DDPG improves the performance of the earlier algorithm by using an experience replay buffer and by using separate target networks both for the critic and the actor to stabilise learning. The target networks are continuously updated based on the live networks via a soft-update process.

The pseudocode for the whole algorithm is shown below (excerpt from [Lillicrap et al, 2016](https://arxiv.org/abs/1509.02971)).

![DDPG pseudocode, from (Lillicrap et al 2016)](DDPG-algorithm.png)

DDPG as described in the paper is a single agent reinforcement learning algorithm. Here we have a cooperative two-agent environment. We take the approach of treating each agent independently. Additionally, we use a different agent with different weights for each of the two agents playing the game, as opposed to having a single agent model play against itself. The first choice means that the environment is not stationary from the point of view of an individual agent - the learning done by the other agent changes the dynamics of the environment continuously. The fact that the other agent is a genuinely separate agent makes this issue more obvious. This has the potential of making training an agent very unstable, and methods to mitigate this have been proposed, e.g. to use a centralized critic that is aware of the actions of all agents in the environment ([Lowe et al 2018](https://arxiv.org/abs/1706.02275)). Here we instead use plain DDPG, but use it in the context of Population Based Training of Neural Networks ([Jaderberg et al, 2017](https://arxiv.org/abs/1711.09846)), hoping that using a population is enough to stabilise the training process.

The implementation of the training algorithm is split into the following files:

`agent.py`
:  The core DDPG implementation in a `DdpgAgent` class and a `ddpg` function that uses it to implement the algorithm.

`actor.py`
:  The actor neural network

`critic.py`
:  The critic neural network

`ornstein_uhlenbeck_noise.py`
:  The noise process that is used to make the agent explore the environment in DDPG

`replaybuffer.py`
:  The replay buffer used by the DdpgAgent

`utils.py`
:  Utilities used by the other modules

`population_based_training.py`
:  The Population Based Training algorithm


In [86]:
import replaybuffer
import ornstein_uhlenbeck_noise
import utils
import torch as torch
import pandas as pd
import agent
import population_based_training as pbt

import importlib



In [49]:
agent.ddpg()

TypeError: ddpg() missing 3 required positional arguments: 'agent', 'env', and 'brain_name'

In [38]:
starting_hyperparams = [{
        'actor_lr': actor_lr,
        'critic_lr': critic_lr,
        'gamma': gamma,
        'tau': tau
    }
    for actor_lr in [1e-3, 3e-4]
    for critic_lr in [1e-3, 3e-4]
    for gamma in [0.95, 0.99]
    for tau in [3e-4, 1e-4]
]
starting_hyperparams

[{'actor_lr': 0.001, 'critic_lr': 0.001, 'gamma': 0.95, 'tau': 0.0003},
 {'actor_lr': 0.001, 'critic_lr': 0.001, 'gamma': 0.95, 'tau': 0.0001},
 {'actor_lr': 0.001, 'critic_lr': 0.001, 'gamma': 0.99, 'tau': 0.0003},
 {'actor_lr': 0.001, 'critic_lr': 0.001, 'gamma': 0.99, 'tau': 0.0001},
 {'actor_lr': 0.001, 'critic_lr': 0.0003, 'gamma': 0.95, 'tau': 0.0003},
 {'actor_lr': 0.001, 'critic_lr': 0.0003, 'gamma': 0.95, 'tau': 0.0001},
 {'actor_lr': 0.001, 'critic_lr': 0.0003, 'gamma': 0.99, 'tau': 0.0003},
 {'actor_lr': 0.001, 'critic_lr': 0.0003, 'gamma': 0.99, 'tau': 0.0001},
 {'actor_lr': 0.0003, 'critic_lr': 0.001, 'gamma': 0.95, 'tau': 0.0003},
 {'actor_lr': 0.0003, 'critic_lr': 0.001, 'gamma': 0.95, 'tau': 0.0001},
 {'actor_lr': 0.0003, 'critic_lr': 0.001, 'gamma': 0.99, 'tau': 0.0003},
 {'actor_lr': 0.0003, 'critic_lr': 0.001, 'gamma': 0.99, 'tau': 0.0001},
 {'actor_lr': 0.0003, 'critic_lr': 0.0003, 'gamma': 0.95, 'tau': 0.0003},
 {'actor_lr': 0.0003, 'critic_lr': 0.0003, 'gamma': 0.

In [153]:
importlib.reload(agent)
importlib.reload(pbt)
starting_agents = [
    agent.DdpgAgent(actor_lr=p['actor_lr'],
              critic_lr=p['critic_lr'],
              tau=p['tau'],
              gamma=p['gamma'],
              name="agent-{}".format(i))
    for i, p in enumerate(starting_hyperparams)]
agents = starting_agents


In [None]:
pbt.population_based_training(agents, env, brain_name, episodes_between_exploit=200, max_episode=10000)

Paired agent-4 with agent-14
Episode 200	Average Score: -0.01
Episode 200	Average Score: -0.00

Agent: agent-4, mean return: -0.01

Agent: agent-14, mean return: -0.00
Paired agent-15 with agent-12
Episode 200	Average Score: -0.00
Episode 200	Average Score: -0.00

Agent: agent-15, mean return: -0.00

Agent: agent-12, mean return: -0.00
Paired agent-8 with agent-13
Episode 200	Average Score: -0.00
Episode 200	Average Score: -0.01

Agent: agent-8, mean return: -0.00

Agent: agent-13, mean return: -0.01
Paired agent-2 with agent-11
Episode 200	Average Score: -0.01
Episode 200	Average Score: -0.00

Agent: agent-2, mean return: -0.01

Agent: agent-11, mean return: -0.00
Paired agent-6 with agent-1
Episode 200	Average Score: -0.01
Episode 200	Average Score: -0.00

Agent: agent-6, mean return: -0.01

Agent: agent-1, mean return: -0.00
Paired agent-5 with agent-10
Episode 200	Average Score: -0.00
Episode 200	Average Score: -0.01

Agent: agent-5, mean return: -0.00

Agent: agent-10, mean return

Episode 1000	Average Score: -0.01
Episode 1000	Average Score: -0.00

Agent: agent-5, mean return: -0.01

Agent: agent-15, mean return: -0.00
Paired agent-2 with agent-8
Episode 1000	Average Score: -0.01
Episode 1000	Average Score: -0.00

Agent: agent-2, mean return: -0.01

Agent: agent-8, mean return: -0.00
Paired agent-14 with agent-6
Episode 1000	Average Score: 0.01
Episode 1000	Average Score: -0.01

Agent: agent-14, mean return: 0.01

Agent: agent-6, mean return: -0.01
agent-0 performed worse than agent-4 but it wasn't significant, p = 0.50
agent-1 performed worse than agent-7 but it wasn't significant, p = 0.06
agent-5 performed worse than agent-15 but it wasn't significant, p = 0.11
Overwriting agent-6 with agent-7, p = 0.00
agent-10 performed worse than agent-7 but it wasn't significant, p = 0.11
agent-11 performed worse than agent-8 but it wasn't significant, p = 0.19
Overwriting agent-12 with agent-14, p = 0.00
agent-13 performed worse than agent-4 but it wasn't significant, p 

agent-11 performed worse than agent-15 but it wasn't significant, p = 0.51
Overwriting agent-13 with agent-0, p = 0.00
Overwriting agent-14 with agent-0, p = 0.00
Overwriting agent-15 with agent-3, p = 0.00
Paired agent-10 with agent-3
Episode 2000	Average Score: -0.00
Episode 2000	Average Score: 0.02

Agent: agent-10, mean return: -0.00

Agent: agent-3, mean return: 0.02
Paired agent-5 with agent-9
Episode 2000	Average Score: 0.03
Episode 2000	Average Score: -0.01

Agent: agent-5, mean return: 0.03

Agent: agent-9, mean return: -0.01
Paired agent-6 with agent-14
Episode 2000	Average Score: 0.03
Episode 2000	Average Score: 0.01

Agent: agent-6, mean return: 0.03

Agent: agent-14, mean return: 0.01
Paired agent-15 with agent-4
Episode 2000	Average Score: 0.02
Episode 2000	Average Score: -0.01

Agent: agent-15, mean return: 0.02

Agent: agent-4, mean return: -0.01
Paired agent-1 with agent-11
Episode 2000	Average Score: -0.00
Episode 2000	Average Score: -0.01

Agent: agent-1, mean return

Overwriting agent-5 with agent-13, p = 0.00
Overwriting agent-6 with agent-7, p = 0.01
Overwriting agent-8 with agent-15, p = 0.00
Overwriting agent-9 with agent-1, p = 0.00
Overwriting agent-10 with agent-0, p = 0.00
Overwriting agent-12 with agent-3, p = 0.00
Overwriting agent-14 with agent-3, p = 0.00
Paired agent-0 with agent-3
Episode 3000	Average Score: 0.17
Episode 3000	Average Score: 0.18

Agent: agent-0, mean return: 0.17

Agent: agent-3, mean return: 0.18
Paired agent-7 with agent-1
Episode 3000	Average Score: 0.08
Episode 3000	Average Score: 0.07

Agent: agent-7, mean return: 0.08

Agent: agent-1, mean return: 0.07
Paired agent-2 with agent-14
Episode 3000	Average Score: 0.15
Episode 3000	Average Score: 0.15

Agent: agent-2, mean return: 0.15

Agent: agent-14, mean return: 0.15
Paired agent-5 with agent-6
Episode 3000	Average Score: 0.06
Episode 3000	Average Score: 0.06

Agent: agent-5, mean return: 0.06

Agent: agent-6, mean return: 0.06
Paired agent-11 with agent-8
Episode


Agent: agent-13, mean return: 0.27

Agent: agent-15, mean return: 0.27
Paired agent-12 with agent-4
Episode 3800	Average Score: 0.53
Episode 3800	Average Score: 0.53

Agent: agent-12, mean return: 0.53

Agent: agent-4, mean return: 0.53
Paired agent-10 with agent-8
Episode 3800	Average Score: 0.27
Episode 3800	Average Score: 0.26

Agent: agent-10, mean return: 0.27

Agent: agent-8, mean return: 0.26
Paired agent-14 with agent-6
Episode 3800	Average Score: 0.42
Episode 3800	Average Score: 0.41

Agent: agent-14, mean return: 0.42

Agent: agent-6, mean return: 0.41
Overwriting agent-3 with agent-12, p = 0.00
Overwriting agent-8 with agent-12, p = 0.00
Overwriting agent-10 with agent-14, p = 0.00
agent-11 performed worse than agent-4 but it wasn't significant, p = 0.22
Overwriting agent-13 with agent-5, p = 0.00
agent-14 performed worse than agent-2 but it wasn't significant, p = 0.25
Overwriting agent-15 with agent-7, p = 0.02
Paired agent-2 with agent-14
Episode 4000	Average Score: 0.43

Episode 4800	Average Score: 0.32
Episode 4800	Average Score: 0.34

Agent: agent-11, mean return: 0.32

Agent: agent-4, mean return: 0.34
Paired agent-0 with agent-15
Episode 4800	Average Score: 0.30
Episode 4800	Average Score: 0.31

Agent: agent-0, mean return: 0.30

Agent: agent-15, mean return: 0.31
Paired agent-14 with agent-8
Episode 4800	Average Score: 0.55
Episode 4800	Average Score: 0.53

Agent: agent-14, mean return: 0.55

Agent: agent-8, mean return: 0.53
Paired agent-6 with agent-13
Episode 4800	Average Score: 0.53
Episode 4800	Average Score: 0.52

Agent: agent-6, mean return: 0.53

Agent: agent-13, mean return: 0.52
Paired agent-1 with agent-10
Episode 4800	Average Score: 0.51
Episode 4800	Average Score: 0.52

Agent: agent-1, mean return: 0.51

Agent: agent-10, mean return: 0.52
Paired agent-12 with agent-2
Episode 4800	Average Score: 0.62
Episode 4800	Average Score: 0.59

Agent: agent-12, mean return: 0.62

Agent: agent-2, mean return: 0.59
Paired agent-9 with agent-3
Episo

Episode 5600	Average Score: 0.87
Episode 5600	Average Score: 0.86

Agent: agent-8, mean return: 0.87

Agent: agent-1, mean return: 0.86
Overwriting agent-0 with agent-8, p = 0.00
Overwriting agent-5 with agent-13, p = 0.01
agent-6 performed worse than agent-7 but it wasn't significant, p = 0.91
Overwriting agent-7 with agent-8, p = 0.00
Overwriting agent-9 with agent-10, p = 0.00
Overwriting agent-11 with agent-10, p = 0.01
agent-14 performed worse than agent-4 but it wasn't significant, p = 0.64
agent-15 performed worse than agent-11 but it wasn't significant, p = 0.29
Paired agent-11 with agent-3
Episode 5800	Average Score: 0.40
Episode 5800	Average Score: 0.42

Agent: agent-11, mean return: 0.40

Agent: agent-3, mean return: 0.42
Paired agent-7 with agent-12
Episode 5800	Average Score: 0.53
Episode 5800	Average Score: 0.53

Agent: agent-7, mean return: 0.53

Agent: agent-12, mean return: 0.53
Paired agent-10 with agent-5
Episode 5800	Average Score: 0.51
Episode 5800	Average Score: 0

Episode 6600	Average Score: 0.33
Episode 6600	Average Score: 0.31

Agent: agent-1, mean return: 0.33

Agent: agent-12, mean return: 0.31
Paired agent-10 with agent-8
Episode 6600	Average Score: 0.24
Episode 6600	Average Score: 0.23

Agent: agent-10, mean return: 0.24

Agent: agent-8, mean return: 0.23
Overwriting agent-0 with agent-15, p = 0.00
Overwriting agent-1 with agent-4, p = 0.00
Overwriting agent-6 with agent-1, p = 0.00
Overwriting agent-9 with agent-8, p = 0.00
Overwriting agent-10 with agent-12, p = 0.04
agent-12 performed worse than agent-13 but it wasn't significant, p = 0.30
Overwriting agent-13 with agent-11, p = 0.00
Paired agent-12 with agent-1
Episode 6800	Average Score: 0.48
Episode 6800	Average Score: 0.49

Agent: agent-12, mean return: 0.48

Agent: agent-1, mean return: 0.49
Paired agent-14 with agent-0
Episode 6800	Average Score: 0.27
Episode 6800	Average Score: 0.27

Agent: agent-14, mean return: 0.27

Agent: agent-0, mean return: 0.27
Paired agent-2 with agent-1

In [130]:
agents[14].history

Unnamed: 0,episode,return,actor_lr,critic_lr,tau,gamma
0,1,-0.01,0.0003,0.0003,0.0003,0.99
0,2,-0.01,0.0003,0.0003,0.0003,0.99
0,3,-0.01,0.0003,0.0003,0.0003,0.99
0,4,0.0,0.0003,0.0003,0.0003,0.99
0,5,-0.01,0.0003,0.0003,0.0003,0.99
0,6,0.0,0.0003,0.0003,0.0003,0.99
0,7,-0.01,0.0003,0.0003,0.0003,0.99
0,8,-0.01,0.0003,0.0003,0.0003,0.99
0,9,-0.01,0.0003,0.0003,0.0003,0.99
0,10,-0.01,0.0003,0.0003,0.0003,0.99


In [82]:
env.close()

### 4. It's Your Turn!

Now it's your turn to train your own agent to solve the environment!  A few **important notes**:
- When training the environment, set `train_mode=True`, so that the line for resetting the environment looks like the following:
```python
env_info = env.reset(train_mode=True)[brain_name]
```
- To structure your work, you're welcome to work directly in this Jupyter notebook, or you might like to start over with a new file!  You can see the list of files in the workspace by clicking on **_Jupyter_** in the top left corner of the notebook.
- In this coding environment, you will not be able to watch the agents while they are training.  However, **_after training the agents_**, you can download the saved model weights to watch the agents on your own machine! 