# Project 3: Collaboration and Competition

The goal of this project is to train 2 agents to play tennis against each other from Unity ML-Agents toolkit Tennis environment.

![Trained agent](./assets/tennis.gif)
### Environment:

The environment has two agents playing tennis against each other. If the agent misses to hit the ball the other agent gets a point. The goal of each agent is to hit the ball as many times as possible. At every step the agents have access to their own local observation. The size of state space per agent is 24. States observed by the agent are position and velocities of ball and racket.

The actions each agent can take are continuous actions, one to move the racket towards or away from nets and the other to jump. The size of action space for each agent is 2. If an agent hits the ball with racket it recievs '+0.1' reward and receives '-0.01' reward if it misses the ball. The task is episodic, and in order to solve the environment, agents must get an average score of +0.5 (over 100 consecutive episodes, after taking the maximum over both agents). Specifically,
   * After each episode, we add up the rewards that each agent received (without discounting), to get a score for each     agent. This yields 2 (potentially different) scores. We then take the maximum of these 2 scores.
   * This yields a single score for each episode.

In [1]:
from env import UnityEnvWrapper
import numpy as np
from collections import deque
from maddpg import MADDPG

In [2]:
from bokeh.plotting import figure 
from bokeh.models import Legend
from bokeh.layouts import column
from bokeh.io import output_notebook, show
output_notebook()

In [3]:
env  = UnityEnvWrapper('Tennis.app')

INFO:unityagents:
'Academy' started successfully!
Unity Academy name: Academy
        Number of Brains: 1
        Number of External Brains : 1
        Lesson number : 0
        Reset Parameters :
		
Unity brain name: TennisBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 8
        Number of stacked Vector Observation: 3
        Vector Action space type: continuous
        Vector Action space size (per agent): 2
        Vector Action descriptions: , 


In [4]:
# reset the environment
env_info = env._env.reset(train_mode=True)[env.brain_name]

# number of agents 
num_agents = len(env_info.agents)
print('Number of agents:', num_agents)

# size of each action
action_size = env.brain.vector_action_space_size
print('Size of each action:', action_size)

# examine the state space 
states = env_info.vector_observations
state_size = states.shape[1]
print('There are {} agents. Each observes a state with length: {}'.format(states.shape[0], state_size))
print('The state for the first agent looks like:', states[0])

Number of agents: 2
Size of each action: 2
There are 2 agents. Each observes a state with length: 24
The state for the first agent looks like: [ 0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.         -6.65278625 -1.5
 -0.          0.          6.83172083  6.         -0.          0.        ]


### Approach

Agents are trained using Deep Reinforcement algorithm Multi Agent Deep Deterministic Policy Gradient (MADDPG). In the next sections I will talk about the MADDPG.

### Deep Deterministic Policy Gradient
DDPG adapts ideas underlying the success of Deep Q-Learning to the continuous action domain. DDPG alogrithm conists of a parameterized actor function $\mu(s|\theta^\mu)$ and a critic network $Q(s,a|\theta^Q)$. Similar to DQN it also uses a replay buffer to save all the transitions and sample random mini batch of transitions for training. Instead of performing a hard copy of critic network to target actor network (like in DQN) a soft update is performed after every training step. DDPG also uses a target actor network and performs soft update on it too. DDPG also uses batch normalization on several layers of actor and critic networks. Inorder to enable exploration noise sampled from a noise process $N$ is added to actions generated by actor network.

\begin{equation}
\mu^t(s_{t}) = \mu(s_{t}|\theta^\mu) + N
\label{ae}
\end{equation}

<img src="assets/ddpg_algo.png" style="width:600px;height:500px;"> 

The figure above describes the DDPG algorithm.

### Multi Agent Deep Deterministic Policy Gradient

<img src="assets/maddpg.png" style="width:300px;height:250px;"> 

MADDPG uses DDPG for each agent but with slight modification. It uses dectralized actors with centralized critics. The critics have visibility to observations of each agent while actors have visibility only to the observations of its own agent. The figure above shows the same, actors receiving only agents state while critics recieving states from all the agents. 

### Hyperparameters

|Parameter|Description|Value|
|:---:|:---:|:---:|
|Weight decay| L2 weight decay| 0|
|$\gamma$|Discount factor|0.99|
|$\alpha_{actor}$| Actor network optimizer learning rate| 0.0002|
|$\alpha_{critic}$| Critic network optimizer learning rate| 0.0002|
|$Buffer size$| Size of replay buffer| 10000|
|$Batch size$| Mini-batch size| 128|
|$\tau$| Traget parameters soft update rate| 0.01|


### Training

The agents solved the environment by reaching the average episodic reward of 0.5 in 429 episodes, which took around 22 mins. The figure below shows the episodic rewards and rolling mean of episode rewards for each episode. 

<img src="assets/training_plot.png">

You can see the actual plot in Tennis.ipynb notebook. To run the training yourself, please open Tennis.ipynb notebook and run all the cells. Please make sure you have downloaded appropriate unity ml agents tennis environment using the links from readme.

In [5]:
trained_agent = MADDPG(num_agents, state_size, action_size, 10)
trained_agent.loadCheckPoints()

### Network Architecture

In [11]:
print(trained_agent.agents[0].actor_local)
print(trained_agent.agents[0].critic_local)

Actor(
  (fc1): Linear(in_features=24, out_features=128, bias=True)
  (fc2): Linear(in_features=128, out_features=128, bias=True)
  (fc3): Linear(in_features=128, out_features=2, bias=True)
  (bn1): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (bn2): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
Critic(
  (fcs1): Linear(in_features=24, out_features=128, bias=True)
  (fc2): Linear(in_features=130, out_features=128, bias=True)
  (fc3): Linear(in_features=128, out_features=1, bias=True)
  (bn1): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (bn2): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)


In [6]:
def test(n_episodes=10):
    current_score = []
    running_mean = []
    scores_deque = deque(maxlen=100)
    for i_episode in range(1, n_episodes+1):
        env_info = env._env.reset(train_mode=False)[env.brain_name]
        states = env_info.vector_observations
        scores = np.zeros(num_agents)
        while True:
            actions = trained_agent.act(states)
            next_states, rewards, dones = env.step(actions)
            scores += rewards
            states = next_states
            if np.any(dones):
                break
        scores_deque.append(np.max(scores))
        current_score.append(np.max(scores))
        running_mean.append(np.mean(scores_deque))
    
        print('\rEpisode {}\tAverage Score: {:.2f}\tCurrent Score: {:.2f}'.format(i_episode, np.mean(scores_deque),np.max(scores)), end="")
    return current_score, running_mean

In [7]:
scores, running_mean = test(100)

Episode 100	Average Score: 1.92	Current Score: 0.00

In [8]:
PLOT_WIDTH = 900
PLOT_HEIGHT = 300
LINE_WIDTH = 2


def get_figure(args, x_axis_label, y_axis_label):
    
    fig = figure(
        plot_width=PLOT_WIDTH,
        plot_height=PLOT_HEIGHT,
        y_axis_label=y_axis_label,
        x_axis_label=x_axis_label
    )
    for data, x_axis_label, y_axis_label, color in args:
        fig.line(range(len(data)), data, legend=y_axis_label ,line_width=LINE_WIDTH, color=color)
    return fig
plots = []
plots.append(get_figure(
    [
        (scores, 'Episodes', 'Episodic reward', 'skyblue'),
        (running_mean, 'Episodes', 'Rolling mean of Episodic deward', 'slateblue'),
        ([0.5 for i in range(len(scores))], 'Episodes', 'Project completion threshold', 'tomato')
    ],
    'Episodes', ''
))


main_row = column(*plots)
show(main_row)

The figure above shows the episodic rewards and rolling mean of episodic rewards received by trained MADDPG agents for 100 consecutive runs. The agents consistently managed to recieve an rolling mean episodic reward in the around range of [1.5, 2.6] which cleary above the environment completion limit of 0.5

In [9]:
env._env.close()

#### Future work
- I would like to extend this to work with new environemnts.
- I would like to investigate and implement a PPO version of the MADDPG and explore if that works.

#### Refrences
- [Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments](https://arxiv.org/pdf/1706.02275.pdf)
- [Continuous control with deep reinforcement learning](https://arxiv.org/abs/1509.02971)
- [Udacity DRLND DDPG example](https://github.com/udacity/deep-reinforcement-learning/blob/master/ddpg-pendulum/ddpg_agent.py)
- [Medium blog on MADDPG](https://medium.com/@amitpatel.gt/maddpg-91caa221d75e)