# Collaboration and Competition

### 1. Install dependencies
Most importantly install [Unity ML-agents](https://github.com/Unity-Technologies/ml-agents/blob/master/docs/Installation.md), PyTorch, and NumPy

In [1]:
from unityagents import UnityEnvironment
import numpy as np
import torch
from collections import deque
import matplotlib.pyplot as plt
%matplotlib inline

Make sure you have the Unity enviroment downloaded and change the path of the file_name

In [2]:
env = UnityEnvironment(file_name="Tennis.app")

INFO:unityagents:
'Academy' started successfully!
Unity Academy name: Academy
        Number of Brains: 1
        Number of External Brains : 1
        Lesson number : 0
        Reset Parameters :
		
Unity brain name: TennisBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 8
        Number of stacked Vector Observation: 3
        Vector Action space type: continuous
        Vector Action space size (per agent): 2
        Vector Action descriptions: , 


Environments contain **brains** which are responsible for deciding the actions of their associated agents. Here we check for the first brain available, and set it as the default brain we will be controlling from Python.

In [3]:
# get the default brain
brain_name = env.brain_names[0]
brain = env.brains[brain_name]

### 2. Exame the State and Action Spaces
In this environment, two agents control rackets to bounce a ball over a net. If an agent hits the ball over the net, it receives a reward of +0.1.  If an agent lets a ball hit the ground or hits the ball out of bounds, it receives a reward of -0.01.  Thus, the goal of each agent is to keep the ball in play.

The observation space consists of 3 stacked instances of 8 variables corresponding to the position and velocity of the ball and racket. Two continuous actions are available, corresponding to movement toward (or away from) the net, and jumping. 

Run the code cell below to print some information about the environment.

In [4]:
# reset the environment
env_info = env.reset(train_mode=True)[brain_name]

# number of agents 
num_agents = len(env_info.agents)
print('Number of agents:', num_agents)

# size of each action
action_size = brain.vector_action_space_size
print('Size of each action:', action_size)

# examine the state space 
states = env_info.vector_observations
state_size = states.shape[1]
print('There are {} agents. Each observes a state with length: {}'.format(states.shape[0], state_size))

Number of agents: 2
Size of each action: 2
There are 2 agents. Each observes a state with length: 24


### 3. Instantiate and initialize the agent
The learning agent is imported from a separate file "./agent.py" and takes `state_size`, `action_size` and a `seed` as instance variables.

A few highlights of the agents:
- The agents select the policy given by the actor-critic network
- The agents use a shared buffer to store recent steps `(state, action, reward, next_state, done)` tuples and replay them
- The agents maximize reward based on an actor-critic network

In [5]:
from config import Config
from buffer import ReplayBuffer
from maddpg import MADDPG

In [6]:
# setup configuration                
config = Config()

# general config
config.device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
config.seed = 0

# environment related config
config.state_size = env_info.vector_observations.shape[1]    # size of the observation space (state space)
config.action_size = brain.vector_action_space_size          # size of the action space
config.num_agents = len(env_info.agents)                     # number of agents

# Experience replay memory related config
config.buffer_size = int(1e6)                                # size of the memory buffer
config.batch_size = 256                                      # sample minibatch size
config.memory = lambda: ReplayBuffer(config.action_size, config.buffer_size, config.batch_size, config.seed, config.device)

# agent related info
config.gamma = 0.99                                          # discount rate for future rewards
config.tau = 0.02                                            # interpolation factor for soft update of target network
config.lr_actor = 1e-4                                       # learning rate of Actor
config.lr_critic = 3e-4                                      # learning rate of Critic
config.weight_decay = 0                                      # L2 weight decay
config.update_every = 4                                      # update every 20 time steps
config.num_updates = 1                                       # number of updates to the network

In [7]:
maddpg = MADDPG(config)

### 4. Test randomly selected actions (untrained agents)
Run randomly selected actions in the environment to see what happens to the score. This is similar to an **untrained** agents.

Once this cell is executed, you will watch the agents' performance, if they select actions at random with each time step. A window should pop up that allows you to observe the agents.

In [None]:
for i in range(1, 6):                                      # play game for 5 episodes
    env_info = env.reset(train_mode=False)[brain_name]     # reset the environment    
    states = env_info.vector_observations                  # get the current state (for each agent)
    scores = np.zeros(num_agents)                          # initialize the score (for each agent)
    while True:
        actions = np.random.randn(num_agents, action_size) # select an action (for each agent)
        actions = np.clip(actions, -1, 1)                  # all actions between -1 and 1
        env_info = env.step(actions)[brain_name]           # send all actions to tne environment
        next_states = env_info.vector_observations         # get next state (for each agent)
        rewards = env_info.rewards                         # get reward (for each agent)
        dones = env_info.local_done                        # see if episode finished
        scores += env_info.rewards                         # update the score (for each agent)
        states = next_states                               # roll over states to next time step
        if np.any(dones):                                  # exit loop if episode finished
            break
    print('Score (max over agents) from episode {}: {}'.format(i, np.max(scores)))

### 5. Train multiple agents with Deep Determinitic Policy Gradient (DDPG)
The agent actually runs on an underlying actor-critic network. This is beneficial instead of using an typical deep Q-learning network (DQN) not only the environment's state space is large at 24 variables but the action space contains 2 continuous action variables. 

The setup is similar to the DQN with a local and target network; however, now there are separate networks to evaluate: the **actor** network for learning the optimal policy and the **critic** network for evaluating the selected action. 

Let's train the agents until they achieve an average of the maximum score of +0.5 over 100 episodes.

In [None]:
def ddpg_multi(n_episodes=3000, max_t=1000):
    '''
    -------------------------------------------
    Parameters
    
    n_episodes: # of episodes that the agent is training for
    max_t:      # of time steps (max) the agent is taking per episode
    -------------------------------------------
    '''
    scores_deque = deque(maxlen=100)
    scores = []
    max_score = -np.inf
    for i_episode in range(1, n_episodes+1):
        env_info = env.reset(train_mode=True)[brain_name]             # turn on train mode of the environment
        states = env_info.vector_observations                         # get the current state for each agent
        maddpg.reset()                                                # reset the OU noise parameter 
        ep_scores = np.zeros(num_agents)                              # initialize the score for each agent
        for t in range(max_t):
            actions = maddpg.act(states)                              # select an action for each agent 
            env_info = env.step(actions)[brain_name]                  # send all actions to the environment
            next_states = env_info.vector_observations                # get next state for each agent
            rewards = env_info.rewards                                # get reward for each agent
            dones = env_info.local_done                               # check if episode finished
            maddpg.step(states, actions, rewards, next_states, dones) # agents record enviroment response in recent step
            states = next_states                                      # set the state as the next state for the following step for each agent
            ep_scores += rewards                                      # update the total score
            if np.any(dones):                                         # exit loop if episode for any agent finished
                break 
                
        scores_deque.append(np.max(ep_scores))
        scores.append(ep_scores)
        
        # print average epsiode score and average 100-episode score for each episode
        print('\rEpisode {} \tMax Score: {:.2f} \tAverage Max Score: {:.2f}'.format(i_episode, np.max(ep_scores), np.mean(scores_deque)), end="")  
        
        # print and save actor and critic weights when a score of +30 over 100 episodes has been achieved
        if np.mean(scores_deque) >= 0.5:
            for i in range(config.num_agents):
                torch.save(maddpg.maddpg_agents[i].actor_local.state_dict(), 'checkpoint_actor_{}_final.pth'.format(i))
                torch.save(maddpg.maddpg_agents[i].critic_local.state_dict(), 'checkpoint_critic_{}_final.pth'.format(i))
            print('\nEnvironment solved in {:d} episodes!\tAverage Max Score: {:.2f}'.format(i_episode-100, np.mean(scores_deque)))
            break
    return scores

scores = ddpg_multi()

### Visualize the scores
Plot the scores according to their episodes. We can see a gradual increase in the scores as we increase the training episodes.

- threshold score of +0.5 in green dashed line
- scores of agents per episode in purple
- average maximum score per 100 episodes in red

In [None]:
fig = plt.figure()

# plot scores
t = np.arange(1,len(scores)+1)
s = []
i = 0
while i < len(scores):
    s.append(np.mean([np.max(s_ep) for s_ep in scores[i:i+100]]))
    i += 1

# plot max score/episode
plt.plot(np.arange(1, len(scores)+1), [np.max(s_ep) for s_ep in scores])
# plot average of max per next 100 episodes
plt.plot(t, s, c='r', linewidth=2)
# plot threshold line at +0.5
plt.hlines(0.5, 0, len(scores), colors='g', linestyles='dashed')
plt.ylabel('Score')
plt.xlabel('Episode #')
plt.show()

In [None]:
env.close()

### 6. Test the trained agents
Run a pair of **trained** agents for 1000 time steps to see what happens to the score. Compare this with the score of the untrained agents from 4.

In [8]:
# Load policy network weights saved from training
maddpg.maddpg_agents[0].actor_local.load_state_dict(torch.load('checkpoint_actor_0_final.pth'))
maddpg.maddpg_agents[1].actor_local.load_state_dict(torch.load('checkpoint_actor_1_final.pth'))

for i in range(1, 6):                                      # play game for 5 episodes
    env_info = env.reset(train_mode=False)[brain_name]     # reset the environment    
    states = env_info.vector_observations                  # get the current state (for each agent)
    scores = np.zeros(num_agents)                          # initialize the score (for each agent)
    while True:
        actions = maddpg.act(states)                       # select an action for each agent
        env_info = env.step(actions)[brain_name]           # send all actions to tne environment
        next_states = env_info.vector_observations         # get next state (for each agent)
        rewards = env_info.rewards                         # get reward (for each agent)
        dones = env_info.local_done                        # see if episode finished
        scores += env_info.rewards                         # update the score (for each agent)
        states = next_states                               # roll over states to next time step
        if np.any(dones):                                  # exit loop if episode finished
            break
    print('Score (max over agents) from episode {}: {}'.format(i, np.max(scores)))

Score (max over agents) from episode 1: 0.30000000447034836
Score (max over agents) from episode 2: 0.7000000104308128
Score (max over agents) from episode 3: 0.20000000298023224
Score (max over agents) from episode 4: 0.30000000447034836
Score (max over agents) from episode 5: 1.700000025331974


In [9]:
env.close()