# Tiki-taka time! 

---

This notebook uses the Unity SoccerTwos environment, where two teams of two players each contend against each other, and the Actor Critic framework with Proximal Policy Optimization to teach the agents how to win a soccer game! 


### Environment setup

First we import the necessary packages:

In [1]:
# We use PyTorch to implement the neural networks
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.autograd import Variable
import torch.distributions as distributions
from torch.utils.data.sampler import BatchSampler, SubsetRandomSampler
# We import the UnityEnvironment
from unityagents import UnityEnvironment
# and some other relevant packages
import numpy as np
import matplotlib.pyplot as plt
from collections import deque
import datetime
import pytz
import os

Next we define a function to get the local time:

In [2]:
def get_time(format):
    utc_now = pytz.utc.localize(datetime.datetime.utcnow())
    pst_now = utc_now.astimezone(pytz.timezone("America/Los_Angeles"))
    return pst_now.strftime(format)

### Exploring UnityEnvironments: Soccer

Next, we will start the environment! Environments contain **_brains_** which are responsible for deciding the actions of their associated agents. Here we obtain separate brains for the striker and the goalie agents.

In [None]:
# Start the environment
env = UnityEnvironment(file_name="Soccer.app")
# Print the brain names
print(env.brain_names)

In [None]:
# There are two brains:
# 1. set the goalie brain
g_brain_name = env.brain_names[0]
g_brain = env.brains[g_brain_name]
# 2. set the striker brain
s_brain_name = env.brain_names[1]
s_brain = env.brains[s_brain_name]

Next, we shed more light on the environment by printing some relevant info. 

In [None]:
# Reset the environment
env_info = env.reset(train_mode=True)
# Goalie info
num_g_agents = len(env_info[g_brain_name].agents)
print('Number of goalie agents:', num_g_agents)
g_action_size = g_brain.vector_action_space_size
print('Number of goalie actions:', g_action_size)
g_states = env_info[g_brain_name].vector_observations
g_state_size = g_states.shape[1]
print('There are {} goalie agents. Each receives a state with length: {}'.format(g_states.shape[0], g_state_size))
# Striker info
num_s_agents = len(env_info[s_brain_name].agents)
print('Number of striker agents:', num_s_agents)
s_action_size = s_brain.vector_action_space_size
print('Number of striker actions:', s_action_size)
s_states = env_info[s_brain_name].vector_observations
s_state_size = s_states.shape[1]
print('There are {} striker agents. Each receives a state with length: {}'.format(s_states.shape[0], s_state_size))

# Close the environment
env.close()

### Small simulation: sample actions in the environment

We use the Python API to control the agents and receive feedback from the environment. We will watch the agents' performance, as they select actions at random with each time step.  

In [None]:
# Start the environment
env = UnityEnvironment(file_name="Soccer.app")

for i in range(2):                                         # play game for 2 episodes
    env_info = env.reset(train_mode=False)                 # reset the environment    
    g_states = env_info[g_brain_name].vector_observations  # get initial state (goalies)
    s_states = env_info[s_brain_name].vector_observations  # get initial state (strikers)
    g_scores = np.zeros(num_g_agents)                      # initialize the score (goalies)
    s_scores = np.zeros(num_s_agents)                      # initialize the score (strikers)
    while True:
        # select actions and send to environment
        g_actions = np.random.randint(g_action_size, size=num_g_agents)
        s_actions = np.random.randint(s_action_size, size=num_s_agents)
        actions = dict(zip([g_brain_name, s_brain_name], [g_actions, s_actions]))
        env_info = env.step(actions)
        
        # get next actor_states
        g_next_states = env_info[g_brain_name].vector_observations
        s_next_states = env_info[s_brain_name].vector_observations
        
        # get reward and update scores
        g_rewards = env_info[g_brain_name].rewards
        s_rewards = env_info[s_brain_name].rewards
        g_scores += g_rewards
        s_scores += s_rewards
        
        # check if episode finished
        done = np.any(env_info[g_brain_name].local_done)
        
        # roll over actor_states to next time step
        g_states = g_next_states
        s_states = s_next_states
        
        # exit loop if episode finished
        if done:
            break
    print('Scores from episode {}: {} (goalies), {} (strikers)'.format(i+1, g_scores, s_scores))

# Close the environment
env.close()

### Time for training!

##### Setup the Actor Critic Networks

The *Actor* receives his *own state space* and outputs:
* an action, 
* the log probability of that action (to be used later in calculating the advantage ratio), and 
* the entropy of the probability distribution (higher entropy, more uncertainty). 
The entropy acts as *noise* in the *loss function*. Intuitively, it urges the agent to try more random actions initially, so as to not get stuck in an action that fares well short-term, but is not optimal in the long-term. That is, it helps avoid local minima. 

The *Critic* receives the *combined state space of all agents* on the field and outputs the expected total reward for an action given that state. 
This value is compared to the actual total reward from an actor's action, and it will tell us how much better the chosen action is compared to the average likely reward. This is called the *advantage*. 

A note on the distributions function:

It is not possible to have the actor simply output a softmax distribution of action probabilities and then choose an action off a random sampling of those probabilities. Neural networks cannot directly backpropagate through random samples. PyTorch and Tensorflow offer a [distribution function](https://pytorch.org/docs/stable/distributions.html) to solve this that makes the action selection differentiable. The actor passes the softmax output through this distribution function to select the action and then backpropagation can occur.

Next we provide the definitions for the actor and the critic networks.

In [6]:
def initialize_fc_layer(layer, weight_scale=1.0, bias=0.0):
    """
    Initializes a fully connected layer. 

    Arguments:
      layer - torch.nn.<layer> module
      w_scale - float
    Outputs:
      Initialized layer
    """
    # 'nn.init.orthogonal_' fills the 'layer.weight.data' (Tensor) with a (semi) orthogonal matrix, 
    # see Exact solutions to the nonlinear dynamics of learning in deep linear neural networks - Saxe, A. et al. (2013). 
    nn.init.orthogonal_(layer.weight.data, gain=weight_scale)
    # layer.weight.data.mul_(weight_scale)
    # 'nn.init.constant_' fills the 'layer.bias.data' (Tensor) with the 'bias' value.
    nn.init.constant_(layer.bias.data, bias)
    return layer

class Actor(nn.Module):
    # Chosen architecture:
    # input -> fc1 -> relu -> fc2 -> relu -> fc3 -> softmax -> output

    # Initialize
    def __init__(self, state_size, action_size, hidden_0=256, hidden_1=128):
        super(Actor, self).__init__()
        # We initialize 3 fully connected layers. 
        self.fc1 = initialize_fc_layer(nn.Linear(state_size, hidden_0))
        self.fc2 = initialize_fc_layer(nn.Linear(hidden_0, hidden_1))
        self.fc3 = initialize_fc_layer(nn.Linear(hidden_1, action_size))
    
    # Forward propagation
    def forward(self, x, action=None):
        # Input x
        # -> fc1 -> relu
        x = F.relu(self.fc1(x))
        # -> fc2 -> relu
        x = F.relu(self.fc2(x))
        # -> fc3 -> softmax
        probs = F.softmax(self.fc3(x), dim=1)
        
        # Create Categorical distribution based on the
        # probabilities out of the softmax. 
        dist = distributions.Categorical(probs)

        # If no action provided, sample randomly
        # based on the distribution from the nn output.
        if action is None:
            action = dist.sample()
        
        # Compute the log-probability density/mass function 
        # evaluated at the 'action' value.
        log_prob = dist.log_prob(action)

        return action, log_prob, dist.entropy()

    # Load from checkpoint
    def load(self, checkpoint):        
        if os.path.isfile(checkpoint):
            self.load_state_dict(torch.load(checkpoint))

    # Save to checkpoint
    def checkpoint(self, checkpoint):
        torch.save(self.state_dict(), checkpoint)
    
class Critic(nn.Module):
    # Chosen architecture:
    # input -> fc1 -> relu -> fc2 -> relu -> fc3 -> output

    # Initialize
    def __init__(self, state_size, hidden_0=256, hidden_1=128):
        super(Critic, self).__init__()

        self.fc1 = initialize_fc_layer(nn.Linear(state_size*4, hidden_0))
        self.fc2 = initialize_fc_layer(nn.Linear(hidden_0, hidden_1))
        self.fc3 = initialize_fc_layer(nn.Linear(hidden_1, 1))

    # Forward propagation
    def forward(self, x):
        # Input x
        # -> fc1 -> relu
        x = F.relu(self.fc1(x))
        # -> fc2 -> relu
        x = F.relu(self.fc2(x))
        # -> fc3
        value = self.fc3(x)
        return value

    # Load from checkpoint
    def load(self, checkpoint):        
        if os.path.isfile(checkpoint):
            self.load_state_dict(torch.load(checkpoint))

    # Save to checkpoint
    def checkpoint(self, checkpoint):
        torch.save(self.state_dict(), checkpoint)

##### Policy improvement

The learning process takes the experiences from the agents playing one full soccer game. This is either play until a goal was scored or 600 time steps passed and the game was terminated. 

##### Setting up classes for training
Learning to be better programmers :) we use classes for cleaner coding.

`class Memory` contains the stored experiences for training/evaluation:

In [2]:
from collections import namedtuple

class Memory:
    def __init__(self):
        self.memory = []
        self.experience = namedtuple('Experience', field_names=['actor_state', 'critic_state', 'action', 'log_prob', 'reward'])

    def add(self, actor_state, critic_state, action, log_prob, reward):
        """Add a new experience to memory."""        
        exp = self.experience(actor_state, critic_state, action, log_prob, reward)
        self.memory.append(exp)

    def experiences(self, clear=True):
        """Return experiences stored in memory"""
        # Number of experiences is the length of self.memory. 
        n_exp = len(self.memory)
        # For each exp in self.memory, stack the
        # (actor_states, critic_states, actions, log_probabilities, rewards)
        actor_states = np.vstack([exp.actor_state for exp in self.memory if exp is not None])
        critic_states = np.vstack([exp.critic_state for exp in self.memory if exp is not None])
        actions = np.vstack([exp.action for exp in self.memory if exp is not None])
        log_probs = np.vstack([exp.log_prob for exp in self.memory if exp is not None])
        rewards = np.vstack([exp.reward for exp in self.memory if exp is not None])

        # Clear memory after returning experiences.
        if clear:
            self.memory.clear()

        return actor_states, critic_states, actions, log_probs, rewards, n_exp
    
    def delete(self, i):
        del self.memory[i]


`class Agent` is a wrapper for each agent:

In [3]:
# from memory import Memory

class Agent:

    def __init__(self, device, key, actor_model, n_step):
        # Set device
        self.device = device
        # Set key
        self.KEY = key
        # Set neural model
        self.actor_model = actor_model   
        # MEMORY
        self.memory = Memory()
        # Set number of steps
        self.N_STEP = n_step


    #get an action from the actor for each step of game play (inference/eval only)
    def act(self, state):
        state = torch.from_numpy(state).float().unsqueeze(0).to(self.device)
        self.actor_model.eval()
        with torch.no_grad():
            action, log_prob, _ = self.actor_model(state)
        self.actor_model.train()
        action = action.cpu().detach().numpy().item()
        log_prob = log_prob.cpu().detach().numpy().item()
        return action, log_prob

    def step(self, actor_state, critic_state, action, log_prob, reward):
        self.memory.add(actor_state, critic_state, action, log_prob, reward)

Finally, we create another class for our `Optimizer`:

In [4]:
class Optimizer:

    def __init__(self, device, actor_model, critic_model, optimizer, epochs, n_step, batch_size, gamma, epsilon, entropy_weight, gradient_clip):
        # Set device
        self.device = device 

        # Set neural nets
        self.actor_model = actor_model
        self.critic_model = critic_model
        self.optimizer = optimizer

        # Set hyperparameters
        self.epochs = epochs
        self.n_step = n_step
        self.batch_size = batch_size
        self.gamma = gamma
        self.epsilon = epsilon
        self.entropy_weight = entropy_weight
        self.gradient_clip = gradient_clip  

    def learn(self, memory):
        # Extract experiences from memory:
        actor_states, critic_states, actions, log_probs, rewards, n_exp = memory.experiences()

        # Discounts: gamma, gamma^2, gamma^3, ...
        discounts = self.gamma**np.arange(n_exp)
        # Discount the rewards of the episode
        discounted_rewards = rewards.squeeze(1) * discounts
        # Compute the total discounted reward for the episode
        rewards_future = discounted_rewards[::-1].cumsum(axis=0)[::-1]
        
        # Setup torch tensors
        actor_states = torch.from_numpy(actor_states).float().to(self.device)
        critic_states = torch.from_numpy(critic_states).float().to(self.device)
        actions = torch.from_numpy(actions).long().to(self.device).squeeze(1)
        log_probs = torch.from_numpy(log_probs).float().to(self.device).squeeze(1)
        rewards = torch.from_numpy(rewards_future.copy()).float().to(self.device)
        
        """
        We want the agent to take actions which achieve the greatest reward compared to the average expected reward 
        for that state (as estimated by our critic). We compute the advantage function
        below and normalize it to improve training.
        """
        # Get critic values detached from the training process (eval/inference only)
        self.critic_model.eval()
        with torch.no_grad():
            values = self.critic_model(critic_states).detach()
        self.critic_model.train()

        # Get advantages
        advantages = (rewards - values.squeeze()).detach()
        advantages_normalized = (advantages - advantages.mean()) / (advantages.std() + 1.0e-10)
        advantages_normalized = torch.tensor(advantages_normalized).float().to(self.device)

        """
        Each epoch has a set of experiences (n_exp). 
        We take a random mini-batch of experiences to train on.
        """
        batches = BatchSampler(SubsetRandomSampler(range(0, n_exp)), self.batch_size, drop_last=False)
        losses = []
        for batch_indices in batches:
            batch_indices = torch.tensor(batch_indices).long().to(self.device)

            # Get data from the batch
            sampled_actor_states = actor_states[batch_indices]
            sampled_critic_states = critic_states[batch_indices]
            sampled_actions = actions[batch_indices]
            sampled_log_probs = log_probs[batch_indices]
            sampled_rewards = rewards[batch_indices]
            sampled_advantages = advantages_normalized[batch_indices]

            # Get new probability of each action given the state and latest actor policy
            _, new_log_probs, entropies = self.actor_model(sampled_actor_states, sampled_actions)

            # Compute ratio of how much more likely is the new action choice vs. old choice 
            # according to the updated actor
            ratio = (new_log_probs - sampled_log_probs).exp()

            # Compute PPO loss
            """
            The clipping function makes sure that we don't update our weights too much when we find a much better 
            choice. This makes sure we do not charge in a false lead. 
            This is the key idea of Proximal Policy Optimization (PPO). 
            """
            clip = torch.clamp(ratio, 1 - self.epsilon, 1 + self.epsilon)
            policy_loss = torch.min( ratio * sampled_advantages, clip * sampled_advantages )
            policy_loss = - torch.mean( policy_loss )

            """
            Entropy regularization term steers the new policy towards equal probability of all actions, encouraging
            exploration early on, but decreasing in importance over time. 
            """
            entropy = torch.mean(entropies)
            # Get predicted future rewards to use in backpropagation to improve the critic's estimates
            values = self.critic_model(sampled_critic_states) 
            value_loss = F.mse_loss(sampled_rewards, values.squeeze())

            """
            The loss function combines the policy loss with value loss and adds the entropy term. PyTorch will
            backpropagate the respective losses through to each network's parameters and optimize over time.
            """
            loss = policy_loss + (0.5 * value_loss) - (entropy * self.entropy_weight)  

            self.optimizer.zero_grad()                  
            loss.backward()
            # nn.utils.clip_grad_norm_( self.actor_model.parameters(), self.GRADIENT_CLIP )
            # nn.utils.clip_grad_norm_( self.critic_model.parameters(), self.GRADIENT_CLIP )
            self.optimizer.step()


            losses.append(loss.data)

            # # some reporting to check performance
            # if self.actor_model == striker_0_actor:
            #     episode_loss.append(policy_loss.cpu().detach().numpy().squeeze().item())
            #     policy_loss_value.append(policy_loss.cpu().detach().numpy().squeeze().item())
            #     value_loss_value.append(value_loss.cpu().detach().numpy().squeeze().item())
            #     entropy_value.append(torch.mean(entropy))
        
            # if self.actor_model == goalie_0_actor:
            #     episode_loss.append(policy_loss.cpu().detach().numpy().squeeze().item())
            #     policy_loss_value_g.append(policy_loss.cpu().detach().numpy().squeeze().item())
            #     value_loss_value_g.append(value_loss.cpu().detach().numpy().squeeze().item())
            #     entropy_value_g.append(torch.mean(entropy))

        self.epsilon *= 1
        self.entropy_weight *= 0.995

        return np.average(losses)

##### It's training time

Putting all the above together:

In [None]:
# Set torch.device
DEVICE = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

# Start the environment
env = UnityEnvironment(file_name="Soccer.app") #no_graphis=True)
# Get info
# set the goalie brain
g_brain_name = env.brain_names[0]
g_brain = env.brains[g_brain_name]

# set the striker brain
s_brain_name = env.brain_names[1]
s_brain = env.brains[s_brain_name]


# Reset the environment
env_info = env.reset(train_mode=True)
# Goalie info
num_g_agents = len(env_info[g_brain_name].agents)
g_action_size = g_brain.vector_action_space_size
g_states = env_info[g_brain_name].vector_observations
g_state_size = g_states.shape[1]
# Striker info
num_s_agents = len(env_info[s_brain_name].agents)
s_action_size = s_brain.vector_action_space_size
s_states = env_info[s_brain_name].vector_observations
s_state_size = s_states.shape[1]


# Set hyperparameters
N_STEP = 8
BATCH_SIZE = 32
GAMMA = 0.995
EPSILON = 0.1
ENTROPY_WEIGHT = 0.001
GRADIENT_CLIP = 0.5
GOALIE_LR = 8e-5
STRIKER_LR = 1e-4

# Set checkpoints to save trained models
CHECKPOINT_GOALIE_ACTOR = './checkpoint_goalie_actor.pth'
CHECKPOINT_GOALIE_CRITIC = './checkpoint_goalie_critic.pth'
CHECKPOINT_STRIKER_ACTOR = './checkpoint_striker_actor.pth'
CHECKPOINT_STRIKER_CRITIC = './checkpoint_striker_critic.pth'

# Actors and Critics
GOALIE_0_KEY = 0
STRIKER_0_KEY = 0
GOALIE_1_KEY = 1
STRIKER_1_KEY = 1

# Goalie Actor-Critic
goalie_actor_model = Actor(g_state_size, g_action_size).to(DEVICE)
goalie_critic_model = Critic(g_state_size + s_state_size + g_state_size + s_state_size).to(DEVICE)
goalie_optim = optim.Adam(list(goalie_actor_model.parameters()) + list(goalie_critic_model.parameters()), lr=GOALIE_LR )
# self.optim = optim.RMSprop( list( self.actor_model.parameters() ) + list( self.critic_model.parameters() ), lr=lr, alpha=0.99, eps=1e-5 )
goalie_actor_model.load(CHECKPOINT_GOALIE_ACTOR)
goalie_critic_model.load(CHECKPOINT_GOALIE_CRITIC)

# Striker Actor-Critic
striker_actor_model = Actor( s_state_size, s_action_size ).to(DEVICE)
striker_critic_model = Critic(s_state_size + g_state_size + s_state_size + g_state_size).to(DEVICE)
striker_optim = optim.Adam(list(striker_actor_model.parameters()) + list(striker_critic_model.parameters()), lr=STRIKER_LR )
# self.optim = optim.RMSprop( list( self.actor_model.parameters() ) + list( self.critic_model.parameters() ), lr=lr, alpha=0.99, eps=1e-5 )
striker_actor_model.load(CHECKPOINT_STRIKER_ACTOR)
striker_critic_model.load(CHECKPOINT_STRIKER_CRITIC)


# Agents
goalie_0 = Agent(DEVICE, GOALIE_0_KEY, goalie_actor_model, N_STEP)
goalie_optimizer = Optimizer(DEVICE, goalie_actor_model, goalie_critic_model, goalie_optim, N_STEP, BATCH_SIZE, GAMMA, EPSILON, ENTROPY_WEIGHT, GRADIENT_CLIP)

striker_0 = Agent(DEVICE, STRIKER_0_KEY, striker_actor_model, N_STEP)
striker_optimizer = Optimizer(DEVICE, striker_actor_model, striker_critic_model, striker_optim, N_STEP, BATCH_SIZE, GAMMA, EPSILON, ENTROPY_WEIGHT, GRADIENT_CLIP)

def ppo_train():
    n_episodes = 5000
    team_0_window_score = deque(maxlen=100)
    team_0_window_score_wins = deque(maxlen=100)

    team_1_window_score = deque(maxlen=100)
    team_1_window_score_wins = deque(maxlen=100)

    draws = deque(maxlen=100)

    for episode in range(n_episodes):
        # Reset the environment
        env_info = env.reset(train_mode=True)
        # Get initial states
        g_states = env_info[g_brain_name].vector_observations
        s_states = env_info[s_brain_name].vector_observations
        # Initialize scores
        g_scores = np.zeros(num_g_agents)
        s_scores = np.zeros(num_s_agents)

        steps = 0
        while True:
            # Select actions and send to environment
            action_goalie_0, log_prob_goalie_0 = goalie_0.act(g_states[goalie_0.KEY])
            action_striker_0, log_prob_striker_0 = striker_0.act(s_states[striker_0.KEY])

            # action_goalie_1, log_prob_goalie_1 = goalie_1.act( g_states[goalie_1.KEY] )
            # action_striker_1, log_prob_striker_1 = striker_1.act( s_states[striker_1.KEY] )
            
            # random            
            action_goalie_1 = np.asarray( [np.random.choice(g_action_size)] )
            action_striker_1 = np.asarray( [np.random.choice(s_action_size)] )


            actions_goalies = np.array( (action_goalie_0, action_goalie_1) )                                    
            actions_strikers = np.array( (action_striker_0, action_striker_1) )

            actions = dict( zip( [g_brain_name, s_brain_name], [actions_goalies, actions_strikers] ) )

        
            env_info = env.step(actions)                                                
            # get next states
            goalies_next_states = env_info[g_brain_name].vector_observations         
            strikers_next_states = env_info[s_brain_name].vector_observations
            
            # get reward and update scores
            goalies_rewards = env_info[g_brain_name].rewards  
            strikers_rewards = env_info[s_brain_name].rewards
            g_scores += goalies_rewards
            s_scores += strikers_rewards
                        
            # check if episode finished
            done = np.any(env_info[g_brain_name].local_done)

            # store experiences
            goalie_0_reward = goalies_rewards[goalie_0.KEY]
            goalie_0.step( 
                g_states[goalie_0.KEY],
                np.concatenate( 
                    (
                        g_states[goalie_0.KEY],
                        s_states[striker_0.KEY],
                        g_states[GOALIE_1_KEY],
                        s_states[STRIKER_1_KEY],
                    ), axis=0 ),
                action_goalie_0,
                log_prob_goalie_0,
                goalie_0_reward 
            )


            striker_0_reward = strikers_rewards[striker_0.KEY]
            striker_0.step(                 
                s_states[striker_0.KEY],
                np.concatenate( 
                    (
                        s_states[striker_0.KEY],
                        g_states[goalie_0.KEY],                        
                        s_states[STRIKER_1_KEY],                 
                        g_states[GOALIE_1_KEY]                        
                    ), axis=0 ),               
                action_striker_0,
                log_prob_striker_0,
                striker_0_reward
            )


            # exit loop if episode finished
            if done:
                break  

            # roll over states to next time step
            g_states = goalies_next_states
            s_states = strikers_next_states

            steps += 1

        # learn
        goalie_loss = goalie_optimizer.learn(goalie_0.memory)
        striker_loss = striker_optimizer.learn(striker_0.memory)        

        goalie_actor_model.checkpoint( CHECKPOINT_GOALIE_ACTOR )   
        goalie_critic_model.checkpoint( CHECKPOINT_GOALIE_CRITIC )    
        striker_actor_model.checkpoint( CHECKPOINT_STRIKER_ACTOR )    
        striker_critic_model.checkpoint( CHECKPOINT_STRIKER_CRITIC )

        team_0_score = g_scores[goalie_0.KEY] + s_scores[striker_0.KEY]
        team_0_window_score.append( team_0_score )
        team_0_window_score_wins.append( 1 if team_0_score > 0 else 0)        

        team_1_score = g_scores[GOALIE_1_KEY] + s_scores[STRIKER_1_KEY]
        team_1_window_score.append( team_1_score )
        team_1_window_score_wins.append( 1 if team_1_score > 0 else 0 )

        draws.append( team_0_score == team_1_score )
        
        print('Episode: {} \tSteps: \t{} \tGoalie Loss: \t {:.10f} \tStriker Loss: \t {:.10f}'.format( episode + 1, steps, goalie_loss, striker_loss ))
        print('\tRed Wins: \t{} \tScore: \t{:.5f} \tAvg: \t{:.2f}'.format( np.count_nonzero(team_0_window_score_wins), team_0_score, np.sum(team_0_window_score) ))
        print('\tBlue Wins: \t{} \tScore: \t{:.5f} \tAvg: \t{:.2f}'.format( np.count_nonzero(team_1_window_score_wins), team_1_score, np.sum(team_1_window_score) ))
        print('\tDraws: \t{}'.format( np.count_nonzero(draws) ))

        if np.count_nonzero( team_0_window_score_wins ) >= 95:
            break
    

# train the agent
# ppo_train()

# test the trained agents
team_0_window_score = deque(maxlen=100)
team_0_window_score_wins = deque(maxlen=100)

team_1_window_score = deque(maxlen=100)
team_1_window_score_wins = deque(maxlen=100)

draws = deque(maxlen=100)

for episode in range(50):                                               # play game for n episodes
    env_info = env.reset(train_mode=False)                              # reset the environment    
    g_states = env_info[g_brain_name].vector_observations         # get initial state (goalies)
    s_states = env_info[s_brain_name].vector_observations        # get initial state (strikers)

    g_scores = np.zeros(num_g_agents)                          # initialize the score (goalies)
    s_scores = np.zeros(num_s_agents)                        # initialize the score (strikers)

    steps = 0

    while True:
        # select actions and send to environment
        action_goalie_0, log_prob_goalie_0 = goalie_0.act( g_states[goalie_0.KEY] )
        action_striker_0, log_prob_striker_0 = striker_0.act( s_states[striker_0.KEY] )

        # action_goalie_1, log_prob_goalie_1 = goalie_1.act( g_states[goalie_1.KEY] )
        # action_striker_1, log_prob_striker_1 = striker_1.act( s_states[striker_1.KEY] )
        
        # random            
        action_goalie_1 = np.asarray( [np.random.randint(g_action_size)] )
        action_striker_1 = np.asarray( [np.random.randint(s_action_size)] )


        actions_goalies = np.array( (action_goalie_0, action_goalie_1) )                                    
        actions_strikers = np.array( (action_striker_0, action_striker_1) )

        actions = dict( zip( [g_brain_name, s_brain_name], [actions_goalies, actions_strikers] ) )

    
        env_info = env.step(actions)                                                
        # get next states
        goalies_next_states = env_info[g_brain_name].vector_observations         
        strikers_next_states = env_info[s_brain_name].vector_observations
        
        # get reward and update scores
        goalies_rewards = env_info[g_brain_name].rewards  
        strikers_rewards = env_info[s_brain_name].rewards
        g_scores += goalies_rewards
        s_scores += strikers_rewards
                    
        # check if episode finished
        done = np.any(env_info[g_brain_name].local_done)

        # exit loop if episode finished
        if done:
            break  

        # roll over states to next time step
        g_states = goalies_next_states
        s_states = strikers_next_states

        steps += 1
        
    team_0_score = g_scores[goalie_0.KEY] + s_scores[striker_0.KEY]
    team_0_window_score.append( team_0_score )
    team_0_window_score_wins.append( 1 if team_0_score > 0 else 0)        

    team_1_score = g_scores[GOALIE_1_KEY] + s_scores[STRIKER_1_KEY]
    team_1_window_score.append( team_1_score )
    team_1_window_score_wins.append( 1 if team_1_score > 0 else 0 )

    draws.append( team_0_score == team_1_score )
    
    print('Episode {}'.format( episode + 1 ))
    print('\tRed Wins: \t{} \tScore: \t{:.5f} \tAvg: \t{:.2f}'.format( np.count_nonzero(team_0_window_score_wins), team_0_score, np.sum(team_0_window_score) ))
    print('\tBlue Wins: \t{} \tScore: \t{:.5f} \tAvg: \t{:.2f}'.format( np.count_nonzero(team_1_window_score_wins), team_1_score, np.sum(team_1_window_score) ))
    print('\tDraws: \t{}'.format( np.count_nonzero( draws ) ))

env.close()