# Tiki-taka time! 

---

This notebook uses the Unity SoccerTwos environment, where two teams of two players each contend against each other, and the Actor Critic framework with Proximal Policy Optimization to teach the agents how to win a soccer game! 

### Getting familiar with UnityEnvironments: Soccer


##### Environment setup

First we import the necessary packages:

In [None]:
import numpy as np
import torch
import torch.optim as optim
from unityagents import UnityEnvironment

##### Exploring UnityEnvironments: Soccer

Next, we will start the environment! Environments contain **_brains_** which are responsible for deciding the actions of their associated agents. Here we obtain separate brains for the striker and the goalie agents.

In [None]:
# Start the environment
env = UnityEnvironment(file_name="Soccer.app")
# Print the brain names
print(env.brain_names)

In [None]:
# There are two brains:
# 1. set the goalie brain
g_brain_name = env.brain_names[0]
g_brain = env.brains[g_brain_name]
# 2. set the striker brain
s_brain_name = env.brain_names[1]
s_brain = env.brains[s_brain_name]

Next, we shed more light on the environment by printing some relevant info. 

In [None]:
# Reset the environment
env_info = env.reset(train_mode=True)
# Goalie info
num_g_agents = len(env_info[g_brain_name].agents)
print('Number of goalie agents:', num_g_agents)
g_action_size = g_brain.vector_action_space_size
print('Number of goalie actions:', g_action_size)
g_states = env_info[g_brain_name].vector_observations
g_state_size = g_states.shape[1]
print('There are {} goalie agents. Each receives a state with length: {}'.format(g_states.shape[0], g_state_size))
# Striker info
num_s_agents = len(env_info[s_brain_name].agents)
print('Number of striker agents:', num_s_agents)
s_action_size = s_brain.vector_action_space_size
print('Number of striker actions:', s_action_size)
s_states = env_info[s_brain_name].vector_observations
s_state_size = s_states.shape[1]
print('There are {} striker agents. Each receives a state with length: {}'.format(s_states.shape[0], s_state_size))

# Close the environment
env.close()

##### Small simulation: sample actions in the environment

We use the Python API to control the agents and receive feedback from the environment. We will watch the agents' performance, as they select actions at random with each time step.  

In [None]:
# Start the environment
env = UnityEnvironment(file_name="Soccer.app")

for i in range(2):                                         # play game for 2 episodes
    env_info = env.reset(train_mode=False)                 # reset the environment    
    g_states = env_info[g_brain_name].vector_observations  # get initial state (goalies)
    s_states = env_info[s_brain_name].vector_observations  # get initial state (strikers)
    g_scores = np.zeros(num_g_agents)                      # initialize the score (goalies)
    s_scores = np.zeros(num_s_agents)                      # initialize the score (strikers)
    while True:
        # select actions and send to environment
        g_actions = np.random.randint(g_action_size, size=num_g_agents)
        s_actions = np.random.randint(s_action_size, size=num_s_agents)
        actions = dict(zip([g_brain_name, s_brain_name], [g_actions, s_actions]))
        env_info = env.step(actions)
        
        # get next actor_states
        g_next_states = env_info[g_brain_name].vector_observations
        s_next_states = env_info[s_brain_name].vector_observations
        
        # get reward and update scores
        g_rewards = env_info[g_brain_name].rewards
        s_rewards = env_info[s_brain_name].rewards
        g_scores += g_rewards
        s_scores += s_rewards
        
        # check if episode finished
        done = np.any(env_info[g_brain_name].local_done)
        
        # roll over actor_states to next time step
        g_states = g_next_states
        s_states = s_next_states
        
        # exit loop if episode finished
        if done:
            break
    print('Scores from episode {}: {} (goalies), {} (strikers)'.format(i+1, g_scores, s_scores))

# Close the environment
env.close()

### Time for training!

Learning to be better programmers :) we use classes for cleaner coding, so all the magic is in the corresponding python files. Here we provide a short description.

##### Actor Critic Networks

The *Actor* receives his *own state space* and outputs:
* an action, 
* the log probability of that action (to be used later in calculating the advantage ratio), and 
* the entropy of the probability distribution (higher entropy, more uncertainty). 
The entropy acts as *noise* in the *loss function*. Intuitively, it urges the agent to try more random actions initially, so as to not get stuck in an action that fares well short-term, but is not optimal in the long-term. That is, it helps avoid local minima. 

The *Critic* receives the *combined state space of all agents* on the field and outputs the expected total reward for an action given that state. 
This value is compared to the actual total reward from an actor's action, and it will tell us how much better the chosen action is compared to the average likely reward. This is called the *advantage*. 

A note on the distributions function:

It is not possible to have the actor simply output a softmax distribution of action probabilities and then choose an action off a random sampling of those probabilities. Neural networks cannot directly backpropagate through random samples. PyTorch and Tensorflow offer a [distribution function](https://pytorch.org/docs/stable/distributions.html) to solve this that makes the action selection differentiable. The actor passes the softmax output through this distribution function to select the action and then backpropagation can occur.

The definitions are found in the [actor_critic_models.py](./utils/actor_critic_models.py) under the classes `Actor` and `Critic`. Each class has typical neural network functions `forward`, `load`, and `checkpoint`.

##### Memory

We create a class `Memory` that contains the stored experiences for training. For each episode we store `[actor_states, critic_states, actions, log_probabilities, rewards]` for each step. The definitions are found in the [memory.py](./utils/memory.py). 

##### Agents

The `Agent` class is a wrapper for each agent, and associates an actor model and the experiences of the actor from the running episodes. The definitions are found in the [agent.py](./utils/agent.py). 

##### Optimizer

The `Optimizer` class is used by each agent and contains the actor model, the critic model, and essentially performs the learning task based on the PPO loss function. The definitions are found in the [optimizer.py](./utils/optimizer.py). 

##### Trainer

Finally the `Trainer` class is a wrapper for training and testing on the whole environment for each agent. 

### It's game time!
Putting all the above together we start the environment, create the agents and optimizers, and call our coach to train and test!

In [None]:
import numpy as np
import torch
import torch.optim as optim
from unityagents import UnityEnvironment

from utils.agent import Agent
from utils.actor_critic_models import Actor, Critic
from utils.optimizer import Optimizer
from utils.trainer import Trainer

############ ENVIRONMENT ############ 
# Start the environment
env = UnityEnvironment(file_name="Soccer.app", no_graphics=True)
# Get info: set the brains
g_brain_name = env.brain_names[0]
g_brain = env.brains[g_brain_name]
s_brain_name = env.brain_names[1]
s_brain = env.brains[s_brain_name]
# Reset the environment
env_info = env.reset(train_mode=True)
# Goalie info
num_g_agents = len(env_info[g_brain_name].agents)
g_action_size = g_brain.vector_action_space_size
g_states = env_info[g_brain_name].vector_observations
g_state_size = g_states.shape[1]
# Striker info
num_s_agents = len(env_info[s_brain_name].agents)
s_action_size = s_brain.vector_action_space_size
s_states = env_info[s_brain_name].vector_observations
s_state_size = s_states.shape[1]

############ HYPERPARAMETERS ############
# Set hyperparameters
N_EPOCHS = 6000
N_STEP = 8
BATCH_SIZE = 32
GAMMA = 0.995
EPSILON = 0.1
ENTROPY_WEIGHT = 0.001
GRADIENT_CLIP = 0.5
GOALIE_LR = 8e-5
STRIKER_LR = 8e-5

# Set torch.device
DEVICE = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

############ ACTOR-CRITIC MODELS ############
# Set checkpoints for models
CHECKPOINT_GOALIE_ACTOR = './checkpoint_g_actor.pth'
CHECKPOINT_GOALIE_CRITIC = './checkpoint_g_critic.pth'
CHECKPOINT_STRIKER_ACTOR = './checkpoint_s_actor.pth'
CHECKPOINT_STRIKER_CRITIC = './checkpoint_s_critic.pth'
# Actors and Critics
GOALIE_0_KEY = 0
STRIKER_0_KEY = 0
GOALIE_1_KEY = 1
STRIKER_1_KEY = 1
# Goalie Actor-Critic
goalie_actor_model = Actor(g_state_size, g_action_size, CHECKPOINT_GOALIE_ACTOR).to(DEVICE)
goalie_critic_model = Critic(g_state_size + s_state_size + g_state_size + s_state_size, CHECKPOINT_GOALIE_CRITIC).to(DEVICE)
goalie_optim = optim.Adam(list(goalie_actor_model.parameters()) + list(goalie_critic_model.parameters()), lr=GOALIE_LR)
goalie_actor_model.load()
goalie_critic_model.load()
# Striker Actor-Critic
striker_actor_model = Actor(s_state_size, s_action_size, CHECKPOINT_STRIKER_ACTOR).to(DEVICE)
striker_critic_model = Critic(s_state_size + g_state_size + s_state_size + g_state_size, CHECKPOINT_STRIKER_CRITIC).to(DEVICE)
striker_optim = optim.Adam(list(striker_actor_model.parameters()) + list(striker_critic_model.parameters()), lr=STRIKER_LR)
striker_actor_model.load()
striker_critic_model.load()

############ AGENTS ############
goalie_0 = Agent(DEVICE, GOALIE_0_KEY, goalie_actor_model, N_STEP)
goalie_optimizer = Optimizer(DEVICE, goalie_actor_model, goalie_critic_model, goalie_optim, N_STEP, BATCH_SIZE, GAMMA, EPSILON, ENTROPY_WEIGHT, GRADIENT_CLIP)
striker_0 = Agent(DEVICE, STRIKER_0_KEY, striker_actor_model, N_STEP)
striker_optimizer = Optimizer(DEVICE, striker_actor_model, striker_critic_model, striker_optim, N_STEP, BATCH_SIZE, GAMMA, EPSILON, ENTROPY_WEIGHT, GRADIENT_CLIP)

############ TRAINING ############
# Get the coach!
coach = Trainer(env, DEVICE, N_EPOCHS, goalie_0, goalie_optimizer, striker_0, striker_optimizer)
# Train the team
coach.ppo_train(GOALIE_1_KEY, STRIKER_1_KEY)
# Test the team
coach.test(GOALIE_1_KEY, STRIKER_1_KEY)

############ THE END ############
# Close the environment
env.close()