# Collaboration and Competition

---

### Start the Environment

We begin by importing the necessary packages.  If the code cell below returns an error, please revisit the project instructions to double-check that you have installed [Unity ML-Agents](https://github.com/Unity-Technologies/ml-agents/blob/master/docs/Installation.md) and [NumPy](http://www.numpy.org/).

In [1]:
import sys
sys.path.insert(0, "python/")

from unityagents import UnityEnvironment
import numpy as np
import torch
from dqn.dqn_agent_v2 import Agent
from collections import deque

In [2]:
from model import ActorModel, CriticModel
from agent import Agent as PpoAgent

Next, we will start the environment!  **_Before running the code cell below_**, change the `file_name` parameter to match the location of the Unity environment that you downloaded.

- **Mac**: `"path/to/Soccer.app"`
- **Windows** (x86): `"path/to/Soccer_Windows_x86/Soccer.exe"`
- **Windows** (x86_64): `"path/to/Soccer_Windows_x86_64/Soccer.exe"`
- **Linux** (x86): `"path/to/Soccer_Linux/Soccer.x86"`
- **Linux** (x86_64): `"path/to/Soccer_Linux/Soccer.x86_64"`
- **Linux** (x86, headless): `"path/to/Soccer_Linux_NoVis/Soccer.x86"`
- **Linux** (x86_64, headless): `"path/to/Soccer_Linux_NoVis/Soccer.x86_64"`

For instance, if you are using a Mac, then you downloaded `Soccer.app`.  If this file is in the same folder as the notebook, then the line below should appear as follows:
```
env = UnityEnvironment(file_name="Soccer.app")
```

In [3]:
env = UnityEnvironment(file_name="Soccer_Env/Soccer.exe")

INFO:unityagents:
'Academy' started successfully!
Unity Academy name: Academy
        Number of Brains: 2
        Number of External Brains : 2
        Lesson number : 0
        Reset Parameters :
		
Unity brain name: GoalieBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 112
        Number of stacked Vector Observation: 3
        Vector Action space type: discrete
        Vector Action space size (per agent): 4
        Vector Action descriptions: , , , 
Unity brain name: StrikerBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 112
        Number of stacked Vector Observation: 3
        Vector Action space type: discrete
        Vector Action space size (per agent): 6
        Vector Action descriptions: , , , , , 


Environments contain **_brains_** which are responsible for deciding the actions of their associated agents. Here we obtain separate brains for the striker and goalie agents.

In [4]:
# print the brain names
print(env.brain_names)

# set the goalie brain
g_brain_name = env.brain_names[0]
g_brain = env.brains[g_brain_name]

# set the striker brain
s_brain_name = env.brain_names[1]
s_brain = env.brains[s_brain_name]

['GoalieBrain', 'StrikerBrain']


### Obtain the State and Action Spaces

In [5]:
# reset the environment
env_info = env.reset(train_mode=True)

# number of agents 
num_g_agents = len(env_info[g_brain_name].agents)
print('Number of goalie agents:', num_g_agents)
num_s_agents = len(env_info[s_brain_name].agents)
print('Number of striker agents:', num_s_agents)

# number of actions
g_action_size = g_brain.vector_action_space_size
print('Number of goalie actions:', g_action_size)
s_action_size = s_brain.vector_action_space_size
print('Number of striker actions:', s_action_size)

# examine the state space 
g_states = env_info[g_brain_name].vector_observations
g_state_size = g_states.shape[1]
print('There are {} goalie agents. Each receives a state with length: {}'.format(g_states.shape[0], g_state_size))
s_states = env_info[s_brain_name].vector_observations
s_state_size = s_states.shape[1]
print('There are {} striker agents. Each receives a state with length: {}'.format(s_states.shape[0], s_state_size))

Number of goalie agents: 2
Number of striker agents: 2
Number of goalie actions: 4
Number of striker actions: 6
There are 2 goalie agents. Each receives a state with length: 336
There are 2 striker agents. Each receives a state with length: 336


### DQN Agent

Attempt to implement DQN, leveraging Udacity Deep Reinforcement Learning [repo](https://github.com/udacity/deep-reinforcement-learning)

In [5]:
g_agent = Agent(state_size=g_state_size, action_size=g_action_size, seed=0)
s_agent = Agent(state_size=s_state_size, action_size=s_action_size, seed=0)

Attempt to modify reward structure to incentivize going after the ball

In [6]:
def ball_reward(state):
    """
    Params
    ======
        state : current state of striker, 3 stacked 112 element vector
            1: ball
            2: opponent's goal
            3: own goal
            4: wall
            5: teammate
            6: opponent
            7: distance
    """
    reward = 0.0
    # Penalize if ball is not in view
    if not any(state[0::8]):
        reward = -0.03
    # Reward for kicking the ball
    else:
        idx = np.where(state[0::8])[0] # check which ray sees the ball
        distance = state[idx*8 + 7] # get the corresponding distance to ball
        if (np.amin(distance) <= 0.03): # Just picking some thresholds for now.
            reward = 0.3

    return reward

In [14]:
def dqn(n_episodes=2000, max_t=1000, eps_start=1.0, eps_end=0.01, eps_decay=0.995):
    """Deep Q-Learning.
    
    Params
    ======
        n_episodes (int): maximum number of training episodes
        max_t (int): maximum number of timesteps per episode
        eps_start (float): starting value of epsilon, for epsilon-greedy action selection
        eps_end (float): minimum value of epsilon
        eps_decay (float): multiplicative factor (per episode) for decreasing epsilon
    """
    scores = []                        # list containing scores from each episode
    scores_window = deque(maxlen=100)  # last 100 scores
    eps = eps_start                    # initialize epsilon
    g_losses = []
    g_losses_window = deque(maxlen=100)
    s_losses = []
    s_losses_window = deque(maxlen=100)
    for i_episode in range(1, n_episodes+1):
        env_info  = env.reset(train_mode=True)
        score = 0
        ball_reward_val = 0.0
        
        g_states = env_info[g_brain_name].vector_observations       # get initial state (goalies)
        s_states = env_info[s_brain_name].vector_observations       # get initial state (strikers)

        g_scores = np.zeros(num_g_agents)                           # initialize the score (goalies)
        s_scores = np.zeros(num_s_agents)                           # initialize the score (strikers)  
        
        for t in range(max_t):
            action_g_0 = g_agent.act(g_states[0], eps)        # always pick state index 0
            action_s_0 = s_agent.act(s_states[0], eps)  
            
            # Set other team to random
            action_g_1 = np.asarray( [np.random.choice(g_action_size)] ) 
            action_s_1 = np.asarray( [np.random.choice(s_action_size)] )
            # Train simultaneously
            #action_g_1 = g_agent.act(g_states[1], eps)        # always pick state index 1
            #action_s_1 = s_agent.act(s_states[1], eps) 
            
            # Combine actions
            actions_g = np.array( (action_g_0, action_g_1) )                                    
            actions_s = np.array( (action_s_0, action_s_1) )
            actions = dict( zip( [g_brain_name, s_brain_name], [actions_g, actions_s] ) )
            
            env_info = env.step(actions)                                                
            # get next states
            g_next_states = env_info[g_brain_name].vector_observations         
            s_next_states = env_info[s_brain_name].vector_observations
            
            # get reward and update scores
            g_rewards = env_info[g_brain_name].rewards
            #print(g_rewards)
            s_rewards = env_info[s_brain_name].rewards
            #print(s_rewards)
            g_scores += g_rewards
            s_scores += s_rewards
            
            ball_reward_val += ball_reward(s_states[0])
            
            # check if episode finished
            done = np.any(env_info[g_brain_name].local_done)
            print(env_info[g_brain_name].text_observations)
            print(env_info[g_brain_name].visual_observations)
            
            # store experiences
            g_agent.step(g_states[0], action_g_0, g_rewards[0], 
                         g_next_states[0], done)
            s_agent.step(s_states[0], action_s_0, s_rewards[0] + ball_reward(s_states[0]), # adding ball reward
                         s_next_states[0], done)

            if done:
                break
                
            g_states = g_next_states
            s_states = s_next_states
                
        # learn
        goalie_loss = g_agent.learn(g_agent.memory.sample(), 0.99) # discount = 0.99
        striker_loss = s_agent.learn(s_agent.memory.sample(), 0.99) # discount = 0.99 
        
        g_losses.append(goalie_loss.item())
        g_losses_window.append(goalie_loss.item())
        s_losses.append(striker_loss.item())
        s_losses_window.append(striker_loss.item())
        
        score = g_scores[0] + s_scores[0]
        scores_window.append(score)       # save most recent score
        scores.append(score)              # save most recent score
        
        eps = max(eps_end, eps_decay*eps) # decrease epsilon
        print('\rEpisode {}\tAverage Score: {:.2f}\t Goalie Loss:' \
                  '{:.5f}\t Striker Loss: {:.5f}' \
                  '\t Ball Reward: {:.2f}'.format(i_episode, \
                                                  np.mean(scores_window), \
                                                  np.mean(g_losses_window), \
                                                  np.mean(s_losses_window), \
                                                  ball_reward_val), end="")
        #print(s_states[0][0:56])
        if i_episode % 100 == 0:
            print('\rEpisode {}\tAverage Score: {:.2f}\t Goalie Loss:' \
                  '{:.5f}\t Striker Loss: {:.5f}\n' \
                  '\t Ball Reward: {:.2f}'.format(i_episode, \
                                                  np.mean(scores_window), \
                                                  np.mean(g_losses_window), \
                                                  np.mean(s_losses_window), \
                                                  ball_reward_val))
            torch.save(g_agent.qnetwork_local.state_dict(), 'checkpoint_goalie.pth')
            torch.save(s_agent.qnetwork_local.state_dict(), 'checkpoint_striker.pth')
    return scores


Set hyperparameters and train DQN

In [None]:
n_episodes = 5000
n_episodes = 5
max_t = 100000
eps_start = 1.0
eps_end = 0.1
eps_decay = 0.9995

# Pick up where we left off
#GOALIE = './trained_models/goalie_dqn_run1.pth'
#STRIKER = './trained_models/striker_dqn_run1.pth'
#g_agent.qnetwork_local.load (GOALIE )
#s_agent.qnetwork_local.load( STRIKER )
GOALIE = './trained_models/goalie_dqn_ballreward2.pth'
STRIKER = './trained_models/striker_dqn_ballreward2.pth'
g_agent.qnetwork_local.load (GOALIE )
s_agent.qnetwork_local.load( STRIKER )

# Train
eps_start = 0.1
eps_end = 0.1
scores = dqn(n_episodes, max_t, eps_start, eps_end, eps_decay)

### Test Trained Agent

In [6]:
# Load trained agents and run
# DQN
# g_agent_red = Agent(state_size=g_state_size, action_size=g_action_size, seed=0)
# s_agent_red = Agent(state_size=s_state_size, action_size=s_action_size, seed=0)
g_agent_blue = Agent(state_size=g_state_size, action_size=g_action_size, seed=0)
s_agent_blue = Agent(state_size=s_state_size, action_size=s_action_size, seed=0)
# PPO
DEVICE = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
goalie_actor_model = ActorModel( g_state_size, g_action_size ).to(DEVICE)
striker_actor_model = ActorModel( s_state_size, s_action_size ).to(DEVICE)
N_STEP = 8



# RED TEAM -------------------------------------
# DQN_base ----
GOALIE_red = './trained_models/goalie_dqn_run1.pth'
STRIKER_red = './trained_models/striker_dqn_run1.pth'
# DQN_1 ----
# GOALIE_red = './trained_models/goalie_dqn_V1.pth'
# STRIKER_red = './trained_models/striker_dqn_V1.pth'
# DQN_1_mod ----
# GOALIE_red = './trained_models/goalie_dqn_V1_modified.pth'
# STRIKER_red = './trained_models/striker_dqn_V1_modified.pth'
# DQN_2 ----
# GOALIE_red = './trained_models/goalie_dqn_V2.pth'
# STRIKER_red = './trained_models/striker_dqn_V2.pth'
# DQN_2_mod ----
# GOALIE_red = './trained_models/goalie_dqn_V2_mod.pth'
# STRIKER_red = './trained_models/striker_dqn_V2_mod.pth'

# g_agent_red.qnetwork_local.load (GOALIE_red )
# s_agent_red.qnetwork_local.load( STRIKER_red )
# PPO_1 ----
# goalie_actor_model.load( './trained_models/checkpoint_goalie_actor_v1.pth' )
# striker_actor_model.load( './trained_models/checkpoint_striker_actor_v1.pth' )
g_agent_red = PpoAgent( DEVICE, 0, goalie_actor_model, N_STEP )
s_agent_red = PpoAgent( DEVICE, 0, striker_actor_model, N_STEP )

# BLUE TEAM -------------------------------------
# DQN_base
# GOALIE_blue = './trained_models/goalie_dqn_run1.pth'
# STRIKER_blue = './trained_models/striker_dqn_run1.pth'
# DQN_1 ----
# GOALIE_blue = './trained_models/goalie_dqn_V1.pth'
# STRIKER_blue = './trained_models/striker_dqn_V1.pth'
# DQN_1_mod ----
# GOALIE_blue = './trained_models/goalie_dqn_V1_modified.pth'
# STRIKER_blue = './trained_models/striker_dqn_V1_modified.pth'
# DQN_2 ----
# GOALIE_blue = './trained_models/goalie_dqn_V2.pth'
# STRIKER_blue = './trained_models/striker_dqn_V2.pth'
# DQN_2_mod ----
GOALIE_blue = './trained_models/goalie_dqn_V2_mod.pth'
STRIKER_blue = './trained_models/striker_dqn_V2_mod.pth'

g_agent_blue.qnetwork_local.load (GOALIE_blue )
s_agent_blue.qnetwork_local.load( STRIKER_blue )

In [None]:
team_red_window_score = []
team_red_window_score_wins = []

team_blue_window_score = []
team_blue_window_score_wins = []

draws = []

for i in range(500):                                       # play game for 2 episodes
    env_info = env.reset(train_mode=True)                  # reset the environment    
    g_states = env_info[g_brain_name].vector_observations  # get initial state (goalies)
    s_states = env_info[s_brain_name].vector_observations  # get initial state (strikers)
    g_scores = np.zeros(num_g_agents)                      # initialize the score (goalies)
    s_scores = np.zeros(num_s_agents)                      # initialize the score (strikers)
    while True:
        # RED TEAM actions
#         action_g_0 = g_agent_red.act(g_states[0], 0)       # always pick state index 0 for red
#         action_s_0 = s_agent_red.act(s_states[0], 0)  
        # Get action for PPO agent
        action_g_0, _ = g_agent_red.act( g_states[0] )
        action_s_0, _ = s_agent_red.act( s_states[0] )

        # BLUE TEAM actions
        # ----- RANDOM -----
        action_g_1 = np.asarray( [np.random.choice(g_action_size)] ) 
        action_s_1 = np.asarray( [np.random.choice(s_action_size)] )
        # ----- Trained -----
#         action_g_1 = g_agent_blue.act(g_states[1], 0)      # always pick state index 1 for blue
#         action_s_1 = s_agent_blue.act(s_states[1], 0)

        # Combine actions
        actions_g = np.array( (action_g_0, action_g_1) )                                    
        actions_s = np.array( (action_s_0, action_s_1) )
        actions = dict( zip( [g_brain_name, s_brain_name], [actions_g, actions_s] ) )

        env_info = env.step(actions)                       
        
        # get next states
        g_next_states = env_info[g_brain_name].vector_observations         
        s_next_states = env_info[s_brain_name].vector_observations
        
        # get reward and update scores
        g_rewards = env_info[g_brain_name].rewards  
        s_rewards = env_info[s_brain_name].rewards
        g_scores += g_rewards
        s_scores += s_rewards
        
        # check if episode finished
        done = np.any(env_info[g_brain_name].local_done)  
        
        # roll over states to next time step
        g_states = g_next_states
        s_states = s_next_states
        
        # exit loop if episode finished
        if done:                                           
            break
    team_red_score = g_scores[0] + s_scores[0]
    team_red_window_score.append( team_red_score )
    team_red_window_score_wins.append( 1 if team_red_score > 0 else 0)        

    team_blue_score = g_scores[1] + s_scores[1]
    team_blue_window_score.append( team_blue_score )
    team_blue_window_score_wins.append( 1 if team_blue_score > 0 else 0 )

    draws.append( team_red_score == team_blue_score )
    print('Scores from episode {}: {} (goalies), {} (strikers)'.format(i+1, g_scores, s_scores))

print('Red Wins: \t{} \tScore: \t{:.5f} \tAvg: \t{:.2f} \tDraws: \t{}'.format( \
                  np.count_nonzero(team_red_window_score_wins), team_red_score, \
                  np.sum(team_red_window_score), np.count_nonzero(draws) ))

env.close()

Scores from episode 1: [-0.52000005  0.58000001] (goalies), [-0.58000001  0.52000005] (strikers)
Scores from episode 2: [-0.60166666  0.49833334] (goalies), [-0.49833334  0.60166666] (strikers)
Scores from episode 3: [ 0.46666668 -0.63333333] (goalies), [ 0.63333333 -0.46666668] (strikers)
Scores from episode 4: [-0.07333343  1.02666668] (goalies), [-1.02666668  0.07333343] (strikers)
Scores from episode 5: [-0.78833345  0.31166666] (goalies), [-0.31166666  0.78833345] (strikers)
Scores from episode 6: [ 1.07000002 -0.03000004] (goalies), [ 0.03000004 -1.07000002] (strikers)
Scores from episode 7: [-0.70333333  0.39666667] (goalies), [-0.39666667  0.70333333] (strikers)
Scores from episode 8: [ 0.56166668 -0.53833338] (goalies), [ 0.53833338 -0.56166668] (strikers)
Scores from episode 9: [-0.56499999  0.53500001] (goalies), [-0.53500001  0.56499999] (strikers)
Scores from episode 10: [-0.87833333  0.22166667] (goalies), [-0.22166667  0.87833333] (strikers)
Scores from episode 11: [-0.9

Scores from episode 85: [-0.96833339  0.13166667] (goalies), [-0.13166667  0.96833339] (strikers)
Scores from episode 86: [1.00166669 1.00166669] (goalies), [-1.00166669 -1.00166669] (strikers)
Scores from episode 87: [1.00166669 1.00166669] (goalies), [-1.00166669 -1.00166669] (strikers)
Scores from episode 88: [-0.71999999  0.38000001] (goalies), [-0.38000001  0.71999999] (strikers)
Scores from episode 89: [1.00166669 1.00166669] (goalies), [-1.00166669 -1.00166669] (strikers)
Scores from episode 90: [-0.92500012  0.17499999] (goalies), [-0.17499999  0.92500012] (strikers)
Scores from episode 91: [ 0.61666668 -0.48333332] (goalies), [ 0.48333332 -0.61666668] (strikers)
Scores from episode 92: [ 0.75666667 -0.34333338] (goalies), [ 0.34333338 -0.75666667] (strikers)
Scores from episode 93: [ 0.45666668 -0.64333333] (goalies), [ 0.64333333 -0.45666668] (strikers)
Scores from episode 94: [-0.55999999  0.54000001] (goalies), [-0.54000001  0.55999999] (strikers)
Scores from episode 95: [-

Scores from episode 169: [ 0.24666667 -0.85333333] (goalies), [ 0.85333333 -0.24666667] (strikers)
Scores from episode 170: [-0.88666667  0.21333334] (goalies), [-0.21333334  0.88666667] (strikers)
Scores from episode 171: [1.00166669 1.00166669] (goalies), [-1.00166669 -1.00166669] (strikers)
Scores from episode 172: [1.00166669 1.00166669] (goalies), [-1.00166669 -1.00166669] (strikers)
Scores from episode 173: [1.00166669 1.00166669] (goalies), [-1.00166669 -1.00166669] (strikers)
Scores from episode 174: [1.00166669 1.00166669] (goalies), [-1.00166669 -1.00166669] (strikers)
Scores from episode 175: [ 0.235      -0.86500006] (goalies), [ 0.86500006 -0.235     ] (strikers)
Scores from episode 176: [-0.22666665  0.87333335] (goalies), [-0.87333335  0.22666665] (strikers)
Scores from episode 177: [1.00166669 1.00166669] (goalies), [-1.00166669 -1.00166669] (strikers)
Scores from episode 178: [ 0.68000001 -0.42000005] (goalies), [ 0.42000005 -0.68000001] (strikers)
Scores from episode 

Scores from episode 253: [-0.52999999  0.57000001] (goalies), [-0.57000001  0.52999999] (strikers)
Scores from episode 254: [1.00166669 1.00166669] (goalies), [-1.00166669 -1.00166669] (strikers)
Scores from episode 255: [1.00166669 1.00166669] (goalies), [-1.00166669 -1.00166669] (strikers)
Scores from episode 256: [ 0.14666667 -0.95333339] (goalies), [ 0.95333339 -0.14666667] (strikers)
Scores from episode 257: [-0.51833332  0.58166668] (goalies), [-0.58166668  0.51833332] (strikers)
Scores from episode 258: [ 0.75500002 -0.34500005] (goalies), [ 0.34500005 -0.75500002] (strikers)
Scores from episode 259: [-0.29333338  0.80666668] (goalies), [-0.80666668  0.29333338] (strikers)
Scores from episode 260: [ 0.33333333 -0.76666672] (goalies), [ 0.76666672 -0.33333333] (strikers)
Scores from episode 261: [ 0.44500001 -0.65499999] (goalies), [ 0.65499999 -0.44500001] (strikers)
Scores from episode 262: [1.00166669 1.00166669] (goalies), [-1.00166669 -1.00166669] (strikers)
Scores from epis

In [16]:
env.close()