# Collaboration and Competition

---

### Start the Environment

We begin by importing the necessary packages.  If the code cell below returns an error, please revisit the project instructions to double-check that you have installed [Unity ML-Agents](https://github.com/Unity-Technologies/ml-agents/blob/master/docs/Installation.md) and [NumPy](http://www.numpy.org/).

In [1]:
import sys
sys.path.insert(0, "python/")

from unityagents import UnityEnvironment
import numpy as np
import torch
from dqn.dqn_agent_v2 import Agent
from collections import deque

Next, we will start the environment!  **_Before running the code cell below_**, change the `file_name` parameter to match the location of the Unity environment that you downloaded.

- **Mac**: `"path/to/Soccer.app"`
- **Windows** (x86): `"path/to/Soccer_Windows_x86/Soccer.exe"`
- **Windows** (x86_64): `"path/to/Soccer_Windows_x86_64/Soccer.exe"`
- **Linux** (x86): `"path/to/Soccer_Linux/Soccer.x86"`
- **Linux** (x86_64): `"path/to/Soccer_Linux/Soccer.x86_64"`
- **Linux** (x86, headless): `"path/to/Soccer_Linux_NoVis/Soccer.x86"`
- **Linux** (x86_64, headless): `"path/to/Soccer_Linux_NoVis/Soccer.x86_64"`

For instance, if you are using a Mac, then you downloaded `Soccer.app`.  If this file is in the same folder as the notebook, then the line below should appear as follows:
```
env = UnityEnvironment(file_name="Soccer.app")
```

In [2]:
env = UnityEnvironment(file_name="Soccer_Env/Soccer.exe")

INFO:unityagents:
'Academy' started successfully!
Unity Academy name: Academy
        Number of Brains: 2
        Number of External Brains : 2
        Lesson number : 0
        Reset Parameters :
		
Unity brain name: GoalieBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 112
        Number of stacked Vector Observation: 3
        Vector Action space type: discrete
        Vector Action space size (per agent): 4
        Vector Action descriptions: , , , 
Unity brain name: StrikerBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 112
        Number of stacked Vector Observation: 3
        Vector Action space type: discrete
        Vector Action space size (per agent): 6
        Vector Action descriptions: , , , , , 


Environments contain **_brains_** which are responsible for deciding the actions of their associated agents. Here we obtain separate brains for the striker and goalie agents.

In [3]:
# print the brain names
print(env.brain_names)

# set the goalie brain
g_brain_name = env.brain_names[0]
g_brain = env.brains[g_brain_name]

# set the striker brain
s_brain_name = env.brain_names[1]
s_brain = env.brains[s_brain_name]

['GoalieBrain', 'StrikerBrain']


### Obtain the State and Action Spaces

In [4]:
# reset the environment
env_info = env.reset(train_mode=True)

# number of agents 
num_g_agents = len(env_info[g_brain_name].agents)
print('Number of goalie agents:', num_g_agents)
num_s_agents = len(env_info[s_brain_name].agents)
print('Number of striker agents:', num_s_agents)

# number of actions
g_action_size = g_brain.vector_action_space_size
print('Number of goalie actions:', g_action_size)
s_action_size = s_brain.vector_action_space_size
print('Number of striker actions:', s_action_size)

# examine the state space 
g_states = env_info[g_brain_name].vector_observations
g_state_size = g_states.shape[1]
print('There are {} goalie agents. Each receives a state with length: {}'.format(g_states.shape[0], g_state_size))
s_states = env_info[s_brain_name].vector_observations
s_state_size = s_states.shape[1]
print('There are {} striker agents. Each receives a state with length: {}'.format(s_states.shape[0], s_state_size))

Number of goalie agents: 2
Number of striker agents: 2
Number of goalie actions: 4
Number of striker actions: 6
There are 2 goalie agents. Each receives a state with length: 336
There are 2 striker agents. Each receives a state with length: 336


### DQN Agent

Attempt to implement DQN, leveraging Udacity Deep Reinforcement Learning [repo](https://github.com/udacity/deep-reinforcement-learning)

In [5]:
g_agent = Agent(state_size=g_state_size, action_size=g_action_size, seed=0)
s_agent = Agent(state_size=s_state_size, action_size=s_action_size, seed=0)

Attempt to modify reward structure to incentivize going after the ball

In [6]:
def ball_reward(state):
    """
    Params
    ======
        state : current state of striker, 3 stacked 112 element vector
            1: ball
            2: opponent's goal
            3: own goal
            4: wall
            5: teammate
            6: opponent
            7: distance
    """
    reward = 0.0
    # Penalize if ball is not in view
    if not any(state[0::8]):
        reward = -0.03
    # Reward for kicking the ball
    else:
        idx = np.where(state[0::8])[0] # check which ray sees the ball
        distance = state[idx*8 + 7] # get the corresponding distance to ball
        if (np.amin(distance) <= 0.03): # Just picking some thresholds for now.
            reward = 0.3

    return reward

In [14]:
def dqn(n_episodes=2000, max_t=1000, eps_start=1.0, eps_end=0.01, eps_decay=0.995):
    """Deep Q-Learning.
    
    Params
    ======
        n_episodes (int): maximum number of training episodes
        max_t (int): maximum number of timesteps per episode
        eps_start (float): starting value of epsilon, for epsilon-greedy action selection
        eps_end (float): minimum value of epsilon
        eps_decay (float): multiplicative factor (per episode) for decreasing epsilon
    """
    scores = []                        # list containing scores from each episode
    scores_window = deque(maxlen=100)  # last 100 scores
    eps = eps_start                    # initialize epsilon
    g_losses = []
    g_losses_window = deque(maxlen=100)
    s_losses = []
    s_losses_window = deque(maxlen=100)
    for i_episode in range(1, n_episodes+1):
        env_info  = env.reset(train_mode=True)
        score = 0
        ball_reward_val = 0.0
        
        g_states = env_info[g_brain_name].vector_observations       # get initial state (goalies)
        s_states = env_info[s_brain_name].vector_observations       # get initial state (strikers)

        g_scores = np.zeros(num_g_agents)                           # initialize the score (goalies)
        s_scores = np.zeros(num_s_agents)                           # initialize the score (strikers)  
        
        for t in range(max_t):
            action_g_0 = g_agent.act(g_states[0], eps)        # always pick state index 0
            action_s_0 = s_agent.act(s_states[0], eps)  
            
            # Set other team to random
            action_g_1 = np.asarray( [np.random.choice(g_action_size)] ) 
            action_s_1 = np.asarray( [np.random.choice(s_action_size)] )
            # Train simultaneously
            #action_g_1 = g_agent.act(g_states[1], eps)        # always pick state index 1
            #action_s_1 = s_agent.act(s_states[1], eps) 
            
            # Combine actions
            actions_g = np.array( (action_g_0, action_g_1) )                                    
            actions_s = np.array( (action_s_0, action_s_1) )
            actions = dict( zip( [g_brain_name, s_brain_name], [actions_g, actions_s] ) )
            
            env_info = env.step(actions)                                                
            # get next states
            g_next_states = env_info[g_brain_name].vector_observations         
            s_next_states = env_info[s_brain_name].vector_observations
            
            # get reward and update scores
            g_rewards = env_info[g_brain_name].rewards
            #print(g_rewards)
            s_rewards = env_info[s_brain_name].rewards
            #print(s_rewards)
            g_scores += g_rewards
            s_scores += s_rewards
            
            ball_reward_val += ball_reward(s_states[0])
            
            # check if episode finished
            done = np.any(env_info[g_brain_name].local_done)
            print(env_info[g_brain_name].text_observations)
            print(env_info[g_brain_name].visual_observations)
            
            # store experiences
            g_agent.step(g_states[0], action_g_0, g_rewards[0], 
                         g_next_states[0], done)
            s_agent.step(s_states[0], action_s_0, s_rewards[0] + ball_reward(s_states[0]), # adding ball reward
                         s_next_states[0], done)

            if done:
                break
                
            g_states = g_next_states
            s_states = s_next_states
                
        # learn
        goalie_loss = g_agent.learn(g_agent.memory.sample(), 0.99) # discount = 0.99
        striker_loss = s_agent.learn(s_agent.memory.sample(), 0.99) # discount = 0.99 
        
        g_losses.append(goalie_loss.item())
        g_losses_window.append(goalie_loss.item())
        s_losses.append(striker_loss.item())
        s_losses_window.append(striker_loss.item())
        
        score = g_scores[0] + s_scores[0]
        scores_window.append(score)       # save most recent score
        scores.append(score)              # save most recent score
        
        eps = max(eps_end, eps_decay*eps) # decrease epsilon
        print('\rEpisode {}\tAverage Score: {:.2f}\t Goalie Loss:' \
                  '{:.5f}\t Striker Loss: {:.5f}' \
                  '\t Ball Reward: {:.2f}'.format(i_episode, \
                                                  np.mean(scores_window), \
                                                  np.mean(g_losses_window), \
                                                  np.mean(s_losses_window), \
                                                  ball_reward_val), end="")
        #print(s_states[0][0:56])
        if i_episode % 100 == 0:
            print('\rEpisode {}\tAverage Score: {:.2f}\t Goalie Loss:' \
                  '{:.5f}\t Striker Loss: {:.5f}\n' \
                  '\t Ball Reward: {:.2f}'.format(i_episode, \
                                                  np.mean(scores_window), \
                                                  np.mean(g_losses_window), \
                                                  np.mean(s_losses_window), \
                                                  ball_reward_val))
            torch.save(g_agent.qnetwork_local.state_dict(), 'checkpoint_goalie.pth')
            torch.save(s_agent.qnetwork_local.state_dict(), 'checkpoint_striker.pth')
    return scores


Set hyperparameters and train DQN

In [15]:
n_episodes = 5000
n_episodes = 5
max_t = 100000
eps_start = 1.0
eps_end = 0.1
eps_decay = 0.9995

# Pick up where we left off
#GOALIE = './trained_models/goalie_dqn_run1.pth'
#STRIKER = './trained_models/striker_dqn_run1.pth'
#g_agent.qnetwork_local.load (GOALIE )
#s_agent.qnetwork_local.load( STRIKER )
GOALIE = './trained_models/goalie_dqn_ballreward2.pth'
STRIKER = './trained_models/striker_dqn_ballreward2.pth'
g_agent.qnetwork_local.load (GOALIE )
s_agent.qnetwork_local.load( STRIKER )

# Train
eps_start = 0.1
eps_end = 0.1
scores = dqn(n_episodes, max_t, eps_start, eps_end, eps_decay)

['', '']
[]
['', '']
[]
['', '']
[]
['', '']
[]
['', '']
[]
['', '']
[]
['', '']
[]
['', '']
[]
['', '']
[]
['', '']
[]
['', '']
[]
['', '']
[]
['', '']
[]
['', '']
[]
['', '']
[]
['', '']
[]
['', '']
[]
['', '']
[]
['', '']
[]
['', '']
[]
['', '']
[]
['', '']
[]
['', '']
[]
['', '']
[]
['', '']
[]
['', '']
[]
['', '']
[]
['', '']
[]
['', '']
[]
['', '']
[]
['', '']
[]
['', '']
[]
['', '']
[]
['', '']
[]
['', '']
[]
['', '']
[]
['', '']
[]
['', '']
[]
['', '']
[]
['', '']
[]
['', '']
[]
['', '']
[]
['', '']
[]
['', '']
[]
['', '']
[]
['', '']
[]
['', '']
[]
['', '']
[]
['', '']
[]
['', '']
[]
['', '']
[]
['', '']
[]
['', '']
[]
['', '']
[]
['', '']
[]
['', '']
[]
['', '']
[]
['', '']
[]
['', '']
[]
['', '']
[]
['', '']
[]
['', '']
[]
['', '']
[]
['', '']
[]
['', '']
[]
['', '']
[]
['', '']
[]
['', '']
[]
['', '']
[]
['', '']
[]
['', '']
[]
['', '']
[]
['', '']
[]
['', '']
[]
['', '']
[]
['', '']
[]
['', '']
[]
['', '']
[]
['', '']
[]
['', '']
[]
['', '']
[]
['', '']
[]
['', '']
[]
['',

### Test Trained Agent

In [5]:
# Load trained agents and run
g_agent_red = Agent(state_size=g_state_size, action_size=g_action_size, seed=0)
s_agent_red = Agent(state_size=s_state_size, action_size=s_action_size, seed=0)
g_agent_blue = Agent(state_size=g_state_size, action_size=g_action_size, seed=0)
s_agent_blue = Agent(state_size=s_state_size, action_size=s_action_size, seed=0)

# RED TEAM -------------------------------------
# DQN_base ----
# GOALIE_red = './trained_models/goalie_dqn_run1.pth'
# STRIKER_red = './trained_models/striker_dqn_run1.pth'
# DQN_1 ----
# GOALIE_red = './trained_models/goalie_dqn_ballreward2.pth'
# STRIKER_red = './trained_models/striker_dqn_ballreward2.pth'
# GOALIE_red = './trained_models/checkpoint_goalie_10K.pth'
# STRIKER_red = './trained_models/checkpoint_striker_10K.pth'
# DQN_1_mod ----
GOALIE_red = './trained_models/goalie_dqn_V1_modified.pth'
STRIKER_red = './trained_models/striker_dqn_V1_modified.pth'

g_agent_red.qnetwork_local.load (GOALIE_red )
s_agent_red.qnetwork_local.load( STRIKER_red )

# BLUE TEAM -------------------------------------
# DQN_base
GOALIE_blue = './trained_models/goalie_dqn_run1.pth'
STRIKER_blue = './trained_models/striker_dqn_run1.pth'
# DQN_1 ----
# GOALIE_blue = './trained_models/goalie_dqn_ballreward2.pth'
# STRIKER_blue = './trained_models/striker_dqn_ballreward2.pth'
# GOALIE_blue = './trained_models/checkpoint_goalie_10K.pth'
# STRIKER_blue = './trained_models/checkpoint_striker_10K.pth'
# DQN_1_mod ----
GOALIE_blue = './trained_models/goalie_dqn_V1_modified.pth'
STRIKER_blue = './trained_models/striker_dqn_V1_modified.pth'

g_agent_blue.qnetwork_local.load (GOALIE_blue )
s_agent_blue.qnetwork_local.load( STRIKER_blue )

In [6]:
team_red_window_score = []
team_red_window_score_wins = []

team_blue_window_score = []
team_blue_window_score_wins = []

draws = []

for i in range(500):                                       # play game for 2 episodes
    env_info = env.reset(train_mode=True)                  # reset the environment    
    g_states = env_info[g_brain_name].vector_observations  # get initial state (goalies)
    s_states = env_info[s_brain_name].vector_observations  # get initial state (strikers)
    g_scores = np.zeros(num_g_agents)                      # initialize the score (goalies)
    s_scores = np.zeros(num_s_agents)                      # initialize the score (strikers)
    while True:
        # RED TEAM actions
        action_g_0 = g_agent_red.act(g_states[0], 0)       # always pick state index 0 for red
        action_s_0 = s_agent_red.act(s_states[0], 0)  

        # BLUE TEAM actions
        # ----- RANDOM -----
#         action_g_1 = np.asarray( [np.random.choice(g_action_size)] ) 
#         action_s_1 = np.asarray( [np.random.choice(s_action_size)] )
        # ----- Trained -----
        action_g_1 = g_agent_blue.act(g_states[1], 0)      # always pick state index 1 for blue
        action_s_1 = s_agent_blue.act(s_states[1], 0)

        # Combine actions
        actions_g = np.array( (action_g_0, action_g_1) )                                    
        actions_s = np.array( (action_s_0, action_s_1) )
        actions = dict( zip( [g_brain_name, s_brain_name], [actions_g, actions_s] ) )

        env_info = env.step(actions)                       
        
        # get next states
        g_next_states = env_info[g_brain_name].vector_observations         
        s_next_states = env_info[s_brain_name].vector_observations
        
        # get reward and update scores
        g_rewards = env_info[g_brain_name].rewards  
        s_rewards = env_info[s_brain_name].rewards
        g_scores += g_rewards
        s_scores += s_rewards
        
        # check if episode finished
        done = np.any(env_info[g_brain_name].local_done)  
        
        # roll over states to next time step
        g_states = g_next_states
        s_states = s_next_states
        
        # exit loop if episode finished
        if done:                                           
            break
    team_red_score = g_scores[0] + s_scores[0]
    team_red_window_score.append( team_red_score )
    team_red_window_score_wins.append( 1 if team_red_score > 0 else 0)        

    team_blue_score = g_scores[1] + s_scores[1]
    team_blue_window_score.append( team_blue_score )
    team_blue_window_score_wins.append( 1 if team_blue_score > 0 else 0 )

    draws.append( team_red_score == team_blue_score )
    print('Scores from episode {}: {} (goalies), {} (strikers)'.format(i+1, g_scores, s_scores))

print('Red Wins: \t{} \tScore: \t{:.5f} \tAvg: \t{:.2f} \tDraws: \t{}'.format( \
                  np.count_nonzero(team_red_window_score_wins), team_red_score, \
                  np.sum(team_red_window_score), np.count_nonzero(draws) ))

env.close()

Scores from episode 1: [ 0.41833334 -0.68166666] (goalies), [ 0.68166666 -0.41833334] (strikers)
Scores from episode 2: [-0.86000006  0.24      ] (goalies), [-0.24        0.86000006] (strikers)
Scores from episode 3: [ 0.15499999 -0.94500006] (goalies), [ 0.94500006 -0.15499999] (strikers)
Scores from episode 4: [1.00166669 1.00166669] (goalies), [-1.00166669 -1.00166669] (strikers)
Scores from episode 5: [-0.87833345  0.22166666] (goalies), [-0.22166666  0.87833345] (strikers)
Scores from episode 6: [-0.65999999  0.44000001] (goalies), [-0.44000001  0.65999999] (strikers)
Scores from episode 7: [-0.96500012  0.13499999] (goalies), [-0.13499999  0.96500012] (strikers)
Scores from episode 8: [1.00166669 1.00166669] (goalies), [-1.00166669 -1.00166669] (strikers)
Scores from episode 9: [-0.975  0.125] (goalies), [-0.125  0.975] (strikers)
Scores from episode 10: [-0.91666667  0.18333334] (goalies), [-0.18333334  0.91666667] (strikers)
Scores from episode 11: [1.00166669 1.00166669] (goal

Scores from episode 88: [-0.70666678  0.39333333] (goalies), [-0.39333333  0.70666678] (strikers)
Scores from episode 89: [1.00166669 1.00166669] (goalies), [-1.00166669 -1.00166669] (strikers)
Scores from episode 90: [1.00166669 1.00166669] (goalies), [-1.00166669 -1.00166669] (strikers)
Scores from episode 91: [ 0.13666666 -0.96333345] (goalies), [ 0.96333345 -0.13666666] (strikers)
Scores from episode 92: [ 0.33       -0.77000012] (goalies), [ 0.77000012 -0.33      ] (strikers)
Scores from episode 93: [ 0.165 -0.935] (goalies), [ 0.935 -0.165] (strikers)
Scores from episode 94: [1.00166669 1.00166669] (goalies), [-1.00166669 -1.00166669] (strikers)
Scores from episode 95: [-0.88500012  0.215     ] (goalies), [-0.215       0.88500012] (strikers)
Scores from episode 96: [-0.88833345  0.21166666] (goalies), [-0.21166666  0.88833345] (strikers)
Scores from episode 97: [ 0.41       -0.69000005] (goalies), [ 0.69000005 -0.41      ] (strikers)
Scores from episode 98: [ 0.265      -0.835000

Scores from episode 173: [1.00166669 1.00166669] (goalies), [-1.00166669 -1.00166669] (strikers)
Scores from episode 174: [-0.97333333  0.12666667] (goalies), [-0.12666667  0.97333333] (strikers)
Scores from episode 175: [ 0.19833334 -0.90166667] (goalies), [ 0.90166667 -0.19833334] (strikers)
Scores from episode 176: [-0.96333333  0.13666667] (goalies), [-0.13666667  0.96333333] (strikers)
Scores from episode 177: [ 0.24333334 -0.85666666] (goalies), [ 0.85666666 -0.24333334] (strikers)
Scores from episode 178: [ 0.14333333 -0.95666673] (goalies), [ 0.95666673 -0.14333333] (strikers)
Scores from episode 179: [ 0.27166666 -0.82833339] (goalies), [ 0.82833339 -0.27166666] (strikers)
Scores from episode 180: [ 0.12166667 -0.97833339] (goalies), [ 0.97833339 -0.12166667] (strikers)
Scores from episode 181: [-0.97333339  0.12666666] (goalies), [-0.12666666  0.97333339] (strikers)
Scores from episode 182: [ 0.165      -0.93500006] (goalies), [ 0.93500006 -0.165     ] (strikers)
Scores from 

Scores from episode 258: [-0.81166666  0.28833334] (goalies), [-0.28833334  0.81166666] (strikers)
Scores from episode 259: [ 0.25       -0.85000012] (goalies), [ 0.85000012 -0.25      ] (strikers)
Scores from episode 260: [-0.73166666  0.36833334] (goalies), [-0.36833334  0.73166666] (strikers)
Scores from episode 261: [-0.83666666  0.26333334] (goalies), [-0.26333334  0.83666666] (strikers)
Scores from episode 262: [ 0.46166667 -0.63833339] (goalies), [ 0.63833339 -0.46166667] (strikers)
Scores from episode 263: [-0.97666673  0.12333333] (goalies), [-0.12333333  0.97666673] (strikers)
Scores from episode 264: [-0.58333338  0.51666668] (goalies), [-0.51666668  0.58333338] (strikers)
Scores from episode 265: [ 0.55666668 -0.54333332] (goalies), [ 0.54333332 -0.55666668] (strikers)
Scores from episode 266: [ 0.12333333 -0.97666667] (goalies), [ 0.97666667 -0.12333333] (strikers)
Scores from episode 267: [-0.85166672  0.24833333] (goalies), [-0.24833333  0.85166672] (strikers)
Scores fro

Scores from episode 343: [ 0.19666667 -0.90333339] (goalies), [ 0.90333339 -0.19666667] (strikers)
Scores from episode 344: [-0.93  0.17] (goalies), [-0.17  0.93] (strikers)
Scores from episode 345: [1.00166669 1.00166669] (goalies), [-1.00166669 -1.00166669] (strikers)
Scores from episode 346: [-0.46166666  0.63833335] (goalies), [-0.63833335  0.46166666] (strikers)
Scores from episode 347: [ 0.48666668 -0.61333333] (goalies), [ 0.61333333 -0.48666668] (strikers)
Scores from episode 348: [-0.94833345  0.15166666] (goalies), [-0.15166666  0.94833345] (strikers)
Scores from episode 349: [ 0.225      -0.87500006] (goalies), [ 0.87500006 -0.225     ] (strikers)
Scores from episode 350: [-0.75500006  0.34500001] (goalies), [-0.34500001  0.75500006] (strikers)
Scores from episode 351: [-0.905  0.195] (goalies), [-0.195  0.905] (strikers)
Scores from episode 352: [ 0.24166666 -0.85833339] (goalies), [ 0.85833339 -0.24166666] (strikers)
Scores from episode 353: [ 0.15833333 -0.94166679] (goal

Scores from episode 428: [-0.77666678  0.32333333] (goalies), [-0.32333333  0.77666678] (strikers)
Scores from episode 429: [1.00166669 1.00166669] (goalies), [-1.00166669 -1.00166669] (strikers)
Scores from episode 430: [-0.97666673  0.12333333] (goalies), [-0.12333333  0.97666673] (strikers)
Scores from episode 431: [1.00166669 1.00166669] (goalies), [-1.00166669 -1.00166669] (strikers)
Scores from episode 432: [-0.86833339  0.23166667] (goalies), [-0.23166667  0.86833339] (strikers)
Scores from episode 433: [ 0.25166666 -0.84833345] (goalies), [ 0.84833345 -0.25166666] (strikers)
Scores from episode 434: [ 0.24666667 -0.85333339] (goalies), [ 0.85333339 -0.24666667] (strikers)
Scores from episode 435: [ 0.175      -0.92500006] (goalies), [ 0.92500006 -0.175     ] (strikers)
Scores from episode 436: [ 0.19166667 -0.90833333] (goalies), [ 0.90833333 -0.19166667] (strikers)
Scores from episode 437: [-0.97333339  0.12666666] (goalies), [-0.12666666  0.97333339] (strikers)
Scores from ep

In [7]:
env.close()