# Collaboration and Competition

---

You are welcome to use this coding environment to train your agent for the project.  Follow the instructions below to get started!

### 1. Start the Environment

Run the next code cell to install a few packages.  This line will take a few minutes to run!

In [1]:
!pip -q install ./python

[31mtensorflow 1.7.1 has requirement numpy>=1.13.3, but you'll have numpy 1.12.1 which is incompatible.[0m
[31mipython 6.5.0 has requirement prompt-toolkit<2.0.0,>=1.0.15, but you'll have prompt-toolkit 3.0.5 which is incompatible.[0m


The environment is already saved in the Workspace and can be accessed at the file path provided below. 

In [2]:
from unityagents import UnityEnvironment
import numpy as np

env = UnityEnvironment(file_name="/data/Tennis_Linux_NoVis/Tennis")

INFO:unityagents:
'Academy' started successfully!
Unity Academy name: Academy
        Number of Brains: 1
        Number of External Brains : 1
        Lesson number : 0
        Reset Parameters :
		
Unity brain name: TennisBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 8
        Number of stacked Vector Observation: 3
        Vector Action space type: continuous
        Vector Action space size (per agent): 2
        Vector Action descriptions: , 


Environments contain **_brains_** which are responsible for deciding the actions of their associated agents. Here we check for the first brain available, and set it as the default brain we will be controlling from Python.

In [3]:
# get the default brain
brain_name = env.brain_names[0]
brain = env.brains[brain_name]
print('brain_name:', brain_name)

brain_name: TennisBrain


### 2. Examine the State and Action Spaces

Run the code cell below to print some information about the environment.

In [4]:
# reset the environment
env_info = env.reset(train_mode=True)[brain_name]

# number of agents 
num_agents = len(env_info.agents)
print('Number of agents (num_agents):', num_agents)

# size of each action
action_size = brain.vector_action_space_size
print('Size of each action (action_size):', action_size)

# examine the state space 
states = env_info.vector_observations
state_size = states.shape[1]
print('There are {} agents. Each observes a state with length (state_size): {}'.format(states.shape[0], state_size))
print('The state for the first agent looks like:', states[0])
print('The state for the second agent looks like:', states[1])

Number of agents (num_agents): 2
Size of each action (action_size): 2
There are 2 agents. Each observes a state with length (state_size): 24
The state for the first agent looks like: [ 0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.         -6.65278625 -1.5        -0.          0.
  6.83172083  6.         -0.          0.        ]
The state for the second agent looks like: [ 0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.         -6.4669857  -1.5         0.          0.
 -6.83172083  6.          0.          0.        ]


### 3. Take Random Actions in the Environment

In the next code cell, you will learn how to use the Python API to control the agent and receive feedback from the environment.

Note that **in this coding environment, you will not be able to watch the agents while they are training**, and you should set `train_mode=True` to restart the environment.

In [5]:
for i in range(1):                                         # play game for 5 episodes
    env_info = env.reset(train_mode=True)[brain_name]     # reset the environment    
    states = env_info.vector_observations                  # get the current state (for each agent)
    scores = np.zeros(num_agents)                          # initialize the score (for each agent)
    #while True:
    for i in range(10):
        print("\n\nstates[0]: {}".format(states[0]))
        print("states[1]: {}".format(states[1]))
        actions = np.random.randn(num_agents, action_size) # select an action (for each agent)
        actions = np.clip(actions, -1, 1)                  # all actions between -1 and 1
        print("actions: {}".format(actions))
        env_info = env.step(actions)[brain_name]           # send all actions to tne environment
        next_states = env_info.vector_observations         # get next state (for each agent)
        rewards = env_info.rewards                         # get reward (for each agent)
        print("rewards: {}".format(rewards))
        dones = env_info.local_done                        # see if episode finished
        scores += env_info.rewards                         # update the score (for each agent)
        print("scores: {}".format(scores))
        states = next_states                               # roll over states to next time step
        if np.any(dones):                                  # exit loop if episode finished
            break
        break
    print('Total score (averaged over agents) this episode: {}'.format(np.mean(scores)))



states[0]: [ 0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.         -7.38993645 -1.5        -0.          0.
  6.83172083  5.99607611 -0.          0.        ]
states[1]: [ 0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.         -6.70024681 -1.5         0.          0.
 -6.83172083  5.99607611  0.          0.        ]
actions: [[ 0.08233786  1.        ]
 [-1.          0.01887031]]
rewards: [0.0, 0.0]
scores: [ 0.  0.]
Total score (averaged over agents) this episode: 0.0


In [6]:
# define netowrks
import torch
import torch.nn.functional as F
import torch.optim as optim
import copy
import random

def hidden_init(layer):
    fan_in = layer.weight.data.size()[0]
    lim = 1. / np.sqrt(fan_in)
    return (-lim, lim)

# actor - take in a state and output a distribution of actions
class ActorModel(torch.nn.Module):
    def __init__(self, state_size, action_size):
        super(ActorModel, self).__init__()
        self.state_size   = state_size
        self.action_size = action_size

        self.fc1 = torch.nn.Linear(state_size, 256)
        self.bn1 = torch.nn.BatchNorm1d(256)
        self.fc2 = torch.nn.Linear(256, 256)
        self.bn2 = torch.nn.BatchNorm1d(256)
        self.out = torch.nn.Linear(256, action_size)
        self.reset_parameters()
        
    def forward(self, states):
        batch_size = states.size(0)
        x = self.fc1(states)
        x = F.relu(self.bn1(x))
        x = F.relu(self.bn2(self.fc2(x)))
        x = F.tanh(self.out(x))
        return x
    
    def reset_parameters(self):
        self.fc1.weight.data.uniform_(*hidden_init(self.fc1))
        self.fc2.weight.data.uniform_(*hidden_init(self.fc2))
        self.out.weight.data.uniform_(-3e-3, 3e-3)

# critic - take in a state AND actions - outputs a state value function - V
class CriticModel(torch.nn.Module):
    def __init__(self, state_size, action_size):
        super(CriticModel, self).__init__()
        self.state_size   = state_size
        self.action_size = action_size

        self.fc1 = torch.nn.Linear(state_size, 256)
        self.bn1 = torch.nn.BatchNorm1d(256)
        self.fc2 = torch.nn.Linear(256+action_size, 256)
        self.bn2 = torch.nn.BatchNorm1d(256)
        self.fc3 = torch.nn.Linear(256, 128)
        self.bn3 = torch.nn.BatchNorm1d(128)
        self.out = torch.nn.Linear(128, 1)
        self.reset_parameters()
        
    def forward(self, states, actions):
        batch_size = states.size(0)
        xs = F.leaky_relu(self.bn1(self.fc1(states)))
        x = torch.cat((xs, actions), dim=1) #add in actions to the network
        x = F.leaky_relu(self.bn2(self.fc2(x)))
        x = F.leaky_relu(self.bn3(self.fc3(x)))
        x = self.out(x)
        return x
    
    def reset_parameters(self):
        self.fc1.weight.data.uniform_(*hidden_init(self.fc1))
        self.fc2.weight.data.uniform_(*hidden_init(self.fc2))
        self.fc3.weight.data.uniform_(*hidden_init(self.fc3))
        self.out.weight.data.uniform_(-3e-3, 3e-3)


class StepInfo:
    def __init__(self, step_number, states, actions, rewards, next_states, dones):
        self.step_number = step_number
        self.states = states
        self.actions = actions
        self.rewards = rewards
        self.next_states = next_states
        self.dones = dones
        
    def __str__(self):
        return "step_number: {},  states: {},  actions: {},  rewards: {},  next_states: {}".format(self.step_number, self.states, self.actions, self.rewards, self.next_states)

def reset_game(in_env, brain_name):
    # **important note** When training the environment, set `train_mode=True`
    env_info = in_env.reset(train_mode=True)[brain_name]      # reset the environment    
    states = env_info.vector_observations
    return states

def env_step(in_env, brain_name, states, replay_buffer, actor_model_0, actor_model_1, epsilon, logging=False):
    #Play a game. Add to the replay_buffer
    if (len(replay_buffer) > 0):
        step_number = replay_buffer[-1].step_number + 1
    else:
        step_number = 0

    state_tensor = torch.from_numpy(states).float().cuda()

    rand_num = random.uniform(0, 1)
    if rand_num > epsilon:
        actor_model_0.eval()
        actor_model_1.eval()
        #with torch.no_grad():
        #print("state_tensor[0].unsqueeze(0): {}".format(state_tensor[0].unsqueeze(0)))
        actions_tensor_0 = actor_model_0(state_tensor[0].unsqueeze(0))
        actions_tensor_1 = actor_model_1(state_tensor[1].unsqueeze(0))
        actor_model_0.train()
        actor_model_1.train()
        actions_np = np.zeros((num_agents, action_size))
        actions_np[0] = actions_tensor_0.detach().cpu().numpy()
        actions_np[1] = actions_tensor_1.detach().cpu().numpy()
        if logging:
            print("actions from models: {}".format(actions_np))
    else:
        actions_np =  (2.0 * np.random.rand(num_agents, action_size)) - 1.0
        if logging:
            print("random actions: {}".format(actions_np))

    env_info = in_env.step(actions_np)[brain_name]      # send all actions to the environment
    next_states = env_info.vector_observations          # get next state (for each agent)
    rewards = env_info.rewards                          # get reward (for each agent)
    dones = env_info.local_done                         # see if episode finished

    this_step_info = StepInfo(step_number, states, actions_np, rewards, next_states, dones)
    replay_buffer.append(this_step_info)

    return next_states

def soft_update_target(local_model, target_model, tau):
    for target_param, local_param in zip(target_model.parameters(), local_model.parameters()):
        target_param.data.copy_(tau*local_param.data + (1.0-tau)*target_param.data)

def getBatch(replay_buffer, batch_size):
#     Number of agents (num_agents): 2
#     Size of each action (action_size): 2
#     There are 2 agents. Each observes a state with length (state_size): 24
    return_states = np.zeros((batch_size, num_agents, state_size))
    return_actions = np.zeros((batch_size, num_agents, action_size))
    return_rewards = np.zeros((batch_size, 2))
    return_next_states = np.zeros((batch_size, num_agents, state_size))
    return_next_actions = np.zeros((batch_size, num_agents, action_size))
    
#     print("replay_buffer[0].states.shape: {}".format(replay_buffer[0].states.shape))
#     print("replay_buffer[0].rewards[0]: {}".format(replay_buffer[0].rewards[0]))
#     print("replay_buffer[0].actions.shape: {}".format(replay_buffer[0].actions.shape))
#     print("replay_buffer[0].next_states.shape: {}".format(replay_buffer[0].next_states.shape))

    for i in range(batch_size):
        rand_frame_index = random.randint(0,len(replay_buffer)-2)
        return_states[i] = replay_buffer[rand_frame_index].states[0]
        return_actions[i] = replay_buffer[rand_frame_index].actions[0]
        return_rewards[i] = replay_buffer[rand_frame_index].rewards[0]
        return_next_states[i] = replay_buffer[rand_frame_index].next_states[0]
        return_next_actions[i] = replay_buffer[rand_frame_index+1].actions[0]
        #### TODO - make sure "next" actions don't roll over onto the next  playthrough.
        
    return return_states, return_actions, return_rewards, return_next_states, return_next_actions


### 4. It's Your Turn!

Now it's your turn to train your own agent to solve the environment!  A few **important notes**:
- When training the environment, set `train_mode=True`, so that the line for resetting the environment looks like the following:
```python
env_info = env.reset(train_mode=True)[brain_name]
```
- To structure your work, you're welcome to work directly in this Jupyter notebook, or you might like to start over with a new file!  You can see the list of files in the workspace by clicking on **_Jupyter_** in the top left corner of the notebook.
- In this coding environment, you will not be able to watch the agents while they are training.  However, **_after training the agents_**, you can download the saved model weights to watch the agents on your own machine! 

In [18]:
from collections import deque

# instantiate objects that will can be re-used
buffer_length = 100000
replay_buffer = deque(maxlen=buffer_length)

actor_model_local_0   = ActorModel(state_size, action_size).cuda()
actor_model_target_0  = ActorModel(state_size, action_size).cuda()
critic_model_local_0  = CriticModel(state_size, action_size).cuda()
critic_model_target_0 = CriticModel(state_size, action_size).cuda()

actor_model_local_1   = ActorModel(state_size, action_size).cuda()
actor_model_target_1  = ActorModel(state_size, action_size).cuda()
critic_model_local_1  = CriticModel(state_size, action_size).cuda()
critic_model_target_1 = CriticModel(state_size, action_size).cuda()

actor_model_locals   = (actor_model_local_0, actor_model_local_1)
actor_model_targets  = (actor_model_target_0, actor_model_target_1)
critic_model_locals  = (critic_model_local_0, critic_model_local_1)
critic_model_targets = (critic_model_target_0, critic_model_target_1)

if True:  # set to false if running for the first time or fresh models are desired.
    print("Loading models from disk")
    actor_model_locals[0].load_state_dict(torch.load("actor_model_local_0.pt"))
    actor_model_targets[0].load_state_dict(torch.load("actor_model_target_0.pt"))
    critic_model_locals[0].load_state_dict(torch.load("critic_model_local_0.pt"))
    critic_model_targets[0].load_state_dict(torch.load("critic_model_target_0.pt"))
    
    actor_model_locals[1].load_state_dict(torch.load("actor_model_local_1.pt"))
    actor_model_targets[1].load_state_dict(torch.load("actor_model_target_1.pt"))
    critic_model_locals[1].load_state_dict(torch.load("critic_model_local_1.pt"))
    critic_model_targets[1].load_state_dict(torch.load("critic_model_target_1.pt"))

lr_actor = .0005 #learning rate of .001 is too high - losses going in wrong direction.
lr_critic = .0005
weight_decay = 0.0
actor_optimizer_0 = optim.Adam(actor_model_locals[0].parameters(), lr=lr_actor)
actor_optimizer_1 = optim.Adam(actor_model_locals[1].parameters(), lr=lr_actor)
critic_optimizer_0 = optim.Adam(critic_model_locals[0].parameters(), lr=lr_critic, weight_decay=weight_decay)
critic_optimizer_1 = optim.Adam(critic_model_locals[1].parameters(), lr=lr_critic, weight_decay=weight_decay)

actor_optimizers = (actor_optimizer_0, actor_optimizer_1)
critic_optimizers = (critic_optimizer_0, critic_optimizer_1)

Loading models from disk


In [19]:
def train_player(player_idx, batch_size, gamma, tau):
    #do some learning
    states, actions, rewards, next_states, next_actions = getBatch(replay_buffer, batch_size)

#     print("batch states.shape: {}".format(states.shape)) #(128, 2, 24)
#     print("batch actions.shape: {}".format(actions.shape)) #(128, 2, 2)
#     print("batch rewards.shape: {}".format(rewards.shape)) #(128, 2)
#     print("batch next_states.shape: {}".format(next_states.shape)) #(128, 2, 24)
#     print("batch next_actions.shape: {}".format(next_actions.shape)) #(128, 2, 2)

    # convert to tensors for input into the models.
    rewards_tensor = torch.from_numpy(rewards[:,player_idx]).unsqueeze(1).float().cuda()
#     print("rewards_tensor.shape: {}".format(rewards_tensor.shape))
    
    next_actions_tensor = torch.from_numpy(next_actions[:,player_idx,:]).float().cuda()
#     print("next_actions_tensor.shape: {}".format(next_actions_tensor.shape))
    
    next_states_tensor = torch.from_numpy(next_states[:,player_idx,:]).float().cuda()
#     print("next_states_tensor.shape: {}".format(next_states_tensor.shape))
    
    states_tensor = torch.from_numpy(states[:,player_idx,:]).float().cuda()
#     print("states_tensor.shape: {}".format(states_tensor.shape))
    
    actions_tensor = torch.from_numpy(actions[:,player_idx,:]).float().cuda()
#     print("actions_tensor.shape: {}".format(actions_tensor.shape))

    # ---------------------------- update critic ---------------------------- #
    # Get predicted next-state actions and Q values from target models

    # Compute Q targets for current states (y_i)
    actor_model_targets[player_idx].eval()
    actor_model_locals[player_idx].eval()
    critic_model_targets[player_idx].eval()
    critic_model_locals[player_idx].eval()
    
    actions_next = actor_model_targets[player_idx](next_states_tensor)
    Q_targets_next = critic_model_targets[player_idx](next_states_tensor, actions_next)
    Q_targets = rewards_tensor + (gamma * Q_targets_next)

    # Compute critic loss
    Q_expected = critic_model_locals[player_idx](states_tensor, actions_tensor)
    critic_loss = F.mse_loss(Q_expected, Q_targets)

    # Minimize the critic loss
    critic_model_locals[player_idx].train()
    critic_optimizers[player_idx].zero_grad()
    critic_loss.backward()
    torch.nn.utils.clip_grad_norm_(critic_model_locals[player_idx].parameters(), 1)
    critic_optimizers[player_idx].step()

    # ---------------------------- update actor ---------------------------- #
    # Compute actor loss
    actions_pred = actor_model_locals[player_idx](states_tensor)
    critic_model_locals[player_idx].eval()
    actor_loss = -critic_model_locals[player_idx](states_tensor, actions_pred).mean()
    
    critic_model_locals[player_idx].train()
    critic_model_targets[player_idx].train()
    actor_model_targets[player_idx].train()
    actor_model_locals[player_idx].train()
    
    # Minimize the actor loss
    actor_optimizers[player_idx].zero_grad()
    actor_loss.backward()
    actor_optimizers[player_idx].step()

    # ----------------------- update target networks ----------------------- #
    #use very small Tau and update with every step
    soft_update_target(critic_model_locals[player_idx], critic_model_targets[player_idx], tau)
    soft_update_target(actor_model_locals[player_idx], actor_model_targets[player_idx], tau)

    return critic_loss.item(), actor_loss.item()


In [28]:
from workspace_utils import active_session

epochs = 10 # 5
steps_per_epoch = 1000
gamma = 0.99
tau = 0.001
batch_size = 128
epsilon_decay_steps = buffer_length
epsilon_max = 0.75
epsilon_min = 0.1
current_state = reset_game(env, brain_name)

with active_session():
    for epoch in range(epochs):
        total_actor_loss_0 = 0.0
        total_critic_loss_0 = 0.0    
        total_actor_loss_1 = 0.0
        total_critic_loss_1 = 0.0    
        total_rewards_0 = 0
        total_rewards_1 = 0
        epsilon = max(epsilon_min, epsilon_max - len(replay_buffer)/epsilon_decay_steps)

        for game_step in range(steps_per_epoch):
            #play a game
            #total_rewards = play_game(env, brain_name, replay_buffer, actor_model_local)
            current_state = env_step(env, brain_name, current_state, replay_buffer, actor_model_locals[0], actor_model_locals[1], epsilon)
            total_rewards_0 += np.sum(replay_buffer[-1].rewards[0])
            total_rewards_1 += np.sum(replay_buffer[-1].rewards[1])

            #if the game is done, reset and continue
            if np.any(replay_buffer[-1].dones):
                # new game
                current_state = reset_game(env, brain_name)
                current_state = env_step(env, brain_name, current_state, replay_buffer, actor_model_locals[0], actor_model_locals[1], epsilon)
                total_rewards_0 += np.sum(replay_buffer[-1].rewards[0])
                total_rewards_1 += np.sum(replay_buffer[-1].rewards[1])

            if len(replay_buffer) < 10000:
                continue  

            critic_loss_0, actor_loss_0  = train_player(0, batch_size, gamma, tau)
            critic_loss_1, actor_loss_1  = train_player(1, batch_size, gamma, tau)

            total_actor_loss_0 += actor_loss_0
            total_actor_loss_1 += actor_loss_1
            total_critic_loss_0 += critic_loss_0
            total_critic_loss_1 += critic_loss_1

        print("epoch: {} - epsilon: {:.3f}".format(epoch, epsilon))
        print("    total_rewards_0: {:.3f}, critic_loss_0: {:.8f}, actor_loss_0: {:.6f}".format(total_rewards_0, total_critic_loss_0/steps_per_epoch, total_actor_loss_0/steps_per_epoch))
        print("    total_rewards_1: {:.3f}, critic_loss_1: {:.8f}, actor_loss_1: {:.6f}".format(total_rewards_1, total_critic_loss_1/steps_per_epoch, total_actor_loss_1/steps_per_epoch))
        
        total_play_rewards_0 = 0
        total_play_rewards_1 = 0
        for game_step in range(steps_per_epoch):
            #play a game
            current_state = env_step(env, brain_name, current_state, replay_buffer, actor_model_locals[0], actor_model_locals[1], epsilon)
            total_play_rewards_0 += np.sum(replay_buffer[-1].rewards[0])
            total_play_rewards_1 += np.sum(replay_buffer[-1].rewards[1])

            #if the game is done, reset and continue
            if np.any(replay_buffer[-1].dones):
                # new game
                current_state = reset_game(env, brain_name)
                current_state = env_step(env, brain_name, current_state, replay_buffer, actor_model_locals[0], actor_model_locals[1], epsilon)
                total_play_rewards_0 += np.sum(replay_buffer[-1].rewards[0])
                total_play_rewards_1 += np.sum(replay_buffer[-1].rewards[1])
        
        print("    total_play_rewards_0: {:.3f}, total_play_rewards_1: {:.3f}".format(total_play_rewards_0, total_play_rewards_1))
        if total_play_rewards_0 > total_play_rewards_1:
            soft_update_target(actor_model_locals[0], actor_model_locals[1], 1.0)
            soft_update_target(actor_model_targets[0], actor_model_locals[1], 1.0)
            soft_update_target(critic_model_locals[0], actor_model_locals[1], 1.0)
            soft_update_target(critic_model_targets[0], actor_model_locals[1], 1.0)
        else:
            soft_update_target(actor_model_locals[1], actor_model_locals[0], 1.0)
            soft_update_target(actor_model_targets[1], actor_model_locals[0], 1.0)
            soft_update_target(critic_model_locals[1], actor_model_locals[0], 1.0)
            soft_update_target(critic_model_targets[1], actor_model_locals[0], 1.0)
            

    torch.save(actor_model_locals[0].state_dict(),  "actor_model_local_0.pt")
    torch.save(actor_model_targets[0].state_dict(), "actor_model_target_0.pt")
    torch.save(critic_model_locals[0].state_dict(), "critic_model_local_0.pt")
    torch.save(critic_model_targets[0].state_dict(),"critic_model_target_0.pt")

    torch.save(actor_model_locals[1].state_dict(),  "actor_model_local_1.pt")
    torch.save(actor_model_targets[1].state_dict(), "actor_model_target_1.pt")
    torch.save(critic_model_locals[1].state_dict(), "critic_model_local_1.pt")
    torch.save(critic_model_targets[1].state_dict(),"critic_model_target_1.pt")

epoch: 0 - epsilon: 0.100
    total_rewards_0: 2.160, critic_loss_0: 0.00002368, actor_loss_0: -0.340311
    total_rewards_1: 2.050, critic_loss_1: 0.00002250, actor_loss_1: -0.336133
total_play_rewards_0: 2.670, total_play_rewards_1: 2.260
epoch: 1 - epsilon: 0.100
    total_rewards_0: 2.190, critic_loss_0: 0.00002407, actor_loss_0: -0.339907
    total_rewards_1: 2.480, critic_loss_1: 0.00003630, actor_loss_1: -0.334128
total_play_rewards_0: 2.380, total_play_rewards_1: 2.110
epoch: 2 - epsilon: 0.100
    total_rewards_0: 2.480, critic_loss_0: 0.00002123, actor_loss_0: -0.339315
    total_rewards_1: 1.970, critic_loss_1: 0.00004168, actor_loss_1: -0.331701
total_play_rewards_0: 2.270, total_play_rewards_1: 2.090
epoch: 3 - epsilon: 0.100
    total_rewards_0: 3.400, critic_loss_0: 0.00002291, actor_loss_0: -0.338896
    total_rewards_1: 1.690, critic_loss_1: 0.00002849, actor_loss_1: -0.331835
total_play_rewards_0: 2.140, total_play_rewards_1: 2.060
epoch: 4 - epsilon: 0.100
    total_

In [30]:
epsilon = 0.05
total_game_count = 100
total_rewards_over_all_games_agent_0 = 0
total_rewards_over_all_games_agent_1 = 0
total_max_over_all_games = 0

for game_count in range(total_game_count):
    total_rewards_agent_0 = 0.0
    total_rewards_agent_1 = 0.0

    current_state = reset_game(env, brain_name)
    replay = []
    
    for i in range(2000):
        current_state = env_step(env, brain_name, current_state, replay, actor_model_locals[0], actor_model_locals[1], epsilon, logging=False)
        total_rewards_agent_0 += replay[-1].rewards[0]
        total_rewards_agent_1 += replay[-1].rewards[1]
        total_max_over_all_games += max(replay[-1].rewards[0], replay[-1].rewards[1])

        total_rewards_over_all_games_agent_0 += total_rewards_agent_0
        total_rewards_over_all_games_agent_1 += total_rewards_agent_1
        
        if np.any(replay[-1].dones):
            break
        
    print("game {}, total_rewards_agent_0: {:.3f}, total_rewards_agent_1: {:.3f}, game steps: {}".format(game_count+1, total_rewards_agent_0, total_rewards_agent_1, len(replay)))
    
print("Average max reward over {} games: {:.3f}".format(total_game_count, total_max_over_all_games/total_game_count))




game 1, total_rewards_agent_0: 0.800, total_rewards_agent_1: 0.790, game steps: 314
game 2, total_rewards_agent_0: 0.100, total_rewards_agent_1: 0.090, game steps: 33
game 3, total_rewards_agent_0: 0.100, total_rewards_agent_1: 0.090, game steps: 34
game 4, total_rewards_agent_0: 0.100, total_rewards_agent_1: 0.090, game steps: 37
game 5, total_rewards_agent_0: 0.100, total_rewards_agent_1: 0.090, game steps: 33
game 6, total_rewards_agent_0: 0.100, total_rewards_agent_1: 0.090, game steps: 34
game 7, total_rewards_agent_0: 0.100, total_rewards_agent_1: -0.010, game steps: 31
game 8, total_rewards_agent_0: 0.400, total_rewards_agent_1: 0.390, game steps: 155
game 9, total_rewards_agent_0: 0.100, total_rewards_agent_1: 0.090, game steps: 37
game 10, total_rewards_agent_0: 0.100, total_rewards_agent_1: 0.090, game steps: 34
game 11, total_rewards_agent_0: 1.990, total_rewards_agent_1: 1.900, game steps: 751
game 12, total_rewards_agent_0: 0.200, total_rewards_agent_1: 0.190, game steps: 

game 100, total_rewards_agent_0: 0.100, total_rewards_agent_1: 0.090, game steps: 34
Average max reward over 100 games: 0.740


In [None]:
for i in range(10):
    stepInfo = replay_buffer[i]
    print("\nstep_number: {}".format(stepInfo.step_number))
    print("states: {}".format(stepInfo.states))
    print("actions: {}".format(stepInfo.actions))
    print("rewards: {}".format(stepInfo.rewards))
    print("next_states: {}".format(stepInfo.next_states))
    print("dones: {}".format(stepInfo.dones))
    
