# Collaboration and Competition

---

In this notebook, you will learn how to use the Unity ML-Agents environment for the third project of the [Deep Reinforcement Learning Nanodegree](https://www.udacity.com/course/deep-reinforcement-learning-nanodegree--nd893) program.

### 1. Start the Environment

We begin by importing the necessary packages.  If the code cell below returns an error, please revisit the project instructions to double-check that you have installed [Unity ML-Agents](https://github.com/Unity-Technologies/ml-agents/blob/master/docs/Installation.md) and [NumPy](http://www.numpy.org/).

In [None]:
from IPython.core.display import display, HTML
display(HTML(
    '<style>'
        '#notebook { padding-top:0px !important; } ' 
        '.container { width:95% !important; } '
        '.end_space { min-height:0px !important; } '
    '</style>'
))

In [1]:
from unityagents import UnityEnvironment
import numpy as np

Next, we will start the environment!  **_Before running the code cell below_**, change the `file_name` parameter to match the location of the Unity environment that you downloaded.

- **Mac**: `"path/to/Tennis.app"`
- **Windows** (x86): `"path/to/Tennis_Windows_x86/Tennis.exe"`
- **Windows** (x86_64): `"path/to/Tennis_Windows_x86_64/Tennis.exe"`
- **Linux** (x86): `"path/to/Tennis_Linux/Tennis.x86"`
- **Linux** (x86_64): `"path/to/Tennis_Linux/Tennis.x86_64"`
- **Linux** (x86, headless): `"path/to/Tennis_Linux_NoVis/Tennis.x86"`
- **Linux** (x86_64, headless): `"path/to/Tennis_Linux_NoVis/Tennis.x86_64"`

For instance, if you are using a Mac, then you downloaded `Tennis.app`.  If this file is in the same folder as the notebook, then the line below should appear as follows:
```
env = UnityEnvironment(file_name="Tennis.app")
```

In [2]:
#env = UnityEnvironment(file_name="Tennis_Linux/Tennis.x86_64")
env = UnityEnvironment(file_name="Tennis_Windows_x86_64/Tennis.exe")

INFO:unityagents:
'Academy' started successfully!
Unity Academy name: Academy
        Number of Brains: 1
        Number of External Brains : 1
        Lesson number : 0
        Reset Parameters :
		
Unity brain name: TennisBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 8
        Number of stacked Vector Observation: 3
        Vector Action space type: continuous
        Vector Action space size (per agent): 2
        Vector Action descriptions: , 


Environments contain **_brains_** which are responsible for deciding the actions of their associated agents. Here we check for the first brain available, and set it as the default brain we will be controlling from Python.

In [3]:
# get the default brain
brain_name = env.brain_names[0]
brain = env.brains[brain_name]

### 2. Examine the State and Action Spaces

In this environment, two agents control rackets to bounce a ball over a net. If an agent hits the ball over the net, it receives a reward of +0.1.  If an agent lets a ball hit the ground or hits the ball out of bounds, it receives a reward of -0.01.  Thus, the goal of each agent is to keep the ball in play.

The observation space consists of 8 variables corresponding to the position and velocity of the ball and racket. Two continuous actions are available, corresponding to movement toward (or away from) the net, and jumping. 

Run the code cell below to print some information about the environment.

In [4]:
# reset the environment
env_info = env.reset(train_mode=True)[brain_name]

# number of agents 
num_agents = len(env_info.agents)
print('Number of agents:', num_agents)

# size of each action
action_size = brain.vector_action_space_size
print('Size of each action:', action_size)

# examine the state space 
states = env_info.vector_observations
state_size = states.shape[1]
print('There are {} agents. Each observes a state with length: {}'.format(states.shape[0], state_size))
print('The state for the first agent looks like:', states[0])

Number of agents: 2
Size of each action: 2
There are 2 agents. Each observes a state with length: 24
The state for the first agent looks like: [ 0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.         -6.65278625 -1.5
 -0.          0.          6.83172083  6.         -0.          0.        ]


When finished, you can close the environment.

### 4. It's Your Turn!

Now it's your turn to train your own agent to solve the environment!  When training the environment, set `train_mode=True`, so that the line for resetting the environment looks like the following:
```python
env_info = env.reset(train_mode=True)[brain_name]
```

In [5]:
#from buffer import ReplayBuffer
from common.Memory import ReplayMemory
from maddpg import MADDPG
import torch
import numpy as np
from tensorboardX import SummaryWriter
import os
from utilities import transpose_list, transpose_to_tensor
from collections import deque

# keep training awake
#from workspace_utils import keep_awake

# for saving gif
#import imageio

%load_ext autoreload
%autoreload 2

In [6]:
seed = 0
np.random.seed(seed)
torch.manual_seed(seed)

# number of training episodes.
# change this to higher number to experiment. say 30000.
number_of_episodes = 60000
episode_length = 200
batchsize = 256
# how many episodes to save policy and gif
save_interval = 200

In [7]:
# amplitude of OU noise
# this slowly decreases to 0
#noise = 2
#noise_reduction = 0.999

log_path = os.getcwd()+"/log"
model_dir= os.getcwd()+"/model_dir"

os.makedirs(model_dir, exist_ok=True)

# keep 5000 episodes worth of replay
buffer = ReplayMemory(int(1e5))

# initialize policy and critic

in_actor = state_size
hidden_in_actor = 256
hidden_out_actor = 128
out_actor = 2
# critic input = obs from both agents + actions from both agents
in_critic = 2*state_size + 2*action_size
hidden_in_critic = 256
hidden_out_critic = 128

maddpg = MADDPG(in_actor, hidden_in_actor, hidden_out_actor, 
                out_actor, in_critic, hidden_in_critic, hidden_out_critic,
                lr_actor=1.0e-4, lr_critic=1.0e-3, discount_factor=0.99, tau=1.0e-3)

logger = SummaryWriter(log_dir=log_path)

In [8]:
# how many episodes before update
steps_per_update = 1
num_updates = 4
random_actions = 8000
update_start = 8000

agent0_reward = []
agent1_reward = []
average_score_log = []

In [10]:
# training loop
# show progressbar
#import progressbar as pb
#widget = ['episode: ', pb.Counter(),'/',str(number_of_episodes),' ', 
#          pb.Percentage(), ' ', pb.ETA(), ' ', pb.Bar(marker=pb.RotatingMarker()), ' ' ]

#timer = pb.ProgressBar(widgets=widget, maxval=number_of_episodes).start()

scores_deque = deque(np.zeros(100))
ep_len_deque = deque(np.zeros(100))
#rand = 1.0
t = 0

for episode in range(1, number_of_episodes+1):

    #timer.update(episode)

    reward_this_episode = np.zeros(num_agents)
    env_info = env.reset(train_mode=True)[brain_name]
    obs = env_info.vector_observations
    ep_len = 0

    #for calculating rewards for this particular episode - addition of all time steps

    # save info or not
    save_info = ((episode) % save_interval == 0 or episode==number_of_episodes)
    #frames = []
    maddpg.reset()

    #for episode_t in range(episode_length):
    while True:
        if t > random_actions:
            rand = 1.0
        else:
            rand = 0.0
        t += 1
        ep_len += 1

        # explore = only explore for a certain number of episodes
        # action input needs to be transposed
        actions = maddpg.act(torch.tensor(obs, dtype=torch.float), rand=rand, add_noise=True)
        #noise *= noise_reduction

        actions = torch.stack(actions).detach().numpy()

        # step forward one frame
        env_info = env.step(actions)[brain_name]
        next_obs = env_info.vector_observations            # get next state (for each agent)
        rewards = env_info.rewards                         # get reward (for each agent)
        dones = env_info.local_done                        # see if episode finished
 
        # add data to buffer
        transition = (obs, actions, rewards, next_obs, dones)
        buffer.push(*transition)
        
        # update once after every episode_per_update
        if t > update_start and t % steps_per_update == 0:
            for _ in range(steps_per_update * num_updates):  # train for x times per update
                samples = buffer.sample(batchsize)
                for a_i in range(num_agents):
                    #samples = buffer.sample(batchsize)
                    maddpg.update(samples, a_i, logger)
                maddpg.update_targets() #soft update the target network towards the actual networks

        reward_this_episode += rewards

        obs = next_obs
        
        if np.any(dones):
            break
    ep_len_deque.append(ep_len)
    avg_ep_len = np.mean(ep_len_deque)
    score = np.max(reward_this_episode)
    scores_deque.append(score)
    average_score = np.mean(scores_deque)
    average_score_log.append(average_score)
    '''
    # update once after every episode_per_update
    if len(buffer) > batchsize and episode % episode_per_update == 0:
        for _ in range(5):  # train for 5 times
            samples = buffer.sample(batchsize)
            for a_i in range(num_agents):
                #samples = buffer.sample(batchsize)
                maddpg.update(samples, a_i, logger)
            maddpg.update_targets() #soft update the target network towards the actual networks
    '''
    agent0_reward.append(reward_this_episode[0])
    agent1_reward.append(reward_this_episode[1])

    if episode % 100 == 0 or episode == number_of_episodes-1:
        avg_rewards = [np.mean(agent0_reward), np.mean(agent1_reward)]
        agent0_reward = []
        agent1_reward = []
        for a_i, avg_rew in enumerate(avg_rewards):
            logger.add_scalar('agent%i/mean_episode_rewards' % a_i, avg_rew, episode)
            
    print('\rEpisode {}\tAverage Score: {:.4f}\tScore: {:.4f}\tAvg Ep Len: {:.2f}'.format(episode, average_score, score, avg_ep_len), end="")
    if episode % 100 == 0:
        print('\rEpisode {}\tAverage Score: {:.4f}\tScore: {:.4f}\tAvg Ep Len: {:.2f}'.format(episode, average_score, score, avg_ep_len))
    if average_score >= 0.5:
        print('\nEnvironment solved in {:d} episodes!\tAverage Score: {:.4f}'.format(episode, average_score))
        break
    '''
    if episode %100 == 0:
        print('last 100 avg reward for episode ending {} is {}'.format(episode, np.mean(scores_deque)))
    '''
    #saving model
    
    if save_info:
        save_dict_list =[]
        for i in range(2):
            save_dict = {'actor_params' : maddpg.maddpg_agent[i].actor.state_dict(),
                         'actor_optim_params': maddpg.maddpg_agent[i].actor_optimizer.state_dict(),
                         'critic_params' : maddpg.maddpg_agent[i].critic.state_dict(),
                         'critic_optim_params' : maddpg.maddpg_agent[i].critic_optimizer.state_dict()}
            save_dict_list.append(save_dict)

        torch.save(save_dict_list, os.path.join(model_dir, 'episode-{}.pt'.format(episode)))

#timer.finish()

Episode 100	Average Score: 0.0190	Score: 0.0000	Avg Ep Len: 10.60
Episode 200	Average Score: 0.0274	Score: 0.0000	Avg Ep Len: 14.56
Episode 300	Average Score: 0.0325	Score: 0.0000	Avg Ep Len: 16.72
Episode 400	Average Score: 0.0332	Score: 0.2000	Avg Ep Len: 17.59
Episode 500	Average Score: 0.0317	Score: 0.0000	Avg Ep Len: 17.84
Episode 600	Average Score: 0.0300	Score: 0.0000	Avg Ep Len: 17.91
Episode 700	Average Score: 0.0289	Score: 0.0000	Avg Ep Len: 18.04
Episode 800	Average Score: 0.0266	Score: 0.0000	Avg Ep Len: 17.87
Episode 900	Average Score: 0.0259	Score: 0.0000	Avg Ep Len: 18.01
Episode 1000	Average Score: 0.0252	Score: 0.0000	Avg Ep Len: 17.98
Episode 1100	Average Score: 0.0250	Score: 0.0000	Avg Ep Len: 18.09
Episode 1200	Average Score: 0.0246	Score: 0.0000	Avg Ep Len: 18.12
Episode 1300	Average Score: 0.0240	Score: 0.0000	Avg Ep Len: 18.09
Episode 1400	Average Score: 0.0232	Score: 0.0000	Avg Ep Len: 17.96
Episode 1456	Average Score: 0.0230	Score: 0.0000	Avg Ep Len: 17.99

KeyboardInterrupt: 

In [None]:
import torch
import torch.nn.functional as F
import torch.nn as nn
import torch.optim as optim

from unityagents import UnityEnvironment
import numpy as np
import random
import copy
from collections import namedtuple, deque
import os
import time
import sys

import matplotlib.pyplot as plt

device = torch.device("cpu")

In [None]:
class Buffer:
    """Fixed-size buffer to store experience tuples."""

    def __init__(self, buffer_size, batch_size, seed):
        """Initialize a ReplayBuffer object.
        Params
        ======
            action_size (int): dimension of each action
            buffer_size (int): maximum size of buffer
            batch_size (int): size of each training batch
            seed (int): seed
        """
        self.memory = deque(maxlen=buffer_size)  # internal memory (deque)
        self.batch_size = batch_size
        self.seed = seed
        self.experience = namedtuple("Experience", field_names=[
            "observation", "action", "reward", "next_observation", "done"])
    
    def add(self, observation, action, reward, next_observation, done):
        """Add a new experience to memory."""
        
        # Join a sequence of agents's states, next states and actions along columns
        e = self.experience(observation, action, reward, next_observation, done)
        self.memory.append(e)
    
    def sample(self):
        """Randomly sample a batch of experiences from memory."""
        experiences = random.sample(self.memory, k=self.batch_size)

        observations = torch.from_numpy(np.vstack([e.observation for e in experiences if e is not None])).float().to(device)
        actions = torch.from_numpy(np.vstack([e.action for e in experiences if e is not None])).float().to(device)
        rewards = torch.from_numpy(np.vstack([e.reward for e in experiences if e is not None])).float().to(device)
        next_observations = torch.from_numpy(np.vstack([e.next_observation for e in experiences if e is not None])).float().to(device)
        dones = torch.from_numpy(np.vstack([e.done for e in experiences if e is not None]).astype(np.uint8)).float().to(device)
        return (observations, actions, rewards, next_observations, dones)

    def __len__(self):
        """Return the current size of internal memory."""
        return len(self.memory)

In [None]:
class RNoise:
    """uniformly distributed random noise process"""

    def __init__(self, shape, amplitude):
        """Initialize parameters and noise process
        Params
        ======
            shape (int): dimension of each action
            buffer_size (int): maximum size of buffer
            amplitude (int): size of each training batch
        """
        
        self.amplitude = amplitude
        self.shape = shape
        self.state = np.zeros(self.shape)
    def reset(self):
        """Reset the internal state (= noise) to zero."""
        self.state = np.zeros(self.shape)

    def sample(self):
        """Return a noise sample."""
        self.state = self.amplitude*(2.*np.random.rand(self.shape) - 1.)
        return self.state

class OUNoise:
    """Ornstein-Uhlenbeck process."""

    def __init__(self, size, seed, mu=0., theta=0.15, sigma=0.2):
        """Initialize parameters and noise process."""
        self.mu = mu * np.ones(size)
        self.theta = theta
        self.sigma = sigma
        np.random.seed(seed)
        self.seed = np.random.randint(0,100)
        self.size = size
        self.reset()  
        
    def reset(self):
        """Reset the internal state (= noise) to mean (mu)."""
        self.state = copy.copy(self.mu)

    def sample(self):
        """Update internal state and return it as a noise sample."""
        x = self.state
        dx = self.theta * (self.mu - x) + self.sigma * np.random.randn(self.size)
        self.state = x + dx
        return self.state


In [None]:
import torch.nn as nn

def hidden_init(layer):
    fan_in = layer.weight.data.size()[0]
    lim = 1. / np.sqrt(fan_in)
    return (-lim, lim)

class Actor(nn.Module):
    """Actor (Policy) Model."""

    def __init__(self, state_size, action_size, seed, fc1_units=256, fc2_units=128, percent_dropout = 0.1):    
        """Initialize parameters and build model.
        Params
        ======
            state_size (int): Dimension of each state
            action_size (int): Dimension of each action
            seed (int): seed
            fc1_units (int): Number of nodes in first hidden layer
            fc2_units (int): Number of nodes in second hidden layer
            percent_dropout (float): percentage of nodes being dropped out.
        """
        super(Actor, self).__init__()
        self.seed = torch.manual_seed(seed)
        
        self.layer_1 = nn.Sequential(nn.Linear(state_size, fc1_units),
                                    nn.ReLU(),
                                    nn.Dropout(percent_dropout)) 
        
        self.layer_2 = nn.Sequential(nn.Linear(fc1_units, fc2_units), 
                                    nn.ReLU(),
                                    nn.Dropout(percent_dropout))
        
        self.layer_3 = nn.Linear(fc2_units, action_size)
        self.reset_parameters()
        
    def reset_parameters(self):
        # Apply to layers the specified weight initialization
        self.layer_1[0].weight.data.uniform_(*hidden_init(self.layer_1[0]))
        self.layer_2[0].weight.data.uniform_(*hidden_init(self.layer_2[0]))
        self.layer_3.weight.data.uniform_(-3e-3, 3e-3)
        
    def forward(self, state):
        """Build an actor (policy) network that maps states -> actions."""           
        x = self.layer_1(state)
        x = self.layer_2(x)
        x = self.layer_3(x)
        return torch.tanh(x)

class Critic(nn.Module):
    """Critic (Value) Model."""

    def __init__(self, state_size, action_size, seed, fc1_units=256, fc2_units=128, percent_dropout = 0.1): 
        """Initialize parameters and build model.
        Params
        ======
            state_size (int): Dimension of each state
            action_size (int): Dimension of each action
            seed (int): seed
            num_agents (int): Total number of agents
            fc1_units (int): Number of nodes in the first hidden layer
            fc2_units (int): Number of nodes in the second hidden layer
        """
        super(Critic, self).__init__()
        self.seed = torch.manual_seed(seed)
        
        
        
        self.layer_1 = nn.Sequential(nn.Linear(state_size * NUM_AGENTS + action_size * NUM_AGENTS, fc1_units),
                                    nn.ReLU(),
                                    nn.Dropout(percent_dropout))
        
        self.layer_2 = nn.Sequential(nn.Linear(fc1_units, fc2_units),
                                    nn.ReLU(),
                                    nn.Dropout(percent_dropout))
               
        self.layer_3 = nn.Linear(fc2_units, 1)
        self.reset_parameters()       
        
    def reset_parameters(self):
        # Apply to layers the specified weight initialization
        self.layer_1[0].weight.data.uniform_(*hidden_init(self.layer_1[0]))
        self.layer_2[0].weight.data.uniform_(*hidden_init(self.layer_2[0]))
        self.layer_3.weight.data.uniform_(-3e-3, 3e-3)  
        
    def forward(self, state, action):
        """Build a critic (value) network that maps (state, action) pairs -> Q-value."""
        xs = torch.cat((state, action), dim = 1)
        x = self.layer_1(xs)
        x = self.layer_2(x)
        return self.layer_3(x)

In [None]:
class DDPG_Agent():
    """Interacts with and learns from the environment."""
    
    def __init__(self, agent_name, state_size, action_size, random_seed):
        """Initialize an Agent object.
        
        Params
        ======
            agent_name (str): name of the agent
            state_size (int): dimension of each state
            action_size (int): dimension of each action
            random_seed (int): random seed
        """
        self.state_size = state_size
        self.action_size = action_size
        self.seed = random_seed
        self.agent_name = agent_name

        # Actor Network (w/ Target Network)
        self.actor_local = Actor(state_size, action_size, self.seed).to(device)
        self.actor_target = Actor(state_size, action_size, self.seed).to(device)
        self.actor_optimizer = optim.Adam(self.actor_local.parameters(), lr=LR_ACTOR)

        # Critic Network (w/ Target Network)
        self.critic_local = Critic(state_size, action_size, self.seed).to(device)
        self.critic_target = Critic(state_size, action_size, self.seed).to(device)
        self.critic_optimizer = optim.Adam(self.critic_local.parameters(), lr=LR_CRITIC, weight_decay=WEIGHT_DECAY)
        
        self.epsilon = 1.
        self.epsilon_decay_rate = 0.999
        self.epsilon_min = 0.2

        # Noise process
#         self.noise = OUNoise(action_size, random_seed)
        self.noise = RNoise(action_size, 0.5)
    def epsilon_decay(self):
        self.epsilon = max(self.epsilon_decay_rate*self.epsilon, self.epsilon_min)

    def act(self, state, add_noise=True):
        """Returns actions for given state as per current policy."""
        if (type(state) != torch.Tensor):
            state = torch.from_numpy(state).float().to(device)
        self.actor_local.eval()
        with torch.no_grad():
            action = self.actor_local(state).cpu().data.numpy()
        self.actor_local.train()
        if add_noise:
            action += self.epsilon * self.noise.sample()
        return np.clip(action, -1, 1)

    def reset(self):
        self.noise.reset()

#     def learn(self, agent_name, experiences, gamma):
    def learn(self, agent_name, my_next_observation, other_next_action, next_observations,
              actions, observations, self_observation, other_pred_action, my_reward, my_done, gamma):
        """Update policy and value parameters using given batch of experience tuples.
        Q_targets = r + γ * critic_target(next_state, actor_target(next_state))
        where:
            actor_target(state) -> action
            critic_target(state, action) -> Q-value

        Params
        ======
            agent_name (str): name of the agent
            my_next_observation (torch.Tensor): current agent's own next observation
            other_next_action (torch.Tensor): other agents' actions **
            next_observations (torch.Tensor): god-view next observation
            actions (torch.Tensor): god-view observations
            observations (torch.Tensor): god-view observations
            self_observation (torch.Tensor): current agent's own observation
            other_pred_action (torch.Tensor): other agents' predicted actions
            gamma (float): discount factor
        """
        # ---------------------------- update critic ---------------------------- #
        # Get predicted next-state actions and Q values from target models
        
        next_action = self.actor_target(my_next_observation)
        if agent_name == 'agent_0':
            next_actions = torch.cat(
                           [next_action, other_next_action],
                           1).to(device)
        else:
            next_actions = torch.cat(
                           [other_next_action, next_action],
                           1).to(device)
        
        Q_targets_next = self.critic_target(next_observations, next_actions)
        # Compute Q targets for current states (y_i)
        Q_targets = my_reward + (gamma * Q_targets_next * (1 - my_done))
        # Compute critic loss
        Q_expected = self.critic_local(observations, actions)
        critic_loss = F.mse_loss(Q_expected, Q_targets)
        # Minimize the loss
        self.critic_optimizer.zero_grad()
        critic_loss.backward()
        self.critic_optimizer.step()

        # ---------------------------- update actor ---------------------------- #
#         Compute actor loss
        pred_action = self.actor_local(self_observation)
        
        if agent_name == 'agent_0':
            pred_actions = torch.cat(
                           [pred_action, other_pred_action],
                           1).to(device)
        else:
            pred_actions = torch.cat(
                           [other_pred_action, pred_action],
                           1).to(device)
        
        actor_loss = -self.critic_local(observations, pred_actions).mean()

        # Minimize the loss
        self.actor_optimizer.zero_grad()
        actor_loss.backward()
        self.actor_optimizer.step()

        # ----------------------- update target networks ----------------------- #
        self.soft_update(self.critic_local, self.critic_target, TAU)
        self.soft_update(self.actor_local, self.actor_target, TAU)                     

    def soft_update(self, local_model, target_model, tau):
        """Soft update model parameters.
        θ_target = τ*θ_local + (1 - τ)*θ_target

        Params
        ======
            local_model: PyTorch model (weights will be copied from)
            target_model: PyTorch model (weights will be copied to)
            tau (float): interpolation parameter 
        """
        for target_param, local_param in zip(target_model.parameters(), local_model.parameters()):
            target_param.data.copy_(tau*local_param.data + (1.0-tau)*target_param.data)


In [None]:
class MADDPG_Agent:

    def __init__(self, state_size, action_size, num_agents, random_seed = 0):
        """Initialize an Agent object.
        
        Params
        ======
            state_size (int): dimension of each state
            action_size (int): dimension of each action
            num_agents (int): how many agents to be trained
            random_seed (int): random seed
        """
        self.state_size = state_size
        self.action_size = action_size
        self.num_agents = num_agents
        self.seed = random_seed

        self.memory = Buffer(BUFFER_SIZE, BATCH_SIZE, self.seed)

        self.agent_names = []
        for i in range(num_agents):
            self.agent_names.append( 'agent_' + str(i) )

        self.agents = dict()
        for agent_name in self.agent_names:
            self.agents[agent_name] = DDPG_Agent(agent_name, state_size, action_size, self.seed)

    def act(self, observations, add_noise = True):
        '''get actions for both agents
        Params
        ======
            observations (np.array): current observation
            add_noise (bool): add noise or not
        '''
        actions = []
        for i, agent_name in enumerate(self.agent_names):
            actions.append(self.agents[agent_name].act(observations[i], add_noise))
        return np.array(actions)
    
    
    def reset(self):
        '''reset the noise class'''
        for agent_name in self.agent_names:
            self.agents[agent_name].reset()
    
    def epsilon_decay(self):
        '''Decay the noise amplitude if required'''
        for agent_name in self.agent_names:
            self.agents[agent_name].epsilon_decay()
            
    def step(self, observation, action, reward, next_observation, done, step):
        """Learning process, get past experience tuple in the replay buffer,
        
        Params
        ======
            observation (torch.Tensor): all agents' observations
            action (torch.Tensor): all agents' observations
            next_observations (torch.Tensor): all agents' next observation
            observations (torch.Tensor): all agents' observations
            done (torch.Tensor): all agents' dones
            step (int): current training step, use it for noise decay
        """
        self.memory.add(observation, action, reward, next_observation, done)
        if (len(self.memory) > BATCH_SIZE) and (step % TRAIN_EVERY) == 0:
            for _ in range(NUM_TRAINS) :
                experiences = self.memory.sample()
                observations, actions, rewards, next_observations, dones = experiences

                my_observation = torch.chunk(observations, NUM_AGENTS, dim = 1)
                my_next_observation = torch.chunk(next_observations, NUM_AGENTS, dim = 1)
                my_reward = torch.chunk(rewards, NUM_AGENTS, dim = 1)
                my_done = torch.chunk(dones, NUM_AGENTS, dim = 1)

                other_next_actions = []
                other_pred_actions = []
                # prepare next step actions for actor learning process. The date will be fed in critic_local
                for num_agent, agent_name in enumerate(self.agent_names):
                    other_next_actions.append( torch.Tensor(self.agents[agent_name].act(my_next_observation[num_agent])) )
                    other_pred_actions.append( torch.Tensor(self.agents[agent_name].act(my_observation[num_agent])) )

                self.agents['agent_0'].learn('agent_0', my_next_observation[0], other_next_actions[1], next_observations,
                  actions, observations, my_observation[0], other_pred_actions[1], my_reward[0], my_done[0], gamma = GAMMA)
                self.agents['agent_1'].learn('agent_1', my_next_observation[1], other_next_actions[0], next_observations,
                  actions, observations, my_observation[1], other_pred_actions[0], my_reward[1], my_done[1], gamma = GAMMA)

In [None]:
SEED = 0
BUFFER_SIZE = int(1e5)  # replay buffer size
BATCH_SIZE = 256        # minibatch size
GAMMA = 0.99            # discount factor
TAU = 1e-3              # for soft update of target parameters
LR_ACTOR = 1e-4         # learning rate of the actor
LR_CRITIC = 1e-3        # learning rate of the critic
NUM_AGENTS = 2          # number of agents
WEIGHT_DECAY = 0.       # L2 weight decay
TRAIN_EVERY = 1         # how often to train the network
NUM_TRAINS = 5          # number of trains per each train step

agent = MADDPG_Agent(state_size, action_size, num_agents, random_seed = SEED)

In [None]:
def maddpg(n_episodes=5000, score_lenth = 100 ):
    """Multi-Agent Deep Deterministic Policy Gradient for N agents
    
    Params
    ======
        n_episodes (int): maximum number of training episodes
    """
    scores_deque = deque(maxlen=score_lenth)                              
    scores = []                                                         
    average_scores = []
    for i_episode in range(1, n_episodes+1):                                    
        env_info = env.reset(train_mode=True)[brain_name]                   
        states = env_info.vector_observations                                          
        observation = states.reshape(1,NUM_AGENTS*state_size).squeeze(0)  # merge both agents' states as an observation
        score = np.zeros(num_agents)                                    
        agent.reset()
        step = 0
        while True:
            step += 1
            actions = agent.act(states)
            action = actions.reshape(1,NUM_AGENTS*action_size).squeeze(0) # merge both agents' actions as an action
            env_info = env.step(actions)[brain_name]                                  
            next_states = env_info.vector_observations                   
            next_observation = next_states.reshape(1,NUM_AGENTS*state_size).squeeze(0) # merge both agents' next states as the next observation
            rewards = env_info.rewards                                        
            dones = env_info.local_done                                 
            agent.step(observation, action, rewards, next_observation, dones, step)   
            states = next_states                                        
            observation = next_observation
            score += rewards                                             
            if any(dones):                                 
                break                                                   
#         agent.epsilon_decay()
        score = np.max(score)
        scores.append(score)
        scores_deque.append(score)
        average_score = np.mean(scores_deque)
        average_scores.append(average_score)
        print('\rEpisode {}\tAverage Score: {:.4f}\tScore: {:.4f}'.format(i_episode, average_score, score), end="")
        if i_episode % 100 == 0:
            print('\rEpisode {}\tAverage score: {:.4f}'.format(i_episode , average_score))
        if average_score >= 0.5:
            print('\nEnvironment solved in {:d} episodes!\tAverage Score: {:.4f}'.format(i_episode, average_score))
            break
    
    torch.save(agent.agents['agent_0'].actor_local.state_dict(), 'agent_one_checkpoint_actor.pth')
    torch.save(agent.agents['agent_0'].critic_local.state_dict(), 'agent_one_checkpoint_critic.pth')

    torch.save(agent.agents['agent_1'].actor_local.state_dict(), 'agent_two_checkpoint_actor.pth')
    torch.save(agent.agents['agent_1'].critic_local.state_dict(), 'agent_two_checkpoint_critic.pth')
    
    return scores, average_scores            

scores, average_scores = maddpg()

In [None]:
agent0_rewards = np.random.randn(300)
agent1_rewards = np.random.randn(300)

In [None]:
scores = np.max(np.stack((agent0_rewards, agent1_rewards)), axis=0)

In [None]:
scores[-1]

In [None]:
scores[299]

In [None]:
scores[:100]

In [None]:
len(average_scores)