# Collaboration and Competition

---

You are welcome to use this coding environment to train your agent for the project.  Follow the instructions below to get started!

### 1. Start the Environment

Run the next code cell to install a few packages.  This line will take a few minutes to run!

In [1]:
!pip -q install ./python

[31mtensorflow 1.7.1 has requirement numpy>=1.13.3, but you'll have numpy 1.12.1 which is incompatible.[0m
[31mipython 6.5.0 has requirement prompt-toolkit<2.0.0,>=1.0.15, but you'll have prompt-toolkit 2.0.9 which is incompatible.[0m


The environment is already saved in the Workspace and can be accessed at the file path provided below. 

In [2]:
from unityagents import UnityEnvironment
import numpy as np

env = UnityEnvironment(file_name="/data/Tennis_Linux_NoVis/Tennis")

INFO:unityagents:
'Academy' started successfully!
Unity Academy name: Academy
        Number of Brains: 1
        Number of External Brains : 1
        Lesson number : 0
        Reset Parameters :
		
Unity brain name: TennisBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 8
        Number of stacked Vector Observation: 3
        Vector Action space type: continuous
        Vector Action space size (per agent): 2
        Vector Action descriptions: , 


Environments contain **_brains_** which are responsible for deciding the actions of their associated agents. Here we check for the first brain available, and set it as the default brain we will be controlling from Python.

In [3]:
# get the default brain
brain_name = env.brain_names[0]
brain = env.brains[brain_name]

### 2. Examine the State and Action Spaces

Run the code cell below to print some information about the environment.

In [4]:
# reset the environment
env_info = env.reset(train_mode=True)[brain_name]

# number of agents 
num_agents = len(env_info.agents)
print('Number of agents:', num_agents)

# size of each action
action_size = brain.vector_action_space_size
print('Size of each action:', action_size)

# examine the state space 
states = env_info.vector_observations
state_size = states.shape[1]
print('There are {} agents. Each observes a state with length: {}'.format(states.shape[0], state_size))
print('The state for the first agent looks like:', states[0])

Number of agents: 2
Size of each action: 2
There are 2 agents. Each observes a state with length: 24
The state for the first agent looks like: [ 0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.         -6.65278625 -1.5        -0.          0.
  6.83172083  6.         -0.          0.        ]


### 3. Take Random Actions in the Environment

In the next code cell, you will learn how to use the Python API to control the agent and receive feedback from the environment.

Note that **in this coding environment, you will not be able to watch the agents while they are training**, and you should set `train_mode=True` to restart the environment.

In [5]:
for i in range(1):                                         # play game for 5 episodes
    env_info = env.reset(train_mode=False)[brain_name]     # reset the environment    
    states = env_info.vector_observations                  # get the current state (for each agent)
    scores = np.zeros(num_agents)                          # initialize the score (for each agent)
    step = 0 
    while True:
        actions = np.random.randn(num_agents, action_size) # select an action (for each agent)
        actions = np.clip(actions, -1, 1)                  # all actions between -1 and 1
        env_info = env.step(actions)[brain_name]           # send all actions to tne environment
        next_states = env_info.vector_observations         # get next state (for each agent)
        rewards = env_info.rewards                         # get reward (for each agent)
        dones = env_info.local_done                        # see if episode finished
        scores += env_info.rewards                         # update the score (for each agent)
        states = next_states                               # roll over states to next time step
        step += 1
        if np.any(dones):                                  # exit loop if episode finished
            break
    print('Total score (averaged over agents) this episode: {}'.format(np.mean(scores)))

Total score (averaged over agents) this episode: -0.004999999888241291


## Test Model

In [11]:
import numpy as np
import math
import torch
import torch.nn as nn
import json
import torch.nn.functional as F
import torch.optim as optim
from collections import namedtuple, deque

def parse_params(params_dir):
    with open(params_dir) as fp:
        params = json.load(fp)
    return params

class Actor_critic_model(nn.Module):
    def __init__(self, params_dir, input_dim, act_size, num_agents):
        super().__init__()
        self.input_dim = input_dim
        self.act_size = act_size
        self.num_agents = num_agents
        self.params = parse_params(params_dir)
        self.mu = self.create_module_list(self.create_actor)
        self.std =  self.create_std()
        self.val = self.create_module_list(self.create_critic)
        self.device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
        
    def _init_weights(self, m):
        if isinstance(m, nn.Linear):
            nn.init.kaiming_normal_(m.weight, mode='fan_in')
            m.bias.data.fill_(0.01)

    def create_module_list(self, func):
        module_list = nn.ModuleList()
        for _ in range(self.num_agents):
            module_list.append(func())
        return module_list
    
    def create_std(self):
        param_list = nn.ParameterList([nn.Parameter(
            torch.ones(1, self.act_size)) for _ in range(self.num_agents)]
                        )
        return param_list

    def create_actor(self):
        module_list = nn.ModuleList()
        layer = nn.Sequential()
        fc = nn.Linear(self.input_dim, self.params['hidden_dim'])
        layer.add_module(f"fc_layer_1", fc)
        # layer.add_module(f"bn_layer_1",
                        # nn.BatchNorm1d(self.params['hidden_dim']))
        # layer.add_module(f"RELU_layer_1", nn.LeakyReLU())
        layer.add_module(f"RELU_layer_1", nn.ReLU())
        module_list.append(layer)
        # module_list.apply(self._init_weights)
        self.add_hidden_layer(module_list, self.params['actor_h_num'],
                         self.params['hidden_dim'], self.params['hidden_dim'])
        module_list.append(nn.Sequential(nn.Linear(self.params['hidden_dim'],
                                          self.act_size)))
        # module_list.apply(self._init_weights)
        return module_list
    
    def create_critic(self):
        module_list = nn.ModuleList()
        layer = nn.Sequential()
        fc = nn.Linear(self.input_dim + self.act_size * 2, self.params['hidden_dim'])
        layer.add_module(f"fc_layer_1", fc)
        # layer.add_module(f"bn_layer_1",
                        # nn.BatchNorm1d(self.params['hidden_dim']))
        # layer.add_module(f"RELU_layer_1", nn.LeakyReLU())
        layer.add_module(f"RELU_layer_1", nn.ReLU())
        module_list.append(layer)
        self.add_hidden_layer(module_list, self.params['critic_h_num'],
                         self.params['hidden_dim'], self.params['hidden_dim'])
        module_list.append(nn.Sequential(nn.Linear(self.params['hidden_dim'], 1)))
        # module_list.apply(self._init_weights)
        return module_list
    
    def add_hidden_layer(self, module_list, num_hidden_layer,
                         input_dim, output_dim):
        if num_hidden_layer == 0:
            return
        for i in range(1, num_hidden_layer+1):
            layer = nn.Sequential()
            fc = nn.Linear(input_dim, output_dim)
            layer.add_module(f"fc_layer_{i}", fc)
            # layer.add_module(f"bn_layer_{i}",
                          #    nn.BatchNorm1d(output_dim))
            # layer.add_module(f"RELU_layer_{i}", nn.LeakyReLU())
            layer.add_module(f"RELU_layer_{i}", nn.ReLU())
            module_list.append(layer)
            
    def forward(self, states, actor=True, train=True, index=None):
        '''
            If actor is True, output actions and log probabilities FloatTensor
            If Critic (actor = False), output state value FloatTensor
        '''
        x_ = torch.FloatTensor(states).to(self.device)
        if actor:
            # forward in actor path
            mus = torch.zeros(self.num_agents, self.act_size, dtype=torch.float)
            dists = []
            acts = torch.zeros(self.num_agents, self.act_size, dtype=torch.float)
            lps = torch.zeros(self.num_agents, self.act_size, dtype=torch.float)
            for i in range(self.num_agents):
                
                mu_ = x_[i]
                for m in self.mu[i]:
                    mu_ = m(mu_)
                mus[i] = (mu_)
                dists.append(torch.distributions.Normal(mus[i], self.std[i]))
                act_ = dists[i].sample()
                if train:
                    # only return log probabilities in training phases
                    lps[i] = dists[i].log_prob(act_)
                    acts[i] = torch.clamp(act_, -1, 1)
                    return acts, lps
                # return only actions in executation phases
                return torch.clamp(act_, -1, 1)
        # forward in value path
        for v in self.val[index]:
            x_ = v(x_)
        return x_

In [12]:
params_dir = f"./params.txt"
amodel =  Actor_critic_model(params_dir, 24, 2, 2)

In [13]:
list(amodel.parameters())


[Parameter containing:
 tensor([[-0.1985, -0.1288,  0.1789,  ..., -0.1654,  0.0376,  0.2027],
         [ 0.1717, -0.0791,  0.1173,  ..., -0.1954,  0.1795,  0.0756],
         [ 0.1393,  0.0430, -0.1515,  ...,  0.0009,  0.1954, -0.0147],
         ...,
         [-0.1946, -0.0024, -0.0642,  ...,  0.0591, -0.1154, -0.1974],
         [ 0.1854, -0.0100, -0.0588,  ..., -0.1981,  0.0050,  0.0286],
         [ 0.0320, -0.1381, -0.1619,  ...,  0.0542,  0.0346, -0.1930]]),
 Parameter containing:
 tensor([ 0.0354,  0.2000, -0.0518, -0.0361, -0.1515, -0.1081, -0.0692,
          0.0736,  0.0014,  0.1593, -0.1601,  0.0491,  0.1895, -0.0391,
          0.1125, -0.0936,  0.0335, -0.1701,  0.1948, -0.1872,  0.1438,
          0.1375, -0.1360,  0.0201, -0.0581, -0.1437, -0.0438,  0.1718,
         -0.1198, -0.1885, -0.0877, -0.0901, -0.0720,  0.0778, -0.0335,
          0.1936,  0.1831, -0.0003,  0.1432, -0.0970, -0.1702,  0.1719,
          0.0001,  0.1295, -0.1975,  0.1577, -0.1547, -0.0260,  0.0403,
        

## Version 1

In [None]:
import numpy as np
import math
import torch
import torch.nn as nn
import json
import torch.nn.functional as F
import torch.optim as optim
from collections import namedtuple, deque


device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
def get_extra_obs(states, actions):
    ''' 
        return a list contains other agents states and actions, len = num_agents
        states: list of states by each agent
        actions: List of action by each agent
    '''
    extra_obs = []
    # print(f"actions : {actions}")
    for i in range(states.shape[0]):
        list_ = []
        # states
        list_.extend(states[i])
        # agent's action
        list_.extend(actions[i])
        # other agent's actions
        list_.extend(actions[np.arange(len(actions))!= i][0])
        extra_obs.append(list_)
    return extra_obs


class Actor_critic_model(nn.Module):
    def __init__(self, params_dir, input_dim, act_size, num_agents):
        super().__init__()
        self.input_dim = input_dim
        self.act_size = act_size
        self.num_agents = num_agents
        self.params = parse_params(params_dir)
        self.mu = self.create_module_list(self.create_actor)
        self.std =  self.create_std()
        self.val = self.create_module_list(self.create_critic)
        self.device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
        
    def _init_weights(self, m):
        if isinstance(m, nn.Linear):
            nn.init.kaiming_normal_(m.weight, mode='fan_in')
            m.bias.data.fill_(0.01)

    def create_module_list(self, func):
        module_list = nn.ModuleList()
        for _ in range(self.num_agents):
            module_list.append(func())
        return module_list
    
    def create_std(self):
        param_list = nn.ParameterList([nn.Parameter(
            torch.ones(1, self.act_size)) for _ in range(self.num_agents)]
                        )
        return param_list

    def create_actor(self):
        module_list = nn.ModuleList()
        layer = nn.Sequential()
        fc = nn.Linear(self.input_dim, self.params['hidden_dim'])
        layer.add_module(f"fc_layer_1", fc)
        # layer.add_module(f"bn_layer_1",
                        # nn.BatchNorm1d(self.params['hidden_dim']))
        # layer.add_module(f"RELU_layer_1", nn.LeakyReLU())
        layer.add_module(f"RELU_layer_1", nn.ReLU())
        module_list.append(layer)
        # module_list.apply(self._init_weights)
        self.add_hidden_layer(module_list, self.params['actor_h_num'],
                         self.params['hidden_dim'], self.params['hidden_dim'])
        module_list.append(nn.Sequential(nn.Linear(self.params['hidden_dim'],
                                          self.act_size)))
        # module_list.apply(self._init_weights)
        return module_list
    
    def create_critic(self):
        module_list = nn.ModuleList()
        layer = nn.Sequential()
        fc = nn.Linear(self.input_dim + self.act_size * 2, self.params['hidden_dim'])
        layer.add_module(f"fc_layer_1", fc)
        # layer.add_module(f"bn_layer_1",
                        # nn.BatchNorm1d(self.params['hidden_dim']))
        # layer.add_module(f"RELU_layer_1", nn.LeakyReLU())
        layer.add_module(f"RELU_layer_1", nn.ReLU())
        module_list.append(layer)
        self.add_hidden_layer(module_list, self.params['critic_h_num'],
                         self.params['hidden_dim'], self.params['hidden_dim'])
        module_list.append(nn.Sequential(nn.Linear(self.params['hidden_dim'], 1)))
        # module_list.apply(self._init_weights)
        return module_list
    
    def add_hidden_layer(self, module_list, num_hidden_layer,
                         input_dim, output_dim):
        if num_hidden_layer == 0:
            return
        for i in range(1, num_hidden_layer+1):
            layer = nn.Sequential()
            fc = nn.Linear(input_dim, output_dim)
            layer.add_module(f"fc_layer_{i}", fc)
            # layer.add_module(f"bn_layer_{i}",
                          #    nn.BatchNorm1d(output_dim))
            # layer.add_module(f"RELU_layer_{i}", nn.LeakyReLU())
            layer.add_module(f"RELU_layer_{i}", nn.ReLU())
            module_list.append(layer)
            
    def forward(self, states, actor=True, train=True, index=None):
        '''
            If actor is True, output actions and log probabilities FloatTensor
            If Critic (actor = False), output state value FloatTensor
        '''
        x_ = torch.FloatTensor(states).to(self.device)
        if actor:
            # forward in actor path
            mus = torch.zeros(self.num_agents, self.act_size, dtype=torch.float)
            dists = []
            acts = torch.zeros(self.num_agents, self.act_size, dtype=torch.float)
            lps = torch.zeros(self.num_agents, self.act_size, dtype=torch.float)
            for i in range(self.num_agents):
                
                mu_ = x_[i]
                for m in self.mu[i]:
                    mu_ = m(mu_)
                mus[i] = (mu_)
                dists.append(torch.distributions.Normal(mus[i], self.std[i]))
                act_ = dists[i].sample()
                if train:
                    # only return log probabilities in training phases
                    lps[i] = dists[i].log_prob(act_)
                    acts[i] = torch.clamp(act_, -1, 1)
                    return acts, lps
                # return only actions in executation phases
                return torch.clamp(act_, -1, 1)
        # forward in value path
        for v in self.val[index]:
            x_ = v(x_)
        return x_

def parse_params(params_dir):
    with open(params_dir) as fp:
        params = json.load(fp)
    return params


class Agent():
    def __init__(self, device, num_agents, params_dir, state_size, action_size):
        self.model = Actor_critic_model(params_dir, state_size, action_size, num_agents).to(device)
        
        # I should try a version without target, just like A2C
        self.target = Actor_critic_model(params_dir, state_size, action_size, num_agents).to(device)
        self.device = device
        self.num_agents = num_agents
        self.params = self.model.params
        self.optimizer = [optim.Adam(self.model.parameters(),
                                    lr=self.params['lr']) for _ in range(self.num_agents)]
                                    # lr=0.0001)

    def __call__(self, states):
        # mu, std, val, etp = self.model(states)
        actions, log_prob = self.model(states)
        return actions, log_prob

    def step(self, memories):
        '''
        second edition
            experiences:
                list with n_steps_taken * [actions, rewards, log_probs,
                                           not_dones, state_values]:
                    actions (tensor: num agents * num actions)
                    rewards (list: size = num agents)
                    log_probs (tensor: num agents * num actions)
                    not_dones (np array: size = num agents)
                    state_values (list: size = num agents)
        '''
        loss = [0.0] * self.num_agents
        for idx in range(self.num_agents):
            actions, rewards, log_probs, not_dones, states, next_states= memories[idx].spit()
            # print(f"state_values : {state_values}")
            rewards = torch.FloatTensor(rewards).view(-1, 1)
            #print(f"len(rewards[0]) - 1 : {len(rewards) - 1}")
            not_dones = torch.FloatTensor(not_dones).to(device).unsqueeze(1)
            # print(rewards)
            state_values = self.model(states, actor=False, index=idx)
            next_values = self.target(next_states, actor=False, index=idx)
            returns = rewards + self.params['gamma'] * not_dones * next_values.detach()
            advantage_  = rewards + self.params['gamma'] * not_dones * next_values.detach() - state_values.detach()
            # print(f"log_probs.shape : {log_probs.shape}")
            # print(f"advantage_.shape : {advantage_.shape}")
            # print(f"state_values[i].shape : {state_values[i].shape}")
            # print(f"return_.shape : {return_.shape}")
            # print(f"processed_experience : {processed_experience}")
            log_probs = torch.stack(log_probs)
            policy_loss = -(log_probs) * advantage_
            value_loss = (0.5 * (returns - state_values).pow(2))
            self.optimizer[idx].zero_grad()
            loss[idx] = ((policy_loss + value_loss.unsqueeze(1)).mean())
            # print(f"loss[idx] : {loss[idx]}")
            if torch.isnan(loss[idx]).any():
                print('Nan in loss function')
                pass
            # loss[idx].backward()
            if idx == (self.num_agents - 1):
                loss[idx].backward()
            else:
                loss[idx].backward(retain_graph=True)
            nn.utils.clip_grad_norm_(self.model.parameters(), self.model.params['grad_clip'])
            self.optimizer[idx].step()
        self.soft_udpate()
        
    def soft_udpate(self):
        for tp, lp in zip(self.target.parameters(),
                          self.model.parameters()):
            tp.data.copy_(self.params['TAU']*lp.data +
                          (1.0-self.params['TAU'])*tp.data)
            
    def hard_udpate(self):
        for tp, lp in zip(self.target.parameters(),
                          self.model.parameters()):
            tp.data.copy_(lp.data)


class Experience():
    def __init__(self):
        self.actions = []
        self.rewards = []
        #self.extra_into = []
        self.log_probs = []
        self.not_dones = []
        self.states = []
        self.next_states = []
        # self.etp = []

    def add(self, actions, rewards, log_probs, not_dones,
            states, next_states):
        self.actions.append(actions)
        self.rewards.append(rewards)
        #self.extra_into.append(extra_into)
        self.log_probs.append(log_probs)
        self.not_dones.append(not_dones)
        self.states.append(states)
        self.next_states.append(next_states)
        # self.etp.append(etp)

    def spit(self):
        return (self.actions[1:], self.rewards[1:], self.log_probs[1:],
                self.not_dones[1:],
                self.states[1:], self.next_states[1:])

    def __len__(self):
        return len(self.rewards)
    
params_dir = f"./params.txt"
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
agent = Agent(device, num_agents, params_dir, state_size, action_size)

def plot_scores(scores):
    fig, ax = plt.subplots(1, figsize=(8, 8))
    plt.plot(np.arange(len(scores)), scores)
    plt.ylabel('Score')
    plt.xlabel('Episode #')
    plt.show()


scores_window = deque(maxlen=100)
memories = [Experience() for _ in range(num_agents)]
learned_steps = 0
while learned_steps < 1000:
    env_info = env.reset(train_mode=True)[brain_name]
    states_ = env_info.vector_observations
    scores = np.zeros(num_agents)
    done = [False] * num_agents
    actions_ = np.random.random([num_agents, action_size])
    log_prob_ = torch.rand(num_agents, action_size)
    rewards_ = [0] * num_agents
    states_plus = get_extra_obs(states, actions)
    steps = 0
    # while not np.any(done):
    while steps < 20:
        actions_next, log_prob_next = agent(states)
        next_states_plus = get_extra_obs(states, actions_next.cpu().numpy())
        env_info = env.step(actions_next.detach().cpu().numpy())[brain_name]
        done = env_info.local_done
        not_done_ = (1 - np.array(done))
        for idx in range(num_agents):
            memories[idx].add(actions_[idx], rewards_[idx], log_prob_[idx],
                              not_done_[idx], states_plus[idx], next_states_plus[idx])
        steps += 1

        rewards_ = env_info.rewards
        states = env_info.vector_observations
        actions_ = actions_next
        log_prob_ = log_prob_next
        states_plus = next_states_plus
        scores += rewards_
        if (len(memories[0].actions) % 1000 == 0) and (len(memories[0].actions) >1):
            agent.step(memories)
            learned_steps += 1
            memories = [Experience() for _ in range(num_agents)]
            print(f"learned_steps {learned_steps}: {np.max(scores)}")
        if learned_steps % 1000 == 0:
            agent.hard_udpate()
        if np.any(done):
            # memories = Experience()
            # print(f" steps : {steps}")
            break
        # print(scores)

    scores_window.append(np.max(scores))
    if (len(scores_window)) == 100 and ((sum(scores_window) / len(scores_window)) > 0.5):
        torch.save(agent.model.state_dict(), agent.params['working_dir'])
        print(f"Envinroment solved in episode{steps}!")
        print(f"Score: {scores_window}")
        break

learned_steps 1: 0.0
learned_steps 2: 0.0
learned_steps 3: 0.0
learned_steps 4: 0.0
learned_steps 5: 0.0
learned_steps 6: 0.0
learned_steps 7: 0.0
learned_steps 8: 0.0
learned_steps 9: 0.0
learned_steps 10: 0.0
learned_steps 11: 0.0
learned_steps 12: 0.0
learned_steps 13: 0.0
learned_steps 14: 0.0
learned_steps 15: 0.0
learned_steps 16: 0.0
learned_steps 17: 0.0
learned_steps 18: 0.10000000149011612
learned_steps 19: 0.0
learned_steps 20: 0.0
learned_steps 21: 0.0
learned_steps 22: 0.0
learned_steps 23: 0.0
learned_steps 24: 0.0
learned_steps 25: 0.0
learned_steps 26: 0.0
learned_steps 27: 0.0
learned_steps 28: 0.0
learned_steps 29: 0.0
learned_steps 30: 0.0
learned_steps 31: 0.0
learned_steps 32: 0.0
learned_steps 33: 0.0
learned_steps 34: 0.0
learned_steps 35: 0.0
learned_steps 36: 0.0
learned_steps 37: 0.0
learned_steps 38: 0.0
learned_steps 39: 0.0
learned_steps 40: 0.0
learned_steps 41: 0.0
learned_steps 42: 0.0
learned_steps 43: 0.0
learned_steps 44: 0.0
learned_steps 45: 0.0
lea

learned_steps 361: 0.0
learned_steps 362: 0.0
learned_steps 363: 0.0
learned_steps 364: 0.0
learned_steps 365: 0.0
learned_steps 366: 0.0
learned_steps 367: 0.0
learned_steps 368: 0.0
learned_steps 369: 0.0
learned_steps 370: 0.0
learned_steps 371: 0.0
learned_steps 372: 0.0
learned_steps 373: 0.0
learned_steps 374: 0.0
learned_steps 375: 0.0
learned_steps 376: 0.0
learned_steps 377: 0.0
learned_steps 378: 0.0
learned_steps 379: 0.0
learned_steps 380: 0.0
learned_steps 381: 0.0
learned_steps 382: 0.0
learned_steps 383: 0.0
learned_steps 384: 0.0
learned_steps 385: 0.0
learned_steps 386: 0.0
learned_steps 387: 0.0
learned_steps 388: 0.0
learned_steps 389: 0.0
learned_steps 390: 0.0
learned_steps 391: 0.0
learned_steps 392: 0.0
learned_steps 393: 0.0
learned_steps 394: 0.0
learned_steps 395: 0.0
learned_steps 396: 0.0
learned_steps 397: 0.0
learned_steps 398: 0.0
learned_steps 399: 0.0
learned_steps 400: 0.0
learned_steps 401: 0.0
learned_steps 402: 0.0
learned_steps 403: 0.0
learned_ste

## Version 2

In [6]:
import numpy as np
import math
import torch
import torch.nn as nn
import json
import torch.nn.functional as F
import torch.optim as optim
from collections import namedtuple, deque


device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
def get_extra_obs(states, actions):
    ''' 
        return a list contains other agents states and actions, len = num_agents
        states: list of states by each agent
        actions: List of action by each agent
    '''
    extra_obs = []
    # print(f"actions : {actions}")
    for i in range(states.shape[0]):
        list_ = []
        # states
        list_.extend(states[i])
        # agent's action
        list_.extend(actions[i])
        # other agent's actions
        list_.extend(actions[np.arange(len(actions))!= i][0])
        extra_obs.append(list_)
    return extra_obs


class Actor_critic_model(nn.Module):
    def __init__(self, params_dir, input_dim, act_size, num_agents):
        super().__init__()
        self.input_dim = input_dim
        self.act_size = act_size
        self.num_agents = num_agents
        self.params = parse_params(params_dir)
        self.mu = self.create_module_list(self.create_actor)
        self.std =  self.create_std()
        self.val = self.create_module_list(self.create_critic)
        self.device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
        
    def _init_weights(self, m):
        if isinstance(m, nn.Linear):
            nn.init.kaiming_normal_(m.weight, mode='fan_in')
            m.bias.data.fill_(0.01)

    def create_module_list(self, func):
        module_list = nn.ModuleList()
        for _ in range(self.num_agents):
            module_list.append(func())
        return module_list
    
    def create_std(self):
        param_list = nn.ParameterList([nn.Parameter(
            torch.ones(1, self.act_size)) for _ in range(self.num_agents)]
                        )
        return param_list

    def create_actor(self):
        module_list = nn.ModuleList()
        layer = nn.Sequential()
        fc = nn.Linear(self.input_dim, self.params['hidden_dim'])
        layer.add_module(f"fc_layer_1", fc)
        # layer.add_module(f"bn_layer_1",
                        # nn.BatchNorm1d(self.params['hidden_dim']))
        # layer.add_module(f"RELU_layer_1", nn.LeakyReLU())
        layer.add_module(f"RELU_layer_1", nn.ReLU())
        module_list.append(layer)
        # module_list.apply(self._init_weights)
        self.add_hidden_layer(module_list, self.params['actor_h_num'],
                         self.params['hidden_dim'], self.params['hidden_dim'])
        module_list.append(nn.Sequential(nn.Linear(self.params['hidden_dim'],
                                          self.act_size)))
        # module_list.apply(self._init_weights)
        return module_list
    
    def create_critic(self):
        module_list = nn.ModuleList()
        layer = nn.Sequential()
        fc = nn.Linear(self.input_dim + self.act_size * 2, self.params['hidden_dim'])
        layer.add_module(f"fc_layer_1", fc)
        # layer.add_module(f"bn_layer_1",
                        # nn.BatchNorm1d(self.params['hidden_dim']))
        # layer.add_module(f"RELU_layer_1", nn.LeakyReLU())
        layer.add_module(f"RELU_layer_1", nn.ReLU())
        module_list.append(layer)
        self.add_hidden_layer(module_list, self.params['critic_h_num'],
                         self.params['hidden_dim'], self.params['hidden_dim'])
        module_list.append(nn.Sequential(nn.Linear(self.params['hidden_dim'], 1)))
        # module_list.apply(self._init_weights)
        return module_list
    
    def add_hidden_layer(self, module_list, num_hidden_layer,
                         input_dim, output_dim):
        if num_hidden_layer == 0:
            return
        for i in range(1, num_hidden_layer+1):
            layer = nn.Sequential()
            fc = nn.Linear(input_dim, output_dim)
            layer.add_module(f"fc_layer_{i}", fc)
            # layer.add_module(f"bn_layer_{i}",
                          #    nn.BatchNorm1d(output_dim))
            # layer.add_module(f"RELU_layer_{i}", nn.LeakyReLU())
            layer.add_module(f"RELU_layer_{i}", nn.ReLU())
            module_list.append(layer)
            
    def forward(self, states, actor=True, train=True, index=None):
        '''
            If actor is True, output actions and log probabilities FloatTensor
            If Critic (actor = False), output state value FloatTensor
        '''
        x_ = torch.FloatTensor(states).to(self.device)
        if actor:
            # forward in actor path
            mus = torch.zeros(self.num_agents, self.act_size, dtype=torch.float)
            dists = []
            acts = torch.zeros(self.num_agents, self.act_size, dtype=torch.float)
            lps = torch.zeros(self.num_agents, self.act_size, dtype=torch.float)
            for i in range(self.num_agents):
                
                mu_ = x_[i]
                for m in self.mu[i]:
                    mu_ = m(mu_)
                mus[i] = (mu_)
                dists.append(torch.distributions.Normal(mus[i], self.std[i]))
                act_ = dists[i].sample()
                if train:
                    # only return log probabilities in training phases
                    lps[i] = dists[i].log_prob(act_)
                    acts[i] = torch.clamp(act_, -1, 1)
                    return acts, lps
                # return only actions in executation phases
                return torch.clamp(act_, -1, 1)
        # forward in value path
        for v in self.val[index]:
            x_ = v(x_)
        return x_

def parse_params(params_dir):
    with open(params_dir) as fp:
        params = json.load(fp)
    return params


class Agent():
    def __init__(self, device, num_agents, params_dir, state_size, action_size):
        self.model = Actor_critic_model(params_dir, state_size, action_size, num_agents).to(device)
        
        # I should try a version without target, just like A2C
        self.target = Actor_critic_model(params_dir, state_size, action_size, num_agents).to(device)
        self.device = device
        self.num_agents = num_agents
        self.params = self.model.params
        self.optimizer = [optim.Adam(self.model.parameters(),
                                    lr=self.params['lr']) for _ in range(self.num_agents)]
                                    # lr=0.0001)

    def __call__(self, states):
        # mu, std, val, etp = self.model(states)
        actions, log_prob = self.model(states)
        return actions, log_prob

    def step(self, memories):
        '''
        second edition
            experiences:
                list with n_steps_taken * [actions, rewards, log_probs,
                                           not_dones, state_values]:
                    actions (tensor: num agents * num actions)
                    rewards (list: size = num agents)
                    log_probs (tensor: num agents * num actions)
                    not_dones (np array: size = num agents)
                    state_values (list: size = num agents)
        '''
        loss = [0.0] * self.num_agents
        for idx in range(self.num_agents):
            actions, rewards, log_probs, not_dones, states, next_states= memories[idx].spit()
            # print(f"state_values : {state_values}")
            rewards = torch.FloatTensor(rewards).view(-1, 1)
            #print(f"len(rewards[0]) - 1 : {len(rewards) - 1}")
            not_dones = torch.FloatTensor(not_dones).to(device).unsqueeze(1)
            # print(rewards)
            state_values = self.model(states, actor=False, index=idx)
            next_values = self.target(next_states, actor=False, index=idx)
            returns = rewards + self.params['gamma'] * not_dones * next_values.detach()
            advantage_  = rewards + self.params['gamma'] * not_dones * next_values.detach() - state_values.detach()
            # print(f"log_probs.shape : {log_probs.shape}")
            # print(f"advantage_.shape : {advantage_.shape}")
            # print(f"state_values[i].shape : {state_values[i].shape}")
            # print(f"return_.shape : {return_.shape}")
            # print(f"processed_experience : {processed_experience}")
            log_probs = torch.stack(log_probs)
            policy_loss = -(log_probs) * advantage_
            value_loss = (0.5 * (returns - state_values).pow(2))
            self.optimizer[idx].zero_grad()
            loss[idx] = ((policy_loss + value_loss.unsqueeze(1)).mean())
            # print(f"loss[idx] : {loss[idx]}")
            if torch.isnan(loss[idx]).any():
                print('Nan in loss function')
                pass
            # loss[idx].backward()
            if idx == (self.num_agents - 1):
                loss[idx].backward()
            else:
                loss[idx].backward(retain_graph=True)
            nn.utils.clip_grad_norm_(self.model.parameters(), self.model.params['grad_clip'])
            self.optimizer[idx].step()
        self.soft_udpate()
        
    def soft_udpate(self):
        for tp, lp in zip(self.target.parameters(),
                          self.model.parameters()):
            tp.data.copy_(self.params['TAU']*lp.data +
                          (1.0-self.params['TAU'])*tp.data)
            
    def hard_udpate(self):
        for tp, lp in zip(self.target.parameters(),
                          self.model.parameters()):
            tp.data.copy_(lp.data)


class Experience():
    def __init__(self):
        self.actions = []
        self.rewards = []
        #self.extra_into = []
        self.log_probs = []
        self.not_dones = []
        self.states = []
        self.next_states = []
        # self.etp = []

    def add(self, actions, rewards, log_probs, not_dones,
            states, next_states):
        self.actions.append(actions)
        self.rewards.append(rewards)
        #self.extra_into.append(extra_into)
        self.log_probs.append(log_probs)
        self.not_dones.append(not_dones)
        self.states.append(states)
        self.next_states.append(next_states)
        # self.etp.append(etp)

    def spit(self):
        return (self.actions[1:], self.rewards[1:], self.log_probs[1:],
                self.not_dones[1:],
                self.states[1:], self.next_states[1:])

    def __len__(self):
        return len(self.rewards)
    
params_dir = f"./params.txt"
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
agent = Agent(device, num_agents, params_dir, state_size, action_size)

def plot_scores(scores):
    fig, ax = plt.subplots(1, figsize=(8, 8))
    plt.plot(np.arange(len(scores)), scores)
    plt.ylabel('Score')
    plt.xlabel('Episode #')
    plt.show()


scores_window = deque(maxlen=100)
memories = [Experience() for _ in range(num_agents)]
learned_steps = 0
while learned_steps < 1000:
    env_info = env.reset(train_mode=True)[brain_name]
    states_ = env_info.vector_observations
    scores = np.zeros(num_agents)
    done = [False] * num_agents
    actions_ = np.random.random([num_agents, action_size])
    log_prob_ = torch.rand(num_agents, action_size)
    rewards_ = [0] * num_agents
    states_plus = get_extra_obs(states, actions)
    steps = 0
    # while not np.any(done):
    while steps < 20:
        actions_next, log_prob_next = agent(states)
        next_states_plus = get_extra_obs(states, actions_next.cpu().numpy())
        env_info = env.step(actions_next.detach().cpu().numpy())[brain_name]
        done = env_info.local_done
        not_done_ = (1 - np.array(done))
        for idx in range(num_agents):
            memories[idx].add(actions_[idx], rewards_[idx], log_prob_[idx],
                              not_done_[idx], states_plus[idx], next_states_plus[idx])
        steps += 1

        rewards_ = env_info.rewards
        states = env_info.vector_observations
        actions_ = actions_next
        log_prob_ = log_prob_next
        states_plus = next_states_plus
        scores += rewards_
        if (len(memories[0].actions) % 1000 == 0) and (len(memories[0].actions) >1):
            agent.step(memories)
            learned_steps += 1
            memories = [Experience() for _ in range(num_agents)]
            print(f"learned_steps {learned_steps}: {np.max(scores)}")
        if learned_steps % 1000 == 0:
            agent.hard_udpate()
        if np.any(done):
            # memories = Experience()
            # print(f" steps : {steps}")
            break
        # print(scores)

    scores_window.append(np.max(scores))
    if (len(scores_window)) == 100 and ((sum(scores_window) / len(scores_window)) > 0.5):
        torch.save(agent.model.state_dict(), agent.params['working_dir'])
        print(f"Envinroment solved in episode{steps}!")
        print(f"Score: {scores_window}")
        break

array([[-0.76130287,  0.2669243 ],
       [-1.        , -0.35200716]])

In [14]:
np.arange(len(actions))!= 0

array([False,  True], dtype=bool)

In [6]:
def get_extra_obs(states, actions):
    ''' 
        return a list contains other agents states and actions, len = num_agents
        states: list of states by each agent
        actions: List of action by each agent
    '''
    extra_obs = []
    # print(f"actions : {actions}")
    for i in range(states.shape[0]):
        list_ = []
        # states
        list_.extend(states[i])
        # agent's action
        list_.extend(actions[i])
        # other agent's actions
        list_.extend(actions[np.arange(len(actions))!= i][0])
        extra_obs.append(list_)
    return extra_obs

In [7]:
len(get_extra_obs(states, actions)[0])

28

In [12]:
import torch

In [10]:
a = get_extra_obs(states, actions)

In [15]:
torch.FloatTensor(a [0])

tensor([ 0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,
         0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,
         0.0000,  0.0000, -7.4364, -1.5000, -0.0000,  0.0000,  6.6949,
         5.9608, -0.0000,  0.0000, -1.0000, -0.8501,  0.9570,  0.6319])

In [16]:
rewards

[0.0, -0.009999999776482582]

In [64]:
np.random.random([2,3])

array([[ 0.91444326,  0.80984923,  0.56989909],
       [ 0.18852114,  0.22988881,  0.49607161]])

## Version 3

In [6]:
import numpy as np
import math
import torch
import torch.nn as nn
import json
import torch.nn.functional as F
import torch.optim as optim
from collections import namedtuple, deque


device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
def get_extra_obs(states, actions):
    ''' 
        return a list contains other agents states and actions, len = num_agents
        states: list of states by each agent
        actions: List of action by each agent
    '''
    extra_obs = []
    # print(f"actions : {actions}")
    for i in range(states.shape[0]):
        list_ = []
        # states
        list_.extend(states[i])
        # agent's action
        list_.extend(actions[i])
        # other agent's actions
        list_.extend(actions[np.arange(len(actions))!= i][0])
        extra_obs.append(list_)
    return extra_obs


class Actor_critic_model(nn.Module):
    def __init__(self, params_dir, input_dim, act_size, num_agents):
        super().__init__()
        self.input_dim = input_dim
        self.act_size = act_size
        self.num_agents = num_agents
        self.params = parse_params(params_dir)
        self.mu = [self.create_actor() for _ in range(self.num_agents)]
        self.std = [nn.Parameter(torch.ones(1, act_size)) for _ in range(self.num_agents)]
        self.val = self.create_critic()
        self.device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
        
    def _init_weights(self, m):
        if isinstance(m, nn.Linear):
            nn.init.kaiming_normal_(m.weight, mode='fan_in')
            m.bias.data.fill_(0.01)

    def create_actor(self):
        module_list = nn.ModuleList()
        layer = nn.Sequential()
        fc = nn.Linear(self.input_dim, self.params['hidden_dim'])
        layer.add_module(f"fc_layer_1", fc)
        # layer.add_module(f"bn_layer_1",
                        # nn.BatchNorm1d(self.params['hidden_dim']))
        # layer.add_module(f"RELU_layer_1", nn.LeakyReLU())
        layer.add_module(f"RELU_layer_1", nn.ReLU())
        module_list.append(layer)
        # module_list.apply(self._init_weights)
        self.add_hidden_layer(module_list, self.params['actor_h_num'],
                         self.params['hidden_dim'], self.params['hidden_dim'])
        module_list.append(nn.Sequential(nn.Linear(self.params['hidden_dim'],
                                          self.act_size)))
        # module_list.apply(self._init_weights)
        return module_list
    
    def create_critic(self):
        module_list = nn.ModuleList()
        layer = nn.Sequential()
        fc = nn.Linear(self.input_dim + self.act_size * 2, self.params['hidden_dim'])
        layer.add_module(f"fc_layer_1", fc)
        # layer.add_module(f"bn_layer_1",
                        # nn.BatchNorm1d(self.params['hidden_dim']))
        # layer.add_module(f"RELU_layer_1", nn.LeakyReLU())
        layer.add_module(f"RELU_layer_1", nn.ReLU())
        module_list.append(layer)
        self.add_hidden_layer(module_list, self.params['critic_h_num'],
                         self.params['hidden_dim'], self.params['hidden_dim'])
        module_list.append(nn.Sequential(nn.Linear(self.params['hidden_dim'], 1)))
        # module_list.apply(self._init_weights)
        return module_list
    
    def add_hidden_layer(self, module_list, num_hidden_layer,
                         input_dim, output_dim):
        if num_hidden_layer == 0:
            return
        for i in range(1, num_hidden_layer+1):
            layer = nn.Sequential()
            fc = nn.Linear(input_dim, output_dim)
            layer.add_module(f"fc_layer_{i}", fc)
            # layer.add_module(f"bn_layer_{i}",
                          #    nn.BatchNorm1d(output_dim))
            # layer.add_module(f"RELU_layer_{i}", nn.LeakyReLU())
            layer.add_module(f"RELU_layer_{i}", nn.ReLU())
            module_list.append(layer)
            
    def forward(self, states, actor=True, train=True):
        '''
            If actor is True, output actions and log probabilities FloatTensor
            If Critic (actor = False), output state value FloatTensor
        '''
        x_ = torch.FloatTensor(states).to(self.device)
        if actor:
            # forward in actor path
            mus = torch.zeros(self.num_agents, self.act_size, dtype=torch.float)
            dists = []
            acts = torch.zeros(self.num_agents, self.act_size, dtype=torch.float)
            lps = torch.zeros(self.num_agents, self.act_size, dtype=torch.float)
            for i in range(self.num_agents):
                
                mu_ = x_[i]
                for m in self.mu[i]:
                    mu_ = m(mu_)
                mus[i] = (mu_)
                dists.append(torch.distributions.Normal(mus[i], self.std[i]))
                act_ = dists[i].sample()
                if train:
                    # only return log probabilities in training phases
                    lps[i] = dists[i].log_prob(act_)
                    acts[i] = torch.clamp(act_, -1, 1)
                    return acts, lps
                # return only actions in executation phases
                return torch.clamp(act_, -1, 1)
        # forward in value path
        for v in self.val:
            x_ = v(x_)
        return x_

def parse_params(params_dir):
    with open(params_dir) as fp:
        params = json.load(fp)
    return params


class Agent():
    def __init__(self, device, num_agents, params_dir, state_size, action_size):
        self.model = Actor_critic_model(params_dir, state_size, action_size, num_agents).to(device)
        
        # I should try a version without target, just like A2C
        self.target = Actor_critic_model(params_dir, state_size, action_size, num_agents).to(device)
        self.device = device
        self.num_agents = num_agents
        self.params = self.model.params
        self.optimizer = optim.Adam(self.model.parameters(),
                                    lr=self.params['lr'])
                                    # lr=0.0001)

    def __call__(self, states):
        # mu, std, val, etp = self.model(states)
        actions, log_prob = self.model(states)
        return actions, log_prob

    def step(self, memories):
        '''
        second edition
            experiences:
                list with n_steps_taken * [actions, rewards, log_probs,
                                           not_dones, state_values]:
                    actions (tensor: num agents * num actions)
                    rewards (list: size = num agents)
                    log_probs (tensor: num agents * num actions)
                    not_dones (np array: size = num agents)
                    state_values (list: size = num agents)
        '''
        loss = [0.0] * self.num_agents
        for idx in range(self.num_agents):
            actions, rewards, log_probs, not_dones, states, next_states= memories[idx].spit()
            # print(f"state_values : {state_values}")
            rewards = torch.FloatTensor(rewards).view(-1, 1)
            #print(f"len(rewards[0]) - 1 : {len(rewards) - 1}")
            not_dones = torch.FloatTensor(not_dones).to(device).unsqueeze(1)
            # print(rewards)
            state_values = self.model(states, actor=False)
            next_values = self.target(next_states, actor=False)
            returns = rewards + self.params['gamma'] * not_dones * next_values.detach()
            advantage_  = rewards + self.params['gamma'] * not_dones * next_values.detach() - state_values.detach()
            # print(f"log_probs.shape : {log_probs.shape}")
            # print(f"advantage_.shape : {advantage_.shape}")
            # print(f"state_values[i].shape : {state_values[i].shape}")
            # print(f"return_.shape : {return_.shape}")
            # print(f"processed_experience : {processed_experience}")
            log_probs = torch.stack(log_probs)
            policy_loss = -(log_probs) * advantage_
            value_loss = (0.5 * (returns - state_values).pow(2))
            self.optimizer.zero_grad()
            loss[idx] = ((policy_loss + value_loss.unsqueeze(1)).mean())
            # print(f"loss[idx] : {loss[idx]}")
            if torch.isnan(loss[idx]).any():
                print('Nan in loss function')
                pass
            if idx == self.num_agents -1:
                loss[idx].backward()
            else:
                loss[idx].backward(retain_graph=True)
            nn.utils.clip_grad_norm_(self.model.parameters(), self.model.params['grad_clip'])
            self.optimizer.step()
        self.soft_udpate()
        
    def soft_udpate(self):
        for tp, lp in zip(self.target.parameters(),
                          self.model.parameters()):
            tp.data.copy_(self.params['TAU']*lp.data +
                          (1.0-self.params['TAU'])*tp.data)
            
    def hard_udpate(self):
        for tp, lp in zip(self.target.parameters(),
                          self.model.parameters()):
            tp.data.copy_(lp.data)


class Experience():
    def __init__(self):
        self.actions = []
        self.rewards = []
        #self.extra_into = []
        self.log_probs = []
        self.not_dones = []
        self.states = []
        self.next_states = []
        # self.etp = []

    def add(self, actions, rewards, log_probs, not_dones,
            states, next_states):
        self.actions.append(actions)
        self.rewards.append(rewards)
        #self.extra_into.append(extra_into)
        self.log_probs.append(log_probs)
        self.not_dones.append(not_dones)
        self.states.append(states)
        self.next_states.append(next_states)
        # self.etp.append(etp)

    def spit(self):
        return (self.actions[1:], self.rewards[1:], self.log_probs[1:],
                self.not_dones[1:],
                self.states[1:], self.next_states[1:])

    def __len__(self):
        return len(self.rewards)
    
params_dir = f"./params.txt"
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
agent = Agent(device, num_agents, params_dir, state_size, action_size)

def plot_scores(scores):
    fig, ax = plt.subplots(1, figsize=(8, 8))
    plt.plot(np.arange(len(scores)), scores)
    plt.ylabel('Score')
    plt.xlabel('Episode #')
    plt.show()


scores_window = deque(maxlen=100)
memories = [Experience() for _ in range(num_agents)]
learned_steps = 0
while learned_steps < 5000:
    env_info = env.reset(train_mode=True)[brain_name]
    states_ = env_info.vector_observations
    scores = np.zeros(num_agents)
    done = [False] * num_agents
    actions_ = np.random.random([num_agents, action_size])
    log_prob_ = torch.rand(num_agents, action_size)
    rewards_ = [0] * num_agents
    states_plus = get_extra_obs(states, actions)
    steps = 0
    # while not np.any(done):
    while steps < 20:
        actions_next, log_prob_next = agent(states)
        next_states_plus = get_extra_obs(states, actions_next.cpu().numpy())
        env_info = env.step(actions_next.detach().cpu().numpy())[brain_name]
        done = env_info.local_done
        not_done_ = (1 - np.array(done))
        for idx in range(num_agents):
            memories[idx].add(actions_[idx], rewards_[idx], log_prob_[idx],
                              not_done_[idx], states_plus[idx], next_states_plus[idx])
        steps += 1

        rewards_ = env_info.rewards
        states = env_info.vector_observations
        actions_ = actions_next
        log_prob_ = log_prob_next
        states_plus = next_states_plus
        scores += rewards_
        if (len(memories[0].actions) % 400 == 0) and (len(memories[0].actions) >1):
            agent.step(memories)
            learned_steps += 1
            memories = [Experience() for _ in range(num_agents)]
            print(f"learned_steps {learned_steps}: {np.max(scores)}")
        if learned_steps % 1000 == 0:
            agent.hard_udpate()
        if np.any(done):
            # memories = Experience()
            # print(f" steps : {steps}")
            break
        # print(scores)

    scores_window.append(np.max(scores))
    if (len(scores_window)) == 100 and ((sum(scores_window) / len(scores_window)) > 0.5):
        torch.save(agent.model.state_dict(), agent.params['working_dir'])
        print(f"Envinroment solved in episode{steps}!")
        print(f"Score: {scores_window}")
        break

learned_steps 1: 0.0
learned_steps 2: 0.0
learned_steps 3: 0.0
learned_steps 4: 0.0
learned_steps 5: 0.0
learned_steps 6: 0.0
learned_steps 7: 0.0
learned_steps 8: 0.0
learned_steps 9: 0.0
learned_steps 10: 0.0
learned_steps 11: 0.0
learned_steps 12: 0.0
learned_steps 13: 0.0
learned_steps 14: 0.0
learned_steps 15: 0.0
learned_steps 16: 0.0
learned_steps 17: 0.10000000149011612
learned_steps 18: 0.0
learned_steps 19: 0.10000000149011612
learned_steps 20: 0.0
learned_steps 21: 0.0
learned_steps 22: 0.0
learned_steps 23: 0.0
learned_steps 24: 0.0
learned_steps 25: 0.0
learned_steps 26: 0.0
learned_steps 27: 0.0
learned_steps 28: 0.0
learned_steps 29: 0.0
learned_steps 30: 0.0
learned_steps 31: 0.0
learned_steps 32: 0.0
learned_steps 33: 0.0
learned_steps 34: 0.0
learned_steps 35: 0.0
learned_steps 36: 0.0
learned_steps 37: 0.0
learned_steps 38: 0.0
learned_steps 39: 0.0
learned_steps 40: 0.0
learned_steps 41: 0.0
learned_steps 42: 0.0
learned_steps 43: 0.0
learned_steps 44: 0.0
learned_s

learned_steps 351: 0.0
learned_steps 352: 0.0
learned_steps 353: 0.0
learned_steps 354: 0.0
learned_steps 355: 0.0
learned_steps 356: 0.0
learned_steps 357: 0.0
learned_steps 358: 0.0
learned_steps 359: 0.0
learned_steps 360: 0.0
learned_steps 361: 0.0
learned_steps 362: 0.10000000149011612
learned_steps 363: 0.0
learned_steps 364: 0.0
learned_steps 365: 0.0
learned_steps 366: 0.10000000149011612
learned_steps 367: 0.0
learned_steps 368: 0.0
learned_steps 369: 0.0
learned_steps 370: 0.0
learned_steps 371: 0.0
learned_steps 372: 0.0
learned_steps 373: 0.0
learned_steps 374: 0.0
learned_steps 375: 0.0
learned_steps 376: 0.0
learned_steps 377: 0.0
learned_steps 378: 0.0
learned_steps 379: 0.0
learned_steps 380: 0.0
learned_steps 381: 0.0
learned_steps 382: 0.0
learned_steps 383: 0.0
learned_steps 384: 0.0
learned_steps 385: 0.0
learned_steps 386: 0.0
learned_steps 387: 0.0
learned_steps 388: 0.0
learned_steps 389: 0.0
learned_steps 390: 0.0
learned_steps 391: 0.0
learned_steps 392: 0.0
le

learned_steps 694: 0.0
learned_steps 695: 0.0
learned_steps 696: 0.0
learned_steps 697: 0.10000000149011612
learned_steps 698: 0.0
learned_steps 699: 0.0
learned_steps 700: 0.0
learned_steps 701: 0.0
learned_steps 702: 0.0
learned_steps 703: 0.0
learned_steps 704: 0.10000000149011612
learned_steps 705: 0.0
learned_steps 706: 0.0
learned_steps 707: 0.0
learned_steps 708: 0.0
learned_steps 709: 0.0
learned_steps 710: 0.0
learned_steps 711: 0.0
learned_steps 712: 0.0
learned_steps 713: 0.0
learned_steps 714: 0.0
learned_steps 715: 0.0
learned_steps 716: 0.0
learned_steps 717: 0.0
learned_steps 718: 0.10000000149011612
learned_steps 719: 0.0
learned_steps 720: 0.0
learned_steps 721: 0.10000000149011612
learned_steps 722: 0.0
learned_steps 723: 0.0
learned_steps 724: 0.0
learned_steps 725: 0.0
learned_steps 726: 0.0
learned_steps 727: 0.0
learned_steps 728: 0.0
learned_steps 729: 0.0
learned_steps 730: 0.0
learned_steps 731: 0.0
learned_steps 732: 0.0
learned_steps 733: 0.0
learned_steps 73

learned_steps 1032: 0.0
learned_steps 1033: 0.0
learned_steps 1034: 0.0
learned_steps 1035: 0.0
learned_steps 1036: 0.0
learned_steps 1037: 0.0
learned_steps 1038: 0.0
learned_steps 1039: 0.0
learned_steps 1040: 0.0
learned_steps 1041: 0.0
learned_steps 1042: 0.0
learned_steps 1043: 0.0
learned_steps 1044: 0.0
learned_steps 1045: 0.0
learned_steps 1046: 0.0
learned_steps 1047: 0.0
learned_steps 1048: 0.0
learned_steps 1049: 0.0
learned_steps 1050: 0.0
learned_steps 1051: 0.0
learned_steps 1052: 0.0
learned_steps 1053: 0.0
learned_steps 1054: 0.0
learned_steps 1055: 0.10000000149011612
learned_steps 1056: 0.0
learned_steps 1057: 0.0
learned_steps 1058: 0.0
learned_steps 1059: 0.0
learned_steps 1060: 0.0
learned_steps 1061: 0.0
learned_steps 1062: 0.0
learned_steps 1063: 0.0
learned_steps 1064: 0.10000000149011612
learned_steps 1065: 0.0
learned_steps 1066: 0.09000000357627869
learned_steps 1067: 0.0
learned_steps 1068: 0.0
learned_steps 1069: 0.0
learned_steps 1070: 0.0
learned_steps 10

learned_steps 1367: 0.0
learned_steps 1368: 0.0
learned_steps 1369: 0.0
learned_steps 1370: 0.0
learned_steps 1371: 0.0
learned_steps 1372: 0.0
learned_steps 1373: 0.0
learned_steps 1374: 0.0
learned_steps 1375: 0.10000000149011612
learned_steps 1376: 0.0
learned_steps 1377: 0.0
learned_steps 1378: 0.10000000149011612
learned_steps 1379: 0.0
learned_steps 1380: 0.0
learned_steps 1381: 0.0
learned_steps 1382: 0.10000000149011612
learned_steps 1383: 0.0
learned_steps 1384: 0.0
learned_steps 1385: 0.0
learned_steps 1386: 0.0
learned_steps 1387: 0.0
learned_steps 1388: 0.0
learned_steps 1389: 0.0
learned_steps 1390: 0.0
learned_steps 1391: 0.0
learned_steps 1392: 0.0
learned_steps 1393: 0.0
learned_steps 1394: 0.0
learned_steps 1395: 0.0
learned_steps 1396: 0.0
learned_steps 1397: 0.0
learned_steps 1398: 0.0
learned_steps 1399: 0.0
learned_steps 1400: 0.0
learned_steps 1401: 0.0
learned_steps 1402: 0.0
learned_steps 1403: 0.0
learned_steps 1404: 0.0
learned_steps 1405: 0.0
learned_steps 14

learned_steps 1700: 0.0
learned_steps 1701: 0.0
learned_steps 1702: 0.0
learned_steps 1703: 0.0
learned_steps 1704: 0.0
learned_steps 1705: 0.0
learned_steps 1706: 0.0
learned_steps 1707: 0.0
learned_steps 1708: 0.0
learned_steps 1709: 0.0
learned_steps 1710: 0.0
learned_steps 1711: 0.0
learned_steps 1712: 0.0
learned_steps 1713: 0.0
learned_steps 1714: 0.0
learned_steps 1715: 0.10000000149011612
learned_steps 1716: 0.0
learned_steps 1717: 0.0
learned_steps 1718: 0.10000000149011612
learned_steps 1719: 0.0
learned_steps 1720: 0.0
learned_steps 1721: 0.0
learned_steps 1722: 0.0
learned_steps 1723: 0.0
learned_steps 1724: 0.0
learned_steps 1725: 0.0
learned_steps 1726: 0.0
learned_steps 1727: 0.0
learned_steps 1728: 0.0
learned_steps 1729: 0.0
learned_steps 1730: 0.0
learned_steps 1731: 0.0
learned_steps 1732: 0.0
learned_steps 1733: 0.0
learned_steps 1734: 0.0
learned_steps 1735: 0.0
learned_steps 1736: 0.0
learned_steps 1737: 0.0
learned_steps 1738: 0.0
learned_steps 1739: 0.0
learned_

learned_steps 2031: 0.0
learned_steps 2032: 0.0
learned_steps 2033: 0.0
learned_steps 2034: 0.0
learned_steps 2035: 0.0
learned_steps 2036: 0.0
learned_steps 2037: 0.0
learned_steps 2038: 0.0
learned_steps 2039: 0.0
learned_steps 2040: 0.0
learned_steps 2041: 0.0
learned_steps 2042: 0.0
learned_steps 2043: 0.0
learned_steps 2044: 0.0
learned_steps 2045: 0.10000000149011612
learned_steps 2046: 0.0
learned_steps 2047: 0.0
learned_steps 2048: 0.0
learned_steps 2049: 0.0
learned_steps 2050: 0.0
learned_steps 2051: 0.0
learned_steps 2052: 0.0
learned_steps 2053: 0.0
learned_steps 2054: 0.0
learned_steps 2055: 0.0
learned_steps 2056: 0.0
learned_steps 2057: 0.0
learned_steps 2058: 0.10000000149011612
learned_steps 2059: 0.0
learned_steps 2060: 0.0
learned_steps 2061: 0.0
learned_steps 2062: 0.0
learned_steps 2063: 0.0
learned_steps 2064: 0.0
learned_steps 2065: 0.0
learned_steps 2066: 0.0
learned_steps 2067: 0.0
learned_steps 2068: 0.0
learned_steps 2069: 0.0
learned_steps 2070: 0.0
learned_

learned_steps 2361: 0.0
learned_steps 2362: 0.0
learned_steps 2363: 0.0
learned_steps 2364: 0.0
learned_steps 2365: 0.0
learned_steps 2366: 0.0
learned_steps 2367: 0.0
learned_steps 2368: 0.0
learned_steps 2369: 0.0
learned_steps 2370: 0.0
learned_steps 2371: 0.10000000149011612
learned_steps 2372: 0.0
learned_steps 2373: 0.0
learned_steps 2374: 0.0
learned_steps 2375: 0.0
learned_steps 2376: 0.0
learned_steps 2377: 0.0
learned_steps 2378: 0.0
learned_steps 2379: 0.10000000149011612
learned_steps 2380: 0.0
learned_steps 2381: 0.0
learned_steps 2382: 0.0
learned_steps 2383: 0.0
learned_steps 2384: 0.0
learned_steps 2385: 0.0
learned_steps 2386: 0.0
learned_steps 2387: 0.0
learned_steps 2388: 0.0
learned_steps 2389: 0.0
learned_steps 2390: 0.10000000149011612
learned_steps 2391: 0.0
learned_steps 2392: 0.0
learned_steps 2393: 0.0
learned_steps 2394: 0.0
learned_steps 2395: 0.0
learned_steps 2396: 0.0
learned_steps 2397: 0.0
learned_steps 2398: 0.0
learned_steps 2399: 0.0
learned_steps 24

learned_steps 2689: 0.0
learned_steps 2690: 0.0
learned_steps 2691: 0.0
learned_steps 2692: 0.0
learned_steps 2693: 0.0
learned_steps 2694: 0.0
learned_steps 2695: 0.0
learned_steps 2696: 0.0
learned_steps 2697: 0.0
learned_steps 2698: 0.0
learned_steps 2699: 0.0
learned_steps 2700: 0.0
learned_steps 2701: 0.0
learned_steps 2702: 0.0
learned_steps 2703: 0.0
learned_steps 2704: 0.0
learned_steps 2705: 0.0
learned_steps 2706: 0.0
learned_steps 2707: 0.0
learned_steps 2708: 0.0
learned_steps 2709: 0.0
learned_steps 2710: 0.0
learned_steps 2711: 0.0
learned_steps 2712: 0.0
learned_steps 2713: 0.0
learned_steps 2714: 0.0
learned_steps 2715: 0.0
learned_steps 2716: 0.0
learned_steps 2717: 0.0
learned_steps 2718: 0.0
learned_steps 2719: 0.0
learned_steps 2720: 0.0
learned_steps 2721: 0.0
learned_steps 2722: 0.0
learned_steps 2723: 0.0
learned_steps 2724: 0.0
learned_steps 2725: 0.0
learned_steps 2726: 0.0
learned_steps 2727: 0.0
learned_steps 2728: 0.10000000149011612
learned_steps 2729: 0.0


learned_steps 3021: 0.0
learned_steps 3022: 0.0
learned_steps 3023: 0.0
learned_steps 3024: 0.0
learned_steps 3025: 0.0
learned_steps 3026: 0.0
learned_steps 3027: 0.10000000149011612
learned_steps 3028: 0.10000000149011612
learned_steps 3029: 0.0
learned_steps 3030: 0.10000000149011612
learned_steps 3031: 0.0
learned_steps 3032: 0.0
learned_steps 3033: 0.0
learned_steps 3034: 0.10000000149011612
learned_steps 3035: 0.0
learned_steps 3036: 0.0
learned_steps 3037: 0.0
learned_steps 3038: 0.0
learned_steps 3039: 0.0
learned_steps 3040: 0.0
learned_steps 3041: 0.10000000149011612
learned_steps 3042: 0.0
learned_steps 3043: 0.0
learned_steps 3044: 0.0
learned_steps 3045: 0.0
learned_steps 3046: 0.0
learned_steps 3047: 0.10000000149011612
learned_steps 3048: 0.0
learned_steps 3049: 0.0
learned_steps 3050: 0.0
learned_steps 3051: 0.0
learned_steps 3052: 0.0
learned_steps 3053: 0.0
learned_steps 3054: 0.0
learned_steps 3055: 0.0
learned_steps 3056: 0.0
learned_steps 3057: 0.0
learned_steps 30

learned_steps 3346: 0.0
learned_steps 3347: 0.0
learned_steps 3348: 0.0
learned_steps 3349: 0.0
learned_steps 3350: 0.0
learned_steps 3351: 0.0
learned_steps 3352: 0.0
learned_steps 3353: 0.0
learned_steps 3354: 0.0
learned_steps 3355: 0.0
learned_steps 3356: 0.0
learned_steps 3357: 0.0
learned_steps 3358: 0.0
learned_steps 3359: 0.0
learned_steps 3360: 0.0
learned_steps 3361: 0.0
learned_steps 3362: 0.0
learned_steps 3363: 0.0
learned_steps 3364: 0.0
learned_steps 3365: 0.0
learned_steps 3366: 0.0
learned_steps 3367: 0.0
learned_steps 3368: 0.0
learned_steps 3369: 0.0
learned_steps 3370: 0.0
learned_steps 3371: 0.0
learned_steps 3372: 0.0
learned_steps 3373: 0.0
learned_steps 3374: 0.0
learned_steps 3375: 0.0
learned_steps 3376: 0.0
learned_steps 3377: 0.0
learned_steps 3378: 0.0
learned_steps 3379: 0.0
learned_steps 3380: 0.0
learned_steps 3381: 0.0
learned_steps 3382: 0.0
learned_steps 3383: 0.0
learned_steps 3384: 0.0
learned_steps 3385: 0.0
learned_steps 3386: 0.0
learned_steps 33

learned_steps 3673: 0.0
learned_steps 3674: 0.0
learned_steps 3675: 0.0
learned_steps 3676: 0.0
learned_steps 3677: 0.0
learned_steps 3678: 0.0
learned_steps 3679: 0.0
learned_steps 3680: 0.0
learned_steps 3681: 0.0
learned_steps 3682: 0.0
learned_steps 3683: 0.0
learned_steps 3684: 0.0
learned_steps 3685: 0.0
learned_steps 3686: 0.0
learned_steps 3687: 0.0
learned_steps 3688: 0.10000000149011612
learned_steps 3689: 0.0
learned_steps 3690: 0.0
learned_steps 3691: 0.0
learned_steps 3692: 0.0
learned_steps 3693: 0.0
learned_steps 3694: 0.0
learned_steps 3695: 0.0
learned_steps 3696: 0.0
learned_steps 3697: 0.0
learned_steps 3698: 0.0
learned_steps 3699: 0.0
learned_steps 3700: 0.0
learned_steps 3701: 0.0
learned_steps 3702: 0.0
learned_steps 3703: 0.0
learned_steps 3704: 0.0
learned_steps 3705: 0.0
learned_steps 3706: 0.0
learned_steps 3707: 0.0
learned_steps 3708: 0.0
learned_steps 3709: 0.0
learned_steps 3710: 0.0
learned_steps 3711: 0.0
learned_steps 3712: 0.0
learned_steps 3713: 0.0


learned_steps 4003: 0.0
learned_steps 4004: 0.0
learned_steps 4005: 0.0
learned_steps 4006: 0.0
learned_steps 4007: 0.0
learned_steps 4008: 0.0
learned_steps 4009: 0.0
learned_steps 4010: 0.0
learned_steps 4011: 0.0
learned_steps 4012: 0.0
learned_steps 4013: 0.10000000149011612
learned_steps 4014: 0.0
learned_steps 4015: 0.0
learned_steps 4016: 0.0
learned_steps 4017: 0.0
learned_steps 4018: 0.0
learned_steps 4019: 0.0
learned_steps 4020: 0.0
learned_steps 4021: 0.0
learned_steps 4022: 0.0
learned_steps 4023: 0.0
learned_steps 4024: 0.0
learned_steps 4025: 0.0
learned_steps 4026: 0.0
learned_steps 4027: 0.0
learned_steps 4028: 0.0
learned_steps 4029: 0.0
learned_steps 4030: 0.10000000149011612
learned_steps 4031: 0.0
learned_steps 4032: 0.0
learned_steps 4033: 0.0
learned_steps 4034: 0.0
learned_steps 4035: 0.0
learned_steps 4036: 0.0
learned_steps 4037: 0.10000000149011612
learned_steps 4038: 0.0
learned_steps 4039: 0.0
learned_steps 4040: 0.0
learned_steps 4041: 0.0
learned_steps 40

learned_steps 4331: 0.0
learned_steps 4332: 0.0
learned_steps 4333: 0.0
learned_steps 4334: 0.0
learned_steps 4335: 0.0
learned_steps 4336: 0.0
learned_steps 4337: 0.0
learned_steps 4338: 0.0
learned_steps 4339: 0.0
learned_steps 4340: 0.0
learned_steps 4341: 0.10000000149011612
learned_steps 4342: 0.0
learned_steps 4343: 0.0
learned_steps 4344: 0.0
learned_steps 4345: 0.0
learned_steps 4346: 0.0
learned_steps 4347: 0.0
learned_steps 4348: 0.10000000149011612
learned_steps 4349: 0.0
learned_steps 4350: 0.0
learned_steps 4351: 0.0
learned_steps 4352: 0.0
learned_steps 4353: 0.0
learned_steps 4354: 0.0
learned_steps 4355: 0.10000000149011612
learned_steps 4356: 0.10000000149011612
learned_steps 4357: 0.0
learned_steps 4358: 0.0
learned_steps 4359: 0.0
learned_steps 4360: 0.0
learned_steps 4361: 0.0
learned_steps 4362: 0.0
learned_steps 4363: 0.0
learned_steps 4364: 0.0
learned_steps 4365: 0.0
learned_steps 4366: 0.0
learned_steps 4367: 0.0
learned_steps 4368: 0.0
learned_steps 4369: 0.0


learned_steps 4659: 0.0
learned_steps 4660: 0.0
learned_steps 4661: 0.0
learned_steps 4662: 0.0
learned_steps 4663: 0.0
learned_steps 4664: 0.0
learned_steps 4665: 0.0
learned_steps 4666: 0.0
learned_steps 4667: 0.0
learned_steps 4668: 0.0
learned_steps 4669: 0.0
learned_steps 4670: 0.0
learned_steps 4671: 0.0
learned_steps 4672: 0.0
learned_steps 4673: 0.0
learned_steps 4674: 0.10000000149011612
learned_steps 4675: 0.0
learned_steps 4676: 0.0
learned_steps 4677: 0.0
learned_steps 4678: 0.0
learned_steps 4679: 0.10000000149011612
learned_steps 4680: 0.0
learned_steps 4681: 0.0
learned_steps 4682: 0.0
learned_steps 4683: 0.0
learned_steps 4684: 0.10000000149011612
learned_steps 4685: 0.0
learned_steps 4686: 0.10000000149011612
learned_steps 4687: 0.0
learned_steps 4688: 0.0
learned_steps 4689: 0.0
learned_steps 4690: 0.0
learned_steps 4691: 0.0
learned_steps 4692: 0.0
learned_steps 4693: 0.0
learned_steps 4694: 0.0
learned_steps 4695: 0.0
learned_steps 4696: 0.10000000149011612
learned_

learned_steps 4989: 0.0
learned_steps 4990: 0.0
learned_steps 4991: 0.0
learned_steps 4992: 0.0
learned_steps 4993: 0.10000000149011612
learned_steps 4994: 0.0
learned_steps 4995: 0.0
learned_steps 4996: 0.0
learned_steps 4997: 0.0
learned_steps 4998: 0.0
learned_steps 4999: 0.0
learned_steps 5000: 0.0


In [129]:
agent.model

Actor_critic_model(
  (val): ModuleList(
    (0): Sequential(
      (fc_layer_1): Linear(in_features=28, out_features=128, bias=True)
      (RELU_layer_1): ReLU()
    )
    (1): Sequential(
      (fc_layer_1): Linear(in_features=128, out_features=128, bias=True)
      (RELU_layer_1): ReLU()
    )
    (2): Sequential(
      (fc_layer_2): Linear(in_features=128, out_features=128, bias=True)
      (RELU_layer_2): ReLU()
    )
    (3): Sequential(
      (0): Linear(in_features=128, out_features=1, bias=True)
    )
  )
)

In [88]:
acts = torch.zeros(2, 2, dtype=torch.float)

In [107]:
log_probs = torch.stack(log_probs).to(self.device)

NameError: name 'log_probs' is not defined

In [110]:
b = torch.stack([torch.rand(2, 3),
                      torch.rand(2, 3)])
print(b.shape)

torch.Size([2, 2, 3])


In [97]:
torch.rand(2).unsqueeze(0).shape

torch.Size([1, 2])

In [75]:
states_plus = get_extra_obs(states, actions)

In [68]:
rewards

[0.0, -0.009999999776482582]

In [59]:
[0.0 for _ in range(2)]

[0.0, 0.0]

In [42]:
import numpy as np
import math
import torch
import torch.nn as nn
import json
import torch.nn.functional as F
import torch.optim as optim
from collections import namedtuple, deque


class Actor_critic_model(nn.Module):
    def __init__(self, params_dir, input_dim, act_size, num_agents):
        super().__init__()
        self.input_dim = input_dim
        self.act_size = act_size
        self.num_agents = num_agents
        self.params = parse_params(params_dir)
        self.mu = [self.create_actor() for _ in range(self.num_agents)]
        self.std = [nn.Parameter(torch.ones(1, act_size)) for _ in range(self.num_agents)]
        self.val = self.create_critic()
        self.device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
        
    def _init_weights(self, m):
        if isinstance(m, nn.Linear):
            nn.init.kaiming_normal_(m.weight, mode='fan_in')
            m.bias.data.fill_(0.01)

    def create_actor(self):
        module_list = nn.ModuleList()
        layer = nn.Sequential()
        fc = nn.Linear(self.input_dim, self.params['hidden_dim'])
        layer.add_module(f"fc_layer_1", fc)
        # layer.add_module(f"bn_layer_1",
                        # nn.BatchNorm1d(self.params['hidden_dim']))
        # layer.add_module(f"RELU_layer_1", nn.LeakyReLU())
        layer.add_module(f"RELU_layer_1", nn.ReLU())
        module_list.append(layer)
        # module_list.apply(self._init_weights)
        self.add_hidden_layer(module_list, self.params['actor_h_num'],
                         self.params['hidden_dim'], self.params['hidden_dim'])
        module_list.append(nn.Sequential(nn.Linear(self.params['hidden_dim'],
                                          self.act_size)))
        # module_list.apply(self._init_weights)
        return module_list
    
    def create_critic(self):
        module_list = nn.ModuleList()
        layer = nn.Sequential()
        fc = nn.Linear(self.input_dim + self.act_size * 2, self.params['hidden_dim'])
        layer.add_module(f"fc_layer_1", fc)
        # layer.add_module(f"bn_layer_1",
                        # nn.BatchNorm1d(self.params['hidden_dim']))
        # layer.add_module(f"RELU_layer_1", nn.LeakyReLU())
        layer.add_module(f"RELU_layer_1", nn.ReLU())
        module_list.append(layer)
        self.add_hidden_layer(module_list, self.params['critic_h_num'],
                         self.params['hidden_dim'], self.params['hidden_dim'])
        module_list.append(nn.Sequential(nn.Linear(self.params['hidden_dim'], 1)))
        # module_list.apply(self._init_weights)
        return module_list
    
    def add_hidden_layer(self, module_list, num_hidden_layer,
                         input_dim, output_dim):
        if num_hidden_layer == 0:
            return
        for i in range(1, num_hidden_layer+1):
            layer = nn.Sequential()
            fc = nn.Linear(input_dim, output_dim)
            layer.add_module(f"fc_layer_{i}", fc)
            # layer.add_module(f"bn_layer_{i}",
                          #    nn.BatchNorm1d(output_dim))
            # layer.add_module(f"RELU_layer_{i}", nn.LeakyReLU())
            layer.add_module(f"RELU_layer_{i}", nn.ReLU())
            module_list.append(layer)
            
    def forward(self, states, actor=True, train=True):
        '''
            If actor is True, output actions and log probabilities FloatTensor
            If Critic (actor = False), output state value FloatTensor
        '''
        x_ = torch.FloatTensor(states).to(self.device)
        if actor:
            # forward in actor path
            mus = torch.zeros(self.num_agents, self.act_size)
            dists = []
            acts = torch.zeros(self.num_agents, self.act_size)
            lps = torch.zeros(self.num_agents, self.act_size)
            for i in range(self.num_agents):
                
                mu_ = x_[i]
                for m in self.mu[i]:
                    mu_ = m(mu_)
                mus[i] = (mu_)
                dists.append(torch.distributions.Normal(mus[i], self.std[i]))
                act_ = dists[i].sample()
                if train:
                    # only return log probabilities in training phases
                    lps[i] = dists[i].log_prob(act_)
                    acts[i] = torch.clamp(act_, -1, 1)
                    return acts, lps
                # return only actions in executation phases
                return torch.clamp(act_, -1, 1)
        # forward in value path
        for v in self.val:
            x_ = v(x_)
        return x_

def parse_params(params_dir):
    with open(params_dir) as fp:
        params = json.load(fp)
    return params
        
params_dir = f"./params.txt"
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model_a = Actor_critic_model(params_dir, len(states[0]), 2, 2)


In [18]:
states

array([[ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        , -7.43639946, -1.5       , -0.        ,  0.        ,
         6.69487906,  5.96076012, -0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        , -7.73019552, -1.5       ,  0.        ,  0.        ,
        -6.69487906,  5.96076012,  0.        ,  0.        ]])

In [28]:
def get_extra_obs(states, actions):
    ''' 
        return a list contains other agents states and actions, len = num_agents
        states: list of states by each agent
        actions: List of action by each agent
    '''
    extra_obs = []
    # print(f"actions : {actions}")
    for i in range(states.shape[0]):
        list_ = []
        # states
        list_.extend(states[i])
        # agent's action
        list_.extend(actions[i])
        # other agent's actions
        list_.extend(actions[np.arange(len(actions))!= i][0])
        extra_obs.append(list_)
    return extra_obs
extra_obs =  get_extra_obs(states, actions)

In [47]:
torch.FloatTensor(torch.rand(28,10)).unsqueeze(1).shape
                

torch.Size([28, 1])

In [51]:
c = model_a(torch.FloatTensor(torch.rand(10, 28)),actor=False)

In [52]:
c.shape

torch.Size([10, 1])

In [20]:
a, b = model_a(states)

In [20]:
import numpy as np
import math
import torch
import torch.nn as nn
import json
import torch.nn.functional as F
import torch.optim as optim
from collections import namedtuple, deque


device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")


class Actor_critic_model(nn.Module):
    def __init__(self, params_dir, input_dim, act_size, num_agents):
        super().__init__()
        self.input_dim = input_dim
        self.act_size = act_size
        self.num_agents = num_agents
        self.params = parse_params(params_dir)
        self.mu = [self.create_actor() for _ in range(self.num_agents)]
        self.std = [nn.Parameter(torch.ones(1, act_size)) for _ in range(self.num_agents)]
        self.val = self.create_critic()
        self.device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
        
    def _init_weights(self, m):
        if isinstance(m, nn.Linear):
            nn.init.kaiming_normal_(m.weight, mode='fan_in')
            m.bias.data.fill_(0.01)
            
    def get_extra_obs(self, states, actions):
        ''' 
            return a list contains other agents states and actions, len = num_agents
            states: list of states by each agent
            actions: List of action by each agent
        '''
        extra_obs = []
        # print(f"actions : {actions}")
        for i in range(states.shape[0]):
            list_ = []
            list_.extend(states[np.arange(len(states))!= i][0])
            list_.extend(actions[np.arange(len(actions))!= i][0])
            extra_obs.append(list_)
        return extra_obs

    def create_actor(self):
        module_list = nn.ModuleList()
        layer = nn.Sequential()
        fc = nn.Linear(self.input_dim, self.params['hidden_dim'])
        layer.add_module(f"fc_layer_1", fc)
        # layer.add_module(f"bn_layer_1",
                        # nn.BatchNorm1d(self.params['hidden_dim']))
        # layer.add_module(f"RELU_layer_1", nn.LeakyReLU())
        layer.add_module(f"RELU_layer_1", nn.ReLU())
        module_list.append(layer)
        # module_list.apply(self._init_weights)
        self.add_hidden_layer(module_list, self.params['actor_h_num'],
                         self.params['hidden_dim'], self.params['hidden_dim'])
        module_list.append(nn.Sequential(nn.Linear(self.params['hidden_dim'],
                                          self.act_size)))
        # module_list.apply(self._init_weights)
        return module_list
    
    def create_critic(self):
        module_list = nn.ModuleList()
        layer = nn.Sequential()
        fc = nn.Linear(self.input_dim*2 + self.act_size, self.params['hidden_dim'])
        layer.add_module(f"fc_layer_1", fc)
        # layer.add_module(f"bn_layer_1",
                        # nn.BatchNorm1d(self.params['hidden_dim']))
        # layer.add_module(f"RELU_layer_1", nn.LeakyReLU())
        layer.add_module(f"RELU_layer_1", nn.ReLU())
        module_list.append(layer)
        self.add_hidden_layer(module_list, self.params['critic_h_num'],
                         self.params['hidden_dim'], self.params['hidden_dim'])
        module_list.append(nn.Sequential(nn.Linear(self.params['hidden_dim'], 1)))
        # module_list.apply(self._init_weights)
        return module_list
    
    def add_hidden_layer(self, module_list, num_hidden_layer,
                         input_dim, output_dim):
        if num_hidden_layer == 0:
            return
        for i in range(1, num_hidden_layer+1):
            layer = nn.Sequential()
            fc = nn.Linear(input_dim, output_dim)
            layer.add_module(f"fc_layer_{i}", fc)
            # layer.add_module(f"bn_layer_{i}",
                          #    nn.BatchNorm1d(output_dim))
            # layer.add_module(f"RELU_layer_{i}", nn.LeakyReLU())
            layer.add_module(f"RELU_layer_{i}", nn.ReLU())
            module_list.append(layer)
            
    def forward(self, states):
        x_ = states.copy()
        x_ = torch.FloatTensor(x_).to(self.device)

        mus = torch.zeros(self.num_agents, self.act_size)
        dists = []
        acts = torch.zeros(self.num_agents, self.act_size)
        lps = torch.zeros(self.num_agents, self.act_size)
        actions = np.zeros((self.num_agents, self.act_size))
        for i in range(self.num_agents):
            mu_ = x_[i]
            for m in self.mu[i]:
                mu_ = m(mu_)
            mus[i] = (mu_)
            dists.append(torch.distributions.Normal(mus[i], self.std[i]))
            act_ = dists[i].sample()
            lps[i] = dists[i].log_prob(act_)
            actions[i]= torch.clamp(act_, -1, 1).numpy()
            acts[i] = torch.clamp(act_, -1, 1)
        
        # print(f"extra_obs: {extra_obs}")
        extra_obs = self.get_extra_obs(states, actions)
        combined_states_ = torch.FloatTensor(extra_obs).detach().requires_grad_()
        # print(f"combined_states: {combined_states_}")
        val_x = torch.cat([x_, combined_states_], dim=1)
        # print(f"val_x shape: {val_x.shape}")
        for v in self.val:
            val_x = v(val_x)
        return acts, lps, val_x

def parse_params(params_dir):
    with open(params_dir) as fp:
        params = json.load(fp)
    return params


class Agent():
    def __init__(self, device, num_agents, params_dir, state_size, action_size):
        self.model = Actor_critic_model(params_dir, state_size, action_size, num_agents).to(device)
        self.device = device
        self.num_agents = num_agents
        self.params = self.model.params
        self.optimizer = optim.Adam(self.model.parameters(),
                                    # lr=self.params['lr'])
                                    lr=0.0001)

    def __call__(self, states):
        # mu, std, val, etp = self.model(states)
        actions, log_prob, val = self.model(states)
        return actions, log_prob, val

    def step(self, memories):
        '''
        second edition
            experiences:
                list with n_steps_taken * [actions, rewards, log_probs,
                                           not_dones, state_values]:
                    actions (tensor: num agents * num actions)
                    rewards (list: size = num agents)
                    log_probs (tensor: num agents * num actions)
                    not_dones (np array: size = num agents)
                    state_values (list: size = num agents)
        '''
        loss = 0.0
        for idx in range(self.num_agents):
            actions, rewards, log_probs, not_dones, state_values = memories[idx].spit()
            # print(f"state_values : {state_values}")
            rewards = torch.FloatTensor(rewards).view(-1, 1)
            #print(f"len(rewards[0]) - 1 : {len(rewards) - 1}")

            processed_experience = [None] * (len(rewards) - 1)
            #print(f"processed_experience : {processed_experience}")
            # return_  = state_values[-1].detach()   
            return_  = state_values[-1].detach()
            not_dones = torch.FloatTensor(not_dones).to(device).unsqueeze(1)
            # print(rewards)
            for i in reversed(range(len(rewards)-1)):
                # print(f"pnot_dones : {not_dones}")
                # print(f"not_dones[i+1 : {not_dones[i+1]}")
                not_done_ = not_dones[i+1]
                reward_ = rewards[i]
                return_ = reward_ + self.params['gamma'] * not_done_ * return_
                next_value_ = state_values[i+1]
                advantage_  = reward_ + self.params['gamma'] * not_done_ * next_value_.detach() - state_values[i].detach()
                # print(f"log_probs[i].shape : {log_probs[i].shape}")
                # print(f"advantage_.shape : {advantage_.shape}")
                # print(f"state_values[i].shape : {state_values[i].shape}")
                # print(f"return_.shape : {return_.shape}")
                processed_experience[i] = [log_probs[i].unsqueeze(0), advantage_, state_values[i], return_]
            # print(f"processed_experience : {processed_experience}")
            log_probs, advantages, values, returns = map(lambda x: torch.cat(x, dim=0), zip(*processed_experience))
            policy_loss = (-log_probs * (advantages.unsqueeze(1)))
            value_loss = (0.5 * (returns - values).pow(2))
            self.optimizer.zero_grad()
            loss += ((policy_loss + value_loss.unsqueeze(1)).mean())
        if torch.isnan(loss).any():
            print('Nan in loss function')
            pass
        loss.backward()
        nn.utils.clip_grad_norm_(self.model.parameters(), self.model.params['grad_clip'])
        self.optimizer.step()

        


    
class Experience():
    def __init__(self):
        self.actions = []
        self.rewards = []
        #self.extra_into = []
        self.log_probs = []
        self.not_dones = []
        self.state_values = []
        # self.etp = []

    def add(self, actions, rewards, log_probs, not_dones, state_values):
        self.actions.append(actions)
        self.rewards.append(rewards)
        #self.extra_into.append(extra_into)
        self.log_probs.append(log_probs)
        self.not_dones.append(not_dones)
        self.state_values.append(state_values)
        # self.etp.append(etp)

    def spit(self):
        return (self.actions, self.rewards, self.log_probs, self.not_dones,
                self.state_values)
    

class ReplayBuffer():
    def __init__(self, buffer_size, action_size, batch_size, seed, device):
        """Initialize a ReplayBuffer object.
        Params
        ======
            action_size (int): dimension of each action
            buffer_size (int): maximum size of buffer
            batch_size (int): size of each training batch
            seed (int): random seed
        """
        self.action_size = action_size
        self.memory = deque(maxlen=buffer_size)
        # create a named tuple object to store training samples
        self.experience = namedtuple("Experience",
                                     field_names=['actions', "rewards", "log_probs",
                                                  "not_dones", "state_values"])
        self.batch_size = batch_size
        self.device = device
        self.seed = random.seed(seed)

    def add(self, actions, rewards, log_probs, not_dones, state_values):
        '''
            create a new namedtuple for each experience and append it to memory
            All inputs are in numpy format
        '''
        self.memory.append(self.experience(
                actions, rewards, log_probs, not_dones, state_values))

    def sample(self):
        sampled_exp = random.sample(self.memory, k=self.batch_size)
        actions = torch.from_numpy(
            np.vstack([e.actions for e in sampled_exp if e is not None])
            ).float().to(self.device)
        rewards = torch.from_numpy(
            np.vstack([e.reward for e in sampled_exp if e is not None])
            ).float().to(self.device)
        log_probs = torch.from_numpy(
            np.vstack([e.log_probs for e in sampled_exp if e is not None])
            ).long().to(self.device)
        not_dones = torch.from_numpy(
            np.vstack([e.not_dones for e in sampled_exp if e is not None])
            ).float().to(self.device)
        state_values = torch.from_numpy(
                 np.vstack(
                         [e.state_values for e in sampled_exp if e is not None]
                             ).astype(np.uint8)).float().to(self.device)
        return (actions, rewards, log_probs, not_dones, state_values)

    def __len__(self):
        return len(self.memory)
    
params_dir = f"./params.txt"
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
agent = Agent(device, num_agents, params_dir, state_size, action_size)

def plot_scores(scores):
    fig, ax = plt.subplots(1, figsize=(8, 8))
    plt.plot(np.arange(len(scores)), scores)
    plt.ylabel('Score')
    plt.xlabel('Episode #')
    plt.show()

    

    
scores_window = deque(maxlen=100)
batch_size = 100
seed = 99
# version 3

for i in range(1):
    env_info = env.reset(train_mode=True)[brain_name]
    states_ = env_info.vector_observations
    scores = np.zeros(num_agents)
    memories = [ReplayBuffer(buffer_size, action_size, batch_size, seed, device) for _ in range(num_agents)]
    done = [False] * num_agents
    steps = 0
    # while not np.any(done):
    while steps < 99:
        actions_, log_prob_, state_values_ = agent(states)
        env_info = env.step(actions_.detach().cpu().numpy())[brain_name]
        next_states_ = env_info.vector_observations
        rewards_ = env_info.rewards
        done = env_info.local_done
        not_done_ = (1 - np.array(done))
        for idx in range(num_agents):
            memories[idx].add(actions_[idx], rewards_[idx], log_prob_[idx], not_done_[idx], state_values_[idx])
        steps += 1
        if np.any(done):
            memories = Experience()
            break
        if steps % 5 == 0:
            agent.step(memories)
            memories = [Experience() for _ in range(num_agents)]
        states = next_states_
        scores += rewards_
        # print(scores)
    print(f"Episode {i}: {np.max(scores)}")
    scores_window.append(np.max(scores))
    if (len(scores_window)) == 100 and ((sum(scores_window) / len(scores_window)) > 0.5):
        torch.save(agent.model.state_dict(), agent.params['working_dir'])
        print(f"Envinroment solved in episode{i}!")
        print(f"Score: {scores_window}")
        break

TypeError: data type not understood

In [31]:
brain_name

'TennisBrain'

In [None]:
for env_ in envss:
    print(env_.brain_names[0])
    env_info = env_.reset(train_mode=True)[env_.brain_names[0]]
    states_ = env_info.vector_observations
    print(states)

TennisBrain


In [5]:
envss = [UnityEnvironment(file_name="/data/Tennis_Linux_NoVis/Tennis") for _ in range(num_agents)]

INFO:unityagents:
'Academy' started successfully!
Unity Academy name: Academy
        Number of Brains: 1
        Number of External Brains : 1
        Lesson number : 0
        Reset Parameters :
		
Unity brain name: TennisBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 8
        Number of stacked Vector Observation: 3
        Vector Action space type: continuous
        Vector Action space size (per agent): 2
        Vector Action descriptions: , 
INFO:unityagents:
'Academy' started successfully!
Unity Academy name: Academy
        Number of Brains: 1
        Number of External Brains : 1
        Lesson number : 0
        Reset Parameters :
		
Unity brain name: TennisBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 8
        Number of stacked Vector Observation: 3


In [6]:
agent.model

NameError: name 'agent' is not defined

In [110]:
params_dir = f"./params.txt"
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
agent = Agent(device, num_agents, params_dir, state_size, action_size)

In [12]:
print(state_values_)

tensor([[ 1.0401],
        [ 0.2980]])


In [14]:
experiences = memories.spit()

In [39]:
experiences[1]

[[0.0, 0.0],
 [0.0, 0.0],
 [0.0, 0.0],
 [0.0, 0.0],
 [0.0, 0.0],
 [0.0, -0.009999999776482582]]

In [111]:
def plot_scores(scores):
    fig, ax = plt.subplots(1, figsize=(8, 8))
    plt.plot(np.arange(len(scores)), scores)
    plt.ylabel('Score')
    plt.xlabel('Episode #')
    plt.show()

    

    
scores_window = deque(maxlen=100)

for i in range(1):
    env_info = env.reset(train_mode=True)[brain_name]
    states_ = env_info.vector_observations
    scores = np.zeros(num_agents)
    memories = [Experience() for _ in range(num_agents)]
    done = [False] * num_agents
    steps = 0
    # while not np.any(done):
    while True:
        actions_, log_prob_, state_values_ = agent(states)
        env_info = env.step(actions_.detach().cpu().numpy())[brain_name]
        next_states_ = env_info.vector_observations
        rewards_ = env_info.rewards
        done = env_info.local_done
        not_done_ = (1 - np.array(done))
        for i in range(num_agents):
            memories[i].add(actions_[i], rewards_[i], log_prob_[i], not_done_[i], state_values_[i])
        steps += 1
        if np.any(done):
            memories = Experience()
            break
        if steps % 5 == 0:
            agent.step(memories)
            memories = [Experience() for _ in range(num_agents)]
        states = next_states_
        scores += rewards_
        # print(scores)
    print(f"Episode {i}: {np.max(scores)}")
    scores_window.append(np.max(scores))
    if (len(scores_window)) == 100 and ((sum(scores_window) / len(scores_window)) > 0.5):
        torch.save(agent.model.state_dict(), agent.params['working_dir'])
        print(f"Envinroment solved in episode{i}!")
        print(f"Score: {scores_window}")
        break

RuntimeError: dimension out of range (expected to be in range of [-1, 0], but got 1)

In [105]:
def plot_scores(scores):
    fig, ax = plt.subplots(1, figsize=(8, 8))
    plt.plot(np.arange(len(scores)), scores)
    plt.ylabel('Score')
    plt.xlabel('Episode #')
    plt.show()

    

    
scores_window = deque(maxlen=100)

for i in range(1):
    env_info = env.reset(train_mode=True)[brain_name]
    states_ = env_info.vector_observations
    scores = np.zeros(num_agents)
    memories = [Experience() for _ in range(num_agents)]
    done = [False] * num_agents
    steps = 0
    # while not np.any(done):
    while True:
        actions_, log_prob_, state_values_ = agent(states)
        env_info = env.step(actions_.detach().cpu().numpy())[brain_name]
        next_states_ = env_info.vector_observations
        rewards_ = env_info.rewards
        done = env_info.local_done
        not_done_ = (1 - np.array(done))
        for i in range(num_agents):
            memories[i].add(actions_[i], rewards_[i], log_prob_[i], not_done_[i], state_values_[i])
        steps += 1
        if np.any(done):
            memories = Experience()
            break
        if steps % 5 == 0:
            for i in range(num_agents):
                experiences_ = memories[i].spit()
                agent.step(experiences, i)
            memories = Experience()
        states = next_states_
        scores += rewards_
        # print(scores)
    print(f"Episode {i}: {np.max(scores)}")
    scores_window.append(np.max(scores))
    if (len(scores_window)) == 100 and ((sum(scores_window) / len(scores_window)) > 0.5):
        torch.save(agent.model.state_dict(), agent.params['working_dir'])
        print(f"Envinroment solved in episode{i}!")
        print(f"Score: {scores_window}")
        break

combined_states: tensor([[ 0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,
          0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,
          0.0000,  0.0000, -7.7612, -1.5000,  0.0000,  0.0000, -7.0182,
          4.1753,  0.0000,  0.0000, -0.7648,  1.0000],
        [ 0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,
          0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,
          0.0000,  0.0000, -7.7354, -1.5000, -0.0000,  0.0000,  7.0182,
          4.1753, -0.0000,  0.0000, -1.0000,  0.5948]])
combined_states: tensor([[  0.0000,   0.0000,   0.0000,   0.0000,   0.0000,   0.0000,
           0.0000,   0.0000,  -7.0355,  -1.5000,   0.0000,   0.0000,
          -7.0182,   4.0537,   0.0000,   0.0000,  -9.4327,  -1.5589,
         -23.9720,  -0.9810,  -7.0182,   3.3866, -23.9720,  -0.9810,
          -0.7648,   1.0000],
        [  0.0000,   0.0000,   0.0000,   0.0000,   0.0000,   0.0000,
           0.0000,   0.0000,  -7.3583,  -1.5000,

RuntimeError: Trying to backward through the graph a second time, but the buffers have already been freed. Specify retain_graph=True when calling backward the first time.

In [212]:
rewards_ 

array([[ 0.],
       [ 0.]])

In [210]:
scores2 = np.zeros(num_agents)

In [213]:
scores2

array([ 0.,  0.])

In [10]:
import torch
a = torch.rand(2,37)

In [12]:
print(a)
print(a.shape)
# print(a.view(-1,5))

tensor([[ 0.1089,  0.9336,  0.3417,  0.7516,  0.8893,  0.0894,  0.7528,
          0.2575,  0.6720,  0.6967,  0.3517,  0.5943,  0.7501,  0.1009,
          0.9530,  0.5058,  0.3703,  0.1080,  0.2512,  0.2847,  0.7189,
          0.5767,  0.6629,  0.9613,  0.2417,  0.6889,  0.2270,  0.0936,
          0.9169,  0.2505,  0.8809,  0.3256,  0.9014,  0.5382,  0.3768,
          0.4793,  0.1952],
        [ 0.7707,  0.2375,  0.8081,  0.4530,  0.2677,  0.4489,  0.9745,
          0.5429,  0.0577,  0.6078,  0.6531,  0.8821,  0.8981,  0.0596,
          0.4699,  0.6121,  0.1488,  0.7653,  0.5284,  0.0629,  0.9455,
          0.6540,  0.9659,  0.5065,  0.9747,  0.0727,  0.9179,  0.9386,
          0.5085,  0.9132,  0.0913,  0.4783,  0.9041,  0.7388,  0.6664,
          0.7652,  0.9673]])
torch.Size([2, 37])


In [13]:
states_ = np.reshape(env_info.vector_observations, (1,num_agents*state_size))
print(states_)


[[ 0.          0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.          0.
   0.          0.         -7.98782539 -1.5        -0.          0.
   6.14030886  5.99607611 -0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.
  -7.28886175 -1.5         0.          0.         -6.14030886  5.99607611
   0.          0.        ]]


In [11]:
a = np.reshape(env_info.vector_observations, (1,num_agents*state_size))
print(a.shape)

(1, 48)


In [12]:
for i, j in enumerate(a):
    print(f"i : {i} j :{j}")

i : 0 j :[ 0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.         -7.98782539 -1.5        -0.          0.
  6.14030886  5.99607611 -0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
 -7.28886175 -1.5         0.          0.         -6.14030886  5.99607611
  0.          0.        ]


In [19]:
print(np.reshape(env_info.vector_observations, (1,num_agents*state_size)))

[[ 0.          0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.          0.
   0.          0.         -7.98782539 -1.5        -0.          0.
  -7.11741829  5.96076012 -0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.
  -7.28886175 -1.5         0.          0.          7.11741829  5.96076012
   0.          0.        ]]


In [15]:
next_states = env_info.vector_observations

When finished, you can close the environment.

In [None]:
env.close()

### 4. It's Your Turn!

Now it's your turn to train your own agent to solve the environment!  A few **important notes**:
- When training the environment, set `train_mode=True`, so that the line for resetting the environment looks like the following:
```python
env_info = env.reset(train_mode=True)[brain_name]
```
- To structure your work, you're welcome to work directly in this Jupyter notebook, or you might like to start over with a new file!  You can see the list of files in the workspace by clicking on **_Jupyter_** in the top left corner of the notebook.
- In this coding environment, you will not be able to watch the agents while they are training.  However, **_after training the agents_**, you can download the saved model weights to watch the agents on your own machine! 