# Collaboration and Competition

---

In this notebook, you will learn how to use the Unity ML-Agents environment for the third project of the [Deep Reinforcement Learning Nanodegree](https://www.udacity.com/course/deep-reinforcement-learning-nanodegree--nd893) program.

### 1. Start the Environment

We begin by importing the necessary packages.  If the code cell below returns an error, please revisit the project instructions to double-check that you have installed [Unity ML-Agents](https://github.com/Unity-Technologies/ml-agents/blob/master/docs/Installation.md) and [NumPy](http://www.numpy.org/).

In [1]:
import urllib

In [2]:
import numpy as np
import random
import time

from collections import deque
from unityagents import UnityEnvironment
#from tensorboardX import SummaryWriter

In [3]:
from tensorboardX import SummaryWriter

In [4]:
writer = SummaryWriter('tensorboard_logs/run_{}'.format(time.ctime().replace(" ", "_").replace(":", "_")))

Next, we will start the environment!  **_Before running the code cell below_**, change the `file_name` parameter to match the location of the Unity environment that you downloaded.

- **Mac**: `"path/to/Tennis.app"`
- **Windows** (x86): `"path/to/Tennis_Windows_x86/Tennis.exe"`
- **Windows** (x86_64): `"path/to/Tennis_Windows_x86_64/Tennis.exe"`
- **Linux** (x86): `"path/to/Tennis_Linux/Tennis.x86"`
- **Linux** (x86_64): `"path/to/Tennis_Linux/Tennis.x86_64"`
- **Linux** (x86, headless): `"path/to/Tennis_Linux_NoVis/Tennis.x86"`
- **Linux** (x86_64, headless): `"path/to/Tennis_Linux_NoVis/Tennis.x86_64"`

For instance, if you are using a Mac, then you downloaded `Tennis.app`.  If this file is in the same folder as the notebook, then the line below should appear as follows:
```
env = UnityEnvironment(file_name="Tennis.app")
```

In [5]:
env = UnityEnvironment(file_name="./Tennis_Linux/Tennis.x86_64")

INFO:unityagents:
'Academy' started successfully!
Unity Academy name: Academy
        Number of Brains: 1
        Number of External Brains : 1
        Lesson number : 0
        Reset Parameters :
		
Unity brain name: TennisBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 8
        Number of stacked Vector Observation: 3
        Vector Action space type: continuous
        Vector Action space size (per agent): 2
        Vector Action descriptions: , 


Environments contain **_brains_** which are responsible for deciding the actions of their associated agents. Here we check for the first brain available, and set it as the default brain we will be controlling from Python.

In [6]:
# get the default brain
brain_name = env.brain_names[0]
brain = env.brains[brain_name]

### 2. Examine the State and Action Spaces

In this environment, two agents control rackets to bounce a ball over a net. If an agent hits the ball over the net, it receives a reward of +0.1.  If an agent lets a ball hit the ground or hits the ball out of bounds, it receives a reward of -0.01.  Thus, the goal of each agent is to keep the ball in play.

The observation space consists of 8 variables corresponding to the position and velocity of the ball and racket. Two continuous actions are available, corresponding to movement toward (or away from) the net, and jumping. 

Run the code cell below to print some information about the environment.

In [18]:
# reset the environment
env_info = env.reset(train_mode=True)[brain_name]

# number of agents 
num_agents = len(env_info.agents)
print('Number of agents:', num_agents)

# size of each action
action_size = brain.vector_action_space_size
print('Size of each action:', action_size)

# examine the state space 
states = env_info.vector_observations
state_size = states.shape[1]
print('There are {} agents. Each observes a state with length: {}'.format(states.shape[0], state_size))
print('The state for the first agent looks like:', states[0])

Number of agents: 2
Size of each action: 2
There are 2 agents. Each observes a state with length: 24
The state for the first agent looks like: [ 0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.         -6.68944931 -1.5
 -0.          0.         -7.59082413  5.96076012 -0.          0.        ]


### 3. Take Random Actions in the Environment

In the next code cell, you will learn how to use the Python API to control the agents and receive feedback from the environment.

Once this cell is executed, you will watch the agents' performance, if they select actions at random with each time step.  A window should pop up that allows you to observe the agents.

Of course, as part of the project, you'll have to change the code so that the agents are able to use their experiences to gradually choose better actions when interacting with the environment!

In [8]:
for i in range(1, 6):                                      # play game for 5 episodes
    env_info = env.reset(train_mode=False)[brain_name]     # reset the environment    
    states = env_info.vector_observations                  # get the current state (for each agent)
    scores = np.zeros(num_agents)                          # initialize the score (for each agent)
    while True:
        actions = np.random.randn(num_agents, action_size) # select an action (for each agent)
        actions = np.clip(actions, -1, 1)                  # all actions between -1 and 1
        env_info = env.step(actions)[brain_name]           # send all actions to tne environment
        next_states = env_info.vector_observations         # get next state (for each agent)
        rewards = env_info.rewards                         # get reward (for each agent)
        dones = env_info.local_done                        # see if episode finished
        scores += env_info.rewards                         # update the score (for each agent)
        states = next_states                               # roll over states to next time step
        if np.any(dones):                                  # exit loop if episode finished
            break
    print('Score (max over agents) from episode {}: {}'.format(i, np.max(scores)))

Score (max over agents) from episode 1: 0.0
Score (max over agents) from episode 2: 0.09000000171363354
Score (max over agents) from episode 3: 0.0
Score (max over agents) from episode 4: 0.0
Score (max over agents) from episode 5: 0.0


When finished, you can close the environment.

### 4. It's Your Turn!

Now it's your turn to train your own agent to solve the environment!  When training the environment, set `train_mode=True`, so that the line for resetting the environment looks like the following:
```python
env_info = env.reset(train_mode=True)[brain_name]
```

In [9]:
from ou_noise import OUNoise
from utils import hard_update, soft_update
from replay_buffer import ReplayBuffer

import torch
import torch.nn as nn
import torch.nn.functional as F

In [10]:
class Actor(nn.Module):
    """Actor (Policy) Model."""
    def __init__(self, 
                 in_actor,
                 hidden_in_actor=256, 
                 hidden_out_actor=256,
                 out_actor=2,
                 leak=0.01, 
                 seed=42):
        """ Initialize parameters and build model.
        Params
        ======
            in_actor (int): Size of state tensor
            hidden_in_actor (int): Number of hidden units in FC layer 1
            hidden_out_actor (int): Number of hidden units in FC layer 2           
            out_actor (int): Size of action tensor
            seed (int): Random seed                    
            leak (float): the leak rate for leaky ReLU, i.e. the alpha in (x < 0) * alpha * x + (x >= 0) * x
        """
        super(Actor, self).__init__()
        self.leak = leak
        self.seed = torch.manual_seed(seed)

        self.fc1 = nn.Linear(in_actor, hidden_in_actor)
        self.fc2 = nn.Linear(hidden_in_actor, hidden_out_actor)
        self.fc3 = nn.Linear(hidden_out_actor, out_actor)

        self.bn = nn.BatchNorm1d(in_actor)
        self.reset_parameters()

    def reset_parameters(self):
        """Initialize FC layers followed by leaky ReLU using Kaiming He's (2015) approach.
        Source: https://arxiv.org/pdf/1502.01852v1.pdf
        For more info see here:
            https://www.jefkine.com/deep/2016/08/08/initialization-of-deep-networks-case-of-rectifiers/
        """
        torch.nn.init.kaiming_normal_(
            self.fc1.weight.data,
            a=self.leak, 
            nonlinearity='leaky_relu',
            mode='fan_in')
        torch.nn.init.kaiming_normal_(
            self.fc2.weight.data,
            a=self.leak,
            nonlinearity='leaky_relu',
            mode='fan_in')
        torch.nn.init.uniform_(self.fc3.weight.data,
                               -3e-3, 3e-3)

    def forward(self, state):
        """Build an actor (policy) network that maps states -> actions."""
#         state = self.bn(state)
        x = F.leaky_relu(self.fc1(state), negative_slope=self.leak)
        x = F.leaky_relu(self.fc2(x), negative_slope=self.leak)
        x = torch.tanh(self.fc3(x))
        return x

In [11]:
class Critic(nn.Module):
    """Critic (Value) Model."""
    def __init__(self,
                 in_critic,
                 hidden_in_critic=256,
                 hidden_out_critic=256,
                 out_critic=1, 
                 leak=0.01, 
                 seed=42):
        """Initialize parameters and build model.
        Params
        ======
            in_critic (int): Dimension of each state            
            hidden_in_critic (int): Number of nodes in the first hidden layer
            hidden_out_critic (int): Number of nodes in the second hidden layer
            out_critic (int): Dimension of each action
            leak (float): Leakiness of leaky ReLU
        """
        super(Critic, self).__init__()
        self.leak = leak
        self.seed = torch.manual_seed(seed)
        self.fc1 = nn.Linear(in_critic, hidden_in_critic)
        self.fc2 = nn.Linear(hidden_in_critic, hidden_out_critic)
        self.fc3 = nn.Linear(hidden_out_critic, out_critic)
        self.reset_parameters()

    def reset_parameters(self):
        """Initialize FC layers followed by leaky ReLU using Kaiming He's (2015) approach.
        Source: https://arxiv.org/pdf/1502.01852v1.pdf
        For more info see here:
            https://www.jefkine.com/deep/2016/08/08/initialization-of-deep-networks-case-of-rectifiers/
        """
        torch.nn.init.kaiming_normal_(self.fc1.weight.data, a=self.leak,
                                      nonlinearity='leaky_relu', mode='fan_in')
        torch.nn.init.kaiming_normal_(self.fc2.weight.data, a=self.leak,
                                      nonlinearity='leaky_relu', mode='fan_in')
        torch.nn.init.uniform_(self.fc3.weight.data, -3e-3, 3e-3)

    def forward(self, state):
        """ Build a critic (value) network that maps (state, action) pairs -> Q-values."""
        x = F.leaky_relu(self.fc1(state), negative_slope=self.leak)
        x = F.leaky_relu(self.fc2(x), negative_slope=self.leak)
        x = self.fc3(x)
        return x

In [12]:
class DDPGAgent:
    def __init__(self, 
                 in_actor, 
                 hidden_in_actor,
                 hidden_out_actor,
                 out_actor, 
                 in_critic,
                 hidden_in_critic,
                 hidden_out_critic,
                 out_critic,
                 device,
                 relu_leak=1e-2,
                 seed=42,
                 lr_actor=1.0e-2, 
                 lr_critic=1.0e-2,
                 noise_mul=0.1):
        super(DDPGAgent, self).__init__()
        self.device=device
        self.actor = Actor(
            in_actor,
            hidden_in_actor, 
            hidden_out_actor, 
            out_actor, 
            relu_leak, 
            seed).to(device)
        self.critic = Critic(
            in_critic,
            hidden_in_critic,
            hidden_out_critic, 
            out_critic, 
            relu_leak, 
            seed).to(device)
        self.target_actor = Actor(
            in_actor, 
            hidden_in_actor,
            hidden_out_actor,
            out_actor, 
            relu_leak,
            seed).to(device)
        self.target_critic = Critic(
            in_critic, 
            hidden_in_critic,
            hidden_out_critic,
            out_critic, 
            relu_leak,
            seed).to(device)

        self.noise = OUNoise(out_actor, seed=seed, scale=1.0)
        
        # initialize targets same as original networks
        hard_update(self.target_actor, self.actor)
        hard_update(self.target_critic, self.critic)

        self.actor_optimizer = torch.optim.Adam(self.actor.parameters(), lr=lr_actor)
        self.critic_optimizer = torch.optim.Adam(self.critic.parameters(), lr=lr_critic, weight_decay=1.e-5)   
        
    def act(self, states, noise_mul=0.1, train=False):
        if not train:
            self.actor.eval()
        actions = self.actor(states).cpu().data.numpy()
        actions += self.noise.sample() * noise_mul
        self.actor.train()
        return np.clip(actions, -1, 1)        
    
    def target_act(self, states, noise_mul=0.1, train=False):
        if not train:
            self.target_actor.eval()
        actions = self.target_actor(states).cpu().data.numpy()
        actions += self.noise.sample() * noise_mul
        self.target_actor.train()
        return np.clip(actions, -1, 1)
    
    def reset(self):
        self.noise.reset()

In [39]:
class MADDPGAgent:
    def __init__(
        self,
        num_agents,
        in_actor, 
        hidden_in_actor,
        hidden_out_actor,
        out_actor, 
        hidden_in_critic,
        hidden_out_critic,
        out_critic,
        device,
        relu_leak=1e-2,
        seed=42,
        lr_actor=1.0e-2, 
        lr_critic=1.0e-2,
        replay_buffer_size=int(1e6),
        batch_size=64,
        gamma=0.99,
        tau=1e-3,
        update_every=1,
        num_updates=1
    ):        
        in_critic = in_actor * num_agents + out_actor * 2
        self.agents = []
        self.num_agents = num_agents
        self.out_actor = out_actor
        self.replay_buffer = ReplayBuffer(num_agents * out_actor, replay_buffer_size, batch_size, seed, device)
        self.batch_size = batch_size
        self.gamma = gamma
        self.tau = tau
        self.device = device
        self.steps = 0
        self.update_every = update_every
        self.num_updates = num_updates
        self.noise_scale = 1.0
        self.learn_steps = 0
        for _ in range(num_agents):
            self.agents.append(
                DDPGAgent(
                    in_actor, hidden_in_actor, hidden_out_actor, out_actor,
                    in_critic, hidden_in_critic, hidden_out_critic, out_critic, 
                    device, relu_leak, seed, lr_actor, lr_critic)
            )          

    
    def act(self, states, noise_mul=1.0, train=False):
        """Returns actions for given state as per current policy."""
        actions = np.zeros((self.num_agents, self.out_actor))
        for agent_idx, state in enumerate(states):
            state = torch.from_numpy(state).float().to(self.device)
            if state.dim() < 2:
                state = state.unsqueeze(0)
            curr_agent = self.agents[agent_idx]
            action = curr_agent.act(state, noise_mul, train=train)
            actions[agent_idx, :] = action
        return actions
            
    def step(self, states, actions, rewards, next_states, dones):
        """Save experience in replay memory, and use random sample from buffer to learn."""
        self.steps += 1
        self.replay_buffer.add(states, actions, rewards, next_states, dones)

        if (len(self.replay_buffer) > self.batch_size) and (self.steps % self.update_every == 0):
            # reset steps so we don't overflow at some point
            self.steps = 0
            for _ in range(self.num_updates):
                experiences = self.replay_buffer.sample()
                self.learn(experiences, self.gamma)
  
    def learn(self, experiences, gamma):
        
        obs_full, actions_full, rewards_full, next_obs_full, dones_full = experiences
        
        for agent_idx, agent in enumerate(self.agents):                        
            
            agent.critic_optimizer.zero_grad()
            
            agent_torch_idx = torch.tensor([agent_idx]).to(self.device)
            other_agent_idx = (agent_idx + 1) % 2                  
            other_agent_torch_idx = torch.tensor([other_agent_idx]).to(self.device)
            
            # Select for current agent
            obs = torch.index_select(obs_full, 1, agent_torch_idx).squeeze()
            actions = torch.index_select(actions_full, 1, agent_torch_idx)
            rewards = torch.index_select(rewards_full, 1, agent_torch_idx)
            next_obs = torch.index_select(next_obs_full, 1, agent_torch_idx).squeeze()
            dones = torch.index_select(dones_full, 1, agent_torch_idx)            
            
            target_actions = agent.target_act(next_obs, train=False)
            target_actions = target_actions.squeeze()
            
            target_actions = torch.from_numpy(target_actions).to(self.device)

            target_critic_input = torch.cat((
                next_obs_full.reshape(self.batch_size, -1), 
                target_actions,
                torch.zeros_like(target_actions)
            ), dim=1).to(device)
                                    
            with torch.no_grad():
                q_next = agent.target_critic(target_critic_input)
                        
            y = rewards + self.gamma * q_next * (1 - dones)
                        
            actions = actions.squeeze()
            obs_full_reshaped = obs_full.reshape(self.batch_size, -1) 
            
            opponent_actions = torch.index_select(actions_full, 1, other_agent_torch_idx)
            opponent_actions = opponent_actions.squeeze()
            
            critic_input = torch.cat((
                obs_full_reshaped, 
                actions,
                opponent_actions), dim=1)
            
            q = agent.critic(critic_input)
            
            #cr_loss = torch.nn.SmoothL1Loss()
            cr_loss = F.mse_loss
            critic_loss = cr_loss(q, y.detach())
            critic_loss.backward()
            torch.nn.utils.clip_grad_norm_(agent.critic.parameters(), 1)
            agent.critic_optimizer.step()
            
            agent.actor_optimizer.zero_grad()

            q_input = []
            
            for curr_agent in range(self.num_agents):
                curr_actor = self.agents[curr_agent].actor
                curr_idx = torch.tensor([curr_agent]).to(self.device)
                ob = torch.index_select(obs_full, 1, curr_idx).squeeze()
                result = curr_actor(ob)
                if i != agent_idx:
                    result = result.detach()     
                q_input.append(result)
                
            q_input = torch.cat(q_input, dim=1)
            
            obs_full_reshaped = obs_full.reshape(self.batch_size, -1)
            q_input2 = torch.cat((obs_full_reshaped, q_input), dim=1)
                        
            actor_loss = -agent.critic(q_input2).mean()
            actor_loss.backward()
            torch.nn.utils.clip_grad_norm_(agent.critic.parameters(), 1)
            agent.actor_optimizer.step()
            
            soft_update(agent.target_actor, agent.actor, self.tau)
            soft_update(agent.target_critic, agent.critic, self.tau)

            al = actor_loss.cpu().detach().item()
            cl = critic_loss.cpu().detach().item()
                        
            writer.add_scalar("agent{}/actor_loss".format(agent_idx), al, self.learn_steps)
            writer.add_scalar("agent{}/critic_loss".format(agent_idx), cl, self.learn_steps)
            self.learn_steps += 1
            
        writer.file_writer.flush()
        if (self.learn_steps % 10 == 0):
            print("critic loss: {}".format(cl))
            print("actor loss: {}".format(al))
            
#             print("critic loss: {}, actor loss: {}".format(cl, al))
#             logger.add_scalars('agent%i/losses' % agent_number,
#                            {'critic loss': cl,
#                             'actor_loss': al},
#                            self.iter)            
        
    def reset(self):
        self.noise.reset()
        
    def save_agents(self):
        raise NotImplementedError("Saving is not implemented yet.")


In [40]:
num_agents = 2

in_actor = 24
hidden_in_actor = 256
hidden_out_actor = 128
out_actor = 2

hidden_in_critic = 256
hidden_out_critic = 128

out_critic = 1

lr_actor = 1e-3
lr_critic = 1e-3
relu_leak = 1e-2
seed = 42

gpu_id = 0
use_gpu = True
device = 'cuda:{}'.format(gpu_id) if use_gpu else 'cpu'

batch_size = 64
replay_buffer_size = int(1e6)
gamma = 0.99
tau = 1e-3
update_every = 1
num_updates = 5

meta_agent = MADDPGAgent(  
    num_agents,
    in_actor, 
    hidden_in_actor,
    hidden_out_actor,
    out_actor, 
    hidden_in_critic,
    hidden_out_critic,
    out_critic,
    device,
    relu_leak,
    seed,
    lr_actor, 
    lr_critic,
    replay_buffer_size,
    batch_size,
    gamma,
    tau,
    update_every,
    num_updates)

In [41]:
env_info = env.reset(train_mode=False)[brain_name]
states = env_info.vector_observations

actions = meta_agent.act(states)
print(actions)

[[0.12225351 0.00236954]
 [0.18207023 0.33995774]]


In [43]:
num_episodes = 100

num_steps_max = 1000

print_every = 1

rolling_scores = deque(maxlen=100)
ctr = 0

noise_init = 5.0
noise_mul = noise_init
noise_decay = 1 - 5e-4
min_noise = 0.1

for i in range(num_episodes):               
    # NOTE: SWITCH TO train_mode = True
    env_info = env.reset(train_mode=True)[brain_name]     # reset the environment    
    states = env_info.vector_observations                  # get the current state (for each agent)
    scores = np.zeros(num_agents)                          # initialize the score (for each agent)

    noise_mul *= noise_decay
    
    for j in range(num_steps_max):
        
        ctr += 1
        
        noise_mul = min(max(noise_mul, min_noise), noise_init)
        actions = meta_agent.act(states, noise_mul)
        
        env_info = env.step(actions)[brain_name]
        next_states = env_info.vector_observations
        rewards = env_info.rewards        
        dones = env_info.local_done
        actions = torch.from_numpy(actions).to(meta_agent.device)
        meta_agent.step(states, actions, rewards, next_states, dones)
        scores += env_info.rewards
        
        rolling_scores.append(float(np.max(env_info.rewards)))
        
        states = next_states
        if np.any(dones): 
            print("episode {} done after {} steps".format(i, j))
            break
            
    rolling_score = np.array(list(rolling_scores)).mean()
    writer.add_scalar("meta_agent/rolling_score", rolling_score, i) #, i)
    writer.file_writer.flush()
    
#     if i % print_every == 0:        
#         print('Score (max over agents) from episode {}: {}'.format(i, np.max(scores)))
#         print("Rolling score: {}".format(np.array(rolling_scores).mean()))

critic loss: 0.0003948748344555497
actor loss: 0.17601554095745087
critic loss: 0.00016305880853906274
actor loss: 0.1962447613477707
critic loss: 0.00019429261737968773
actor loss: 0.18311229348182678
critic loss: 0.0003059223818127066
actor loss: 0.19797825813293457
critic loss: 0.0003411299258004874
actor loss: 0.18570178747177124
critic loss: 0.0002789040154311806
actor loss: 0.1917210966348648
critic loss: 0.0002744097728282213
actor loss: 0.17613297700881958
critic loss: 0.0001510655420133844
actor loss: 0.19614408910274506
critic loss: 0.00015334202907979488
actor loss: 0.15852437913417816
critic loss: 0.00023845708346925676
actor loss: 0.16881579160690308
critic loss: 0.00014442391693592072
actor loss: 0.16906879842281342
critic loss: 0.000374025636119768
actor loss: 0.1851234883069992
critic loss: 0.00019573591998778284
actor loss: 0.20293346047401428
critic loss: 0.00030467082979157567
actor loss: 0.18125519156455994
critic loss: 8.80771258380264e-05
actor loss: 0.17902582883

critic loss: 0.000188870559213683
actor loss: 0.15785709023475647
critic loss: 0.0002847683208528906
actor loss: 0.1695864200592041
critic loss: 0.0001791198446881026
actor loss: 0.16136778891086578
critic loss: 0.0001926658587763086
actor loss: 0.16125835478305817
critic loss: 0.0001513996976427734
actor loss: 0.13828040659427643
critic loss: 0.00010944624955300242
actor loss: 0.156443253159523
critic loss: 0.0002018628001678735
actor loss: 0.16144651174545288
critic loss: 0.00017122634744737297
actor loss: 0.1713477522134781
critic loss: 0.00017726147780194879
actor loss: 0.14869549870491028
critic loss: 0.00016780704027041793
actor loss: 0.13969887793064117
critic loss: 0.00024329061852768064
actor loss: 0.14795929193496704
critic loss: 0.00022039099712856114
actor loss: 0.13274604082107544
critic loss: 0.0002405385603196919
actor loss: 0.13719965517520905
episode 7 done after 14 steps
critic loss: 0.00018539329175837338
actor loss: 0.14797386527061462
critic loss: 0.000131122578750

critic loss: 0.0001538825745228678
actor loss: 0.15593813359737396
critic loss: 0.0001301503216382116
actor loss: 0.1131114810705185
critic loss: 0.00020486232824623585
actor loss: 0.1543169915676117
critic loss: 0.00012586470984388143
actor loss: 0.14575016498565674
critic loss: 0.00014348502736538649
actor loss: 0.12475204467773438
critic loss: 0.00010462810314493254
actor loss: 0.13834914565086365
critic loss: 9.710924496175721e-05
actor loss: 0.14386843144893646
critic loss: 0.00019583833636716008
actor loss: 0.13530883193016052
critic loss: 0.00013976232730783522
actor loss: 0.14584560692310333
critic loss: 0.00012786993465851992
actor loss: 0.1471109539270401
critic loss: 0.00011674017150653526
actor loss: 0.13536106050014496
critic loss: 0.00012980884639546275
actor loss: 0.11891932040452957
critic loss: 0.0002508804900571704
actor loss: 0.1397174596786499
critic loss: 0.00017892196774482727
actor loss: 0.13415105640888214
critic loss: 0.0001895600580610335
actor loss: 0.1488755

critic loss: 0.0001398178283125162
actor loss: 0.13707982003688812
critic loss: 0.00018286969861947
actor loss: 0.13446803390979767
critic loss: 0.00014507968444377184
actor loss: 0.13479356467723846
critic loss: 0.00014352338621392846
actor loss: 0.13705973327159882
critic loss: 0.00012807358871214092
actor loss: 0.12261200696229935
critic loss: 0.00011550081399036571
actor loss: 0.12220420688390732
critic loss: 0.00017607638437766582
actor loss: 0.11814859509468079
critic loss: 6.098593439674005e-05
actor loss: 0.12063197791576385
critic loss: 0.00013275642413645983
actor loss: 0.12304946035146713
critic loss: 8.051653276197612e-05
actor loss: 0.12501898407936096
critic loss: 0.00025024000206030905
actor loss: 0.15326261520385742
critic loss: 0.00022409734083339572
actor loss: 0.14977942407131195
critic loss: 0.0001203506690217182
actor loss: 0.1245475634932518
critic loss: 0.00017999607371166348
actor loss: 0.11421804130077362
critic loss: 0.0002400296798441559
actor loss: 0.1262488

critic loss: 0.00021870213095098734
actor loss: 0.10750541090965271
critic loss: 0.0001536080235382542
actor loss: 0.12432628124952316
critic loss: 0.00021454872330650687
actor loss: 0.1384524255990982
critic loss: 0.00026877463096752763
actor loss: 0.1346672773361206
critic loss: 0.0001917902409331873
actor loss: 0.10824563354253769
critic loss: 0.00018469267524778843
actor loss: 0.1255706250667572
critic loss: 0.00011998452828265727
actor loss: 0.1183001697063446
critic loss: 0.00037764786975458264
actor loss: 0.12602555751800537
critic loss: 0.00018493487732484937
actor loss: 0.09649483859539032
critic loss: 0.0002344243839615956
actor loss: 0.11488483101129532
episode 20 done after 31 steps
critic loss: 0.00020416092593222857
actor loss: 0.1270141452550888
critic loss: 0.0001888412080006674
actor loss: 0.1267329603433609
critic loss: 0.00016397805302403867
actor loss: 0.13337337970733643
critic loss: 0.00013271538773551583
actor loss: 0.09470251202583313
critic loss: 0.000119920056

critic loss: 0.00028102443320676684
actor loss: 0.1080251932144165
critic loss: 0.00020862455130554736
actor loss: 0.09624979645013809
episode 27 done after 14 steps
critic loss: 0.0002451519249007106
actor loss: 0.12117061764001846
critic loss: 0.0003570900298655033
actor loss: 0.1019495502114296
critic loss: 0.00018514366820454597
actor loss: 0.11220484972000122
critic loss: 0.0002493095234967768
actor loss: 0.09498120099306107
critic loss: 0.0002492825733497739
actor loss: 0.11760053783655167
critic loss: 0.00027533777756616473
actor loss: 0.09518951177597046
critic loss: 0.00026139835244975984
actor loss: 0.09485061466693878
critic loss: 0.0002205948840128258
actor loss: 0.10931297391653061
critic loss: 0.0001811723632272333
actor loss: 0.0980139970779419
critic loss: 0.00021206219389569014
actor loss: 0.11489613354206085
critic loss: 0.00018966477364301682
actor loss: 0.1177511066198349
critic loss: 0.00022716252715326846
actor loss: 0.10519576072692871
critic loss: 0.000167883423

critic loss: 0.0002597654820419848
actor loss: 0.08267176151275635
critic loss: 0.0002066231973003596
actor loss: 0.0785430371761322
critic loss: 0.00014421901141759008
actor loss: 0.08910596370697021
critic loss: 0.0002874308847822249
actor loss: 0.07015802711248398
episode 34 done after 13 steps
critic loss: 0.00014720157196279615
actor loss: 0.08374684303998947
critic loss: 0.00021440713317133486
actor loss: 0.09907209873199463
critic loss: 0.00013889950059819967
actor loss: 0.09692233800888062
critic loss: 0.00015983061166480184
actor loss: 0.0908602774143219
critic loss: 0.00017698243027552962
actor loss: 0.10288182646036148
critic loss: 0.00013622443657368422
actor loss: 0.10678169131278992
critic loss: 0.00014292632113210857
actor loss: 0.10797367990016937
critic loss: 0.00025504580116830766
actor loss: 0.11427296698093414
critic loss: 0.00016748151392675936
actor loss: 0.09495282173156738
critic loss: 0.0002971976646222174
actor loss: 0.10555914044380188
critic loss: 0.00016282

critic loss: 0.00017892004689201713
actor loss: 0.09533675760030746
critic loss: 0.00026211992371827364
actor loss: 0.0877709835767746
critic loss: 0.00023012160090729594
actor loss: 0.09594765305519104
critic loss: 0.00037586261169053614
actor loss: 0.09786861389875412
critic loss: 0.0001577231741975993
actor loss: 0.09255985170602798
critic loss: 0.00020578299881890416
actor loss: 0.07646815478801727
critic loss: 0.00026505946880206466
actor loss: 0.09172934293746948
critic loss: 0.0003104378702118993
actor loss: 0.09485743194818497
critic loss: 0.0004101305385120213
actor loss: 0.08516786992549896
critic loss: 0.0002511079655960202
actor loss: 0.09195388853549957
critic loss: 0.00027458096155896783
actor loss: 0.09318219870328903
critic loss: 0.0003087125951424241
actor loss: 0.09500441700220108
critic loss: 0.0004512136511038989
actor loss: 0.0872746929526329
critic loss: 0.00023372797295451164
actor loss: 0.08542682230472565
episode 42 done after 14 steps
critic loss: 0.0003478734

critic loss: 0.00012731464812532067
actor loss: 0.07505988329648972
critic loss: 0.00017708326049614698
actor loss: 0.08167402446269989
critic loss: 0.00022549870482180268
actor loss: 0.09903690218925476
critic loss: 0.00018717529019340873
actor loss: 0.09396352618932724
critic loss: 0.00012994572171010077
actor loss: 0.07712428271770477
critic loss: 0.00011399259528843686
actor loss: 0.07772614061832428
critic loss: 0.00012383610010147095
actor loss: 0.08172354102134705
critic loss: 0.0002168363134842366
actor loss: 0.07747761905193329
critic loss: 0.00017793099686969072
actor loss: 0.08559850603342056
critic loss: 0.00021540610759984702
actor loss: 0.08635187894105911
episode 48 done after 14 steps
critic loss: 0.0002501429116819054
actor loss: 0.07973960041999817
critic loss: 0.00010269663471262902
actor loss: 0.09321127831935883
critic loss: 0.0002007565344683826
actor loss: 0.07111675292253494
critic loss: 0.00012632415746338665
actor loss: 0.07279004156589508
critic loss: 0.00012

critic loss: 0.00022933565196581185
actor loss: 0.06863363087177277
critic loss: 0.0001245766325155273
actor loss: 0.06844300031661987
critic loss: 0.00016899411275517195
actor loss: 0.06259658932685852
critic loss: 0.00012552467524074018
actor loss: 0.06313024461269379
critic loss: 0.0001761745661497116
actor loss: 0.08588402718305588
critic loss: 0.00010696657409425825
actor loss: 0.07365516573190689
critic loss: 0.00013125409896019846
actor loss: 0.07760680466890335
critic loss: 0.0001339998416369781
actor loss: 0.07832387834787369
critic loss: 0.00010753474634839222
actor loss: 0.0745341032743454
critic loss: 0.00012744862760882825
actor loss: 0.0659174695611
critic loss: 0.0001738365099299699
actor loss: 0.08383046835660934
episode 54 done after 13 steps
critic loss: 0.0002101580030284822
actor loss: 0.06185032054781914
critic loss: 0.0001606458390597254
actor loss: 0.07481791079044342
critic loss: 0.00018317249487154186
actor loss: 0.05988403037190437
critic loss: 0.0001904041419

critic loss: 0.000203562798560597
actor loss: 0.06903710216283798
critic loss: 0.0002550171047914773
actor loss: 0.07256561517715454
critic loss: 0.0001711098593659699
actor loss: 0.054491180926561356
critic loss: 0.00019456259906291962
actor loss: 0.07131167501211166
critic loss: 0.00020033202599734068
actor loss: 0.06940753757953644
episode 62 done after 13 steps
critic loss: 0.0002002097899094224
actor loss: 0.0761282816529274
critic loss: 0.0002092223148792982
actor loss: 0.08741673082113266
critic loss: 0.0002786357654258609
actor loss: 0.07483984529972076
critic loss: 0.00017351011047139764
actor loss: 0.07572261244058609
critic loss: 0.00020485519780777395
actor loss: 0.08408936858177185
critic loss: 0.00015087543579284102
actor loss: 0.06718505918979645
critic loss: 0.0002504692238289863
actor loss: 0.06358716636896133
critic loss: 0.00016984320245683193
actor loss: 0.05893830955028534
critic loss: 0.0001874939480330795
actor loss: 0.06635238230228424
critic loss: 0.00013772796

critic loss: 0.0001629850739846006
actor loss: 0.06822340190410614
critic loss: 9.43435006774962e-05
actor loss: 0.03946932032704353
critic loss: 0.0001322298194281757
actor loss: 0.05440302938222885
critic loss: 8.86861453182064e-05
actor loss: 0.07298639416694641
critic loss: 0.00015764673298690468
actor loss: 0.06617268919944763
critic loss: 0.0001345360797131434
actor loss: 0.064656101167202
critic loss: 0.00010487368126632646
actor loss: 0.05345048010349274
critic loss: 0.0001103555696317926
actor loss: 0.05688747763633728
critic loss: 0.00018004706362262368
actor loss: 0.06786436587572098
critic loss: 0.0002259434259030968
actor loss: 0.05855242535471916
critic loss: 0.00024210991978179663
actor loss: 0.048193953931331635
critic loss: 0.00021842055139131844
actor loss: 0.0642675831913948
episode 70 done after 13 steps
critic loss: 0.00012052714737365022
actor loss: 0.06562875211238861
critic loss: 0.0001405457587679848
actor loss: 0.07092341780662537
critic loss: 0.00013085313548

critic loss: 0.0001396035950165242
actor loss: 0.05745255574584007
critic loss: 0.00012873116065748036
actor loss: 0.06137796491384506
episode 75 done after 13 steps
critic loss: 0.00015260372310876846
actor loss: 0.06263932585716248
critic loss: 0.00015754441847093403
actor loss: 0.07184983044862747
critic loss: 0.00010416985605843365
actor loss: 0.053236931562423706
critic loss: 0.00018228495900984854
actor loss: 0.05790308862924576
critic loss: 0.00013311684597283602
actor loss: 0.06371191889047623
critic loss: 0.00018652771541383117
actor loss: 0.05260854214429855
critic loss: 0.00025843820185400546
actor loss: 0.06302569806575775
critic loss: 0.00016992168093565851
actor loss: 0.039723120629787445
critic loss: 0.00016408554802183062
actor loss: 0.06916431337594986
critic loss: 0.00017370792920701206
actor loss: 0.05811205506324768
critic loss: 0.00021245551761239767
actor loss: 0.06870909035205841
critic loss: 0.00010692713840398937
actor loss: 0.06662838906049728
critic loss: 0.0

critic loss: 9.114173008129e-05
actor loss: 0.049798332154750824
critic loss: 0.00017277160077355802
actor loss: 0.06249690800905228
critic loss: 9.279049118049443e-05
actor loss: 0.03716329485177994
critic loss: 0.00010428170207887888
actor loss: 0.056044284254312515
critic loss: 9.342782868770882e-05
actor loss: 0.05205903574824333
critic loss: 0.0001747303904267028
actor loss: 0.06718938052654266
critic loss: 0.00018722671666182578
actor loss: 0.06403263658285141
critic loss: 0.0001842434867285192
actor loss: 0.05840502306818962
critic loss: 0.0002055703371297568
actor loss: 0.060351159423589706
critic loss: 0.00011845908738905564
actor loss: 0.05784772336483002
critic loss: 0.00023525212600361556
actor loss: 0.04993409663438797
critic loss: 0.00011179919238202274
actor loss: 0.05359321087598801
critic loss: 0.00014676374848932028
actor loss: 0.05427370220422745
episode 84 done after 14 steps
critic loss: 0.00012880789290647954
actor loss: 0.047905176877975464
critic loss: 0.0001152

critic loss: 0.0001068560522980988
actor loss: 0.052250683307647705
critic loss: 0.00014633801765739918
actor loss: 0.047614939510822296
critic loss: 0.0001319244911428541
actor loss: 0.05728669837117195
critic loss: 0.00013345296611078084
actor loss: 0.06876377761363983
critic loss: 5.849140870850533e-05
actor loss: 0.04438583180308342
critic loss: 0.00018989457748830318
actor loss: 0.05185125395655632
critic loss: 0.00012914792750962079
actor loss: 0.05110098049044609
critic loss: 0.00011776699830079451
actor loss: 0.048488471657037735
critic loss: 0.0001982162648346275
actor loss: 0.050088804215192795
critic loss: 0.00011115175584563985
actor loss: 0.053366370499134064
critic loss: 0.0002175965637434274
actor loss: 0.048669058829545975
critic loss: 0.00016453237913083285
actor loss: 0.050219375640153885
critic loss: 0.0002915711374953389
actor loss: 0.06248272582888603
critic loss: 0.0002862960391212255
actor loss: 0.04587627574801445
critic loss: 0.00019235556828789413
actor loss: 

critic loss: 0.0001043229567585513
actor loss: 0.03807613626122475
critic loss: 9.63678103289567e-05
actor loss: 0.047788407653570175
critic loss: 7.343881588894874e-05
actor loss: 0.03846045956015587
critic loss: 0.0002396915660938248
actor loss: 0.04113345965743065
episode 98 done after 14 steps
critic loss: 8.275770233012736e-05
actor loss: 0.048858001828193665
critic loss: 0.00011975115921813995
actor loss: 0.03807682916522026
critic loss: 0.0001153893826995045
actor loss: 0.04428328573703766
critic loss: 0.00011958840332226828
actor loss: 0.04681527987122536
critic loss: 0.0001539028889965266
actor loss: 0.04085780307650566
critic loss: 8.227971557062119e-05
actor loss: 0.04930553212761879
critic loss: 9.687861165730283e-05
actor loss: 0.04298140108585358
critic loss: 0.00018740650557447225
actor loss: 0.050707217305898666
critic loss: 0.00011070293112425134
actor loss: 0.037234287708997726
critic loss: 0.00017578057304490358
actor loss: 0.04100637882947922
critic loss: 0.00011014

In [20]:
env_info = env.reset(train_mode=True)[brain_name]

In [21]:
states = env_info.vector_observations
state_size = states.shape[1]

In [22]:
state_size

24

In [24]:
states.shape

(2, 24)

In [27]:
foo = torch.stack(torch.from_numpy(states), axis=1)

TypeError: stack(): argument 'tensors' (position 1) must be tuple of Tensors, not Tensor