In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import tqdm
from replay_memory import ReplayMemory
import random
import pdb
import time
import math
from mlagents.envs.environment import UnityEnvironment

## Policy Network

### Playing Atari with Deep Reinforcement Learning
https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf

There are several possible ways of parameterizing Q using a neural network. Since Q maps history-action pairs to scalar estimates of their Q-value, the history and the action have been used as inputs to the neural network by some previous approaches [20, 12]. The main drawback of this type of architecture is that a separate forward pass is required to compute the Q-value of each action, resulting in a cost that scales linearly with the number of actions. We instead use an architecture in which there is a separate output unit for each possible action, and only the state representation is an input to the neural network. The outputs correspond to the predicted Q-values of the individual action for the input state. The main advantage of this type of architecture is the ability to compute Q-values for all possible actions in a given state with only a single forward pass through the network.

We now describe the exact architecture used for all seven Atari games. The input to the neural
network consists is an 84 × 84 × 4 image produced by φ. The first hidden layer convolves 16 8 × 8
filters with stride 4 with the input image and applies a rectifier nonlinearity [10, 18]. The second
hidden layer convolves 32 4 × 4 filters with stride 2, again followed by a rectifier nonlinearity. The
final hidden layer is fully-connected and consists of 256 rectifier units. The output layer is a fully-connected linear layer with a single output for each valid action. The number of valid actions varied
between 4 and 18 on the games we considered. We refer to convolutional networks trained with our
approach as Deep Q-Networks (DQN).

### Our Networks

#### Vector Observations (PushBlock No-Stack Small)
* We have 7 ray angles per ray scan. Each ray angle contributes a length 5 sublist containing data of the form [hit_block, hit_goal, hit_wall, hit_anything, distance_if_hit]. This is essentially a one-hot representation of the the objects in the environment.
* We have 2 ray scans at different angles, so the agent observes a total of **70 data elements every timestep**.
* Vector observation space (size 70).
* Action space (size 7): 0, 1, ..., 6.

Based on the work of Mnih et al., for our Q-network that uses vector observations (basically pre-extracted features), we are going to have 1 hidden layer (with 100 rectified units) and 1 output layer (with 7 units corresponding to the 7 actions in the action space).

#### Visual Observations (PushBlock No-Stack Small)


In [2]:
class PushBlockNoStackSmallNetwork(nn.Module):
    
    def __init__(self):
        super(PushBlockNoStackSmallNetwork, self).__init__()
        self.hidden = nn.Linear(210, 100)
        self.out = nn.Linear(100, 7)
    
    def forward(self, state):
        temp = F.relu(self.hidden(state))
        return self.out(temp)

## Training

### Playing Atari with Deep Reinforcement Learning
https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf

In these experiments, we used the RMSProp algorithm with minibatches of size 32. The behavior
policy during training was epsilon-greedy with epsilon annealed linearly from 1 to 0.1 over the first million
frames, and fixed at 0.1 thereafter. We trained for a total of 10 million frames and used a replay
memory of one million most recent frames.

Following previous approaches to playing Atari games, we also use a simple frame-skipping technique [3]. More precisely, the agent sees and selects actions on every kth frame instead of every frame, and its last action is repeated on skipped frames. Since running the emulator forward for one step requires much less computation than having the agent select an action, this technique allows the agent to play roughly k times more games without significantly increasing the runtime. We use k = 4 for all games except Space Invaders where we noticed that using k = 4 makes the lasers invisible because of the period at which they blink. We used k = 3 to make the lasers visible and this change was the only difference in hyperparameter values between any of the games.

In [19]:
num_episodes = 1000
num_timesteps = 2000
discount = 0.99
exploration_rate = 0.3

In [20]:
def extract_vec_obs(brain_info, brain_name):
    """Extract vector observations from a BrainInfo object."""
    vec_obs = brain_info[brain_name].vector_observations.flatten()
    return torch.from_numpy(vec_obs).float()

def extract_reward(brain_info, brain_name, brain_num):
    """Extract reward from BrainInfo object."""
    return brain_info[brain_name].rewards[brain_num]

In [21]:
env = UnityEnvironment(file_name="environment-binaries/PushBlock-30.app")
brain_name = "PushBlock"

INFO:mlagents.envs:
'Academy' started successfully!
Unity Academy name: Academy
        Number of Training Brains : 0
        Reset Parameters :
		block_scale -> 2.0
		static_friction -> 0.0
		dynamic_friction -> 0.0
		block_drag -> 0.5



In [22]:
action_space = [0, 1, 2, 3, 4, 5, 6]

In [27]:
qnets = {
    "PushBlock": PushBlockNoStackSmallNetwork()
}
optimizers = {k: optim.SGD(v.parameters(), 0.1, weight_decay=0.98) for k, v in qnets.items()}

In [None]:
for _ in range(num_episodes):
    # initialize start state
    old_braininfos = env.step()
    for _ in range(num_timesteps):
        # choose actions for all brains and agents
        actions = {} # brain name -> vector of actions
        for brain in braininfos:
            vector_observations = old_braininfos[brain].vector_observations
            brain_actions = []
            for i in range(len(braininfos[brain].agents)):
                if random.random() < exploration_rate:
                    action = random.choice(action_space)
                else:
                    vector_observation = torch.from_numpy(vector_observations[i]).float()
                    q_values = qnets[brain](vector_observation)
                    action = torch.argmax(q_values).item()
                brain_actions.append(action)
            actions[brain] = brain_actions
        # execute action
        breakpoint()
        new_braininfos = env.step(actions)
        # update weights for all policy networks
        for brain in braininfos:
            network = qnets[brain]
            # take into account experiences from all agents using this network
            for i in range(len(braininfos[brain].agents)):
                old_state = torch.from_numpy(old_braininfos[brain].vector_observations[i]).float()
                action = actions[brain][i]
                breakpoint()
                prediction = network(old_state)[action]
                new_state = new_braininfos.vector_observations[i]
                breakpoint()
                reward = new_braininfos.rewards
                reward = extract_reward(brain_infos, brain_name, i)
                target = reward + discount * max(qnet(next_state))
                loss = (target - prediction) ** 2
                loss.backward()
                # update network parameters
                optimizer.step()
                optimizer.zero_grad()
        old_braininfos = new_braininfos

> <ipython-input-33-2eb065cc1a78>(17)<module>()->None
-> action = torch.argmax(q_values).item()


In [18]:
env.close()

INFO:mlagents.envs:Environment shut down with return code 0.


## Training Evaluation Metrics

### Playing Atari with Deep Reinforcement Learning
https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf

In supervised learning, one can easily track the performance of a model during training by evaluating
it on the training and validation sets. In reinforcement learning, however, accurately evaluating the
progress of an agent during training can be challenging. Since our evaluation metric, as suggested
by [3], is the total reward the agent collects in an episode or game averaged over a number of
games, we periodically compute it during training. **The average total reward metric tends to be very
noisy** because small changes to the weights of a policy can lead to large changes in the distribution of
states the policy visits . The leftmost two plots in figure 2 show how the average total reward evolves
during training on the games Seaquest and Breakout. Both averaged reward plots are indeed quite
noisy, giving one the impression that the learning algorithm is not making steady progress. **Another,
more stable, metric is the policy’s estimated action-value function Q**, which provides an estimate of
how much discounted reward the agent can obtain by following its policy from any given state. We
collect a fixed set of states by running a random policy before training starts and track the average
of the maximum2 predicted Q for these states. The two rightmost plots in figure 2 show that average
predicted Q increases much more smoothly than the average total reward obtained by the agent and
plotting the same metrics on the other five games produces similarly smooth curves. In addition
to seeing relatively smooth improvement to predicted Q during training we did not experience any
divergence issues in any of our experiments. This suggests that, despite lacking any theoretical
convergence guarantees, our method is able to train large neural networks using a reinforcement
learning signal and stochastic gradient descent in a stable manner.