---

## Learning to play Gravitar with a Deep-Q Network

### Abstract
This is a Deep Convolutional Dueling Q Learning agent with experience replay memory and Random Network Distillation (RND).
Two deep convolutional Q networks (DCQNs) are used for stability: the primary, and the target.
The target network is updated to have the same weights as the primary one every *SAVE_EVERY* frames.
Each state consists of the sensory (image) observation and its three preceding observations.
The DCQN performs multiple convolutions on each of these states for visual pattern learning.
RND is used to give the agent reward for discovering new states (thus exploring the levels).
The fully connected layers (after convolutions) are split into dueling layers which are then re-combined using mean advantage.
Finally, for a slight improvement in training speed, new experiences are sent to and stored in the replay buffer on the GPU.

### Performance on other Games
With slightly different hyperparameter values (and using QNetwork instead of DuelingConvQNetwork), the agent is able to
play Cart-Pole to an average score of 440 after approximately 600 episodes.

### References
#### Code
- https://github.com/seungeunrho/minimalRL/blob/master/dqn.py, released under the MIT licence.
- https://github.com/astooke/rlpyt, released under the MIT licence.
- https://github.com/simoninithomas/Deep_reinforcement_learning_Course/tree/master/RND%20Montezuma's%20revenge%20PyTorch,
released under the MIT licence.
- https://github.com/noagarcia/visdom-tutorial/blob/master/utils.py, licence not provided, presented as open-source tutorial.

#### Papers
- Burda, Y., Edwards, H., Storkey, A. and Klimov, O., 2018. Exploration by random network distillation. arXiv preprint
arXiv:1810.12894.
- Schaul, T., Quan, J., Antonoglou, I. and Silver, D., 2015. Prioritized experience replay. arXiv preprint
arXiv:1511.05952.
- Hasselt, H., 2010. Double Q-learning. Advances in neural information processing systems, 23, pp.2613-2621.
- Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A.A., Veness, J., Bellemare, M.G., Graves, A., Riedmiller, M., Fidjeland,
A.K., Ostrovski, G. and Petersen, S., 2015. Human-level control through deep reinforcement learning.
nature, 518(7540), pp.529-533.
- Wang, Z., Schaul, T., Hessel, M., Hasselt, H., Lanctot, M. and Freitas, N., 2016, June. Dueling network architectures
for deep reinforcement learning. In International conference on machine learning (pp. 1995-2003). PMLR.
- Parr, B., 2018. Deep In-GPU Experience Replay. arXiv preprint arXiv:1801.03138.

### Requirements

- OpenAI Gym with Atari Environments
- Numpy
- PyTorch
- Visdom for stat visualisation
- line_profiler for profiling

---

### Imports

In [1]:
# %load_ext line_profiler

import collections
import gym
from gym.wrappers import Monitor, AtariPreprocessing
import random
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from visdom import Visdom

---

### Configuration and Hyperparameters

In [2]:
# Hyperparameters
LEARNING_RATE = 1e-4
GAMMA = 0.99

EPSILON_MAX = 0.50
EPSILON_MIN = 0.02
EPSILON_REDUCTION = 1.0 / 2000

SAVE_EVERY = 100
STEPS_PER_TRAIN = 10

BUFFER_SIZE_MAX = 50000
BUFFER_SIZE_MIN = 2000
BUFFER_BATCH_SIZE = 128

INTRINSIC_CLIP = 5.0
INTRINSIC_SAMPLE_SIZE = 1000
UPDATE_PROPORTION = 0.25
INTRINSIC_REWARD_FACTOR = 0.5

EX_SCORE_NORM = 1 / 500  # Extrinsic reward normalising factor
IN_SCORE_NORM = 1 / 500  # Intrinsic reward normalising factor

# Configuration
ENVIRONMENT = "Gravitar-v0"
# ENVIRONMENT = "Breakout-v0"
# ENVIRONMENT = "Gravitar-ram-v0"
# ENVIRONMENT = "CartPole-v1"

ATARI_IMAGE_ENV = ENVIRONMENT in ("Gravitar-v0", "Breakout-v0")

STATE_VALUE_FACTOR = (1.0 / 255.0) if ATARI_IMAGE_ENV else 1.0
STATE_STORAGE_TYPE = torch.uint8 if ATARI_IMAGE_ENV else torch.float

VIDEO_EVERY = 25
OUTPUT_EVERY = 5
MEAN_EVERY = 100

DEVICE = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

---

### Utilities and Helpers

In [3]:
class VisdomLinePlotter:
    """Plots to Visdom"""
    def __init__(self, env_name="main", label_x="epochs"):
        self.vis = Visdom()
        self.env = env_name
        self.plots = {}
        self.label_x = label_x

    def clear_envs(self):
        for env in self.vis.get_env_list():
            self.vis.delete_env(env)

    def plot(self, var_name, split_name, title_name, x, y):
        if var_name not in self.plots:
            self.plots[var_name] = self.vis.line(
                X=np.array([x, x]), Y=np.array([y, y]),
                env=self.env,
                opts=dict(
                    legend=[split_name],
                    title=title_name,
                    xlabel=self.label_x,
                    ylabel=var_name
                )
            )
        else:
            self.vis.line(
                X=np.array([x]), Y=np.array([y]),
                env=self.env,
                win=self.plots[var_name],
                name=split_name,
                update = "append"
            )


line_plotter = VisdomLinePlotter(label_x="Episodes")
line_plotter.clear_envs()

Setting up a new session...


---

### Reinforcement Learning Model

Includes Experience Replay Buffer, Deep-Q model and Agent model.

In [4]:
class ReplayBuffer(object):
    def __init__(self, observation_shape, size_limit: int, min_memory: int, batch_size: int, device):
        self.size_limit: int = size_limit
        self.min_memory: int = min_memory
        self.batch_size: int = batch_size

        self.count: int = 0

        self.states         = torch.zeros((size_limit, *observation_shape), device=device, dtype=STATE_STORAGE_TYPE)
        self.actions        = torch.zeros((size_limit, 1), device=device, dtype=torch.long)
        self.rewards        = torch.zeros((size_limit, 1), device=device, dtype=torch.float)
        self.states_        = torch.zeros((size_limit, *observation_shape), device=device, dtype=STATE_STORAGE_TYPE)
        self.terminal_masks = torch.zeros((size_limit, 1), device=device, dtype=torch.float)
        self.priorities     = torch.ones((size_limit,))

    def put(self, state, action, reward, state_, terminal_mask: float):
        idx = self.count % self.size_limit

        self.states[idx]         = state
        self.actions[idx]        = action
        self.rewards[idx]        = reward
        self.states_[idx]        = state_
        self.terminal_masks[idx] = terminal_mask

        # self.buffer_p[self.pos] = max(self.buffer_p[:len(self)], default=1)

        self.count += 1

    def sample(self):
        probabilities = self.priorities[:len(self)]
        sampled_idx = probabilities.multinomial(num_samples=self.batch_size, replacement=True)

        return \
            self.states[sampled_idx].float(), \
            self.actions[sampled_idx], \
            self.rewards[sampled_idx], \
            self.states_[sampled_idx].float(), \
            self.terminal_masks[sampled_idx]

    def __len__(self):
        return min(self.count, self.size_limit)

    def can_sample(self):
        return len(self) > self.min_memory

class QNetwork(nn.Module):
    def __init__(self, state_shape, n_actions):
        super().__init__()

        obs_length = np.array(state_shape).prod()
        self.model = nn.Sequential(
            nn.Linear(obs_length, 128), nn.ReLU(),
            nn.Linear(128, 256), nn.ReLU(),
            nn.Linear(256, 128), nn.ReLU(),
            nn.Linear(128, n_actions)
        )

    def forward(self, x):
        return self.model(x * STATE_VALUE_FACTOR)

def conv_block(in_channels):
    return nn.Sequential(
        nn.Conv2d(in_channels, 32, (8, 8), (4, 4)), nn.ReLU(),
        nn.Conv2d(32, 64, (4, 4), (2, 2)), nn.ReLU(),
        nn.Conv2d(64, 64, (3, 3), (1, 1)), nn.ReLU()
    )


class DuelingConvQNetwork(nn.Module):
    def __init__(self, n_actions):
        super().__init__()

        self.conv = conv_block(4)

        self.fc = nn.Sequential(
            nn.Linear(64 * 7 * 7, 512), nn.ReLU(),
            nn.Linear(512, 512), nn.ReLU(),
            nn.Linear(512, n_actions)
        )

        self.fc_dueling = nn.Sequential(
            nn.Linear(64 * 7 * 7, 512), nn.ReLU(),
            nn.Linear(512, 1)
        )

    def forward(self, x):
        x = self.conv(x * STATE_VALUE_FACTOR)
        x = x.view(x.shape[0], -1)
        advantages = self.fc(x)
        value = self.fc_dueling(x)
        return value + advantages - advantages.mean(dim=-1, keepdim=True)


class IntrinsicNetwork(nn.Module):
    def __init__(self, obs_shape, is_target):
        super().__init__()

        if ATARI_IMAGE_ENV:
            self.conv = conv_block(4)
            in_features = 64 * 7 * 7
        else:
            in_features = np.array(obs_shape).prod()

        self.fc = nn.Sequential(
            nn.Linear(in_features, 512), nn.ReLU(),
            # Add an additional two fully connected layers if this is the non-target
            *[block for _ in range(0 if is_target else 2) for block in [nn.ReLU(), nn.Linear(512, 512)]]
        )

        for p in self.modules():
            if isinstance(p, (nn.Conv2d, nn.Linear)):
                torch.nn.init.orthogonal_(p.weight, np.sqrt(2))
                p.bias.data.zero_()

        # Disable training for target network
        if is_target:
            for p in self.parameters():
                p.requires_grad = False

    def forward(self, x):
        if ATARI_IMAGE_ENV:
            x = self.conv(x)
        return self.fc(x.view(x.shape[0], -1))

In [5]:
class Agent(object):
    def __init__(
            self, observation_space, action_space, device,
            buffer_size, min_buffer_size, batch_size,
            update_proportion, intrinsic_reward_factor):
        self.observation_space = observation_space
        self.action_space = action_space
        self.device = device
        self.intrinsic_reward_factor = intrinsic_reward_factor

        state_shape = (4, 84, 84) if ATARI_IMAGE_ENV else observation_space.shape

        self.replay_buffer: ReplayBuffer = ReplayBuffer(
            state_shape, buffer_size, min_buffer_size, batch_size, DEVICE)

        self.q_net: nn.Module = \
            (DuelingConvQNetwork(action_space.n) if ATARI_IMAGE_ENV
             else QNetwork(state_shape, action_space.n)).to(device)
        self.q_target: nn.Module = \
            (DuelingConvQNetwork(action_space.n) if ATARI_IMAGE_ENV
             else QNetwork(state_shape, action_space.n)).to(device)
        self.q_target.load_state_dict(self.q_net.state_dict())

        self.f_predictor = IntrinsicNetwork(state_shape, False).to(device)
        self.f_target = IntrinsicNetwork(state_shape, True).to(device)
        self.mse_loss = nn.MSELoss(reduction="none")
        self.update_proportion = update_proportion

        self.optimiser = \
            optim.Adam(list(self.q_net.parameters()) + list(self.f_predictor.parameters()), lr=LEARNING_RATE)

    def remember(self, state, action, reward, state_, done: bool):
        self.replay_buffer.put(state, action, reward, state_, 0.0 if done else 1.0)

    def train_step(self, n_episode):
        if not self.replay_buffer.can_sample():
            return

        last_rnd_loss = 0.0
        last_loss = 0.0
        last_expected_reward = 0.0

        for i in range(STEPS_PER_TRAIN):
            states, actions, rewards, states_, done_masks = self.replay_buffer.sample()

            target_next_feature, predicted_next_feature = self.rnd(states_)

            # --------------- RND
            rnd_loss = self.mse_loss(predicted_next_feature, target_next_feature.detach()).mean(-1)
            mask = torch.rand(len(rnd_loss)).to(self.device)
            mask = mask.__lt__(self.update_proportion).type(torch.FloatTensor).to(self.device)
            rnd_loss = (rnd_loss * mask).sum() / torch.max(mask.sum(), torch.Tensor([1]).to(self.device))

            if i == STEPS_PER_TRAIN - 1:
                last_rnd_loss = rnd_loss.detach().item()
            # ---------------

            q_out = self.q_net(states)
            q_sa = q_out.gather(1, actions)

            max_q_prime = self.q_target(states_).max(1)[0].unsqueeze(1)
            target = rewards + GAMMA * max_q_prime * done_masks

            # print("FL:", forward_loss)
            loss = F.smooth_l1_loss(q_sa, target.detach()) + rnd_loss.detach()

            if i == STEPS_PER_TRAIN - 1:
                last_loss = loss.detach().item()
                last_expected_reward = q_sa.detach().mean().item()

            self.optimiser.zero_grad()
            loss.backward()
            self.optimiser.step()

        # print(last_loss, last_expected_reward)
        line_plotter.plot("Loss", "val", "RND Loss", n_episode, last_rnd_loss)
        line_plotter.plot("Loss", "val", "Network Loss", n_episode, last_loss)
        line_plotter.plot("Expected Reward", "reward", "Expected Return", n_episode, last_expected_reward)

    def rematch_networks(self):
        self.q_target.load_state_dict(self.q_net.state_dict())

    def sample_action(self, state, epsilon):
        # Exploitation vs Exploration
        return \
            self.q_net.forward(state).argmax().item() if random.random() > epsilon else self.action_space.sample()

    def rnd(self, state_):
        target_next_feature = self.f_target(state_ * STATE_VALUE_FACTOR)
        predicted_next_feature = self.f_predictor(state_)
        return target_next_feature, predicted_next_feature

    def intrinsic_reward(self, state_):
        target_next_feature, predicted_next_feature = self.rnd(state_)
        intrinsic_reward = (target_next_feature - predicted_next_feature).pow(2).sum(-1) * self.intrinsic_reward_factor
        return intrinsic_reward.cpu().detach().item()

---

### Training Loop

In [6]:
def main():
    with gym.make(ENVIRONMENT) as env:
        if ATARI_IMAGE_ENV:
            env = AtariPreprocessing(env, frame_skip=1)  # Gravitar-v0 already skips frames

        env = Monitor(
            env,
            "./video",
            video_callable=lambda episode_id: not episode_id % VIDEO_EVERY and episode_id,
            force=True
        )

        #region Random seed initialisation for reproducible environment - do not change
        seed = 742
        torch.manual_seed(seed)
        env.seed(seed)
        random.seed(seed)
        np.random.seed(seed)
        env.action_space.seed(seed)
        #endregion

        agent: Agent = Agent(
            env.observation_space, env.action_space, DEVICE,
            BUFFER_SIZE_MAX, BUFFER_SIZE_MIN, BUFFER_BATCH_SIZE,
            UPDATE_PROPORTION, INTRINSIC_REWARD_FACTOR
        )

        marking = []
        score_history =  collections.deque(maxlen=MEAN_EVERY)
        normalised_intrinsic =  collections.deque(maxlen=INTRINSIC_SAMPLE_SIZE)

        line_plotter.plot("Score", "max", "Agent Score", 0.0, 0.0)
        line_plotter.plot( "Score", "min", "Agent Score", 0.0, 0.0)
        line_plotter.plot("Score", "val", "Agent Score", 0.0, 0.0)
        line_plotter.plot("Score", "average", "Agent Score", 0.0, 0.0)

        best_v_score = 0
        best_v_ep = 0

        intrinsic_rewards =  collections.deque(maxlen=INTRINSIC_SAMPLE_SIZE)

        for n_episode in range(int(1e32)):
            epsilon = max(EPSILON_MIN, EPSILON_MAX - n_episode * EPSILON_REDUCTION)
            score = 0.0

            if ATARI_IMAGE_ENV:
                state = torch.cat([torch.as_tensor(env.reset(), dtype=torch.uint8, device=DEVICE).unsqueeze(0)] * 4)
            else:
                state = torch.as_tensor(env.reset(), dtype=torch.float, device=DEVICE)

            while True:
                # Sample an action from the agent and use it
                chosen_action = agent.sample_action(state.unsqueeze(0).float(), epsilon)
                obs_, extrinsic_reward, done, info = env.step(chosen_action)

                # Convert observation to state
                if ATARI_IMAGE_ENV:
                    state_ = torch.cat(
                        (state[1:], torch.as_tensor(obs_, dtype=torch.uint8, device=DEVICE).unsqueeze(0)))
                else:
                    state_ = torch.as_tensor(obs_, dtype=torch.float, device=DEVICE)

                # RND intrinsic reward calculation
                intrinsic_reward = agent.intrinsic_reward(state_.unsqueeze(0).float())
                intrinsic_rewards.append(intrinsic_reward)
                std = np.std(intrinsic_rewards)
                if std != 0.0:
                    intrinsic_reward = \
                        ((intrinsic_reward - np.mean(intrinsic_rewards)) / std) \
                        .clip(-INTRINSIC_CLIP, INTRINSIC_CLIP)
                    normalised_intrinsic.append(intrinsic_reward)

                # Combine intrinsic and extrinsic rewards
                reward = intrinsic_reward * IN_SCORE_NORM + extrinsic_reward * EX_SCORE_NORM

                # Store the experience
                agent.remember(state, chosen_action, reward, state_, done)

                state = state_

                score += extrinsic_reward
                if done:
                    break

            score_history.append(score)

            agent.train_step(n_episode)

            if not (n_episode % SAVE_EVERY) and n_episode:
                print("Matching network values")
                agent.rematch_networks()

            if n_episode:
                line_plotter.plot("Score", "val", "Agent Score", n_episode, score)
                line_plotter.plot("Reward", "val", "Average Intrinsic Reward", n_episode, np.mean(normalised_intrinsic))
                line_plotter.plot("Epsilon", "val", "Epsilon", n_episode, epsilon)

                if not n_episode % MEAN_EVERY:
                    line_plotter.plot("Score", "average", "Agent Score", n_episode, np.array(score_history).mean())
                    max_score = np.array(score_history).max(initial=0)
                    line_plotter.plot("Score", "max", "Agent Score", n_episode, max_score)
                    line_plotter.plot(
                        "Score", "min", "Agent Score", n_episode, np.array(score_history).min(initial=max_score))

                if not n_episode % VIDEO_EVERY:
                    if score > best_v_score:
                        best_v_score = score
                        best_v_ep = n_episode
                    print(f"video: {n_episode}, score: {score:.0f} (best {best_v_score} at episode {best_v_ep})")
                if not n_episode % OUTPUT_EVERY:
                    print(f"episode: {n_episode}, "
                          f"score: {score:.0f}, "
                          f"epsilon: {epsilon:.3f}" +
                          (f", buffer size: {len(agent.replay_buffer)}"
                           if len(agent.replay_buffer) != BUFFER_SIZE_MAX else ""))
                          # (f", loss: {mean_loss:.2g}" if mean_loss is not None else ""))

            #region Submission log marking - do not change
            marking.append(score)
            if n_episode % 100 == 0:
                print("marking, episode: {}, score: {:.1f}, mean_score: {:.2f}, std_score: {:.2f}".format(
                    n_episode, score, np.array(marking).mean(), np.array(marking).std()))
                marking = []
            #endregion

In [7]:
main()
# %lprun -f ReplayBuffer.put -f ReplayBuffer.sample -f Agent.train_step -f main main()

marking, episode: 0, score: 0.0, mean_score: 0.00, std_score: 0.00
episode: 5, score: 500, epsilon: 0.497, buffer size: 5156
episode: 10, score: 250, epsilon: 0.495, buffer size: 9122
episode: 15, score: 0, epsilon: 0.492, buffer size: 13876
episode: 20, score: 100, epsilon: 0.490, buffer size: 18807
video: 25, score: 700 (best 700.0 at episode 25)
episode: 25, score: 700, epsilon: 0.487, buffer size: 23248
episode: 30, score: 0, epsilon: 0.485, buffer size: 28278
episode: 35, score: 100, epsilon: 0.482, buffer size: 32323
episode: 40, score: 0, epsilon: 0.480, buffer size: 36880
episode: 45, score: 200, epsilon: 0.477, buffer size: 42170
video: 50, score: 0 (best 700.0 at episode 25)
episode: 50, score: 0, epsilon: 0.475, buffer size: 46389
episode: 55, score: 0, epsilon: 0.472
episode: 60, score: 0, epsilon: 0.470
episode: 65, score: 0, epsilon: 0.468
episode: 70, score: 0, epsilon: 0.465
video: 75, score: 0 (best 700.0 at episode 25)
episode: 75, score: 0, epsilon: 0.463
episode: 80

KeyboardInterrupt: 

---