# DQN implementation with PyTorch using PongNoFrameskip-v4 benchmark.

In this notebook, we implement Deep Q-Network (DQN), one of the rainforcement learning algorithm, using `PyTorch`.  
This code refers to [jmichaux/dqn-pytorch](https://github.com/jmichaux/dqn-pytorch).

## Setup

In [0]:
!apt-get install -y cmake zlib1g-dev libjpeg-dev xvfb ffmpeg xorg-dev python-opengl libboost-all-dev libsdl2-dev swig freeglut3-dev
!pip install -U gym imageio PILLOW pyvirtualdisplay 'gym[atari]' 'pyglet==1.3.2' pyopengl scipy JSAnimation opencv-python pillow h5py pyyaml hyperdash pyvirtualdisplay hyperdash
!apt-get install xvfb

Reading package lists... Done
Building dependency tree       
Reading state information... Done
libjpeg-dev is already the newest version (8c-2ubuntu8).
libjpeg-dev set to manually installed.
zlib1g-dev is already the newest version (1:1.2.11.dfsg-0ubuntu2).
zlib1g-dev set to manually installed.
freeglut3-dev is already the newest version (2.8.1-3).
freeglut3-dev set to manually installed.
libboost-all-dev is already the newest version (1.65.1.0ubuntu1).
cmake is already the newest version (3.10.2-1ubuntu2.18.04.1).
ffmpeg is already the newest version (7:3.4.6-0ubuntu0.18.04.1).
The following additional packages will be installed:
  gir1.2-ibus-1.0 libcapnp-0.6.1 libdbus-1-dev libdmx-dev libdmx1
  libfontenc-dev libfs-dev libfs6 libibus-1.0-5 libibus-1.0-dev
  libmirclient-dev libmirclient9 libmircommon-dev libmircommon7
  libmircookie-dev libmircookie2 libmircore-dev libmircore1 libmirprotobuf3
  libpciaccess-dev libpixman-1-dev libprotobuf-dev libprotobuf-lite10
  libpulse-dev libpu

In [0]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/drive


In [0]:
!cp /content/drive/My\ Drive/Colab\ Notebooks/MT/Utils/xdpyinfo /usr/bin/
!cp /content/drive/My\ Drive/Colab\ Notebooks/MT/Utils/libXxf86dga.* /usr/lib/x86_64-linux-gnu/
!chmod +x /usr/bin/xdpyinfo

In [0]:
!hyperdash signup --github

Opening browser, please wait. If something goes wrong, press CTRL+C to cancel.
[1m SSH'd into a remote machine, or just don't have access to a browser? Open this link in any browser and then copy/paste the provided access token: [4mhttps://hyperdash.io/oauth/github/start?state=client_cli_manual[0m [0m
Waiting for Github OAuth to complete.
If something goes wrong, press CTRL+C to cancel.
Access token: ktrC7LtWkkeiMvcg/FRs6hul5/35VNJDMHJW0EHI6Qo=
Successfully logged in! We also installed: zYfW8JXjr7jfEmSCNZouY368BmbAKOp1VUkGwXNX5Ck= as your default API key


## Package Import

In [0]:
import copy
from collections import namedtuple
from itertools import count
import math
import random
import numpy as np
import os
import time

import gym
from collections import deque
from hyperdash import Experiment
import cv2

import pyvirtualdisplay
import base64
import IPython

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import torchvision.transforms as T

## Hyper parameters

In [0]:
# Runtime settings
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")    
Transition = namedtuple('Transion', ('state', 'action', 'next_state', 'reward'))
cv2.ocl.setUseOpenCL(False)
time_stamp = str(int(time.time()))
random.seed(0)
np.random.seed(0)

# Hyper parameters
BATCH_SIZE = 32 # @param
GAMMA = 0.99 # @param
EPS_START = 1 # @param
EPS_END = 0.02 # @param
EPS_DECAY = 1000000 # @param
TARGET_UPDATE = 1000 # @param
DEFAULT_DURABILITY = 10 # @param
LEARNING_RATE = 1e-4 # @param
INITIAL_MEMORY = 10000 # @param
MEMORY_SIZE = 10 * INITIAL_MEMORY # @param
AGENT_N = 5 # @param

# Some settings
ENV_NAME = "PongNoFrameskip-v4" # @param
EXP_NAME = "PongNoFrameskip-v4_" + time_stamp # @param
TRAIN_LOG_FILE_NAME = ENV_NAME + "_train_" + time_stamp + ".log" # @param
TEST_LOG_FILE_NAME = ENV_NAME + "_test_" + time_stamp + ".log" # @param
RENDER = False # @param

## Define the Replay memory

In [0]:
class ReplayMemory(object):
    def __init__(self, capacity):
        self.capacity = capacity
        self.memory = []
        self.position = 0
        
    def push(self, *args):
        if len(self.memory) < self.capacity:
            self.memory.append(None)
        self.memory[self.position] = Transition(*args)
        self.position = (self.position + 1) % self.capacity
        
    def sample(self, batch_size):
        return random.sample(self.memory, batch_size)
    
    def __len__(self):
        return len(self.memory)


class PrioritizedReplay(object):
    def __init__(self, capacity):
        pass

## Define the DQNs

Now we define the two types of DQN. One is simple q-network using 3 layers CNN. On the other one is batch normalaized 4 layers CNN.

In [0]:
class DQNbn(nn.Module):
    def __init__(self, in_channels=4, n_actions=14):
        """
        Initialize Deep Q Network
        Args:
            in_channels (int): number of input channels
            n_actions (int): number of outputs
        """
        super(DQNbn, self).__init__()
        self.conv1 = nn.Conv2d(in_channels, 32, kernel_size=8, stride=4)
        self.bn1 = nn.BatchNorm2d(32)
        self.conv2 = nn.Conv2d(32, 64, kernel_size=4, stride=2)
        self.bn2 = nn.BatchNorm2d(64)
        self.conv3 = nn.Conv2d(64, 64, kernel_size=3, stride=1)
        self.bn3 = nn.BatchNorm2d(64)
        self.fc4 = nn.Linear(7 * 7 * 64, 512)
        self.head = nn.Linear(512, n_actions)
        
    def forward(self, x):
        x = x.float() / 255
        x = F.relu(self.bn1(self.conv1(x)))
        x = F.relu(self.bn2(self.conv2(x)))
        x = F.relu(self.bn3(self.conv3(x)))
        x = F.relu(self.fc4(x.view(x.size(0), -1)))
        return self.head(x)


class DQN(nn.Module):
    def __init__(self, in_channels=4, n_actions=14):
        """
        Initialize Deep Q Network
        Args:
            in_channels (int): number of input channels
            n_actions (int): number of outputs
        """
        super(DQN, self).__init__()
        self.conv1 = nn.Conv2d(in_channels, 32, kernel_size=8, stride=4)
        # self.bn1 = nn.BatchNorm2d(32)
        self.conv2 = nn.Conv2d(32, 64, kernel_size=4, stride=2)
        # self.bn2 = nn.BatchNorm2d(64)
        self.conv3 = nn.Conv2d(64, 64, kernel_size=3, stride=1)
        # self.bn3 = nn.BatchNorm2d(64)
        self.fc4 = nn.Linear(7 * 7 * 64, 512)
        self.head = nn.Linear(512, n_actions)
        
    def forward(self, x):
        x = x.float() / 255
        x = F.relu(self.conv1(x))
        x = F.relu(self.conv2(x))
        x = F.relu(self.conv3(x))
        x = F.relu(self.fc4(x.view(x.size(0), -1)))
        return self.head(x)

## Define the Agent

In [0]:
class Agent:
    def __init__(policy_net, target_net, durability, optimizer):
        self.policy_net = policy_net
        self.target_net = target_net
        self.target_net.load_state_dict(policy_net.state_dict())
        self.durability = durability
        self.optimizer = optimizer
    

    def select_action(state):
        global steps_done
        sample = random.random()
        eps_threshold = EPS_END + (EPS_START - EPS_END)* \
            math.exp(-1. * steps_done / EPS_DECAY)
        steps_done += 1
        if sample > eps_threshold:
            with torch.no_grad():
                return policy_net(state.to('cuda')).max(1)[1].view(1,1)
        else:
            return torch.tensor([[random.randrange(4)]], device=device, dtype=torch.long)

    
    def optimize_model():
        if len(memory) < BATCH_SIZE:
            return
        transitions = memory.sample(BATCH_SIZE)
        """
        zip(*transitions) unzips the transitions into
        Transition(*) creates new named tuple
        batch.state - tuple of all the states (each state is a tensor)
        batch.next_state - tuple of all the next states (each state is a tensor)
        batch.reward - tuple of all the rewards (each reward is a float)
        batch.action - tuple of all the actions (each action is an int)    
        """
        batch = Transition(*zip(*transitions))
        
        actions = tuple((map(lambda a: torch.tensor([[a]], device='cuda'), batch.action))) 
        rewards = tuple((map(lambda r: torch.tensor([r], device='cuda'), batch.reward))) 

        non_final_mask = torch.tensor(
            tuple(map(lambda s: s is not None, batch.next_state)),
            device=device, dtype=torch.uint8)
        
        non_final_next_states = torch.cat([s for s in batch.next_state
                                        if s is not None]).to('cuda')
        

        state_batch = torch.cat(batch.state).to('cuda')
        action_batch = torch.cat(actions)
        reward_batch = torch.cat(rewards)
        
        state_action_values = policy_net(state_batch).gather(1, action_batch)
        
        next_state_values = torch.zeros(BATCH_SIZE, device=device)
        next_state_values[non_final_mask] = target_net(non_final_next_states).max(1)[0].detach()
        expected_state_action_values = (next_state_values * GAMMA) + reward_batch
        
        loss = F.smooth_l1_loss(state_action_values, expected_state_action_values.unsqueeze(1))
        
        optimizer.zero_grad()
        loss.backward()
        for param in policy_net.parameters():
            param.grad.data.clamp_(-1, 1)
        optimizer.step()


    def get_state(obs):
        state = np.array(obs)
        state = state.transpose((2, 0, 1))
        state = torch.from_numpy(state)
        return state.unsqueeze(0)


    def get_durability():
        return self.durability
    

    def reduce_durability(value):
        self.durability = self.durability - value

## Define the Environment

In [0]:
def make_env(env, stack_frames=True, episodic_life=True, clip_rewards=False, scale=False):
    if episodic_life:
        env = EpisodicLifeEnv(env)

    env = NoopResetEnv(env, noop_max=30)
    env = MaxAndSkipEnv(env, skip=4)
    if 'FIRE' in env.unwrapped.get_action_meanings():
        env = FireResetEnv(env)

    env = WarpFrame(env)
    if stack_frames:
        env = FrameStack(env, 4)
    if clip_rewards:
        env = ClipRewardEnv(env)
    return env

class RewardScaler(gym.RewardWrapper):

    def reward(self, reward):
        return reward * 0.1


class ClipRewardEnv(gym.RewardWrapper):
    def __init__(self, env):
        gym.RewardWrapper.__init__(self, env)

    def reward(self, reward):
        """Bin reward to {+1, 0, -1} by its sign."""
        return np.sign(reward)


class LazyFrames(object):
    def __init__(self, frames):
        """This object ensures that common frames between the observations are only stored once.
        It exists purely to optimize memory usage which can be huge for DQN's 1M frames replay
        buffers.
        This object should only be converted to numpy array before being passed to the model.
        You'd not believe how complex the previous solution was."""
        self._frames = frames
        self._out = None

    def _force(self):
        if self._out is None:
            self._out = np.concatenate(self._frames, axis=2)
            self._frames = None
        return self._out

    def __array__(self, dtype=None):
        out = self._force()
        if dtype is not None:
            out = out.astype(dtype)
        return out

    def __len__(self):
        return len(self._force())

    def __getitem__(self, i):
        return self._force()[i]

class FrameStack(gym.Wrapper):
    def __init__(self, env, k):
        """Stack k last frames.
        Returns lazy array, which is much more memory efficient.
        See Also
        --------
        baselines.common.atari_wrappers.LazyFrames
        """
        gym.Wrapper.__init__(self, env)
        self.k = k
        self.frames = deque([], maxlen=k)
        shp = env.observation_space.shape
        self.observation_space = gym.spaces.Box(low=0, high=255, shape=(shp[0], shp[1], shp[2] * k), dtype=env.observation_space.dtype)

    def reset(self):
        ob = self.env.reset()
        for _ in range(self.k):
            self.frames.append(ob)
        return self._get_ob()

    def step(self, action):
        ob, reward, done, info = self.env.step(action)
        self.frames.append(ob)
        return self._get_ob(), reward, done, info

    def _get_ob(self):
        assert len(self.frames) == self.k
        return LazyFrames(list(self.frames))


class WarpFrame(gym.ObservationWrapper):
    def __init__(self, env):
        """Warp frames to 84x84 as done in the Nature paper and later work."""
        gym.ObservationWrapper.__init__(self, env)
        self.width = 84
        self.height = 84
        self.observation_space = gym.spaces.Box(low=0, high=255,
            shape=(self.height, self.width, 1), dtype=np.uint8)

    def observation(self, frame):
        frame = cv2.cvtColor(frame, cv2.COLOR_RGB2GRAY)
        frame = cv2.resize(frame, (self.width, self.height), interpolation=cv2.INTER_AREA)
        return frame[:, :, None]


class FireResetEnv(gym.Wrapper):
    def __init__(self, env=None):
        """For environments where the user need to press FIRE for the game to start."""
        super(FireResetEnv, self).__init__(env)
        assert env.unwrapped.get_action_meanings()[1] == 'FIRE'
        assert len(env.unwrapped.get_action_meanings()) >= 3

    def step(self, action):
        return self.env.step(action)

    def reset(self):
        self.env.reset()
        obs, _, done, _ = self.env.step(1)
        if done:
            self.env.reset()
        obs, _, done, _ = self.env.step(2)
        if done:
            self.env.reset()
        return obs


class EpisodicLifeEnv(gym.Wrapper):
    def __init__(self, env=None):
        """Make end-of-life == end-of-episode, but only reset on true game over.
        Done by DeepMind for the DQN and co. since it helps value estimation.
        """
        super(EpisodicLifeEnv, self).__init__(env)
        self.lives = 0
        self.was_real_done = True
        self.was_real_reset = False

    def step(self, action):
        obs, reward, done, info = self.env.step(action)
        self.was_real_done = done
        # check current lives, make loss of life terminal,
        # then update lives to handle bonus lives
        lives = self.env.unwrapped.ale.lives()
        if lives < self.lives and lives > 0:
            # for Qbert somtimes we stay in lives == 0 condtion for a few frames
            # so its important to keep lives > 0, so that we only reset once
            # the environment advertises done.
            done = True
        self.lives = lives
        return obs, reward, done, info

    def reset(self):
        """Reset only when lives are exhausted.
        This way all states are still reachable even though lives are episodic,
        and the learner need not know about any of this behind-the-scenes.
        """
        if self.was_real_done:
            obs = self.env.reset()
            self.was_real_reset = True
        else:
            # no-op step to advance from terminal/lost life state
            obs, _, _, _ = self.env.step(0)
            self.was_real_reset = False
        self.lives = self.env.unwrapped.ale.lives()
        return obs


class MaxAndSkipEnv(gym.Wrapper):
    def __init__(self, env=None, skip=4):
        """Return only every `skip`-th frame"""
        super(MaxAndSkipEnv, self).__init__(env)
        # most recent raw observations (for max pooling across time steps)
        self._obs_buffer = deque(maxlen=2)
        self._skip = skip

    def step(self, action):
        total_reward = 0.0
        done = None
        for _ in range(self._skip):
            obs, reward, done, info = self.env.step(action)
            self._obs_buffer.append(obs)
            total_reward += reward
            if done:
                break

        max_frame = np.max(np.stack(self._obs_buffer), axis=0)

        return max_frame, total_reward, done, info

    def reset(self):
        """Clear past frame buffer and init. to first obs. from inner env."""
        self._obs_buffer.clear()
        obs = self.env.reset()
        self._obs_buffer.append(obs)
        return obs

class NoopResetEnv(gym.Wrapper):
    def __init__(self, env=None, noop_max=30):
        """Sample initial states by taking random number of no-ops on reset.
        No-op is assumed to be action 0.
        """
        super(NoopResetEnv, self).__init__(env)
        self.noop_max = noop_max
        self.override_num_noops = None
        assert env.unwrapped.get_action_meanings()[0] == 'NOOP'

    def step(self, action):
        return self.env.step(action)

    def reset(self):
        """ Do no-op action for a number of steps in [1, noop_max]."""
        self.env.reset()
        if self.override_num_noops is not None:
            noops = self.override_num_noops
        else:
            noops = np.random.randint(1, self.noop_max + 1)
        assert noops > 0
        obs = None
        for _ in range(noops):
            obs, _, done, _ = self.env.step(0)
            if done:
                obs = self.env.reset()
        return obs

## Deprecated code

**Thease code move to agent class.**

In [0]:
@deprecated
def select_action(state):
    global steps_done
    sample = random.random()
    eps_threshold = EPS_END + (EPS_START - EPS_END)* \
        math.exp(-1. * steps_done / EPS_DECAY)
    steps_done += 1
    if sample > eps_threshold:
        with torch.no_grad():
            return policy_net(state.to('cuda')).max(1)[1].view(1,1)
    else:
        return torch.tensor([[random.randrange(4)]], device=device, dtype=torch.long)


@deprecated
def optimize_model():
    if len(memory) < BATCH_SIZE:
        return
    transitions = memory.sample(BATCH_SIZE)
    """
    zip(*transitions) unzips the transitions into
    Transition(*) creates new named tuple
    batch.state - tuple of all the states (each state is a tensor)
    batch.next_state - tuple of all the next states (each state is a tensor)
    batch.reward - tuple of all the rewards (each reward is a float)
    batch.action - tuple of all the actions (each action is an int)    
    """
    batch = Transition(*zip(*transitions))
    
    actions = tuple((map(lambda a: torch.tensor([[a]], device='cuda'), batch.action))) 
    rewards = tuple((map(lambda r: torch.tensor([r], device='cuda'), batch.reward))) 

    non_final_mask = torch.tensor(
        tuple(map(lambda s: s is not None, batch.next_state)),
        device=device, dtype=torch.uint8)
    
    non_final_next_states = torch.cat([s for s in batch.next_state
                                       if s is not None]).to('cuda')
    

    state_batch = torch.cat(batch.state).to('cuda')
    action_batch = torch.cat(actions)
    reward_batch = torch.cat(rewards)
    
    state_action_values = policy_net(state_batch).gather(1, action_batch)
    
    next_state_values = torch.zeros(BATCH_SIZE, device=device)
    next_state_values[non_final_mask] = target_net(non_final_next_states).max(1)[0].detach()
    expected_state_action_values = (next_state_values * GAMMA) + reward_batch
    
    loss = F.smooth_l1_loss(state_action_values, expected_state_action_values.unsqueeze(1))
    
    optimizer.zero_grad()
    loss.backward()
    for param in policy_net.parameters():
        param.grad.data.clamp_(-1, 1)
    optimizer.step()


@deprecated
def get_state(obs):
    state = np.array(obs)
    state = state.transpose((2, 0, 1))
    state = torch.from_numpy(state)
    return state.unsqueeze(0)

## Degine the train steps

In my research, make this code multi-agent (**Note**: Multi-agent here means multiple independent agents sharing a task environment)

In [0]:
# TODO : To change the deprecated function to Agent clsss fuction
def train(env, n_episodes, exp, render=False):
    for episode in range(n_episodes):
        obs = env.reset()
        state = get_state(obs)
        total_reward = 0.0
        for t in count():
            action = select_action(state)

            if render:
                env.render()

            obs, reward, done, info = env.step(action)

            total_reward += reward

            if not done:
                next_state = get_state(obs)
            else:
                next_state = None

            reward = torch.tensor([reward], device=device)

            memory.push(state, action.to('cpu'), next_state, reward.to('cpu'))
            state = next_state

            if steps_done > INITIAL_MEMORY:
                optimize_model()

                if steps_done % TARGET_UPDATE == 0:
                    target_net.load_state_dict(policy_net.state_dict())

            if done:
                break
        
        exp.metric("total_reword", total_reward)
        out_str = 'Total steps: {} \t Episode: {}/{} \t Total reward: {}'.format(
            steps_done, episode, t, total_reward)
        if episode % 20 == 0:
            print(out_str)
            out_str = str("\n" + out_str + "\n")
            exp.log(out_str)
        else:
            # print(out_str)
            exp.log(out_str)
        with open(TRAIN_LOG_FILE_NAME, 'wt') as f:
            f.write(out_str)
    env.close()
    return

## Define the test steps

In [0]:
# TODO : To change the deprecated function to Agent clsss fuction
def test(env, n_episodes, policy, exp, render=True):
    env = gym.wrappers.Monitor(env, './videos/' + 'dqn_pong_video')
    for episode in range(n_episodes):
        obs = env.reset()
        state = get_state(obs)
        total_reward = 0.0
        for t in count():
            action = policy(state.to('cuda')).max(1)[1].view(1,1)

            if render:
                env.render()
                time.sleep(0.02)

            obs, reward, done, info = env.step(action)

            total_reward += reward

            if not done:
                next_state = get_state(obs)
            else:
                next_state = None

            state = next_state

            if done:
                out_str = "Finished Episode {} with reward {}".format(
                    episode, total_reward)
                print(out_str)
                exp.log(out_str)
                with open(TEST_LOG_FILE_NAME, 'wt') as f:
                    f.write(out_str)
                break

    env.close()
    return

## Main steps

In [0]:
# Create Agent
agents = []

policy_net_0 = DQN(n_actions=4).to(device)
target_net_0 = DQN(n_actions=4).to(device)
agents.append(Agent(policy_net, target_net, DEFAULT_DURABILITY,
                    optim.Adam(policy_net.parameters(), lr=LEARNING_RATE)))

AGENT_N = len(agents)

In [0]:
# create networks
policy_net = DQN(n_actions=4).to(device)
target_net = DQN(n_actions=4).to(device)
target_net.load_state_dict(policy_net.state_dict())

IncompatibleKeys(missing_keys=[], unexpected_keys=[])

In [0]:
# setup optimizer
optimizer = optim.Adam(policy_net.parameters(), lr=LEARNING_RATE)

steps_done = 0

# create environment
env = gym.make(ENV_NAME)
env = make_env(env)

# initialize replay memory
memory = ReplayMemory(MEMORY_SIZE)

# Hyperdash experiment
exp = Experiment(EXP_NAME, capture_io=False)
print("Learning rate:{}".format(lr))
exp.param("Learning rate", lr)
exp.param("Environment", ENV_NAME)
exp.param("Batch size", BATCH_SIZE)
exp.param("Gamma", GAMMA)
exp.param("Episode start", EPS_START)
exp.param("Episode end", EPS_END)
exp.param("Episode decay", EPS_DECAY)
exp.param("Target update", TARGET_UPDATE)
exp.param("Render", str(RENDER))
exp.param("Initial memory", INITIAL_MEMORY)
exp.param("Memory size", MEMORY_SIZE)

In [0]:
# train model
train(env, 400, exp)
exp.end()

torch.save(policy_net, "dqn_pong_model")

policy_net = torch.load("dqn_pong_model")
exp_test = Experiment(str(EXP_NAME + "_test_step"), capture_io=False)
test(env, 1, policy_net, exp_test, render=False)
exp_test.end()

| total_reword: -21.000000 |
Total steps: 821 	 Episode: 0/820 	 Total reward: -21.0

Total steps: 821 	 Episode: 0/820 	 Total reward: -21.0

| total_reword: -21.000000 |
Total steps: 1580 	 Episode: 1/758 	 Total reward: -21.0
Total steps: 1580 	 Episode: 1/758 	 Total reward: -21.0
| total_reword: -20.000000 |
Total steps: 2522 	 Episode: 2/941 	 Total reward: -20.0
Total steps: 2522 	 Episode: 2/941 	 Total reward: -20.0
| total_reword: -21.000000 |
Total steps: 3421 	 Episode: 3/898 	 Total reward: -21.0
Total steps: 3421 	 Episode: 3/898 	 Total reward: -21.0
| total_reword: -21.000000 |
Total steps: 4326 	 Episode: 4/904 	 Total reward: -21.0
Total steps: 4326 	 Episode: 4/904 	 Total reward: -21.0
| total_reword: -19.000000 |
Total steps: 5333 	 Episode: 5/1006 	 Total reward: -19.0
Total steps: 5333 	 Episode: 5/1006 	 Total reward: -19.0
| total_reword: -20.000000 |
Total steps: 6309 	 Episode: 6/975 	 Total reward: -20.0
Total steps: 6309 	 Episode: 6/975 	 Total reward: -20

  "type " + obj.__name__ + ". It won't be checked "


Finished Episode 0 with reward -1.0
Finished Episode 0 with reward -1.0
This run of Pong-test_1567682751_test_step ran for 0:00:19 and logs are available locally at: /root/.hyperdash/logs/pong-test-1567682751-test-step/pong-test-1567682751-test-step_2019-09-05t13-39-21-303112.log


## Video vidualization

In [0]:
display = pyvirtualdisplay.Display(visible=0, size=(1400, 900)).start()
os.environ["DISPLAY"] = ":" + str(display.display) + "." + str(display.screen)

In [0]:
def embed_mp4(filename):
    """Embeds an mp4 file in the notebook."""
    
    video = open(filename,'rb').read()
    b64 = base64.b64encode(video)
    tag = '''
    <video width="640" height="480" controls>
        <source src="data:video/mp4;base64,{0}" type="video/mp4">
    Your browser does not support the video tag.
    </video>'''.format(b64.decode())

    return IPython.display.HTML(tag)

In [0]:
embed_mp4("/content/videos/dqn_pong_video/openaigym.video.0.122.video000000.mp4")

In [0]:
# !mv /content/drive/My\ Drive/Colab\ Notebooks/MT/pong_videos /content/drive/My\ Drive/Colab\ Notebooks/MT/pong_videos_1567682751
# !mv /content/dqn_pong_model /content/drive/My\ Drive/Colab\ Notebooks/MT/pong_videos_1567682751/

In [0]:
!mkdir /content/drive/My\ Drive/Colab\ Notebooks/MT/pong_videos_1568005544

In [0]:
!mv ./PongNoFrameskip-v4_*.log /content/drive/My\ Drive/Colab\ Notebooks/MT/pong_videos_1568005544/
!mv ./dqn_pong_model /content/drive/My\ Drive/Colab\ Notebooks/MT/pong_videos_1568005544/
!mv ./videos /content/drive/My\ Drive/Colab\ Notebooks/MT/pong_videos_1568005544/