#Deep Q-Learning for Atari Games: Zaxxon Agent Implementation

### Setup & Environment Sanity Check
- Install Gymnasium with Atari support, ALE-py (emulator), and AutoROM (downloads ROMs).
- Accept ROM license and register ALE environments with Gymnasium.
- Create and reset `ALE/Zaxxon-v5` in RGB mode to verify everything is loaded.


In [1]:
!pip install "gymnasium[accept-rom-license]" "gymnasium[atari,accept-rom-license]" "ale-py" "autorom[accept-rom-license]" opencv-python moviepy tqdm pyyaml pandas==2.2.2 --force-reinstall
!AutoROM --accept-license


Collecting ale-py
  Downloading ale_py-0.11.2-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (9.0 kB)
Collecting opencv-python
  Downloading opencv_python-4.12.0.88-cp37-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (19 kB)
Collecting moviepy
  Downloading moviepy-2.2.1-py3-none-any.whl.metadata (6.9 kB)
Collecting tqdm
  Downloading tqdm-4.67.1-py3-none-any.whl.metadata (57 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.7/57.7 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pyyaml
  Downloading pyyaml-6.0.3-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl.metadata (2.4 kB)
Collecting pandas==2.2.2
  Downloading pandas-2.2.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (19 kB)
Collecting gymnasium[accept-rom-license]
  Downloading gymnasium-1.2.1-py3-none-any.whl.metadata (10.0 kB)
Collecting autorom[accept-rom-license]
  Downloading AutoROM-0.6.1-py3-none-a

AutoROM will download the Atari 2600 ROMs.
They will be installed to:
	/usr/local/lib/python3.12/dist-packages/AutoROM/roms

Existing ROMs will be overwritten.


In [2]:
import gymnasium as gym, ale_py
gym.register_envs(ale_py)

env = gym.make("ALE/Zaxxon-v5", render_mode="rgb_array")
obs, info = env.reset(seed=0)
print("✅ Environment loaded successfully!")
print("Observation space:", env.observation_space)
print("Action space:", env.action_space)
env.close()



✅ Environment loaded successfully!
Observation space: Box(0, 255, (210, 160, 3), uint8)
Action space: Discrete(18)


### Core DQN (convolutional) for Zaxxon
- **DQN**: 3 conv layers (Atari-style) + 2 fully-connected layers.
- **Input**: 4 stacked 84×84 grayscale frames (uint8 scaled to [0,1] inside the network).
- **ReplayBuffer**: stores (state_stack, action, reward, next_state_stack, done) tuples and samples random batches.
- **Target network**: cloned from policy_net; synced every episode (simple but stable).
- **Epsilon-greedy**: starts at 1.0 and decays to 0.05; encourages exploration early, exploitation later.
- **Loss**: MSE between Q(s,a) and bootstrap target r + γ max_a' Q_target(s', a').
- **Saves**: final weights to `dqn_zaxxon.pt`.


In [1]:
with open("dqn_zaxxon.py", "w") as f:
    f.write('''
import gymnasium as gym, ale_py
gym.register_envs(ale_py)
import torch, torch.nn as nn, torch.optim as optim
import numpy as np, cv2, random, os
from collections import deque

# ------------------ Hyperparameters ------------------
GAMMA = 0.99
LR = 1e-4
EPS_START = 1.0
EPS_END = 0.05
EPS_DECAY = 0.995
BATCH_SIZE = 32
MEM_SIZE = 10000
TARGET_UPDATE = 1000
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# ------------------ Q-network ------------------
class DQN(nn.Module):
    def __init__(self, n_actions):
        super().__init__()
        self.conv = nn.Sequential(
            nn.Conv2d(4, 32, 8, 4), nn.ReLU(),
            nn.Conv2d(32, 64, 4, 2), nn.ReLU(),
            nn.Conv2d(64, 64, 3, 1), nn.ReLU()
        )
        self.fc = nn.Sequential(
            nn.Linear(3136, 512),
            nn.ReLU(),
            nn.Linear(512, n_actions)
        )
    def forward(self, x):
        x = x / 255.0
        x = self.conv(x)
        return self.fc(x.view(x.size(0), -1))

# ------------------ Replay Buffer ------------------
class ReplayBuffer:
    def __init__(self, capacity):
        self.buffer = deque(maxlen=capacity)
    def push(self, s, a, r, ns, d):
        self.buffer.append((s, a, r, ns, d))
    def sample(self, batch_size):
        batch = random.sample(self.buffer, batch_size)
        s, a, r, ns, d = zip(*batch)
        return np.array(s), a, r, np.array(ns), d
    def __len__(self): return len(self.buffer)

# ------------------ Preprocess ------------------
def preprocess(frame):
    frame = cv2.cvtColor(frame, cv2.COLOR_RGB2GRAY)
    frame = cv2.resize(frame, (84, 84))
    return frame

# ------------------ Train DQN ------------------
def train_dqn(env_id="ALE/Zaxxon-v5", total_episodes=30):
    env = gym.make(env_id)
    n_actions = env.action_space.n
    policy_net = DQN(n_actions).to(device)
    target_net = DQN(n_actions).to(device)
    target_net.load_state_dict(policy_net.state_dict())
    optimizer = optim.Adam(policy_net.parameters(), lr=LR)
    memory = ReplayBuffer(MEM_SIZE)
    epsilon = EPS_START
    frame_stack = deque(maxlen=4)

    for episode in range(total_episodes):
        obs, _ = env.reset()
        state = preprocess(obs)
        frame_stack.clear()
        frame_stack.extend([state]*4)
        total_reward = 0
        done = False

        while not done:
            s_stack = np.array(frame_stack)
            if random.random() < epsilon:
                action = env.action_space.sample()
            else:
                with torch.no_grad():
                    q_values = policy_net(torch.tensor(s_stack, dtype=torch.float32).unsqueeze(0).to(device))
                    action = q_values.argmax(1).item()

            next_obs, r, terminated, truncated, _ = env.step(action)
            done = terminated or truncated
            ns = preprocess(next_obs)
            frame_stack.append(ns)
            memory.push(s_stack, action, r, np.array(frame_stack), done)
            total_reward += r

            if len(memory) > BATCH_SIZE:
                s, a, r, ns, d = memory.sample(BATCH_SIZE)
                s = torch.tensor(s, dtype=torch.float32).to(device)
                a = torch.tensor(a).unsqueeze(1).to(device)
                r = torch.tensor(r, dtype=torch.float32).unsqueeze(1).to(device)
                ns = torch.tensor(ns, dtype=torch.float32).to(device)
                d = torch.tensor(d, dtype=torch.float32).unsqueeze(1).to(device)

                q_vals = policy_net(s).gather(1, a)
                next_q = target_net(ns).max(1)[0].detach().unsqueeze(1)
                target = r + GAMMA * next_q * (1 - d)
                loss = nn.MSELoss()(q_vals, target)

                optimizer.zero_grad()
                loss.backward()
                optimizer.step()

        epsilon = max(EPS_END, epsilon * EPS_DECAY)
        target_net.load_state_dict(policy_net.state_dict())
        print(f"Episode {episode+1}/{total_episodes} | Reward: {total_reward:.1f} | Epsilon: {epsilon:.3f}")

    torch.save(policy_net.state_dict(), "dqn_zaxxon.pt")
    print("✅ Training complete. Model saved as dqn_zaxxon.pt")
    env.close()
    return policy_net, total_episodes
''')


In [2]:
!python -u dqn_zaxxon.py


In [4]:
!pip install gymnasium[atari,accept-rom-license] ale-py torch torchvision torchaudio imageio




In [5]:
import gymnasium as gym
import ale_py
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
from collections import deque
import random
import os
from gymnasium.wrappers import RecordVideo
from IPython.display import Video
import glob


### Quick Trainer (Notebook Version)
- Trains a DQN for `ALE/Zaxxon-v5` inside the notebook.
- Uses 4-frame stacks, replay memory, ε-greedy exploration, and MSE TD loss.
- Saves weights to `dqn_zaxxon.pt`.
- Prints episode reward and current epsilon.


In [6]:
class DQN(nn.Module):
    def __init__(self, n_actions):
        super().__init__()
        self.net = nn.Sequential(
            nn.Conv2d(4, 32, kernel_size=8, stride=4),
            nn.ReLU(),
            nn.Conv2d(32, 64, kernel_size=4, stride=2),
            nn.ReLU(),
            nn.Conv2d(64, 64, kernel_size=3, stride=1),
            nn.ReLU(),
            nn.Flatten(),
            nn.Linear(3136, 512),
            nn.ReLU(),
            nn.Linear(512, n_actions)
        )

    def forward(self, x):
        return self.net(x / 255.0)


In [7]:
import cv2

def preprocess(obs):
    img = cv2.cvtColor(obs, cv2.COLOR_RGB2GRAY)
    img = cv2.resize(img, (84, 84))
    return img


In [8]:
gym.register_envs(ale_py)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

def train_dqn(env_id="ALE/Zaxxon-v5", total_episodes=200):
    env = gym.make(env_id)
    n_actions = env.action_space.n
    policy_net = DQN(n_actions).to(device)
    optimizer = optim.Adam(policy_net.parameters(), lr=1e-4)
    gamma = 0.99
    epsilon, eps_min, eps_decay = 1.0, 0.1, 0.995

    memory = deque(maxlen=50000)
    batch_size = 32

    for ep in range(total_episodes):
        obs, _ = env.reset()
        stack = deque([preprocess(obs)] * 4, maxlen=4)
        done, total_reward = False, 0

        while not done:
            s = np.array(stack)
            if np.random.rand() < epsilon:
                a = env.action_space.sample()
            else:
                with torch.no_grad():
                    qvals = policy_net(torch.tensor(s, dtype=torch.float32)
                                       .unsqueeze(0).to(device))
                    a = qvals.argmax(1).item()

            next_obs, r, terminated, truncated, _ = env.step(a)
            done = terminated or truncated
            stack.append(preprocess(next_obs))
            total_reward += r

            memory.append((s, a, r, np.array(stack), done))

            if len(memory) > batch_size:
                batch = random.sample(memory, batch_size)
                s_batch = torch.tensor([b[0] for b in batch], dtype=torch.float32).to(device)
                a_batch = torch.tensor([b[1] for b in batch]).to(device)
                r_batch = torch.tensor([b[2] for b in batch]).to(device)
                s2_batch = torch.tensor([b[3] for b in batch], dtype=torch.float32).to(device)
                d_batch = torch.tensor([b[4] for b in batch]).to(device)

                q_vals = policy_net(s_batch).gather(1, a_batch.unsqueeze(1)).squeeze(1)
                next_q = policy_net(s2_batch).max(1)[0]
                target = r_batch + gamma * next_q * (1 - d_batch)

                loss = nn.MSELoss()(q_vals, target)
                optimizer.zero_grad()
                loss.backward()
                optimizer.step()

        epsilon = max(eps_min, epsilon * eps_decay)
        print(f"Episode {ep+1}/{total_episodes} | Reward: {total_reward:.2f} | Epsilon: {epsilon:.3f}")

    torch.save(policy_net.state_dict(), "dqn_zaxxon.pt")
    print("✅ Model trained & saved as dqn_zaxxon.pt")
    env.close()
    return policy_net


In [11]:
import os, glob
print("cwd:", os.getcwd())
print(glob.glob("*"))
print(glob.glob("**/*.pt", recursive=True))


cwd: /content
['videos', 'dqn_zaxxon.py', '__pycache__', 'sample_data']
[]


In [None]:
### File & Drive Setup
- Print working directory and visible files (sanity check).
- Mount Google Drive for persistent storage of models and videos.


In [3]:
from google.colab import drive
drive.mount("/content/drive")


Mounted at /content/drive


In [4]:
!mkdir -p "/content/drive/MyDrive/rl"

In [5]:
import gymnasium as gym, ale_py, torch, torch.nn as nn, torch.optim as optim
import numpy as np, random, os, glob
from collections import deque
from IPython.display import Video


In [6]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

class DQN(nn.Module):
    def __init__(self, n_actions):
        super().__init__()
        self.net = nn.Sequential(
            nn.Conv2d(4, 32, 8, 4), nn.ReLU(),
            nn.Conv2d(32, 64, 4, 2), nn.ReLU(),
            nn.Conv2d(64, 64, 3, 1), nn.ReLU(),
            nn.Flatten(),
            nn.Linear(7 * 7 * 64, 512), nn.ReLU(),
            nn.Linear(512, n_actions)
        )
    def forward(self, x):
        return self.net(x / 255.0)

def preprocess(obs):
    import cv2
    gray = cv2.cvtColor(obs, cv2.COLOR_RGB2GRAY)
    resized = cv2.resize(gray, (84, 84))
    return resized


In [7]:
import gymnasium as gym, ale_py, torch, torch.nn as nn, torch.optim as optim
import numpy as np, random
from collections import deque

# --- Register Atari environments ---
gym.register_envs(ale_py)

# --- DQN network ---
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

class DQN(nn.Module):
    def __init__(self, n_actions):
        super().__init__()
        self.net = nn.Sequential(
            nn.Conv2d(4, 32, 8, 4), nn.ReLU(),
            nn.Conv2d(32, 64, 4, 2), nn.ReLU(),
            nn.Conv2d(64, 64, 3, 1), nn.ReLU(),
            nn.Flatten(),
            nn.Linear(7*7*64, 512), nn.ReLU(),
            nn.Linear(512, n_actions)
        )
    def forward(self, x):
        return self.net(x / 255.0)

def preprocess(obs):
    import cv2
    gray = cv2.cvtColor(obs, cv2.COLOR_RGB2GRAY)
    return cv2.resize(gray, (84, 84))

# --- Training Function (short demo) ---
def train_dqn_zaxxon(episodes=2, gamma=0.99, lr=1e-4, eps_decay=0.995):
    env = gym.make("ALE/Zaxxon-v5")
    n_actions = env.action_space.n
    policy = DQN(n_actions).to(device)
    optimizer = optim.Adam(policy.parameters(), lr=lr)

    memory = deque(maxlen=2000)
    epsilon, eps_min = 1.0, 0.1

    for ep in range(episodes):
        obs, _ = env.reset()
        stack = deque([preprocess(obs)] * 4, maxlen=4)
        done, total_reward = False, 0
        while not done:
            s = np.array(stack)
            action = env.action_space.sample() if np.random.rand() < epsilon else policy(
                torch.tensor(s, dtype=torch.float32).unsqueeze(0).to(device)
            ).argmax(1).item()
            nxt, r, term, trunc, _ = env.step(action)
            done = term or trunc
            stack.append(preprocess(nxt))
            total_reward += r
        epsilon = max(eps_min, epsilon * eps_decay)
        print(f"Episode {ep+1}/{episodes} | Reward: {total_reward:.1f} | Epsilon: {epsilon:.3f}")

    torch.save(policy.state_dict(), "dqn_zaxxon.pt")
    env.close()
    print("✅ Saved trained model as dqn_zaxxon.pt")


### Mini Demo Trainer (2 episodes)
- Not for performance—just ensures the pipeline works end-to-end.
- Produces `dqn_zaxxon.pt` quickly so we can test recording/evaluation.


In [8]:
train_dqn_zaxxon(episodes=2)


Episode 1/2 | Reward: 0.0 | Epsilon: 0.995
Episode 2/2 | Reward: 0.0 | Epsilon: 0.990
✅ Saved trained model as dqn_zaxxon.pt


### Record One Episode (Video)
- Wrap environment with `RecordVideo`.
- Load `dqn_zaxxon.pt` and run a full episode (greedy actions).
- Save MP4 to `videos/` and display the newest recording.


In [9]:
from gymnasium.wrappers import RecordVideo
from collections import deque
from IPython.display import Video
import gymnasium as gym, ale_py, torch, numpy as np, glob, os

# import your net + preprocess + device from the definitions you ran earlier
# (if you're in a fresh runtime, re-run the cells that define DQN, preprocess, device)
gym.register_envs(ale_py)

# make a recorder env
env = RecordVideo(gym.make("ALE/Zaxxon-v5", render_mode="rgb_array"),
                  video_folder="videos", episode_trigger=lambda e: True)

# load the small demo model
n_actions = env.action_space.n
policy = DQN(n_actions).to(device)
policy.load_state_dict(torch.load("dqn_zaxxon.pt", map_location=device))
policy.eval()

# play one episode and record
import numpy as np
obs, _ = env.reset(seed=0)
stack = deque([preprocess(obs)]*4, maxlen=4)
done = False
while not done:
    s = np.array(stack)
    with torch.no_grad():
        a = policy(torch.tensor(s, dtype=torch.float32).unsqueeze(0).to(device)).argmax(1).item()
    obs, r, term, trunc, _ = env.step(a)
    done = term or trunc
    stack.append(preprocess(obs))

env.close()

# show the newest mp4
mp4 = sorted(glob.glob("videos/*.mp4"))[-1]
Video(mp4, embed=True)


  IMAGEMAGICK_BINARY = r"C:\Program Files\ImageMagick-6.8.8-Q16\magick.exe"


In [10]:
env = RecordVideo(gym.make("ALE/Zaxxon-v5", render_mode="rgb_array"),
                  video_folder="videos", episode_trigger=lambda e: True)
obs, _ = env.reset(seed=0)

from collections import deque
stack = deque([preprocess(obs)]*4, maxlen=4)

eps_to_record = 3
ep = 0
done = False
while ep < eps_to_record:
    s = np.array(stack)
    with torch.no_grad():
        a = policy(torch.tensor(s, dtype=torch.float32).unsqueeze(0).to(device)).argmax(1).item()
    obs, r, term, trunc, _ = env.step(a)
    done = term or trunc
    stack.append(preprocess(obs))
    if done:
        ep += 1
        if ep < eps_to_record:
            obs, _ = env.reset(seed=ep)
            stack = deque([preprocess(obs)]*4, maxlen=4)
        done = False

env.close()

import glob
mp4 = sorted(glob.glob("videos/*.mp4"))[-1]
Video(mp4, embed=True)


  logger.warn(


### Evaluation: Average Return & Steps
- Load `dqn_zaxxon.pt`.
- Run `n_episodes` greedily (no ε) and compute:
  - Average episodic return (sum of rewards).
  - Average steps per episode.


In [11]:
import gymnasium as gym, ale_py, torch, numpy as np
from collections import deque

gym.register_envs(ale_py)

def eval_policy(weights_path="dqn_zaxxon.pt", env_id="ALE/Zaxxon-v5", n_episodes=5):
    env = gym.make(env_id)
    nA  = env.action_space.n
    net = DQN(nA).to(device)
    net.load_state_dict(torch.load(weights_path, map_location=device))
    net.eval()

    ret_list, step_list = [], []
    for ep in range(n_episodes):
        obs, _ = env.reset()
        stack = deque([preprocess(obs)]*4, maxlen=4)
        done, R, steps = False, 0.0, 0
        while not done:
            s = np.array(stack)
            with torch.no_grad():
                a = net(torch.tensor(s, dtype=torch.float32).unsqueeze(0).to(device)).argmax(1).item()
            obs, r, term, trunc, _ = env.step(a)
            done = term or trunc
            stack.append(preprocess(obs))
            R += r; steps += 1
        ret_list.append(R); step_list.append(steps)
    env.close()
    print(f"Avg return over {n_episodes}: {np.mean(ret_list):.2f}")
    print(f"Avg steps per episode: {np.mean(step_list):.1f}")

eval_policy(n_episodes=3)


Avg return over 3: 0.00
Avg steps per episode: 885.0


### Stronger Trainer (Closer to Atari DQN)
Key improvements:
- **Double DQN** target: action from policy net, value from target net (reduces overestimation).
- **Huber loss** + **grad clipping**: stabilizes updates.
- **Linear ε decay by steps**: smoother exploration schedule.
- **Target net update by steps**: sync every `target_update` steps (not per episode).
- **Replay warm-up**: start learning only after `start_learn` frames.
- **Periodic checkpoints**: saves model & optimizer to `/checkpoints/`.


In [2]:
# ====== Stronger DQN for Atari (Zaxxon) ======
import gymnasium as gym, ale_py
gym.register_envs(ale_py)

import torch, torch.nn as nn, torch.optim as optim
import numpy as np, random, os, time, glob
from collections import deque
import cv2

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# --------- Model -----------
class DQN(nn.Module):
    def __init__(self, n_actions):
        super().__init__()
        self.conv = nn.Sequential(
            nn.Conv2d(4, 32, 8, 4), nn.ReLU(),
            nn.Conv2d(32, 64, 4, 2), nn.ReLU(),
            nn.Conv2d(64, 64, 3, 1), nn.ReLU()
        )
        self.head = nn.Sequential(
            nn.Linear(7*7*64, 512),
            nn.ReLU(),
            nn.Linear(512, n_actions),
        )
    def forward(self, x):
        x = x/255.0
        x = self.conv(x)
        return self.head(x.view(x.size(0), -1))

def preprocess(obs):
    g = cv2.cvtColor(obs, cv2.COLOR_RGB2GRAY)
    return cv2.resize(g, (84, 84))

# --------- Training ----------
def train_zaxxon(
    env_id="ALE/Zaxxon-v5",
    episodes=500,               # increase for better skills
    gamma=0.99,
    lr=1e-4,
    replay_size=100_000,
    batch_size=64,
    start_learn=10_000,         # warmup frames
    target_update=5_000,        # steps
    eps_start=1.0, eps_end=0.05, eps_decay_steps=500_000,
    clip_grad=10.0,
    reward_clip=True,
    ckpt_dir="checkpoints",
    resume=True                 # resume if ckpt exists
):
    os.makedirs(ckpt_dir, exist_ok=True)
    env = gym.make(env_id)
    nA  = env.action_space.n

    # networks
    policy = DQN(nA).to(device)
    target = DQN(nA).to(device)
    target.load_state_dict(policy.state_dict())
    opt = optim.Adam(policy.parameters(), lr=lr)
    loss_fn = nn.SmoothL1Loss()   # Huber

    # resume if possible
    latest = sorted(glob.glob(os.path.join(ckpt_dir, "zaxxon_ep*.pt")))
    start_ep = 0
    total_steps = 0
    if resume and latest:
        path = latest[-1]
        sd = torch.load(path, map_location=device)
        policy.load_state_dict(sd["policy"])
        target.load_state_dict(sd["target"])
        opt.load_state_dict(sd["opt"])
        start_ep = sd["episode"] + 1
        total_steps = sd.get("total_steps", 0)
        print(f"🔁 Resumed from {path} (episode {start_ep}, total_steps {total_steps})")

    memory = deque(maxlen=replay_size)

    # epsilon schedule (linear)
    def eps_by_step(t):
        if t >= eps_decay_steps: return eps_end
        return eps_end + (eps_start - eps_end) * (1 - t/eps_decay_steps)

    frame_stack = deque(maxlen=4)
    last_target_update = total_steps

    def select_action(state, eps):
        if random.random() < eps:
            return env.action_space.sample()
        with torch.no_grad():
            q = policy(torch.tensor(state, dtype=torch.float32).unsqueeze(0).to(device))
            return q.argmax(1).item()

    for ep in range(start_ep, episodes):
        obs, _ = env.reset()
        frame_stack.clear()
        f = preprocess(obs)
        frame_stack.extend([f]*4)

        done, ep_ret, ep_len = False, 0.0, 0
        while not done:
            s = np.array(frame_stack)
            eps = eps_by_step(total_steps)
            a = select_action(s, eps)

            nxt, r, terminated, truncated, _ = env.step(a)
            done = terminated or truncated
            r = float(np.clip(r, -1, 1)) if reward_clip else float(r)

            f2 = preprocess(nxt)
            frame_stack.append(f2)
            s2 = np.array(frame_stack)

            memory.append((s, a, r, s2, float(done)))
            ep_ret += r; ep_len += 1; total_steps += 1

            # learn
            if len(memory) >= max(batch_size, start_learn):
                batch = random.sample(memory, batch_size)
                S  = torch.tensor([b[0] for b in batch], dtype=torch.float32).to(device)
                A  = torch.tensor([b[1] for b in batch]).unsqueeze(1).to(device)
                R  = torch.tensor([b[2] for b in batch], dtype=torch.float32).unsqueeze(1).to(device)
                S2 = torch.tensor([b[3] for b in batch], dtype=torch.float32).to(device)
                D  = torch.tensor([b[4] for b in batch], dtype=torch.float32).unsqueeze(1).to(device)

                q_sa   = policy(S).gather(1, A)
                with torch.no_grad():
                    # Double DQN target
                    next_actions = policy(S2).argmax(1, keepdim=True)
                    next_q = target(S2).gather(1, next_actions)
                    y = R + gamma * (1 - D) * next_q
                loss = loss_fn(q_sa, y)

                opt.zero_grad()
                loss.backward()
                nn.utils.clip_grad_norm_(policy.parameters(), clip_grad)
                opt.step()

            # target update
            if total_steps - last_target_update >= target_update:
                target.load_state_dict(policy.state_dict())
                last_target_update = total_steps

        print(f"Ep {ep+1:4d}/{episodes} | return={ep_ret:6.1f} | len={ep_len:5d} | eps={eps_by_step(total_steps):.3f} | steps={total_steps}")

        # save ckpt every N episodes
        if (ep+1) % 10 == 0:
            path = os.path.join(ckpt_dir, f"zaxxon_ep{ep+1}.pt")
            torch.save({
                "episode": ep,
                "policy": policy.state_dict(),
                "target": target.state_dict(),
                "opt": opt.state_dict(),
                "total_steps": total_steps
            }, path)
            torch.save(policy.state_dict(), "dqn_zaxxon.pt")  # convenience latest
            print(f"💾 Saved {path}")

    # final save
    torch.save(policy.state_dict(), "dqn_zaxxon.pt")
    print("✅ Training complete. Latest weights -> dqn_zaxxon.pt")
    env.close()
    return policy


### Longplay Recording (60 FPS)
- Make a viewer-friendly MP4 by setting `render_fps=60`.
- Record multiple episodes and return the newest `.mp4`.
- Use after a longer training run for a more impressive demo.


In [13]:
# first long stretch (you can start with 200–500; more is better for Zaxxon)
_ = train_zaxxon(episodes=100)


🔁 Resumed from checkpoints/zaxxon_ep90.pt (episode 90, total_steps 81913)
Ep   91/100 | return=   0.0 | len=  885 | eps=0.843 | steps=82798
Ep   92/100 | return=   0.0 | len=  885 | eps=0.841 | steps=83683
Ep   93/100 | return=   0.0 | len=  885 | eps=0.839 | steps=84568
Ep   94/100 | return=   0.0 | len=  885 | eps=0.838 | steps=85453
Ep   95/100 | return=   0.0 | len=  885 | eps=0.836 | steps=86338
Ep   96/100 | return=   0.0 | len=  885 | eps=0.834 | steps=87223
Ep   97/100 | return=   0.0 | len=  885 | eps=0.833 | steps=88108
Ep   98/100 | return=   0.0 | len=  885 | eps=0.831 | steps=88993
Ep   99/100 | return=   0.0 | len=  885 | eps=0.829 | steps=89878
Ep  100/100 | return=   0.0 | len=  885 | eps=0.828 | steps=90763
💾 Saved checkpoints/zaxxon_ep100.pt
✅ Training complete. Latest weights -> dqn_zaxxon.pt


### Record One Episode (Video)
- Wrap environment with `RecordVideo`.
- Load `dqn_zaxxon.pt` and run a full episode (greedy actions).
- Save MP4 to `videos/` and display the newest recording.


In [14]:
from gymnasium.wrappers import RecordVideo
from IPython.display import Video
import gymnasium as gym, ale_py, torch, numpy as np, glob, os
from collections import deque

gym.register_envs(ale_py)

def record_longplay(weights="dqn_zaxxon.pt", env_id="ALE/Zaxxon-v5",
                    episodes=5, seed=0, fps=60, out_dir="videos"):
    base = gym.make(env_id, render_mode="rgb_array")
    base.metadata["render_fps"] = fps
    env = RecordVideo(base, video_folder=out_dir,
                      name_prefix="zaxxon_longplay",
                      episode_trigger=lambda e: True)

    nA = env.action_space.n
    policy = DQN(nA).to(device)
    policy.load_state_dict(torch.load(weights, map_location=device))
    policy.eval()

    for ep in range(episodes):
        obs, _ = env.reset(seed=seed+ep)
        stack = deque([preprocess(obs)]*4, maxlen=4)
        done, ep_ret, steps = False, 0.0, 0
        while not done:
            s = np.array(stack)
            with torch.no_grad():
                a = policy(torch.tensor(s, dtype=torch.float32).unsqueeze(0).to(device)).argmax(1).item()
            obs, r, term, trunc, _ = env.step(a)
            done = term or trunc
            stack.append(preprocess(obs))
            ep_ret += r; steps += 1
        print(f"[Video] ep {ep+1}/{episodes} | return={ep_ret:.1f} | steps={steps}")

    env.close()
    mp4 = sorted(glob.glob(os.path.join(out_dir, "*.mp4")))[-1]
    return mp4

mp4 = record_longplay(episodes=3, fps=60)  # make this 5–10 for longer
Video(mp4, embed=True)


[Video] ep 1/3 | return=0.0 | steps=885


  IMAGEMAGICK_BINARY = r"C:\Program Files\ImageMagick-6.8.8-Q16\magick.exe"


[Video] ep 2/3 | return=0.0 | steps=885
[Video] ep 3/3 | return=0.0 | steps=885


#**Code Attribution and Licensing**

All implementation work, including the Deep Q-Network architecture, replay buffer, training loop, preprocessing logic, and gameplay recording, was developed by Nithin Yash Menezes within a Google Colab environment as part of the INFO 7375: LLM Agents & Deep Q-Learning Assignment at Northeastern University.

##**The following open-source libraries and frameworks were used in this project:**

Gymnasium, ALE-py, and AutoROM — licensed under the MIT License (Farama Foundation).

*PyTorch* — BSD-style License (Meta AI).

*OpenCV*— Apache License 2.0.

*MoviePy* — MIT License.

*NumPy, Pandas, and TQDM* — BSD License.

####**All libraries were used under their respective open-source licenses without modification.**

The code file dqn_zaxxon.py, along with the training, evaluation, and recording scripts, was written independently by Nithin Yash Menezes. Conceptual inspiration was drawn from:

Mnih et al. (2015), Human-level control through deep reinforcement learning, Nature.

Official documentation for Gymnasium Atari environments and PyTorch tutorials.

All original code contributions created for this assignment are released under the MIT License, allowing free educational reuse and modification with appropriate credit.

For reference, the full implementation and training process can be found in the Colab notebook: