## Reinforcement Learning Programming - CSCN 8020 - Assignment 3


Jun He (8903073)     


![Taxi Environment](Pong.png)<br>
### Introduction
Individuals will work with the Pong environment and implement Deep Q-Learning. In this environment, if we got the right and left paddle's positions as the state, we would have continuous states that prevent the use of a tabular solutions, and thus the use of Q-Learning. To solve this problem, you will use a neural network to represent the function with Deep Q Networks and use the game frames as the observation. The way you interact with the environment will be very similar to the Ms. Pacman gym environment used in class.
### Action space

* 0: NOOP (No operation)
* 1: FIRE
* 2: Move right
* 3: Move left
* 4: Right Fire
* 5: Left Fire
### Observation Space

The environment has an observation space of size (210,160,3), which is an RGB image with values between 0 and 255.

Do not change the difficulty or mode, the default is set to 0 for both. Also, ensure you're using the PongDeterministic-v4 version of the environment, which has the reduced set of actions.

### Helper Utility 
To assist in interacting with the environment, provided with a file (assignment3_utils.py) that has a few methods to allow pre-processing of the image: cropping, switching to grayscale, and downsizing.

### DQN [100]

#### Problem Statement

Implement the DQN algorithm on the Pong environment from OpenAI Gym. Train an agent to efficiently play the game (And win!). Use the following hyperparameters:

* Mini-batch size: 8
* Update rate of target network: 10 episodes
* Discount Factor γ: 0.95
* Exploration:
    * Exploration Factor Initial Value ε_init: 1.0
    * Exploration Decay Rate (δ): 0.995
    * Exploration minimum value ε_min: 0.05

Calculate the exploration rate using:

ε = ε * δ if ε ≥ ε_min
= ε_min otherwise

## Tasks 
### 1.1 Implement the DQN algorithm and train an agent on the Pong environment.
* Change the input size of the CNN and use all 4 images as an input instead of blending them
together.
* Note that if you apply the crop function in the utilities, you’ll need to change the input
state size to 84x80 instead of 105x80

### 1.2 Report the following metrics during training with the number of steps:
1. Score per episode
2. Average Cumulative reward of the last 5 episodes

### 1.3 Plot the deliberate change to the following parameters SEPARATELY and plot a figure of the previous metrics for each change as follows:
* Plot figure with changing the mini-batch size: [8 (default), 16]
* Plot figure with changing the update rate of the target network to be every [3, 10 (default)]
episode


In [None]:
import gymnasium as gym
import random
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import matplotlib.pyplot as plt
from collections import deque
import assignment3_utils as utils  # Helper functions for preprocessing

# Ensure required packages are installed
!pip install gymnasium[atari] ale-py

class DQN(nn.Module):
    """
    Convolutional Neural Network for Deep Q-Learning.
    """
    def __init__(self, input_shape, num_actions):
        super(DQN, self).__init__()
        self.conv1 = nn.Conv2d(in_channels=4, out_channels=32, kernel_size=8, stride=4)
        self.conv2 = nn.Conv2d(in_channels=32, out_channels=64, kernel_size=4, stride=2)
        self.conv3 = nn.Conv2d(in_channels=64, out_channels=64, kernel_size=3, stride=1)
        
        # Calculate the flattened size after convolutions
        self._to_linear = self._get_conv_output((4, 84, 80))
        
        self.fc1 = nn.Linear(self._to_linear, 512)
        self.fc2 = nn.Linear(512, num_actions)
    
    def _get_conv_output(self, shape):
        """Compute the size of the tensor after the convolutional layers."""
        with torch.no_grad():
            x = torch.zeros(1, *shape)
            x = self.conv1(x)
            x = self.conv2(x)
            x = self.conv3(x)
            return int(np.prod(x.shape[1:]))
    
    def forward(self, x):
        x = x.squeeze(2)  # Ensure 4D shape (batch, 4, 84, 80)
        x = F.relu(self.conv1(x))
        x = F.relu(self.conv2(x))
        x = F.relu(self.conv3(x))
        x = x.view(x.size(0), -1)  # Flatten
        x = F.relu(self.fc1(x))
        return self.fc2(x)

class ReplayBuffer:
    """
    Experience replay buffer for storing past experiences.
    """
    def __init__(self, capacity):
        self.buffer = deque(maxlen=capacity)
    
    def push(self, state, action, reward, next_state, done):
        self.buffer.append((state, action, reward, next_state, done))
    
    def sample(self, batch_size):
        return random.sample(self.buffer, batch_size)
    
    def __len__(self):
        return len(self.buffer)

class DQNAgent:
    """
    DQN Agent with ε-greedy policy, experience replay, and target network.
    """
    def __init__(self, env, batch_size=32, gamma=0.95, lr=1e-4, epsilon=1.0, epsilon_min=0.05, epsilon_decay=0.995):
        self.env = env
        self.batch_size = batch_size
        self.gamma = gamma
        self.lr = lr
        self.epsilon = epsilon
        self.epsilon_min = epsilon_min
        self.epsilon_decay = epsilon_decay
        self.num_actions = env.action_space.n
        
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        
        # Initialize networks
        self.policy_net = DQN((4, 84, 80), self.num_actions).to(self.device)
        self.target_net = DQN((4, 84, 80), self.num_actions).to(self.device)
        self.target_net.load_state_dict(self.policy_net.state_dict())
        self.target_net.eval()
        
        self.optimizer = optim.Adam(self.policy_net.parameters(), lr=self.lr)
        self.memory = ReplayBuffer(10000)
    
    def select_action(self, state):
        if random.random() < self.epsilon:
            return self.env.action_space.sample()
        state = torch.FloatTensor(np.array(state)).unsqueeze(0).to(self.device)  # Ensure correct shape (1, 4, 84, 80)
        with torch.no_grad():
            return self.policy_net(state).argmax(dim=1).item()
    
    def train_step(self):
        if len(self.memory) < self.batch_size:
            return
        batch = self.memory.sample(self.batch_size)
        
        states, actions, rewards, next_states, dones = zip(*batch)
        states = torch.FloatTensor(np.stack(states)).to(self.device)  # Ensure 4D shape (batch, 4, 84, 80)
        states = states.squeeze(2)  # Remove incorrect singleton dimension
        actions = torch.LongTensor(actions).to(self.device)
        rewards = torch.FloatTensor(rewards).to(self.device)
        next_states = torch.FloatTensor(np.stack(next_states)).to(self.device)  # Ensure 4D shape
        next_states = next_states.squeeze(2)  # Remove incorrect singleton dimension
        dones = torch.FloatTensor(dones).to(self.device)
        
        q_values = self.policy_net(states).gather(1, actions.unsqueeze(1)).squeeze(1)
        next_q_values = self.target_net(next_states).max(1)[0]
        target_q_values = rewards + (1 - dones) * self.gamma * next_q_values
        
        loss = F.mse_loss(q_values, target_q_values)
        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()
    
    def update_target(self):
        self.target_net.load_state_dict(self.policy_net.state_dict())
    
    def decay_epsilon(self):
        self.epsilon = max(self.epsilon_min, self.epsilon * self.epsilon_decay)

# Initialize Environment and Run Training
env = gym.make("ALE/Pong-v5")  # Updated for gymnasium compatibility
agent = DQNAgent(env)

# Training Loop
num_episodes = 500
for episode in range(num_episodes):
    state, _ = env.reset()
    state = utils.process_frame(state, (84, 80))
    state = np.squeeze(state, axis=-1)  # Ensure correct shape (84, 80)
    state_stack = deque([state] * 4, maxlen=4)  # Stack last 4 frames
    total_reward = 0
    done = False
    
    while not done:
        stacked_state = np.stack(state_stack, axis=0)  # Ensure shape (4, 84, 80)
        action = agent.select_action(stacked_state)
        next_state, reward, done, truncated, _ = env.step(action)
        next_state = utils.process_frame(next_state, (84, 80))
        next_state = np.squeeze(next_state, axis=-1)  # Ensure correct shape (84, 80)
        state_stack.append(next_state)
        agent.memory.push(np.stack(state_stack, axis=0), action, reward, np.stack(state_stack, axis=0), done)
        agent.train_step()
        total_reward += reward
    
    agent.decay_epsilon()
    if episode % 10 == 0:
        agent.update_target()
    
    print(f"Episode {episode}: Reward = {total_reward}")

# Save Model
torch.save(agent.policy_net.state_dict(), "dqn_pong_model.pth")  # Save model for later use



DEPRECATION: Loading egg at e:\ai-venvs\venv\helena_ai_venv\lib\site-packages\gym_maze-0.4-py3.11.egg is deprecated. pip 25.1 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330

[notice] A new release of pip is available: 24.3.1 -> 25.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


Episode 0: Reward = -21.0
Episode 1: Reward = -21.0
Episode 2: Reward = -21.0
Episode 3: Reward = -21.0
Episode 4: Reward = -20.0
Episode 5: Reward = -20.0
Episode 6: Reward = -21.0
Episode 7: Reward = -21.0
Episode 8: Reward = -21.0
Episode 9: Reward = -21.0
Episode 10: Reward = -21.0
Episode 11: Reward = -21.0
Episode 12: Reward = -20.0
Episode 13: Reward = -21.0
Episode 14: Reward = -21.0
Episode 15: Reward = -20.0
Episode 16: Reward = -20.0
Episode 17: Reward = -19.0
Episode 18: Reward = -21.0
Episode 19: Reward = -20.0
Episode 20: Reward = -19.0
Episode 21: Reward = -20.0
Episode 22: Reward = -19.0
Episode 23: Reward = -20.0
Episode 24: Reward = -21.0
Episode 25: Reward = -20.0
Episode 26: Reward = -18.0
Episode 27: Reward = -20.0
Episode 28: Reward = -19.0
Episode 29: Reward = -21.0
Episode 30: Reward = -21.0
Episode 31: Reward = -20.0
Episode 32: Reward = -21.0
Episode 33: Reward = -20.0
Episode 34: Reward = -21.0
Episode 35: Reward = -20.0
Episode 36: Reward = -21.0
Episode 37: