Name: Mahendran Jinachandran 

Student ID: 24088951

# Section 1: 
Why Reinforcement Learning is the ML paradigm of choice for this task?

- I have chosen "Ping Pong" game to work on this task. Reinforcement Learning(RL) is the most suitable for this task because it requires control and making decisions. Unlike supervised learning methods that rely on large datasets, RL does not require any datasets it rather relies on experiences just like we humans do.

- In the game of "Ping Pong", the agent (which is a software in this case) should learn how to move up or down to bounce the ball and score points to win the match. This is one of the examples of "Sequential Decision-Making" where each move impacts the game's results. 

- The one thing which separates RL from the other ML techniques is that the agent starts with no experience of the game prior. It learns through trial and error, explores different actions, receives feedback through those actions in from of rewards or penalties and eventually learns to make better choices to maximize its score. This is one of the techniques, which makes RL a very powerful approach for Atari games and many more, in which the rules are fixed but the strategies will be diffirent which much be learnt. 

Just For Fun: Hopefully, the game learns to make much better decisions than I do in real-life. Let's dive into the game

# Section 2:
## The Environment


In [1]:
# imports
import gym
import numpy as np
import cv2 
from collections import deque
import tensorflow as tf
from tensorflow.keras import layers, models
import random
from tqdm.notebook import tqdm
import time

In [2]:
print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))

Num GPUs Available:  1


In [3]:
# Check for available GPU devices and use them
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
    try:
        # Set memory growth for GPUs (avoid memory allocation errors)
        for gpu in gpus:
            tf.config.experimental.set_memory_growth(gpu, True)
    except RuntimeError as e:
        print("Error: " + str(e))

a. The Atari game selected is *Pong-v5*

In [4]:
#env = gym.make("ALE/Pong-v5", render_mode="human") # render_mode = 'human' for rendering
env = gym.make("ALE/Pacman-v5", render_mode="human") # render_mode = 'human' for rendering

A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]


In [5]:
env.metadata['render_fps'] = 60 

b. Inputs received from Gym Environment 

In [6]:
observation = env.reset()
print("Observation shape:", observation[0].shape)

Observation shape: (250, 160, 3)


c. Control settings for the JoyStick

In [7]:
num_actions = env.action_space.n
print("Number of possible actions:", num_actions)
print("Available actions are: ")
print(env.unwrapped.get_action_meanings())

Number of possible actions: 5
Available actions are: 
['NOOP', 'UP', 'RIGHT', 'LEFT', 'DOWN']


# Section 3: DQN Implementation

a. Capture and Preprocessing of the Data (1 mark)

In [8]:
def preprocess_frame_color(frame):
    """
    Resize a raw RGB frame to 84x84 while keeping 3 color channels.
    """
    return cv2.resize(frame, (84, 84), interpolation=cv2.INTER_AREA)

def initialize_frame_stack(preprocessed_frame, stack_size=4):
    """
    Initialize a deque of stacked frames with the same frame repeated.
    """
    return deque([preprocessed_frame for _ in range(stack_size)], maxlen=stack_size)

def stack_frames(stacked_frames, new_frame, is_new_episode, stack_size=4):
    """
    Initialize the frame stack if it's a new episode, otherwise
    append the new frame to the existing stack.
    """
    if is_new_episode:
        stacked_frames = initialize_frame_stack(new_frame, stack_size)
    else:
        stacked_frames.append(new_frame)
        
    # Concatenate along the channel dimension: (84, 84, 12) for 4 RGB frames
    stacked_state = np.concatenate(stacked_frames, axis=2)
    return stacked_state, stacked_frames

b. The Network Structure (2 marks)

In [9]:
def create_dqn_model(input_shape, num_actions):
    """
    Create a Convolutional Neural Network for DQN.
    Input shape = (84, 84, 12) for 4 stacked color frames.
    Output = Q-value for each action.
    """
    model = models.Sequential([
        layers.Input(shape=input_shape),

        layers.Conv2D(32, (8, 8), strides=2, activation='relu'),
        layers.BatchNormalization(),

        layers.Conv2D(64, (4, 4), strides=2, activation='relu'),
        layers.BatchNormalization(),

        layers.Conv2D(64, (3, 3), strides=1, activation='relu'),
        layers.BatchNormalization(),

        layers.Flatten(),

        layers.Dense(512, activation='relu'),
        layers.Dropout(0.5),  # Dropout to avoid overfitting
        
        layers.Dense(num_actions) 
    ])
    return model

c. Q-Learning Update (2 marks)

In [10]:
def update_model(main_model, target_model, optimizer, batch, gamma=0.99):
    states, actions, rewards, next_states, dones = batch

    # Convert to tensors
    states = tf.convert_to_tensor(states, dtype=tf.float32)
    next_states = tf.convert_to_tensor(next_states, dtype=tf.float32)
    actions = tf.convert_to_tensor(actions, dtype=tf.int32)
    rewards = tf.convert_to_tensor(rewards, dtype=tf.float32)
    dones = tf.convert_to_tensor(dones, dtype=tf.float32)

    # Train step
    with tf.GradientTape() as tape:
        # Q(s, a) using the main model
        q_values = main_model(states)
        q_action = tf.reduce_sum(q_values * tf.one_hot(actions, main_model.output_shape[-1]), axis=1)

        # Q(s', a') using the target model (no gradient)
        next_q_values = target_model(next_states)
        max_next_q = tf.reduce_max(next_q_values, axis=1)
        target_q = rewards + gamma * max_next_q * (1.0 - dones)

        # Loss between predicted Q and target Q
        loss = tf.keras.losses.MSE(target_q, q_action)

    # Apply gradients
    grads = tape.gradient(loss, main_model.trainable_variables)
    optimizer.apply_gradients(zip(grads, main_model.trainable_variables))
    loss = loss.numpy()
    return loss

d. Other Important Concepts (2 marks)

In [11]:
class ReplayBuffer:
    def __init__(self, capacity):
        self.buffer = deque(maxlen=capacity)

    def add(self, experience):
        self.buffer.append(experience)

    def sample(self, batch_size):
        batch = random.sample(self.buffer, batch_size)
        states, actions, rewards, next_states, dones = map(np.array, zip(*batch))
        return states, actions, rewards, next_states, dones

    def __len__(self):
        return len(self.buffer)

In [None]:
# Hyperparameters
GAMMA = 0.99 # Discount factor
EPSILON = 1.0   # Exploration rate
EPSILON_MIN = 0.1 # Minimum exploration rate
EPSILON_DECAY = 0.995 # Decay rate
LEARNING_RATE = 0.00025 # Learning rate
REPLAY_BUFFER_SIZE = 100000
BATCH_SIZE = 32 # Batch size for training
TARGET_UPDATE_FREQ = 1000 # Frequency to update target model
STACK_SIZE =  4 # Number of frames to stack
TOTAL_EPISODES = 500 # Total episodes to train
MAX_STEPS = 5000 # Max steps per episode
input_shape = (84, 84, 3 * STACK_SIZE) # 4 stacked frames

In [13]:
# Models and optimizer
main_model = create_dqn_model(input_shape, num_actions)
target_model = create_dqn_model(input_shape, num_actions)
target_model.set_weights(main_model.get_weights())
optimizer = tf.keras.optimizers.legacy.Adam(learning_rate=LEARNING_RATE)

# Replay buffer
replay_buffer = ReplayBuffer(REPLAY_BUFFER_SIZE)

# Frame stack
stacked_frames = deque(maxlen=STACK_SIZE)

2025-04-13 17:38:04.514232: I metal_plugin/src/device/metal_device.cc:1154] Metal device set to: Apple M2
2025-04-13 17:38:04.514370: I metal_plugin/src/device/metal_device.cc:296] systemMemory: 8.00 GB
2025-04-13 17:38:04.514374: I metal_plugin/src/device/metal_device.cc:313] maxCacheSize: 2.67 GB
2025-04-13 17:38:04.514977: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:303] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2025-04-13 17:38:04.515233: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:269] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)


In [14]:
def choose_action(stacked_state):
    # Epsilon-greedy action
    if np.random.rand() < EPSILON:
        action = env.action_space.sample()
    else:
        q_values = main_model(np.expand_dims(stacked_state, axis=0), training=False)
        action = np.argmax(q_values.numpy())
    return action

In [None]:
step_count = 0

for episode in tqdm(range(TOTAL_EPISODES), desc="Training Episodes"):
    start_time = time.time()

    state, info = env.reset()
    preprocessed_frame = preprocess_frame_color(state)
    stacked_state, stacked_frames = stack_frames(stacked_frames, preprocessed_frame, True)

    total_reward = 0
    done = False

    for step in tqdm(range(MAX_STEPS), desc=f"Episode {episode+1}", leave=False):
        step_count += 1
        if np.random.rand() < EPSILON:
            action = env.action_space.sample()
        else:
            q_values = main_model(np.expand_dims(stacked_state, axis=0), training=False)
            action = np.argmax(q_values.numpy())
        next_state, reward, terminated, truncated, _ = env.step(action)
        done = terminated or truncated

        # Process next frame
        preprocessed_next = preprocess_frame_color(next_state)
        next_stacked_state, stacked_frames = stack_frames(stacked_frames, preprocessed_next, False)

        # Store experience
        replay_buffer.add((stacked_state, action, reward, next_stacked_state, float(done)))

        stacked_state = next_stacked_state
        total_reward += reward

        # Train enough samples
        if len(replay_buffer) > BATCH_SIZE:
            batch = replay_buffer.sample(BATCH_SIZE)
            loss = update_model(main_model, target_model, optimizer, batch, gamma=GAMMA)

        # Update target network
        if step_count % TARGET_UPDATE_FREQ == 0:
            target_model.set_weights(main_model.get_weights())
            print(f"[Step {step_count}] 🔄 Target network updated.")

        if done:
            break

    # Decay epsilon
    EPSILON = max(EPSILON_MIN, EPSILON * EPSILON_DECAY)

    elapsed = time.time() - start_time
    tqdm.write(f"🎮 Episode {episode + 1}/{TOTAL_EPISODES} - Reward: {total_reward} - Epsilon: {EPSILON:.4f} - ⏱️ Time: {elapsed:.2f}s")

Training Episodes:   0%|          | 0/500 [00:00<?, ?it/s]

Episode 1:   0%|          | 0/5000 [00:00<?, ?it/s]

🎮 Episode 1/500 - Reward: 11.0 - Epsilon: 0.9950 - ⏱️ Time: 52.07s


Episode 2:   0%|          | 0/5000 [00:00<?, ?it/s]

🎮 Episode 2/500 - Reward: 12.0 - Epsilon: 0.9900 - ⏱️ Time: 77.00s


Episode 3:   0%|          | 0/5000 [00:00<?, ?it/s]

[Step 1000] 🔄 Target network updated.
🎮 Episode 3/500 - Reward: 23.0 - Epsilon: 0.9851 - ⏱️ Time: 80.91s


Episode 4:   0%|          | 0/5000 [00:00<?, ?it/s]

🎮 Episode 4/500 - Reward: 6.0 - Epsilon: 0.9801 - ⏱️ Time: 49.78s


Episode 5:   0%|          | 0/5000 [00:00<?, ?it/s]

[Step 2000] 🔄 Target network updated.
🎮 Episode 5/500 - Reward: 24.0 - Epsilon: 0.9752 - ⏱️ Time: 77.19s


Episode 6:   0%|          | 0/5000 [00:00<?, ?it/s]

🎮 Episode 6/500 - Reward: 19.0 - Epsilon: 0.9704 - ⏱️ Time: 112.50s


Episode 7:   0%|          | 0/5000 [00:00<?, ?it/s]

[Step 3000] 🔄 Target network updated.
🎮 Episode 7/500 - Reward: 17.0 - Epsilon: 0.9655 - ⏱️ Time: 64.72s


Episode 8:   0%|          | 0/5000 [00:00<?, ?it/s]

🎮 Episode 8/500 - Reward: 32.0 - Epsilon: 0.9607 - ⏱️ Time: 102.21s


Episode 9:   0%|          | 0/5000 [00:00<?, ?it/s]

[Step 4000] 🔄 Target network updated.
🎮 Episode 9/500 - Reward: 22.0 - Epsilon: 0.9559 - ⏱️ Time: 71.33s


Episode 10:   0%|          | 0/5000 [00:00<?, ?it/s]

🎮 Episode 10/500 - Reward: 9.0 - Epsilon: 0.9511 - ⏱️ Time: 53.90s


Episode 11:   0%|          | 0/5000 [00:00<?, ?it/s]

KeyboardInterrupt: 

: 