# Pong Atari

## Importing Libraries

In [None]:
from agent0 import Agent0
from env2 import Env
from memory import ReplayMemory
import retro
import numpy as np
import torch
from tqdm import trange
import collections
import plotly.express as px
import pandas as pd
from test import test
import os

## Training Pong AI against Pong Bot

An AI will be trained to play the ATARI game, Pong. During the start, the AI is trained using Double Deep Q Network (Double DQN) provided from the Rainbow Github by Kaixhin. As stated to be Double Deep Q Network, two neural network will be used to handle the learning process of the AI. A DQN network will be responsible as the main neural network for the selection of the next action with the maximum value. On the other hand, another same structured DQN network usually called Target Network will be responsible for the evaluation of that action. The agent will be trained for 5 million steps.

### Loading variables, parameters and environment

The Environment Wrapper (Env) from env2 is used to modify the environment to cater to our training process.
- It simplifies the control mapping needed to interact with the environment.
- Output 4 stack of states instead of 1 state for the neural network model to process.
- A random initial state will be use as the starting state when the environment is reset so that it does not overfit too much to the original starting state.
- For every state the model gets from interacting with the environment, the Pong game will be progressed for 4 frames. The aforementioned state will be represented by Max Pooling on the last 2 frames. The reward gained for the state will be the accumulation of the reward gained from the 4 frames.
- All of the modifications above are also applicable for a 2-Player Pong environment.

In [None]:
env0 = retro.make(game='Pong-Atari2600', players=1) # Pong environment for 1 Player
env = Env(env0) # Environment Wrapper

### Initializing the arguments for the model and agent

In [None]:
class args_parser():
    def __init__(self, device, model=None):
        self.atoms = 51
        self.V_min = -10
        self.V_max = 10
        self.batch_size = 32
        self.multi_step = 3
        self.discount = 0.99
        self.norm_clip = 10.0
        self.learning_rate = 0.0000625
        self.adam_eps = 1.5e-4
        self.architecture = "canonical"
        self.history_length = 4
        self.hidden_size = 512
        self.noisy_std = 0.1
        self.priority_weight = 0.4
        self.priority_exponent = 0.5
        self.evaluation_episodes = 10
        self.render = False
        self.players = 1
        self.device = device
        self.model = model

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [None]:
env.training # No purpose
args = args_parser(device)

In [None]:
CAPACITY = int(1e6)
EVALUATION_SIZE = 500
PRIO_WEIGHTS = 0.4
STEPS = int(5e6)
REPLAY_FREQ = 4
REWARD_CLIP = 1
LEARNING_START = int(10e3)
EVALUATION_INTERVAL = 100000
CHECKPOINT_INTERVAL = 1000000
TARGET_UPDATE = int(8e3)

In [None]:
model = Agent0(args, env)

### Training Iteration

In [None]:
priority_weight_increase = (1 - PRIO_WEIGHTS) / (STEPS - LEARNING_START)
mem = ReplayMemory(args, CAPACITY)
val_mem = ReplayMemory(args, EVALUATION_SIZE)
metrics = {'steps': [], 'rewards': [], 'Qs': [], 'best_avg_reward': -float('inf')}
results_dir = os.getcwd()
model.train()
done = True

for T in trange(1, STEPS+1):
    if done:
        state = env.reset().to(device)

    if T % REPLAY_FREQ == 0:
        model.reset_noise()

    action = model.act(state)
    next_state, reward, done, info = env.step(action)
    if REWARD_CLIP > 0:
        reward = max(min(reward, REWARD_CLIP), -REWARD_CLIP)
    mem.append(state, action, reward, done)

    if T >= LEARNING_START:
        mem.priority_weight = min(mem.priority_weight + priority_weight_increase, 1)  # Anneal importance sampling weight β to 1

        if T % REPLAY_FREQ == 0:
            model.learn(mem)  # Train with n-step distributional double-Q learning

        if T % EVALUATION_INTERVAL == 0:
            model.eval()  # Set DQN (online network) to evaluation mode
            avg_reward, avg_Q = test(args, T, model, val_mem, metrics, results_dir)  # Test
            model.train()  # Set DQN (online network) back to training mode

        # Update target network
        if T % TARGET_UPDATE == 0:
            model.update_target_net()

        # Checkpoint the network
        if (CHECKPOINT_INTERVAL != 0) and (T % CHECKPOINT_INTERVAL == 0):
            model.save(results_dir, 'checkpoint.pth')

    state = next_state.to(device)

env.close()

### Result from Training Pong AI against Pong Bot

The following figure below showcase the Q Value Graph during the training process which represents the expected total reward from the behaviour of the agent.

<img src="./assets/Q Value.png" width="1000"/>

During the first 1 million steps, the Q Value of the agent is decreasing alot as we assumed it is still learning the correct moves for every possible state. Throughout the later steps. the Q Value is gradually increasing, indicating that the agent is slowly improving and optimizing its action for a given state. The lowest Q Value obtained was at the 900th episode at -1.887. The highest Q value obtained was at the 4.9 millionth episode at 0.219.

The next figure showcase the Reward Graph obtained from the evaluation test-run on the Agent Neural Network every 100000 intervals

<img src="./assets/Reward.png" width="1000"/>

For every test-run, the agent is evaluated by playing against the Pong Bot for 10 rounds. If the agent managed to score a point, it will gain a reward of +1 whereas if the Pong Bot scores a point, the agent will be penalised with a reward of -1. The dotted lines in the graph represent the maximum total reward and minimum total reward it managed to obtain within the 10 rounds played. The darker blue line showcase the mean reward it obtained within the 10 games while the shaded light blue areas are the variance of the reward by a standard deviation of 1. From the graph, we can see that the agent is slowly exploring the game from the start and it starts to learn to play the game approximately starting at 1 million steps till the 2.3 million steps. From then on, the agent is just optimizing itself to maximize the rewards that it can gained. In one of the test-run conducted on the 4.8 million steps, the agent showcase that it managed to beat the Pong Bot with a maximum reward of 21 for 10 games with no variance in each games at all and we believe the agent neural network has reached the optimal point in tuning itself.

## Evaluation of the Pong AI against itself

<img src="./assets/Test2.gif" >

An evaluation was conducted using the trained Pong AI against itself and was found that the Pong AI controlling the left paddle are more prone to losing as it cannot deflect the ball properly. We tried to solve this issue and found that there is a difference in the state given if the Pong AI were to play on the right side compared to the left side. When the ball veered to the right side during the start of the game, the ball would be able to reach the outside boundary to score without bouncing off the floor or the ceiling of the game. However, when the ball veered to the left side during the start of the game, the ball would hit the floor or the ceiling of the game before even reaching to the outside boundary. This difference in the state is what cause the Pong AI to not be able to perform well on the left side as it is only trained on the right side and we felt that it might had overfitted on playing for the right paddle.

## Training Pong AI against itself

In this section, we will be further fine-tuning the Pong AI by training the neural network against itself. All of the parameters and model used will be the same as the ones used to train the Pong AI against the the Pong Bot. The only difference is the environment will be set for 2-Players and during the training process, two neural networks will be trained; one on the left paddle and another on the right paddle. Both agents will be trained for 3 million steps.

In [None]:
from agent import Agent
from agent2 import Agent2
from test2 import test

In [None]:
env0 = retro.make(game='Pong-Atari2600', players=2)
env = Env(env0)

In [None]:
trained = 'model.pth'

In [None]:
env.train() # No purpose
args = args_parser(device, trained)

In [None]:
model = Agent(args, env)

In [None]:
model2 = Agent2(args, env)

In [None]:
CAPACITY = int(1e4)
EVALUATION_SIZE = 500
PRIO_WEIGHTS = 0.4
STEPS = int(3e6)
REPLAY_FREQ = 4
REWARD_CLIP = 1
LEARNING_START = int(10e3)
EVALUATION_INTERVAL = 100000
TARGET_UPDATE = int(8e3)

In [None]:
priority_weight_increase = (1 - PRIO_WEIGHTS) / (STEPS - LEARNING_START)
mem = ReplayMemory(args, CAPACITY)
mem2 = ReplayMemory(args, CAPACITY)
metrics = {'steps': [], 'reward1': [], 'reward2': [], 'best_avg_reward1': -float('inf'), 'best_avg_reward2': -float('inf')}
results_dir = os.getcwd()
model.train()
model2.train()
done = True

for T in trange(1, STEPS+1):
    if done:
        state = env.reset().to(device)

    if T % REPLAY_FREQ == 0:
        model.reset_noise()
        model2.reset_noise()

    state2 = torch.flip(state,[2])
    action = model.act(state)
    action2 = model2.act(state2)
    next_state, reward, done, info = env.step_2P(action, action2)

    reward1, reward2 = reward
    mem.append(state, action, reward1, done)
    mem2.append(state2, action2, reward2, done)

    if T >= LEARNING_START:
        mem.priority_weight = min(mem.priority_weight + priority_weight_increase, 1)  # Anneal importance sampling weight β to 1
        mem2.priority_weight = min(mem2.priority_weight + priority_weight_increase, 1)

        if T % REPLAY_FREQ == 0:
            model.learn(mem)  # Train with n-step distributional double-Q learning
            model2.learn(mem2)

        if T % EVALUATION_INTERVAL == 0:
            model.eval()  # Set DQN (online network) to evaluation mode
            model2.eval()
            test(args, T, model, model2, env0, metrics, results_dir)  # Test
            model.train()  # Set DQN (online network) back to training mode
            model2.train()

        # Update target network
        if T % TARGET_UPDATE == 0:
            model.update_target_net()
            model2.update_target_net()


    state = next_state.to(device)

env.close()

### Result from Training Pong AI against itself

<table><tr><td><img src='./assets/Reward1.png'></td><td><img src='./assets/Reward2.png'></td></tr></table>

Two graphs represented above refers to the reward score of the <strong>Right</strong> agent (Reward-1) and the <strong>Left</strong> agent (Reward-2) respectively. Before we start the training process, we hypothesize that for the Pong AI to fine-tune itself, the opposing AI must be able to win against the Pong AI so that the Pong AI can improve by training against the improved opposing AI and the reward graph should be a zig-zag line. As for reaching the optimal point, both agents would need to have an expected reward score of 0 to indicate that both agents are fine-tuned to the max. As shown in the graph, we can see that the <strong>Right</strong> agent is consistently winning more against the <strong>Left</strong> agent. However, there are two instances (1.3 million steps and 2.4 million steps) where the <strong>Left</strong> agent managed to win against the <strong>Right</strong> agent, indicating that it has improved and hence the <strong>Right</strong> agent is able to improve more following these two instances.

# Space Invaders Atari

## Importing Libraries

In [None]:
import gym
import numpy as np
import retro
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import os
import imageio
from env_SI import Env
from collections import deque, namedtuple
from torch.autograd import Variable
from tqdm import trange
import random
from test import test

## Training a Deep Q Network for Space Invaders

A simple neural network constructed for Deep Q Network are as follows:
- A convolutional layer which takes in 4 stack of states from the environment.
- Another convolutional layer to further conduct features extraction.
- The features are flatten as a linear layer and forward to another linear layer.
- The last linear layer will be softmaxed to decide the action to be taken by the model.

The agent will be trained for 1 million steps

In [None]:
class Space_Invaders_DQN(nn.Module):
    def __init__(self):
        super(Space_Invaders_DQN, self).__init__()
        self.conv1 = nn.Conv2d(4, 16, 8, stride=4) # output will be 20x20x16
        self.conv2 = nn.Conv2d(16, 32, 4, stride=2) # output will be 9x9x32
        self.fc1 = nn.Linear(32*81, 256)
        self.fc2 = nn.Linear(256, 6)

    def forward(self,x):
        x = F.relu(self.conv1(x))
        x = F.relu(self.conv2(x))
        x = x.view(-1, 32*81)
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        return x

### Replay Memory

Stores the agent's experience in each timestep and will be utilized to train the model

In [None]:
class RepMemory(object):
    def __init__(self, capacity):
        self.buffer = deque(maxlen = capacity)

    def store(self, experience):
        self.buffer.append(experience)

    def sample(self, batch_size):
        buffer_size = len(self.buffer)
        index = np.random.choice(np.arange(buffer_size),
                                size = batch_size,
                                replace = False)
        
        return [self.buffer[i] for i in index]

    def length(self):
        return len(self.buffer)

## Deep Q Network Agent

In [None]:
class DQNAgent(object):
    def __init__(self, args):
        self.batch_size = args.batch_size
        self.gamma = args.gamma
        self.loss_fn = args.loss_fn

        self.online_net = Space_Invaders_DQN().to(device=args.device)
        if args.model:  # Load pretrained model if provided
            if os.path.isfile(args.model):
                state_dict = torch.load(args.model, map_location='cpu')  # Always load tensors onto CPU by default, will shift to GPU if necessary
                if 'conv1.weight' in state_dict.keys():
                    for old_key, new_key in (('conv1.weight', 'convs.0.weight'), ('conv1.bias', 'convs.0.bias'), ('conv2.weight', 'convs.2.weight'), ('conv2.bias', 'convs.2.bias'), ('conv3.weight', 'convs.4.weight'), ('conv3.bias', 'convs.4.bias')):
                        state_dict[new_key] = state_dict[old_key]  # Re-map state dict for old pretrained models
                        del state_dict[old_key]  # Delete old keys for strict load_state_dict
                self.online_net.load_state_dict(state_dict)
                print("Loading pretrained model: " + args.model)
            else:  # Raise error if incorrect model path provided
                raise FileNotFoundError(args.model)
    
        self.optimizer = optim.Adam(self.online_net.parameters(), lr=args.lr)
    
    
    def act(self, state):
        with torch.no_grad():
            return self.online_net(state.unsqueeze(0)).argmax(1).item()
    
    def learn(self, mem):
        # check if enough experience collected so far
        # the agent continues with a random policy without updates till then
        if mem.length() < self.batch_size:
            return
    
        self.optimizer.zero_grad()
        # sample a random batch from the replay memory to learn from experience
        # for no experience replay the batch size is 1 and hence learning online
        transitions = mem.sample(self.batch_size)
        batch = Transition(*zip(*transitions))
        
        # isolate the values
        non_terminal_mask = np.array(list(map(lambda s: s is not None, batch.next_state)))
        
        with torch.no_grad():
            batch_next_state = Variable(torch.cat([s for s in batch.next_state if s is not None]))

        batch_state = Variable(torch.cat(batch.state)).to(device)
        batch_action = Variable(torch.stack(batch.action)).to(device)
        batch_reward = Variable(torch.stack(batch.reward)).to(device)
    
        # There is no separate target Q-network implemented and all updates are done
        # synchronously at intervals of 1 unlike in the original paper
        # current Q-values
        current_Q = self.online_net(batch_state).gather(1, batch_action)
        # expected Q-values (target)
        max_next_Q = self.online_net(batch_next_state).detach().max(1)[0]
        expected_Q = batch_reward
        
        expected_Q[non_terminal_mask] += (self.gamma * max_next_Q).data.unsqueeze(1)
        # with torch.no_grad():
        #     expected_Q = Variable(torch.from_numpy(expected_Q).cuda())
    
        # loss between current Q values and target Q values
        if self.loss_fn == 'l1':
            loss = F.smooth_l1_loss(current_Q, expected_Q)
        else:
            loss = F.mse_loss(current_Q, expected_Q)
    
        # backprop the loss
        loss.backward()
        self.optimizer.step()

    def train(self):
        self.online_net.train()

    def eval(self):
        self.online_net.eval()

    def save(self, path, name='model.pth'):
        torch.save(self.online_net.state_dict(), os.path.join(path, name))

The Environment Wrapper (Env) from env_SI is used to modify the environment to cater to the Space Invaders
- Output 4 stack of states instead of 1 state for the neural network model to process.
- A random initial state will be use as the starting state when the environment is reset so that it does not overfit too much to the original starting state.
- For every state the model gets from interacting with the environment, the Space Invaders game will be progressed for 4 frames. The aforementioned state will be represented by using the last frame. The reward gained for the state will be the accumulation of the reward gained from the 4 frames.

In [None]:
env0 = gym.make('SpaceInvaders-v0')
env = Env(env0, device)

### Initializing the arguments for the model and agent

In [None]:
class args_parser():
    def __init__(self, device, model=None):
        self.batch_size = 32
        self.lr = 0.0001
        self.gamma = 0.99
        self.loss_fn = 'l1'
        self.evaluation_episodes = 5
        self.device = device
        self.model = None

In [None]:
if torch.cuda.is_available():
  device = torch.device('cuda')
else:
  device = torch.device('cpu')

results_dir = os.getcwd()
Transition = namedtuple('Transition',('state', 'action', 'next_state', 'reward'))

In [None]:
CAPACITY = int(1e4)
STEPS = int(1e6)
EVALUATION_INTERVAL = int(1e5)
REPLAY_FREQ = 4
EPSILON_START = 0.95
EPSILON_END = 0.05
EPSILON_DECAY = 600000
args = args_parser(device)
i = 1

In [None]:
model = DQNAgent(args)

### Training Iteration

In [None]:
mem = RepMemory(CAPACITY)
metrics = {'steps': [], 'rewards': [], 'best_avg_reward': -float('inf')}
done = True
model.train()

for T in trange(1, STEPS+1):
    if done:
        state = env.reset().to(device)

    if T < (EPSILON_DECAY + 1):
        eps_threshold = T * ((EPSILON_END - EPSILON_START) / (EPSILON_DECAY)) + EPSILON_START
    else:
        eps_threshold = 0.05

    if random.random() > eps_threshold:
        action = model.act(state)
    else:
        action = random.randint(0, 5)

    next_state, reward, done, info = env.step(action)
    mem.store((state.unsqueeze(0), torch.tensor([action]), next_state.unsqueeze(0), torch.tensor([reward])))

    if T % REPLAY_FREQ == 0:
        model.learn(mem)

    if T % EVALUATION_INTERVAL == 0:
        model.eval()
        gif = test(args, T, model, env0, metrics, results_dir)
        imageio.mimsave(os.path.join(results_dir, './GIF/DQN{}.gif'.format(i)), gif)
        i += 1
        model.train()
        
    state = next_state.to(device)

env.close()

## Trial-And-Error and Results from Training Space Invaders on Deep Q Network

<img src="./assets/FirstTest.gif" >

During the first training process of the Space Invaders AI, the AI will maximize its reward by consistently fire lasers to eliminate the invaders. However, we felt that the AI is not learning the game properly yet as the spaceship only randomly shoots at the aliens to try to gain rewards from it. Hence, a small change is made to the environment wrapper by penalizing the agent with a reward of -100 to force the agent to move the spaceship and dodge the lasers shot by the invaders.

### After small changes in the environment

<img src="./assets/DQN.gif" >

Now the Space Invaders AI can learn to dodge the lasers shot by the invaders instead of the spaceship randomly moving around and shoot its lasers.

<img src="./assets/SI_DQN.png" >

The agent is evaluated every 100000 steps and the rewards obtained is represented on the graph above. Even though we have tune the environment for the Space Invaders AI to learn to dodge the lasers, it does not seem to be able to learn exceptionally well. This may be due to the issue of the Maximization Bias in Deep Q Network where it has the tendency to overestimate both the value and the action-value (Q) functions. Hence, we will try to use a Double Deep Q Network to train the AI.

## Training a Double Deep Q Network for Space Invaders

For the Double Deep Q Network, we will use the same model used to train the Pong AI from the Rainbow Github by Kaixhin. All the parameters set is also the same as the default parameters set to train the Pong AI.

In [None]:
from agent import Agent
from memory import ReplayMemory

In [None]:
class args_parser2():
    def __init__(self, device, model=None):
        self.atoms = 51
        self.V_min = -10
        self.V_max = 10
        self.batch_size = 32
        self.multi_step = 3
        self.discount = 0.99
        self.norm_clip = 10.0
        self.learning_rate = 0.0000625
        self.adam_eps = 1.5e-4
        self.architecture = "canonical"
        self.history_length = 4
        self.hidden_size = 512
        self.noisy_std = 0.1
        self.priority_weight = 0.4
        self.priority_exponent = 0.5
        self.evaluation_episodes = 5
        self.render = False
        self.players = 1
        self.device = device
        self.model = model

In [None]:
CAPACITY = int(1e4)
EVALUATION_SIZE = 500
PRIO_WEIGHTS = 0.4
STEPS = int(1e6)
REPLAY_FREQ = 4
REWARD_CLIP = 1
LEARNING_START = int(10e3)
EVALUATION_INTERVAL = 100000
TARGET_UPDATE = int(8e3)
EPSILON_START = 0.95
EPSILON_END = 0.05
EPSILON_DECAY = 600000
i = 1
args = args_parser2(device)

In [None]:
model2 = Agent(args, env)

In [None]:
priority_weight_increase = (1 - PRIO_WEIGHTS) / (STEPS - LEARNING_START)
mem2 = ReplayMemory(args, CAPACITY)
metrics2 = {'steps': [], 'rewards': [], 'Qs': [], 'best_avg_reward': -float('inf')}
results_dir = os.getcwd()
model2.train()
done = True

for T in trange(1, STEPS+1):
    if done:
        state = env.reset().to(device)

    if T % REPLAY_FREQ == 0:
        model2.reset_noise()

    if T < (EPSILON_DECAY + 1):
        eps_threshold = T * ((EPSILON_END - EPSILON_START) / (EPSILON_DECAY)) + EPSILON_START
    else:
        eps_threshold = 0.05

    action = model2.act_e_greedy(state, eps_threshold)
    next_state, reward, done, info = env.step(action)
    if REWARD_CLIP > 0:
        reward = max(min(reward, REWARD_CLIP), -REWARD_CLIP)
    mem2.append(state, action, reward, done)

    if T >= LEARNING_START:
        mem2.priority_weight = min(mem2.priority_weight + priority_weight_increase, 1)  # Anneal importance sampling weight β to 1

        if T % REPLAY_FREQ == 0:
            model2.learn(mem2)  # Train with n-step distributional double-Q learning

        if T % EVALUATION_INTERVAL == 0:
            model2.eval()  # Set DQN (online network) to evaluation mode
            gif = test(args, T, model2, env0, metrics2, results_dir)
            imageio.mimsave(os.path.join(results_dir, './GIF/DDQN{}.gif'.format(i)), gif)
            i += 1
            model2.train()  # Set DQN (online network) back to training mode
            
        # Update target network
        if T % TARGET_UPDATE == 0:
            model2.update_target_net()

    state = next_state.to(device)

env.close()

## Results from Training Space Invaders on Double Deep Q Network

<img src="./assets/DDQN.gif" >

After training the Space Invaders AI on a Double Deep Q Network, the AI has learn some tricks in playing the game. The AI Agent is able to utilize the shield in the game to block off the lasers attack by the invaders and it also learned to create small openings in the shield to shoot the invaders through it. This shows that utilizing a Double Deep Q Network, it has effectively let the agent to learn the hidden techniques that it can make use of in playing the game.

<img src="./assets/SI_DDQN.png" >

The agent is also evaluated during the training process in playing 5 rounds of Space Invaders and the graph above represents the expected reward of the agent. From the graph, we can see that the AI agent can perform exceptionally better compared to the AI trained on a Deep Q Network after training it for 1 million steps.