### HD Task 13: Extend Module 12 Task


### Tic-Tac-Toe Problem
##### Extentions
1. Integrate the idea of deep representation learning in your code. For this you can use a deep convolutional neural network to represent the state of the board, and integrate it in your implementation of Q-Learning algorithm.

2. Implement a policy gradient algorithm. You can use simple REINFORCE algorithm, do your own research and decide which algorithm you will be implementing.

3. Compare the performance of your policy gradient algorithm with deep Q-Learning algorithm.

## UML Design

##Abstract Board Definition
#### Parent/ Super-Type Class

In [1]:
import datetime
import random

# Library for module abstract class
from abc import ABC, abstractmethod

# Define the abstract board
class AbstractBoard(ABC):
    def __init__(self, board_size): # Initialize the object's attributes
        self.board_size = board_size # Defines the size

    @abstractmethod
    def set_board(self, placement, state):
        pass

    @abstractmethod
    def get_board_state(self):
        pass

    @abstractmethod
    def get_board_size(self):
        return self.board_size

    @abstractmethod
    def insert_letter(self, letter, position):
        pass

    @abstractmethod
    def is_full(self):
        pass

    @abstractmethod
    def print_board(self):
        pass

    @abstractmethod
    def space_is_free(self):
        pass

    @abstractmethod
    def reset_board(self):
        pass

## Concrete Board Definition
#### Child/ Sub-Type Class
#### Define the method to make it concrete

In [2]:
class Board:
    def __init__(self, size):
        self.size = size
        self.board = [[' ' for _ in range(size)] for _ in range(size)]
        self.turn = 'O'

    def print_board(self):
        for row in self.board:
            print(' | '.join(row))

    def get_board_state(self):
        board_state = {}
        for i in range(self.size):
            for j in range(self.size):
                board_state[(i, j)] = self.board[i][j]
        return board_state

    def insert_letter(self, letter, move):
        i, j = move
        self.board[i][j] = letter

    def set_board(self, move, letter):
        i, j = move
        self.board[i][j] = letter
        self.turn = letter

    def chk_for_win(self, letter):
        for i in range(self.size):
            if self.board[i][0] == letter and self.board[i][1] == letter and self.board[i][2] == letter:
                return True
            if self.board[0][i] == letter and self.board[1][i] == letter and self.board[2][i] == letter:
                return True
            if self.board[i][0] == letter and self.board[1][1] == letter and self.board[2][2] == letter:
                return True
            if self.board[0][2] == letter and self.board[1][1] == letter and self.board[2][0] == letter:
                return True
        return False

    def chk_for_draw(self):
        for i in range(self.size):
            for j in range(self.size):
                if self.board[i][j] == ' ':
                    return False
        return True


    def get_turn(self):
      return self.turn


##Abstract Game Definition
#### Parent/ Super-Type Class

In [3]:
class AbstractGame():
  def __init__(self, board_data):
    self.board_data = board_data # Defines the board

    @abstractmethod
    def chk_for_win(self):
        pass

    @abstractmethod
    def chk_for_draw(self):
        pass


## Concrete Game Definition
#### Child/ Sub-Type Class

In [4]:
class Game(AbstractGame):
  def __init__(self, board_data):
    # Get access to method of parent/ super type class (board) returning a temp object
    #super().__init__(board_data)
    self.board_data = board_data

  # Check for Win
  def chk_for_win(self, letter):
    board_state = self.board_data.get_board_state()
    size = self.board_data.get_board_size()
    for row in range(size): # Check rows
        if all(board_state[row * size + col + 1] == letter for col in range(size)):
            return True

    for col in range(size): # Check columns
        if all(board_state[row * size + col + 1] == letter for row in range(size)):
            return True

    if all(board_state[i * size + i + 1] == letter for i in range(size)): # Check diagonals
        return True

    if all(board_state[i * size + size - i] == letter for i in range(size)):
        return True
    return False

  # Check for Draw
  def chk_for_draw(self):
    board_state = self.board_data.get_board_state()
    for key, value in board_state.items(): # Calling tuple unpack to access keys/ values
        if value == ' ':
            return False
    return True


##Abstract Player Definition
#### Parent/ Super-Type Class
#### All details (functions) are in lower module (OCP)

In [5]:
class AbstractPlayer(ABC):
    def __init__(self, letter, algorithm):
        self.letter = letter # O for human/bot, X for bot
        self.algorithm = algorithm # subclass of abstract algorithm

    @abstractmethod
    def get_move(self, board):
        pass

## Concrete Player Definition (human)
#### Child/ Sub-Type Class

In [6]:
class HumanPlayer(AbstractPlayer):
    def __init__(self, letter, algorithm):
        super().__init__(letter, algorithm)

    def get_move(self, board):
        while True:
            try:
                position = int(input(f'Enter position for {self.letter}: '))
                if 1 <= position <= len(board) and board[position] == ' ':
                    return position
                else:
                    print('Invalid position, please enter a different position.')
            except ValueError:
                print('Invalid input. Please enter a valid integer.')
        return None  # add this line to return None if an invalid position is entered

## Concrete Player Definition (bot)
#### Child/ Sub-Type Class

In [7]:
class BotPlayer(AbstractPlayer):
    def __init__(self, letter, algorithm):
        super().__init__(letter, algorithm)

    def get_move(self, board):
        return i, j  # Return the move as a tuple

##Abstract Algorithm Definition
#### Parent/ Super-Type Class

In [8]:
class Algorithm(ABC):
  def __init__(self, board_data):
    self.board_data = board_data
    self.player = 'O'
    self.bot = 'X'

    @abstractmethod
    def get_move(self, board_data, letter):
      pass

# Deep Convolutional Neural Network (Board State) Q-Learning Algorithm

### Use a deep convolutional neural network to represent the state of the board.

##### a. Board Representation: CNN Model

#### In previous modules (SIT320), it was evident that tic-tac-toe (game) could not be solved with ‘classical techniques’ [1, p. 8].

#### *For example, Minimax algorithm assumes the player (opponent) will perform moves (actions) in a particular way, this assumption can be invalid/ incorrect.*
- To find the most optimal solution DP can use the probabilities of an opponent’s behaviour (each move calculated) or ‘…learn the model of the opponent’s behaviour’ [1, p. 9] for sequential decision problems (tic-tac-toe). <br>
- However, many iterations of episodes (games) are required to estimate/ learn probabilities (using a value function method to evaluate all states).


#### A type of ANNs is Deep Convolutional Neural Network (CNN), ‘…specialized for processing high-dimensional data arranged in spatial arrays, such as images’ [1, p. 227].
- Each layer in CNN creates a multitude of feature maps (pattern of activity within an array of units), each unit performs the same operation on data (within a receptive field).
- Using different locations on the arrays of incoming data to store each unit of a feature map, thus units in the same feature map have the same weights.


#### Deep Convolutional Neural Network (CNN), a model that can take the board (game) state as input. One or more hidden layers, an output layer (Q value) for all possible moves in the game state.
- I used TensorFlow [4] an open-source Machine learning (ML) (Python library from Google) to create the CNN model.
- I added padding to the layer before the first Conv2D layer in TensorFlow model when using (3x3 game state size) (MaxPooling2D layer) [4] this is to avoid zero or negative outputs of the model shape, this happens when you pass a model that is too small for the layer parameters.

#### I created a CNN model architecture summary, architecture using the TensorFlow, this information in the table is the layers of the model and the output shapes of each layer (number of trainable parameters). This table visualises the CNN model structure. This model has two convolutional layers (input/ output layers) and two middle layers.

#### To encode the data for the CNN, need to consider either using a single node and feeding it a unique hash value of the board or using an array with each element encoding the value of the piece (specific position as an integer).<br>
- I have prepared the data for the input layer to be, ‘O’ = 1 (9bits), ‘X’ -1 (9bits), and empty space = (0) (9bits), representing a total of 27 bits. The output layer will consist of 9 nodes each representing a position on the board, the values in the output nodes are the Q values for the corresponding moves (the estimate future reward for that given state).



In [62]:
!pip install --upgrade tensorflow
import tensorflow.compat.v1 as tf

Collecting tensorflow
  Downloading tensorflow-2.14.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (489.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m489.8/489.8 MB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
Collecting ml-dtypes==0.2.0 (from tensorflow)
  Downloading ml_dtypes-0.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m64.4 MB/s[0m eta [36m0:00:00[0m
Collecting wrapt<1.15,>=1.11.0 (from tensorflow)
  Downloading wrapt-1.14.1-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (77 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.9/77.9 kB[0m [31m10.1 MB/s[0m eta [36m0:00:00[0m
Collecting tensorboard<2.15,>=2.14 (from tensorflow)
  Downloading tensorboard-2.14.1-py3-none-any.whl (5.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.5/5.5 MB[

In [1]:
# Python Libraries
import numpy as np
import numpy as ndarray
import random
import tensorflow as tf
from tensorflow.keras.layers import Conv2D, Flatten, Dense
from tensorflow.keras.models import Sequential

# Prepare data: Transform board state to CNN input shape
def prepare_input(board_state):
    mapping = {'O': 1, 'X': -1, ' ': 0}
    # Reshape data for CNN
    board = [mapping[letter] for letter in board_state.values()]
    # return np.array(board).reshape(5, 5, 1)
    return np.array(board).reshape(3, 3, 1)

# Represent state of board (CNN board)
def build_model():
    model = tf.keras.Sequential([
        tf.keras.layers.Input(shape=(3, 3, 1)), # Input shape
        tf.keras.layers.ZeroPadding2D(padding=((1, 1), (1, 1))), # Padding layer
        tf.keras.layers.Conv2D(32, (3, 3), activation='relu'), # Change input shape to (5, 5, 1)
        tf.keras.layers.Conv2D(64, (3, 3), activation='relu'),
        tf.keras.layers.Flatten(),
        tf.keras.layers.Dense(64, activation='relu'),
        tf.keras.layers.Dense(1, activation='linear')
    ])
    return model

# Test print
print("Model set up:")
model = build_model()  # Create the model
model.summary() # Print model architecture summary


Model set up:
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 zero_padding2d (ZeroPaddin  (None, 5, 5, 1)           0         
 g2D)                                                            
                                                                 
 conv2d (Conv2D)             (None, 3, 3, 32)          320       
                                                                 
 conv2d_1 (Conv2D)           (None, 1, 1, 64)          18496     
                                                                 
 flatten (Flatten)           (None, 64)                0         
                                                                 
 dense (Dense)               (None, 64)                4160      
                                                                 
 dense_1 (Dense)             (None, 1)                 65        
                                          

In [4]:
# standard librarys
import json  # to store learned state
import tqdm
import torch
import numpy as np

# settings
torch.manual_seed(0)
np.random.seed(2)

### Integrate it in your implementation of Q-Learning algorithm.

###Floris Laporte’s article Reinforcement from the Ground Up (2020) [2] was used to study Q learning with CNN. I found this topic difficult to implement into the previous tic tac toe game methods, however I still wanted to see how this algorithm effected the game play.

[2] https://blog.flaport.net/reinforcement-learning-part-2.html


#### My aim of this modification is to train the CNN to find playable spaces to predict the best X and O actions (moves) as more layers are added to the network. CNN can use regression to  use less space than tabular format (Q-Learning) and mimic the Q function to predict what the next action (move) might be in the game.

#### I found this topic difficult to implement into the previous tic tac toe game methods, however I still wanted to see how this algorithm effected the game play.
- The QModel class consists of an embedding layer, three fully connected (linear) layers and ReLU activation functions.
- The forward method define how the model processes input data with the save and load methods used for the model parameters. With a epsilon green policy as the strategy used, where the agent will choose random actions with a certain probability (epsilon) to explore moves.
- The TicTacToe class, Fig. 3 play method simulates playing different numbers of games and records transitions (state, action, next state and reward). The play_turn method simulates a players turn and checks the outcome, the visualise state is used to show the output to terminal.
- The Agent class, Fig. 4 uses the Q learning approach where the agent learns a Q value function to estimate the expected future rewards for each action. The agent takes random actions (probability determined) by the epsilon parameter which can be tested with more time. The best action method returns the best action based on the current Q values and the get action decides this next action (random or based on Q values). The learn method updates these Q values using transitions recorded in each game.

### Mean Squared Error (MSE) (learning to mimic another function) is used to implement regression [1], [4]. The output of the CNN is used as the input to the loss function with discounted rewards and maximum Q values (next states) being the updated estimate of the Q function.



In [5]:
import tqdm
import torch
import numpy as np
import torch.nn as nn

# Set Speed
torch.manual_seed(0)
np.random.seed(2)

class QModel(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.embedding = torch.nn.Embedding(3, 3)
        self.layer1 = torch.nn.Linear(30, 300)
        self.layer2 = torch.nn.Linear(300, 300)
        self.layer3 = torch.nn.Linear(300, 9)
        self.relu = torch.nn.ReLU()

    def forward(self, states2d, turns):
        if not torch.is_tensor(states2d):
            states2d = torch.from_numpy(states2d)
        if not torch.is_tensor(turns):
            turns = torch.from_numpy(turns)
        assert states2d.dim() == 3 # batch dimension required
        assert turns.dim() == 1 # only dim = batch dim
        x = torch.cat([states2d.flatten(1), turns[:,None]], 1)
        x = self.relu(self.embedding(x)).flatten(1)
        x = self.relu(self.layer1(x))
        x = self.relu(self.layer2(x))
        x = self.layer3(x)
        return x

    def _serialize_tensor(self, tensor):
        if tensor.dim() == 0:
            return float(tensor)
        return [self._serialize_tensor(t) for t in tensor]

    def _deserialize_tensor(self, tensor):
        return torch.tensor(tensor, dtype=torch.get_default_dtype())

    def save(self, filename):
        if not filename.endswith(".json"):
            filename += ".json"
        with open(filename, "w") as file:
            json.dump(
                {k: self._serialize_tensor(t) for k, t in self.state_dict().items()},
                file,
            )

    def load(self, filename):
        if not filename.endswith(".json"):
            filename += ".json"
        with open(filename, "r") as file:
            self.load_state_dict(
                {k: self._deserialize_tensor(t) for k, t in json.load(file).items()}
            )
        return self

class TicTacToe:
    def __init__(self, player1, player2):
        self.players = {1: player1, 2: player2} # Players against each other
        # Game Outcome (tie, player1 wins, player2 wins)
        self._reward = {0: 0, 1: 1, 2: -1}

    def play(self, num_games=1, visualize=False):
        transitions = []
        for _ in range(num_games):
            turn = 1
            state2d = np.zeros((3,3), dtype=np.int64)
            state = (state2d, turn) # full state of the game
            for i in range(9):
                current_player = self.players[turn]
                action = current_player.get_action(state)
                next_state, reward = self.play_turn(state, action)
                transitions.append(
                    (state, action, next_state, reward)
                )
                if visualize:
                    self.visualize_state(next_state, turn)

                (state2d, turn) = state = next_state

                if turn == 0:
                    break
        return transitions

    # Current current player move check win/ loss
    def play_turn(self, state, action):
        state2d, turn = state # Find states
        next_state2d = state2d.copy()
        next_turn = turn % 2 + 1
        ax, ay = action // 3, action % 3  # Action two indices

        # Check space is legal
        if state2d[ax, ay] != 0:  # Check invalid move
            next_state2d.fill(0)
            next_state = (next_state2d, 0)  # next_turn == 0 -> game over
            return next_state, self._reward[next_turn]  # next player wins
        next_state2d[ax, ay] = turn # apply action

        # check if the action resulted in a winner
        mask = next_state2d == turn
        if (
            (mask[0, 0] and mask[1, 1] and mask[2, 2])
            or (mask[0, 2] and mask[1, 1] and mask[2, 0])
            or (mask[0, 0] and mask[0, 1] and mask[0, 2])
            or (mask[1, 0] and mask[1, 1] and mask[1, 2])
            or (mask[2, 0] and mask[2, 1] and mask[2, 2])
            or (mask[0, 0] and mask[1, 0] and mask[2, 0])
            or (mask[0, 1] and mask[1, 1] and mask[2, 1])
            or (mask[0, 2] and mask[1, 2] and mask[2, 2])
        ):
            next_state = (next_state2d, 0)  # next_turn == 0 -> game over
            return next_state, self._reward[turn]  # current player wins

        # if the playing board is full, but no winner found = draw
        if (next_state2d != 0).all():  # final draw
            next_state = (next_state2d, 0)  # next_turn == 0 -> game over
            return next_state, self._reward[0]  # no winner

        # if no move winner = next player's turn.
        next_state = (next_state2d, next_turn)
        return next_state, self._reward[0]  # no winner yet

    @staticmethod
    # Show Game State
    def visualize_state(next_state, turn):
        next_state2d, next_turn = next_state
        print(f"player {turn}'s turn:")
        if (next_state2d == 0).all() and turn == 0:
            print("[invalid state]\n\n")
        else:
            for i in range(3):
                print("|".join(["O" if next_state2d[i][j]==1 else "X" if next_state2d[i][j]==2 else " " for j in range(3)]))
                if i < 2:
                    print("-"*5)
            print("\n")
            # check if the game has ended and if so, who won
            mask = next_state2d == turn
            win_combinations = [
                (mask[0, 0] and mask[1, 1] and mask[2, 2]),
                (mask[0, 2] and mask[1, 1] and mask[2, 0]),
                (mask[0, 0] and mask[0, 1] and mask[0, 2]),
                (mask[1, 0] and mask[1, 1] and mask[1, 2]),
                (mask[2, 0] and mask[2, 1] and mask[2, 2]),
                (mask[0, 0] and mask[1, 0] and mask[2, 0]),
                (mask[0, 1] and mask[1, 1] and mask[2, 1]),
                (mask[0, 2] and mask[1, 2] and mask[2, 2]),
            ]
            if any(win_combinations):
                print(f"player {turn} wins!\n")
            elif (next_state2d != 0).all():
              print("Tie!\n")

In [6]:
# Agent plays by repeating games to find optimal Q Value
class Agent:
    def __init__(
        self, qmodel=None, epsilon=0.2, learning_rate=0.01, discount_factor=0.9):

        self.qmodel = QModel() if qmodel is None else qmodel
        self.learning_rate = learning_rate # Speed Q values get updated
        # pytorch Optimizer Update weights of Q Model
        self._optimizer = torch.optim.Adam(self.qmodel.parameters(), lr=learning_rate)
        self.discount_factor = discount_factor # % Future rewards
        self.epsilon = epsilon # Chance of random action

    def random_action(self):
        return int(np.random.randint(0, 9, 1, dtype=np.int64)) # Find random actions choosen from allowed actions

    def best_action(self, state):
        with torch.no_grad(): # Best Q values
            state2d, turn = state
            sign = np.float64(1 - 2 * (turn - 1))
            turns = torch.tensor(turn, dtype=torch.int64)[None]  # Reduce Batch
            states2d = torch.tensor(state2d, dtype=torch.int64)[None]
            qvalues = self.qmodel(states2d, turns)[0]
        return np.argmax(sign * qvalues)

    # Perform an action
    def get_action(self, state):
        if np.random.rand() < self.epsilon:
            # Action random with chance of epsilon = best action
            action = self.random_action()
        else:
            # Q values for current game state
            action = self.best_action(state)
        return action

    # Learn from current action
    def learn(self, transitions):
      states, actions, next_states, rewards = zip(*transitions)
      states2d, turns = zip(*states)
      next_states2d, next_turns = zip(*next_states)
      turns = torch.tensor(turns, dtype=torch.int64)
      next_turns = torch.tensor(next_turns, dtype=torch.int64)
      states2d = torch.tensor(states2d, dtype=torch.int64)
      next_states2d = torch.tensor(next_states2d, dtype=torch.int64)
      actions = torch.tensor(actions, dtype=torch.int64)
      rewards = torch.tensor(rewards, dtype=torch.float32)
      with torch.no_grad():
          # Q values for current game state
          # Check Game is over or not?
          mask = (next_turns > 0).float()
          signs = (1 - 2 * (next_turns - 1)).float()
          next_qvalues = self.qmodel(next_states2d, next_turns)
          expected_qvalues_for_actions = rewards + mask * signs * (
              self.discount_factor * torch.max(signs[:, None] * next_qvalues, 1)[0]
          )

      # update Q values:
      qvalues_for_actions = torch.gather(
          self.qmodel(states2d, turns), dim=1, index=actions[:, None]
      ).view(-1)
      loss = torch.nn.functional.smooth_l1_loss(
          qvalues_for_actions, expected_qvalues_for_actions
      )
      self._optimizer.zero_grad()
      loss.backward()
      self._optimizer.step()
      return loss.item()


In [21]:
# initialize
np.random.seed(3)
torch.manual_seed(1)
total_number_of_games = 100000
number_of_games_per_batch = 100

player = Agent(epsilon=0.7, learning_rate=0.01)
game = TicTacToe(player, player)

min_loss = np.inf
range_ = tqdm.trange(total_number_of_games // number_of_games_per_batch)
for i in range_:
    transitions = game.play(num_games=number_of_games_per_batch)
    np.random.shuffle(transitions)
    loss = player.learn(transitions)

    if loss < min_loss and loss < 0.01:
        min_loss = loss

    range_.set_postfix(loss=loss, min_loss=min_loss)

player.qmodel.save("qmodel.json")

100%|██████████| 1000/1000 [01:57<00:00,  8.54it/s, loss=0.0134, min_loss=0.00929]


In [34]:
player = Agent(epsilon=0.0)  # epsilon=0 = no random guesses
game = TicTacToe(player, player)
player.qmodel.load("qmodel.json")

# play
game.play(num_games=1, visualize=True);
print("------------------")

player 1's turn:
 | | 
-----
 |O| 
-----
 | | 


player 2's turn:
 | |X
-----
 |O| 
-----
 | | 


player 1's turn:
 | |X
-----
 |O| 
-----
 |O| 


player 2's turn:
 |X|X
-----
 |O| 
-----
 |O| 


player 1's turn:
O|X|X
-----
 |O| 
-----
 |O| 


player 2's turn:
O|X|X
-----
 |O| 
-----
 |O|X


player 1's turn:
O|X|X
-----
 |O|O
-----
 |O|X


player 2's turn:
O|X|X
-----
X|O|O
-----
 |O|X


player 1's turn:
O|X|X
-----
X|O|O
-----
O|O|X


Tie!

------------------
player 1's turn:
 | | 
-----
 |O| 
-----
 | | 


player 2's turn:
 | |X
-----
 |O| 
-----
 | | 


player 1's turn:
 | |X
-----
 |O| 
-----
 |O| 


player 2's turn:
 |X|X
-----
 |O| 
-----
 |O| 


player 1's turn:
O|X|X
-----
 |O| 
-----
 |O| 


player 2's turn:
O|X|X
-----
 |O| 
-----
 |O|X


player 1's turn:
O|X|X
-----
 |O|O
-----
 |O|X


player 2's turn:
O|X|X
-----
X|O|O
-----
 |O|X


player 1's turn:
O|X|X
-----
X|O|O
-----
O|O|X


Tie!



# Evaluation

#### This implementation, combined Q learning with Deep CNN where the agent could learn complex strategies in the game using experience of prior games, this gave the agent more flexibility by allowing it to generalise its moves from one state to another.
- I believe these state transitions were efficient in data usage as the training was very quick to compile and the flexible hyperparameters meant I could test different applications (epsilon, discount factor).

#### Hyperparameters require evaluation (CNN, Q-Learning) and the training process. Evaluating the performance of AI player against different game strategies to insure effective learning. The benefit of CNN is weight sharing, reducing the number of trainable network parameters (creating a faster algorithm) and avoids overfitting. The classification layer is organised and high reliant on the features extracted and can be used in large-scale networks.

- However, I don’t think for larger game state sizes this would be efficient, I struggled to implement a 5x5 board size and with more time I believe I would have proved this hypothesis.
- When I tried to change the epochs size it appeared to be very hypersensitive and was time consuming to find the best hyperparameters to test.
- Due to time limits I wasn’t able to test the algorithm with other forms of RL, however I found it was able to enhance the RL difficulties into smaller ‘supervised’ [3] tasks where the maximum Q value produced for the Q learning overestimates the value of state actions [1, p. 137].


# Research Policy Gradient Algorithm
#### Policy gradient algorithm.

### The main aim of RL agents is to maximise the expected reward when following a policy these are defined using parameters (weights, biases of units in the CNN). Where a network of nodes is to be trained on the weights and bias with feedback on process Gradient Decent [3]. Nodes are arranged in layers input nodes (first layer), middle layers (hidden layers) and output nodes (last layer), deep refers to multiple hidden layers within the CNN, with backpropagation used with middle layers.
<br>

#### A policy is a distribution that the agent uses to find optimal actions, we can use a deep neural network to increase the probability of finding the most optimal actions which approximates the agent’s policy [10]. From the previous tasks in this unit, I have learnt that policy updates use deep RL algorithms which are either value based, or policy based, this includes DQN discussed in the previous question.
<br>

#### However, most algorithms use a value function which learns the value of an action to pick the best (optimal) action. A ‘parameterized policy’ [1] selects actions without depending on a value function, it can utilise this to learn a policy parameter.
- In RL the aim is to maximise the agent’s performance (finding the most optimal actions) over a duration of time.
- Where the probability is the  action a taken at a time t given the environment state s with parameter thus a gradient of performance measured. - This gradient will the proportional to the amount of time spent in each state including the sum of actions pairs and the gradient of that policy (gradient accent).
<br>
<br>

#### A CNNs network observes the environment as input and outputs actions selecting according to a SoftMax activation function [11] Fig. 9, alternatively the CNN would be a simple linear regression model only capable of returning ‘0s’. It generates a game (episode) and keeps track of that states, actions, rewards, and the agent’s memory. It than revisits these states at the end of each episode (checking the states, actions, and rewards) to calculate the discounted future returns at each time step. The returns are than used as weights and the agents’ actions as labels to than perform backpropagation to than update the weights of the Deep CNN. The agent repeats for several rotations until the most optimal actions are found.
<br>

#### This results in a Policy Gradient algorithm REINFORCE, where we look at the overall performance in the agent’s behaviour to guide the policy improvement in terms of cumulative rewards. The RL agent uses the environment starting state to goal state (Monte Carlo Policy Gradient algorithm) [12], Fig 10 and unlike the TL or DP ‘bootstrap’ methods [1, p. 119].

- To optimise the policy methods such as Maximisation Likelihood Estimate (MLE) can be used to iteratively adjust the policies parameters to select actions that will lead to higher rewards.
- As such policy gradient methods are a type or RL that optimise parameter policy that focus on the agent’s behaviour not just the immediate rewards.

<br>
### Policy gradient algorithms such as Monte Carlo, Fig 11 have significant advantages, they can learn to take actions for specific probabilities and efficient exploration by approaching deterministic policies ‘asymptotically’ [1, p. 337]. Additionally continuous action spaces are easily handled by policy gradient algorithms where action value methods can lack. <br>
<br>
### An algorithms performance can also be measured using the policy gradient theorem which does not involve state distribution (policy planning for future actions). And by ‘adding a state-value function as a baseline reduce REINFORCE’s variance without introducing bias’ [1, p. 337]. This reduces bootstrapping methods which introduce bias (TD, DP), however these action value methods can reduce variance.


# Compare the Convergence Performance of Algorithms
#### Compare the performance of your policy gradient algorithm with deep Q-Learning algorithm.

### Convergence performance of the two algorithms by running several iterations of the game and recording the number of wins, losses, and draws (status) for each algorithm returning play_game().

*Convergence refers to the limit of a process and can be a useful analytical tool when evaluating the expected performance of an optimization algorithm* [6].
- To examine the values of each algorithms process in relation to their behaviour over time.
- Run multiple games and then comparing wins/loses/ draws.<br>
<br>

### RL changes the trajectory results in for a multitude of rewards thus Monte Carlo Policy Gradient has ‘high variance but zero bias’ [14]. Whereas TD and DP (DQN) algorithms when used as a step (one action is used with a small change) results in a low variance. This can affect ‘model convergence’ as Policy Gradients are weak to variance (and mass producing samples can impact efficiency), especially in on policy methods where behaviour policy and target policy are the same. Whereas off policy methods can improve exploration/ target policy without creating mass amounts of new samples, the more we know about a model’s environment (dynamic) the less train and error we need to experiment with to find the most optimal policy, Fig. 12.

- Many RL methods make assumptions (continuity) assuming that the state space or the control space is continuous (DQN) and Q learning with Deep CNN when in a continuous control space has to many complex steps [14].
- This is due to the searching required of the entire control space to find the maximum Q value for the next action, this is computationally very difficult. - Whereas policy gradient algorithms can support continuous control, as it‘optimises the policy directly’ [14] by implementing constraints using policy parameters within the objective function.
- However, the choice isn’t necessary one or the other, we can have the best of both outputs by adding value learning to a policy gradient or adding a policy gradient to a RL [14].
