# AI in Games, _Reinforcement Learning_<br>Assignment 2, Question 5:<br>**Deep Reinforcement Learning**

## Introduction to the concept
Suppose the states could be represented as vectors of features, and suppose the action-reward function (that enables us to obtain estimated optimal policies) can be approximated using a non-linear function of the aforementioned features. Under these assumptions, we use convolutional neural networks to help (1) distill the features of a state to essential features, (2) approximate action-rewards using the distilled essential features.

## Preparing the context
The following are the necessary preparations and imports needed to run and test the main code of this document in the intended context. Mounting directory & setting present working directory...

In [1]:
# Mounting the Google Drive folder (run if necessary):
from google.colab import drive
drive.mount('/content/drive/', force_remount=True)
# Saving the present working directory's path:
# NOTE: Change `pwd` based on your own Google Drive organisation
pwd = "./drive/MyDrive/ColabNotebooks/AIG-Labs/AIG-Assignment2/"

Mounted at /content/drive/


To install module `import_ipynb` to enable importing Jupyter Notebooks as modules...

`!pip install import_ipynb`

Importing the code in notebook `Q1_environment.ipynb`...




In [3]:
import import_ipynb
N = import_ipynb.NotebookLoader(path=[pwd])
N.load_module("Q1_environment")
from Q1_environment import *

importing Jupyter notebook from ./drive/MyDrive/ColabNotebooks/AIG-Labs/AIG-Assignment2/Q1_environment.ipynb


Other necessary imports...

In [11]:
import numpy as np
import torch
from torch import nn
from collections import deque

## Wrapping the environment to enable feature mapping
State image is composed of four channels and is represented by a `numpy.array` of shape $(4, h, w)$, where $h$ is the number of rows and $w$ is the number of columns of the lake grid.

### DEMO: State image representation to be used

In [5]:
# Dividing lines for neat presentation:
div1 = '\n================================================\n'
div2 = '------------------------------------'

# Printing the original grid:
myLake = np.array(lake['small'])
print(f'The original frozen lake grid:\n{myLake}\n{div1}')

# Printing channels 2, 3 & 4 for each state image:
lake_image = [(np.array(myLake) == c).astype(float) for c in ['&', '#', '$']]
print('Channels 2, 3 & 4 for each state image')
L = ['C2. Start tile marker', 'C3. Hole tile marker', 'C4. Goal tile marker']
for l, A in zip(L, lake_image): print(f'{div2}\n{l}:\n{A}')

The original frozen lake grid:
[['&' '.' '.' '.']
 ['.' '#' '.' '#']
 ['.' '.' '.' '#']
 ['#' '.' '.' '$']]


Channels 2, 3 & 4 for each state image
------------------------------------
C2. Start tile marker:
[[1. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]]
------------------------------------
C3. Hole tile marker:
[[0. 0. 0. 0.]
 [0. 1. 0. 1.]
 [0. 0. 0. 1.]
 [1. 0. 0. 0.]]
------------------------------------
C4. Goal tile marker:
[[0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 1.]]


### Implementing wrapper class

In [267]:
class FrozenLakeImageWrapper:
    def __init__(self, env):
        self.env = env
        lake = self.env.lake
        # NOTE: The lake grid is converted into an array by the environment
        self.n_actions = self.env.n_actions

        # Obtaining a state image for each state:
        #------------------------------------
        # 1. Shape for each state image:
        self.state_shape = (4, lake.shape[0], lake.shape[1])
        #------------------------------------
        # 2. Obtaining a list of filter arrays:
        lake_image = [(lake == c).astype(float) for c in ['&', '#', '$']]
        #------------------------------------
        # 3. Obtaining the state image for each state:
        #........................
        # Handling for the absorbing state...
        # a. Channel 1 of the state:
        # NOTE: Absorbing state has no position on the grid, so all zeros
        A = np.zeros(lake.shape)

        # b. Attaching channels 2, 3 & 4, then storing all as an array:
        self.state_image = {env.absorbing_state: np.stack([A] + lake_image)}
        '''
        IMPLEMENTATION NOTE:
        `[A]` is a list containing array A, and `lake_image` is a list
        containing 3 arrays. Using `+` between `[A]` and `lake_image` will
        concatenate the two lists, resulting in a list of 4 arrays.

        `np.stack` joins the above array list into a single array of arrays.
        '''
        #........................
        # Handling for the other states actually present on the grid...
        for state in range(lake.size):
            # a. Channel 1 of the state:
            '''
            NOTE ON CHANNEL 1:
            The 1st channel is the array such that the element is 1 if the
            index matches the state, 0 otherwise. This corresponds to the
            position of the agent if the agent were to be in this state. Hence,
            note that the 1st channel shows not the current position of the
            agent, but its position if it were in this state.
            '''
            # a.1. Initialising it as an array of zeros:
            A = np.zeros(lake.shape)
            # a.2. Assigning the current state's position as 1:
            row = state // lake.shape[0]
            col = state % lake.shape[1]
            A[row, col] = 1.0

            # b. Attaching channels 2, 3 & 4, then storing all as an array:
            self.state_image[state] = np.stack([A] + lake_image)
            '''
            IMPLEMENTATION NOTE:
            Check the implementation note above this loop.
            '''

    #================================================

    # Mapping the given state paired with each action to state image:
    # NOTE: State images were obtained for each state in the class constructor

    def encode_state(self, state):
        return self.state_image[state]

    #================================================

    # Obtaining the policy via decoding neural network's output:
    # 1. Encode states as state images
    # 2. Pass state images as input to the neural network
    # 3. Obtain the action-value function as an output
    # 4. Use the action-value function to obtain the policy & state-values

    def decode_policy(self, dqn):
        # 1. Encode states as state images:
        N = self.env.n_states
        states = np.array([self.encode_state(s) for s in range(N)])

        # 2 & 3: Obtain the action-value function for encoded states:
        q = dqn(states).detach().numpy()
        # NOTE: `torch.no_grad` omitted to avoid import

        # 4. Use the action-value function to obtain the policy & state-values:
        policy = q.argmax(axis=1)
        value = q.max(axis=1)
        return policy, value

    #================================================

    # Resetting environment & encoding it as state image:

    def reset(self):
        return self.encode_state(self.env.reset())

    #================================================

    # Taking a step in environment & encoding next state as state image:

    def step(self, action):
        state, reward, done = self.env.step(action)
        return self.encode_state(state), reward, done

    #================================================

    # Visualising the agent's performance (by inputs or using a policy):

    def render(self, policy=None, value=None):
        self.env.render(policy, value)

## Implementing the neural network

In [321]:
class DeepQNetwork(torch.nn.Module):
    def __init__(self, wenv, learning_rate, kernel_size,
                 conv_out_channels, fc_out_features, seed):
        torch.nn.Module.__init__(self)
        torch.manual_seed(seed)

        # Convolutional layer:
        self.conv_layer = torch.nn.Conv2d(in_channels=wenv.state_shape[0],
                                          out_channels=conv_out_channels,
                                          kernel_size=kernel_size, stride=1)

        # h ==> Number of rows in grid, w ==> Number of columns in grid
        h = wenv.state_shape[1] - kernel_size + 1
        w = wenv.state_shape[2] - kernel_size + 1

        # Fully connected layer:
        self.fc_layer = torch.nn.Linear(in_features=h*w*conv_out_channels,
                                        out_features=fc_out_features)

        # Output layer:
        self.output_layer = torch.nn.Linear(in_features=fc_out_features,
                                            out_features=wenv.n_actions)

        # Optimiser for gradient descent:
        self.optimizer = torch.optim.Adam(self.parameters(), lr=learning_rate)

    #================================================

    # Feed-forward function:

    def forward(self, x):
        # Setting the activation function:
        activation = torch.nn.ReLU()

        #print(f'0.1: x: {x}')
        y = torch.tensor(x, dtype=torch.float)
        #print(f'0.2: y: {y}')
        '''
        EXPECTED SHAPE OF THE ABOVE INPUT TENSOR:
        `x.shape` = (B, 4, h, w), where
        >> B ==> Number of states
        >> 4 ==> Number of channels per state representation
        >> h ==> Number of rows in the playing grid
        >> w ==> Number of columns in the playing grid
        '''
        # Feeding forward the input across layers:
        y = self.conv_layer(y)
        #print(f'1: y: {y}')
        y = activation(y)
        #print(f'2: y: {y}')

        # Flattening `x` before passing it to the fully connected layer:
        y = torch.flatten(y, start_dim=1)
        #print(f'3: y: {y}')
        '''
        NOTE ON FLATTENING:
        We want to flatten each state image representation. Now, a state image
        consists of `conv_out_channels` channels, each of shape `(h, w)`.
        Hence, each state image needs to become a `h*w*conv_out_channels` sized
        tensor.

        Now, `x` holds B states, and before applying ReLU (which does not alter
        the input tensor shape), the state images are arranged in `x` such
        that `x` was an array of arrays, each array being a state image, which
        means the 1st dimension of `x` corresponds to the states. This means
        we want to flatten each state representation while maintaining the
        an array of state representations. Hence, we leave the 1st dimension
        of `x` (i.e. axis 0) and start flattening from the 2nd dimension
        (i.e. axis 1, leading to the argument `start_dim=1`).
        '''

        y = self.fc_layer(y)
        #print(f'4.1: y: {y}')
        y = activation(y)
        #print(f'4.2: y: {y}')

        y = self.output_layer(y)
        #print(f'5: y: {y}')
        return y

    #================================================

    # Single step of training:

    def train_step(self, transitions, gamma, tdqn, episode):
        # Organising the transitions data into separate arrays:
        states = np.array([transition[0] for transition in transitions])
        actions = np.array([transition[1] for transition in transitions])
        rewards = np.array([transition[2] for transition in transitions])
        next_states = np.array([transition[3] for transition in transitions])
        dones = np.array([transition[4] for transition in transitions])
        #print(f"Transitions:\nstates: {states}\nactions: {actions}\nrewards: {rewards}\nnext_states: {next_states}\ndones: {dones}\n\n")

        # Obtaining current action-value estimates:
        q = self(states)
        # NOTE: The above is equivalent to doing `q = self.forward(states)`

        #print(f'ORIGINAL:\nepisode: {episode}\nr: {rewards}\nq: {q}')
        q = q.gather(1, torch.Tensor(actions).view(len(transitions), 1).long())
        #print(f'A:\nq: {q}')
        q = q.view(len(transitions))
        #print(f'B\nq: {q}\n\n')

        with torch.no_grad():
            next_q = tdqn(next_states).max(dim=1)[0] * (1 - dones)
            '''
            EXPLAINING THE ABOVE LINE:
            `tdqn(next_states)` is equivalent to `tdqn.forward(next_states)`,
            and simply applies the forward model of the non-updated model
            (stored in `tdqn`) to the next states, to get an estimate of
            action-values given the previous weights of the model.
            ------------------------------------
            The `.max` function, when applied to a tensor, produces two tensors:
            1. The array of max value(s) along the specified dimension
            2. The dimension-specific indices where the max value(s) were found

            We only want the first of the above two tensors. Hence, we apply
            the subscript `[0]` on `tdqn(next_states).max(dim=1)`, to do
            `tdqn(next_states).max(dim=1)[0]`
            '''

        #print(f'HELLO:{tdqn(next_states)}\n\n')

        # Estimating the one-step rewards given the stored rewards:
        target = torch.Tensor(rewards) + gamma*next_q

        # Loss calculation:
        # NOTE 1: The loss is the mean squared error between `q` & `target`
        # NOTE 2: `q - target` is temporal difference for given state-action
        #loss = torch.mean((q - target)**2)
        loss = torch.nn.functional.mse_loss(q, target.to(torch.float32))

        # Performing gradient descent, i.e. optimisation:
        self.optimizer.zero_grad() # Intialising gradient as zero
        loss.backward()            # Computing the current gradient
        self.optimizer.step()      # Performing the optimisation step

The following class `ReplayBuffer` implements a replay buffer that stores transitions. A transition is a tuple composed of a state, action, reward, next state, and a flag variable that denotes whether the episode ended at the next state. The buffer is represented by a Python deque object that automatically discards the oldest transitions when it reaches capacity. The method `draw` returns a list of $n$ transitions ($n \implies$ batch size) drawn without replacement from the replay buffer.

In [322]:
class ReplayBuffer:
    def __init__(self, buffer_size, random_state):
        # Replay buffer data structure:
        self.buffer = deque(maxlen=buffer_size)

        # Maintaining the given random state for enabling replicability:
        self.random_state = random_state

    def __len__(self):
        return len(self.buffer)

    def append(self, transition):
        self.buffer.append(transition)

    def draw(self, batch_size):
        # Length of the replay buffer:
        N = self.__len__()

        # Randomly sampling `batch_size` buffer indices without replacement:
        I = self.random_state.choice(N, size=batch_size, replace=False)

        # Returning the transitions corresponding to the above indices:
        return [self.buffer[i] for i in I]

## Learning process

In [323]:
def deep_q_network_learning(env, max_episodes, lr,
                            gamma, epsilon, batch_size,
                            target_update_frequency, buffer_size, kernel_size,
                            conv_out_channels, fc_out_features, seed):
    # INITIALISATION
    # Setting random state with given seed for enabling replicability:
    random_state = np.random.RandomState(seed)

    # Initialising replay buffer
    replay_buffer = ReplayBuffer(buffer_size, random_state)

    # Initialising the required deep neural networks:
    args = [env, lr, kernel_size, conv_out_channels, fc_out_features, seed]
    dqn = DeepQNetwork(*args)
    tdqn = DeepQNetwork(*args)

    # Array of linearly decreasing exploration factors:
    epsilon = np.linspace(epsilon, 0, max_episodes)

    #================================================

    # TRAINING LOOP
    for i in range(max_episodes):
        state = env.reset()

        done = False
        while not done:
            # Choosing next action with epsilon-greedy policy:
            if random_state.rand() < epsilon[i]:
                action = random_state.choice(env.n_actions)
            else:
                with torch.no_grad(): q = dqn(np.array([state]))[0].numpy()
                qmax = np.max(q)
                best = [a for a in range(env.n_actions) if np.allclose(qmax, q[a])]
                action = random_state.choice(best)

            # Moving the agent to the next state within the current episode:
            next_state, reward, done = env.step(action)
            # Updating the replay buffer:
            replay_buffer.append((state, action, reward, next_state, done))
            # Updating the state variable:
            state = next_state

            if len(replay_buffer) >= batch_size:
                transitions = replay_buffer.draw(batch_size)
                dqn.train_step(transitions, gamma, tdqn, i)

        if (i % target_update_frequency) == 0:
            tdqn.load_state_dict(dqn.state_dict())

    return dqn

## Testing the above functions
_The function testing code must not run if this file is imported as a module, hence we do..._<br>`if __name__ == '__main__'`<br>_... to check if the current file is being executed as the main code._

In [336]:
if __name__ == '__main__':
    # Defining the parameters:
    env = FrozenLake(lake['small'], 0.1, 100)
    wenv = FrozenLakeImageWrapper(env)
    max_episodes = 1000
    lr = 0.01 # Learning rate
    gamma = 0.9
    epsilon = 0.5
    batch_size = 50
    target_update_frequency = 1
    buffer_size = 500
    kernel_size = 4
    conv_out_channels = 12
    fc_out_features = 12
    seed = 0

    # Running the function:
    DeepQ = deep_q_network_learning(wenv,
                                    max_episodes,
                                    lr,
                                    gamma, epsilon,
                                    batch_size,
                                    target_update_frequency,
                                    buffer_size,
                                    kernel_size,
                                    conv_out_channels,
                                    fc_out_features,
                                    seed)

    # Obtaining the policy & state values:
    DeepQ = wenv.decode_policy(DeepQ)
    labels = ("deep q network learning")

    # Displaying results:
    displayResults([DeepQ], labels, env)



AGENT PERFORMANCE AFTER D

Lake:
[['&' '.' '.' '.']
 ['.' '#' '.' '#']
 ['.' '.' '.' '#']
 ['#' '.' '.' '$']]
Policy:
[['_' '^' '_' '<']
 ['_' '<' '_' '_']
 ['>' '>' '_' '_']
 ['_' '_' '>' '<']]
Value:
[[0.441 0.500 0.568 0.486]
 [0.516 0.030 0.654 0.047]
 [0.515 0.540 0.754 0.027]
 [0.013 0.664 0.886 1.020]]


**NOTE ON SETTING LEARNING RATE**:<br>If the learning rate is set too high, the gradient descent process will tend to overshoot the optimum. The result of this in our case is that the weights become highly negative overall, producing results such that applying ReLU leads to a zero-matrix. This zero-matrix leads to every resultant row of the final output (i.e. the action-values for each state) being equal, leading to a situation where:

- The forward model produces the same action-values for each action, no matter the state
- As a result of the above, the same maximum action-value is indicated for each state
- As a result of the above, action-values & thus policy converge to the same value & same action for each state

Hence, set the learning rate sufficiently low to prevent such an overshooting gradient descent.