# AI in Games, _Reinforcement Learning_<br>Assignment 2, Question 3:<br>**Tabular Model-free Methods**

## Introduction to the concept
Model-based methods are algorithms that cannot use (thus do not require) a known environment model. "Tabular" refers to the use of look-up tables or arrays that store previous estimates of a value function (ex. state value function) and use these estimates to update them every iteration. It is not surprising that when the environment model is not known to begin with, i.e. when the state-transition probabilities and rewards are initially unknown for each state-action pair, the evaluation and improvement of policies would take more time to be accurate and effective, since the agent has to learn the rewards for each state-action pair by interacting with the environment repeatedly.

## Preparing the context
The following are the necessary preparations and imports needed to run and test the main code of this document in the intended context. Mounting directory & setting present working directory...

In [None]:
if __name__ == '__main__':
    # Mounting the Google Drive folder (run if necessary):
    from google.colab import drive
    drive.mount('/content/drive/', force_remount=True)
    # Saving the present working directory's path:
    pwd = "./drive/MyDrive/ColabNotebooks/AIG-Labs/AIG-Assignment2/"

Mounted at /content/drive/


To install module `import_ipynb` to enable importing Jupyter Notebooks as modules...

`!pip install import_ipynb`

Importing the code in notebook `Q1_environment.ipynb`...




In [None]:
if __name__ == '__main__':
    import import_ipynb
    N = import_ipynb.NotebookLoader(path=[pwd])
    N.load_module("Q1_environment")
    from Q1_environment import *

importing Jupyter notebook from ./drive/MyDrive/ColabNotebooks/AIG-Labs/AIG-Assignment2/Q1_environment.ipynb


**NOTE**: `Q1_environment` contains

- The environment class and `FrozenLake` subclass
- The lists containing the small and big lake environments
- The function for rendering the policy and state-values
- The function for implementing the epsilon-greedy policy

Other necessary imports...

In [None]:
import numpy as np

## $\epsilon$-greedy policy
The $\epsilon$-greedy (epsilon-greedy) policy is a method of choosing an action from a state such that the probability of choosing a random action is $\epsilon$ while the probability of choosing the action that maximises the action-value function (as far as it has been estimated) from the given state is $1-\epsilon$. $\epsilon$ (epsilon) is called the "exploration factor". $\epsilon$-greedy is a solution to the exploration-exploitation dilemma, wherein the degree of exploration (i.e. expansion of actions unknown potential) is decided by the exploration factor.

<br>**NOTE**: **Breaking ties between reward-maximising actions**:
<br>When the above policy chooses to exploit rather than explore, it may be the case that there exist multiple actions that (within a tolerance level) maximise the action-value function from a given state. In such a case, exploration can be encouraged even during exploitation by making a random selection from the reward-maximising actions. This approach is _no worse and potentially better_ than simply picking the first action that maximises the action-value function from the given state, since it furthers exploration without reducing exploitation.

In [None]:
# NOTE: The following is generalised for the tabular & non-tabular methods:
def e_greedy(q, e, n_actions, random_state, s=None):
    '''
    NOTE ON THE ARGUMENTS:
    - `q`: One of the following...
        - The matrix of action-values for each state
        - The array of action-values for a given state
    - `e`: The exploration factor epsilon
    - `n_actions`: The number of actions an agent can take from any state
    - `random_state`: The set `numpy.random.RandomState` object
    - `s`: The given state

    If `s=None`, `q` is the array of action-values for a given state. Else,
    `q` is the matrix of action-values for each state.
    '''

    # Storing the action-values for the given state `s` (if `s` is given):
    if s != None: q = q[s]

    # The epsilon-greedy policy:
    if random_state.rand() < e:
        return random_state.randint(0,n_actions)
    else:
        # Obtaining all actions that maximise action-value from state `s`:
        best = [a for a in range(n_actions) if np.allclose(np.max(q), q[a])]
        # Breaking the tie using random selection:
        return random_state.choice(best)

## SARSA control
**NOTE**: SARSA $\implies$ "State-Action-Reward-State-Action"

In [None]:
def sarsa(env, max_episodes, eta, gamma, epsilon, seed=None):
    '''
    NOTE ON THE ARGUMENTS:
    - `env`: Object of the chosen environment class (ex. FrozenLake)
    - `max_episodes`: Upper limit of episodes the agent can go through
    - `eta`:
        - Initial learning rate
        - The learning rate is meant to decrease linearly over episodes
    - `gamma`:
        - Discount factor
        - Not subject to change over time
    - `epsilon`:
        - Initial exploration factor
        - Exploration factor is w.r.t. epsilon-greedy policy
        - Denotes the chance of selecting a random state
        - The exploration factor is meant to decrease linearly over episodes
    - `seed`:
        - Optional seed for pseudorandom number generation
        - By default, it is `None` ==> random seed will be chosen
    '''
    # INITIALISATION
    # Setting random state with `seed` for enabling replicability:
    random_state = np.random.RandomState(seed)

    # Initialising key parameters:
    # 1. Array of linearly decreasing learning rates:
    eta = np.linspace(eta, 0, max_episodes)
    # 2. Array of linearly decreasing exploration factors:
    epsilon = np.linspace(epsilon, 0, max_episodes)
    # 3. Array of state-action values:
    q = np.zeros((env.n_states, env.n_actions))
    '''
    NOTE ON THE NEW `eta` & `epsilon`:
    The above `eta` and `epsilon` are arrays formed by taking the initial
    learning rate and exploration factor given and creating an array of
    linearly decreasing values. Hence:
    - eta[i] is the learning rate for the ith episode
    - epsilon[i] is the exploration factor for the ith episode
    '''

    #================================================

    # LEARNING LOOP
    for i in range(max_episodes):
        # NOTE: i ==> episode number
        # Beginning at the initial state before each episode:
        s = env.reset()
        # Selecting action `a` for `s` by epsilon-greedy policy based on `q`:
        a = e_greedy(q, epsilon[i], env.n_actions, random_state, s)
        # While the state is not terminal:
        '''
        HOW TO CHECK IF A STATE IS TERMINAL?
        A terminal state is one wherein either the maximum number
        of steps for taking actions is reached or the agent reaches the
        absorbing state or the agent transitions to the absorbing state for any
        action. In this implementation, the check for whether the terminal
        state is reached is handled by the `done` flag of the `env.step`
        function; if `False`, continue, else consider the state as terminal.
        '''
        done = False
        while not done:
            # Next state & reward after taking `a` from `s`:
            next_s, r, done = env.step(a)
            # NOTE: `env.step` automatically updates the state of the agent

            # Selecting action `next_a` for `next_s`:
            # NOTE: Selection is done by epsilon greedy policy based on `q`
            next_a = e_greedy(q, epsilon[i], env.n_actions, random_state, next_s)

            # Updating the action-value for the current state-action pair:
            # USING: Temporal difference for (s,a) with epsilon-greedy policy
            q[s, a] = q[s, a] + eta[i]*(r + gamma*q[next_s, next_a] - q[s, a])

            # Moving to the next state & action pair:
            s, a = next_s, next_a

    #================================================

    # FINAL RESULTS
    # Obtaining the estimated optimal policy
    # NOTE: Policy = The column index (i.e. action) with max value per row
    policy = q.argmax(axis=1)
    # Obtaining the state values w.r.t the above policy:
    # NOTE: Value = Max value per row
    value = q.max(axis=1)
    # Returning the above obtained policy & state value array:
    return policy, value

## Q-Learning

In [None]:
def q_learning(env, max_episodes, eta, gamma, epsilon, seed=None):
    '''
    NOTE ON THE ARGUMENTS:
    Same as for the function `sarsa`.
    '''
    # INITIALISATION
    # Setting random state with `seed` for enabling replicability:
    random_state = np.random.RandomState(seed)

    # Initialising key parameters:
    # 1. Array of linearly decreasing learning rates:
    eta = np.linspace(eta, 0, max_episodes)
    # 2. Array of linearly decreasing exploration factors:
    epsilon = np.linspace(epsilon, 0, max_episodes)
    # 3. Array of state-action values:
    q = np.zeros((env.n_states, env.n_actions))
    '''
    NOTE ON THE NEW `eta` & `epsilon`:
    Check corresponding comment for the function `sarsa`.
    '''

    #================================================

    # LEARNING LOOP
    for i in range(max_episodes):
        # NOTE: i ==> episode number
        # Beginning at the initial state before each episode:
        s = env.reset()
        # Selecting action `a` for `s` by epsilon-greedy policy based on `q`:
        a = e_greedy(q, epsilon[i], env.n_actions, random_state, s)
        # While the state is not terminal:
        '''
        HOW TO CHECK IF A STATE IS TERMINAL?
        Check corresponding comment for the function `sarsa`.
        '''
        done = False
        while not done:
            # Next state & reward after taking `a` from `s`:
            next_s, r, done = env.step(a)
            # NOTE: `env.step` automatically updates the state of the agent

            # Updating the action-value for the current state-action pair:
            # USING: Temporal difference for (s,a) with greedy policy
            q[s, a] = q[s, a] + eta[i]*(r + gamma*np.max(q[next_s]) - q[s, a])

            # Moving to the next state & action pair:
            s, a = next_s, e_greedy(q, epsilon[i], env.n_actions, random_state, s)

    #================================================

    # FINAL RESULTS
    # Obtaining the estimated optimal policy
    # NOTE: Policy = The column index (i.e. action) with max value per row
    policy = q.argmax(axis=1)
    # Obtaining the state values w.r.t the above policy:
    # NOTE: Value = Max value per row
    value = q.max(axis=1)
    # Returning the above obtained policy & state value array:
    return policy, value

## Testing the above functions
_The function testing code must not run if this file is imported as a module, hence we do..._<br>`if __name__ == '__main__'`<br>_... to check if the current file is being executed as the main code._

In [None]:
if __name__ == '__main__':
    # Defining the parameters:
    env = FrozenLake(lake=LAKE['small'], slip=0.1, max_steps=None, seed=0)
    # NOTE: Putting `max_steps=None` makes it default to the grid size
    max_episodes = 2000
    eta = 1
    gamma = 0.9
    epsilon = 1

    # Running the functions:
    SARSA = sarsa(env, max_episodes, eta, gamma, epsilon, 0)
    QLearning = q_learning(env, max_episodes, eta, gamma, epsilon, 0)
    labels = ("sarsa", "q-learning")

    # Displaying results:
    displayResults((SARSA, QLearning), labels, env)



AGENT PERFORMANCE AFTER SARSA

Lake:
[['&' '.' '.' '.']
 ['.' '#' '.' '#']
 ['.' '.' '.' '#']
 ['#' '.' '.' '$']]
Policy:
[['_' '<' '_' '<']
 ['_' '^' '_' '^']
 ['>' '_' '_' '^']
 ['^' '>' '>' '_']]
Value:
[[0.386 0.268 0.320 0.243]
 [0.463 0.000 0.489 0.000]
 [0.555 0.642 0.767 0.000]
 [0.000 0.745 0.886 1.000]]


AGENT PERFORMANCE AFTER Q-LEARNING

Lake:
[['&' '.' '.' '.']
 ['.' '#' '.' '#']
 ['.' '.' '.' '#']
 ['#' '.' '.' '$']]
Policy:
[['_' '>' '_' '<']
 ['_' '^' '_' '^']
 ['>' '_' '_' '^']
 ['^' '>' '>' '>']]
Value:
[[0.502 0.489 0.635 0.568]
 [0.561 0.000 0.701 0.000]
 [0.647 0.720 0.798 0.000]
 [0.000 0.798 0.892 1.000]]
