# AI in Games, _Reinforcement Learning_<br>Assignment 2, Question 3:<br>**Tabular Model-free Methods**

## Introduction to the concept
Model-based methods are algorithms that cannot use (thus do not require) a known environment model. "Tabular" refers to the use of look-up tables or arrays that store previous estimates of a value function (ex. state value function) and use these estimates to update them every iteration. It is not surprising that when the environment model is not known to begin with, i.e. when the state-transition probabilities and rewards are initially unknown for each state-action pair, the evaluation and improvement of policies would take more time to be accurate and effective, since the agent has to learn the rewards for each state-action pair by iteracting with the environment repeatedly.

## Preparing the context
The following are the necessary preparations and imports needed to run and test the main code of this document in the intended context. Mounting directory & setting present working directory...

In [None]:
# Mounting the Google Drive folder (run if necessary):
from google.colab import drive
drive.mount('/content/drive/', force_remount=True)
# Saving the present working directory's path:
pwd = "./drive/MyDrive/ColabNotebooks/AIG-Labs/AIG-Assignment2/"

Mounted at /content/drive/


To install module `import_ipynb` to enable importing Jupyter Notebooks as modules...

`!pip install import_ipynb`

Importing the code in notebook `Q1_environment.ipynb`...




In [None]:
import import_ipynb
import numpy as np
N = import_ipynb.NotebookLoader(path=[pwd])
N.load_module("Q1_environment")
from Q1_environment import *

importing Jupyter notebook from ./drive/MyDrive/ColabNotebooks/AIG-Labs/AIG-Assignment2/Q1_environment.ipynb


## SARSA control
**NOTE**: SARSA $\implies$ "State-Action-Reward-State-Action"

In [None]:
def sarsa(env, max_episodes, eta, gamma, epsilon, seed=None):
    '''
    NOTE ON THE ARGUMENTS:
    - `env`: Object of the chosen environment class (ex. FrozenLake)
    - `max_episodes`: Upper limit of episodes the agent can go through
    - `eta`:
        - Initial learning rate
        - The learning rate is meant to decrease linearly over episodes
    - `gamma`:
        - Discount factor
        - Not subject to change over time
    - `epsilon`:
        - Initial exploration factor
        - Exploration factor is w.r.t. epsilon-greedy policy
        - Denotes the chance of selecting a random state
        - The exploration factor is meant to decrease linearly over episodes
    - `seed`:
        - Optional seed for pseudorandom number generation
        - By default, it is `None` ==> random seed will be chosen
    '''
    # INITIALISATION
    # Choosing a random initial state:
    random_state = np.random.RandomState(seed)
    # Initialising array of learning rates:
    eta = np.linspace(eta, 0, max_episodes)
    # Initialising array of exploration factors:
    epsilon = np.linspace(epsilon, 0, max_episodes)
    # Initialising of state-action values:
    q = np.zeros((env.n_states, env.n_actions))
    '''
    NOTE ON THE NEW `eta` & `epsilon`:
    The above `eta` and `epsilon` are arrays formed by taking the initial
    learning rate and exploration factor given and creating an array of
    linearly decreasing values. Hence:
    - eta[i] is the learning rate for the ith episode
    - epsilon[i] is the exploration factor for the ith episode
    '''

    #================================================

    # EPSILON-GREEDY POLICY
    # Implementing the epsilon-greedy policy as a lambda function:
    e_greedy = lambda s, e: {True: np.random.randint(0,env.n_actions),
                             False: np.argmax(q[s])}[np.random.rand() < e]
    # NOTE: `e` is the given epsilon value

    #================================================

    # LEARNING LOOP
    for i in range(max_episodes):
        # NOTE: i ==> episode number
        # Beginning at the initial state before each episode:
        s = env.reset()
        # Selecting action `a` for `s` by epsilon-greedy policy based on `q`:
        a = e_greedy(s, epsilon[i])
        # While the state is not terminal:
        '''
        HOW TO CHECK IF A STATE IS TERMINAL?
        A terminal state is one wherein either the maximum number
        of steps for taking actions is reached or the agent reaches the
        absorbing state or the agent transitions to the absorbing state for any
        action. In this implementation, the check for whether the terminal
        state is reached is handled by the `done` flag of the `env.step`
        function; if `False`, continue, else consider the state as terminal.
        '''
        done = False
        while not done:
            # Next state & reward after taking `a` from `s`:
            next_s, r, done = env.step(a)
            # NOTE: `env.step` automatically updates the state of the agent

            # Selecting action `next_a` for `next_s`:
            # NOTE: Selection is done by epsilon greedy policy based on `q`
            next_a = e_greedy(next_s, epsilon[i])

            # Updating the action reward for the current state-action pair:
            # USING: Temporal difference for (s,a) with epsilon-greedy policy
            q[s, a] = q[s, a] + eta[i]*(r + gamma*q[next_s, next_a] - q[s, a])

            # Moving to the next state & action pair:
            s, a = next_s, next_a

    #================================================

    # FINAL RESULTS
    # Obtaining the estimated optimal policy
    # NOTE: Policy = The column index (i.e. action) with max value per row
    policy = q.argmax(axis=1)
    # Obtaining the state values w.r.t the above policy:
    # NOTE: Value = Max value per row
    value = q.max(axis=1)
    # Returning the above obtained policy & state value array:
    return policy, value

## Q-Learning

In [None]:
def q_learning(env, max_episodes, eta, gamma, epsilon, seed=None):
    '''
    NOTE ON THE ARGUMENTS:
    Same as for the function `sarsa`.
    '''
    # INITIALISATION
    # Choosing a random state:
    random_state = np.random.RandomState(seed)
    # Initialising array of learning rates:
    eta = np.linspace(eta, 0, max_episodes)
    # Initialising array of exploration factors:
    epsilon = np.linspace(epsilon, 0, max_episodes)
    # Initialising of state-action values:
    q = np.zeros((env.n_states, env.n_actions))
    '''
    NOTE ON THE NEW `eta` & `epsilon`:
    Check corresponding comment for the function `sarsa`.
    '''

    #================================================

    # EPSILON-GREEDY POLICY
    # Implementing the epsilon-greedy policy as a lambda function:
    e_greedy = lambda s, e: {True: np.random.randint(0, env.n_actions),
                             False: np.argmax(q[s])}[np.random.rand() < e]
    # NOTE: `e` is the given epsilon value

    #================================================

    # LEARNING LOOP
    for i in range(max_episodes):
        # NOTE: i ==> episode number
        # Beginning at the initial state before each episode:
        s = env.reset()
        # Selecting action `a` for `s` by epsilon-greedy policy based on `q`:
        a = e_greedy(s, epsilon[i])
        # While the state is not terminal:
        '''
        HOW TO CHECK IF A STATE IS TERMINAL?
        Check corresponding comment for the function `sarsa`.
        '''
        done = False
        while not done:
            # Next state & reward after taking `a` from `s`:
            next_s, r, done = env.step(a)
            # NOTE: `env.step` automatically updates the state of the agent

            # Updating the action reward for the current state-action pair:
            # USING: Temporal difference for (s,a) with greedy policy
            q[s, a] = q[s, a] + eta[i]*(r + gamma*np.max(q[next_s]) - q[s, a])

            # Moving to the next state & action pair:
            s, a = next_s, e_greedy(s, epsilon[i])

    #================================================

    # FINAL RESULTS
    # Obtaining the estimated optimal policy
    # NOTE: Policy = The column index (i.e. action) with max value per row
    policy = q.argmax(axis=1)
    # Obtaining the state values w.r.t the above policy:
    # NOTE: Value = Max value per row
    value = q.max(axis=1)
    # Returning the above obtained policy & state value array:
    return policy, value

## Testing the above functions
_The function testing code must not run if this file is imported as a module, hence we do..._<br>`if __name__ == '__main__'`<br>_... to check if the current file is being executed as the main code._

In [None]:
if __name__ == '__main__':
    # Defining the parameters:
    env = FrozenLake(lake['small'], 0.1, 100)
    max_episodes = 2000
    eta = 1
    gamma = 0.9
    epsilon = 1

    # Running the functions:
    SARSA = sarsa(env, max_episodes, eta, gamma, epsilon)
    QLearning = q_learning(env, max_episodes, eta, gamma, epsilon)
    labels = ("sarsa", "q-learning")

    # Displaying results:
    displayResults((SARSA, QLearning), labels, env)



AGENT PERFORMANCE AFTER SARSA

Lake:
[['&' '.' '.' '.']
 ['.' '#' '.' '#']
 ['.' '.' '.' '#']
 ['#' '.' '.' '$']]
Policy:
[['_' '>' '_' '<']
 ['_' '^' '_' '^']
 ['>' '_' '_' '^']
 ['^' '>' '>' '_']]
Value:
[[0.430 0.383 0.456 0.346]
 [0.492 0.000 0.597 0.000]
 [0.546 0.658 0.762 0.000]
 [0.000 0.776 0.887 1.000]]


AGENT PERFORMANCE AFTER Q-LEARNING

Lake:
[['&' '.' '.' '.']
 ['.' '#' '.' '#']
 ['.' '.' '.' '#']
 ['#' '.' '.' '$']]
Policy:
[['_' '>' '_' '<']
 ['_' '^' '_' '^']
 ['>' '_' '_' '^']
 ['^' '>' '>' '>']]
Value:
[[0.496 0.467 0.602 0.545]
 [0.554 0.000 0.577 0.000]
 [0.637 0.715 0.721 0.000]
 [0.000 0.801 0.869 1.000]]
