# Homework Assignment Week 4 - Q-Learning <a class="tocSkip">

This weeks homework assignment is to implement Q learning from scratch for the gridworld environment. Use this [repository](https://github.com/rlcode/reinforcement-learning/tree/master/1-grid-world) as a guide, but try not to peak at the Q learning code, recreate it, then check your code with it. Good luck!


<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Exercise-1---Standard-Q-Learning" data-toc-modified-id="Exercise-1---Standard-Q-Learning-1">Exercise 1 - Standard Q-Learning</a></span></li><li><span><a href="#Exercise-2---Q-Learning-with-Adaptive-Learning-Rate" data-toc-modified-id="Exercise-2---Q-Learning-with-Adaptive-Learning-Rate-2">Exercise 2 - Q-Learning with Adaptive Learning Rate</a></span></li><li><span><a href="#Exercise-3---Q-Learning-with-Adaptive-Learning-Rate-&amp;-Adaptive-Epsilon" data-toc-modified-id="Exercise-3---Q-Learning-with-Adaptive-Learning-Rate-&amp;-Adaptive-Epsilon-3">Exercise 3 - Q-Learning with Adaptive Learning Rate &amp; Adaptive Epsilon</a></span></li><li><span><a href="#Exercise-3---Experimenting" data-toc-modified-id="Exercise-3---Experimenting-4">Exercise 3 - Experimenting</a></span></li><li><span><a href="#Report" data-toc-modified-id="Report-5">Report</a></span><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#Introduction" data-toc-modified-id="Introduction-5.0.1">Introduction</a></span></li><li><span><a href="#Findings" data-toc-modified-id="Findings-5.0.2">Findings</a></span></li><li><span><a href="#Potential-Explanations" data-toc-modified-id="Potential-Explanations-5.0.3">Potential Explanations</a></span></li></ul></li></ul></li><li><span><a href="#Project-Extension-Ideas" data-toc-modified-id="Project-Extension-Ideas-6">Project Extension Ideas</a></span></li></ul></div>

# Exercise 1 - Standard Q-Learning

In [1]:
# import libraries
import pixiedust
import numpy as np
import pprint
from grid_world import standard_grid
from utils import print_values, print_policy

Pixiedust database opened successfully


In [69]:
# Global variables
ALL_POSSIBLE_ACTIONS = ('U', 'D', 'L', 'R')
GAMMA = 0.95 # Discount factor from Bellman Equation
# Track positions of interest in the grid
POSITIONS = {
    'START': (2, 0),
    'WIN': (0, 3),
    'LOSE': (1, 3)
}


# Helper Functions   
def display_q(Q):
    print('\nQ Table:')
    pprint.pprint(Q)

    
def report_q(Q):
    for k in sorted(Q.keys()):
        print(f"{str(k):>6}", end='->  ')
        for j in sorted(Q[k].keys()):
            print(f"{j}: {Q[k][j]:>8.3}", end=', ')
        print("")

        
def initialize_q(grid):
    '''initialize Q(s,a) and returns'''
    Q = {}
    states = grid.non_terminal_states()
    for s in states:
        Q[s] = {}
        for a in ALL_POSSIBLE_ACTIONS:
            Q[s][a] = 0.0
    return Q


def epsilon_action(Q, state, epsilon=0.1):
    '''epsilon greedy action selection'''
    status = f'State: {state}'
    #print(f'\nState: {state}')
    if np.random.random() < (1 - epsilon):
        status += ". Following policy..."
        #print(Q[state])
        #action = np.argmax(Q[state])
        action = max(Q[state], key=Q[state].get)
    else:
        status += ". Following random policy..."
        action = np.random.choice(ALL_POSSIBLE_ACTIONS)
    
    status += f" Taking action '{action}'..."
    #print(f"\nStatus: '{status}'")
    return action


def update_q(Q, prev_state, action, reward, curr_state):
    #report_q(Q)
    alpha = 0.1 # Learning Rate
    #Q[prev_state][action] += alpha * (reward + GAMMA * max(Q[curr_state]) - Q[prev_state][action])
    Q[prev_state][action] += alpha * (reward + GAMMA * Q[curr_state][max(Q[curr_state])] - Q[prev_state][action])
    

# Main Program
def main(num_episodes=1000, epsilon=0.2, episode_window=1000):
    
    def print_episode_status():        
        if curr_state in Q:
            max_q = max(Q[curr_state])
        else:
            max_q = "Terminal State"
        
        print(f"Turn: {turn:>5}| "
            f"Previous State: {str(prev_state):>5}| "
            f"Action: {action}| "
            f"Actual Action: {actual_action}| "
            f"\nReward: {reward:>5.3f}| "
            f"Curr State: {str(curr_state):>5}| "
            f"Max Q: {max_q:>15}")
        
        
    # this grid gives you a reward of -0.1 for every non-terminal state
    # we want to see if this will encourage finding a shorter path to the goal
    grid = standard_grid(obey_prob=0.5, step_cost=-.5)

    # print rewards
    print("rewards:")
    print_values(grid.rewards, grid)
    
    Q = initialize_q(grid)
    #print('Initial Q:')
    #report_q(Q)
    
    total_reward = 0
    
    for episode in range(num_episodes + 1):
        
        if episode % episode_window == 0 and episode != 0:
            avg_reward = total_reward/episode_window
            print(f"\nEpisode = {episode}| Avg Reward = {avg_reward}")
            #report_q(Q)
            total_reward = 0
        
        #if episode % 10000 == 0:
        
        # reset our position to starting position for new episode
        if episode != 0:
            grid.set_state(POSITIONS['START'])
        
        turn = 0
        curr_state = grid.current_state()
        #print(f"\n\nStarting episode {episode} with current state at {curr_state}")
        
        while not grid.game_over():
            prev_state = curr_state
            action = epsilon_action(Q, curr_state, epsilon=epsilon)
            actual_action, reward = grid.move(action)
            total_reward += reward
            curr_state = grid.current_state()

            #print_episode_status()
            
            # check if we reached end points
            if grid.is_terminal(curr_state):
                if curr_state == POSITIONS['WIN']:
                    win_or_lose = 'WON'
                else:
                    win_or_lose = 'LOST'
                #print(f"\nYOU {win_or_lose} episode {episode} with {turn + 1} turns!")
                break

            update_q(Q, prev_state, action, reward, curr_state)
            turn += 1

        #print('Game over!')
    #display_values(Q, returns)

In [70]:
#%pixie_debugger
main(num_episodes=100000, episode_window=500)

rewards:
---------------------------
-0.50|-0.50|-0.50| 1.00|
---------------------------
-0.50| 0.00|-0.50|-1.00|
---------------------------
-0.50|-0.50|-0.50|-0.50|

Episode = 500| Avg Reward = -7.568

Episode = 1000| Avg Reward = -6.923

Episode = 1500| Avg Reward = -7.633

Episode = 2000| Avg Reward = -7.415

Episode = 2500| Avg Reward = -7.418

Episode = 3000| Avg Reward = -7.181

Episode = 3500| Avg Reward = -7.121

Episode = 4000| Avg Reward = -7.472

Episode = 4500| Avg Reward = -7.497

Episode = 5000| Avg Reward = -7.994

Episode = 5500| Avg Reward = -7.051

Episode = 6000| Avg Reward = -7.566

Episode = 6500| Avg Reward = -7.304

Episode = 7000| Avg Reward = -7.481

Episode = 7500| Avg Reward = -7.454

Episode = 8000| Avg Reward = -7.156

Episode = 8500| Avg Reward = -7.612

Episode = 9000| Avg Reward = -7.383

Episode = 9500| Avg Reward = -7.189

Episode = 10000| Avg Reward = -7.435

Episode = 10500| Avg Reward = -7.055

Episode = 11000| Avg Reward = -7.507

Episode = 11500

---

# Exercise 2 - Q-Learning with Adaptive Learning Rate

In [97]:
# Global variables
ALL_POSSIBLE_ACTIONS = ('U', 'D', 'L', 'R')
GAMMA = 0.95 # Discount factor from Bellman Equation
START_ALPHA = 0.1 # Learning rate, how much we update our Q table each step
ALPHA_TAPER = 0.01 # How much our adaptive learning rate is adjusted each update
# Track positions of interest in the grid
POSITIONS = {
    'START': (2, 0),
    'WIN': (0, 3),
    'LOSE': (1, 3)
}


# Helper Functions   
def display_q(Q):
    print('\nQ Table:')
    pprint.pprint(Q)

    
def report_table(table, format_str='8.3'):
    for k in sorted(table.keys()):
        print(f"{str(k):>6}", end='->  ')
        for j in sorted(table[k].keys()):
            value = table[k][j]
            print(f"{j}: {value:>{format_str}}", end=', ')
        print("")

        
def initialize_table(grid, initial_value=0):
    '''initialize Q(s,a) and returns'''
    table = {}
    states = grid.non_terminal_states()
    for s in states:
        table[s] = {}
        for a in ALL_POSSIBLE_ACTIONS:
            table[s][a] = initial_value
    return table


def epsilon_action(Q, state, epsilon=0.1):
    '''epsilon greedy action selection'''
    status = f'State: {state}'
    if np.random.random() < (1 - epsilon):
        status += ". Following policy..."
        action = max(Q[state], key=Q[state].get)
    else:
        status += ". Following random policy..."
        action = np.random.choice(ALL_POSSIBLE_ACTIONS)
    
    status += f" Taking action '{action}'..."
    #print(f"\nStatus: '{status}'")
    return action


def update_q(Q, prev_state, action, reward, curr_state, update_counts):
    #report_table(Q)
    # Learning Rate
    alpha = START_ALPHA / (1.0 + update_counts[prev_state][action] * ALPHA_TAPER)
    update_counts[prev_state][action] += 1
    
    # Update Q
    Q[prev_state][action] += alpha * (reward + GAMMA * Q[curr_state][max(Q[curr_state])] - Q[prev_state][action])
    return alpha
    
    

# Main Program
def main(num_episodes=1000, epsilon=0.2, episode_window=1000):
    
    def print_episode_status():        
        if curr_state in Q:
            max_q = max(Q[curr_state])
        else:
            max_q = "Terminal State"
        
        print(f"Turn: {turn:>5}| "
            f"Previous State: {str(prev_state):>5}| "
            f"Action: {action}| "
            f"Actual Action: {actual_action}| "
            f"\nReward: {reward:>5.3f}| "
            f"Curr State: {str(curr_state):>5}| "
            f"Max Q: {max_q:>15}")
        
        
    # this grid gives you a reward of -0.1 for every non-terminal state
    # we want to see if this will encourage finding a shorter path to the goal
    grid = standard_grid(obey_prob=0.5, step_cost=-.5)

    # print rewards
    print("Rewards:")
    print_values(grid.rewards, grid)
    print('\n')
    
    # Initialize Tables
    Q = initialize_table(grid, initial_value=0.0)
    update_counts = initialize_table(grid, initial_value=0)
    #print('Initial Q:')
    #report_table(Q)
    #print('Initial Update Counts:')
    #report_table(update_counts, format_str='5')
    
    total_reward = 0
    alpha = 0
    
    for episode in range(num_episodes + 1):
        
        if episode % episode_window == 0 and episode != 0:
            avg_reward = total_reward/episode_window
            print(f"\nEpisode = {episode}| Avg Reward = {avg_reward}| Alpha = {alpha}")
            #print('Q-Table:')
            #report_table(Q)
            #print('\nUpdate Counts Table:')
            #report_table(update_counts, format_str='5')
            total_reward = 0
        
        #if episode % 10000 == 0:
        
        # reset our position to starting position for new episode
        if episode != 0:
            grid.set_state(POSITIONS['START'])
        
        turn = 0
        curr_state = grid.current_state()
        #print(f"\n\nStarting episode {episode} with current state at {curr_state}")
        
        while not grid.game_over():
            prev_state = curr_state
            action = epsilon_action(Q, curr_state, epsilon=epsilon)
            actual_action, reward = grid.move(action)
            total_reward += reward
            curr_state = grid.current_state()

            #print_episode_status()
            
            # check if we reached end points
            if grid.is_terminal(curr_state):
                if curr_state == POSITIONS['WIN']:
                    win_or_lose = 'WON'
                else:
                    win_or_lose = 'LOST'
                #print(f"\nYOU {win_or_lose} episode {episode} with {turn + 1} turns!")
                break

            alpha = update_q(Q, prev_state, action, reward, curr_state, update_counts)
            turn += 1

        #print('Game over!')
    #display_values(Q, returns)

In [98]:
main(num_episodes=100000, episode_window=500)

Rewards:
---------------------------
-0.50|-0.50|-0.50| 1.00|
---------------------------
-0.50| 0.00|-0.50|-1.00|
---------------------------
-0.50|-0.50|-0.50|-0.50|



Episode = 500| Avg Reward = -8.084| Alpha = 0.05714285714285715

Episode = 1000| Avg Reward = -7.218| Alpha = 0.00931098696461825

Episode = 1500| Avg Reward = -7.34| Alpha = 0.006426735218508998

Episode = 2000| Avg Reward = -6.909| Alpha = 0.005027652086475616

Episode = 2500| Avg Reward = -7.421| Alpha = 0.004070004070004071

Episode = 3000| Avg Reward = -7.357| Alpha = 0.0039032006245121

Episode = 3500| Avg Reward = -7.312| Alpha = 0.00590318772136954

Episode = 4000| Avg Reward = -6.821| Alpha = 0.0026239832065074785

Episode = 4500| Avg Reward = -7.469| Alpha = 0.004699248120300752

Episode = 5000| Avg Reward = -7.162| Alpha = 0.006854009595613434

Episode = 5500| Avg Reward = -7.419| Alpha = 0.0019087612139721323

Episode = 6000| Avg Reward = -7.434| Alpha = 0.006119951040391677

Episode = 6500| Avg Reward = -


Episode = 59500| Avg Reward = -7.159| Alpha = 0.0003633456870866943

Episode = 60000| Avg Reward = -7.112| Alpha = 0.00017838667094794677

Episode = 60500| Avg Reward = -7.108| Alpha = 0.0001955645949857238

Episode = 61000| Avg Reward = -7.277| Alpha = 0.00017542013121425815

Episode = 61500| Avg Reward = -7.359| Alpha = 0.0005966231131794046

Episode = 62000| Avg Reward = -6.995| Alpha = 0.0001910621142933568

Episode = 62500| Avg Reward = -7.672| Alpha = 0.00018961299986727093

Episode = 63000| Avg Reward = -7.203| Alpha = 0.00034295905068934773

Episode = 63500| Avg Reward = -7.481| Alpha = 0.0006046314771146986

Episode = 64000| Avg Reward = -7.094| Alpha = 0.0010462439840970914

Episode = 64500| Avg Reward = -7.072| Alpha = 0.00018399941120188416

Episode = 65000| Avg Reward = -6.973| Alpha = 0.00016462260268334841

Episode = 65500| Avg Reward = -7.446| Alpha = 0.00033000033000033

Episode = 66000| Avg Reward = -7.278| Alpha = 0.00016216391528557067

Episode = 66500| Avg Reward 

---

# Exercise 3 - Q-Learning with Adaptive Learning Rate & Adaptive Epsilon

In [119]:
# Global variables
ALL_POSSIBLE_ACTIONS = ('U', 'D', 'L', 'R')
GAMMA = 0.95 # Discount factor from Bellman Equation
START_ALPHA = 0.1 # Learning rate, how much we update our Q table each step
ALPHA_TAPER = 0.01 # How much our adaptive learning rate is adjusted each update
START_EPSILON = 1.0 # Probability of random action
EPSILON_TAPER = 0.01 # How much epsilon is adjusted each step
# Track positions of interest in the grid
POSITIONS = {
    'START': (2, 0),
    'WIN': (0, 3),
    'LOSE': (1, 3)
}


# Helper Functions   
def display_q(Q):
    print('\nQ Table:')
    pprint.pprint(Q)

    
def report_table(table, format_str='8.3'):
    for k in sorted(table.keys()):
        print(f"{str(k):>6}", end='->  ')
        preffered_action = max(table[k], key=table[k].get)
        for j in sorted(table[k].keys()):
            #bold = ''
            value = table[k][j]
            if j == preffered_action:
                print(f"{j}: {value:>{format_str}}", end=', ')
        print("")

        
def initialize_table(grid, initial_value=0):
    '''initialize Q(s,a) and returns'''
    table = {}
    states = grid.non_terminal_states()
    for s in states:
        table[s] = {}
        for a in ALL_POSSIBLE_ACTIONS:
            table[s][a] = initial_value
    return table


def epsilon_action(Q, state, epsilon=START_EPSILON):
    '''epsilon greedy action selection'''
    status = f'State: {state}'
    if np.random.random() < (1 - epsilon):
        status += ". Following policy..."
        action = max(Q[state], key=Q[state].get)
    else:
        status += ". Following random policy..."
        action = np.random.choice(ALL_POSSIBLE_ACTIONS)
    
    status += f" Taking action '{action}'..."
    #print(f"\nStatus: '{status}'")
    return action


def update_q(Q, prev_state, action, reward, curr_state, update_counts):
    #report_table(Q)
    # Learning Rate
    alpha = START_ALPHA / (1.0 + update_counts[prev_state][action] * ALPHA_TAPER)
    update_counts[prev_state][action] += 1
    
    # Update Q
    Q[prev_state][action] += alpha * (reward + GAMMA * Q[curr_state][max(Q[curr_state])] - Q[prev_state][action])
    return alpha
    
    

# Main Program
def main(num_episodes=1000, episode_window=1000):
    
    def print_episode_status():        
        if curr_state in Q:
            max_q = max(Q[curr_state])
        else:
            max_q = "Terminal State"
        
        print(f"Turn: {turn:>5}| "
            f"Previous State: {str(prev_state):>5}| "
            f"Action: {action}| "
            f"Actual Action: {actual_action}| "
            f"\nReward: {reward:>5.3f}| "
            f"Curr State: {str(curr_state):>5}| "
            f"Max Q: {max_q:>15}")
        
    # this grid gives you a reward of -0.1 for every non-terminal state
    # we want to see if this will encourage finding a shorter path to the goal
    grid = standard_grid(obey_prob=0.5, step_cost=-.5)

    # print rewards
    print("Rewards:")
    print_values(grid.rewards, grid)
    print('\n')
    
    # Initialize Tables
    Q = initialize_table(grid, initial_value=0.0)
    update_counts = initialize_table(grid, initial_value=0)
    #print('Initial Q:')
    #report_table(Q)
    #print('Initial Update Counts:')
    #report_table(update_counts, format_str='5')
    
    total_reward = 0
    alpha = 0
    
    for episode in range(num_episodes + 1):
        
        # Taper EPSILON each episode
        epsilon = START_EPSILON / (1.0 + episode * EPSILON_TAPER)
        
        if episode % episode_window == 0 and episode != 0:
            avg_reward = total_reward/episode_window
            print(f"\nEpisode = {episode:>8}| Avg Reward = {avg_reward:>8}|"
                  f" Alpha = {alpha:>8.5f} | Epsilon = {epsilon:>8.5f}")
            print('Q-Table:')
            report_table(Q)
            #print('\nUpdate Counts Table:')
            #report_table(update_counts, format_str='5')
            total_reward = 0
        
        # reset our position to starting position for new episode
        if episode != 0:
            grid.set_state(POSITIONS['START'])
        
        turn = 0
        curr_state = grid.current_state()
        #print(f"\n\nStarting episode {episode} with current state at {curr_state}")
        
        while not grid.game_over():
            prev_state = curr_state
            action = epsilon_action(Q, curr_state, epsilon=epsilon)
            actual_action, reward = grid.move(action)
            total_reward += reward
            curr_state = grid.current_state()

            #print_episode_status()
            
            # check if we reached end points
            if grid.is_terminal(curr_state):
                if curr_state == POSITIONS['WIN']:
                    win_or_lose = 'WON'
                else:
                    win_or_lose = 'LOST'
                #print(f"\nYOU {win_or_lose} episode {episode} with {turn + 1} turns!")
                break

            alpha = update_q(Q, prev_state, action, reward, curr_state, update_counts)
            turn += 1
    
    return Q, update_counts

In [120]:
Q, update_counts = main(num_episodes=100000, episode_window=500)

Rewards:
---------------------------
-0.50|-0.50|-0.50| 1.00|
---------------------------
-0.50| 0.00|-0.50|-1.00|
---------------------------
-0.50|-0.50|-0.50|-0.50|



Episode =      500| Avg Reward =   -8.469| Alpha =  0.05848 | Epsilon =  0.16667
Q-Table:
(0, 0)->  R:    -4.07, 
(0, 1)->  L:    -3.69, 
(0, 2)->  R:    -3.49, 
(1, 0)->  R:    -4.33, 
(1, 2)->  U:    -3.47, 
(2, 0)->  R:    -4.34, 
(2, 1)->  R:    -4.05, 
(2, 2)->  R:    -3.78, 
(2, 3)->  L:    -3.66, 

Episode =     1000| Avg Reward =   -7.414| Alpha =  0.01675 | Epsilon =  0.09091
Q-Table:
(0, 0)->  R:    -5.13, 
(0, 1)->  R:    -4.93, 
(0, 2)->  U:    -4.75, 
(1, 0)->  L:    -5.29, 
(1, 2)->  U:    -4.78, 
(2, 0)->  R:     -5.3, 
(2, 1)->  R:    -5.14, 
(2, 2)->  U:    -4.96, 
(2, 3)->  U:    -4.93, 

Episode =     1500| Avg Reward =   -6.944| Alpha =  0.02625 | Epsilon =  0.06250
Q-Table:
(0, 0)->  R:    -5.73, 
(0, 1)->  R:    -5.56, 
(0, 2)->  L:    -5.44, 
(1, 0)->  U:    -5.82, 
(1, 2)->  U:    -5.46, 
(2, 0


Episode =    14000| Avg Reward =   -7.494| Alpha =  0.00374 | Epsilon =  0.00709
Q-Table:
(0, 0)->  D:    -8.42, 
(0, 1)->  R:    -8.42, 
(0, 2)->  L:    -8.41, 
(1, 0)->  U:    -8.42, 
(1, 2)->  R:    -8.41, 
(2, 0)->  D:    -8.42, 
(2, 1)->  R:    -8.42, 
(2, 2)->  R:    -8.41, 
(2, 3)->  R:     -8.4, 

Episode =    14500| Avg Reward =   -7.413| Alpha =  0.00274 | Epsilon =  0.00685
Q-Table:
(0, 0)->  R:    -8.45, 
(0, 1)->  D:    -8.44, 
(0, 2)->  L:    -8.43, 
(1, 0)->  U:    -8.45, 
(1, 2)->  R:    -8.43, 
(2, 0)->  R:    -8.45, 
(2, 1)->  R:    -8.44, 
(2, 2)->  D:    -8.44, 
(2, 3)->  U:    -8.43, 

Episode =    15000| Avg Reward =   -7.251| Alpha =  0.00081 | Epsilon =  0.00662
Q-Table:
(0, 0)->  D:    -8.47, 
(0, 1)->  D:    -8.47, 
(0, 2)->  R:    -8.46, 
(1, 0)->  D:    -8.47, 
(1, 2)->  U:    -8.46, 
(2, 0)->  R:    -8.47, 
(2, 1)->  R:    -8.47, 
(2, 2)->  R:    -8.46, 
(2, 3)->  R:    -8.45, 

Episode =    15500| Avg Reward =   -7.353| Alpha =  0.00078 | Epsilon =  0.006


Episode =    28000| Avg Reward =   -7.411| Alpha =  0.00159 | Epsilon =  0.00356
Q-Table:
(0, 0)->  R:    -8.87, 
(0, 1)->  R:    -8.87, 
(0, 2)->  R:    -8.87, 
(1, 0)->  D:    -8.88, 
(1, 2)->  R:    -8.87, 
(2, 0)->  U:    -8.87, 
(2, 1)->  U:    -8.87, 
(2, 2)->  R:    -8.87, 
(2, 3)->  R:    -8.87, 

Episode =    28500| Avg Reward =   -7.473| Alpha =  0.00069 | Epsilon =  0.00350
Q-Table:
(0, 0)->  U:    -8.88, 
(0, 1)->  R:    -8.88, 
(0, 2)->  R:    -8.88, 
(1, 0)->  U:    -8.88, 
(1, 2)->  U:    -8.88, 
(2, 0)->  U:    -8.88, 
(2, 1)->  R:    -8.88, 
(2, 2)->  R:    -8.88, 
(2, 3)->  U:    -8.88, 

Episode =    29000| Avg Reward =   -7.279| Alpha =  0.00067 | Epsilon =  0.00344
Q-Table:
(0, 0)->  U:    -8.89, 
(0, 1)->  R:    -8.89, 
(0, 2)->  D:    -8.89, 
(1, 0)->  U:    -8.89, 
(1, 2)->  R:    -8.89, 
(2, 0)->  R:    -8.89, 
(2, 1)->  R:    -8.89, 
(2, 2)->  R:    -8.89, 
(2, 3)->  U:    -8.89, 

Episode =    29500| Avg Reward =   -7.797| Alpha =  0.00134 | Epsilon =  0.003


Episode =    42000| Avg Reward =   -7.306| Alpha =  0.00127 | Epsilon =  0.00238
Q-Table:
(0, 0)->  D:    -9.08, 
(0, 1)->  R:    -9.08, 
(0, 2)->  R:    -9.08, 
(1, 0)->  U:    -9.08, 
(1, 2)->  L:    -9.08, 
(2, 0)->  R:    -9.08, 
(2, 1)->  R:    -9.08, 
(2, 2)->  R:    -9.08, 
(2, 3)->  R:    -9.08, 

Episode =    42500| Avg Reward =   -7.248| Alpha =  0.00028 | Epsilon =  0.00235
Q-Table:
(0, 0)->  L:    -9.08, 
(0, 1)->  R:    -9.08, 
(0, 2)->  R:    -9.08, 
(1, 0)->  D:    -9.08, 
(1, 2)->  U:    -9.08, 
(2, 0)->  R:    -9.08, 
(2, 1)->  U:    -9.08, 
(2, 2)->  R:    -9.08, 
(2, 3)->  R:    -9.08, 

Episode =    43000| Avg Reward =    -7.41| Alpha =  0.00026 | Epsilon =  0.00232
Q-Table:
(0, 0)->  R:    -9.09, 
(0, 1)->  D:    -9.09, 
(0, 2)->  R:    -9.09, 
(1, 0)->  L:    -9.09, 
(1, 2)->  R:    -9.09, 
(2, 0)->  D:    -9.09, 
(2, 1)->  R:    -9.09, 
(2, 2)->  U:    -9.09, 
(2, 3)->  R:    -9.09, 

Episode =    43500| Avg Reward =   -7.385| Alpha =  0.00091 | Epsilon =  0.002


Episode =    56000| Avg Reward =   -7.726| Alpha =  0.00020 | Epsilon =  0.00178
Q-Table:
(0, 0)->  U:     -9.2, 
(0, 1)->  D:     -9.2, 
(0, 2)->  U:     -9.2, 
(1, 0)->  L:     -9.2, 
(1, 2)->  L:     -9.2, 
(2, 0)->  R:     -9.2, 
(2, 1)->  R:     -9.2, 
(2, 2)->  U:     -9.2, 
(2, 3)->  U:     -9.2, 

Episode =    56500| Avg Reward =   -7.454| Alpha =  0.00075 | Epsilon =  0.00177
Q-Table:
(0, 0)->  U:     -9.2, 
(0, 1)->  R:     -9.2, 
(0, 2)->  U:     -9.2, 
(1, 0)->  L:     -9.2, 
(1, 2)->  L:     -9.2, 
(2, 0)->  D:     -9.2, 
(2, 1)->  D:     -9.2, 
(2, 2)->  R:     -9.2, 
(2, 3)->  U:     -9.2, 

Episode =    57000| Avg Reward =   -7.539| Alpha =  0.00090 | Epsilon =  0.00175
Q-Table:
(0, 0)->  R:    -9.21, 
(0, 1)->  U:    -9.21, 
(0, 2)->  U:    -9.21, 
(1, 0)->  U:    -9.21, 
(1, 2)->  L:    -9.21, 
(2, 0)->  D:    -9.21, 
(2, 1)->  R:    -9.21, 
(2, 2)->  U:    -9.21, 
(2, 3)->  U:    -9.21, 

Episode =    57500| Avg Reward =   -7.471| Alpha =  0.00021 | Epsilon =  0.001


Episode =    70000| Avg Reward =   -7.314| Alpha =  0.00057 | Epsilon =  0.00143
Q-Table:
(0, 0)->  U:    -9.28, 
(0, 1)->  R:    -9.28, 
(0, 2)->  U:    -9.28, 
(1, 0)->  U:    -9.28, 
(1, 2)->  D:    -9.28, 
(2, 0)->  R:    -9.28, 
(2, 1)->  R:    -9.28, 
(2, 2)->  D:    -9.28, 
(2, 3)->  U:    -9.28, 

Episode =    70500| Avg Reward =   -7.606| Alpha =  0.00017 | Epsilon =  0.00142
Q-Table:
(0, 0)->  D:    -9.29, 
(0, 1)->  D:    -9.29, 
(0, 2)->  U:    -9.29, 
(1, 0)->  U:    -9.29, 
(1, 2)->  U:    -9.29, 
(2, 0)->  U:    -9.29, 
(2, 1)->  U:    -9.29, 
(2, 2)->  R:    -9.29, 
(2, 3)->  L:    -9.28, 

Episode =    71000| Avg Reward =   -7.747| Alpha =  0.00017 | Epsilon =  0.00141
Q-Table:
(0, 0)->  U:    -9.29, 
(0, 1)->  R:    -9.29, 
(0, 2)->  R:    -9.29, 
(1, 0)->  L:    -9.29, 
(1, 2)->  R:    -9.29, 
(2, 0)->  L:    -9.29, 
(2, 1)->  R:    -9.29, 
(2, 2)->  R:    -9.29, 
(2, 3)->  L:    -9.29, 

Episode =    71500| Avg Reward =   -7.124| Alpha =  0.00070 | Epsilon =  0.001


Episode =    84000| Avg Reward =   -7.294| Alpha =  0.00042 | Epsilon =  0.00119
Q-Table:
(0, 0)->  D:    -9.35, 
(0, 1)->  U:    -9.35, 
(0, 2)->  U:    -9.35, 
(1, 0)->  D:    -9.35, 
(1, 2)->  R:    -9.35, 
(2, 0)->  L:    -9.35, 
(2, 1)->  D:    -9.35, 
(2, 2)->  L:    -9.35, 
(2, 3)->  D:    -9.35, 

Episode =    84500| Avg Reward =   -7.504| Alpha =  0.00042 | Epsilon =  0.00118
Q-Table:
(0, 0)->  R:    -9.35, 
(0, 1)->  U:    -9.35, 
(0, 2)->  R:    -9.35, 
(1, 0)->  U:    -9.35, 
(1, 2)->  R:    -9.35, 
(2, 0)->  R:    -9.35, 
(2, 1)->  L:    -9.35, 
(2, 2)->  U:    -9.35, 
(2, 3)->  R:    -9.35, 

Episode =    85000| Avg Reward =   -7.554| Alpha =  0.00063 | Epsilon =  0.00118
Q-Table:
(0, 0)->  U:    -9.35, 
(0, 1)->  L:    -9.35, 
(0, 2)->  U:    -9.35, 
(1, 0)->  U:    -9.35, 
(1, 2)->  R:    -9.35, 
(2, 0)->  R:    -9.35, 
(2, 1)->  R:    -9.35, 
(2, 2)->  R:    -9.35, 
(2, 3)->  R:    -9.35, 

Episode =    85500| Avg Reward =   -7.191| Alpha =  0.00014 | Epsilon =  0.001


Episode =    98000| Avg Reward =   -7.266| Alpha =  0.00011 | Epsilon =  0.00102
Q-Table:
(0, 0)->  U:    -9.39, 
(0, 1)->  L:    -9.39, 
(0, 2)->  R:    -9.39, 
(1, 0)->  U:    -9.39, 
(1, 2)->  U:    -9.39, 
(2, 0)->  U:    -9.39, 
(2, 1)->  R:    -9.39, 
(2, 2)->  U:    -9.39, 
(2, 3)->  U:    -9.39, 

Episode =    98500| Avg Reward =   -7.555| Alpha =  0.00051 | Epsilon =  0.00101
Q-Table:
(0, 0)->  R:     -9.4, 
(0, 1)->  R:     -9.4, 
(0, 2)->  R:     -9.4, 
(1, 0)->  U:     -9.4, 
(1, 2)->  R:     -9.4, 
(2, 0)->  U:     -9.4, 
(2, 1)->  R:     -9.4, 
(2, 2)->  U:     -9.4, 
(2, 3)->  U:     -9.4, 

Episode =    99000| Avg Reward =   -7.398| Alpha =  0.00054 | Epsilon =  0.00101
Q-Table:
(0, 0)->  U:     -9.4, 
(0, 1)->  R:     -9.4, 
(0, 2)->  R:     -9.4, 
(1, 0)->  U:     -9.4, 
(1, 2)->  R:     -9.4, 
(2, 0)->  L:     -9.4, 
(2, 1)->  R:     -9.4, 
(2, 2)->  L:     -9.4, 
(2, 3)->  R:     -9.4, 

Episode =    99500| Avg Reward =   -7.198| Alpha =  0.00012 | Epsilon =  0.001

In [113]:
report_table(update_counts, format_str='8')

(0, 0)->  D:    22570, L:    20571, R:    64532, U:    54871, 
(0, 1)->  D:    22188, L:    13378, R:    82928, U:    23654, 
(0, 2)->  D:    16784, L:    15462, R:    24854, U:    18240, 
(1, 0)->  D:    25504, L:    38829, R:    42390, U:    91847, 
(1, 2)->  D:    13080, L:    20014, R:    15222, U:    27896, 
(2, 0)->  D:    46420, L:    29773, R:   127881, U:    54397, 
(2, 1)->  D:    28013, L:    15597, R:   161136, U:    31067, 
(2, 2)->  D:    21192, L:    18696, R:    85689, U:    55832, 
(2, 3)->  D:    23104, L:    18517, R:    25326, U:    20043, 


In [112]:
report_table(Q)

(0, 0)->  D:     -9.4, L:     -9.4, R:     -9.4, U:     -9.4, 
(0, 1)->  D:     -9.4, L:     -9.4, R:     -9.4, U:     -9.4, 
(0, 2)->  D:     -9.4, L:     -9.4, R:     -9.4, U:     -9.4, 
(1, 0)->  D:     -9.4, L:     -9.4, R:     -9.4, U:     -9.4, 
(1, 2)->  D:     -9.4, L:     -9.4, R:     -9.4, U:     -9.4, 
(2, 0)->  D:     -9.4, L:     -9.4, R:     -9.4, U:     -9.4, 
(2, 1)->  D:     -9.4, L:     -9.4, R:     -9.4, U:     -9.4, 
(2, 2)->  D:     -9.4, L:     -9.4, R:     -9.4, U:     -9.4, 
(2, 3)->  D:     -9.4, L:     -9.4, R:     -9.4, U:     -9.4, 


In [114]:
Q

{(0, 0): {'U': -9.399246654048408,
  'D': -9.399259835522214,
  'L': -9.39925962150384,
  'R': -9.399246667065315},
 (0, 1): {'U': -9.39922523120348,
  'D': -9.399225568925518,
  'L': -9.399227920202033,
  'R': -9.39921948607757},
 (0, 2): {'U': -9.399544760898204,
  'D': -9.399553176988693,
  'L': -9.399543613272478,
  'R': -9.399535990100707},
 (1, 0): {'U': -9.399518248341968,
  'D': -9.399523710770659,
  'L': -9.399520852480157,
  'R': -9.399524427386593},
 (1, 2): {'U': -9.399634308136514,
  'D': -9.39964332306894,
  'L': -9.39964342231197,
  'R': -9.399640490555283},
 (2, 0): {'U': -9.400334448781827,
  'D': -9.400334763151823,
  'L': -9.400334691704431,
  'R': -9.400330079140646},
 (2, 1): {'U': -9.400203157289507,
  'D': -9.400207092943477,
  'L': -9.400217546521564,
  'R': -9.400201105029392},
 (2, 2): {'U': -9.400001978708698,
  'D': -9.400009498741797,
  'L': -9.400012825297729,
  'R': -9.400004185021787},
 (2, 3): {'U': -9.3998998930095,
  'D': -9.39989554911265,
  'L': -9.

---

# Exercise 3 - Experimenting

In [3]:
# Global variables
ALL_POSSIBLE_ACTIONS = ('U', 'D', 'L', 'R')
GAMMA = 0.95 # Discount factor from Bellman Equation
START_ALPHA = 0.1 # Learning rate, how much we update our Q table each step
ALPHA_TAPER = 0.01 # How much our adaptive learning rate is adjusted each update
START_EPSILON = 1.0 # Probability of random action
EPSILON_TAPER = 0.01 # How much epsilon is adjusted each step
# Track positions of interest in the grid
POSITIONS = {
    'START': (2, 0),
    'WIN': (0, 3),
    'LOSE': (1, 3)
}


# Helper Functions   
def display_q(Q):
    print('\nQ Table:')
    pprint.pprint(Q)

    
def report_table(table, format_str='8.3'):
    for k in sorted(table.keys()):
        print(f"{str(k):>6}", end='->  ')
        preffered_action = max(table[k], key=table[k].get)
        for j in sorted(table[k].keys()):
            #bold = ''
            value = table[k][j]
            if j == preffered_action:
                print(f"{j}: {value:>{format_str}}", end=', ')
        print("")

        
def initialize_table(grid, initial_value=0):
    '''initialize Q(s,a) and returns'''
    table = {}
    states = grid.non_terminal_states()
    for s in states:
        table[s] = {}
        for a in ALL_POSSIBLE_ACTIONS:
            table[s][a] = initial_value
    return table


def epsilon_action(Q, state, epsilon=START_EPSILON):
    '''epsilon greedy action selection'''
    status = f'State: {state}'
    if np.random.random() < (1 - epsilon):
        status += ". Following policy..."
        action = max(Q[state], key=Q[state].get)
    else:
        status += ". Following random policy..."
        action = np.random.choice(ALL_POSSIBLE_ACTIONS)
    
    status += f" Taking action '{action}'..."
    #print(f"\nStatus: '{status}'")
    return action


def update_q(Q, prev_state, action, reward, curr_state, update_counts):
    #report_table(Q)
    # Learning Rate
    alpha = START_ALPHA / (1.0 + update_counts[prev_state][action] * ALPHA_TAPER)
    update_counts[prev_state][action] += 1
    
    # Update Q
    Q[prev_state][action] += alpha * (reward + GAMMA * Q[curr_state][max(Q[curr_state])] - Q[prev_state][action])
    return alpha
    
    

# Main Program
def main(num_episodes=1000, episode_window=1000):
    
    def print_episode_status():        
        if curr_state in Q:
            max_q = max(Q[curr_state])
        else:
            max_q = "Terminal State"
        
        print(f"Turn: {turn:>5}| "
            f"Previous State: {str(prev_state):>5}| "
            f"Action: {action}| "
            f"Actual Action: {actual_action}| "
            f"\nReward: {reward:>5.3f}| "
            f"Curr State: {str(curr_state):>5}| "
            f"Max Q: {max_q:>15}")
        
    # this grid gives you a reward of -0.1 for every non-terminal state
    # we want to see if this will encourage finding a shorter path to the goal
    grid = standard_grid(obey_prob=1.0, step_cost=-.5)

    # print rewards
    print("Rewards:")
    print_values(grid.rewards, grid)
    print('\n')
    
    # Initialize Tables
    Q = initialize_table(grid, initial_value=0.0)
    update_counts = initialize_table(grid, initial_value=0)
    #print('Initial Q:')
    #report_table(Q)
    #print('Initial Update Counts:')
    #report_table(update_counts, format_str='5')
    
    total_reward = 0
    alpha = 0
    
    for episode in range(num_episodes + 1):
        
        # Taper EPSILON each episode
        epsilon = START_EPSILON / min(1.0 + episode * EPSILON_TAPER, 10)
        
        if episode % episode_window == 0 and episode != 0:
            avg_reward = total_reward/episode_window
            print(f"\nEpisode = {episode:>8}| Avg Reward = {avg_reward:>8}|"
                  f" Alpha = {alpha:>8.5f} | Epsilon = {epsilon:>8.5f}")
            print('Q-Table:')
            report_table(Q)
            #print('\nUpdate Counts Table:')
            #report_table(update_counts, format_str='5')
            total_reward = 0
        
        # reset our position to starting position for new episode
        if episode != 0:
            grid.set_state(POSITIONS['START'])
        
        turn = 0
        curr_state = grid.current_state()
        #print(f"\n\nStarting episode {episode} with current state at {curr_state}")
        
        while not grid.game_over():
            prev_state = curr_state
            action = epsilon_action(Q, curr_state, epsilon=epsilon)
            actual_action, reward = grid.move(action)
            total_reward += reward
            curr_state = grid.current_state()

            #print_episode_status()
            
            # check if we reached end points
            if grid.is_terminal(curr_state):
                if curr_state == POSITIONS['WIN']:
                    win_or_lose = 'WON'
                else:
                    win_or_lose = 'LOST'
                #print(f"\nYOU {win_or_lose} episode {episode} with {turn + 1} turns!")
                break

            alpha = update_q(Q, prev_state, action, reward, curr_state, update_counts)
            turn += 1
    
    return Q, update_counts

In [4]:
Q, update_counts = main(num_episodes=1000000, episode_window=5000)

Rewards:
---------------------------
-0.50|-0.50|-0.50| 1.00|
---------------------------
-0.50| 0.00|-0.50|-1.00|
---------------------------
-0.50|-0.50|-0.50|-0.50|



Episode =     5000| Avg Reward =  -2.2336| Alpha =  0.00342 | Epsilon =  0.10000
Q-Table:
(0, 0)->  R:    -3.26, 
(0, 1)->  R:    -2.64, 
(0, 2)->  R:      0.0, 
(1, 0)->  U:    -3.51, 
(1, 2)->  R:      0.0, 
(2, 0)->  U:    -3.74, 
(2, 1)->  R:    -1.69, 
(2, 2)->  R:     -0.5, 
(2, 3)->  U:      0.0, 

Episode =    10000| Avg Reward =  -2.1189| Alpha =  0.00221 | Epsilon =  0.10000
Q-Table:
(0, 0)->  R:    -4.24, 
(0, 1)->  R:     -3.7, 
(0, 2)->  R:      0.0, 
(1, 0)->  U:    -4.39, 
(1, 2)->  R:      0.0, 
(2, 0)->  R:    -4.52, 
(2, 1)->  R:    -2.13, 
(2, 2)->  R:     -0.5, 
(2, 3)->  U:      0.0, 

Episode =    15000| Avg Reward =  -2.0287| Alpha =  0.00144 | Epsilon =  0.10000
Q-Table:
(0, 0)->  R:    -4.81, 
(0, 1)->  R:     -4.4, 
(0, 2)->  R:      0.0, 
(1, 0)->  U:    -4.96, 
(1, 2)->  R:      0.0, 
(2, 0


Episode =   140000| Avg Reward =   -1.986| Alpha =  0.00014 | Epsilon =  0.10000
Q-Table:
(0, 0)->  R:    -7.83, 
(0, 1)->  R:    -7.72, 
(0, 2)->  R:      0.0, 
(1, 0)->  U:    -7.87, 
(1, 2)->  R:      0.0, 
(2, 0)->  U:    -7.89, 
(2, 1)->  R:    -7.53, 
(2, 2)->  R:     -0.5, 
(2, 3)->  U:      0.0, 

Episode =   145000| Avg Reward =   -2.001| Alpha =  0.00014 | Epsilon =  0.10000
Q-Table:
(0, 0)->  R:    -7.86, 
(0, 1)->  R:    -7.76, 
(0, 2)->  R:      0.0, 
(1, 0)->  U:     -7.9, 
(1, 2)->  R:      0.0, 
(2, 0)->  R:    -7.92, 
(2, 1)->  R:    -7.58, 
(2, 2)->  R:     -0.5, 
(2, 3)->  U:      0.0, 

Episode =   150000| Avg Reward =  -1.9891| Alpha =  0.00013 | Epsilon =  0.10000
Q-Table:
(0, 0)->  R:    -7.89, 
(0, 1)->  R:    -7.79, 
(0, 2)->  R:      0.0, 
(1, 0)->  U:    -7.94, 
(1, 2)->  R:      0.0, 
(2, 0)->  R:    -7.95, 
(2, 1)->  R:    -7.62, 
(2, 2)->  R:     -0.5, 
(2, 3)->  U:      0.0, 

Episode =   155000| Avg Reward =  -1.9806| Alpha =  0.00013 | Epsilon =  0.100


Episode =   275000| Avg Reward =  -1.9878| Alpha =  0.00007 | Epsilon =  0.10000
Q-Table:
(0, 0)->  R:     -8.4, 
(0, 1)->  R:    -8.37, 
(0, 2)->  R:      0.0, 
(1, 0)->  U:    -8.43, 
(1, 2)->  R:      0.0, 
(2, 0)->  R:    -8.44, 
(2, 1)->  R:     -8.3, 
(2, 2)->  R:     -0.5, 
(2, 3)->  U:      0.0, 

Episode =   280000| Avg Reward =  -1.9866| Alpha =  0.00007 | Epsilon =  0.10000
Q-Table:
(0, 0)->  R:    -8.42, 
(0, 1)->  R:    -8.38, 
(0, 2)->  R:      0.0, 
(1, 0)->  U:    -8.45, 
(1, 2)->  R:      0.0, 
(2, 0)->  R:    -8.45, 
(2, 1)->  R:    -8.32, 
(2, 2)->  R:     -0.5, 
(2, 3)->  U:      0.0, 

Episode =   285000| Avg Reward =  -1.9897| Alpha =  0.00007 | Epsilon =  0.10000
Q-Table:
(0, 0)->  R:    -8.43, 
(0, 1)->  R:    -8.39, 
(0, 2)->  R:      0.0, 
(1, 0)->  U:    -8.46, 
(1, 2)->  R:      0.0, 
(2, 0)->  U:    -8.47, 
(2, 1)->  R:    -8.33, 
(2, 2)->  R:     -0.5, 
(2, 3)->  U:      0.0, 

Episode =   290000| Avg Reward =  -2.0005| Alpha =  0.00007 | Epsilon =  0.100


Episode =   410000| Avg Reward =  -1.9779| Alpha =  0.00005 | Epsilon =  0.10000
Q-Table:
(0, 0)->  R:    -8.69, 
(0, 1)->  R:    -8.65, 
(0, 2)->  R:      0.0, 
(1, 0)->  U:     -8.7, 
(1, 2)->  R:      0.0, 
(2, 0)->  U:    -8.71, 
(2, 1)->  R:     -8.6, 
(2, 2)->  R:     -0.5, 
(2, 3)->  U:      0.0, 

Episode =   415000| Avg Reward =   -1.987| Alpha =  0.00005 | Epsilon =  0.10000
Q-Table:
(0, 0)->  R:     -8.7, 
(0, 1)->  R:    -8.66, 
(0, 2)->  R:      0.0, 
(1, 0)->  U:    -8.71, 
(1, 2)->  R:      0.0, 
(2, 0)->  R:    -8.71, 
(2, 1)->  R:    -8.61, 
(2, 2)->  R:     -0.5, 
(2, 3)->  U:      0.0, 

Episode =   420000| Avg Reward =  -1.9795| Alpha =  0.00005 | Epsilon =  0.10000
Q-Table:
(0, 0)->  R:    -8.71, 
(0, 1)->  R:    -8.67, 
(0, 2)->  R:      0.0, 
(1, 0)->  U:    -8.72, 
(1, 2)->  R:      0.0, 
(2, 0)->  R:    -8.72, 
(2, 1)->  R:    -8.62, 
(2, 2)->  R:     -0.5, 
(2, 3)->  U:      0.0, 

Episode =   425000| Avg Reward =  -1.9808| Alpha =  0.00005 | Epsilon =  0.100


Episode =   545000| Avg Reward =  -1.9753| Alpha =  0.00004 | Epsilon =  0.10000
Q-Table:
(0, 0)->  R:    -8.87, 
(0, 1)->  R:    -8.84, 
(0, 2)->  R:      0.0, 
(1, 0)->  U:    -8.87, 
(1, 2)->  R:      0.0, 
(2, 0)->  U:    -8.87, 
(2, 1)->  R:    -8.79, 
(2, 2)->  R:     -0.5, 
(2, 3)->  U:      0.0, 

Episode =   550000| Avg Reward =  -1.9774| Alpha =  0.00136 | Epsilon =  0.10000
Q-Table:
(0, 0)->  R:    -8.87, 
(0, 1)->  R:    -8.84, 
(0, 2)->  R:      0.0, 
(1, 0)->  U:    -8.87, 
(1, 2)->  R:      0.0, 
(2, 0)->  R:    -8.88, 
(2, 1)->  R:     -8.8, 
(2, 2)->  R:     -0.5, 
(2, 3)->  U:      0.0, 

Episode =   555000| Avg Reward =  -1.9744| Alpha =  0.00004 | Epsilon =  0.10000
Q-Table:
(0, 0)->  R:    -8.88, 
(0, 1)->  R:    -8.85, 
(0, 2)->  R:      0.0, 
(1, 0)->  U:    -8.88, 
(1, 2)->  R:      0.0, 
(2, 0)->  R:    -8.88, 
(2, 1)->  R:     -8.8, 
(2, 2)->  R:     -0.5, 
(2, 3)->  U:      0.0, 

Episode =   560000| Avg Reward =  -1.9833| Alpha =  0.00004 | Epsilon =  0.100


Episode =   680000| Avg Reward =  -1.9918| Alpha =  0.00003 | Epsilon =  0.10000
Q-Table:
(0, 0)->  R:    -8.98, 
(0, 1)->  R:    -8.96, 
(0, 2)->  R:      0.0, 
(1, 0)->  U:    -8.99, 
(1, 2)->  R:      0.0, 
(2, 0)->  U:    -8.99, 
(2, 1)->  R:    -8.93, 
(2, 2)->  R:     -0.5, 
(2, 3)->  U:      0.0, 

Episode =   685000| Avg Reward =   -1.989| Alpha =  0.00003 | Epsilon =  0.10000
Q-Table:
(0, 0)->  R:    -8.99, 
(0, 1)->  R:    -8.96, 
(0, 2)->  R:      0.0, 
(1, 0)->  U:    -8.99, 
(1, 2)->  R:      0.0, 
(2, 0)->  R:    -8.99, 
(2, 1)->  R:    -8.93, 
(2, 2)->  R:     -0.5, 
(2, 3)->  U:      0.0, 

Episode =   690000| Avg Reward =  -1.9839| Alpha =  0.00003 | Epsilon =  0.10000
Q-Table:
(0, 0)->  R:    -8.99, 
(0, 1)->  R:    -8.96, 
(0, 2)->  R:      0.0, 
(1, 0)->  U:    -8.99, 
(1, 2)->  R:      0.0, 
(2, 0)->  R:    -8.99, 
(2, 1)->  R:    -8.93, 
(2, 2)->  R:     -0.5, 
(2, 3)->  U:      0.0, 

Episode =   695000| Avg Reward =  -1.9918| Alpha =  0.00003 | Epsilon =  0.100


Episode =   815000| Avg Reward =   -1.993| Alpha =  0.00002 | Epsilon =  0.10000
Q-Table:
(0, 0)->  R:    -9.07, 
(0, 1)->  R:    -9.05, 
(0, 2)->  R:      0.0, 
(1, 0)->  U:    -9.07, 
(1, 2)->  R:      0.0, 
(2, 0)->  U:    -9.07, 
(2, 1)->  R:    -9.02, 
(2, 2)->  R:     -0.5, 
(2, 3)->  U:      0.0, 

Episode =   820000| Avg Reward =  -2.0006| Alpha =  0.00002 | Epsilon =  0.10000
Q-Table:
(0, 0)->  R:    -9.07, 
(0, 1)->  R:    -9.05, 
(0, 2)->  R:      0.0, 
(1, 0)->  U:    -9.08, 
(1, 2)->  R:      0.0, 
(2, 0)->  U:    -9.08, 
(2, 1)->  R:    -9.03, 
(2, 2)->  R:     -0.5, 
(2, 3)->  U:      0.0, 

Episode =   825000| Avg Reward =  -1.9741| Alpha =  0.00002 | Epsilon =  0.10000
Q-Table:
(0, 0)->  R:    -9.08, 
(0, 1)->  R:    -9.05, 
(0, 2)->  R:      0.0, 
(1, 0)->  U:    -9.08, 
(1, 2)->  R:      0.0, 
(2, 0)->  R:    -9.08, 
(2, 1)->  R:    -9.03, 
(2, 2)->  R:     -0.5, 
(2, 3)->  U:      0.0, 

Episode =   830000| Avg Reward =  -1.9898| Alpha =  0.00002 | Epsilon =  0.100


Episode =   950000| Avg Reward =  -1.9948| Alpha =  0.00002 | Epsilon =  0.10000
Q-Table:
(0, 0)->  R:    -9.14, 
(0, 1)->  R:    -9.12, 
(0, 2)->  R:      0.0, 
(1, 0)->  U:    -9.14, 
(1, 2)->  R:      0.0, 
(2, 0)->  U:    -9.14, 
(2, 1)->  R:     -9.1, 
(2, 2)->  R:     -0.5, 
(2, 3)->  U:      0.0, 

Episode =   955000| Avg Reward =  -1.9925| Alpha =  0.00002 | Epsilon =  0.10000
Q-Table:
(0, 0)->  R:    -9.14, 
(0, 1)->  R:    -9.12, 
(0, 2)->  R:      0.0, 
(1, 0)->  U:    -9.14, 
(1, 2)->  R:      0.0, 
(2, 0)->  R:    -9.14, 
(2, 1)->  R:     -9.1, 
(2, 2)->  R:     -0.5, 
(2, 3)->  U:      0.0, 

Episode =   960000| Avg Reward =  -1.9859| Alpha =  0.00002 | Epsilon =  0.10000
Q-Table:
(0, 0)->  R:    -9.14, 
(0, 1)->  R:    -9.12, 
(0, 2)->  R:      0.0, 
(1, 0)->  U:    -9.15, 
(1, 2)->  R:      0.0, 
(2, 0)->  U:    -9.15, 
(2, 1)->  R:    -9.11, 
(2, 2)->  R:     -0.5, 
(2, 3)->  U:      0.0, 

Episode =   965000| Avg Reward =  -1.9788| Alpha =  0.00002 | Epsilon =  0.100

In [6]:
report_table(Q)

(0, 0)->  R:    -9.16, 
(0, 1)->  R:    -9.14, 
(0, 2)->  R:      0.0, 
(1, 0)->  U:    -9.16, 
(1, 2)->  R:      0.0, 
(2, 0)->  U:    -9.16, 
(2, 1)->  R:    -9.12, 
(2, 2)->  R:     -0.5, 
(2, 3)->  U:      0.0, 


In [7]:
pprint.pprint(Q)

{(0, 0): {'D': -9.164222053816427,
          'L': -9.162154972492788,
          'R': -9.161797706160222,
          'U': -9.162235320294263},
 (0, 1): {'D': -9.161562331418832,
          'L': -9.163340299404945,
          'R': -9.139950629545272,
          'U': -9.160631173914922},
 (0, 2): {'D': -9.126505437561667,
          'L': -9.160365168260515,
          'R': 0.0,
          'U': -9.139062541764321},
 (1, 0): {'D': -9.163189206262233,
          'L': -9.163133143741844,
          'R': -9.163210712729088,
          'U': -9.163125324520456},
 (1, 2): {'D': -9.116929337018693,
          'L': -9.118179815290436,
          'R': 0.0,
          'U': -9.127367098535933},
 (2, 0): {'D': -9.16370693215056,
          'L': -9.163723207555348,
          'R': -9.163584309540653,
          'U': -9.16358380121687},
 (2, 1): {'D': -9.163706104814874,
          'L': -9.16269656419527,
          'R': -9.123906496479773,
          'U': -9.163388167334826},
 (2, 2): {'D': -9.123506215950433,
          '

---

# Report

### Introduction

For this assignment, we had to apply Q-Learning to the grid world that we were introduced to earlier in the course. I worked with [Andy][redpanda-github] for this assignment.

We did three variations of Q-Learning for this assignment:
    1. Vanilla Q-Learning
    2. Q-Learning with Adaptive Learning Rate
    3. Q-Learning with Adaptive Learning Rate and Adaptive Epsilon


### Findings

With the vanilla Q-Learning implementation, we observed that the average reward took a dive in the middle of training and never recovered. In terms of values, it went from ~-12 to ~-7 back down to ~-12 and remained there for the rest of training. We reran this a few times and observed the same behavior repeatedly.

From the lectures, we learnt about 2 methods to improving the performance of our Q-Learning algorithm. These were Adaptive Learning Rate and Adaptive Epsilon. Adaptive Learning Rate allows us to achieve both speed and accuracy.

The second variation involved adding Adaptive Learning Rate. We no longer observed the dive of average reward in the middle of training, which was a step in the right direction. The average reward increased to ~-7 and remained there for the rest of training.

The third variation involved adding Adaptive Epsilon along with Adaptive Learning Rate. Again, the dive of average reward in the middle of training was no longer there. The average reward increased to ~-7 and remained there for the rest of training. We re-ran this with various number of episodes (1,000 - 1,000,000). We observed that the average reward reached ~-7 at the 500 episode mark and never improved afterwards. Investigating this led us to find that our Q-table values all converged to ~-9.4.


### Potential Explanations

One potential explanation to our results is that our algorithm settled on a bad policy and continued to remain there thanks to the low learning rate and low epsilon values. The low learning rate ensured very minimal changes would occur to our Q-Table, while the low epsilon values ensured we would continue to follow the existing policy and never explore other avenues.

A potential solution to this could be:
- Cap $t$ in the Adaptive Epsilon equation to a lower number. This would ensure Epsilon wouldn't get too small and there is still an opportunity to go off policy and explore.
- Our move functionality had a 50% of following the intended action with the other 50% being it picking a random perpendicular action (mostly wrong). Increasing this to 90%/100% led to a much improved Q-Function because we were now properly updating the correct square within our Q-Table.

[redpanda-github]: https://github.com/redpanda-ai

---

# Project Extension Ideas
- Add code to run through and visualize the game
- Add checkpoints to rerun the game
- Plot average reward over episodes/time
- Pick the same seed to ensure consistency in the training inputs - plot all 3 approaches
- Plot turns per episode - how long it took our agent to win or lose!