# Random-sample one-step tabular Q-planning

An implementation of "Double Q-learning"  using a gridworld.

More info about "Random-sample one-step tabular Q-planning" can be found on section 8.1 of "Reinforcement Learning: an introduction" 2nd edition by Barto and Sutton


The gridworld has the shape(3,4) with a winning state "w"(0,3), and a lossing state "l"(1,3), a non valid state "x"(2,1) and a start state s(3,0)

|  |  |  |  |
|---|---|---|---|
|  |  |  | w |
|  |  |  | l |
|  | x |  |  |
| s |  |  |  |

In [1]:
import numpy as np
import matplotlib.pyplot as plt
from collections import defaultdict

import grid_world

### Disccount factor and step size

In [2]:

GAMMA = 0.9
ALPHA =0.1

### Auxiliary function to display the values of a policy after finishing iterative policy evaluation

In [3]:
def print_values(V,grid):
    for i in range(grid.width):
        print("--------------------------")
        for j in range(grid.height):
            v = V.get((i,j),0)
            if v >= 0:
                print(" %.2f|" % v, end="")
            else:
                print("%.2f|" % v, end="")
        print("")

### Auxiliary function to display a stochastic policy

In [4]:
def print_policy(P,grid):
    for i in range(grid.width):
        print("---------------------------")
        for j in range(grid.height):
            a = P.get((i,j),' ')
            if isinstance(a,dict):
                a = list(a)[0]
            print("  %s  |" % a, end="")
        print("")

### From or defined grid world file, import a negative grid ,retrieve all actions and states and print grid rewards
Negative grid is used to encourage the agent to find a shortest path to the goal

In [5]:
grid = grid_world.Grid.standard_grid()
states = grid.all_states()
actions = list(set([action   for action_tup in grid.actions.values() for action in action_tup]))

In [6]:
def argmax_dict(dictionary):
    # returns the argmax key and the max value from a dictionary
    # will be used for policy improvement from Q
    max_key = None
    max_val = float("-inf")
    
    for k,v in dictionary.items():
        if v > max_val:
            max_val = v
            max_key = k
            
    return max_key,max_val
        
argmax_dict({"a":1,"b":2})

('b', 2)

In [7]:
actions

['D', 'R', 'U', 'L']

In [8]:
def epsilon_greedy_action(Q,state,epsilon=0.1):
    # choose an action using epsilon-greedy strategy
    probability = np.random.random()
    result = 0
    
    if probability < epsilon:
        #explore
        result = np.random.choice(actions)
    else: 
        #exploit
        result = argmax_dict(Q[state])[0]
        
    return result

In [9]:
print("Rewards of grid")
print_values(grid.rewards,grid)

Rewards of grid
--------------------------
 0.00| 0.00| 0.00| 1.00|
--------------------------
 0.00| 0.00| 0.00|-1.00|
--------------------------
 0.00| 0.00| 0.00| 0.00|


### Initialize  policy

In [10]:
policy = {(2,0):'U',
         (1,0):'U',
         (0,0):'R',
         (0,1):'R',
         (0,2):'R',
         (1,2):'R',
         (2,1):'R',
         (2,2):'R',
         (2,3):'U'}

print_policy(policy,grid)

---------------------------
  R  |  R  |  R  |     |
---------------------------
  U  |     |  R  |     |
---------------------------
  U  |  R  |  R  |  U  |


In [11]:
model = dict()
Q = dict()

for state in grid.all_states():
        model[state] = dict()
        Q[state] = dict()
        for action in actions:
            model[state][action] = {"next_state":state,"reward":0.0}
            Q[state][action] = 0.0

def update_model(state,action,new_state,reward):
    model[state][action]["next_state"] = new_state
    model[state][action]["reward"] = reward
    


In [12]:
def random_sample_one_step_Q_planning_learn_model(grid,episodes):
    
    for i in range(episodes):
        
        s = (2,0)
        grid.set_state(s)
        finished = False
        while not finished:
            a = epsilon_greedy_action(Q,s,epsilon=100.0) #epsilon = 100 to make it completely random
            r = grid.move(a)    
            s1 = grid.current_state()
            
            update_model(s,a,s1,r)
            finished =grid.game_over()
                
            s = s1
            
    
    return model

model = random_sample_one_step_Q_planning_learn_model(grid,5000)

In [13]:

def random_sample_one_step_Q_planning(grid,model,episodes,gamma =1,alpha=1):
    """Do planning using Q-learning for a given model
        return: policy and Q table for that policy
    """
    policy = dict()
    
    for i in range(episodes):
        s_position = np.random.choice(len(grid.all_states()))
        s = list(grid.all_states())[s_position]
        a = epsilon_greedy_action(Q,s,1.0)
        model_transition = model[s][a]
        next_s = model_transition["next_state"]
        reward = model_transition["reward"]
        
        Q_max_action = argmax_dict(Q[next_s])[0]
        Q[s][a] = Q[s][a] + alpha*(reward + gamma*Q[next_s][Q_max_action] - Q[s][a])
        
    for state in grid.all_states():
        best_action = argmax_dict(Q[state])[0]
        policy[state] = best_action
    return policy,Q

policy,Q = random_sample_one_step_Q_planning(grid,model,5000,GAMMA,ALPHA)

In [14]:
grid.all_states()

{(0, 0),
 (0, 1),
 (0, 2),
 (0, 3),
 (1, 0),
 (1, 2),
 (1, 3),
 (2, 0),
 (2, 1),
 (2, 2),
 (2, 3)}

In [15]:
policy

{(0, 1): 'R',
 (1, 2): 'U',
 (0, 0): 'R',
 (1, 3): 'D',
 (2, 1): 'R',
 (2, 0): 'R',
 (2, 3): 'L',
 (2, 2): 'U',
 (1, 0): 'U',
 (0, 2): 'R',
 (0, 3): 'D'}

In [16]:
list(grid.all_states())

[(0, 1),
 (1, 2),
 (0, 0),
 (1, 3),
 (2, 1),
 (2, 0),
 (2, 3),
 (2, 2),
 (1, 0),
 (0, 2),
 (0, 3)]

In [17]:
np.random.choice(len(grid.all_states()))

0

In [18]:
print("Policy")
print_policy(policy,grid)

Policy
---------------------------
  R  |  R  |  R  |  D  |
---------------------------
  U  |     |  U  |  D  |
---------------------------
  R  |  R  |  U  |  L  |


In [19]:
V = defaultdict(lambda:0)
for state in policy.keys():
    V[state] = Q[state][policy[state]]
    print("state  |policy action| state value")
    print(state,"|      ",policy[state] , "    |", V[state] )
    

state  |policy action| state value
(0, 1) |       R     | 0.89993529085614
state  |policy action| state value
(1, 2) |       U     | 0.8998653486064331
state  |policy action| state value
(0, 0) |       R     | 0.8092656661172245
state  |policy action| state value
(1, 3) |       D     | 0.0
state  |policy action| state value
(2, 1) |       R     | 0.7271723917215385
state  |policy action| state value
(2, 0) |       R     | 0.6515728354588084
state  |policy action| state value
(2, 3) |       L     | 0.7268538426433188
state  |policy action| state value
(2, 2) |       U     | 0.8094989076305045
state  |policy action| state value
(1, 0) |       U     | 0.7253248821155659
state  |policy action| state value
(0, 2) |       R     | 0.9999843157595708
state  |policy action| state value
(0, 3) |       D     | 0.0


In [20]:
print_values(grid=grid,V=V)

--------------------------
 0.81| 0.90| 1.00| 0.00|
--------------------------
 0.73| 0.00| 0.90| 0.00|
--------------------------
 0.65| 0.73| 0.81| 0.73|


## Conclusions
* Given a model,planning can generate decent policy by using the model to estimate dynamics and rewards for a given state,action pair.
* The policy is learned using q-learning on the sampled transitions from the model.
* The policy is very good and similar to the ones found by model-free RL.