# Dyna-Q(with given model)

An implementation of "Dyna-Q"  using a gridworld(with a given model)

More info about "Dyna: Integrated Planning, Acting, and Learning" can be found on section 8.2 of "Reinforcement Learning: an introduction" 2nd edition by Barto and Sutton


The gridworld has the shape(3,4) with a winning state "w"(0,3), and a lossing state "l"(1,3), a non valid state "x"(2,1) and a start state s(3,0)

|  |  |  |  |
|---|---|---|---|
|  |  |  | w |
|  |  |  | l |
|  | x |  |  |
| s |  |  |  |

In [1]:
import numpy as np
import matplotlib.pyplot as plt
from collections import defaultdict

import grid_world

### Disccount factor and step size

In [2]:

GAMMA = 0.9
ALPHA =0.1

### Auxiliary function to display the values of a policy after finishing iterative policy evaluation

In [3]:
def print_values(V,grid):
    for i in range(grid.width):
        print("--------------------------")
        for j in range(grid.height):
            v = V.get((i,j),0)
            if v >= 0:
                print(" %.2f|" % v, end="")
            else:
                print("%.2f|" % v, end="")
        print("")

### Auxiliary function to display a stochastic policy

In [4]:
def print_policy(P,grid):
    for i in range(grid.width):
        print("---------------------------")
        for j in range(grid.height):
            a = P.get((i,j),' ')
            if isinstance(a,dict):
                a = list(a)[0]
            print("  %s  |" % a, end="")
        print("")

### From or defined grid world file, import a negative grid ,retrieve all actions and states and print grid rewards
Negative grid is used to encourage the agent to find a shortest path to the goal

In [5]:
grid = grid_world.Grid.standard_grid()
states = grid.all_states()
actions = list(set([action   for action_tup in grid.actions.values() for action in action_tup]))

In [6]:
def argmax_dict(dictionary):
    # returns the argmax key and the max value from a dictionary
    # will be used for policy improvement from Q
    max_key = None
    max_val = float("-inf")
    
    for k,v in dictionary.items():
        if v > max_val:
            max_val = v
            max_key = k
            
    return max_key,max_val
        
argmax_dict({"a":1,"b":2})

('b', 2)

In [7]:
actions

['R', 'D', 'U', 'L']

In [8]:
def epsilon_greedy_action(Q,state,epsilon=0.1):
    # choose an action using epsilon-greedy strategy
    probability = np.random.random()
    result = 0
    
    if probability < epsilon:
        #explore
        result = np.random.choice(actions)
    else: 
        #exploit
        result = argmax_dict(Q[state])[0]
        
    return result

In [9]:
print("Rewards of grid")
print_values(grid.rewards,grid)

Rewards of grid
--------------------------
 0.00| 0.00| 0.00| 1.00|
--------------------------
 0.00| 0.00| 0.00|-1.00|
--------------------------
 0.00| 0.00| 0.00| 0.00|


### Initialize  policy

In [10]:
policy = {(2,0):'U',
         (1,0):'U',
         (0,0):'R',
         (0,1):'R',
         (0,2):'R',
         (1,2):'R',
         (2,1):'R',
         (2,2):'R',
         (2,3):'U'}

print_policy(policy,grid)

---------------------------
  R  |  R  |  R  |     |
---------------------------
  U  |     |  R  |     |
---------------------------
  U  |  R  |  R  |  U  |


In [11]:
model = dict()
Q = dict()

for state in grid.all_states():
        model[state] = dict()
        Q[state] = dict()
        for action in actions:
            model[state][action] = {"next_state":state,"reward":0.0}
            Q[state][action] = 0.0

def update_model(state,action,new_state,reward):
    model[state][action]["next_state"] = new_state
    model[state][action]["reward"] = reward
    


In [12]:
def random_sample_one_step_Q_planning_learn_model(grid,episodes):
    
    for i in range(episodes):
        
        s = (2,0)
        grid.set_state(s)
        finished = False
        while not finished:
            a = epsilon_greedy_action(Q,s,epsilon=100.0) #epsilon = 100 to make it completely random
            r = grid.move(a)    
            s1 = grid.current_state()
            
            update_model(s,a,s1,r)
            finished =grid.game_over()
                
            s = s1
            
    
    return model

model = random_sample_one_step_Q_planning_learn_model(grid,5000)

In [13]:

def dyna_Q_given_model(grid,model,episodes,gamma =1,alpha=1,planning_steps=5):
    """Integrated architecture algorithm(planning and learning) with given model
       Returns a policy and Q value
    """
    policy = dict()
    
    for i in range(episodes):
        finished = False
        s = (2,0)
        grid.set_state(s)
        while not finished:
            #Normal learning
            a = epsilon_greedy_action(Q,s,0.1)
            r = grid.move(a)
            s1 = grid.current_state()
            Q_max_action = argmax_dict(Q[s1])[0]
            Q[s][a] = Q[s][a] + alpha*(r + gamma*Q[s1][Q_max_action] - Q[s][a])
            update_model(s,a,s1,r)
            
            #planning "thinking" loop
            for i in range(planning_steps):
                random_state_num = np.random.choice(len(grid.all_states()))
                sample_state = list(grid.all_states())[random_state_num]
                sample_action = epsilon_greedy_action(Q,s,100.0)
                sample_transition = model[sample_state][sample_action]
                sample_reward = sample_transition["reward"]
                sample_next_state = sample_transition["next_state"]
                Q_max_action = argmax_dict(Q[sample_next_state])[0]
                Q_max_action_value = Q[sample_next_state][Q_max_action]
                Q[sample_state][sample_action] = Q[sample_state][sample_action] + alpha*(sample_reward + gamma*Q_max_action_value - Q[sample_state][sample_action])
                
                
            finished = grid.game_over()
            s = s1
        
    for state in grid.all_states():
        best_action = argmax_dict(Q[state])[0]
        policy[state] = best_action
    return policy,Q

policy,Q = dyna_Q_given_model(grid,model,15,GAMMA,ALPHA,15)

In [14]:
print("Policy")
print_policy(policy,grid)

Policy
---------------------------
  R  |  R  |  R  |  R  |
---------------------------
  U  |     |  U  |  R  |
---------------------------
  R  |  R  |  U  |  L  |


In [15]:
V = defaultdict(lambda:0)
for state in policy.keys():
    V[state] = Q[state][policy[state]]
    print("state  |policy action| state value")
    print(state,"|      ",policy[state] , "    |", V[state] )
    

state  |policy action| state value
(0, 1) |       R     | 0.8592322607996199
state  |policy action| state value
(1, 2) |       U     | 0.8710057520738528
state  |policy action| state value
(0, 0) |       R     | 0.5800288561400662
state  |policy action| state value
(1, 3) |       R     | 0.0
state  |policy action| state value
(2, 1) |       R     | 0.47540881222982695
state  |policy action| state value
(2, 0) |       R     | 0.2506755609993886
state  |policy action| state value
(2, 3) |       L     | 0.50769952583805
state  |policy action| state value
(2, 2) |       U     | 0.7128598856706208
state  |policy action| state value
(1, 0) |       U     | 0.3946188432264865
state  |policy action| state value
(0, 2) |       R     | 0.9929303495098489
state  |policy action| state value
(0, 3) |       R     | 0.0


In [16]:
print_values(grid=grid,V=V)

--------------------------
 0.58| 0.86| 0.99| 0.00|
--------------------------
 0.39| 0.00| 0.87| 0.00|
--------------------------
 0.25| 0.48| 0.71| 0.51|


## Conclusions
* The integrated architecture mixes the best of learning and planning and converges to good policy in fewer esteps.