## Gridworld A

We experiment with policy and value incrementing algorithms on the simple gridworld problem

Gridworld is a closed 4x4 grid in which an agent may move any cardinal direction. The NW and SE corners are terminal states from which no more transitions are allowed. The goal of the agent is to learn a policy which maximized the expected discounted reward at each step. The reward function returns -1 for all non-terminal states.

In [70]:
import numpy as np
import random

To facilitate transitions and later computation, each state is represented internally as a pair of $(i,j)$ coordinates in a`numpy np.array`. Externally a state is just an integer 0...15

In [71]:
world_size = 10
terminal_states = (0, world_size**2 - 1)
states = {i: np.array([i // world_size, i % world_size]) for i in range(world_size * world_size)}

In [72]:
actions = {'up':np.array([-1,0]),
           'down':np.array([1,0]),
           'right':np.array([0,1]),
           'left':np.array([0,-1])}

In [73]:
arrows = {'left':"ðŸ ˆ",
          'right':"ðŸ Š",
          'up':"ðŸ ‰",
          'down':"ðŸ ‹",
          'none':" "}

In [74]:
def transition(state, action):
    """ Return the new state resulting from applying action to state.
        If action is invalid or if state is terminal, return the input state """
    if state in terminal_states:
        return state
    temp = states[state] + actions[action]
    # if valid move
    if (np.all( np.array([0,0]) <= temp) and np.all(temp < np.array([world_size,world_size]))):
        return temp[0]*world_size + temp[1]
    else:
        # if invalid
        return state

In [75]:
# Quick unit test
transition(3,'up'), transition(3,'down'), transition(8,'left')

(3, 13, 7)

In [76]:
def reward(state, action):
    """ Return -1 for all actions from all states """
    return -1

The global `state_actions` maps states to valid actions from that state. There are no actions available from a terminal state.

In [77]:
state_actions = dict()

for s in states.keys():
    children = set()
    for a in actions.keys():
        if transition(s,a) != s:
            children.add(a)
    state_actions.update({ s: children } )

In [78]:
state_actions

{0: set(),
 1: {'down', 'left', 'right'},
 2: {'down', 'left', 'right'},
 3: {'down', 'left', 'right'},
 4: {'down', 'left', 'right'},
 5: {'down', 'left', 'right'},
 6: {'down', 'left', 'right'},
 7: {'down', 'left', 'right'},
 8: {'down', 'left', 'right'},
 9: {'down', 'left'},
 10: {'down', 'right', 'up'},
 11: {'down', 'left', 'right', 'up'},
 12: {'down', 'left', 'right', 'up'},
 13: {'down', 'left', 'right', 'up'},
 14: {'down', 'left', 'right', 'up'},
 15: {'down', 'left', 'right', 'up'},
 16: {'down', 'left', 'right', 'up'},
 17: {'down', 'left', 'right', 'up'},
 18: {'down', 'left', 'right', 'up'},
 19: {'down', 'left', 'up'},
 20: {'down', 'right', 'up'},
 21: {'down', 'left', 'right', 'up'},
 22: {'down', 'left', 'right', 'up'},
 23: {'down', 'left', 'right', 'up'},
 24: {'down', 'left', 'right', 'up'},
 25: {'down', 'left', 'right', 'up'},
 26: {'down', 'left', 'right', 'up'},
 27: {'down', 'left', 'right', 'up'},
 28: {'down', 'left', 'right', 'up'},
 29: {'down', 'left', 

The global policy `pi` ($\pi$) starts out as a uniform policy. Each action is taken with uniform probability from each state.

In [79]:
pi = dict()
for s in states:
    children = state_actions[s]
    for a in children:
        pi.update( { (s,a) : 1/len(children) } )

Initialize transition probabilities. In conformity to the RL textbook, $P(s',r | s,a)$ is the defining probability in all problems. Given an agent in state $s$ and completing action $a$, what is the probability of transitioning to state $s'$ and receiving reward $r$. For gridworld, the environment is entirely deterministic. Action $a$ from state $s$ will always transition to the same $s'$ and return reward $-1$. This dictionary contains all valid transitions and maps them to the probability $p=1$.

In [80]:
p = dict()

In [81]:
for s in states:
    children = state_actions[s]
    for a in children:
        next_s = transition(s, a)
        r = reward(s, a)
        p.update( { (s, a, next_s, r) : 1 } )

## 4.1 Iterative Policy Evaluation Algorithm

Following the algorithm on p.75 of the RL text, we iteratively evaluate the value of policy `pi` = $\pi_U$, the uniform policy

In [82]:
V = { s:0.0 for s in states }
gamma = 1
count = 1
while count < 50:
    delta = 0
    for s in states:
        v = V[s]
        new_v = 0
        for a in state_actions[s]:
            r = reward(s, a)
            new_state = transition(s, a)
            new_v  += pi[(s,a)]*p[(s,a,new_state,r)]*(r + gamma*V[new_state])
        V[s] = new_v
        delta = max(delta, abs(v - V[s]))
    if delta < 0.01:
        break
    count += 1

Now compute the optimal policy based on the child with max V

In [83]:
optimal = dict()
for s in states:
    children = state_actions[s]
    children_states = [transition(s,a) for a in children]
    if (len(children_states) == 0):
        continue
    best_child = max([c for c in children_states], key = lambda x: V[x])
    optimal.update({s : best_child} )
optimal.update({ s: s for s in terminal_states })

In [84]:
for i in range(world_size):
    for j in range(world_size):
        print (f"{V[i*world_size + j]:8.2f}", end=' ')
    print()

    0.00   -33.23   -52.10   -63.96   -71.84   -77.22   -80.93   -83.46   -85.12   -86.12 
  -33.23   -46.21   -58.15   -67.26   -73.88   -78.59   -81.88   -84.14   -85.62   -86.52 
  -52.10   -58.15   -65.22   -71.47   -76.40   -80.06   -82.64   -84.38   -85.51   -86.21 
  -63.96   -67.26   -71.47   -75.50   -78.83   -81.30   -82.96   -83.97   -84.55   -84.93 
  -71.84   -73.88   -76.40   -78.83   -80.76   -82.02   -82.60   -82.63   -82.41   -82.28 
  -77.22   -78.59   -80.06   -81.30   -82.02   -82.05   -81.35   -80.08   -78.64   -77.68 
  -80.93   -81.88   -82.64   -82.96   -82.60   -81.35   -79.11   -76.02   -72.63   -70.13 
  -83.46   -84.14   -84.38   -83.97   -82.63   -80.08   -76.02   -70.39   -63.66   -57.89 
  -85.12   -85.62   -85.51   -84.55   -82.41   -78.64   -72.63   -63.66   -51.27   -37.38 
  -86.12   -86.52   -86.21   -84.93   -82.28   -77.68   -70.13   -57.89   -37.38     0.00 


In [85]:
for i in range(world_size):
    for j in range(world_size):
        print (f"{optimal[i*world_size + j]:4}", end=' ')
    print()

   0    0    1    2    3    4    5    6    7    8 
   0    1   11   12   13   14   15   16   17   18 
  10   11   12   22   23   24   25   26   27   39 
  20   21   22   32   33   34   35   47   48   49 
  30   31   32   33   43   44   56   57   58   59 
  40   41   42   43   44   56   66   67   68   69 
  50   51   52   53   65   66   67   77   78   79 
  60   61   62   74   75   76   77   78   88   89 
  70   71   72   84   85   86   87   88   89   99 
  80   81   93   94   95   96   97   98   99   99 


## 4.3 Policy Iteration

Policy iteration begins with a uniform random deterministic policy and uses value iteration to approximate the value function. This value function then yields a new policy, based on selecting the action with the highest value. The iteration begins again with this value function. The process repeats until the policy is stable. Note in this example we include logic to determine if the policy is oscillating between two equivalently optimal policies. Otherwise an inifinite loop can occur

In [86]:
# Initialize the policy to 0 and `pi` to a random action.



In [92]:
V = dict( {s : random.random() for s in states} )
V.update ( {s : 0 for s in terminal_states } )
pi = dict( {s : random.choice(list(state_actions[s])) for s in states if len(state_actions[s]) > 0} )
pi.update( { s: None for s in terminal_states} )

count= 0
gamma = 1
target_delta = 10
done = False
restarts = 0

while not (done):
    while (restarts < 50):
        print("------------------- starting ----------------------" )
        V = dict( {s : random.random() for s in states} )
        V.update ( {s : 0 for s in terminal_states } )
        while True:
            delta = 0
            for s in pi.keys():
                v = V[s]
                next_s = transition(s,pi[s])
                r = reward(s, pi[s])
                #if (s == 1):
                #    print ("help:", s, v, next_s, r, gamma, V[next_s])
                V[s] = 1 * (r + gamma * V[next_s])
                delta = max(delta, abs(v - V[s]))
                #print(delta, s, V[s])
            print("Delta = ", delta)
            if delta <= target_delta:
                done = True
        restarts += 1
    # end restart while
    print("done!")
    done = True

    optimal = dict()
    last_pi = pi.copy()
    
    for s in states:
        if s not in terminal_states:
            children = state_actions[s]
            best_child = max([c for c in children], \
                             key = lambda x: (-1+gamma*V[transition(s,x)]))
            optimal.update({s : best_child} )
            if optimal[s] != pi[s]:
                pi[s] = optimal[s]
                done = False
                
optimal.update({ s: 'none' for s in terminal_states })

------------------- starting ----------------------
Delta =  7.224821766089971
Delta =  7.022568584015267
Delta =  7.022568584015267
Delta =  2.486381629362206
Delta =  2.000000000000001
Delta =  2.000000000000001
Delta =  2.000000000000001
Delta =  2.0
Delta =  2.0000000000000018
Delta =  2.0
Delta =  2.0
Delta =  2.0
Delta =  2.0
Delta =  2.0
Delta =  2.0
Delta =  2.0000000000000036
Delta =  2.0000000000000036
Delta =  2.0000000000000036
Delta =  2.0
Delta =  2.0
Delta =  2.0
Delta =  2.0
Delta =  2.0
Delta =  2.0
Delta =  2.0
Delta =  2.0
Delta =  2.0
Delta =  2.0
Delta =  2.0
Delta =  2.000000000000007
Delta =  2.000000000000007
Delta =  2.000000000000007
Delta =  2.000000000000007
Delta =  2.000000000000007
Delta =  2.0
Delta =  2.0
Delta =  2.0
Delta =  2.0
Delta =  2.0
Delta =  2.0
Delta =  2.0
Delta =  2.0
Delta =  2.0
Delta =  2.0
Delta =  2.0
Delta =  2.0
Delta =  2.0
Delta =  2.0
Delta =  2.0
Delta =  2.0
Delta =  2.0
Delta =  2.0
Delta =  2.0
Delta =  2.0
Delta =  2.0
Delta

KeyboardInterrupt: 

In [89]:
def initV():
    V = dict( {s : -1 + random.random() for s in states} )
    V.update ( {s : 0 for s in terminal_states } )
    return V

def initPi():
    pi = dict( {s : random.choice(list(state_actions[s])) for s in states if len(state_actions[s]) > 0} )
    pi.update( { s: None for s in terminal_states} )
    return pi

def iterate_V(V, pi, target_delta = 2, bail = 50):
    done = False
    count = 0
    while not done and count < bail:
        delta = 0
        count += 1
        for s in pi.keys():
            v = V[s]
            next_s = transition(s,pi[s])
            r = reward(s, pi[s])
            #if (s == 1):
            #    print ("help:", s, v, next_s, r, gamma, V[next_s])
            V[s] = 1 * (r + gamma * V[next_s])
            delta = max(delta, abs(v - V[s]))
            #print(delta, s, V[s])
        print("Delta = ", delta)
        if delta <= target_delta:
            done = True
    if (count == bail):
        return False
    else:
        return V

def random_restart(f, V,pi, trials = 10):
    while (trials > 0) and not (result := f(V,pi,target_delta=1,bail=20)):
        trials -= 1
        V = initV()
        pi = initPi()
    return result

In [90]:
V = initV()
pi = initPi()
random_restart(iterate_V,V,pi)

Delta =  6.070666469678904
Delta =  3.2808813163734114
Delta =  2.532148839472605
Delta =  2.5321488394726046
Delta =  2.5321488394726046
Delta =  2.5321488394726046
Delta =  2.5321488394726046
Delta =  2.532148839472603
Delta =  2.5321488394726046
Delta =  2.5321488394726046
Delta =  2.5321488394726046
Delta =  2.5321488394726046
Delta =  2.5321488394726046
Delta =  2.5321488394726046
Delta =  2.5321488394726046
Delta =  2.5321488394726046
Delta =  2.5321488394726046
Delta =  2.5321488394726046
Delta =  2.5321488394726046
Delta =  2.5321488394726046
Delta =  8.073396298344914
Delta =  7.238572097802712
Delta =  7.238572097802712
Delta =  2.0000000000000004
Delta =  2.0
Delta =  2.0
Delta =  2.0
Delta =  2.0
Delta =  2.0
Delta =  2.0
Delta =  2.0
Delta =  2.0
Delta =  2.0
Delta =  2.0
Delta =  2.0
Delta =  2.0000000000000036
Delta =  2.0000000000000036
Delta =  2.0000000000000036
Delta =  2.0000000000000036
Delta =  2.0000000000000036
Delta =  9.464863961567826
Delta =  2.8972490841023

False

In [91]:
for i in range(world_size):
    for j in range(world_size):
        print (f"{arrows[optimal[i*world_size + j]]:2}", end=' ')
    print()

KeyError: 0

In [163]:
for i in range(world_size):
    for j in range(world_size):
        print (f"{V[i*world_size + j]:6}", end=' ')
    print()

  -1.0   -1.0   -1.5   -2.0 -1.0625 -1.125 -1.625 -2.125 -1.2142857142857144 -1.7142857142857144 
 -1.25   -1.5   -2.0  -1.25 -1.5625  -1.25   -1.5   -2.0 -1.4285714285714286 -2.2142857142857144 
 -1.25   -1.5  -1.25   -1.5   -2.0 -1.375  -1.75   -2.5 -1.8571428571428572 -2.7142857142857144 
 -1.75   -2.0  -1.75 -1.171875 -1.34375 -1.875  -2.25   -3.0   -3.5   -4.0 
 -1.75   -2.5   -3.0 -1.34375 -1.6875 -2.375 -2.875   -2.5   -4.0   -4.5 
 -2.25  -2.75 -1.0625 -1.125  -1.25 -2.875 -1.125  -1.25   -4.5   -1.5 
 -2.75  -3.25 -1.5625 -1.625   -1.5   -1.5  -1.25   -1.5   -1.5   -2.0 
-1.0625 -1.125  -1.25  -1.75   -2.0   -2.0 -1.125   -2.0   -2.0   -2.5 
-1.125  -1.25   -1.5   -1.5   -2.5   -3.0  -1.25   -1.5   -1.5   -2.0 
 -1.25   -1.5   -2.0   -2.0   -3.0  -1.25   -1.5   -2.0   -1.0   -1.0 
