# DAT257x: Reinforcement Learning Explained

## Lab 4: Dynamic Programming

### Exercise 4.1 Policy Evaluation with 2 Arrays

Policy Evaluation calculates the value function for a policy, given the policy and the full definition of the associated Markov Decision Process.  The full definition of an MDP is the set of states, the set of available actions for each state, the set of rewards, the discount factor, and the state/reward transition function.

In [82]:
import test_dp               # required for testing and grading your code
import gridworld_mdp as gw   # defines the MDP for a 4x4 gridworld

The gridworld MDP defines the probability of state transitions for our 4x4 gridworld using a "get_transitions()" function.  

Let's try it out now, with state=2 and all defined actions.

In [83]:
# try out the gw.get_transitions(state, action) function

state = 2
actions = gw.get_available_actions(state)

for action in actions:
    transitions = gw.get_transitions(state=state, action=action)

    # examine each return transition (only 1 per call for this MDP)
    for (trans) in transitions:
        next_state, reward, probability = trans    # unpack tuple
        print("transition("+ str(state) + ", " + action + "):", "next_state=", next_state, ", reward=", reward, ", probability=", probability)

transition(2, up): next_state= 2 , reward= -1 , probability= 1
transition(2, down): next_state= 6 , reward= -1 , probability= 1
transition(2, left): next_state= 1 , reward= -1 , probability= 1
transition(2, right): next_state= 3 , reward= -1 , probability= 1


**Implement the algorithm for Iterative Policy Evaluation using the 2 array approach**.  In the 2 array approach, one array holds the value estimates for each state computed on the previous iteration, and one array holds the value estimates for the states computing in the current iteration.

A empty function **policy_eval_two_arrays** is provided below; implement the body of the function to correctly calculate the value of the policy using the 2 array approach.  The function defines 5 parameters - a definition of each parameter is given in the comment block for the function.  For sample parameter values, see the calling code in the cell following the function.

In [100]:
def policy_eval_two_arrays(state_count, gamma, theta, get_policy, get_transitions):
    """
    This function uses the two-array approach to evaluate the specified policy for the specified MDP:
    
    'state_count' is the total number of states in the MDP. States are represented as 0-relative numbers.
    
    'gamma' is the MDP discount factor for rewards.
    
    'theta' is the small number threshold to signal convergence of the value function (see Iterative Policy Evaluation algorithm).
    
    'get_policy' is the stochastic policy function - it takes a state parameter and returns list of tuples, 
        where each tuple is of the form: (action, probability).  It represents the policy being evaluated.
        
    'get_transitions' is the state/reward transiton function.  It accepts two parameters, state and action, and returns
        a list of tuples, where each tuple is of the form: (next_state, reward, probabiliity).  
        
    """
    
        
    V = state_count*[0]
    while True: 
        delta = 0 
        state = 0 
        while state < state_count:
            v = 0
            for action, action_probability in get_policy(state):
                for next_state, reward, probability in get_transitions(state, action):
                    v += action_probability * probability * (reward + gamma * V[next_state])
                    print("nxt_rew: " + str(V[next_state]) + ", nxt: " + str(next_state) + ", state:" + str(state) + ", V: " + str(v))
            delta = max(delta, abs(V[state]-v))
            V[state] = v
            #print(reward)
            state += 1
            print("    " + str(v))
        if delta < theta:
            break
    return V

First, test our function using the MDP defined by gw.* functions.

In [101]:
def get_equal_policy(state):
    # build a simple policy where all 4 actions have the same probability, ignoring the specified state
    policy = ( ("up", .25), ("right", .25), ("down", .25), ("left", .25))
    return policy

n_states = gw.get_state_count()

# test our function
values = policy_eval_two_arrays(state_count=n_states, gamma=.9, theta=.001, get_policy=get_equal_policy, \
    get_transitions=gw.get_transitions)

print("Values=", values)

nxt_rew: 0, nxt: 0, state:0, V: 0.0
nxt_rew: 0, nxt: 0, state:0, V: 0.0
nxt_rew: 0, nxt: 0, state:0, V: 0.0
nxt_rew: 0, nxt: 0, state:0, V: 0.0
    0.0
nxt_rew: 0, nxt: 1, state:1, V: -0.25
nxt_rew: 0, nxt: 2, state:1, V: -0.5
nxt_rew: 0, nxt: 5, state:1, V: -0.75
nxt_rew: 0.0, nxt: 0, state:1, V: -1.0
    -1.0
nxt_rew: 0, nxt: 2, state:2, V: -0.25
nxt_rew: 0, nxt: 3, state:2, V: -0.5
nxt_rew: 0, nxt: 6, state:2, V: -0.75
nxt_rew: -1.0, nxt: 1, state:2, V: -1.225
    -1.225
nxt_rew: 0, nxt: 3, state:3, V: -0.25
nxt_rew: 0, nxt: 3, state:3, V: -0.5
nxt_rew: 0, nxt: 7, state:3, V: -0.75
nxt_rew: -1.225, nxt: 2, state:3, V: -1.275625
    -1.275625
nxt_rew: 0.0, nxt: 0, state:4, V: -0.25
nxt_rew: 0, nxt: 5, state:4, V: -0.5
nxt_rew: 0, nxt: 8, state:4, V: -0.75
nxt_rew: 0, nxt: 4, state:4, V: -1.0
    -1.0
nxt_rew: -1.0, nxt: 1, state:5, V: -0.475
nxt_rew: 0, nxt: 6, state:5, V: -0.725
nxt_rew: 0, nxt: 9, state:5, V: -0.975
nxt_rew: -1.0, nxt: 4, state:5, V: -1.45
    -1.45
nxt_rew: -1.225

nxt_rew: -5.211869093809515, nxt: 1, state:1, V: -1.422670546107141
nxt_rew: -7.038617709996874, nxt: 2, state:1, V: -3.2563595308564377
nxt_rew: -6.534464990422557, nxt: 5, state:1, V: -4.976614153701513
nxt_rew: 0.0, nxt: 0, state:1, V: -5.226614153701513
    -5.226614153701513
nxt_rew: -7.038617709996874, nxt: 2, state:2, V: -1.8336889847492968
nxt_rew: -7.554405602662053, nxt: 3, state:2, V: -3.7834302453482587
nxt_rew: -7.107886705221244, nxt: 6, state:2, V: -5.632704754023038
nxt_rew: -5.226614153701513, nxt: 1, state:2, V: -7.058692938605879
    -7.058692938605879
nxt_rew: -7.554405602662053, nxt: 3, state:3, V: -1.949741260598962
nxt_rew: -7.554405602662053, nxt: 3, state:3, V: -3.899482521197924
nxt_rew: -7.058692438662586, nxt: 7, state:3, V: -5.737688319897005
nxt_rew: -7.058692938605879, nxt: 2, state:3, V: -7.575894231083328
    -7.575894231083328
nxt_rew: 0.0, nxt: 0, state:4, V: -0.25
nxt_rew: -6.534464990422557, nxt: 5, state:4, V: -1.9702546228450755
nxt_rew: -7.038617

    -6.604213913250977
nxt_rew: -7.125803667372325, nxt: 2, state:6, V: -1.853305825158773
nxt_rew: -7.12580366736322, nxt: 7, state:6, V: -3.7066116503154976
nxt_rew: -6.604213913242255, nxt: 10, state:6, V: -5.442559780795005
nxt_rew: -6.604213913250977, nxt: 5, state:6, V: -7.1785079112764745
    -7.1785079112764745
nxt_rew: -7.647729922717661, nxt: 3, state:7, V: -1.9707392326114739
nxt_rew: -7.12580366736322, nxt: 7, state:7, V: -3.824045057768198
nxt_rew: -5.276332914891939, nxt: 11, state:7, V: -5.261219963618885
nxt_rew: -7.1785079112764745, nxt: 6, state:7, V: -7.126384243656092
    -7.126384243656092
nxt_rew: -5.275906485600302, nxt: 4, state:8, V: -1.4370789592600681
nxt_rew: -7.17790221051298, nxt: 9, state:8, V: -3.3021069566254884
nxt_rew: -7.646929494227032, nxt: 12, state:8, V: -5.2726660928265705
nxt_rew: -7.125055886870019, nxt: 8, state:8, V: -7.125803667372325
    -7.125803667372325
nxt_rew: -6.604213913250977, nxt: 5, state:9, V: -1.73594813048147
nxt_rew: -6.60421

**Expected output from running above cell:**

`
Values= [0.0, -5.274709263277986, -7.123800104889248, -7.64536148969558, -5.274709263277987, -6.602238720082915, -7.17604178238719, -7.1238001048892485, -7.1238001048892485, -7.176041782387191, -6.602238720082915, -5.274709263277986, -7.645361489695581, -7.1238001048892485, -5.274709263277986]
`

In [86]:
import numpy as np
a = np.append(values, 0)
np.reshape(a, (4,4))

array([[ 0.        , -5.27590649, -7.12580367, -7.64772992],
       [-5.27590649, -6.60421391, -7.17850791, -7.12638424],
       [-7.12580367, -7.17850791, -6.60467837, -5.27666399],
       [-7.64772992, -7.12638424, -5.27666399,  0.        ]])

Now, test our function using the test_dp helper.  The helper also uses the gw MDP, but with a different gamma value.
If our function passes all tests, a passcode will be printed.

In [87]:
# test our function using the test_db helper
test_dp.policy_eval_two_arrays_test( policy_eval_two_arrays ) 


Testing: Policy Evaluation (two-arrays)
passed test: return value is list
passed test: length of list = 15
passed test: values of list elements
PASSED: Policy Evaluation (two-arrays) passcode = 9991-562


In [88]:
-

SyntaxError: invalid syntax (<ipython-input-88-60215a10e730>, line 1)