## Policy Evaluation by Direct Solution of Bellman Equation.

Let us consider an Agent that has to navigate a Grid World with 12 states as shown below. The Agent starts from the bottom left and has to reach the top right corner without stepping on the column with negative reward. If the Agent falls into the shaded state it will bounce back to the previous state.<br>

![title](gridworld.png)

 At any moment, Agent can take four actions - Up(^), Left(<), Down(V),right(>). <br> 


In [8]:
import numpy as np

T = np.load("./T.npy")

Let us assume that a Transition Model is given to the System. Below, we examine the Transition Probabilities of some states.<br>
 
 From the Terminal State (3), it does not move anywhere.

In [9]:
T.shape

(12, 12, 4)

In [10]:
T[3,:,1]

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

Transition Probability of Terminal State is made to be zero for all actions.

 You can examine that the Agent moves to the intended state with 0.8 Probability and moves to random direction with 0.1 probability. <br>

In [11]:
T[6,:,0]

array([0. , 0. , 0.8, 0. , 0. , 0. , 0.1, 0.1, 0. , 0. , 0. , 0. ])

In [12]:
T[6,:,2]

array([0. , 0. , 0. , 0. , 0. , 0. , 0.1, 0.1, 0. , 0. , 0.8, 0. ])

In [13]:
def return_policy_evaluation_linalg(p, r, T, gamma):
    "Solving the Bellman Equation directly"
    u = np.zeros(12)
    for s in range(12):
        if not np.isnan(p[s]):
            action = int(p[s])
            u[s] = np.linalg.solve(np.identity(12) - gamma*T[:,:,action], r)[s]
    return u

In [14]:
gamma = 0.999
iteration = 0

#Generate the first policy randomly
# Nan=Nothing, -1=Terminal, 0=Up, 1=Left, 2=Down, 3=Right
p = np.random.randint(0, 4, size=(12)).astype(np.float32)
p[5] = np.NaN
p[3] = p[7] = -1 #terminal states

#Value function initialised to zero
v = np.array([0.0, 0.0, 0.0,  0.0,
               0.0, 0.0, 0.0,  0.0,
               0.0, 0.0, 0.0,  0.0])

#let us assign appropriate rewards
r = np.array([-0.04, -0.04, -0.04,  +1.0,
              -0.04,   0.0, -0.04,  -1.0,
              -0.04, -0.04, -0.04, -0.04])
unchanged = False
while True:
    iteration += 1
    epsilon = 0.0001
    #1- Policy Evaluation
    v1 = v.copy()
    #Direct solution
    v = return_policy_evaluation_linalg(p, r, T, gamma)
    
    
    #Stopping criteria
    delta = np.absolute(v - v1).max()
    if (delta < epsilon * (1 - gamma) / gamma) or iteration > 100: 
        unchanged = True
        break
 

    if unchanged: break


print("Iterations: " + str(iteration))

print("Gamma: " + str(gamma))
print("Epsilon: " + str(epsilon))
print("===================================================")
print("Estimated value function")
print(v[0:4])
print(v[4:8])
print(v[8:12])


Iterations: 2
Gamma: 0.999
Epsilon: 0.0001
Estimated value function
[-39.53391727 -35.80058996   0.7433094    1.        ]
[ -0.63691929   0.         -40.          -1.        ]
[-40.          -1.14877472  -1.37568607 -40.        ]
