## Policy Improvement on GRID Game Example

First lets import standard numeric library
<img src="./Pictures/Grid.png" alt="Drawing" style="width: 600px;"/>

In [10]:
import numpy as np
from numpy.linalg import inv, det

Below we define the costs of the cells in the grid.

Here $\epsilon$ the probability of a random action

and $\beta$ is the decay rate of our reward


In [11]:
Grid=np.matrix([0,0,0,-1,0,-2,0,0,0,0,0,+2])
r = np.transpose(Grid)
Epsilon = 0.8
beta = 0.9

We define the jumps taken from following each action: Left, Right, Up, Down and a Random Policy.


(This could probably be automated for bigger examples.)

In [12]:
P_Left = np.array(
                    [
                    [1,0,0,0,0,0,0,0,0,0,0,0],
                    [1,0,0,0,0,0,0,0,0,0,0,0],
                    [0,1,0,0,0,0,0,0,0,0,0,0],
                    [0,0,0,1,0,0,0,0,0,0,0,0],
                    [0,0,0,0,1,0,0,0,0,0,0,0],
                    [0,0,0,0,0,1,0,0,0,0,0,0],
                    [0,0,0,0,0,0,1,0,0,0,0,0],
                    [0,0,0,0,0,0,1,0,0,0,0,0],
                    [0,0,0,0,0,0,0,0,1,0,0,0],
                    [0,0,0,0,0,0,0,0,1,0,0,0],
                    [0,0,0,0,0,0,0,0,0,1,0,0],
                    [0,0,0,0,0,0,0,0,0,0,0,1]
                    ]
                   )

P_Right = np.array(
                    [
                    [0,1,0,0,0,0,0,0,0,0,0,0],
                    [0,0,1,0,0,0,0,0,0,0,0,0],
                    [0,0,0,1,0,0,0,0,0,0,0,0],
                    [0,0,0,1,0,0,0,0,0,0,0,0],
                    [0,0,0,0,1,0,0,0,0,0,0,0],
                    [0,0,0,0,0,1,0,0,0,0,0,0],
                    [0,0,0,0,0,0,0,1,0,0,0,0],
                    [0,0,0,0,0,0,0,1,0,0,0,0],
                    [0,0,0,0,0,0,0,0,0,1,0,0],
                    [0,0,0,0,0,0,0,0,0,0,1,0],
                    [0,0,0,0,0,0,0,0,0,0,0,1],
                    [0,0,0,0,0,0,0,0,0,0,0,1]
                    ]
                   )
                   
P_Up = np.array(
                    [
                    [0,0,0,0,1,0,0,0,0,0,0,0],
                    [0,1,0,0,0,0,0,0,0,0,0,0],
                    [0,0,0,0,0,0,1,0,0,0,0,0],
                    [0,0,0,1,0,0,0,0,0,0,0,0],
                    [0,0,0,0,0,0,0,0,1,0,0,0],
                    [0,0,0,0,0,1,0,0,0,0,0,0],
                    [0,0,0,0,0,0,0,0,0,0,1,0],
                    [0,0,0,0,0,0,0,0,0,0,0,1],
                    [0,0,0,0,0,0,0,0,1,0,0,0],
                    [0,0,0,0,0,0,0,0,0,1,0,0],
                    [0,0,0,0,0,0,0,0,0,0,1,0],
                    [0,0,0,0,0,0,0,0,0,0,0,1],
                    ]
                   )
                   
P_Down = np.array(
                    [
                    [1,0,0,0,0,0,0,0,0,0,0,0],
                    [0,1,0,0,0,0,0,0,0,0,0,0],
                    [0,0,1,0,0,0,0,0,0,0,0,0],
                    [0,0,0,1,0,0,0,0,0,0,0,0],
                    [1,0,0,0,0,0,0,0,0,0,0,0],
                    [0,0,0,0,0,1,0,0,0,0,0,0],
                    [0,0,1,0,0,0,0,0,0,0,0,0],
                    [0,0,0,1,0,0,0,0,0,0,0,0],
                    [0,0,0,0,1,0,0,0,0,0,0,0],
                    [0,0,0,0,0,1,0,0,0,0,0,0],
                    [0,0,0,0,0,0,1,0,0,0,0,0],
                    [0,0,0,0,0,0,0,0,0,0,0,1]
                    ]
                   )

P_Random = 0.25*P_Left + 0.25*P_Right + 0.25*P_Up + 0.25*P_Down

Here we define the transition probabilities for each action (given that a random action might be taken).

We collect these into a big array where:

0 = Left   
1 = Right   
2 = Up   
3 = Down   

In [13]:
Q_Left = ( ( 1 - Epsilon ) * P_Left + Epsilon * P_Random ) 
Q_Right = ( ( 1 - Epsilon ) * P_Right + Epsilon * P_Random ) 
Q_Up = ( ( 1 - Epsilon ) * P_Up + Epsilon * P_Random ) 
Q_Down = ( ( 1 - Epsilon ) * P_Down + Epsilon * P_Random ) 

P = [Q_Left, Q_Right, Q_Up, Q_Down] # 0 = Left, 1 = Right, 2 = Up, 3 = Down

Here we get the rewards of a policy $\pi$.

Recall that $\pi$ defines a Markov chain $X^\pi_t$ and that we can evaluate 
   
\begin{equation}
R^\pi(x) = 
\mathbb E_x 
\Big[
\sum_{t=0}^\infty \beta^t r(X_t)
\Big]
\end{equation}  
by solving the equation
   
\begin{equation}
R^\pi(x) = \beta (Q^{\pi} R^{\pi} )(x) +r(x)
\end{equation}
   
Thus interpretting these as vectors, these systems of linear equations are solved by:
   
\begin{equation}
R = (I-\beta Q^{\pi})^{-1} r
\end{equation}

The function below does this calculation for the reward function:

In [5]:
def Construct_Rewards_For_Transitions(pi,P,beta,r):
    Q_pi=[]
    I=np.identity(len(pi))
    
    for i in range(len(pi)):
        Q_pi.append(P[pi[i]][i])
    Q_pi=np.matrix(Q_pi)
    
    R_pi = inv( I - beta * Q_pi) @ r
    
    return R_pi

Given the rewards calculated above we now need to perform the policy 
improvement step. That is for each state=$x$ we solve the maximization

\begin{equation}
\pi_{\text{new}}(x)\in \textit{argmax}_{a \in \mathcal A}\;\; r(x) + \beta \mathbb E_{x,a} \Big[ R^\pi(X' ) \Big].
\end{equation}


In [6]:
def Find_New_Policy(P,R_pi):
    
    new_pi=[]
    Reward_Left = beta * ( np.matrix(P[0]) @ R_pi )
    Reward_Right = beta * ( np.matrix(P[1]) @ R_pi )
    Reward_Up = beta * ( np.matrix(P[2]) @ R_pi )
    Reward_Down = beta * ( np.matrix(P[3]) @ R_pi )
    
    for state in range(len(pi)):
        action=np.argmax(
                [np.matrix.item(Reward_Left[state]),
                 np.matrix.item(Reward_Right[state]),
                 np.matrix.item(Reward_Up[state]),
                 np.matrix.item(Reward_Down[state])])
        new_pi.append(action)
    
    return new_pi

We define some arbirary policy $\pi$ to start with

In [15]:
pi=[0,0,0,1,0,0,0,3,0,3,0,0]

Now we simply iterate on the steps above until a fixed point is reached 

(This takes only 4 steps or so)

In [16]:
for _ in range(100):
    R_pi = Construct_Rewards_For_Transitions(pi,P,beta,r)
    new_pi = Find_New_Policy(P,R_pi)
    print(new_pi)
    if new_pi == pi :
        break
    else:
        pi = new_pi
    

[0, 0, 2, 0, 3, 0, 2, 2, 3, 1, 1, 0]
[1, 1, 2, 0, 3, 0, 2, 2, 1, 1, 1, 0]
[1, 1, 2, 0, 3, 0, 2, 2, 3, 1, 1, 0]
[1, 1, 2, 0, 3, 0, 2, 2, 3, 1, 1, 0]


Finally we can print the rewards assoicated with the last (optimal) policy

In [17]:
print(np.transpose(R_pi))

[[  0.35613505   0.50788985   0.72484776 -10.           0.25047825
  -20.           6.39709719   7.9896067    0.17831926   0.13306751
   10.21393859  20.        ]]
