# Policy improvement
An implementation of policy improvement using a grid_world and "iterative policy evaluation" already implemented in this repo(many code is repeated to maintain notebook independency and never needing to read 2 notebooks)

The gridworld has the shape(3,4) with a winning state "w"(0,3), and a lossing state "l"(1,3), a non valid state "x"(2,1) and a start state s(3,0)

|  |  |  |  |
|---|---|---|---|
|  |  |  | w |
|  |  |  | l |
|  | x |  |  |
| s |  |  |  |

In [1]:
import numpy as np
import matplotlib.pyplot as plt
from collections import defaultdict

import grid_world

### Define the maximum change before finish the loop

In [13]:
CONVERGENCE_THRESHOLD = 10e-4

### Auxiliary function to display the values of a policy after finishing iterative policy evaluation

In [12]:
def print_values(V,grid):
    for i in range(grid.width):
        print("--------------------------")
        for j in range(grid.height):
            v = V.get((i,j),0)
            if v >= 0:
                print(" %.2f|" % v, end="")
            else:
                print("%.2f|" % v, end="")
        print("")

### Auxiliary function to display a stochastic policy

In [11]:
def print_policy(P,grid):
    for i in range(grid.width):
        print("---------------------------")
        for j in range(grid.height):
            a = P.get((i,j),' ')
            if isinstance(a,dict):
                a = list(a)[0]
            print("  %s  |" % a, end="")
        print("")

### From or defined grid world file, import a standard grid and all states

In [10]:
grid = grid_world.Grid.standard_grid()
states = grid.all_states()

### Implement the iterative policy evaluation algorithm:
Function that receives as parameter a policy to evaluate, the disccount factor gamma, and the convergence threshold

In [9]:
def iterative_policy_evaluation(policy,gamma = 1,convergence_threshold = 0.1):
    V = defaultdict(lambda :0) # initializes to 0 the value of all states
    
    while True:
        max_change = 0 #the maximun differente between old and current value in an iteration
        for s in grid.actions.keys(): #finds the value for every state
            old_value  = V[s] #keep a copy of old value of state
            
            new_value = 0 #create an accumulator for the new value of the state after aplying bellman equation
            for action in policy[s].keys(): #calculate expected return for every action in state s (P[s][a])
                action_prob = policy[s][action] #probability of taking state "a" in state "s"
                
                grid.set_state(s)
                reward = grid.move(action) #get inmediate reward of action
                
                new_value += action_prob*(reward + gamma*V[grid.current_state()]) 
                
            V[s] = new_value #store the new value after bellman equation
            max_change = max(max_change,np.abs(old_value  - new_value )) #calc max change to break from loop
            
            
        if max_change < convergence_threshold:
            break
            
    return V

### Implement policy improvement:
Function that receives the state-value of a policy(found using iterative policy evaluation) and returns an improved policy by selecting for every state the action that maximizes it's value

In [31]:
def policy_improvement(policy,V,gamma =1 ):
    import operator #to do the argmax with dicts
    
    new_policy = {}
    for s in grid.actions.keys():
        Qsa ={}
        for action in policy[s].keys():
            grid.set_state(s)
            reward = grid.move(action)
            
            Qsa[action] = reward + gamma*V[grid.current_state()]
            
        state_best_action = max(Qsa.items(), key=operator.itemgetter(1))[0] #equivalent to argmax(Qsa) but for dicts
        
        new_policy[s] = state_best_action
        
    return new_policy

## Compare the value of different policies

### Evaluate an uniform stochastic policy(every action has same probability)

In [7]:
uniform_policy = {}

for s in grid.actions.keys():
    actions = grid.actions[s]
    uniform_policy[s] = {}
    
    for action in actions: #every action in this state has the same probability
        uniform_policy[s][action] = 1.0/len(actions)
        

In [21]:
random_policy_values = iterative_policy_evaluation(policy=uniform_policy,gamma=1,convergence_threshold=CONVERGENCE_THRESHOLD)
print("The state-values of this policy are:")
print_values(grid=grid,V=random_policy_values)

The state-values of this policy are:
--------------------------
-0.03| 0.09| 0.22| 0.00|
--------------------------
-0.16| 0.00|-0.44| 0.00|
--------------------------
-0.29|-0.41|-0.54|-0.77|


In [35]:
improved_policy = policy_improvement(uniform_policy,random_policy_values,gamma=1)
print("Policy after 1 step of policy improvement")
print_policy(improved_policy,grid)

Policy after 1 step of policy improvement
---------------------------
  R  |  R  |  R  |     |
---------------------------
  U  |     |  U  |     |
---------------------------
  U  |  L  |  L  |  L  |


### Evaluate deterministic policy

In [37]:
# In position (2,0) take action: UP , in position (1,0) take action UP and so on....
fixed_policy = {
    (2,0):{'U':1},
    (1,0):{'U':1},
    (0,0):{'R':1},
    (0,1):{'R':1},
    (0,2):{'R':1},
    (1,2):{'R':1},
    (2,1):{'R':1},
    (2,2):{'R':1},
    (2,3):{'U':1}
}

fixed_policy_values = iterative_policy_evaluation(policy=fixed_policy,gamma=0.9,convergence_threshold=CONVERGENCE_THRESHOLD)

In [38]:
print("The state-values of this policy are:")
print_values(grid=grid,V=fixed_policy_values)

The state-values of this policy are:
--------------------------
 0.81| 0.90| 1.00| 0.00|
--------------------------
 0.73| 0.00|-1.00| 0.00|
--------------------------
 0.66|-0.81|-0.90|-1.00|


In [39]:
print("The deterministic policy is:")
print_policy(grid=grid,P=fixed_policy)

The deterministic policy is:
---------------------------
  R  |  R  |  R  |     |
---------------------------
  U  |     |  R  |     |
---------------------------
  U  |  R  |  R  |  U  |


In [40]:
improved_policy = policy_improvement(fixed_policy,fixed_policy_values,gamma=0.9)
print("Policy after 1 step of policy improvement")
print_policy(improved_policy,grid)

Policy after 1 step of policy improvement
---------------------------
  R  |  R  |  R  |     |
---------------------------
  U  |     |  R  |     |
---------------------------
  U  |  R  |  R  |  U  |


## Conclusions
* In the random policy, one step of policy improvement got closer to the deterministic policy
* In the deterministic policy ,one step of policy improvement left the policy unchanged, that means its already optimal policy
* Policy improvement can use the results of iterative policy evaluation to then calculate Q(s,a)