# Policy iteration
An implementation of policy iteration using a grid_world and "iterative policy evaluation","policy improvement" already implemented in this repo(many code is repeated to maintain notebook independency and never needing to read 2 notebooks)

The gridworld has the shape(3,4) with a winning state "w"(0,3), and a lossing state "l"(1,3), a non valid state "x"(2,1) and a start state s(3,0)

|  |  |  |  |
|---|---|---|---|
|  |  |  | w |
|  |  |  | l |
|  | x |  |  |
| s |  |  |  |

In [1]:
import numpy as np
import matplotlib.pyplot as plt
from collections import defaultdict

import grid_world

### Define the maximum change before finish the loop

In [2]:
CONVERGENCE_THRESHOLD = 10e-5

### Auxiliary function to display the values of a policy after finishing iterative policy evaluation

In [3]:
def print_values(V,grid):
    for i in range(grid.width):
        print("--------------------------")
        for j in range(grid.height):
            v = V.get((i,j),0)
            if v >= 0:
                print(" %.2f|" % v, end="")
            else:
                print("%.2f|" % v, end="")
        print("")

### Auxiliary function to display a stochastic policy

In [4]:
def print_policy(P,grid):
    for i in range(grid.width):
        print("---------------------------")
        for j in range(grid.height):
            a = P.get((i,j),' ')
            if isinstance(a,dict):
                a = list(a)[0]
            print("  %s  |" % a, end="")
        print("")

### From or defined grid world file, import a negative grid ,retrieve all actions and states and print grid rewards
Negative grid is used to encourage the agent to find a shortest path to the goal

In [5]:
grid = grid_world.Grid.negative_grid()
states = grid.all_states()
actions = list(set([action   for action_tup in grid.actions.values() for action in action_tup]))

In [6]:
actions

['L', 'D', 'R', 'U']

In [7]:
print("Rewards of negative grid")
print_values(grid.rewards,grid)

Rewards of negative grid
--------------------------
-0.10|-0.10|-0.10| 1.00|
--------------------------
-0.10| 0.00|-0.10|-1.00|
--------------------------
-0.10|-0.10|-0.10|-0.10|


#### Auxiliary function to compare if two policies are equal

In [8]:
def equal_policies(policy1,policy2):
    # policy1 and policy2 are dicts
    
    for s in grid.actions.keys():
        for a in actions:
            
            if policy1[s][a] != policy2[s][a]:
                return False
    
    return True

### Create an initial random policy, and improve it in time using policy iteration

In [9]:
policy = {}
policy_probs = {}

for s in grid.actions.keys():
    action = np.random.choice(actions)
    policy[s] = action
    policy_probs[s] = {action:0.0 for action in actions}
    policy_probs[s][action] = 1.0
    
print("Initial random policy is:")
print_policy(policy,grid)

Initial random policy is:
---------------------------
  R  |  D  |  U  |     |
---------------------------
  U  |     |  U  |     |
---------------------------
  U  |  L  |  L  |  R  |


## Policy iteration is a combination of policy evaluation and policy improvement

### Implement the iterative policy evaluation algorithm:
Function that receives as parameter a policy to evaluate, the disccount factor gamma, and the convergence threshold

In [10]:
def iterative_policy_evaluation(policy,gamma = 1,convergence_threshold = 0.1):
    V = defaultdict(lambda :0) # initializes to 0 the value of all states
    
    while True:
        max_change = 0 #the maximun differente between old and current value in an iteration
        for s in grid.actions.keys(): #finds the value for every state
            old_value  = V[s] #keep a copy of old value of state
            
            new_value = 0 #create an accumulator for the new value of the state after aplying bellman equation
            for action in policy[s].keys(): #calculate expected return for every action in state s (P[s][a])
                action_prob = policy[s][action] #probability of taking state "a" in state "s"
                
                grid.set_state(s)
                reward = grid.move(action) #get inmediate reward of action
                
                new_value += action_prob*(reward + gamma*V[grid.current_state()]) 
                
                
            V[s] = new_value #store the new value after bellman equation
            
            max_change = max(max_change,np.abs(old_value  - new_value )) #calc max change to break from loop
            
        
        if max_change < convergence_threshold:
            break
            
    return V



### Implement policy improvement:
Function that receives the state-value of a policy(found using iterative policy evaluation) and returns an improved policy by selecting for every state the action that maximizes it's value

In [11]:
def policy_improvement(policy,V,gamma =1 ):
    import operator #to do the argmax with dicts
    
    new_policy = {}
    for s in grid.actions.keys():
        Qsa ={}
        new_policy[s] = {}
        for action in policy[s].keys():
            grid.set_state(s)
            reward = grid.move(action)
            
            Qsa[action] = reward + gamma*V[grid.current_state()]
            new_policy[s][action] = 0.0 #make sure all actions are in dictionary but only one will be best action
            
        state_best_action = max(Qsa.items(), key=operator.itemgetter(1))[0] #equivalent to argmax(Qsa) but for dicts
        
        
       
        new_policy[s][state_best_action] = 1.0 
        
    return new_policy

### Implement policy iteration combinint iterative policy evaluation and policy improvement

In [12]:
def policy_iteration(initial_policy,gamma=1,convergence_threshold = 0.1):
    policy_stable = False
    policy = initial_policy
    V = defaultdict(lambda :0)
    
    while not policy_stable:
        V = iterative_policy_evaluation(policy,gamma,convergence_threshold)
        improved_policy = policy_improvement(policy,V,gamma)
        
        
        if equal_policies(policy,improved_policy):
            policy_stable = True
            
        policy = improved_policy
        
        
    return policy,V

## Run policy iteration in the random initial policy

In [13]:
policy,V = policy_iteration(policy_probs,0.9,CONVERGENCE_THRESHOLD)

In [14]:
print("Learned policy is:")
det_policy ={}
for state in policy.keys():
    for action in policy[state].keys():
        if policy[state][action] == 1.0:
            det_policy[state] = action

print_policy(det_policy,grid)

Learned policy is:
---------------------------
  R  |  R  |  R  |     |
---------------------------
  U  |     |  U  |     |
---------------------------
  R  |  R  |  U  |  L  |


In [15]:
print("The state-values of learned  policy are:")
print_values(grid=grid,V=V)

The state-values of learned  policy are:
--------------------------
 0.62| 0.80| 1.00| 0.00|
--------------------------
 0.46| 0.00| 0.80| 0.00|
--------------------------
 0.31| 0.46| 0.62| 0.46|


## Conclusions
* Policy iteration is a combination of :random policy evaluation, policy improvement
* It converges to find a good policy
* As expected the highest value belongs to the state next to the goal, the lowest value correspond to the farest state