# ODP

By keeping a proper map of what has been explored, and using our knowledge of the world physics, we manage to get a more efficient algorithm in our "Q-explorer". Here we will try to take even more advantage of this information. 

If we think about the problem at hand, it is pretty clear that the agent is not exploring in an efficient way. For instance, if we already managed to traverse our world in $n$ steps, then why should the agent care to explore any state that is at a $n+1$ distance(or greater) from the origin? any path going throw such state will necessarily be longer. More generally we should want our agent only to consider going throw states that can lead to shorter paths.

It may seem that this would lead to a very complicated exploration policy, however we can use our dynamic programming tools, to make this a really simple problem. Supose the agent has already completed a run of the world, so he know where the final state is, as well as some of the walls and traps. This agent can generate an "optimistic" map of the world, i.e. a map where each unknown state is considered to be empty(which is the best case sceneario for us), and then use this map and dynamic programming to "plan" a route to the final state. This route is the best possible route we can have, if the only caviat that it may not work, since while traversing this route the agent can hit a wall or a trap whenever it enters an unvisited state. However whenever this happens the agent can simply update his map and his policy(again by using dynamic programming). Here is the algorithm.

<div class="alert alert-block alert-info">   
    <ul>
        <li> keep track of each visited state, walls traps and the final state whenever we tack an action </li>
        <li> use a generic exploration policy to complete a first run of the world </li>
        <li> loop until agent follows a planned path without incidents </li>
        <ul style="padding-bottom: 0;">
            <li> use the information gathered to generate an "optimistic" map of the world </li>
            <li> use dynamic programming to determine a policy in our "optimistic" map </li>
            <li> follow the policy. If something "unexpected" happens(i.e. we encounter a state different from what we have in the "optmistic" map): </li>
            <ul style="padding-bottom: 1;">
                <li> update map </li>
                <li> use dynamic programming to update policy </li>
            </ul>
        </ul>
    </ul>
</div>

This a somewhat simple algorithm, however it has some nice properties. For instance it has a halt condition, where we are garanted to have an optimal path. It also doesn't make unecessary explorations, i.e. explore things that couldn't possibly help it improve its performance.

One problem that we haven't adressed so far is the size of the optmistic world, since the agent might not know the size of the actual world. One simple way is to stabilish an "upper bound" where we know the optimal path would need to be inside. For instance if the agent completed the exploration run by visitin $n$ unique states, then the optimal path can't go farther than $n$ from the starting position. This means we could consider our optimistic world to have size $(2n+1, 2n+1)$ and have the starting position at the midle. This is by no means an optimal estimation, but it would be enough to guarantee a solution. In our case however, just to make things simple, we will pass the size of the optmistic world as paramaters to our agent.

Since this algorithm is pretty different from others this notebook will be a step by step implementation of it(similar in spirit to the one in the code base). We will analyze its performance in other notebooks.

In [1]:
import sys 
sys.path.append('../../..')

import numpy as np
import matplotlib.pyplot as plt

from grid_world.action import Action
from grid_world.grid_world import GridWorld
from grid_world.agents.q_explorer_agent import QExplorerAgent
from grid_world.visualization.format_objects import get_policy_rec_str, get_policy_eval_str, get_world_str
from grid_world.utils.returns import returns_from_reward
from grid_world.utils.policy import get_policy_rec, get_random_policy, sample_action

np.random.seed(21)

In [2]:
gworld = GridWorld(
    grid_shape=(4,5), 
    terminal_states_coordinates=((0,4),),
    walls_coordinates=((0,1), (1,1), (2,3)),
    traps_coordinates=((1,3),),
)
print(get_world_str(gworld))

3               

2          █    

1    █     ☠    

0 ⚐  █        ✘ 

  0  1  2  3  4 


In [3]:
from dynamic_programing.policy_improvement import dynamic_programing_gpi


# lets make some restrictions on the available actions
actions = [Action.up, Action.down, Action.left, Action.right]

def r(effect):
    if effect == -1:
        return -100
    elif effect == 1:
        return 0
    else:
        return -1
    
rewards_dict = {(s, a): r(gworld.take_action(s, a)[1]) for s in gworld.states for a in actions}
rewards = lambda x, y: rewards_dict[(x, y)]

## Agent

### discovery run

In [5]:
from typing import Final, Collection

from grid_world.action import Action
from grid_world.grid_world import GridWorld
from grid_world.state import State
from grid_world.type_aliases import Policy, RewardFunction, Q
from grid_world.utils.evaluators import best_q_value
from grid_world.utils.policy import (
    get_random_policy,
    sample_action,
    get_explorer_policy,
)
from grid_world.utils.returns import returns_from_reward
from utils.operations import add_tuples
from grid_world.agents.world_map import WorldMap

In [6]:
class BasicAgent:
    def __init__(
        self,
        reward_function: RewardFunction,
        actions: Collection[Action] = None,
        policy: Policy = None,
        gamma: float = 1,
        alpha: float = 0.1,
        epsilon: float = 0.1,
    ):
        self.reward_function: Final = reward_function
        self.actions: Final = actions if actions is not None else tuple(Action)
        self.policy = Policy if policy is not None else get_random_policy(self.actions)
        self.gamma = gamma
        self.alpha = alpha
        self.epsilon = epsilon
        self.world_map: set[State] = set()
            
    def update_world_map(self, state, action, new_state):
        if new_state == state:
            self.world_map.add(
                State(add_tuples(state.coordinates, action.direction), "wall")
            )
        else:
            self.world_map.add(new_state)

    
def run_random_episode(
    agent, world, max_steps = 1000000
):
    
    state = world.initial_state
    episode_terminated = False
    episode_states = [state]
    episode_actions = []
    episode_rewards = []

    for _ in range(max_steps):
        action = sample_action(agent.policy, state, agent.actions)
        new_state, effect = world.take_action(state, action)
        reward = agent.reward_function(effect)
        agent.update_world_map(state, action, new_state)

        episode_actions.append(action)
        episode_states.append(new_state)
        episode_rewards.append(reward)

        if new_state.kind == "terminal":
            episode_terminated = True
            break
            
        state = new_state

    return episode_terminated, episode_states, episode_actions, episode_rewards

In [7]:
agent = BasicAgent(r, actions)
episode_terminated, episode_states, episode_actions, episode_rewards = run_random_episode(agent, gworld)
len(episode_states)

26

### determine optimistc world and policy

In [8]:
def get_state_by_kind(kind, world_map, world_size):
    return tuple(
        a.coordinates for a in agent.world_map if (
            a.kind == kind and all(0 <= x < world_size for x in a.coordinates)
        )
    )
            
get_state_by_kind("terminal", agent.world_map, 14)

((0, 4),)

In [9]:
def build_opt_world(world_size, agent):
    return GridWorld(
        grid_shape=(world_size, world_size), 
        terminal_states_coordinates=get_state_by_kind("terminal", agent.world_map, world_size),
        walls_coordinates=get_state_by_kind("wall", agent.world_map, world_size),
        traps_coordinates=get_state_by_kind("trap", agent.world_map, world_size),
    )
optimistic_world = build_opt_world(6, agent)
print(get_world_str(optimistic_world))

5                  

4    █             

3                  

2          █       

1                  

0 ⚐  █        ✘    

  0  1  2  3  4  5 


In [10]:
def get_world_model(world):
    return lambda s, a: lambda x: 1 if x == world.take_action(s, a)[0] else 0

def build_gpi_policy(world, r_map, actions):
    world_model = get_world_model(world)
    
    rewards_dict = {(s, a): r(world.take_action(s, a)[1]) 
                    for s in world.states
                    for a in actions
                    }
    rewards = lambda x, y: rewards_dict[(x, y)]

    policy, _ = dynamic_programing_gpi(
        world_model=world_model,
        reward_function=rewards,
        actions=actions,
        states=world.states,
    )
    return policy

policy = build_gpi_policy(optimistic_world, r, actions)

policy converged in 1 epochs


In [11]:
pi_r = get_policy_rec(policy, optimistic_world, actions)
print(get_policy_rec_str(pi_r, optimistic_world))

 ↓  →  ↓  ↓  ↓  ↓ 

 ↓  █  ↓  ↓  ↓  ↓ 

 ↓  ↓  ↓  →  ↓  ↓ 

 ↓  ↓  ↓  █  ↓  ↓ 

 →  →  ↓  ↓  ↓  ↓ 

 ↑  █  →  →  ✘  ← 




### improved run

In [12]:
def run_opt_episode(
    agent, world, max_steps = 1000000
):
    
    state = world.initial_state
    episode_terminated = False
    episode_states = [state]
    episode_actions = []
    episode_rewards = []
    
    optimistic_world = build_opt_world(6, agent)
    policy_rec = get_policy_rec(agent.policy, optimistic_world, agent.actions)

    for _ in range(max_steps):
        action = policy_rec[state]
        new_state, effect = world.take_action(state, action)
        reward = agent.reward_function(effect)
        agent.update_world_map(state, action, new_state)

        episode_actions.append(action)
        episode_states.append(new_state)
        episode_rewards.append(reward)

        if new_state.kind == "terminal":
            episode_terminated = True
            break
            
        #check if policy is going well; if not we update our optimistic map, and then our policy
        if new_state == state or new_state.kind == "trap":
            optimistic_world = build_opt_world(6, agent)
            agent.policy = build_gpi_policy(optimistic_world, r, actions)
            policy_rec = get_policy_rec(agent.policy, optimistic_world, agent.actions)

        state = new_state

    return episode_terminated, episode_states, episode_actions, episode_rewards

In [13]:
agent.policy = policy
episode_terminated, episode_states, episode_actions, episode_rewards = run_opt_episode(agent, gworld)
len(episode_states)

policy converged in 1 epochs


10

In [14]:
optimistic_world = build_opt_world(6, agent)
pi_r = get_policy_rec(agent.policy, optimistic_world, actions)
print(get_policy_rec_str(pi_r, optimistic_world))

 ↓  →  ↓  ↓  ↓  ↓ 

 ↓  █  ↓  ↓  ↓  ↓ 

 ↓  ↓  ↓  →  ↓  ↓ 

 →  →  ↓  █  ↓  ↓ 

 ↑  █  ↓  ↓  ↓  ↓ 

 ↑  █  →  →  ✘  ← 




In [15]:
print(get_policy_rec_str(pi_r, gworld))

 ↓  ↓  ↓  →  ↓ 

 →  →  ↓  █  ↓ 

 ↑  █  ↓  ☠  ↓ 

 ↑  █  →  →  ✘ 




## Second optimized run

In [16]:
episode_terminated, episode_states, episode_actions, episode_rewards = run_opt_episode(agent, gworld)
len(episode_states)

9

In [17]:
optimistic_world = build_opt_world(6, agent)
pi_r = get_policy_rec(agent.policy, optimistic_world, actions)
print(get_policy_rec_str(pi_r, optimistic_world))

 ↓  →  ↓  ↓  ↓  ↓ 

 ↓  █  ↓  ↓  ↓  ↓ 

 ↓  ↓  ↓  →  ↓  ↓ 

 →  →  ↓  █  ↓  ↓ 

 ↑  █  ↓  ↓  ↓  ↓ 

 ↑  █  →  →  ✘  ← 




Note that althoug the agent has not find many of the walls and the trap, it has already find an optimal path(in only 2 runs!). So from here on it will just follow this path, without doing unecessary explorations