# ODP

By keeping a proper map of what has been explored, and using our knowledge of the world physics, we manage to get a more efficient algorithm in our "Q-explorer". Here we will take this to another level, and try to solve the problem as efficiently as we can. Just as with Q-explorer **we will assume a deterministic setting for this agent**.

If we think about the problem at hand, it is pretty clear that the agent is not exploring in an efficient way. For instance, if we already managed to traverse our world in $n$ steps, then why should the agent care to explore any state that is at a $n+1$ distance(or greater) from the origin? any path going throw such state will necessarily be longer. More generally we should want our agent only to consider going throw states that can lead to shorter paths.

It may seem that this would lead to a very complicated exploration policy, however we can use our dynamic programming tools, to make this a really simple problem. Supose the agent has already completed a run of the world, so he know where the final state is, as well as some of the walls and traps. This agent can generate an "optimistic" map of the world, i.e. a map where each unknown state is considered to be empty(which is the best case sceneario for us), and then use this map and dynamic programming to "plan" a route to the final state. This route is the best possible route we can have, with the only caviat that it may not work, since while traversing this route the agent can hit a wall or a trap whenever it enters an unvisited state. However whenever this happens the agent can simply update his map and his policy(again by using dynamic programming). Here is an outline of the algorithm.


> <ul>
>   <li> use a generic exploration policy to complete a first run of the world, and generate a map(wchich may be incomplete) </li>
>   <li> use dynamic programming to determine a greedy policy in our map; unknown states are set as empty(the "optimistic" assumption) </li>
>   <li> loop until agent passes an episode without incidents </li>
>   <ul style="padding-bottom: 0;">
>        <li> follow the policy. If something "unexpected" happens(i.e. we encounter a state different from what we have in the "optmistic" map): </li>
>        <ul style="padding-bottom: 1;">
>            <li> update map </li>
>            <li> use dynamic programming to update policy </li>
>        </ul>
>    </ul>
> </ul>


This a somewhat simple algorithm, but it has some nice properties. For instance it has a halt condition, where we are guaranteed to have an optimal path. It also doesn't make unecessary explorations, i.e. explore things that couldn't possibly help it improve its performance.

One problem that we haven't adressed so far is the size of the optmistic world, since the agent might not know the size of the actual world. One simple way is to estabilish an "upper bound" where we know the optimal path would be inside. For instance if the agent completed the exploration run by visitin $n$ unique states, then the optimal path can't go farther than $n$ from the starting position. This means we could consider our optimistic world to have size $(2n+1, 2n+1)$ and have the starting position at the midle. This is by no means an optimal estimation, but it would be enough to guarantee a solution. In our case however, just to make things simple, we will pass the size of the optmistic world as a paramaters to our agent.

Another major drawback for this agent is how computationally expensive it is, since it has to solve multiple DP problemns to work.

Since this algorithm is pretty different from others, this notebook will be a step by step implementation of it(similar in spirit to the implementation in the code base). We will look at its performance in other notebooks, but in truth there really isn't much to analyze, there are no hyperparameter or anything, it just kinda solves the problem(although there is definetily room for improvement, specially in the first episode).

Extending this agent to a non-deterministic world could be simple, if we gave it the ability to determine the wind at any given state.

In [1]:
import sys

sys.path.append("../..")

import numpy as np
import matplotlib.pyplot as plt

from exploring_agents.grid_world_agents import ODPAgent
from exploring_agents.training import run_episode, train_agent
from grid_world.action import Action
from grid_world.grid_world import GridWorld
from grid_world.visualization.format_objects import (
    get_policy_rec_str,
    get_policy_eval_str,
    get_world_str,
)
from utils.returns import returns_from_reward
from utils.policy import get_policy_rec, get_random_policy, sample_action
from notebooks.utils.worlds import small_world_01
from notebooks.utils.basics import basic_actions, basic_reward
from dynamic_programing.policy_improvement import dynamic_programing_gpi


np.random.seed(21)

In [2]:
actions = basic_actions
rewards = basic_reward
gworld = small_world_01
print(get_world_str(gworld))

4 ✘          

3    ☠  █    

2            

1 █  █       

0 ⚐          

  0  1  2  3 


## Agent

### discovery run

In [3]:
from typing import Final, Collection

from grid_world.action import Action
from grid_world.grid_world import GridWorld
from grid_world.state import State
from grid_world.type_aliases import Policy, RewardFunction, Q
from grid_world.utils.evaluators import best_q_value
from grid_world.utils.policy import (
    get_random_policy,
    sample_action,
    get_explorer_policy,
)
from grid_world.utils.returns import returns_from_reward
from utils.operations import add_tuples
from grid_world.agents.commons.world_map import WorldMap

In [4]:
class BasicAgent:
    def __init__(self, reward_function, actions=None):
        self.reward_function: Final = reward_function
        self.actions: Final = actions if actions is not None else tuple(Action)
        self.world_map: set[State] = set()
        self.policy = get_random_policy(self.actions)

    def update_world_map(self, state, action, new_state):
        if (new_state == state) and state.kind != "terminal":
            self.world_map.add(
                State(add_tuples(state.coordinates, action.direction), "wall")
            )
        else:
            self.world_map.add(new_state)


def run_random_episode(agent, world, max_steps=1000000):

    state = world.initial_state
    episode_terminated = False
    episode_states = [state]
    episode_actions = []
    episode_rewards = []

    for _ in range(max_steps):
        action = sample_action(agent.policy, state, agent.actions)
        new_state, effect = world.take_action(state, action)
        reward = agent.reward_function(effect)
        agent.update_world_map(state, action, new_state)

        episode_actions.append(action)
        episode_states.append(new_state)
        episode_rewards.append(reward)

        if new_state.kind == "terminal":
            episode_terminated = True
            break

        state = new_state

    return episode_terminated, episode_states, episode_actions, episode_rewards

In [5]:
agent = BasicAgent(rewards, actions)
(
    episode_terminated,
    episode_states,
    episode_actions,
    episode_rewards,
) = run_random_episode(agent, gworld)
len(episode_states)

107

### determine optimistc world and policy

In [6]:
def get_state_by_kind(kind, world_map, world_size):
    return tuple(
        a.coordinates
        for a in agent.world_map
        if (a.kind == kind and all(0 <= x < world_size for x in a.coordinates))
    )


get_state_by_kind("terminal", agent.world_map, 14)

((0, 4),)

In [7]:
def build_opt_world(world_size, agent):
    return GridWorld(
        grid_shape=(world_size, world_size),
        terminal_states_coordinates=get_state_by_kind(
            "terminal", agent.world_map, world_size
        ),
        walls_coordinates=get_state_by_kind("wall", agent.world_map, world_size),
        traps_coordinates=get_state_by_kind("trap", agent.world_map, world_size),
    )


world_size = 6
optimistic_world = build_opt_world(world_size, agent)
print(get_world_str(optimistic_world))

5       █          

4 ✘                

3       █     █    

2             █    

1 █  █        █    

0 ⚐           █    

  0  1  2  3  4  5 


In [8]:
def get_world_model(world):
    return lambda s, a: lambda x: 1 if x == world.take_action(s, a)[0] else 0


def build_gpi_policy(world, r_map, actions):
    world_model = get_world_model(world)

    rewards_dict = {
        (s, a): r_map(world.take_action(s, a)[1]) for s in world.states for a in actions
    }
    rewards = lambda x, y: rewards_dict[(x, y)]

    policy, _ = dynamic_programing_gpi(
        world_model=world_model,
        reward_function=rewards,
        actions=actions,
        states=world.states,
    )
    return policy


policy = build_gpi_policy(optimistic_world, rewards, actions)

In [9]:
pi_r = get_policy_rec(policy, optimistic_world, actions)
print(get_policy_rec_str(pi_r, optimistic_world))

 ↓  ↓  █  ↓  ↓  ↓ 

 ✘  ←  ←  ←  ←  ← 

 ↑  ↑  █  ↑  █  ↑ 

 ↑  ↑  ←  ↑  █  ↑ 

 █  █  ↑  ↑  █  ↑ 

 →  →  ↑  ↑  █  ↑ 




Notice that since the agent isn't aware of a trap at (1,3) it assumes it can go through this square, since it would lead to one of the shortest path. However this should be corrected on the next run.

### improved run

In [10]:
def run_opt_episode(agent, world, max_steps=1000000):

    state = world.initial_state
    episode_terminated = False
    episode_states = [state]
    episode_actions = []
    episode_rewards = []

    optimistic_world = build_opt_world(world_size, agent)
    policy_rec = get_policy_rec(agent.policy, optimistic_world, agent.actions)

    for _ in range(max_steps):
        action = policy_rec[state]
        new_state, effect = world.take_action(state, action)
        reward = agent.reward_function(effect)
        agent.update_world_map(state, action, new_state)

        episode_actions.append(action)
        episode_states.append(new_state)
        episode_rewards.append(reward)

        if new_state.kind == "terminal":
            episode_terminated = True
            break

        # check if policy is going well; if not we update our optimistic map, and then our policy
        if new_state == state or new_state.kind == "trap":
            optimistic_world = build_opt_world(world_size, agent)
            agent.policy = build_gpi_policy(optimistic_world, rewards, actions)
            policy_rec = get_policy_rec(agent.policy, optimistic_world, agent.actions)

        state = new_state

    return episode_terminated, episode_states, episode_actions, episode_rewards

In [11]:
agent.policy = policy
episode_terminated, episode_states, episode_actions, episode_rewards = run_opt_episode(
    agent, gworld
)
len(episode_states)

16

In [12]:
optimistic_world = build_opt_world(world_size, agent)
pi_r = get_policy_rec(agent.policy, optimistic_world, actions)
print(get_policy_rec_str(pi_r, optimistic_world))

 ↓  ↓  █  ↓  ↓  ↓ 

 ✘  ←  ←  ←  ←  ← 

 ↑  ☠  █  ↑  █  ↑ 

 ↑  ←  ←  ↑  █  ↑ 

 █  █  ↑  ↑  █  ↑ 

 →  →  ↑  ↑  █  ↑ 




In [13]:
print(get_policy_rec_str(pi_r, gworld))

 ✘  ←  ←  ← 

 ↑  ☠  █  ↑ 

 ↑  ←  ←  ↑ 

 █  █  ↑  ↑ 

 →  →  ↑  ↑ 




## Second optimized run

In [14]:
episode_terminated, episode_states, episode_actions, episode_rewards = run_opt_episode(
    agent, gworld
)
len(episode_states)

9

In [15]:
optimistic_world = build_opt_world(world_size, agent)
pi_r = get_policy_rec(agent.policy, optimistic_world, actions)
print(get_policy_rec_str(pi_r, optimistic_world))

 ↓  ↓  █  ↓  ↓  ↓ 

 ✘  ←  ←  ←  ←  ← 

 ↑  ☠  █  ↑  █  ↑ 

 ↑  ←  ←  ↑  █  ↑ 

 █  █  ↑  ↑  █  ↑ 

 →  →  ↑  ↑  █  ↑ 




Note that althoug the agent has not find many of the walls and the trap, it has already find an optimal path(in only 2 runs!). So from here on it will just follow this path, without doing unecessary explorations

# Codebase Agent

lets take a quick look at the proper agent we implemented in the codebase

In [3]:
agent = ODPAgent(
    reward_function=basic_reward, actions=basic_actions, world_shape=(6, 6)
)


episode_lengths, episode_returns = train_agent(agent=agent, world=gworld, episodes=5)
pi_r = get_policy_rec(agent.policy, gworld, agent.actions)

print(episode_lengths)
print(get_policy_rec_str(pi_r, gworld))

[107, 16, 9, 9, 9]
 ✘  ←  ←  ← 

 ↑  ☠  █  ↑ 

 ↑  ←  ←  ↑ 

 █  █  ↑  ↑ 

 →  →  ↑  ↑ 




If we pass the terminal states coordinates to the agent, it can do even better. It learns the path on the first run, which isn't very long in the first place!

In [4]:
agent = ODPAgent(
    reward_function=basic_reward,
    actions=basic_actions,
    world_shape=(4, 5),
    terminal_coordinates=(0, 4),
)


episode_lengths, episode_returns = train_agent(agent=agent, world=gworld, episodes=5)
pi_r = get_policy_rec(agent.policy, gworld, agent.actions)

print(episode_lengths)
print(get_policy_rec_str(pi_r, gworld))

[19, 9, 9, 9, 9]
 ✘  ←  ←  ← 

 ↑  ☠  █  ↑ 

 ↑  ←  ←  ↑ 

 █  █  ↑  ↑ 

 →  →  ↑  ↑ 


