# Wind World

In this notebook, we will use dynamic programming to find optimal policies in a stochastic world. We will use 'wind' as a device to introduce some randomness; after each action, there is a chance the agent will be pushed either up or to the right by the wind.

We will see that the same algorithm used in the deterministic setting works just fine, and also that different choice of rewards leads to different optimal policies.

In [1]:
import sys

sys.path.append("../..")

import pandas as pd
import numpy as np

from dynamic_programing.policy_improvement import dynamic_programing_gpi
from grid_world.grid_world import GridWorld
from grid_world.visualization.format_objects import (
    get_policy_rec_str,
    get_policy_eval_str,
    get_world_str,
)
from grid_world.utils.policy import get_policy_rec
from grid_world.action import Action

np.random.seed(12)

# Our World

In [2]:
def wind(x: tuple[int, int]) -> Action:
    n0 = np.random.uniform()
    if n0 < 0.05:
        return Action.right
    elif n0 < 0.1:
        return Action.up
    else:
        return Action.wait


gworld = GridWorld(
    grid_shape=(5, 6),
    terminal_states_coordinates=((0, 5),),
    walls_coordinates=((0, 1), (1, 1), (2, 3), (3, 3)),
    traps_coordinates=((1, 3),),
    wind=wind,
)
print(get_world_str(gworld))

5 ✘             

4               

3    ☠  █  █    

2               

1 █  █          

0 ⚐             

  0  1  2  3  4 


This is the world we will be considering, our goal is to reach the termial state as fast as possible, avoiding the trap. If this looks strange to you please refer to the readme file for more details.

# World Modeling

Here we still need a model of the world to solve this problem with dynamic programming. Remember:

$$ M_w: S \times A \to \mathbb{P}(S) $$

where $M_w$ gives for each pair state action $(s,a)$ a probability distribution over the states $S$, these indicate the probabilitie of moving to this new state when taking action $a$ in state $s$. This means that $M_w(s,a): S \to [0, 1]$ is also a function and $M_w(s,a)(s_0)$ is the probability of getting to $s_0$ when taking action $a$ in state $s$. 

Since we are no longer in a deterministic setting we will estimate this values by sampling. For this we will be consulting the world freely, and estimating for each pair $(s,a)$ the probability of going to $s_0$. This is something an agent wouldn't be able to do in a reinforcement learning setting.

In [3]:
actions = [Action.up, Action.down, Action.left, Action.right]
mw_dict = {}

iterations_per_case = 10000
increment = 1 / iterations_per_case
for s in gworld.states:
    for a in actions:
        psa = {s0: 0 for s0 in gworld.states}
        for _ in range(iterations_per_case):
            fs = gworld.take_action(s, a)[0]
            psa[fs] = psa[fs] + increment
        mw_dict[(s, a)] = psa


def world_model(s, a):
    return lambda s0: mw_dict[(s, a)][s0]


world_model(gworld.get_state((0, 0)), Action.up)(gworld.get_state((1, 0)))

0.04670000000000031

# Rewards and Policy

Lets define some reward functions and create some optimal policies through dynamic programing.

I'll use the dynamic programing policy optmization algorithm implemented in the `dynamic_programing` module to avoid code repetition. It follows the exact same ideas from the determinitisc notebook.

Lets start with the same reward from the deterministic notebook

In [4]:
def r(effect):
    if effect == -1:
        return -100
    elif effect == 1:
        return 0
    else:
        return -1


rewards_dict = {
    (s, a): r(gworld.take_action(s, a)[1]) for s in gworld.states for a in actions
}


def rewards(x, y):
    return rewards_dict[(x, y)]

In [5]:
pi, v_pi = dynamic_programing_gpi(world_model, rewards, actions, gworld.states)
pr = get_policy_rec(pi, gworld, actions)
print(get_policy_eval_str(v_pi, gworld))
print(get_policy_rec_str(pr, gworld))

     0.00      -1.05      -2.10      -3.16      -4.21  

    -1.05      -2.04      -3.05      -4.05      -5.06  

    -2.06    -112.54                            -6.01  

    -8.21      -8.96      -8.87      -7.91      -6.96  

                          -9.77      -8.86      -7.92  

   -12.54     -11.58     -10.68      -9.77      -8.87  


 ✘  ←  ←  ←  ← 

 ↑  ←  ←  ←  ← 

 ↑  ☠  █  █  ↑ 

 ↑  ←  →  →  ↑ 

 █  █  →  ↑  ↑ 

 →  →  →  ↑  ↑ 




Here we see that the policy tries to avoid getting near the trap even though the path are expected to be longer(state (2,2) points right for instance). This is not surprising, as going down on left (2,2) or up on (0,2) will give a 5% chance of hitting the trap due to the wind.

Now lets change the reward function, and see how this affects the policy.

In particular we were giving a very negative reward for hitting the trap. We can change this, so that there is no extra punishiment for hitting the trap, excecpt that we are sent to the beggining.

If we think of the problem like a video game, hiting the trap could mean something like losing a life. So the first reward function would be punishing this, while this second reward function doesn't care about it, and is just interested in reaching the terminal state as fast as possible.

In [6]:
def new_r(effect):
    if effect == 1:
        return 0
    else:
        return -1


new_rewards_dict = {
    (s, a): new_r(gworld.take_action(s, a)[1]) for s in gworld.states for a in actions
}


def new_rewards(x, y):
    return new_rewards_dict[(x, y)]

In [7]:
new_pi, new_v_pi = dynamic_programing_gpi(
    world_model, new_rewards, actions, gworld.states
)
new_pr = get_policy_rec(new_pi, gworld, actions)
print(get_policy_eval_str(new_v_pi, gworld))
print(get_policy_rec_str(new_pr, gworld))

    0.00     -1.05     -2.10     -3.16     -4.21  

   -1.05     -2.04     -3.05     -4.05     -5.06  

   -2.06    -10.80                         -6.01  

   -3.42     -4.40     -5.79     -6.85     -6.96  

                       -6.85     -7.84     -7.92  

   -9.80     -8.85     -7.85     -8.79     -8.87  


 ✘  ←  ←  ←  ← 

 ↑  ←  ←  ←  ← 

 ↑  ☠  █  █  ↑ 

 ↑  ←  ←  ←  ↑ 

 █  █  ↑  ←  ↑ 

 →  →  ↑  ↑  ↑ 




We can see a change in the policy, now it prefers to take the shorter path. Even though following this policy has a reasonably high chance of hitting the trap due to the wind, on average this will be faster.

The take away here is that the details of the reward function will influence the policy, and so we should always consider what we want to achieve and to avoid when defining then.