In [None]:
%cd ..
%load_ext autoreload
%autoreload 2

In [None]:
from ipywidgets import interact_manual

In [None]:
from policy_iteration_probabilistic import *

In this example, we try to find the optimal policy in a windy gridworld. We also study the effect of rewards on how the optimal policy behaves at the probabilisitc state i.e. cell (1, 2).

The method to find the optimal policy is the same as in the [09_policy_iteration_deterministic.ipynb](./09_policy_iteration_deterministic.ipynb) notebook.

Note: 
 * the optimal policy we want to find is deterministic, only the environment is probabilistic
 * cell (0,0) is the top left corner of the grid

In [None]:
# just to be clear show the one state that has a probabilistic state transition
g = windy_grid_penalized(0)

for (current_state, action), v in g.probs.items():
    if len(v) > 1:
        for (future_state, prob) in v.items():
            r = g.rewards.get(future_state, 0)
            print(f"s: {current_state}")
            print(f"a: {action}")
            print(f"s': {future_state}")
            print(f"r: {r}")
            print(f"p(s', r | s, a): {prob}")
            print("")

So, when you are at state (1, 2), taking action 'U' is risky because there is a 50% chance that you might end up at the terminal state with a -1 reward i.e. state (1, 3).

In [None]:
def f(penalty=0):
    grid = windy_grid_penalized(penalty)

    print("the world is:")
    print_world(grid)
    print("")

    policy = create_random_policy(grid)

    solve_gridworld(grid, policy, gamma=0.9)
    
interact_manual(f, penalty=(-2, 0, 0.1))

Some interesting observations:
    
* when penalty is 0, we do not take any risk at (1, 2). We go down and around X to get the terminal state at (0, 3).
* when penalty is -2 and you are at (1, 2), we go to the terminal state with -1 reward because any movement to (0, 3) will incur a higher expected cost.
* when penalty is -0.2 and you are at (1, 2), we take a risk and go up!