Soo..

The idea here is to play a little bit with the "Frozen Lake" classic gym enviroment.

Here is a description from the open ai gym documentation 

https://github.com/openai/gym/blob/master/gym/envs/toy_text/frozen_lake.py

"""
    Winter is here. You and your friends were tossing around a frisbee at the park
    when you made a wild throw that left the frisbee out in the middle of the lake.
    The water is mostly frozen, but there are a few holes where the ice has melted.
    If you step into one of those holes, you'll fall into the freezing water.
    At this time, there's an international frisbee shortage, so it's absolutely imperative that
    you navigate across the lake and retrieve the disc.
    However, the ice is slippery, so you won't always move in the direction you intend.
    The surface is described using a grid like the following
        
        SFFF
        FHFH
        FFFH
        HFFG
    
    S : starting point, safe
    F : frozen surface, safe
    H : hole, fall to your doom
    G : goal, where the frisbee is located
    The episode ends when you reach the goal or fall in a hole.
    You receive a reward of 1 if you reach the goal, and zero otherwise.
"""

In [1]:
# Importing the libraries
import numpy as np
import gym

The actions available are:

LEFT = 0
DOWN = 1
RIGHT = 2
UP = 3

In [3]:
# Lets create our enviroment with Frozen Lake
env = gym.make("FrozenLake-v0", is_slippery=True)

As mentioned before the enviroment takes into account a "is_slippery" state, this creates a transition probability where you can take an action at a particular state but there is a probability you will end up in a different state of what your action intended you to go

In [4]:
# Lets define our initial states, states-values, discount factor, actions and threshold
V = np.zeros(env.nS)
S = np.arange(0,16)
threshold = 1e-3
gamma = 0.9
actions = {'LEFT':0,'DOWN':1,'RIGHT':2,'UP':3}

There are several ways to solve this problem. In this notebook i will use the value iteration algorithm.

In [7]:
''' Lets first define a function that will use bellman optimality equation 
    to calculate the optimal state value functions for all states'''

def optimal_state_value(env,S,V):
    loop = True
    i = 0
    while loop == True:
        delta = 0
        for s in S:
            v_old = V[s]
            best_v = float('-inf')
            for a in actions.values():
                expected_v = 0
                expected_r = 0
                transitions = env.P[s][a]
                for (probs, state_prime, r, done) in transitions:
                    expected_r += probs * r
                    expected_v += probs * V[state_prime] 
                v_new = expected_r + (gamma * expected_v)
                if v_new > best_v:
                    best_v = v_new
            V[s] = best_v
            delta = max(delta, abs(v_old - best_v))
        if delta <= threshold:
            loop = False
        i+=1
    return (V)

In [8]:
''' Now lets use a policy imporvement function to calculate the optimal policy 
    given the optimal state values previously calculated'''

def optimal_policy(env,S,V):
    policy = np.zeros(env.nS)
    V = optimal_state_value(env,S,V)
    for s in S:
        best_a = None
        best_v = float('-inf')
        for k,a in actions.items():
            expected_v = 0
            expected_r = 0
            transitions = env.P[s][a]
            for (probs, state_prime, r, done) in transitions:
                expected_r += probs * r
                expected_v += probs * V[state_prime] 
            v_new = expected_r + (gamma * expected_v)
            if v_new > best_v:
                best_v = v_new
                best_a = a
        policy[s] = best_a
    return(policy)


Now we can run the algorithms and view the results

In [10]:
env.render()
print(optimal_policy(env,S,V).reshape(4,4), actions, sep="\n")


[41mS[0mFFF
FHFH
FFFH
HFFG
[[0. 3. 0. 3.]
 [0. 0. 0. 0.]
 [3. 1. 0. 0.]
 [0. 2. 1. 0.]]
{'LEFT': 0, 'DOWN': 1, 'RIGHT': 2, 'UP': 3}


As one can see, the actions at each state seem kind of counter-intuitive but thats because the "is_slippery" effect.

Let's calculate the results again but turning off the "is_slippery effect"

In [11]:
env2 = gym.make("FrozenLake-v0", is_slippery=False)
env2.render()
print(optimal_policy(env2,S,V).reshape(4,4), actions, sep="\n")


[41mS[0mFFF
FHFH
FFFH
HFFG
[[1. 2. 1. 0.]
 [1. 0. 1. 0.]
 [2. 1. 1. 0.]
 [0. 2. 2. 0.]]
{'LEFT': 0, 'DOWN': 1, 'RIGHT': 2, 'UP': 3}


As one can see, without the "is_slippery" effect the calculated policy seems to be quite intuitive.