# Frozen Lake Example

There is a lake (on a square grid). Your task is to make it from the start on top-left to the end on the bottom-right in as few steps as possible. There are two complecating factors:

1. The ice is slippy so when you make a move (up,down,left, right) there is a chance you don't go where you intended.
2. The ice has cracks in it so there is a chance you fall down a crack (and die!)

With the probabilities of moving known and the cracks known. This defines an MDP that we can solve with value iteration of policy iteration. 

In [7]:
# Load some packages
import sys
sys.path.append('../../')

import numpy as np

# The main modules that we need
import gym
from stochastic_control.optimal_control.discrete_control \
import Value_Iteration, Policy_Iteration

from tqdm import tqdm
from imp import reload

%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [8]:
def print_lake(vi,size=8):
    '''prints the solution more nicely '''  
    
    # print the lake
    env.render()
    print('')
    
    # print the solution
    A_dict = { 0 : 'L', 1: 'D', 2:'R', 3:'U' }
    for s in range(vi.nS):
        print(A_dict[vi.act(s)], end='')
        if (s+1) % size ==0 :
            print('')
    
    # print values/sucess_probabilities
    print('')
    for s in range(vi.nS):
        print(np.round(vi.V[s],2), end=' ')
        if (s+1) % size ==0 :
            print('')

In [20]:
# Load environment
env = gym.make('FrozenLake-v0', map_name=None, is_slippery=True)

In [13]:
# value iteration solution
vi = Value_Iteration(env.nS,env.nA,env.P,disc=0.9)
V = vi.train(10000)
print_lake(vi)


[41mS[0mFHFFFHF
FFFFFFHF
FHHHHFFF
HHFFFHFF
HFFFHFFF
FFFFFHFF
HFHFFFFF
HFFFFFFG

DLLRRLLR
UUDUULLR
LLLLLRDD
LLDDLLRD
LDDLLDRD
DRUDLLRD
LLLRDDRD
LRDRRRRL

0.0 0.0 0.0 0.01 0.01 0.01 0.0 0.06 
0.0 0.0 0.0 0.0 0.01 0.02 0.0 0.08 
0.0 0.0 0.0 0.0 0.0 0.03 0.1 0.13 
0.0 0.0 0.02 0.02 0.01 0.0 0.17 0.2 
0.0 0.02 0.03 0.04 0.0 0.08 0.26 0.31 
0.01 0.03 0.05 0.1 0.11 0.0 0.38 0.46 
0.0 0.03 0.0 0.17 0.26 0.38 0.54 0.71 
0.0 0.06 0.11 0.2 0.31 0.46 0.71 0.0 


In [21]:
# policy iteration solution
poi = Policy_Iteration(env.nS,env.nA,env.P,disc=.9)
V = poi.train(time=15)
print_lake(poi)

policy is optimal

[41mS[0mFFFFFFF
FFFFHHHF
FFFFFFFF
HFFFHFFF
FFHHFFFF
FFFFFFFF
HHFFFHHF
FFFFFFFG

RRRRUUUR
DRRLLLLR
URRUDDDD
LRULLRDD
DLLLDRRD
UUDDDUUR
LLRRLLLR
DDRRDDDL

0.0 0.0 0.01 0.01 0.01 0.02 0.03 0.05 
0.0 0.0 0.01 0.01 0.0 0.0 0.0 0.06 
0.0 0.01 0.01 0.01 0.02 0.06 0.08 0.1 
0.0 0.01 0.01 0.0 0.0 0.09 0.12 0.15 
0.01 0.01 0.0 0.0 0.06 0.11 0.16 0.23 
0.02 0.02 0.05 0.07 0.09 0.12 0.2 0.37 
0.0 0.0 0.07 0.1 0.12 0.0 0.0 0.63 
0.04 0.05 0.09 0.13 0.21 0.36 0.63 0.0 


In [1]:
# env.P