---

<div class="alert alert-primary alert-info">

# Frozen Lake $4\times4$ と $8\times8$

## Reinforcement Learning

</div>

<div class="alert alert-block alert-success">

- ### Value-Iteration
    
</div>

---

<img src='frozenlake.jpg' width=1000 height=50/>

---

In [1]:
%config IPCompleter.greedy=True
%matplotlib inline

In [2]:
import numpy as np
import matplotlib.pyplot as plt

import gym

In [3]:
np.random.seed(1)

<div class="alert alert-primary alert-info">

## Non-slippery version

</div>

The transitional probabilities are deterministic in the unslippery version. We can simplify

$
\begin{align}
v_{\pi}(s) &:= \sup_{\forall{a}} \{ \sum_{\forall{s'}} p_{ss'}^a (R_{s'}^a + \gamma v_{\pi}(s')) \} \\
&=  \sup_{\forall{a}} \{ R_{s'}^a + \gamma v_{\pi}(s') \}
\end{align}
$

---

Let $\gamma$ be the discount factor. Terminate search if updates $\lt \epsilon(1-\gamma) \gamma^{-1}$

---

In [4]:
def value_iteration(env, gamma, epsilon):

    state_values = np.zeros(env.nS)
    
    def compute_state_value(curr_state):
        best_state_value = 0.0
        for action in range(env.nA):
            transition_probability, next_state, reward, done = env.P[curr_state][action][0]
            best_state_value = max(best_state_value, reward + gamma * state_values[next_state])
        return best_state_value
    
    def print_state_values(n):
        print('\nState Values:')
        idx = 0
        for state, value in enumerate(state_values):
            idx += 1
            print(round(value, 4), end='\t')
            if idx % n is 0:
                print()
            if idx == env.nS - 1:
                print('Goal')
                break
    
    env.reset()
    print('Start:')
    env.render()

    iteration = 0
    
    while True:
        delta = 0
        prev_state_values = state_values.copy()
        iteration += 1
        for state in range(env.nS - 1):
            state_values[state] = compute_state_value(state)
            delta = max(delta, np.fabs(prev_state_values[state] - state_values[state]))
        if delta < epsilon * (1 - gamma) / gamma:
            print('\nNumber of Iterations: ', iteration)
            print('Delta: ', delta)
            if env.nS == 64:
                print_state_values(8)
            else:
                print_state_values(4)
            break

---

### $4\times4$

---

In [5]:
env = gym.make('FrozenLake-v0', is_slippery=False)
gamma = 0.999
epsilon = 0.01

if __name__=='__main__':
    value_iteration(env, gamma, epsilon)

Start:

[41mS[0mFFF
FHFH
FFFH
HFFG

Number of Iterations:  7
Delta:  0

State Values:
0.995	0.996	0.997	0.996	
0.996	0.0	0.998	0.0	
0.997	0.998	0.999	0.0	
0.0	0.999	1.0	Goal


---

### $8\times8$

---

In [6]:
env = gym.make('FrozenLake8x8-v0', is_slippery=False)

if __name__=='__main__':
    value_iteration(env, gamma, epsilon)

Start:

[41mS[0mFFFFFFF
FFFFFFFF
FFFHFFFF
FFFFFHFF
FFFHFFFF
FHHFFFHF
FHFFHFHF
FFFHFFFG

Number of Iterations:  15
Delta:  0

State Values:
0.9871	0.9881	0.9891	0.99	0.991	0.992	0.993	0.994	
0.9881	0.9891	0.99	0.991	0.992	0.993	0.994	0.995	
0.9891	0.99	0.991	0.0	0.993	0.994	0.995	0.996	
0.99	0.991	0.992	0.993	0.994	0.0	0.996	0.997	
0.9891	0.99	0.991	0.0	0.995	0.996	0.997	0.998	
0.9881	0.0	0.0	0.995	0.996	0.997	0.0	0.999	
0.9891	0.0	0.993	0.994	0.0	0.998	0.0	1.0	
0.99	0.991	0.992	0.0	0.998	0.999	1.0	Goal


---

<div class="alert alert-primary alert-info">

## Slippery when Wet

<img src='Slippery_when_wet.jpg' width=250 height=5/>

</div>

---

$
\begin{align}
v_{\pi}(s) &:= \sup_{\forall{a}} \{ \sum_{\forall{s'}} p_{ss'}^a (R_{s'}^a + \gamma v_{\pi}(s')) \}
\end{align}
$

---

In [7]:
def value_iteration(env, gamma, epsilon):

    state_values = np.zeros(env.nS)
    
    def compute_state_value(curr_state):
        best_state_value = 0.0
        for action in range(env.nA):
            total_expectation_state_value = 0.0
            observations = env.P[curr_state][action]
            for observation in observations:
                transition_probability, next_state, reward, done = observation
                total_expectation_state_value += ((reward + gamma * state_values[next_state]) * transition_probability)
            best_state_value = max(best_state_value, total_expectation_state_value)
        return best_state_value
    
    def print_state_values(n):
        print('\nState Values:')
        idx = 0
        for state, value in enumerate(state_values):
            idx += 1
            print(round(value, 4), end='\t')
            if idx % n is 0:
                print()
            if idx == env.nS - 1:
                print('Goal')
                break
    
    env.reset()
    print('Start:')
    env.render()

    iteration = 0
    
    while True:
        delta = 0
        prev_state_values = state_values.copy()
        iteration += 1
        for state in range(env.nS - 1):
            state_values[state] = compute_state_value(state)
            delta = max(delta, np.fabs(prev_state_values[state] - state_values[state]))
        if delta < epsilon * (1 - gamma) / gamma:
            print('\nNumber of Iterations: ', iteration)
            print('Delta: ', delta)
            if env.nS == 64:
                print_state_values(8)
            else:
                print_state_values(4)
            break

---

### $4\times4$

---

In [8]:
env = gym.make('FrozenLake-v0', is_slippery=True)
gamma = 0.999
epsilon = 0.01

if __name__=='__main__':
    value_iteration(env, gamma, epsilon)

Start:

[41mS[0mFFF
FHFH
FFFH
HFFG

Number of Iterations:  244
Delta:  9.766677739331264e-06

State Values:
0.7854	0.7783	0.7737	0.7713	
0.7877	0.0	0.5056	0.0	
0.7925	0.7996	0.7447	0.0	
0.0	0.8641	0.9311	Goal


---

### $8\times8$

---

In [9]:
env = gym.make('FrozenLake8x8-v0', is_slippery=True)
gamma = 0.999
epsilon = 0.01

if __name__=='__main__':
    value_iteration(env, gamma, epsilon)

Start:

[41mS[0mFFFFFFF
FFFFFFFF
FFFHFFFF
FFFFFHFF
FFFHFFFF
FHHFFFHF
FHFFHFHF
FFFHFFFG

Number of Iterations:  422
Delta:  9.892355212981485e-06

State Values:
0.8925	0.8952	0.8992	0.9038	0.9087	0.9137	0.9185	0.9223	
0.8919	0.8939	0.8973	0.9016	0.9064	0.9116	0.9175	0.9251	
0.8764	0.8621	0.8187	0.0	0.7776	0.8652	0.909	0.9306	
0.8636	0.8113	0.6991	0.4205	0.5636	0.0	0.8815	0.939	
0.8534	0.7107	0.4695	0.0	0.4945	0.5683	0.7992	0.9502	
0.8457	0.0	0.0	0.1527	0.353	0.413	0.0	0.9642	
0.8406	0.0	0.1641	0.1055	0.0	0.3188	0.0	0.9811	
0.8381	0.6118	0.3874	0.0	0.2718	0.5443	0.7715	Goal


---