#### FROZEN LAKE

https://www.gymlibrary.dev/environments/toy_text/frozen_lake/

### Action Space
The action shape is (1,) in the range {0, 3} indicating which direction to move the player.
```
0: Move left

1: Move down

2: Move right

3: Move up
```
### Observation Space
The observation is a value representing the player’s current position as current_row * nrows + current_col (where both the row and col start at 0).

For example, the goal position in the 4x4 map can be calculated as follows: 3 * 4 + 3 = 15. The number of possible observations is dependent on the size of the map.

The observation is returned as an int().

Starting State
The episode starts with the player in state [0] (location [0, 0]).

### Rewards
Reward schedule:
```
Reach goal: +1

Reach hole: 0

Reach frozen: 0
```
### Episode End
The episode ends if the following happens:
```
Termination:

The player moves into a hole.

The player reaches the goal at max(nrow) * max(ncol) - 1 (location [max(nrow)-1, max(ncol)-1]).

Truncation (when using the time_limit wrapper):

The length of the episode is 100 for 4x4 environment, 200 for FrozenLake8x8-v1 environment.
```

In [6]:
import numpy as np
import gymnasium as gym


env_id = 'FrozenLake-v1'
fl_env = gym.make(env_id, desc=None, map_name="4x4", is_slippery=False, render_mode="ansi")
print("Environment id: ", env_id)
print("Number of action space: ",fl_env.env.action_space)
print("Number of observation space: ",fl_env.env.observation_space)
'''
SHOWING ENV
'''
fl_env.reset()
print("How to move from (S) --> (G) and avoiding (H) with the below map")
print(fl_env.render())


Environment id:  FrozenLake-v1
Number of action space:  Discrete(4)
Number of observation space:  Discrete(16)
How to move from (S) --> (G) and avoiding (H) with the below map

[41mS[0mFFF
FHFH
FFFH
HFFG



In [3]:
''' list's structure
fl_env.p[s|a] = p[s_prime],s_prime,r, done
probalitity of reaching (s_prime) from (s) and its reward (r) by action (a)
'''
def argmax(env, V, pi,action_star, state, gamma):
    Expectation_value = np.zeros(env.env.action_space.n)
    for action in range(env.env.action_space.n):        
        q   =   0
        P   =   np.array(env.env.P[state][action])                   
        (x,y) = np.shape(P)                             
        for i in range(x):                              
            s_prime                   = int(P[i][1])                      
            p                         = P[i][0]
            r                         = P[i][2]                                 
            q                         += p*(r+gamma*V[s_prime])          
            Expectation_value[action] = q

    chosen_action   = np.argmax(Expectation_value) 
    action_star[state]     =  chosen_action
    pi[state][chosen_action]    = 1
    return pi, action_star    
''''
bellman_optimality_update 
'''
def bellman_optimality_update(env, V, state, gamma):   
    '''
    '''
    pi                = np.zeros((env.env.observation_space.n, env.env.action_space.n))
    Expectation_value = np.zeros(env.env.action_space.n)                       
                                            
    for action in range(env.env.action_space.n):             
        q       = 0                                 
        P       = np.array(env.env.P[state][action])
        (x,y)   = np.shape(P)
        
        for i in range(x):
            s_prime                     = int(P[i][1])
            p                           = P[i][0]
            r                           = P[i][2]
            q                           += p*(r + gamma*V[s_prime])
            Expectation_value[action]   = q
            
    chosen_action = np.argmax(Expectation_value)
    pi[state][chosen_action]      = 1
    
    ## Taking greedy action and update value function
    u = 0
    P = np.array(env.env.P[state][chosen_action])
    (x,y) = np.shape(P)
    for i in range(x):
        s_prime = int(P[i][1])
        p       = P[i][0]
        r       = P[i][2]
        u       += p*(r + gamma*V[s_prime])
        
    V[state] =  u
    return V[state]
''''
VALUE_ITERATION 
'''
def value_iteration(env, gamma, theta):
    V = np.zeros(env.env.observation_space.n)       
    num_steps = 0                              
    while True:
        num_steps += 1
        delta = 0
        for state in range(env.env.observation_space.n):                     
            v = V[state]
            bellman_optimality_update(env, V, state, gamma)   
            delta = max(delta, abs(v - V[state])) 
        if delta < theta:                                       
            break                                         
    pi = np.zeros((env.env.observation_space.n, env.env.action_space.n)) 

    action_star = np.zeros((env.env.observation_space.n))

    for state in range(env.env.observation_space.n):
        pi,action_star = argmax(env, V, pi,action_star, state, gamma)         
    return V, pi, action_star                                  

In [4]:
'''EXECUTION
'''                                 
gamma = 0.9
theta = 1e-4
V, policy,action_star = value_iteration(fl_env, gamma,  theta)

In [7]:
'''Playing game
'''
current_state = fl_env.reset()[0]
print(fl_env.render())
for i in range(100):
    current_state, reward, done,_, info = fl_env.step( int(action_star[current_state]))
    print(fl_env.render())
    if done:
      break



[41mS[0mFFF
FHFH
FFFH
HFFG

  (Down)
SFFF
[41mF[0mHFH
FFFH
HFFG

  (Down)
SFFF
FHFH
[41mF[0mFFH
HFFG

  (Right)
SFFF
FHFH
F[41mF[0mFH
HFFG

  (Down)
SFFF
FHFH
FFFH
H[41mF[0mFG

  (Right)
SFFF
FHFH
FFFH
HF[41mF[0mG

  (Right)
SFFF
FHFH
FFFH
HFF[41mG[0m

