In [148]:
import gym
import numpy as np
import random
from collections import namedtuple, defaultdict

from IPython.display import display, clear_output
from time import sleep

## Solving OpenAI's [FrozenLake](https://gym.openai.com/envs/FrozenLake-v0/)

**We control an agent on a ice field and need to figure out how to best get it past holes to the goal.**

The agent controls the movement of a character in a grid world. Some tiles of the grid are walkable, and others lead to the agent falling into the water. Additionally, the movement direction of the agent is uncertain and only partially depends on the chosen direction. The agent is rewarded for finding a walkable path to a goal tile.

The surface is described using a grid like the following:

```
SFFF       (S: starting point, safe)
FHFH       (F: frozen surface, safe)
FFFH       (H: hole, fall to your doom)
HFFG       (G: goal, where the frisbee is located)
```

We can take the following actions:

In [45]:
Moves = namedtuple("Moves", "left down right up".split())
moves = Moves(0,1,2,3)
moves

Moves(left=0, down=1, right=2, up=3)

Lets make the environment:

In [12]:
env = gym.make('FrozenLake-v0')
env.action_space, env.observation_space, env.reward_range

(Discrete(4), Discrete(16), (0, 1))

So our agent can take 4 actions (left, down, right, up) and has 16 places to go to (or the state).

Moving the agent randomly to see whats happening:

In [152]:
observation = env.reset()
print(f"We are starting at state {state}")

done = False

while not done:
    observation, reward, done, info = env.step(random.choice(moves))
    clear_output(wait=True)
    env.render()
    print(observation, reward, done, info)
    sleep(0.8)

  (Down)
SFFF
FHFH
FFFH
[41mH[0mFFG
12 0.0 True {'prob': 0.3333333333333333}


This agent is dumb - falls into a hole all the time! So moving on to q tables.

For each observation or state we have 4 possible actions, and I'm going to store the Q value for each state/action pair in a 2d array:

In [153]:
Q = np.zeros((env.observation_space.n, env.action_space.n))
Q[:4]

array([[0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.]])

Given a state s:

`np.max(Q[s])` gives us the max Q value of that state
`np.argmax(Q[s])` gives us the index of the max step

In [114]:
for key, val in Q.Q.items():

defaultdict(int, {((5, 6), 1): 0, ((5, 3), 1): 90, ((5, 3), 2): 85})