# Implementing Tabular Q-Learning on a simple MDP.

We will use Tabular Q-learning to learn useful behavior in a simple MDP with a small, discrete set of states and a small, discrete set of actions called "Frozen Lake."  You may occasionally want to refer to [the Gymnasium documentation](https://gymnasium.farama.org/) to understand how to interact with this environment (Frozen Lake can be found under "Environments"->"Toy Text").

Let's start by exploring the interface.  Below is some code that demonstrates moving through the space.  Note that later while training, you will want `render_mode` to be `None`, as rendering *dramatically* slows things down.  You should understand the code in these cells.  Play with changing the action and running it again so you can understand the action space.

In [None]:
import gymnasium as gym
import numpy as np
import numpy.random

In [None]:
env = gym.make('FrozenLake-v1', desc=None, map_name="4x4", is_slippery=False, render_mode='ansi')
observation, info = env.reset()
print(f'current state is {observation}, board looks like:')
print(env.render())
a = 1
observation, reward, terminated, truncated, info = env.step(a)
print(f'Taking action {a}. Now in state {observation} and received reward {reward}.')
print(f'terminated is {terminated}, truncated is {truncated}')
print(f'Board looks like:')
print(env.render())

Now, write code which hardcodes in a sequence of actions that moves the agent all the way to the goal without stepping in any holes.  Print out the render and reward after each step, so you can observe the reward is 1 when you reach the goal.

In [None]:
actions = [1, 1, 2, 1, 2, 2]
env = gym.make('FrozenLake-v1', desc=None, map_name="4x4", is_slippery=False, render_mode='ansi')
env.reset()
env.render()
for a in actions:
    observation, reward, terminated, truncated, info = env.step(a)
    print(reward)
    print(env.render())

Now we understand the interface (and can look up what we don't), so it's time to actually do some RL. For our first version of this problem, we're going to make `is_slippery=False`, and `map_name="4x4"`, so we're working with the simplest version of this problem.  For this problem, make the discount factor $\gamma=.9$.

You'll need to do the following steps:

- Create a numpy array of zeros to be your Q-values, which is `number of states x number of actions`.
- Implement $\epsilon$-greedy exploration.  After doing this, pause before implementing the next bullet. Prove to yourself that your agent is visiting a variety of states during the course of this exploration.
- To that exploration loop, add tabular Q-learning.

Eventually, your agent will consistently reach the goal on its own.  In an additional cell below your training code, write code that uses a greedy strategy (always uses the most-valuable action from your learned Q-values), and renders the environment after every step, to demonstrate that it quickly reaches the goal. Your loop should run until the environment is terminated or truncated.

In [None]:
is_slippery = False
success_count = []
step_count = []

state_counts = dict()
for i in range(16):
    state_counts[i] = 0
env = gym.make('FrozenLake-v1', desc=None, map_name="4x4", is_slippery=is_slippery)
Qs = np.zeros((env.observation_space.n,env.action_space.n))
observation, info = env.reset()
for i in range(100001):
    old_observation = observation
    if np.random.random() < EPSILON:
        action = env.action_space.sample()  # agent policy that uses the observation and info
    else:
        action = np.argmax(Qs[observation,:])
    observation, reward, terminated, truncated, info = env.step(action)
    Qs[old_observation, action] = ALPHA * (reward + GAMMA*np.max(Qs[observation,:])) + (1-ALPHA)*Qs[old_observation,action]
    
    state_counts[observation] += 1

    if terminated or truncated:
        observation, info = env.reset()

env.close()

In [None]:
env = gym.make('FrozenLake-v1', desc=None, map_name="4x4", is_slippery=is_slippery, render_mode='ansi')
observation, info = env.reset()
print(env.render())
terminated = truncated = False
while not (terminated or truncated):
    action = np.argmax(Qs[observation, :])
    observation, reward, terminated, truncated, info = env.step(action)
    print(env.render())

Do it again, but make the lake 8x8.  How much more training does it seem you need to do?

Now, we're going to make it slippery.  For non-slippery lakes, it makes sense for your learning rate to be quite high.  For slippery lakes, we will want a smaller learning rate.  Before coding, write in a markdown cell below why that would be true.

Below put your code training and demoing your slippery environments.  There is no policy that guarantees success in reaching the goal on a slippery lake.  To demonstrate success, rather than doing a single run, keep rendering off, and run it 100 times to termination. In what percentage of the times does your agent reach the goal?