## Tabular Q-learning on Frozenlake

The final version of the algorithm is here:
1. Start with an empty table for Q(s, a).
2. Obtain (s, a, r, s') from the environment.
3. Make a Bellman update: 𝑄(𝑠, 𝑎) ← (1 − 𝛼)𝑄(𝑠, 𝑎) + 𝛼 (𝑟 + 𝛾 max_a' 𝑄(𝑠', 𝑎′)
4. Check convergence conditions. If not met, repeat from step 2.

As mentioned earlier, this method is called tabular Q-learning, as we keep a table of
states with their Q-values. Let's try it on our FrozenLake environment.

In the beginning, we import packages and define constants. The new thing here
is the value of 𝛼, which will be used as the learning rate in the value update. The
initialization of our Agent class is simpler now, as we don't need to track the history
of rewards and transition counters, just our value table. This will make our memory
footprint smaller, which is not a big issue for FrozenLake, but can be critical for
larger environments.

In [2]:
import gym
import collections
from tensorboardX import SummaryWriter

ENV_NAME = "FrozenLake-v0"
GAMMA = 0.9
ALPHA = 0.2
TEST_EPISODES = 20

In [None]:
class Agent:
    def __init__(self):
        self.env = gym.make(ENV_NAME)
        self.state = self.env.reset()
        self.values = collections.defaultdict(float)

    def sample_env(self):
        action = self.env.action_space.sample()
        old_state = self.state
        new_state, reward, is_done, _ = self.env.step(action)
        self.state = self.env.reset() if is_done else new_state
        return old_state, action, reward, new_state

    def best_value_and_action(self, state):
        best_value, best_action = None, None
        for action in range(self.env.action_space.n):
            action_value = self.values[(state, action)]
            if best_value is None or best_value < action_value:
                best_value = action_value
                best_action = action
        return best_value, best_action

    def value_update(self, s, a, r, next_s):
        best_v, _ = self.best_value_and_action(next_s)
        new_v = r + GAMMA * best_v
        old_v = self.values[(s, a)]
        self.values[(s, a)] = old_v * (1-ALPHA) + new_v * ALPHA

    def play_episode(self, env):
        total_reward = 0.0
        state = env.reset()
        while True:
            _, action = self.best_value_and_action(state)
            new_state, reward, is_done, _ = env.step(action)
            total_reward += reward
            if is_done:
                break
            state = new_state
        return total_reward

The sample_env method is used to obtain the next transition from the environment.
We sample a random action from the action space and return the tuple of the old
state, action taken, reward obtained, and the new state. The tuple will be used in
the training loop later.

The best_value_and_action method receives the state of the environment and finds the best action to
take from this state by taking the action with the largest value that we have in the
table. If we don't have the value associated with the state and action pair, then we
take it as zero. This method will be used two times: first, in the test method that
plays one episode using our current values table (to evaluate our policy's quality),
and second, in the method that performs the value update to get the value of the
next state.

In the value_update method, we update our values table using one step from the environment. To do this,
we calculate the Bellman approximation for our state, s, and action, a, by summing
the immediate reward with the discounted value of the next state. Then we obtain
the previous value of the state and action pair, and blend these values together
using the learning rate. The result is the new approximation for the value of state s
and action a, which is stored in our table.

The last method (play_episode) in our Agent class plays one full episode using the provided test
environment. The action on every step is taken using our current value table of
Q-values. This method is used to evaluate our current policy to check the progress
of learning. Note that this method doesn't alter our value table: it only uses it to
find the best action to take.

The rest of the example is the training loop, which is very similar to examples from
Chapter 5, Tabular Learning and the Bellman Equation: we create a test environment,
agent, and summary writer, and then, in the loop, we do one step in the environment
and perform a value update using the obtained data. Next, we test our current policy
by playing several test episodes. If a good reward is obtained, then we stop training.

In [5]:
test_env = gym.make(ENV_NAME)
agent = Agent()
writer = SummaryWriter(comment="-q-learning")

iter_no = 0
best_reward = 0.0
while True:
    iter_no += 1
    s, a, r, next_s = agent.sample_env()
    agent.value_update(s, a, r, next_s)

    reward = 0.0
    for _ in range(TEST_EPISODES):
        reward += agent.play_episode(test_env)
    reward /= TEST_EPISODES
    writer.add_scalar("reward", reward, iter_no)
    if reward > best_reward:
        print("Best reward updated %.3f -> %.3f" % (
            best_reward, reward))
        best_reward = reward
    if reward > 0.80:
        print("Solved in %d iterations!" % iter_no)
        break
writer.close()

Best reward updated 0.000 -> 0.050
Best reward updated 0.050 -> 0.150
Best reward updated 0.150 -> 0.200
Best reward updated 0.200 -> 0.350
Best reward updated 0.350 -> 0.400
Best reward updated 0.400 -> 0.450
Best reward updated 0.450 -> 0.550
Best reward updated 0.550 -> 0.600
Best reward updated 0.600 -> 0.750
Best reward updated 0.750 -> 0.850
Solved in 3978 iterations!


You may have noticed that this version used more iterations to solve the problem
compared to the value iteration method from the previous chapter. The reason
for that is that we are no longer using the experience obtained during testing. (In
example Chapter05/02_frozenlake_q_iteration.py, periodical tests caused an
update of Q-table statistics. Here, we don't touch Q-values during the test, which
causes more iterations before the environment gets solved.)