## Q-learning for FrozenLake

In [1]:
import gym
import collections
from tensorboardX import SummaryWriter

In [2]:
ENV_NAME = "FrozenLake-v0"
#ENV_NAME = "FrozenLake8x8-v0"      # uncomment for larger version
GAMMA = 0.9
TEST_EPISODES = 20

The differences with the value iteration are really minors. 

- The most obvious change is to our value table. In the previous example, we kept the value of the state, so the key in the dictionary was just a state. Now we need to store values of the Q-function, which has two parameters: state and action, so the key in the value table is now a composite.

- The second difference is in our calc_action_value() function. We just don't need
it anymore, as our action values are stored in the value table.

- Finally, the most important change in the code is in the agent's value_iteration()
method. Before, it was just a wrapper around the calc_action_value() call,
which did the job of Bellman approximation. Now, as this function has gone and
been replaced by a value table, we need to do this approximation in the value_
iteration() method.

In [3]:
class Agent:
    def __init__(self):
        self.env = gym.make(ENV_NAME)
        self.state = self.env.reset()
        self.rewards = collections.defaultdict(float)
        self.transits = collections.defaultdict(collections.Counter)
        self.values = collections.defaultdict(float)

    def play_n_random_steps(self, count):
        for _ in range(count):
            action = self.env.action_space.sample()
            new_state, reward, is_done, _ = self.env.step(action)
            self.rewards[(self.state, action, new_state)] = reward
            self.transits[(self.state, action)][new_state] += 1
            self.state = self.env.reset() if is_done else new_state

    def select_action(self, state):
        best_action, best_value = None, None
        for action in range(self.env.action_space.n):
            action_value = self.values[(state, action)]
            if best_value is None or best_value < action_value:
                best_value = action_value
                best_action = action
        return best_action

    def play_episode(self, env):
        total_reward = 0.0
        state = env.reset()
        while True:
            action = self.select_action(state)
            new_state, reward, is_done, _ = env.step(action)
            self.rewards[(state, action, new_state)] = reward
            self.transits[(state, action)][new_state] += 1
            total_reward += reward
            if is_done:
                break
            state = new_state
        return total_reward

    def value_iteration(self):
        for state in range(self.env.observation_space.n):
            for action in range(self.env.action_space.n):
                action_value = 0.0
                target_counts = self.transits[(state, action)]
                total = sum(target_counts.values())
                for tgt_state, count in target_counts.items():
                    key = (state, action, tgt_state)
                    reward = self.rewards[key]
                    best_action = self.select_action(tgt_state)
                    val = reward + GAMMA * \
                          self.values[(tgt_state, best_action)]
                    action_value += (count / total) * val
                self.values[(state, action)] = action_value

**The code is very similar to calc_action_value() in the previous example and,
in fact, it does almost the same thing. For the given state and action, it needs to
calculate the value of this action using statistics about target states that we have
reached with the action. To calculate this value, we use the Bellman equation and
our counters, which allow us to approximate the probability of the target state.
However, in Bellman's equation, we have the value of the state; now, we need to
calculate it differently.**

**Before, we had it stored in the value table (as we approximated the value of the
states), so we just took it from this table. We can't do this anymore, so we have to
call the select_action method, which will choose for us the action with the largest
Q-value, and then we take this Q-value as the value of the target state. Of course,
we can implement another function that can calculate for us this value of the state,
but select_action does almost everything we need, so we will reuse it here.**

**As I said, we don't have the calc_action_value method anymore, so, to select
an action, we just iterate over the actions and look up their values in our values
table. It could look like a minor improvement, but if you think about the data that
we used in calc_action_value, it may become obvious why the learning of the
Q-function is much more popular in RL than the learning of the V-function.
Our calc_action_value function uses both information about the reward and
probabilities. It's not a huge problem for the value iteration method, which relies
on this information during training. However, in the next chapter, you will learn
about the value iteration method extension, which doesn't require probability
approximation, but just takes it from the environment samples. For such methods,
this dependency on probability adds an extra burden for the agent. In the case of
Q-learning, what the agent needs to make the decision is just Q-values.
I don't want to say that V-functions are completely useless, because they are an
essential part of the actor-critic method, which we will talk about in part three
of this book. However, in the area of value learning, Q-functions is the definite
favorite. With regards to convergence speed, both our versions are almost identical
(but the Q-learning version requires four times more memory for the value table).**

In [4]:
test_env = gym.make(ENV_NAME)
agent = Agent()
writer = SummaryWriter(comment="-q-iteration")

iter_no = 0
best_reward = 0.0
while True:
    iter_no += 1
    agent.play_n_random_steps(100)
    agent.value_iteration()

    reward = 0.0
    for _ in range(TEST_EPISODES):
        reward += agent.play_episode(test_env)
    reward /= TEST_EPISODES
    writer.add_scalar("reward", reward, iter_no)
    if reward > best_reward:
        print("Best reward updated %.3f -> %.3f" % (best_reward, reward))
        best_reward = reward
    if reward > 0.80:
        print("Solved in %d iterations!" % iter_no)
        break
writer.close()

Best reward updated 0.000 -> 0.100
Best reward updated 0.100 -> 0.150
Best reward updated 0.150 -> 0.300
Best reward updated 0.300 -> 0.350
Best reward updated 0.350 -> 0.400
Best reward updated 0.400 -> 0.450
Best reward updated 0.450 -> 0.800
Best reward updated 0.800 -> 0.900
Solved in 24 iterations!
