# Chapter 4: The Cross-Entropy Method

A first encounter with an actual RL algorithm. As in, a better one (we hope) than the random agent we built in Chapter 2... :)

In [1]:
import numpy as np
import torch
import torch.nn as nn
from collections import namedtuple

HIDDEN_SIZE = 128
BATCH_SIZE = 16
PERCENTILE = 70

In [2]:
class Net(nn.Module):
    def __init__(self, obs_size, hidden_size, n_actions):
        super(Net, self).__init__()
        self.net = nn.Sequential(
            nn.Linear(obs_size, hidden_size),
            nn.ReLU(),
            nn.Linear(hidden_size, n_actions)
        )

    def forward(self, x):
        return self.net(x)

Note that the net has linear output, even though we want to use the output as a set of probabilities. We _could_ add a softmax layer, but for improved numerical stability we'll instead use `CrossEntropyLoss()` in a minute, which requires raw numeric input and then does softmax & cross-entropy in the same step.

In [3]:
Episode = namedtuple("Episode", field_names=["reward", "steps"])
EpisodeStep = namedtuple("EpisodeStep", field_names=["observation", "action"])


In [4]:
def iterate_batches(env, net, batch_size):
    batch = []
    episode_reward = 0.0
    episode_steps = []
    obs = env.reset()
    sm = nn.Softmax(dim=1) # To get a "probability" from the raw output

    while True:
        obs_v = torch.FloatTensor([obs]) # Convert current observation to PyTorch tensor and pass as "batch" i.e. in a list
        act_probs_v = sm(net(obs_v)) # Pass through NN and obtain probabilities
        act_probs = act_probs_v.data.numpy()[0] # Convert bach to numpy array and take first batch element to get 1d array of probs

        action = np.random.choice(len(act_probs), p=act_probs) # Sample an action using those probabilities
        next_obs, reward, is_done, _ = env.step(action)

        episode_reward += reward
        step = EpisodeStep(observation=obs, action=action) # Saving the observation *used to choose the action*
        episode_steps.append(step)

        if is_done:
            # Add episode to store
            e = Episode(reward=episode_reward, steps=episode_steps)
            batch.append(e)
            # Prepare for next episode
            episode_reward = 0.0
            episode_steps = []
            next_obs = env.reset()

            if len(batch) == batch_size:
                yield batch # Quit the while loop and return the batch
                batch = []
        
        # Prepare for next iteration
        obs = next_obs
            
        

In [5]:
def filter_batch(batch, percentile):
    rewards = list(map(lambda s: s.reward, batch))
    reward_bound = np.percentile(rewards, percentile)
    reward_mean = float(np.mean(rewards)) # Used for monitoring only

    # Keep only observations and actions where reward was above specified percentile
    train_obs = []
    train_act = []
    for reward, steps in batch:
        if reward < reward_bound:
            continue
        train_obs.extend(map(lambda step: step.observation, steps))
        train_act.extend(map(lambda step: step.action, steps))

    train_obs_v = torch.FloatTensor(train_obs)
    train_act_v = torch.LongTensor(train_act)

    return train_obs_v, train_act_v, reward_bound, reward_mean # Final two just for tensorboard monitoring

Training setup:

In [6]:
import gym
from torch import optim
from torch.utils.tensorboard import SummaryWriter

env = gym.make("CartPole-v0")
#env = gym.wrappers.Monitor(env, directory="mon", force=True) # record monitoring videos

obs_size = env.observation_space.shape[0]
n_actions = env.action_space.n

net = Net(obs_size, HIDDEN_SIZE, n_actions)
objective = nn.CrossEntropyLoss()
optimizer = optim.Adam(params=net.parameters(), lr=0.01)
writer = SummaryWriter(comment="-cartpole")

Training loop:

In [7]:
for iter_no, batch in enumerate(iterate_batches(env, net, BATCH_SIZE)):
    obs_v, acts_v, reward_b, reward_m = filter_batch(batch, PERCENTILE)
    optimizer.zero_grad()
    action_scores_v = net(obs_v)
    loss_v = objective(action_scores_v, acts_v)
    loss_v.backward()
    optimizer.step()

    print("%d: loss=%.3f, reward_mean=%.1f, rw_bound=%.1f" % (
        iter_no, loss_v.item(), reward_m, reward_b
    ))

    writer.add_scalar("loss", loss_v.item(), iter_no)
    writer.add_scalar("reward_bound", reward_b, iter_no)
    writer.add_scalar("reward_mean", reward_m, iter_no)

    if reward_m > 199:
        print("Solved!")
        writer.close()
        break

0: loss=0.679, reward_mean=18.6, rw_bound=21.5
1: loss=0.681, reward_mean=24.6, rw_bound=27.5
2: loss=0.676, reward_mean=24.2, rw_bound=24.0
3: loss=0.671, reward_mean=29.9, rw_bound=32.5
4: loss=0.658, reward_mean=28.9, rw_bound=37.0
5: loss=0.666, reward_mean=28.4, rw_bound=33.5
6: loss=0.649, reward_mean=25.1, rw_bound=29.0
7: loss=0.652, reward_mean=32.2, rw_bound=36.5
8: loss=0.637, reward_mean=34.4, rw_bound=44.5
9: loss=0.639, reward_mean=37.9, rw_bound=44.0
10: loss=0.605, reward_mean=47.5, rw_bound=55.0
11: loss=0.616, reward_mean=34.6, rw_bound=38.0
12: loss=0.604, reward_mean=44.5, rw_bound=49.0
13: loss=0.582, reward_mean=38.9, rw_bound=48.0
14: loss=0.598, reward_mean=49.1, rw_bound=54.0
15: loss=0.603, reward_mean=45.9, rw_bound=48.5
16: loss=0.597, reward_mean=49.1, rw_bound=52.0
17: loss=0.561, reward_mean=51.9, rw_bound=57.0
18: loss=0.570, reward_mean=48.2, rw_bound=56.0
19: loss=0.564, reward_mean=58.1, rw_bound=59.5
20: loss=0.535, reward_mean=59.3, rw_bound=63.5
21

## FrozenLake environment

If we "wrap" our observations from a different environment (FrozenLake), we can apply exactly the same code as we just did with CartPole!

In [8]:
class DiscreteOneHotWrapper(gym.ObservationWrapper):
    def __init__(self, env):
        super(DiscreteOneHotWrapper, self).__init__(env)
        assert isinstance(env.observation_space, gym.spaces.Discrete)
        shape = (env.observation_space.n, )
        self.observation_space = gym.spaces.Box(0.0, 1.0, shape, dtype=np.float32)
    
    def observation(self, observation):
        res = np.copy(self.observation_space.low) # all 0.0
        res[observation] = 1.0 # replace nth entry with 1.0
        return res


Only two lines are altered below: the environment is wrapped, and the reward bound (i.e. when did we "win") is changed to an appropriate score.

In [12]:
env = DiscreteOneHotWrapper(gym.make("FrozenLake-v1")) # wrapper applied to environment
#env = gym.wrappers.Monitor(env, directory="mon", force=True) # record monitoring videos

obs_size = env.observation_space.shape[0]
n_actions = env.action_space.n

net = Net(obs_size, HIDDEN_SIZE, n_actions)
objective = nn.CrossEntropyLoss()
optimizer = optim.Adam(params=net.parameters(), lr=0.01)
writer = SummaryWriter(comment="-frozenlake-naive")

for iter_no, batch in enumerate(iterate_batches(env, net, BATCH_SIZE)):
    obs_v, acts_v, reward_b, reward_m = filter_batch(batch, PERCENTILE)
    optimizer.zero_grad()
    action_scores_v = net(obs_v)
    loss_v = objective(action_scores_v, acts_v)
    loss_v.backward()
    optimizer.step()

    print("%d: loss=%.3f, reward_mean=%.1f, rw_bound=%.1f" % (
        iter_no, loss_v.item(), reward_m, reward_b
    ))

    writer.add_scalar("loss", loss_v.item(), iter_no)
    writer.add_scalar("reward_bound", reward_b, iter_no)
    writer.add_scalar("reward_mean", reward_m, iter_no)

    if reward_m > 0.8: # this is the only other line changed
        print("Solved!")
        writer.close()
        break

    if iter_no >= 200: # added to prevent running forever
        print("Failed")
        writer.close()
        break

0: loss=1.381, reward_mean=0.0, rw_bound=0.0
1: loss=1.383, reward_mean=0.1, rw_bound=0.0
2: loss=1.380, reward_mean=0.0, rw_bound=0.0
3: loss=1.362, reward_mean=0.0, rw_bound=0.0
4: loss=1.372, reward_mean=0.1, rw_bound=0.0
5: loss=1.373, reward_mean=0.0, rw_bound=0.0
6: loss=1.374, reward_mean=0.0, rw_bound=0.0
7: loss=1.365, reward_mean=0.1, rw_bound=0.0
8: loss=1.346, reward_mean=0.0, rw_bound=0.0
9: loss=1.334, reward_mean=0.0, rw_bound=0.0
10: loss=1.365, reward_mean=0.1, rw_bound=0.0
11: loss=1.319, reward_mean=0.0, rw_bound=0.0
12: loss=1.338, reward_mean=0.0, rw_bound=0.0
13: loss=1.357, reward_mean=0.0, rw_bound=0.0
14: loss=1.279, reward_mean=0.0, rw_bound=0.0
15: loss=1.282, reward_mean=0.0, rw_bound=0.0
16: loss=1.311, reward_mean=0.0, rw_bound=0.0
17: loss=1.314, reward_mean=0.0, rw_bound=0.0
18: loss=1.354, reward_mean=0.0, rw_bound=0.0
19: loss=1.303, reward_mean=0.0, rw_bound=0.0
20: loss=1.262, reward_mean=0.1, rw_bound=0.0
21: loss=1.255, reward_mean=0.0, rw_bound=0.

This doesn't learn well though, because our reward system isn't set up in a good way for this problem!

We currently only reward based on whether the agent reaches the goal (1.0) or not (0.0). But because it's super rare for that to happen with our initially-not-very-clever agent, we never end up with a set of "good" episodes to learn from and make improvements.

So to use the cross-entropy method, we need:

* Finite (and preverably short) training episodes
* Total reward from episodes variable enough to differentiate between good and bad episodes
* No requirement for intermediate monitoring of the agent, i.e. we don't know whether it succeeded or failed until the end

And to improve things here, we could:

* Use more episodes per batch (because successful episodes are rarer than with CartPole)
* Apply a discount factor to the reward (to reward shorter episodes more highly)
* Keep successful episodes for longer i.e. keep them in the "filtered" batch for multiple iterations (due to rare success again)
* Decrease learning rate (to allow neural net to average more training samples)
* Increase training time by a LOT (heh)