# Cross-Entropy Method in RL
The cross-entropy method falls into the model-free and policy-based category
of methods.

The term "model-free" means that the method doesn't build a model of the
environment or reward.

Policy-based methods are directly approximating the policy of the agent, that is, what actions the agent should carry out at every step. Policy is usually represented by probability distribution over the available actions.

As our cross-entropy method is policy-based, our nonlinear function (neural network) produces policy, which basically says for every observation which action the agent should take.

During the agent's lifetime, its experience is present as episodes. Every episode is a sequence of observations that the agent has got from the environment, actions it has issued, and rewards for these actions. Imagine that our agent has played several such episodes. For every episode, we can calculate the total reward that the agent has claimed. This total reward shows how good this episode was for the agent.

Due to randomness in the environment and the way that the agent selects actions to take, some episodes will be better than others. The core of the cross-entropy method is to throw away bad
episodes and train on better ones.

### Algorithm
the steps of the method are as follows:

1. Play ```N``` number of episodes using our current model and environment.
2. Calculate the total reward for every episode and decide on a reward boundary. Usually, we use some percentile of all rewards, such as 50th or 70th.
3. Throw away all episodes with a reward below the boundary.
4. Train on the remaining "elite" episodes using observations as the input and issued actions as the desired output.
5. Repeat from step 1 until we become satisfied with the result.

In [1]:
import gym
from collections import namedtuple
import numpy as np

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.tensorboard import SummaryWriter

from tensorflow import keras
from tensorflow.keras import layers
import tensorflow as tf
from tensorflow.keras import optimizers
from tensorflow.keras.utils import to_categorical

tf.keras.backend.clear_session() # Reset TF notebook state.

### Cartpole

In [2]:
HIDDEN_SIZE = 128
BATCH_SIZE = 16
PERCENTILE = 70

### Define Neural Net
The neural network takes in an observation and returns an action. In this case, we have a discrete number of potential actions, so the network returns a "probability" for each action, which a higher probability indicating that the action is more desirable.

In [3]:
class Net(nn.Module):
    def __init__(self, obs_size, hidden_size, n_actions):
        super(Net, self).__init__()
        self.net = nn.Sequential(
            nn.Linear(obs_size, hidden_size),
            nn.ReLU(),
            nn.Linear(hidden_size, n_actions)
        )

    def forward(self, x):
        return self.net(x)

In [4]:
Episode = namedtuple('Episode', field_names=['reward', 'steps'])
EpisodeStep = namedtuple('EpisodeStep', field_names=['observation', 'action'])

In [5]:
def iterate_batches(env, net, batch_size):
    batch = []
    episode_reward = 0.0
    episode_steps = []
    obs = env.reset() # Reset environment and obtain first observation.
    sm = nn.Softmax(dim=1)
    while True:
        obs_v = torch.FloatTensor([obs])
        act_probs_v = sm(net(obs_v)) # Get action probabilities from observation.
        act_probs = act_probs_v.data.numpy()[0]
        action = np.random.choice(len(act_probs), p=act_probs) # Action is an integer.
        next_obs, reward, is_done, _ = env.step(action) # Perform an action and move to the next state.
        episode_reward += reward
        episode_steps.append(EpisodeStep(observation=obs, action=action))
        if is_done:
            batch.append(Episode(reward=episode_reward, steps=episode_steps))
            episode_reward = 0.0
            episode_steps = []
            next_obs = env.reset()
            if len(batch) == batch_size:
                yield batch
                batch = []
        obs = next_obs
        
def filter_batch(batch, percentile):
    rewards = list(map(lambda s: s.reward, batch))
    reward_bound = np.percentile(rewards, percentile)
    reward_mean = float(np.mean(rewards))

    train_obs = []
    train_act = []
    for example in batch:
        if example.reward < reward_bound:
            continue
        train_obs.extend(map(lambda step: step.observation, example.steps))
        train_act.extend(map(lambda step: step.action, example.steps))

    train_obs_v = torch.FloatTensor(train_obs)
    train_act_v = torch.LongTensor(train_act)
    return train_obs_v, train_act_v, reward_bound, reward_mean

### PyTorch Solution

In [6]:
env = gym.make("CartPole-v0")
# env = gym.wrappers.Monitor(env, directory="mon", force=True)
obs_size = env.observation_space.shape[0] # Number of possible observations.
n_actions = env.action_space.n # Number of possible agent actions.

net = Net(obs_size, HIDDEN_SIZE, n_actions)
objective = nn.CrossEntropyLoss()
optimizer = optim.Adam(params=net.parameters(), lr=0.01)

for iter_no, batch in enumerate(iterate_batches(env, net, BATCH_SIZE)):
    obs_v, acts_v, reward_b, reward_m = filter_batch(batch, PERCENTILE)
    
    # Train neural net.
    optimizer.zero_grad()
    action_scores_v = net(obs_v)
    loss_v = objective(action_scores_v, acts_v)
    loss_v.backward()
    optimizer.step()
    print("%d: loss=%.3f, reward_mean=%.1f, reward_bound=%.1f" % (
        iter_no, loss_v.item(), reward_m, reward_b))
    if reward_m > 199:
        print("Solved!")
        break

0: loss=0.692, reward_mean=25.6, reward_bound=26.5
1: loss=0.698, reward_mean=19.8, reward_bound=22.0
2: loss=0.685, reward_mean=21.7, reward_bound=25.0
3: loss=0.675, reward_mean=21.5, reward_bound=25.0
4: loss=0.657, reward_mean=44.4, reward_bound=43.0
5: loss=0.647, reward_mean=36.4, reward_bound=26.0
6: loss=0.638, reward_mean=58.0, reward_bound=50.0
7: loss=0.625, reward_mean=60.1, reward_bound=65.0
8: loss=0.616, reward_mean=58.4, reward_bound=63.5
9: loss=0.602, reward_mean=71.5, reward_bound=80.0
10: loss=0.601, reward_mean=58.3, reward_bound=78.5
11: loss=0.581, reward_mean=72.8, reward_bound=105.0
12: loss=0.579, reward_mean=65.7, reward_bound=82.0
13: loss=0.567, reward_mean=80.4, reward_bound=102.5
14: loss=0.563, reward_mean=76.6, reward_bound=81.0
15: loss=0.570, reward_mean=95.1, reward_bound=108.5
16: loss=0.569, reward_mean=91.1, reward_bound=104.0
17: loss=0.531, reward_mean=77.9, reward_bound=85.5
18: loss=0.538, reward_mean=89.0, reward_bound=95.5
19: loss=0.529, re

### Keras Solution
This has slow convergence...possibly because TensorFlow has to build a new graph each time we fit a Keras model...

In [63]:
def keras_mlp():
    
    inputs = keras.Input(shape=(obs_size,))
    dense = layers.Dense(256, activation='relu')(inputs)
    outputs = layers.Dense(n_actions, activation='softmax')(dense)
    
    model = keras.Model(inputs=inputs, outputs=outputs)
    model.compile(optimizer='adam',
              loss='categorical_crossentropy',
              metrics=['accuracy'])
    
    return model

In [64]:
def iterate_batches_keras(env, net, batch_size):
    batch = []
    episode_reward = 0.0
    episode_steps = []
    obs = env.reset() # Reset environment and obtain first observation.
    while True:
        obs_v = np.array([obs])
        act_probs = net.predict(obs_v).reshape(-1)
        action = np.random.choice(len(act_probs), p=act_probs)
        next_obs, reward, is_done, _ = env.step(action) # Perform an action and move to the next state.
        episode_reward += reward
        episode_steps.append(EpisodeStep(observation=obs, action=action))
        if is_done:
            batch.append(Episode(reward=episode_reward, steps=episode_steps))
            episode_reward = 0.0
            episode_steps = []
            next_obs = env.reset()
            if len(batch) == batch_size:
                yield batch
                batch = []
        obs = next_obs
        
def filter_batch_keras(batch, percentile):
    rewards = list(map(lambda s: s.reward, batch))
    reward_bound = np.percentile(rewards, percentile)
    reward_mean = float(np.mean(rewards))

    train_obs = []
    train_act = []
    for example in batch:
        if example.reward < reward_bound:
            continue
        train_obs.extend(map(lambda step: step.observation, example.steps))
        train_act.extend(map(lambda step: step.action, example.steps))

    return train_obs, train_act, reward_bound, reward_mean

In [65]:
env = gym.make("CartPole-v0")
# env = gym.wrappers.Monitor(env, directory="mon", force=True)
obs_size = env.observation_space.shape[0] # Number of possible observations.
n_actions = env.action_space.n # Number of possible agent actions.
net = keras_mlp()

In [None]:
for iter_no, batch in enumerate(iterate_batches_keras(env, net, BATCH_SIZE)):
    obs_v, acts_v, reward_b, reward_m = filter_batch_keras(batch, PERCENTILE)
    
    obs_v, acts_v_binary = np.array(obs_v), to_categorical(acts_v)
    net.fit(obs_v, acts_v_binary, batch_size = BATCH_SIZE, verbose = 0)
    
    print("%d: reward_mean=%.1f, reward_bound=%.1f" % (
        iter_no, reward_m, reward_b))
    if reward_m > 199:
        print("Solved!")
        break

0: reward_mean=19.1, reward_bound=19.5
1: reward_mean=22.4, reward_bound=22.5
2: reward_mean=30.4, reward_bound=40.0
3: reward_mean=39.6, reward_bound=48.5
4: reward_mean=36.3, reward_bound=33.5
5: reward_mean=31.1, reward_bound=40.0
6: reward_mean=35.4, reward_bound=41.0
7: reward_mean=35.9, reward_bound=52.0
8: reward_mean=43.6, reward_bound=47.5
9: reward_mean=49.9, reward_bound=56.5
10: reward_mean=53.0, reward_bound=66.0
11: reward_mean=50.3, reward_bound=65.5
12: reward_mean=65.3, reward_bound=80.5
13: reward_mean=93.8, reward_bound=111.0
14: reward_mean=74.8, reward_bound=77.5
15: reward_mean=76.2, reward_bound=87.0
16: reward_mean=111.7, reward_bound=123.5
17: reward_mean=128.3, reward_bound=156.5
18: reward_mean=97.5, reward_bound=127.0
19: reward_mean=136.5, reward_bound=152.5
20: reward_mean=90.6, reward_bound=125.0
21: reward_mean=131.9, reward_bound=165.5
22: reward_mean=159.6, reward_bound=200.0
23: reward_mean=151.1, reward_bound=200.0
