In [2]:
import gym
from collections import namedtuple
import numpy as np
from tensorboardX import SummaryWriter

import torch
import torch.nn as nn
import torch.optim as optim


HIDDEN_SIZE = 128
BATCH_SIZE = 16 # number of episodes per iteration
PERCENTILE = 70

Here we create a class called `Net` that inherits the `nn.Module` class. Using `nn.Sequential` we specify the structure of the neural net which is based on the size of the observation space, the desired number of nodes in the hidden layer, and use the network to predict the number of potential actions. We override the forward function using our implementation of the neural net, 

In [3]:
class Net(nn.Module):
    def __init__(self, obs_size, hidden_size, n_actions):
        super(Net, self).__init__()
        self.net = nn.Sequential(
            nn.Linear(obs_size, hidden_size),
            nn.ReLU(),
            nn.Linear(hidden_size, n_actions)
        )

    def forward(self, x):
        return self.net(x)

Here we define `EpisodeStep` which keeps track of the single steps within an episode and the observation and actions taken. `Episode` keeps track of the (undiscounted) return across all steps in an episode. 

In [4]:
EpisodeStep = namedtuple('EpisodeStep', field_names=['observation', 'action'])
Episode = namedtuple('Episode', field_names=['reward', 'steps'])

Here we define a function that will iterate over all episodes of size `batch_size` and collect the rewards and number of steps from each episode.  We first initialize our variables and reset the AI Gym environment to start a fresh episode. 

Then the `while` loop iterates over each batch, defining the input/observations tensor `obs_v` and uses Softmax to convert all raw action scores to probabilities. We then grab these probabilities by calling the `data` attribute and converting to numpy array. We make a random choice weighted using these probabilities. We take the action by calling `env.step` and obtain our new observation, the reward for this action, and whether or not the episode ended. 

We keep track of the total episode reward by adding the reward from this last action and store knowledge of the observation state and action in `episode_steps`. 

If the episode is over, appened the results of the episode (total reward and number of steps) to `batch` nad reinitialize variables and start new episode. Only once all the batches have completed should the batch data be returned. 

This loop would run forever without the `break` that comes later and is determined once we consider the agent to have learned sufficiently well (takes 199 time points before stick falls). The number of batches required to get to this point is undetermined before starting. 

In [5]:
def iterate_batches(env, net, batch_size):
    batch = []
    episode_reward = 0.0
    episode_steps = []
    obs = env.reset()
    sm = nn.Softmax(dim=1)
    while True:
        obs_v = torch.FloatTensor([obs])
        act_probs_v = sm(net(obs_v))
        act_probs = act_probs_v.data.numpy()[0]
        action = np.random.choice(len(act_probs), p=act_probs)
        next_obs, reward, is_done, _ = env.step(action)
        episode_reward += reward
        episode_steps.append(EpisodeStep(observation=obs, action=action))
        if is_done:
            batch.append(Episode(reward=episode_reward, steps=episode_steps))
            episode_reward = 0.0
            episode_steps = []
            next_obs = env.reset()
            if len(batch) == batch_size:
                yield batch
                batch = []
        obs = next_obs

Here is where we collect training data from episodes that meets our entrance criteria: total reward must be in the 70th percentile of all episodes (the "elite" episodes). We get the 70th percentile of all episode rewards so far as well as the mean rewards, then keep only those above the threshold for training. Training is split into both observations (input) and actions (output). 

We then convert the observation training data to a 4 x n tensor, and the action data into a 2 x n tensor where n is the number of elite episodes in the batch (batch resets every time)

We only store the reward boundary and mean as ways of measuring the performance of the agent. 

In [6]:
def filter_batch(batch, percentile):
    rewards = list(map(lambda s: s.reward, batch))
    reward_bound = np.percentile(rewards, percentile)
    reward_mean = float(np.mean(rewards))

    train_obs = []
    train_act = []
    for example in batch:
        if example.reward < reward_bound:
            continue
        train_obs.extend(map(lambda step: step.observation, example.steps))
        train_act.extend(map(lambda step: step.action, example.steps))

    train_obs_v = torch.FloatTensor(train_obs)
    train_act_v = torch.LongTensor(train_act)
    return train_obs_v, train_act_v, reward_bound, reward_mean


Finally, we run our functions! We start by creating the environment, specifying the size of the observation space and the action space. We then create the network with the input layer, one hidden layer, and the output layer. We use `CrossEntropyLoss` (which is more numerically stable than using softmax and then cross entropy loss) and Adam for optimization. 

In our `for` loop, since we are calling `enumerate` on the generator, we generate the results of `iterate_batches` one at a time and the loop breaks (and we stop calling the potentially infinite generator) when the mean reward per episode exceeds 199. 

So for the given batch of episodes (recall this is the list of total episode reward and # of steps), we first call `filter_batch` to get only the elite episodes. As with all PyTorch NNs, we need to set all the gradients to zero prior to calling the network on the observation input vector, specifying the loss function. The call to `.backward()` generates the gradients of the network and `step()` updates the parameters. While the gradients are reset with each iteration, the parameters are carried through (this is how the network learns). We then write the loss and reward boundary and mean to the writer to monitor performance. 

In [7]:
if __name__ == "__main__":
    env = gym.make("CartPole-v0")
    #env = gym.wrappers.Monitor(env, directory="mon", force=True)
    obs_size = env.observation_space.shape[0]
    n_actions = env.action_space.n

    net = Net(obs_size, HIDDEN_SIZE, n_actions)
    objective = nn.CrossEntropyLoss()
    optimizer = optim.Adam(params=net.parameters(), lr=0.01)
    writer = SummaryWriter(comment="-cartpole")

    for iter_no, batch in enumerate(iterate_batches(env, net, BATCH_SIZE)):
        obs_v, acts_v, reward_b, reward_m = filter_batch(batch, PERCENTILE)
        optimizer.zero_grad()
        action_scores_v = net(obs_v) 
        loss_v = objective(action_scores_v, acts_v) 
        loss_v.backward()
        optimizer.step()
        print("%d: loss=%.3f, reward_mean=%.1f, reward_bound=%.1f" % (
            iter_no, loss_v.item(), reward_m, reward_b))
        writer.add_scalar("loss", loss_v.item(), iter_no)
        writer.add_scalar("reward_bound", reward_b, iter_no)
        writer.add_scalar("reward_mean", reward_m, iter_no)
        if reward_m > 199:
            print("Solved!")
            break
    writer.close()

0: loss=0.684, reward_mean=16.2, reward_bound=17.5
1: loss=0.702, reward_mean=20.6, reward_bound=20.0
2: loss=0.692, reward_mean=23.3, reward_bound=24.0
3: loss=0.669, reward_mean=30.8, reward_bound=34.5
4: loss=0.642, reward_mean=41.2, reward_bound=49.0
5: loss=0.645, reward_mean=32.2, reward_bound=39.5
6: loss=0.634, reward_mean=40.1, reward_bound=48.5
7: loss=0.641, reward_mean=40.1, reward_bound=48.0
8: loss=0.609, reward_mean=44.4, reward_bound=49.0
9: loss=0.612, reward_mean=46.3, reward_bound=54.0
10: loss=0.625, reward_mean=46.1, reward_bound=48.0
11: loss=0.612, reward_mean=58.9, reward_bound=61.5
12: loss=0.590, reward_mean=50.3, reward_bound=60.0
13: loss=0.606, reward_mean=59.0, reward_bound=58.0
14: loss=0.580, reward_mean=63.6, reward_bound=76.0
15: loss=0.594, reward_mean=60.1, reward_bound=65.0
16: loss=0.577, reward_mean=53.9, reward_bound=61.0
17: loss=0.560, reward_mean=54.1, reward_bound=60.5
18: loss=0.590, reward_mean=78.4, reward_bound=81.5
19: loss=0.576, reward

NOTE: It is very important to understand this very distinct difference in reinforcement learning's use of neural networks. We first use the network to make decisisons, then based on the decisions made we update the network's parameters. We then use the updated network to make decisions for a new episode, then again update the parameters. 

How does this relate to policy iteration, policy improvement, value iteration, etc?


~~

Running code below to see what input and output tensors look like:

In [14]:
env = gym.make("CartPole-v0")
obs_size = env.observation_space.shape[0]
n_actions = env.action_space.n

net = Net(obs_size, HIDDEN_SIZE, n_actions)
objective = nn.CrossEntropyLoss()
optimizer = optim.Adam(params=net.parameters(), lr=0.01)
writer = SummaryWriter(comment="-cartpole")

batch = []
episode_reward = 0.0
episode_steps = []
obs = env.reset()
sm = nn.Softmax(dim=1)

batch_size = BATCH_SIZE

while len(batch) < batch_size:
    obs_v = torch.FloatTensor([obs])
    act_probs_v = sm(net(obs_v))
    act_probs = act_probs_v.data.numpy()[0]
    action = np.random.choice(len(act_probs), p=act_probs)
    next_obs, reward, is_done, _ = env.step(action)
    episode_reward += reward
    episode_steps.append(EpisodeStep(observation=obs, action=action))
    if is_done:
        batch.append(Episode(reward=episode_reward, steps=episode_steps))
        episode_reward = 0.0
        episode_steps = []
        next_obs = env.reset()
    obs = next_obs
    
obs_v, acts_v, reward_b, reward_m = filter_batch(batch, PERCENTILE)

obs_v

tensor([[ 3.5141e-03,  1.7276e-02, -3.7226e-02,  3.3517e-02],
        [ 3.8597e-03,  2.1291e-01, -3.6556e-02, -2.7068e-01],
        [ 8.1179e-03,  4.0854e-01, -4.1970e-02, -5.7466e-01],
        [ 1.6289e-02,  2.1403e-01, -5.3463e-02, -2.9549e-01],
        [ 2.0569e-02,  4.0987e-01, -5.9373e-02, -6.0454e-01],
        [ 2.8766e-02,  2.1562e-01, -7.1463e-02, -3.3114e-01],
        [ 3.3079e-02,  2.1588e-02, -7.8086e-02, -6.1817e-02],
        [ 3.3511e-02, -1.7233e-01, -7.9323e-02,  2.0524e-01],
        [ 3.0064e-02,  2.3829e-02, -7.5218e-02, -1.1137e-01],
        [ 3.0541e-02, -1.7014e-01, -7.7445e-02,  1.5667e-01],
        [ 2.7138e-02,  2.6001e-02, -7.4312e-02, -1.5941e-01],
        [ 2.7658e-02, -1.6798e-01, -7.7500e-02,  1.0894e-01],
        [ 2.4298e-02, -3.6191e-01, -7.5321e-02,  3.7620e-01],
        [ 1.7060e-02, -5.5589e-01, -6.7797e-02,  6.4421e-01],
        [ 5.9422e-03, -7.5000e-01, -5.4913e-02,  9.1480e-01],
        [-9.0579e-03, -5.5418e-01, -3.6617e-02,  6.0538e-01],
        

In [17]:
acts_v

tensor([ 1,  1,  0,  1,  0,  0,  0,  1,  0,  1,  0,  0,  0,  0,
         1,  1,  1,  1,  0,  0,  1,  0,  0,  0,  1,  1,  0,  1,
         1,  0,  0,  1,  0,  1,  0,  1,  1,  1,  0,  0,  1,  1,
         0,  0,  0,  0,  0,  0,  0,  1,  0,  1,  0,  1,  0,  0,
         0,  1,  1,  1,  0,  0,  1,  1,  1,  0,  0,  1,  1,  0,
         1,  1,  1,  0,  1,  1,  0,  0,  1,  1,  0,  1,  1,  1,
         1,  0,  1,  0,  1,  1,  1,  1,  0,  0,  0,  1,  0,  1,
         1,  0,  0,  1,  0,  0,  1,  1,  1,  0,  1,  1,  0,  0,
         1,  1,  0,  1,  0,  1,  0,  0,  1,  0,  0,  0,  0,  0,
         0,  0,  1,  1,  0,  0,  0,  1,  1,  1,  0,  0,  0,  0,
         0,  0,  1,  0,  1,  0,  0,  0,  1,  1])