# Policy Gradients (Reinforce) on Cart Pole

In the beginning, we define hyperparameters. The EPISODES_
TO_TRAIN value specifies how many complete episodes we will use for training.

In [None]:
import gym
import ptan
import numpy as np
from tensorboardX import SummaryWriter

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

In [1]:
GAMMA = 0.99
LEARNING_RATE = 0.01
EPISODES_TO_TRAIN = 4

The following network should also be familiar to you. Note that despite the fact our network
returns probabilities, we are not applying softmax nonlinearity to the output. The
reason behind this is that we will use the PyTorch log_softmax function to calculate
the logarithm of the softmax output at once. This method of calculation is much more
numerically stable; however, we need to remember that output from the network
is not probability, but raw scores (usually called logits).

In [2]:
class PGN(nn.Module):
    def __init__(self, input_size, n_actions):
        super(PGN, self).__init__()

        self.net = nn.Sequential(
            nn.Linear(input_size, 128),
            nn.ReLU(),
            nn.Linear(128, n_actions)
        )

    def forward(self, x):
        return self.net(x)

The following function is a bit tricky. It accepts a list of rewards for the whole episode and
needs to calculate the discounted total reward for every step. To do this efficiently,
we calculate the reward from the end of the local reward list. Indeed, the last step
of the episode will have a total reward equal to its local reward. The step before the
last will have the total reward of 𝑟𝑡−1 + 𝛾𝑟𝑡 (if t is an index of the last step).

In [3]:
def calc_qvals(rewards):
    res = []
    sum_r = 0.0
    for r in reversed(rewards):
        sum_r *= GAMMA
        sum_r += r
        res.append(sum_r)
    return list(reversed(res))

Our sum_r variable contains the total reward for the previous steps, so to get the
total reward for the previous step, we need to multiply sum_r by gamma and sum
the local reward.

The preparation steps before the training loop should also be familiar to you.

The only new element is the agent class from the PTAN library. Here, we are using
ptan.agent.PolicyAgent, which needs to make a decision about actions for every
observation. As our network now returns the policy as the probabilities of the
actions, in order to select the action to take, we need to obtain the probabilities from
the network and then perform random sampling from this probability distribution.

When we worked with DQN, the output of the network was Q-values, so if some
action had the value of 0.4 and another action had 0.5, the second action was
preferred 100% of the time. In the case of the probability distribution, if the first
action has a probability of 0.4 and the second 0.5, our agent should take the first
action with a 40% chance and the second with a 50% chance. Of course, our network
can decide to take the second action 100% of the time, and in this case, it returns the
probability 0 for the first action and the probability 1 for the second action.

This difference is important to understand, but the change in the implementation is
not large. Our PolicyAgent internally calls the NumPy random.choice function with
probabilities from the network. The apply_softmax argument instructs it to convert
the network output to probabilities by calling softmax first. 

The third argument preprocessor is a way to get around the fact that the CartPole environment in Gym
returns the observation as float64 instead of the float32 required by PyTorch.

In [4]:
env = gym.make("CartPole-v0")
writer = SummaryWriter(comment="-cartpole-reinforce")

net = PGN(env.observation_space.shape[0], env.action_space.n)
print(net)

agent = ptan.agent.PolicyAgent(net, preprocessor=ptan.agent.float32_preprocessor,
                               apply_softmax=True)
exp_source = ptan.experience.ExperienceSourceFirstLast(env, agent, gamma=GAMMA)

optimizer = optim.Adam(net.parameters(), lr=LEARNING_RATE)

total_rewards = []
step_idx = 0
done_episodes = 0

batch_episodes = 0
batch_states, batch_actions, batch_qvals = [], [], []
cur_rewards = []

PGN(
  (net): Sequential(
    (0): Linear(in_features=4, out_features=128, bias=True)
    (1): ReLU()
    (2): Linear(in_features=128, out_features=2, bias=True)
  )
)


Before we can start the training loop, we need several variables. The first group of
these is used for reporting and contains the total rewards for the episodes and the
count of completed episodes. 

The second group is used to gather the training data.
The cur_rewards list contains local rewards for the episode being currently played.

As this episode reaches the end, we calculate the discounted total rewards from
local rewards using the calc_qvals function and append them to the batch_qvals
list. The batch_states and batch_actions lists contain states and actions that we
saw from the last training.

When enough episodes have passed since the last training step, we perform
optimization on the gathered examples. As a first step, we convert states, actions,
and Q-values into the appropriate PyTorch form.

Then we calculate the loss from the steps. To do this, we ask our network to calculate
states into logits and calculate the logarithm + softmax of them. On the third line,
we select log probabilities from the actions taken and scale them with Q-values.
On the last line, we average those scaled values and do negation to obtain the loss
to minimize. Once again, this minus sign is very important, as our policy gradient
needs to be maximized to improve the policy. As the optimizer in PyTorch does
minimization in respect to the loss function, we need to negate the policy gradient.

Then we perform backpropagation to gather gradients in our
variables and ask the optimizer to perform an SGD update. At the end of the training
loop, we reset the episodes counter and clear our lists for fresh data to gather.

In [5]:
for step_idx, exp in enumerate(exp_source):
    batch_states.append(exp.state)
    batch_actions.append(int(exp.action))
    cur_rewards.append(exp.reward)

    if exp.last_state is None:
        batch_qvals.extend(calc_qvals(cur_rewards))
        cur_rewards.clear()
        batch_episodes += 1

    # handle new rewards
    new_rewards = exp_source.pop_total_rewards()
    if new_rewards:
        done_episodes += 1
        reward = new_rewards[0]
        total_rewards.append(reward)
        mean_rewards = float(np.mean(total_rewards[-100:]))
        print("%d: reward: %6.2f, mean_100: %6.2f, episodes: %d" % (
            step_idx, reward, mean_rewards, done_episodes))
        writer.add_scalar("reward", reward, step_idx)
        writer.add_scalar("reward_100", mean_rewards, step_idx)
        writer.add_scalar("episodes", done_episodes, step_idx)
        if mean_rewards > 195:
            print("Solved in %d steps and %d episodes!" % (step_idx, done_episodes))
            break

    if batch_episodes < EPISODES_TO_TRAIN:
        continue

    optimizer.zero_grad()
    states_v = torch.FloatTensor(batch_states)
    batch_actions_t = torch.LongTensor(batch_actions)
    batch_qvals_v = torch.FloatTensor(batch_qvals)

    logits_v = net(states_v)
    log_prob_v = F.log_softmax(logits_v, dim=1)
    log_prob_actions_v = batch_qvals_v * log_prob_v[range(len(batch_states)), batch_actions_t]
    loss_v = -log_prob_actions_v.mean()

    loss_v.backward()
    optimizer.step()

    batch_episodes = 0
    batch_states.clear()
    batch_actions.clear()
    batch_qvals.clear()

writer.close()

20: reward:  20.00, mean_100:  20.00, episodes: 1
34: reward:  14.00, mean_100:  17.00, episodes: 2
46: reward:  12.00, mean_100:  15.33, episodes: 3
77: reward:  31.00, mean_100:  19.25, episodes: 4
109: reward:  32.00, mean_100:  21.80, episodes: 5
126: reward:  17.00, mean_100:  21.00, episodes: 6
152: reward:  26.00, mean_100:  21.71, episodes: 7
178: reward:  26.00, mean_100:  22.25, episodes: 8
227: reward:  49.00, mean_100:  25.22, episodes: 9
241: reward:  14.00, mean_100:  24.10, episodes: 10
274: reward:  33.00, mean_100:  24.91, episodes: 11
292: reward:  18.00, mean_100:  24.33, episodes: 12
327: reward:  35.00, mean_100:  25.15, episodes: 13
355: reward:  28.00, mean_100:  25.36, episodes: 14
368: reward:  13.00, mean_100:  24.53, episodes: 15
380: reward:  12.00, mean_100:  23.75, episodes: 16
423: reward:  43.00, mean_100:  24.88, episodes: 17
480: reward:  57.00, mean_100:  26.67, episodes: 18
501: reward:  21.00, mean_100:  26.37, episodes: 19
534: reward:  33.00, mean

11974: reward: 200.00, mean_100:  98.59, episodes: 157
12105: reward: 131.00, mean_100:  99.46, episodes: 158
12253: reward: 148.00, mean_100: 100.45, episodes: 159
12423: reward: 170.00, mean_100: 101.00, episodes: 160
12551: reward: 128.00, mean_100: 101.93, episodes: 161
12654: reward: 103.00, mean_100: 102.52, episodes: 162
12808: reward: 154.00, mean_100: 103.35, episodes: 163
13008: reward: 200.00, mean_100: 104.72, episodes: 164
13208: reward: 200.00, mean_100: 106.24, episodes: 165
13336: reward: 128.00, mean_100: 107.02, episodes: 166
13508: reward: 172.00, mean_100: 108.07, episodes: 167
13655: reward: 147.00, mean_100: 108.81, episodes: 168
13832: reward: 177.00, mean_100: 109.95, episodes: 169
14032: reward: 200.00, mean_100: 110.94, episodes: 170
14198: reward: 166.00, mean_100: 112.01, episodes: 171
14323: reward: 125.00, mean_100: 111.33, episodes: 172
14523: reward: 200.00, mean_100: 112.63, episodes: 173
14700: reward: 177.00, mean_100: 112.40, episodes: 174
14898: rew

40052: reward: 200.00, mean_100: 190.63, episodes: 310
40252: reward: 200.00, mean_100: 190.63, episodes: 311
40452: reward: 200.00, mean_100: 191.14, episodes: 312
40652: reward: 200.00, mean_100: 191.14, episodes: 313
40852: reward: 200.00, mean_100: 191.14, episodes: 314
41052: reward: 200.00, mean_100: 191.14, episodes: 315
41252: reward: 200.00, mean_100: 191.14, episodes: 316
41452: reward: 200.00, mean_100: 191.14, episodes: 317
41652: reward: 200.00, mean_100: 191.14, episodes: 318
41852: reward: 200.00, mean_100: 191.14, episodes: 319
42028: reward: 176.00, mean_100: 190.90, episodes: 320
42204: reward: 176.00, mean_100: 190.66, episodes: 321
42404: reward: 200.00, mean_100: 190.78, episodes: 322
42604: reward: 200.00, mean_100: 190.78, episodes: 323
42804: reward: 200.00, mean_100: 190.78, episodes: 324
43004: reward: 200.00, mean_100: 190.78, episodes: 325
43204: reward: 200.00, mean_100: 190.78, episodes: 326
43404: reward: 200.00, mean_100: 190.78, episodes: 327
43604: rew