## The cross-entropy method on FrozenLake

The next environment that we will try to solve using the cross-entropy method is
FrozenLake. Its world is from the so-called grid world category, when your agent
lives in a grid of size 4×4 and can move in four directions: up, down, left, and right.
The agent always starts at a top-left position, and its goal is to reach the bottom-right
cell of the grid. There are holes in the fixed cells of the grid and if you get into those
holes, the episode ends and your reward is zero. If the agent reaches the destination
cell, then it obtains a reward of 1.0 and the episode ends.
To make life more complicated, the world is slippery (it's a frozen lake after all),
so the agent's actions do not always turn out as expected—there is a 33% chance that
it will slip to the right or to the left. If you want the agent to move left, for example,
there is a 33% probability that it will, indeed, move left, a 33% chance that it will end
up in the cell above, and a 33% chance that it will end up in the cell below, this makes progress difficult.

In [1]:
import gym, gym.spaces
from collections import namedtuple
import numpy as np
from tensorboardX import SummaryWriter

import torch
import torch.nn as nn
import torch.optim as optim

Let's look at how this environment is represented in Gym:

In [2]:
e = gym.make("FrozenLake-v0")

In [3]:
e.observation_space

Discrete(16)

In [4]:
e.action_space

Discrete(4)

In [5]:
e.render()


[41mS[0mFFF
FHFH
FFFH
HFFG


In [6]:
e.observation_space.n

16

Our observation space is discrete, which means that it's just a number from zero to
15 inclusive. Obviously, this number is our current position in the grid. The action
space is also discrete, but it can be from zero to three. Our NN from the CartPole
example expects a vector of numbers. To get this, we can apply the traditional onehot
encoding of discrete inputs, which means that the input to our network will
have 16 float numbers and zero everywhere except the index that we will encode.
To minimize changes in our code, we can use the ObservationWrapper class from
Gym and implement our DiscreteOneHotWrapper class:

In [7]:
HIDDEN_SIZE = 128
BATCH_SIZE = 16
PERCENTILE = 70

class DiscreteOneHotWrapper(gym.ObservationWrapper):
    def __init__(self, env):
        super(DiscreteOneHotWrapper, self).__init__(env)
        assert isinstance(env.observation_space,
                          gym.spaces.Discrete)
        shape = (env.observation_space.n, )
        self.observation_space = gym.spaces.Box(
            0.0, 1.0, shape, dtype=np.float32)

    def observation(self, observation):
        res = np.copy(self.observation_space.low)
        res[observation] = 1.0
        return res

With that wrapper applied to the environment, both the observation space and action
space are 100% compatible with our CartPole solution (source code Chapter04/02_
frozenlake_naive.py). However, by launching it, we can see that this doesn't
improve the score over time.

In [8]:
class Net(nn.Module):
    def __init__(self, obs_size, hidden_size, n_actions):
        super(Net, self).__init__()
        self.net = nn.Sequential(
            nn.Linear(obs_size, hidden_size),
            nn.ReLU(),
            nn.Linear(hidden_size, n_actions)
        )

    def forward(self, x):
        return self.net(x)


Episode = namedtuple('Episode', field_names=['reward', 'steps'])
EpisodeStep = namedtuple('EpisodeStep', field_names=['observation', 'action'])


def iterate_batches(env, net, batch_size):
    batch = []
    episode_reward = 0.0
    episode_steps = []
    obs = env.reset()
    sm = nn.Softmax(dim=1)
    while True:
        obs_v = torch.FloatTensor([obs])
        act_probs_v = sm(net(obs_v))
        act_probs = act_probs_v.data.numpy()[0]
        action = np.random.choice(len(act_probs), p=act_probs)
        next_obs, reward, is_done, _ = env.step(action)
        episode_reward += reward
        episode_steps.append(EpisodeStep(observation=obs, action=action))
        if is_done:
            batch.append(Episode(reward=episode_reward, steps=episode_steps))
            episode_reward = 0.0
            episode_steps = []
            next_obs = env.reset()
            if len(batch) == batch_size:
                yield batch
                batch = []
        obs = next_obs


def filter_batch(batch, percentile):
    rewards = list(map(lambda s: s.reward, batch))
    reward_bound = np.percentile(rewards, percentile)
    reward_mean = float(np.mean(rewards))

    train_obs = []
    train_act = []
    for example in batch:
        if example.reward < reward_bound:
            continue
        train_obs.extend(map(lambda step: step.observation, example.steps))
        train_act.extend(map(lambda step: step.action, example.steps))

    train_obs_v = torch.FloatTensor(train_obs)
    train_act_v = torch.LongTensor(train_act)
    return train_obs_v, train_act_v, reward_bound, reward_mean


env = DiscreteOneHotWrapper(gym.make("FrozenLake-v0"))
# env = gym.wrappers.Monitor(env, directory="mon", force=True)
obs_size = env.observation_space.shape[0]
n_actions = env.action_space.n

net = Net(obs_size, HIDDEN_SIZE, n_actions)
objective = nn.CrossEntropyLoss()
optimizer = optim.Adam(params=net.parameters(), lr=0.01)
writer = SummaryWriter(comment="-frozenlake-naive")

for iter_no, batch in enumerate(iterate_batches(env, net, BATCH_SIZE)):
    obs_v, acts_v, reward_b, reward_m = filter_batch(batch, PERCENTILE)
    optimizer.zero_grad()
    action_scores_v = net(obs_v)
    loss_v = objective(action_scores_v, acts_v)
    loss_v.backward()
    optimizer.step()
    print("%d: loss=%.3f, reward_mean=%.1f, reward_bound=%.1f" % (
        iter_no, loss_v.item(), reward_m, reward_b))
    writer.add_scalar("loss", loss_v.item(), iter_no)
    writer.add_scalar("reward_bound", reward_b, iter_no)
    writer.add_scalar("reward_mean", reward_m, iter_no)
    if reward_m > 0.8:
        print("Solved!")
        break
writer.close()

0: loss=1.385, reward_mean=0.1, reward_bound=0.0
1: loss=1.384, reward_mean=0.1, reward_bound=0.0
2: loss=1.382, reward_mean=0.1, reward_bound=0.0
3: loss=1.357, reward_mean=0.0, reward_bound=0.0
4: loss=1.369, reward_mean=0.0, reward_bound=0.0
5: loss=1.304, reward_mean=0.0, reward_bound=0.0
6: loss=1.366, reward_mean=0.0, reward_bound=0.0
7: loss=1.316, reward_mean=0.0, reward_bound=0.0
8: loss=1.334, reward_mean=0.1, reward_bound=0.0
9: loss=1.254, reward_mean=0.1, reward_bound=0.0
10: loss=1.206, reward_mean=0.0, reward_bound=0.0
11: loss=1.280, reward_mean=0.0, reward_bound=0.0
12: loss=1.217, reward_mean=0.0, reward_bound=0.0
13: loss=1.254, reward_mean=0.0, reward_bound=0.0
14: loss=1.198, reward_mean=0.0, reward_bound=0.0
15: loss=1.138, reward_mean=0.0, reward_bound=0.0
16: loss=1.232, reward_mean=0.0, reward_bound=0.0
17: loss=1.248, reward_mean=0.0, reward_bound=0.0
18: loss=1.141, reward_mean=0.0, reward_bound=0.0
19: loss=1.168, reward_mean=0.0, reward_bound=0.0
20: loss=1

167: loss=1.380, reward_mean=0.0, reward_bound=0.0
168: loss=1.380, reward_mean=0.0, reward_bound=0.0
169: loss=1.331, reward_mean=0.0, reward_bound=0.0
170: loss=1.338, reward_mean=0.0, reward_bound=0.0
171: loss=1.372, reward_mean=0.0, reward_bound=0.0
172: loss=1.334, reward_mean=0.0, reward_bound=0.0
173: loss=1.342, reward_mean=0.0, reward_bound=0.0
174: loss=1.375, reward_mean=0.0, reward_bound=0.0
175: loss=1.322, reward_mean=0.1, reward_bound=0.0
176: loss=1.366, reward_mean=0.1, reward_bound=0.0
177: loss=1.352, reward_mean=0.0, reward_bound=0.0
178: loss=1.348, reward_mean=0.0, reward_bound=0.0
179: loss=1.356, reward_mean=0.0, reward_bound=0.0
180: loss=1.354, reward_mean=0.0, reward_bound=0.0
181: loss=1.358, reward_mean=0.1, reward_bound=0.0
182: loss=1.349, reward_mean=0.0, reward_bound=0.0
183: loss=1.376, reward_mean=0.0, reward_bound=0.0
184: loss=1.320, reward_mean=0.0, reward_bound=0.0
185: loss=1.335, reward_mean=0.0, reward_bound=0.0
186: loss=1.319, reward_mean=0.

332: loss=0.198, reward_mean=0.0, reward_bound=0.0
333: loss=0.192, reward_mean=0.1, reward_bound=0.0
334: loss=0.171, reward_mean=0.1, reward_bound=0.0
335: loss=0.132, reward_mean=0.0, reward_bound=0.0
336: loss=0.129, reward_mean=0.1, reward_bound=0.0
337: loss=0.260, reward_mean=0.0, reward_bound=0.0
338: loss=0.031, reward_mean=0.1, reward_bound=0.0
339: loss=0.199, reward_mean=0.0, reward_bound=0.0
340: loss=0.022, reward_mean=0.2, reward_bound=0.0
341: loss=0.195, reward_mean=0.1, reward_bound=0.0
342: loss=0.277, reward_mean=0.1, reward_bound=0.0
343: loss=0.181, reward_mean=0.1, reward_bound=0.0
344: loss=0.262, reward_mean=0.1, reward_bound=0.0
345: loss=0.327, reward_mean=0.0, reward_bound=0.0
346: loss=0.122, reward_mean=0.0, reward_bound=0.0
347: loss=0.358, reward_mean=0.1, reward_bound=0.0
348: loss=0.146, reward_mean=0.0, reward_bound=0.0
349: loss=0.087, reward_mean=0.0, reward_bound=0.0
350: loss=0.229, reward_mean=0.0, reward_bound=0.0
351: loss=0.237, reward_mean=0.

493: loss=0.683, reward_mean=0.1, reward_bound=0.0
494: loss=0.885, reward_mean=0.0, reward_bound=0.0
495: loss=0.898, reward_mean=0.1, reward_bound=0.0
496: loss=0.896, reward_mean=0.0, reward_bound=0.0
497: loss=0.855, reward_mean=0.0, reward_bound=0.0
498: loss=0.950, reward_mean=0.1, reward_bound=0.0
499: loss=0.910, reward_mean=0.0, reward_bound=0.0
500: loss=0.731, reward_mean=0.0, reward_bound=0.0
501: loss=0.985, reward_mean=0.0, reward_bound=0.0
502: loss=0.821, reward_mean=0.1, reward_bound=0.0
503: loss=0.967, reward_mean=0.0, reward_bound=0.0
504: loss=0.835, reward_mean=0.1, reward_bound=0.0
505: loss=0.786, reward_mean=0.1, reward_bound=0.0
506: loss=0.701, reward_mean=0.1, reward_bound=0.0
507: loss=0.897, reward_mean=0.1, reward_bound=0.0
508: loss=0.847, reward_mean=0.0, reward_bound=0.0
509: loss=0.904, reward_mean=0.0, reward_bound=0.0
510: loss=0.936, reward_mean=0.0, reward_bound=0.0
511: loss=0.949, reward_mean=0.1, reward_bound=0.0
512: loss=0.819, reward_mean=0.

661: loss=0.152, reward_mean=0.0, reward_bound=0.0
662: loss=0.155, reward_mean=0.0, reward_bound=0.0
663: loss=0.087, reward_mean=0.1, reward_bound=0.0
664: loss=0.235, reward_mean=0.1, reward_bound=0.0
665: loss=0.232, reward_mean=0.1, reward_bound=0.0
666: loss=0.201, reward_mean=0.1, reward_bound=0.0
667: loss=0.153, reward_mean=0.0, reward_bound=0.0
668: loss=0.181, reward_mean=0.1, reward_bound=0.0
669: loss=0.261, reward_mean=0.1, reward_bound=0.0
670: loss=0.180, reward_mean=0.0, reward_bound=0.0
671: loss=0.161, reward_mean=0.0, reward_bound=0.0
672: loss=0.101, reward_mean=0.0, reward_bound=0.0
673: loss=0.204, reward_mean=0.0, reward_bound=0.0
674: loss=0.216, reward_mean=0.1, reward_bound=0.0
675: loss=0.199, reward_mean=0.0, reward_bound=0.0
676: loss=0.087, reward_mean=0.0, reward_bound=0.0
677: loss=0.129, reward_mean=0.0, reward_bound=0.0
678: loss=0.298, reward_mean=0.1, reward_bound=0.0
679: loss=0.192, reward_mean=0.0, reward_bound=0.0
680: loss=0.308, reward_mean=0.

822: loss=0.369, reward_mean=0.1, reward_bound=0.0
823: loss=0.396, reward_mean=0.2, reward_bound=0.0
824: loss=0.382, reward_mean=0.0, reward_bound=0.0
825: loss=0.333, reward_mean=0.0, reward_bound=0.0
826: loss=0.153, reward_mean=0.1, reward_bound=0.0
827: loss=0.265, reward_mean=0.0, reward_bound=0.0
828: loss=0.280, reward_mean=0.1, reward_bound=0.0
829: loss=0.438, reward_mean=0.1, reward_bound=0.0
830: loss=0.390, reward_mean=0.0, reward_bound=0.0
831: loss=0.335, reward_mean=0.1, reward_bound=0.0
832: loss=0.237, reward_mean=0.0, reward_bound=0.0
833: loss=0.397, reward_mean=0.0, reward_bound=0.0
834: loss=0.315, reward_mean=0.0, reward_bound=0.0
835: loss=0.374, reward_mean=0.1, reward_bound=0.0
836: loss=0.329, reward_mean=0.1, reward_bound=0.0
837: loss=0.278, reward_mean=0.1, reward_bound=0.0
838: loss=0.253, reward_mean=0.0, reward_bound=0.0
839: loss=0.252, reward_mean=0.1, reward_bound=0.0
840: loss=0.224, reward_mean=0.1, reward_bound=0.0
841: loss=0.214, reward_mean=0.

983: loss=0.007, reward_mean=0.1, reward_bound=0.0
984: loss=0.068, reward_mean=0.1, reward_bound=0.0
985: loss=0.005, reward_mean=0.1, reward_bound=0.0
986: loss=0.004, reward_mean=0.1, reward_bound=0.0
987: loss=0.063, reward_mean=0.0, reward_bound=0.0
988: loss=0.168, reward_mean=0.0, reward_bound=0.0
989: loss=0.005, reward_mean=0.0, reward_bound=0.0
990: loss=0.065, reward_mean=0.1, reward_bound=0.0
991: loss=0.004, reward_mean=0.2, reward_bound=0.0
992: loss=0.073, reward_mean=0.0, reward_bound=0.0
993: loss=0.005, reward_mean=0.0, reward_bound=0.0
994: loss=0.006, reward_mean=0.0, reward_bound=0.0
995: loss=0.005, reward_mean=0.2, reward_bound=0.0
996: loss=0.006, reward_mean=0.1, reward_bound=0.0
997: loss=0.006, reward_mean=0.0, reward_bound=0.0
998: loss=0.004, reward_mean=0.1, reward_bound=0.0
999: loss=0.005, reward_mean=0.0, reward_bound=0.0
1000: loss=0.004, reward_mean=0.1, reward_bound=0.0
1001: loss=0.004, reward_mean=0.0, reward_bound=0.0
1002: loss=0.078, reward_mean

1150: loss=0.000, reward_mean=0.1, reward_bound=0.0
1151: loss=0.099, reward_mean=0.1, reward_bound=0.0
1152: loss=0.000, reward_mean=0.0, reward_bound=0.0
1153: loss=0.001, reward_mean=0.1, reward_bound=0.0
1154: loss=0.001, reward_mean=0.0, reward_bound=0.0
1155: loss=0.000, reward_mean=0.1, reward_bound=0.0
1156: loss=0.001, reward_mean=0.0, reward_bound=0.0
1157: loss=0.000, reward_mean=0.1, reward_bound=0.0
1158: loss=0.000, reward_mean=0.2, reward_bound=0.0
1159: loss=0.000, reward_mean=0.2, reward_bound=0.0
1160: loss=0.001, reward_mean=0.1, reward_bound=0.0
1161: loss=0.001, reward_mean=0.0, reward_bound=0.0
1162: loss=0.000, reward_mean=0.1, reward_bound=0.0
1163: loss=0.001, reward_mean=0.0, reward_bound=0.0
1164: loss=0.001, reward_mean=0.1, reward_bound=0.0
1165: loss=0.000, reward_mean=0.1, reward_bound=0.0
1166: loss=0.000, reward_mean=0.1, reward_bound=0.0
1167: loss=0.000, reward_mean=0.1, reward_bound=0.0
1168: loss=0.000, reward_mean=0.1, reward_bound=0.0
1169: loss=0

1312: loss=0.000, reward_mean=0.0, reward_bound=0.0
1313: loss=0.000, reward_mean=0.1, reward_bound=0.0
1314: loss=0.000, reward_mean=0.0, reward_bound=0.0
1315: loss=0.000, reward_mean=0.0, reward_bound=0.0
1316: loss=0.000, reward_mean=0.1, reward_bound=0.0
1317: loss=0.000, reward_mean=0.0, reward_bound=0.0
1318: loss=0.000, reward_mean=0.0, reward_bound=0.0
1319: loss=0.000, reward_mean=0.1, reward_bound=0.0
1320: loss=0.000, reward_mean=0.1, reward_bound=0.0
1321: loss=0.000, reward_mean=0.1, reward_bound=0.0
1322: loss=0.000, reward_mean=0.1, reward_bound=0.0
1323: loss=0.000, reward_mean=0.0, reward_bound=0.0
1324: loss=0.000, reward_mean=0.0, reward_bound=0.0
1325: loss=0.000, reward_mean=0.1, reward_bound=0.0
1326: loss=0.000, reward_mean=0.0, reward_bound=0.0
1327: loss=0.000, reward_mean=0.1, reward_bound=0.0
1328: loss=0.000, reward_mean=0.1, reward_bound=0.0
1329: loss=0.000, reward_mean=0.2, reward_bound=0.0
1330: loss=0.000, reward_mean=0.0, reward_bound=0.0
1331: loss=0

1470: loss=0.000, reward_mean=0.1, reward_bound=0.0
1471: loss=0.000, reward_mean=0.1, reward_bound=0.0
1472: loss=0.000, reward_mean=0.1, reward_bound=0.0
1473: loss=0.000, reward_mean=0.0, reward_bound=0.0
1474: loss=0.000, reward_mean=0.1, reward_bound=0.0
1475: loss=0.000, reward_mean=0.0, reward_bound=0.0
1476: loss=0.000, reward_mean=0.0, reward_bound=0.0
1477: loss=0.000, reward_mean=0.1, reward_bound=0.0
1478: loss=0.000, reward_mean=0.0, reward_bound=0.0
1479: loss=0.000, reward_mean=0.0, reward_bound=0.0
1480: loss=0.000, reward_mean=0.0, reward_bound=0.0
1481: loss=0.000, reward_mean=0.2, reward_bound=0.0
1482: loss=0.000, reward_mean=0.0, reward_bound=0.0
1483: loss=0.000, reward_mean=0.0, reward_bound=0.0
1484: loss=0.000, reward_mean=0.0, reward_bound=0.0
1485: loss=0.000, reward_mean=0.0, reward_bound=0.0
1486: loss=0.000, reward_mean=0.1, reward_bound=0.0
1487: loss=0.000, reward_mean=0.1, reward_bound=0.0
1488: loss=0.000, reward_mean=0.0, reward_bound=0.0
1489: loss=0

1638: loss=0.000, reward_mean=0.0, reward_bound=0.0
1639: loss=0.000, reward_mean=0.2, reward_bound=0.0
1640: loss=0.000, reward_mean=0.1, reward_bound=0.0
1641: loss=0.000, reward_mean=0.0, reward_bound=0.0
1642: loss=0.000, reward_mean=0.0, reward_bound=0.0
1643: loss=0.000, reward_mean=0.1, reward_bound=0.0
1644: loss=0.000, reward_mean=0.1, reward_bound=0.0
1645: loss=0.000, reward_mean=0.1, reward_bound=0.0
1646: loss=0.000, reward_mean=0.0, reward_bound=0.0
1647: loss=0.000, reward_mean=0.0, reward_bound=0.0
1648: loss=0.000, reward_mean=0.1, reward_bound=0.0
1649: loss=0.000, reward_mean=0.0, reward_bound=0.0
1650: loss=0.000, reward_mean=0.1, reward_bound=0.0
1651: loss=0.000, reward_mean=0.1, reward_bound=0.0
1652: loss=0.000, reward_mean=0.0, reward_bound=0.0
1653: loss=0.000, reward_mean=0.2, reward_bound=0.0
1654: loss=0.000, reward_mean=0.1, reward_bound=0.0
1655: loss=0.000, reward_mean=0.0, reward_bound=0.0
1656: loss=0.000, reward_mean=0.0, reward_bound=0.0
1657: loss=0

1805: loss=0.000, reward_mean=0.0, reward_bound=0.0
1806: loss=0.000, reward_mean=0.0, reward_bound=0.0
1807: loss=0.000, reward_mean=0.1, reward_bound=0.0
1808: loss=0.000, reward_mean=0.1, reward_bound=0.0
1809: loss=0.000, reward_mean=0.1, reward_bound=0.0
1810: loss=0.000, reward_mean=0.0, reward_bound=0.0
1811: loss=0.000, reward_mean=0.0, reward_bound=0.0
1812: loss=0.000, reward_mean=0.1, reward_bound=0.0
1813: loss=0.000, reward_mean=0.0, reward_bound=0.0
1814: loss=0.000, reward_mean=0.1, reward_bound=0.0
1815: loss=0.000, reward_mean=0.0, reward_bound=0.0
1816: loss=0.000, reward_mean=0.1, reward_bound=0.0
1817: loss=0.000, reward_mean=0.0, reward_bound=0.0
1818: loss=0.000, reward_mean=0.1, reward_bound=0.0
1819: loss=0.000, reward_mean=0.0, reward_bound=0.0
1820: loss=0.000, reward_mean=0.1, reward_bound=0.0
1821: loss=0.000, reward_mean=0.1, reward_bound=0.0
1822: loss=0.000, reward_mean=0.1, reward_bound=0.0
1823: loss=0.000, reward_mean=0.1, reward_bound=0.0
1824: loss=0

1971: loss=0.000, reward_mean=0.0, reward_bound=0.0
1972: loss=0.000, reward_mean=0.1, reward_bound=0.0
1973: loss=0.000, reward_mean=0.2, reward_bound=0.0
1974: loss=0.000, reward_mean=0.0, reward_bound=0.0
1975: loss=0.000, reward_mean=0.0, reward_bound=0.0
1976: loss=0.000, reward_mean=0.1, reward_bound=0.0
1977: loss=0.000, reward_mean=0.0, reward_bound=0.0
1978: loss=0.000, reward_mean=0.1, reward_bound=0.0
1979: loss=0.000, reward_mean=0.1, reward_bound=0.0
1980: loss=0.000, reward_mean=0.2, reward_bound=0.0
1981: loss=0.000, reward_mean=0.1, reward_bound=0.0
1982: loss=0.000, reward_mean=0.0, reward_bound=0.0
1983: loss=0.000, reward_mean=0.0, reward_bound=0.0
1984: loss=0.000, reward_mean=0.1, reward_bound=0.0
1985: loss=0.000, reward_mean=0.1, reward_bound=0.0
1986: loss=0.000, reward_mean=0.1, reward_bound=0.0
1987: loss=0.000, reward_mean=0.1, reward_bound=0.0
1988: loss=0.000, reward_mean=0.1, reward_bound=0.0
1989: loss=0.000, reward_mean=0.0, reward_bound=0.0
1990: loss=0

2138: loss=0.000, reward_mean=0.1, reward_bound=0.0
2139: loss=0.000, reward_mean=0.0, reward_bound=0.0
2140: loss=0.000, reward_mean=0.1, reward_bound=0.0
2141: loss=0.000, reward_mean=0.0, reward_bound=0.0
2142: loss=0.000, reward_mean=0.1, reward_bound=0.0
2143: loss=0.000, reward_mean=0.1, reward_bound=0.0
2144: loss=0.000, reward_mean=0.0, reward_bound=0.0
2145: loss=0.000, reward_mean=0.0, reward_bound=0.0
2146: loss=0.000, reward_mean=0.1, reward_bound=0.0
2147: loss=0.000, reward_mean=0.0, reward_bound=0.0
2148: loss=0.000, reward_mean=0.1, reward_bound=0.0
2149: loss=0.000, reward_mean=0.1, reward_bound=0.0
2150: loss=0.000, reward_mean=0.0, reward_bound=0.0
2151: loss=0.000, reward_mean=0.1, reward_bound=0.0
2152: loss=0.000, reward_mean=0.1, reward_bound=0.0
2153: loss=0.000, reward_mean=0.0, reward_bound=0.0
2154: loss=0.000, reward_mean=0.1, reward_bound=0.0
2155: loss=0.000, reward_mean=0.0, reward_bound=0.0
2156: loss=0.000, reward_mean=0.0, reward_bound=0.0
2157: loss=0

2303: loss=0.000, reward_mean=0.2, reward_bound=0.0
2304: loss=0.000, reward_mean=0.1, reward_bound=0.0
2305: loss=0.000, reward_mean=0.0, reward_bound=0.0
2306: loss=0.000, reward_mean=0.0, reward_bound=0.0
2307: loss=0.000, reward_mean=0.0, reward_bound=0.0
2308: loss=0.000, reward_mean=0.1, reward_bound=0.0
2309: loss=0.000, reward_mean=0.0, reward_bound=0.0
2310: loss=0.000, reward_mean=0.1, reward_bound=0.0
2311: loss=0.000, reward_mean=0.1, reward_bound=0.0
2312: loss=0.000, reward_mean=0.0, reward_bound=0.0
2313: loss=0.000, reward_mean=0.0, reward_bound=0.0
2314: loss=0.000, reward_mean=0.0, reward_bound=0.0
2315: loss=0.000, reward_mean=0.3, reward_bound=0.5
2316: loss=0.000, reward_mean=0.1, reward_bound=0.0
2317: loss=0.000, reward_mean=0.1, reward_bound=0.0
2318: loss=0.000, reward_mean=0.1, reward_bound=0.0
2319: loss=0.000, reward_mean=0.1, reward_bound=0.0
2320: loss=0.000, reward_mean=0.1, reward_bound=0.0
2321: loss=0.000, reward_mean=0.0, reward_bound=0.0
2322: loss=0

2471: loss=0.000, reward_mean=0.1, reward_bound=0.0
2472: loss=0.000, reward_mean=0.0, reward_bound=0.0
2473: loss=0.000, reward_mean=0.1, reward_bound=0.0
2474: loss=0.000, reward_mean=0.1, reward_bound=0.0
2475: loss=0.000, reward_mean=0.1, reward_bound=0.0
2476: loss=0.000, reward_mean=0.0, reward_bound=0.0
2477: loss=0.000, reward_mean=0.0, reward_bound=0.0
2478: loss=0.000, reward_mean=0.1, reward_bound=0.0
2479: loss=0.000, reward_mean=0.0, reward_bound=0.0
2480: loss=0.000, reward_mean=0.0, reward_bound=0.0
2481: loss=0.000, reward_mean=0.1, reward_bound=0.0
2482: loss=0.000, reward_mean=0.0, reward_bound=0.0
2483: loss=0.000, reward_mean=0.1, reward_bound=0.0
2484: loss=0.000, reward_mean=0.1, reward_bound=0.0
2485: loss=0.000, reward_mean=0.1, reward_bound=0.0
2486: loss=0.000, reward_mean=0.0, reward_bound=0.0
2487: loss=0.000, reward_mean=0.1, reward_bound=0.0
2488: loss=0.000, reward_mean=0.0, reward_bound=0.0
2489: loss=0.000, reward_mean=0.0, reward_bound=0.0
2490: loss=0

2635: loss=0.000, reward_mean=0.0, reward_bound=0.0
2636: loss=0.000, reward_mean=0.0, reward_bound=0.0
2637: loss=0.000, reward_mean=0.1, reward_bound=0.0
2638: loss=0.000, reward_mean=0.0, reward_bound=0.0
2639: loss=0.000, reward_mean=0.1, reward_bound=0.0
2640: loss=0.000, reward_mean=0.1, reward_bound=0.0
2641: loss=0.000, reward_mean=0.1, reward_bound=0.0
2642: loss=0.000, reward_mean=0.0, reward_bound=0.0
2643: loss=0.000, reward_mean=0.0, reward_bound=0.0
2644: loss=0.000, reward_mean=0.0, reward_bound=0.0
2645: loss=0.000, reward_mean=0.1, reward_bound=0.0
2646: loss=0.000, reward_mean=0.0, reward_bound=0.0
2647: loss=0.000, reward_mean=0.2, reward_bound=0.0
2648: loss=0.000, reward_mean=0.0, reward_bound=0.0
2649: loss=0.000, reward_mean=0.1, reward_bound=0.0
2650: loss=0.000, reward_mean=0.1, reward_bound=0.0
2651: loss=0.000, reward_mean=0.0, reward_bound=0.0
2652: loss=0.000, reward_mean=0.0, reward_bound=0.0
2653: loss=0.000, reward_mean=0.1, reward_bound=0.0
2654: loss=0

2799: loss=0.000, reward_mean=0.0, reward_bound=0.0
2800: loss=0.000, reward_mean=0.0, reward_bound=0.0
2801: loss=0.000, reward_mean=0.1, reward_bound=0.0
2802: loss=0.000, reward_mean=0.0, reward_bound=0.0
2803: loss=0.000, reward_mean=0.2, reward_bound=0.0
2804: loss=0.000, reward_mean=0.0, reward_bound=0.0
2805: loss=0.000, reward_mean=0.1, reward_bound=0.0
2806: loss=0.000, reward_mean=0.1, reward_bound=0.0
2807: loss=0.000, reward_mean=0.0, reward_bound=0.0
2808: loss=0.000, reward_mean=0.0, reward_bound=0.0
2809: loss=0.000, reward_mean=0.1, reward_bound=0.0
2810: loss=0.000, reward_mean=0.0, reward_bound=0.0
2811: loss=0.000, reward_mean=0.1, reward_bound=0.0
2812: loss=0.000, reward_mean=0.1, reward_bound=0.0
2813: loss=0.000, reward_mean=0.0, reward_bound=0.0
2814: loss=0.000, reward_mean=0.1, reward_bound=0.0
2815: loss=0.000, reward_mean=0.0, reward_bound=0.0
2816: loss=0.000, reward_mean=0.0, reward_bound=0.0
2817: loss=0.000, reward_mean=0.1, reward_bound=0.0
2818: loss=0

2964: loss=0.000, reward_mean=0.0, reward_bound=0.0
2965: loss=0.000, reward_mean=0.0, reward_bound=0.0
2966: loss=0.000, reward_mean=0.0, reward_bound=0.0
2967: loss=0.000, reward_mean=0.0, reward_bound=0.0
2968: loss=0.000, reward_mean=0.0, reward_bound=0.0
2969: loss=0.000, reward_mean=0.0, reward_bound=0.0
2970: loss=0.000, reward_mean=0.1, reward_bound=0.0
2971: loss=0.000, reward_mean=0.1, reward_bound=0.0
2972: loss=0.000, reward_mean=0.0, reward_bound=0.0
2973: loss=0.000, reward_mean=0.1, reward_bound=0.0
2974: loss=0.000, reward_mean=0.1, reward_bound=0.0
2975: loss=0.000, reward_mean=0.1, reward_bound=0.0
2976: loss=0.000, reward_mean=0.1, reward_bound=0.0
2977: loss=0.000, reward_mean=0.1, reward_bound=0.0
2978: loss=0.000, reward_mean=0.0, reward_bound=0.0
2979: loss=0.000, reward_mean=0.0, reward_bound=0.0
2980: loss=0.000, reward_mean=0.1, reward_bound=0.0
2981: loss=0.000, reward_mean=0.0, reward_bound=0.0
2982: loss=0.000, reward_mean=0.0, reward_bound=0.0
2983: loss=0

3126: loss=0.000, reward_mean=0.0, reward_bound=0.0
3127: loss=0.000, reward_mean=0.0, reward_bound=0.0
3128: loss=0.000, reward_mean=0.0, reward_bound=0.0
3129: loss=0.000, reward_mean=0.1, reward_bound=0.0
3130: loss=0.000, reward_mean=0.1, reward_bound=0.0
3131: loss=0.000, reward_mean=0.0, reward_bound=0.0
3132: loss=0.000, reward_mean=0.1, reward_bound=0.0
3133: loss=0.000, reward_mean=0.1, reward_bound=0.0
3134: loss=0.000, reward_mean=0.0, reward_bound=0.0
3135: loss=0.000, reward_mean=0.1, reward_bound=0.0
3136: loss=0.000, reward_mean=0.1, reward_bound=0.0
3137: loss=0.000, reward_mean=0.1, reward_bound=0.0
3138: loss=0.000, reward_mean=0.0, reward_bound=0.0
3139: loss=0.000, reward_mean=0.1, reward_bound=0.0
3140: loss=0.000, reward_mean=0.0, reward_bound=0.0
3141: loss=0.000, reward_mean=0.0, reward_bound=0.0
3142: loss=0.000, reward_mean=0.0, reward_bound=0.0
3143: loss=0.000, reward_mean=0.1, reward_bound=0.0
3144: loss=0.000, reward_mean=0.1, reward_bound=0.0
3145: loss=0

3290: loss=0.000, reward_mean=0.1, reward_bound=0.0
3291: loss=0.000, reward_mean=0.0, reward_bound=0.0
3292: loss=0.000, reward_mean=0.1, reward_bound=0.0
3293: loss=0.000, reward_mean=0.1, reward_bound=0.0
3294: loss=0.000, reward_mean=0.1, reward_bound=0.0
3295: loss=0.000, reward_mean=0.0, reward_bound=0.0
3296: loss=0.000, reward_mean=0.0, reward_bound=0.0
3297: loss=0.000, reward_mean=0.1, reward_bound=0.0
3298: loss=0.000, reward_mean=0.0, reward_bound=0.0
3299: loss=0.000, reward_mean=0.0, reward_bound=0.0
3300: loss=0.000, reward_mean=0.0, reward_bound=0.0
3301: loss=0.000, reward_mean=0.1, reward_bound=0.0
3302: loss=0.000, reward_mean=0.0, reward_bound=0.0
3303: loss=0.000, reward_mean=0.1, reward_bound=0.0
3304: loss=0.000, reward_mean=0.1, reward_bound=0.0
3305: loss=0.000, reward_mean=0.0, reward_bound=0.0
3306: loss=0.000, reward_mean=0.0, reward_bound=0.0
3307: loss=0.000, reward_mean=0.0, reward_bound=0.0
3308: loss=0.000, reward_mean=0.1, reward_bound=0.0
3309: loss=0

3458: loss=0.000, reward_mean=0.1, reward_bound=0.0
3459: loss=0.000, reward_mean=0.0, reward_bound=0.0
3460: loss=0.000, reward_mean=0.1, reward_bound=0.0
3461: loss=0.000, reward_mean=0.0, reward_bound=0.0
3462: loss=0.000, reward_mean=0.0, reward_bound=0.0
3463: loss=0.000, reward_mean=0.1, reward_bound=0.0
3464: loss=0.000, reward_mean=0.0, reward_bound=0.0
3465: loss=0.000, reward_mean=0.1, reward_bound=0.0
3466: loss=0.000, reward_mean=0.0, reward_bound=0.0
3467: loss=0.000, reward_mean=0.0, reward_bound=0.0
3468: loss=0.000, reward_mean=0.1, reward_bound=0.0
3469: loss=0.000, reward_mean=0.1, reward_bound=0.0
3470: loss=0.000, reward_mean=0.1, reward_bound=0.0
3471: loss=0.000, reward_mean=0.1, reward_bound=0.0
3472: loss=0.000, reward_mean=0.1, reward_bound=0.0
3473: loss=0.000, reward_mean=0.0, reward_bound=0.0
3474: loss=0.000, reward_mean=0.1, reward_bound=0.0
3475: loss=0.000, reward_mean=0.0, reward_bound=0.0
3476: loss=0.000, reward_mean=0.0, reward_bound=0.0
3477: loss=0

3624: loss=0.000, reward_mean=0.1, reward_bound=0.0
3625: loss=0.000, reward_mean=0.1, reward_bound=0.0
3626: loss=0.000, reward_mean=0.1, reward_bound=0.0
3627: loss=0.000, reward_mean=0.0, reward_bound=0.0
3628: loss=0.000, reward_mean=0.1, reward_bound=0.0
3629: loss=0.000, reward_mean=0.1, reward_bound=0.0
3630: loss=0.000, reward_mean=0.0, reward_bound=0.0
3631: loss=0.000, reward_mean=0.1, reward_bound=0.0
3632: loss=0.000, reward_mean=0.0, reward_bound=0.0
3633: loss=0.000, reward_mean=0.1, reward_bound=0.0
3634: loss=0.000, reward_mean=0.1, reward_bound=0.0
3635: loss=0.000, reward_mean=0.0, reward_bound=0.0
3636: loss=0.000, reward_mean=0.1, reward_bound=0.0
3637: loss=0.000, reward_mean=0.0, reward_bound=0.0
3638: loss=0.000, reward_mean=0.0, reward_bound=0.0
3639: loss=0.000, reward_mean=0.0, reward_bound=0.0
3640: loss=0.000, reward_mean=0.0, reward_bound=0.0
3641: loss=0.000, reward_mean=0.1, reward_bound=0.0
3642: loss=0.000, reward_mean=0.1, reward_bound=0.0
3643: loss=0

3791: loss=0.000, reward_mean=0.0, reward_bound=0.0
3792: loss=0.000, reward_mean=0.0, reward_bound=0.0
3793: loss=0.000, reward_mean=0.2, reward_bound=0.0
3794: loss=0.000, reward_mean=0.1, reward_bound=0.0
3795: loss=0.000, reward_mean=0.1, reward_bound=0.0
3796: loss=0.000, reward_mean=0.2, reward_bound=0.0
3797: loss=0.000, reward_mean=0.1, reward_bound=0.0
3798: loss=0.000, reward_mean=0.1, reward_bound=0.0
3799: loss=0.000, reward_mean=0.0, reward_bound=0.0
3800: loss=0.000, reward_mean=0.1, reward_bound=0.0
3801: loss=0.000, reward_mean=0.0, reward_bound=0.0
3802: loss=0.000, reward_mean=0.1, reward_bound=0.0
3803: loss=0.000, reward_mean=0.1, reward_bound=0.0
3804: loss=0.000, reward_mean=0.0, reward_bound=0.0
3805: loss=0.000, reward_mean=0.2, reward_bound=0.0
3806: loss=0.000, reward_mean=0.1, reward_bound=0.0
3807: loss=0.000, reward_mean=0.0, reward_bound=0.0
3808: loss=0.000, reward_mean=0.1, reward_bound=0.0
3809: loss=0.000, reward_mean=0.0, reward_bound=0.0
3810: loss=0

3956: loss=0.000, reward_mean=0.0, reward_bound=0.0
3957: loss=0.000, reward_mean=0.1, reward_bound=0.0
3958: loss=0.000, reward_mean=0.1, reward_bound=0.0
3959: loss=0.000, reward_mean=0.1, reward_bound=0.0
3960: loss=0.000, reward_mean=0.1, reward_bound=0.0
3961: loss=0.000, reward_mean=0.0, reward_bound=0.0
3962: loss=0.000, reward_mean=0.1, reward_bound=0.0
3963: loss=0.000, reward_mean=0.1, reward_bound=0.0
3964: loss=0.000, reward_mean=0.1, reward_bound=0.0
3965: loss=0.000, reward_mean=0.1, reward_bound=0.0
3966: loss=0.000, reward_mean=0.1, reward_bound=0.0
3967: loss=0.000, reward_mean=0.0, reward_bound=0.0
3968: loss=0.000, reward_mean=0.1, reward_bound=0.0
3969: loss=0.000, reward_mean=0.1, reward_bound=0.0
3970: loss=0.000, reward_mean=0.0, reward_bound=0.0
3971: loss=0.000, reward_mean=0.0, reward_bound=0.0
3972: loss=0.000, reward_mean=0.1, reward_bound=0.0
3973: loss=0.000, reward_mean=0.0, reward_bound=0.0
3974: loss=0.000, reward_mean=0.1, reward_bound=0.0
3975: loss=0

4116: loss=0.000, reward_mean=0.1, reward_bound=0.0
4117: loss=0.000, reward_mean=0.1, reward_bound=0.0
4118: loss=0.000, reward_mean=0.1, reward_bound=0.0
4119: loss=0.000, reward_mean=0.1, reward_bound=0.0
4120: loss=0.000, reward_mean=0.0, reward_bound=0.0
4121: loss=0.000, reward_mean=0.0, reward_bound=0.0
4122: loss=0.000, reward_mean=0.0, reward_bound=0.0
4123: loss=0.000, reward_mean=0.1, reward_bound=0.0
4124: loss=0.000, reward_mean=0.1, reward_bound=0.0
4125: loss=0.000, reward_mean=0.0, reward_bound=0.0
4126: loss=0.000, reward_mean=0.1, reward_bound=0.0
4127: loss=0.000, reward_mean=0.1, reward_bound=0.0
4128: loss=0.000, reward_mean=0.1, reward_bound=0.0
4129: loss=0.000, reward_mean=0.1, reward_bound=0.0
4130: loss=0.000, reward_mean=0.1, reward_bound=0.0
4131: loss=0.000, reward_mean=0.1, reward_bound=0.0
4132: loss=0.000, reward_mean=0.0, reward_bound=0.0
4133: loss=0.000, reward_mean=0.0, reward_bound=0.0
4134: loss=0.000, reward_mean=0.1, reward_bound=0.0
4135: loss=0

4283: loss=0.000, reward_mean=0.0, reward_bound=0.0
4284: loss=0.000, reward_mean=0.1, reward_bound=0.0
4285: loss=0.000, reward_mean=0.1, reward_bound=0.0
4286: loss=0.000, reward_mean=0.1, reward_bound=0.0
4287: loss=0.000, reward_mean=0.1, reward_bound=0.0
4288: loss=0.000, reward_mean=0.0, reward_bound=0.0
4289: loss=0.000, reward_mean=0.1, reward_bound=0.0
4290: loss=0.000, reward_mean=0.0, reward_bound=0.0
4291: loss=0.000, reward_mean=0.1, reward_bound=0.0
4292: loss=0.000, reward_mean=0.1, reward_bound=0.0
4293: loss=0.000, reward_mean=0.1, reward_bound=0.0
4294: loss=0.000, reward_mean=0.1, reward_bound=0.0
4295: loss=0.000, reward_mean=0.0, reward_bound=0.0
4296: loss=0.000, reward_mean=0.0, reward_bound=0.0
4297: loss=0.000, reward_mean=0.0, reward_bound=0.0
4298: loss=0.000, reward_mean=0.0, reward_bound=0.0
4299: loss=0.000, reward_mean=0.0, reward_bound=0.0
4300: loss=0.000, reward_mean=0.0, reward_bound=0.0
4301: loss=0.000, reward_mean=0.1, reward_bound=0.0
4302: loss=0

4451: loss=0.000, reward_mean=0.1, reward_bound=0.0
4452: loss=0.000, reward_mean=0.0, reward_bound=0.0
4453: loss=0.000, reward_mean=0.0, reward_bound=0.0
4454: loss=0.000, reward_mean=0.1, reward_bound=0.0
4455: loss=0.000, reward_mean=0.1, reward_bound=0.0
4456: loss=0.000, reward_mean=0.2, reward_bound=0.0
4457: loss=0.000, reward_mean=0.0, reward_bound=0.0
4458: loss=0.000, reward_mean=0.1, reward_bound=0.0
4459: loss=0.000, reward_mean=0.0, reward_bound=0.0
4460: loss=0.000, reward_mean=0.0, reward_bound=0.0
4461: loss=0.000, reward_mean=0.0, reward_bound=0.0
4462: loss=0.000, reward_mean=0.1, reward_bound=0.0
4463: loss=0.000, reward_mean=0.0, reward_bound=0.0
4464: loss=0.000, reward_mean=0.1, reward_bound=0.0
4465: loss=0.000, reward_mean=0.1, reward_bound=0.0
4466: loss=0.000, reward_mean=0.0, reward_bound=0.0
4467: loss=0.000, reward_mean=0.1, reward_bound=0.0
4468: loss=0.000, reward_mean=0.0, reward_bound=0.0
4469: loss=0.000, reward_mean=0.0, reward_bound=0.0
4470: loss=0

4618: loss=0.000, reward_mean=0.0, reward_bound=0.0
4619: loss=0.000, reward_mean=0.0, reward_bound=0.0
4620: loss=0.000, reward_mean=0.0, reward_bound=0.0
4621: loss=0.000, reward_mean=0.1, reward_bound=0.0
4622: loss=0.000, reward_mean=0.1, reward_bound=0.0
4623: loss=0.000, reward_mean=0.0, reward_bound=0.0
4624: loss=0.000, reward_mean=0.1, reward_bound=0.0
4625: loss=0.000, reward_mean=0.0, reward_bound=0.0
4626: loss=0.000, reward_mean=0.1, reward_bound=0.0
4627: loss=0.000, reward_mean=0.1, reward_bound=0.0
4628: loss=0.000, reward_mean=0.1, reward_bound=0.0
4629: loss=0.000, reward_mean=0.0, reward_bound=0.0
4630: loss=0.000, reward_mean=0.1, reward_bound=0.0
4631: loss=0.000, reward_mean=0.2, reward_bound=0.0
4632: loss=0.000, reward_mean=0.1, reward_bound=0.0
4633: loss=0.000, reward_mean=0.1, reward_bound=0.0
4634: loss=0.000, reward_mean=0.1, reward_bound=0.0
4635: loss=0.000, reward_mean=0.1, reward_bound=0.0
4636: loss=0.000, reward_mean=0.0, reward_bound=0.0
4637: loss=0

4785: loss=0.000, reward_mean=0.0, reward_bound=0.0
4786: loss=0.000, reward_mean=0.1, reward_bound=0.0
4787: loss=0.000, reward_mean=0.1, reward_bound=0.0
4788: loss=0.000, reward_mean=0.0, reward_bound=0.0
4789: loss=0.000, reward_mean=0.0, reward_bound=0.0
4790: loss=0.000, reward_mean=0.0, reward_bound=0.0
4791: loss=0.000, reward_mean=0.1, reward_bound=0.0
4792: loss=0.000, reward_mean=0.0, reward_bound=0.0
4793: loss=0.000, reward_mean=0.0, reward_bound=0.0
4794: loss=0.000, reward_mean=0.1, reward_bound=0.0
4795: loss=0.000, reward_mean=0.1, reward_bound=0.0
4796: loss=0.000, reward_mean=0.2, reward_bound=0.0
4797: loss=0.000, reward_mean=0.1, reward_bound=0.0
4798: loss=0.000, reward_mean=0.1, reward_bound=0.0
4799: loss=0.000, reward_mean=0.0, reward_bound=0.0
4800: loss=0.000, reward_mean=0.1, reward_bound=0.0
4801: loss=0.000, reward_mean=0.1, reward_bound=0.0
4802: loss=0.000, reward_mean=0.0, reward_bound=0.0
4803: loss=0.000, reward_mean=0.0, reward_bound=0.0
4804: loss=0

4950: loss=0.000, reward_mean=0.1, reward_bound=0.0
4951: loss=0.000, reward_mean=0.1, reward_bound=0.0
4952: loss=0.000, reward_mean=0.1, reward_bound=0.0
4953: loss=0.000, reward_mean=0.0, reward_bound=0.0
4954: loss=0.000, reward_mean=0.1, reward_bound=0.0
4955: loss=0.000, reward_mean=0.1, reward_bound=0.0
4956: loss=0.000, reward_mean=0.0, reward_bound=0.0
4957: loss=0.000, reward_mean=0.1, reward_bound=0.0
4958: loss=0.000, reward_mean=0.0, reward_bound=0.0
4959: loss=0.000, reward_mean=0.0, reward_bound=0.0
4960: loss=0.000, reward_mean=0.0, reward_bound=0.0
4961: loss=0.000, reward_mean=0.0, reward_bound=0.0
4962: loss=0.000, reward_mean=0.1, reward_bound=0.0
4963: loss=0.000, reward_mean=0.1, reward_bound=0.0
4964: loss=0.000, reward_mean=0.1, reward_bound=0.0
4965: loss=0.000, reward_mean=0.0, reward_bound=0.0
4966: loss=0.000, reward_mean=0.0, reward_bound=0.0
4967: loss=0.000, reward_mean=0.1, reward_bound=0.0
4968: loss=0.000, reward_mean=0.1, reward_bound=0.0
4969: loss=0

5117: loss=0.000, reward_mean=0.1, reward_bound=0.0
5118: loss=0.000, reward_mean=0.0, reward_bound=0.0
5119: loss=0.000, reward_mean=0.1, reward_bound=0.0
5120: loss=0.000, reward_mean=0.0, reward_bound=0.0
5121: loss=0.000, reward_mean=0.0, reward_bound=0.0
5122: loss=0.000, reward_mean=0.0, reward_bound=0.0
5123: loss=0.000, reward_mean=0.1, reward_bound=0.0
5124: loss=0.000, reward_mean=0.1, reward_bound=0.0
5125: loss=0.000, reward_mean=0.0, reward_bound=0.0
5126: loss=0.000, reward_mean=0.0, reward_bound=0.0
5127: loss=0.000, reward_mean=0.0, reward_bound=0.0
5128: loss=0.000, reward_mean=0.1, reward_bound=0.0
5129: loss=0.000, reward_mean=0.0, reward_bound=0.0
5130: loss=0.000, reward_mean=0.1, reward_bound=0.0
5131: loss=0.000, reward_mean=0.0, reward_bound=0.0
5132: loss=0.000, reward_mean=0.1, reward_bound=0.0
5133: loss=0.000, reward_mean=0.1, reward_bound=0.0
5134: loss=0.000, reward_mean=0.0, reward_bound=0.0
5135: loss=0.000, reward_mean=0.0, reward_bound=0.0
5136: loss=0

5275: loss=0.000, reward_mean=0.0, reward_bound=0.0
5276: loss=0.000, reward_mean=0.1, reward_bound=0.0
5277: loss=0.000, reward_mean=0.2, reward_bound=0.0
5278: loss=0.000, reward_mean=0.1, reward_bound=0.0
5279: loss=0.000, reward_mean=0.0, reward_bound=0.0
5280: loss=0.000, reward_mean=0.1, reward_bound=0.0
5281: loss=0.000, reward_mean=0.1, reward_bound=0.0
5282: loss=0.000, reward_mean=0.1, reward_bound=0.0
5283: loss=0.000, reward_mean=0.1, reward_bound=0.0
5284: loss=0.000, reward_mean=0.1, reward_bound=0.0
5285: loss=0.000, reward_mean=0.1, reward_bound=0.0
5286: loss=0.000, reward_mean=0.0, reward_bound=0.0
5287: loss=0.000, reward_mean=0.1, reward_bound=0.0
5288: loss=0.000, reward_mean=0.1, reward_bound=0.0
5289: loss=0.000, reward_mean=0.0, reward_bound=0.0
5290: loss=0.000, reward_mean=0.0, reward_bound=0.0
5291: loss=0.000, reward_mean=0.1, reward_bound=0.0
5292: loss=0.000, reward_mean=0.1, reward_bound=0.0
5293: loss=0.000, reward_mean=0.2, reward_bound=0.0
5294: loss=0

5439: loss=0.000, reward_mean=0.1, reward_bound=0.0
5440: loss=0.000, reward_mean=0.1, reward_bound=0.0
5441: loss=0.000, reward_mean=0.0, reward_bound=0.0
5442: loss=0.000, reward_mean=0.1, reward_bound=0.0
5443: loss=0.000, reward_mean=0.0, reward_bound=0.0
5444: loss=0.000, reward_mean=0.0, reward_bound=0.0
5445: loss=0.000, reward_mean=0.1, reward_bound=0.0
5446: loss=0.000, reward_mean=0.0, reward_bound=0.0
5447: loss=0.000, reward_mean=0.1, reward_bound=0.0
5448: loss=0.000, reward_mean=0.1, reward_bound=0.0
5449: loss=0.000, reward_mean=0.1, reward_bound=0.0
5450: loss=0.000, reward_mean=0.1, reward_bound=0.0
5451: loss=0.000, reward_mean=0.0, reward_bound=0.0
5452: loss=0.000, reward_mean=0.1, reward_bound=0.0
5453: loss=0.000, reward_mean=0.1, reward_bound=0.0
5454: loss=0.000, reward_mean=0.2, reward_bound=0.0
5455: loss=0.000, reward_mean=0.1, reward_bound=0.0
5456: loss=0.000, reward_mean=0.1, reward_bound=0.0
5457: loss=0.000, reward_mean=0.0, reward_bound=0.0
5458: loss=0

5603: loss=0.000, reward_mean=0.0, reward_bound=0.0
5604: loss=0.000, reward_mean=0.1, reward_bound=0.0
5605: loss=0.000, reward_mean=0.0, reward_bound=0.0
5606: loss=0.000, reward_mean=0.1, reward_bound=0.0
5607: loss=0.000, reward_mean=0.0, reward_bound=0.0
5608: loss=0.000, reward_mean=0.1, reward_bound=0.0
5609: loss=0.000, reward_mean=0.0, reward_bound=0.0
5610: loss=0.000, reward_mean=0.0, reward_bound=0.0
5611: loss=0.000, reward_mean=0.1, reward_bound=0.0
5612: loss=0.000, reward_mean=0.0, reward_bound=0.0
5613: loss=0.000, reward_mean=0.1, reward_bound=0.0
5614: loss=0.000, reward_mean=0.0, reward_bound=0.0
5615: loss=0.000, reward_mean=0.0, reward_bound=0.0
5616: loss=0.000, reward_mean=0.1, reward_bound=0.0
5617: loss=0.000, reward_mean=0.1, reward_bound=0.0
5618: loss=0.000, reward_mean=0.1, reward_bound=0.0
5619: loss=0.000, reward_mean=0.0, reward_bound=0.0
5620: loss=0.000, reward_mean=0.1, reward_bound=0.0
5621: loss=0.000, reward_mean=0.1, reward_bound=0.0
5622: loss=0

5770: loss=0.000, reward_mean=0.0, reward_bound=0.0
5771: loss=0.000, reward_mean=0.0, reward_bound=0.0
5772: loss=0.000, reward_mean=0.1, reward_bound=0.0
5773: loss=0.000, reward_mean=0.1, reward_bound=0.0
5774: loss=0.000, reward_mean=0.0, reward_bound=0.0
5775: loss=0.000, reward_mean=0.1, reward_bound=0.0
5776: loss=0.000, reward_mean=0.0, reward_bound=0.0
5777: loss=0.000, reward_mean=0.0, reward_bound=0.0
5778: loss=0.000, reward_mean=0.3, reward_bound=0.5
5779: loss=0.000, reward_mean=0.0, reward_bound=0.0
5780: loss=0.000, reward_mean=0.1, reward_bound=0.0
5781: loss=0.000, reward_mean=0.1, reward_bound=0.0
5782: loss=0.000, reward_mean=0.1, reward_bound=0.0
5783: loss=0.000, reward_mean=0.0, reward_bound=0.0
5784: loss=0.000, reward_mean=0.0, reward_bound=0.0
5785: loss=0.000, reward_mean=0.0, reward_bound=0.0
5786: loss=0.000, reward_mean=0.0, reward_bound=0.0
5787: loss=0.000, reward_mean=0.0, reward_bound=0.0
5788: loss=0.000, reward_mean=0.1, reward_bound=0.0
5789: loss=0

5938: loss=0.115, reward_mean=0.1, reward_bound=0.0
5939: loss=0.000, reward_mean=0.1, reward_bound=0.0
5940: loss=0.000, reward_mean=0.1, reward_bound=0.0
5941: loss=0.000, reward_mean=0.0, reward_bound=0.0
5942: loss=0.000, reward_mean=0.1, reward_bound=0.0
5943: loss=0.000, reward_mean=0.0, reward_bound=0.0
5944: loss=0.000, reward_mean=0.1, reward_bound=0.0
5945: loss=0.000, reward_mean=0.0, reward_bound=0.0
5946: loss=0.000, reward_mean=0.0, reward_bound=0.0
5947: loss=0.000, reward_mean=0.0, reward_bound=0.0
5948: loss=0.000, reward_mean=0.1, reward_bound=0.0
5949: loss=0.000, reward_mean=0.1, reward_bound=0.0
5950: loss=0.000, reward_mean=0.0, reward_bound=0.0
5951: loss=0.000, reward_mean=0.1, reward_bound=0.0
5952: loss=0.000, reward_mean=0.0, reward_bound=0.0
5953: loss=0.000, reward_mean=0.1, reward_bound=0.0
5954: loss=0.000, reward_mean=0.0, reward_bound=0.0
5955: loss=0.000, reward_mean=0.1, reward_bound=0.0
5956: loss=0.000, reward_mean=0.0, reward_bound=0.0
5957: loss=0

6105: loss=0.156, reward_mean=0.1, reward_bound=0.0
6106: loss=0.124, reward_mean=0.0, reward_bound=0.0
6107: loss=0.195, reward_mean=0.0, reward_bound=0.0
6108: loss=0.091, reward_mean=0.0, reward_bound=0.0
6109: loss=0.133, reward_mean=0.1, reward_bound=0.0
6110: loss=0.097, reward_mean=0.0, reward_bound=0.0
6111: loss=0.082, reward_mean=0.1, reward_bound=0.0
6112: loss=0.016, reward_mean=0.1, reward_bound=0.0
6113: loss=0.063, reward_mean=0.0, reward_bound=0.0
6114: loss=0.052, reward_mean=0.0, reward_bound=0.0
6115: loss=0.010, reward_mean=0.1, reward_bound=0.0
6116: loss=0.047, reward_mean=0.0, reward_bound=0.0
6117: loss=0.078, reward_mean=0.0, reward_bound=0.0
6118: loss=0.005, reward_mean=0.1, reward_bound=0.0
6119: loss=0.004, reward_mean=0.1, reward_bound=0.0
6120: loss=0.049, reward_mean=0.0, reward_bound=0.0
6121: loss=0.003, reward_mean=0.1, reward_bound=0.0
6122: loss=0.003, reward_mean=0.1, reward_bound=0.0
6123: loss=0.003, reward_mean=0.1, reward_bound=0.0
6124: loss=0

6271: loss=0.001, reward_mean=0.1, reward_bound=0.0
6272: loss=0.001, reward_mean=0.1, reward_bound=0.0
6273: loss=0.001, reward_mean=0.0, reward_bound=0.0
6274: loss=0.001, reward_mean=0.2, reward_bound=0.0
6275: loss=0.001, reward_mean=0.1, reward_bound=0.0
6276: loss=0.001, reward_mean=0.1, reward_bound=0.0
6277: loss=0.001, reward_mean=0.1, reward_bound=0.0
6278: loss=0.001, reward_mean=0.1, reward_bound=0.0
6279: loss=0.001, reward_mean=0.1, reward_bound=0.0
6280: loss=0.001, reward_mean=0.1, reward_bound=0.0
6281: loss=0.001, reward_mean=0.0, reward_bound=0.0
6282: loss=0.001, reward_mean=0.0, reward_bound=0.0
6283: loss=0.001, reward_mean=0.0, reward_bound=0.0
6284: loss=0.084, reward_mean=0.1, reward_bound=0.0
6285: loss=0.001, reward_mean=0.1, reward_bound=0.0
6286: loss=0.001, reward_mean=0.1, reward_bound=0.0
6287: loss=0.001, reward_mean=0.1, reward_bound=0.0
6288: loss=0.001, reward_mean=0.1, reward_bound=0.0
6289: loss=0.001, reward_mean=0.1, reward_bound=0.0
6290: loss=0

6439: loss=0.126, reward_mean=0.0, reward_bound=0.0
6440: loss=0.055, reward_mean=0.1, reward_bound=0.0
6441: loss=0.078, reward_mean=0.2, reward_bound=0.0
6442: loss=0.057, reward_mean=0.1, reward_bound=0.0
6443: loss=0.048, reward_mean=0.0, reward_bound=0.0
6444: loss=0.046, reward_mean=0.0, reward_bound=0.0
6445: loss=0.059, reward_mean=0.0, reward_bound=0.0
6446: loss=0.017, reward_mean=0.0, reward_bound=0.0
6447: loss=0.118, reward_mean=0.0, reward_bound=0.0
6448: loss=0.046, reward_mean=0.1, reward_bound=0.0
6449: loss=0.014, reward_mean=0.1, reward_bound=0.0
6450: loss=0.093, reward_mean=0.0, reward_bound=0.0
6451: loss=0.048, reward_mean=0.1, reward_bound=0.0
6452: loss=0.011, reward_mean=0.0, reward_bound=0.0
6453: loss=0.107, reward_mean=0.0, reward_bound=0.0
6454: loss=0.078, reward_mean=0.1, reward_bound=0.0
6455: loss=0.088, reward_mean=0.1, reward_bound=0.0
6456: loss=0.092, reward_mean=0.2, reward_bound=0.0
6457: loss=0.055, reward_mean=0.1, reward_bound=0.0
6458: loss=0

6599: loss=0.140, reward_mean=0.1, reward_bound=0.0
6600: loss=0.005, reward_mean=0.1, reward_bound=0.0
6601: loss=0.004, reward_mean=0.1, reward_bound=0.0
6602: loss=0.099, reward_mean=0.1, reward_bound=0.0
6603: loss=0.005, reward_mean=0.1, reward_bound=0.0
6604: loss=0.005, reward_mean=0.0, reward_bound=0.0
6605: loss=0.006, reward_mean=0.0, reward_bound=0.0
6606: loss=0.047, reward_mean=0.1, reward_bound=0.0
6607: loss=0.058, reward_mean=0.1, reward_bound=0.0
6608: loss=0.004, reward_mean=0.1, reward_bound=0.0
6609: loss=0.123, reward_mean=0.0, reward_bound=0.0
6610: loss=0.127, reward_mean=0.0, reward_bound=0.0
6611: loss=0.004, reward_mean=0.1, reward_bound=0.0
6612: loss=0.057, reward_mean=0.1, reward_bound=0.0
6613: loss=0.055, reward_mean=0.1, reward_bound=0.0
6614: loss=0.101, reward_mean=0.1, reward_bound=0.0
6615: loss=0.009, reward_mean=0.1, reward_bound=0.0
6616: loss=0.053, reward_mean=0.1, reward_bound=0.0
6617: loss=0.010, reward_mean=0.1, reward_bound=0.0
6618: loss=0

6765: loss=0.001, reward_mean=0.2, reward_bound=0.0
6766: loss=0.001, reward_mean=0.0, reward_bound=0.0
6767: loss=0.001, reward_mean=0.1, reward_bound=0.0
6768: loss=0.001, reward_mean=0.0, reward_bound=0.0
6769: loss=0.000, reward_mean=0.1, reward_bound=0.0
6770: loss=0.001, reward_mean=0.0, reward_bound=0.0
6771: loss=0.000, reward_mean=0.1, reward_bound=0.0
6772: loss=0.001, reward_mean=0.0, reward_bound=0.0
6773: loss=0.001, reward_mean=0.1, reward_bound=0.0
6774: loss=0.001, reward_mean=0.2, reward_bound=0.0
6775: loss=0.001, reward_mean=0.0, reward_bound=0.0
6776: loss=0.001, reward_mean=0.1, reward_bound=0.0
6777: loss=0.001, reward_mean=0.0, reward_bound=0.0
6778: loss=0.001, reward_mean=0.1, reward_bound=0.0
6779: loss=0.000, reward_mean=0.1, reward_bound=0.0
6780: loss=0.001, reward_mean=0.1, reward_bound=0.0
6781: loss=0.001, reward_mean=0.0, reward_bound=0.0
6782: loss=0.000, reward_mean=0.0, reward_bound=0.0
6783: loss=0.000, reward_mean=0.0, reward_bound=0.0
6784: loss=0

6932: loss=0.001, reward_mean=0.0, reward_bound=0.0
6933: loss=0.001, reward_mean=0.0, reward_bound=0.0
6934: loss=0.001, reward_mean=0.0, reward_bound=0.0
6935: loss=0.001, reward_mean=0.0, reward_bound=0.0
6936: loss=0.001, reward_mean=0.1, reward_bound=0.0
6937: loss=0.001, reward_mean=0.1, reward_bound=0.0
6938: loss=0.001, reward_mean=0.1, reward_bound=0.0
6939: loss=0.001, reward_mean=0.1, reward_bound=0.0
6940: loss=0.001, reward_mean=0.1, reward_bound=0.0
6941: loss=0.001, reward_mean=0.1, reward_bound=0.0
6942: loss=0.001, reward_mean=0.1, reward_bound=0.0
6943: loss=0.001, reward_mean=0.0, reward_bound=0.0
6944: loss=0.001, reward_mean=0.1, reward_bound=0.0
6945: loss=0.001, reward_mean=0.0, reward_bound=0.0
6946: loss=0.001, reward_mean=0.1, reward_bound=0.0
6947: loss=0.001, reward_mean=0.1, reward_bound=0.0
6948: loss=0.001, reward_mean=0.0, reward_bound=0.0
6949: loss=0.001, reward_mean=0.1, reward_bound=0.0
6950: loss=0.001, reward_mean=0.1, reward_bound=0.0
6951: loss=0

7098: loss=0.000, reward_mean=0.0, reward_bound=0.0
7099: loss=0.000, reward_mean=0.1, reward_bound=0.0
7100: loss=0.000, reward_mean=0.1, reward_bound=0.0
7101: loss=0.000, reward_mean=0.1, reward_bound=0.0
7102: loss=0.000, reward_mean=0.1, reward_bound=0.0
7103: loss=0.000, reward_mean=0.0, reward_bound=0.0
7104: loss=0.000, reward_mean=0.0, reward_bound=0.0
7105: loss=0.000, reward_mean=0.0, reward_bound=0.0
7106: loss=0.000, reward_mean=0.1, reward_bound=0.0
7107: loss=0.000, reward_mean=0.1, reward_bound=0.0
7108: loss=0.000, reward_mean=0.0, reward_bound=0.0
7109: loss=0.000, reward_mean=0.0, reward_bound=0.0
7110: loss=0.000, reward_mean=0.1, reward_bound=0.0
7111: loss=0.000, reward_mean=0.1, reward_bound=0.0
7112: loss=0.000, reward_mean=0.1, reward_bound=0.0
7113: loss=0.000, reward_mean=0.1, reward_bound=0.0
7114: loss=0.000, reward_mean=0.1, reward_bound=0.0
7115: loss=0.000, reward_mean=0.1, reward_bound=0.0
7116: loss=0.000, reward_mean=0.2, reward_bound=0.0
7117: loss=0

7266: loss=0.000, reward_mean=0.1, reward_bound=0.0
7267: loss=0.000, reward_mean=0.0, reward_bound=0.0
7268: loss=0.000, reward_mean=0.1, reward_bound=0.0
7269: loss=0.000, reward_mean=0.1, reward_bound=0.0
7270: loss=0.000, reward_mean=0.0, reward_bound=0.0
7271: loss=0.000, reward_mean=0.1, reward_bound=0.0
7272: loss=0.000, reward_mean=0.0, reward_bound=0.0
7273: loss=0.000, reward_mean=0.1, reward_bound=0.0
7274: loss=0.000, reward_mean=0.0, reward_bound=0.0
7275: loss=0.000, reward_mean=0.0, reward_bound=0.0
7276: loss=0.000, reward_mean=0.0, reward_bound=0.0
7277: loss=0.000, reward_mean=0.1, reward_bound=0.0
7278: loss=0.000, reward_mean=0.1, reward_bound=0.0
7279: loss=0.000, reward_mean=0.1, reward_bound=0.0
7280: loss=0.000, reward_mean=0.0, reward_bound=0.0
7281: loss=0.000, reward_mean=0.2, reward_bound=0.0
7282: loss=0.000, reward_mean=0.1, reward_bound=0.0
7283: loss=0.000, reward_mean=0.0, reward_bound=0.0
7284: loss=0.000, reward_mean=0.1, reward_bound=0.0
7285: loss=0

7431: loss=0.000, reward_mean=0.1, reward_bound=0.0
7432: loss=0.000, reward_mean=0.1, reward_bound=0.0
7433: loss=0.000, reward_mean=0.0, reward_bound=0.0
7434: loss=0.000, reward_mean=0.0, reward_bound=0.0
7435: loss=0.000, reward_mean=0.1, reward_bound=0.0
7436: loss=0.000, reward_mean=0.0, reward_bound=0.0
7437: loss=0.000, reward_mean=0.0, reward_bound=0.0
7438: loss=0.000, reward_mean=0.2, reward_bound=0.0
7439: loss=0.000, reward_mean=0.0, reward_bound=0.0
7440: loss=0.000, reward_mean=0.0, reward_bound=0.0
7441: loss=0.000, reward_mean=0.1, reward_bound=0.0
7442: loss=0.000, reward_mean=0.0, reward_bound=0.0
7443: loss=0.000, reward_mean=0.1, reward_bound=0.0
7444: loss=0.000, reward_mean=0.1, reward_bound=0.0
7445: loss=0.000, reward_mean=0.1, reward_bound=0.0
7446: loss=0.000, reward_mean=0.1, reward_bound=0.0
7447: loss=0.000, reward_mean=0.0, reward_bound=0.0
7448: loss=0.000, reward_mean=0.1, reward_bound=0.0
7449: loss=0.000, reward_mean=0.1, reward_bound=0.0
7450: loss=0

7598: loss=0.000, reward_mean=0.0, reward_bound=0.0
7599: loss=0.000, reward_mean=0.1, reward_bound=0.0
7600: loss=0.000, reward_mean=0.0, reward_bound=0.0
7601: loss=0.000, reward_mean=0.1, reward_bound=0.0
7602: loss=0.000, reward_mean=0.1, reward_bound=0.0
7603: loss=0.000, reward_mean=0.0, reward_bound=0.0
7604: loss=0.000, reward_mean=0.0, reward_bound=0.0
7605: loss=0.000, reward_mean=0.0, reward_bound=0.0
7606: loss=0.000, reward_mean=0.0, reward_bound=0.0
7607: loss=0.000, reward_mean=0.1, reward_bound=0.0
7608: loss=0.000, reward_mean=0.0, reward_bound=0.0
7609: loss=0.000, reward_mean=0.1, reward_bound=0.0
7610: loss=0.000, reward_mean=0.0, reward_bound=0.0
7611: loss=0.000, reward_mean=0.0, reward_bound=0.0
7612: loss=0.000, reward_mean=0.0, reward_bound=0.0
7613: loss=0.000, reward_mean=0.2, reward_bound=0.0
7614: loss=0.000, reward_mean=0.1, reward_bound=0.0
7615: loss=0.000, reward_mean=0.0, reward_bound=0.0
7616: loss=0.000, reward_mean=0.0, reward_bound=0.0
7617: loss=0

7757: loss=0.000, reward_mean=0.1, reward_bound=0.0
7758: loss=0.000, reward_mean=0.0, reward_bound=0.0
7759: loss=0.000, reward_mean=0.1, reward_bound=0.0
7760: loss=0.000, reward_mean=0.1, reward_bound=0.0
7761: loss=0.000, reward_mean=0.0, reward_bound=0.0
7762: loss=0.000, reward_mean=0.0, reward_bound=0.0
7763: loss=0.000, reward_mean=0.0, reward_bound=0.0
7764: loss=0.000, reward_mean=0.0, reward_bound=0.0
7765: loss=0.000, reward_mean=0.1, reward_bound=0.0
7766: loss=0.000, reward_mean=0.0, reward_bound=0.0
7767: loss=0.000, reward_mean=0.1, reward_bound=0.0
7768: loss=0.000, reward_mean=0.1, reward_bound=0.0
7769: loss=0.000, reward_mean=0.1, reward_bound=0.0
7770: loss=0.000, reward_mean=0.1, reward_bound=0.0
7771: loss=0.000, reward_mean=0.1, reward_bound=0.0
7772: loss=0.000, reward_mean=0.0, reward_bound=0.0
7773: loss=0.000, reward_mean=0.1, reward_bound=0.0
7774: loss=0.000, reward_mean=0.0, reward_bound=0.0
7775: loss=0.000, reward_mean=0.0, reward_bound=0.0
7776: loss=0

7925: loss=0.000, reward_mean=0.1, reward_bound=0.0
7926: loss=0.000, reward_mean=0.1, reward_bound=0.0
7927: loss=0.000, reward_mean=0.0, reward_bound=0.0
7928: loss=0.000, reward_mean=0.1, reward_bound=0.0
7929: loss=0.000, reward_mean=0.1, reward_bound=0.0
7930: loss=0.000, reward_mean=0.0, reward_bound=0.0
7931: loss=0.000, reward_mean=0.1, reward_bound=0.0
7932: loss=0.000, reward_mean=0.1, reward_bound=0.0
7933: loss=0.000, reward_mean=0.0, reward_bound=0.0
7934: loss=0.000, reward_mean=0.1, reward_bound=0.0
7935: loss=0.000, reward_mean=0.2, reward_bound=0.0
7936: loss=0.000, reward_mean=0.1, reward_bound=0.0
7937: loss=0.000, reward_mean=0.1, reward_bound=0.0
7938: loss=0.000, reward_mean=0.0, reward_bound=0.0
7939: loss=0.000, reward_mean=0.1, reward_bound=0.0
7940: loss=0.000, reward_mean=0.1, reward_bound=0.0
7941: loss=0.000, reward_mean=0.1, reward_bound=0.0
7942: loss=0.000, reward_mean=0.1, reward_bound=0.0
7943: loss=0.000, reward_mean=0.1, reward_bound=0.0
7944: loss=0

8094: loss=0.000, reward_mean=0.0, reward_bound=0.0
8095: loss=0.000, reward_mean=0.1, reward_bound=0.0
8096: loss=0.000, reward_mean=0.0, reward_bound=0.0
8097: loss=0.000, reward_mean=0.1, reward_bound=0.0
8098: loss=0.000, reward_mean=0.1, reward_bound=0.0
8099: loss=0.000, reward_mean=0.1, reward_bound=0.0
8100: loss=0.000, reward_mean=0.1, reward_bound=0.0
8101: loss=0.000, reward_mean=0.1, reward_bound=0.0
8102: loss=0.000, reward_mean=0.1, reward_bound=0.0
8103: loss=0.000, reward_mean=0.0, reward_bound=0.0
8104: loss=0.000, reward_mean=0.0, reward_bound=0.0
8105: loss=0.000, reward_mean=0.0, reward_bound=0.0
8106: loss=0.000, reward_mean=0.2, reward_bound=0.0
8107: loss=0.000, reward_mean=0.1, reward_bound=0.0
8108: loss=0.000, reward_mean=0.1, reward_bound=0.0
8109: loss=0.000, reward_mean=0.1, reward_bound=0.0
8110: loss=0.000, reward_mean=0.0, reward_bound=0.0
8111: loss=0.000, reward_mean=0.0, reward_bound=0.0
8112: loss=0.000, reward_mean=0.0, reward_bound=0.0
8113: loss=0

8262: loss=0.000, reward_mean=0.2, reward_bound=0.0
8263: loss=0.000, reward_mean=0.1, reward_bound=0.0
8264: loss=0.000, reward_mean=0.1, reward_bound=0.0
8265: loss=0.000, reward_mean=0.0, reward_bound=0.0
8266: loss=0.000, reward_mean=0.1, reward_bound=0.0
8267: loss=0.000, reward_mean=0.1, reward_bound=0.0
8268: loss=0.000, reward_mean=0.0, reward_bound=0.0
8269: loss=0.000, reward_mean=0.2, reward_bound=0.0
8270: loss=0.000, reward_mean=0.1, reward_bound=0.0
8271: loss=0.000, reward_mean=0.1, reward_bound=0.0
8272: loss=0.000, reward_mean=0.0, reward_bound=0.0
8273: loss=0.000, reward_mean=0.1, reward_bound=0.0
8274: loss=0.000, reward_mean=0.1, reward_bound=0.0
8275: loss=0.000, reward_mean=0.0, reward_bound=0.0
8276: loss=0.000, reward_mean=0.0, reward_bound=0.0
8277: loss=0.000, reward_mean=0.1, reward_bound=0.0
8278: loss=0.000, reward_mean=0.1, reward_bound=0.0
8279: loss=0.000, reward_mean=0.1, reward_bound=0.0
8280: loss=0.000, reward_mean=0.2, reward_bound=0.0
8281: loss=0

8431: loss=0.000, reward_mean=0.0, reward_bound=0.0
8432: loss=0.000, reward_mean=0.0, reward_bound=0.0
8433: loss=0.000, reward_mean=0.1, reward_bound=0.0
8434: loss=0.000, reward_mean=0.0, reward_bound=0.0
8435: loss=0.000, reward_mean=0.1, reward_bound=0.0
8436: loss=0.000, reward_mean=0.1, reward_bound=0.0
8437: loss=0.000, reward_mean=0.0, reward_bound=0.0
8438: loss=0.000, reward_mean=0.1, reward_bound=0.0
8439: loss=0.000, reward_mean=0.1, reward_bound=0.0
8440: loss=0.000, reward_mean=0.0, reward_bound=0.0
8441: loss=0.000, reward_mean=0.0, reward_bound=0.0
8442: loss=0.000, reward_mean=0.1, reward_bound=0.0
8443: loss=0.000, reward_mean=0.0, reward_bound=0.0
8444: loss=0.000, reward_mean=0.1, reward_bound=0.0
8445: loss=0.000, reward_mean=0.0, reward_bound=0.0
8446: loss=0.000, reward_mean=0.0, reward_bound=0.0
8447: loss=0.000, reward_mean=0.0, reward_bound=0.0
8448: loss=0.000, reward_mean=0.1, reward_bound=0.0
8449: loss=0.000, reward_mean=0.0, reward_bound=0.0
8450: loss=0

8599: loss=0.000, reward_mean=0.1, reward_bound=0.0
8600: loss=0.000, reward_mean=0.0, reward_bound=0.0
8601: loss=0.000, reward_mean=0.1, reward_bound=0.0
8602: loss=0.000, reward_mean=0.0, reward_bound=0.0
8603: loss=0.000, reward_mean=0.1, reward_bound=0.0
8604: loss=0.000, reward_mean=0.0, reward_bound=0.0
8605: loss=0.000, reward_mean=0.1, reward_bound=0.0
8606: loss=0.000, reward_mean=0.1, reward_bound=0.0
8607: loss=0.000, reward_mean=0.0, reward_bound=0.0
8608: loss=0.000, reward_mean=0.1, reward_bound=0.0
8609: loss=0.000, reward_mean=0.1, reward_bound=0.0
8610: loss=0.000, reward_mean=0.1, reward_bound=0.0
8611: loss=0.000, reward_mean=0.0, reward_bound=0.0
8612: loss=0.000, reward_mean=0.1, reward_bound=0.0
8613: loss=0.000, reward_mean=0.1, reward_bound=0.0
8614: loss=0.000, reward_mean=0.1, reward_bound=0.0
8615: loss=0.000, reward_mean=0.1, reward_bound=0.0
8616: loss=0.000, reward_mean=0.1, reward_bound=0.0
8617: loss=0.000, reward_mean=0.1, reward_bound=0.0
8618: loss=0

8767: loss=0.000, reward_mean=0.1, reward_bound=0.0
8768: loss=0.000, reward_mean=0.0, reward_bound=0.0
8769: loss=0.000, reward_mean=0.0, reward_bound=0.0
8770: loss=0.000, reward_mean=0.1, reward_bound=0.0
8771: loss=0.000, reward_mean=0.0, reward_bound=0.0
8772: loss=0.000, reward_mean=0.0, reward_bound=0.0
8773: loss=0.000, reward_mean=0.0, reward_bound=0.0
8774: loss=0.000, reward_mean=0.1, reward_bound=0.0
8775: loss=0.000, reward_mean=0.0, reward_bound=0.0
8776: loss=0.000, reward_mean=0.0, reward_bound=0.0
8777: loss=0.000, reward_mean=0.1, reward_bound=0.0
8778: loss=0.000, reward_mean=0.1, reward_bound=0.0
8779: loss=0.000, reward_mean=0.1, reward_bound=0.0
8780: loss=0.000, reward_mean=0.0, reward_bound=0.0
8781: loss=0.000, reward_mean=0.0, reward_bound=0.0
8782: loss=0.000, reward_mean=0.2, reward_bound=0.0
8783: loss=0.000, reward_mean=0.1, reward_bound=0.0
8784: loss=0.000, reward_mean=0.0, reward_bound=0.0
8785: loss=0.113, reward_mean=0.0, reward_bound=0.0
8786: loss=0

8926: loss=0.000, reward_mean=0.1, reward_bound=0.0
8927: loss=0.000, reward_mean=0.1, reward_bound=0.0
8928: loss=0.000, reward_mean=0.0, reward_bound=0.0
8929: loss=0.000, reward_mean=0.0, reward_bound=0.0
8930: loss=0.000, reward_mean=0.1, reward_bound=0.0
8931: loss=0.000, reward_mean=0.1, reward_bound=0.0
8932: loss=0.000, reward_mean=0.1, reward_bound=0.0
8933: loss=0.000, reward_mean=0.0, reward_bound=0.0
8934: loss=0.000, reward_mean=0.0, reward_bound=0.0
8935: loss=0.000, reward_mean=0.0, reward_bound=0.0
8936: loss=0.000, reward_mean=0.1, reward_bound=0.0
8937: loss=0.000, reward_mean=0.0, reward_bound=0.0
8938: loss=0.000, reward_mean=0.1, reward_bound=0.0
8939: loss=0.000, reward_mean=0.0, reward_bound=0.0
8940: loss=0.000, reward_mean=0.1, reward_bound=0.0
8941: loss=0.000, reward_mean=0.1, reward_bound=0.0
8942: loss=0.000, reward_mean=0.1, reward_bound=0.0
8943: loss=0.000, reward_mean=0.2, reward_bound=0.0
8944: loss=0.000, reward_mean=0.1, reward_bound=0.0
8945: loss=0

9094: loss=0.000, reward_mean=0.0, reward_bound=0.0
9095: loss=0.000, reward_mean=0.1, reward_bound=0.0
9096: loss=0.000, reward_mean=0.1, reward_bound=0.0
9097: loss=0.000, reward_mean=0.0, reward_bound=0.0
9098: loss=0.000, reward_mean=0.1, reward_bound=0.0
9099: loss=0.000, reward_mean=0.1, reward_bound=0.0
9100: loss=0.000, reward_mean=0.0, reward_bound=0.0
9101: loss=0.000, reward_mean=0.1, reward_bound=0.0
9102: loss=0.000, reward_mean=0.0, reward_bound=0.0
9103: loss=0.000, reward_mean=0.1, reward_bound=0.0
9104: loss=0.000, reward_mean=0.0, reward_bound=0.0
9105: loss=0.000, reward_mean=0.1, reward_bound=0.0
9106: loss=0.000, reward_mean=0.1, reward_bound=0.0
9107: loss=0.000, reward_mean=0.1, reward_bound=0.0
9108: loss=0.000, reward_mean=0.1, reward_bound=0.0
9109: loss=0.000, reward_mean=0.0, reward_bound=0.0
9110: loss=0.000, reward_mean=0.0, reward_bound=0.0
9111: loss=0.000, reward_mean=0.1, reward_bound=0.0
9112: loss=0.000, reward_mean=0.1, reward_bound=0.0
9113: loss=0

9260: loss=0.000, reward_mean=0.2, reward_bound=0.0
9261: loss=0.000, reward_mean=0.1, reward_bound=0.0
9262: loss=0.000, reward_mean=0.1, reward_bound=0.0
9263: loss=0.000, reward_mean=0.1, reward_bound=0.0
9264: loss=0.000, reward_mean=0.0, reward_bound=0.0
9265: loss=0.000, reward_mean=0.0, reward_bound=0.0
9266: loss=0.000, reward_mean=0.1, reward_bound=0.0
9267: loss=0.000, reward_mean=0.0, reward_bound=0.0
9268: loss=0.000, reward_mean=0.0, reward_bound=0.0
9269: loss=0.000, reward_mean=0.0, reward_bound=0.0
9270: loss=0.000, reward_mean=0.1, reward_bound=0.0
9271: loss=0.000, reward_mean=0.0, reward_bound=0.0
9272: loss=0.000, reward_mean=0.1, reward_bound=0.0
9273: loss=0.000, reward_mean=0.2, reward_bound=0.0
9274: loss=0.000, reward_mean=0.1, reward_bound=0.0
9275: loss=0.000, reward_mean=0.1, reward_bound=0.0
9276: loss=0.000, reward_mean=0.1, reward_bound=0.0
9277: loss=0.000, reward_mean=0.0, reward_bound=0.0
9278: loss=0.000, reward_mean=0.0, reward_bound=0.0
9279: loss=0

9420: loss=0.000, reward_mean=0.1, reward_bound=0.0
9421: loss=0.000, reward_mean=0.0, reward_bound=0.0
9422: loss=0.000, reward_mean=0.1, reward_bound=0.0
9423: loss=0.000, reward_mean=0.0, reward_bound=0.0
9424: loss=0.000, reward_mean=0.0, reward_bound=0.0
9425: loss=0.000, reward_mean=0.0, reward_bound=0.0
9426: loss=0.000, reward_mean=0.1, reward_bound=0.0
9427: loss=0.000, reward_mean=0.1, reward_bound=0.0
9428: loss=0.000, reward_mean=0.0, reward_bound=0.0
9429: loss=0.000, reward_mean=0.0, reward_bound=0.0
9430: loss=0.000, reward_mean=0.1, reward_bound=0.0
9431: loss=0.000, reward_mean=0.0, reward_bound=0.0
9432: loss=0.000, reward_mean=0.1, reward_bound=0.0
9433: loss=0.000, reward_mean=0.1, reward_bound=0.0
9434: loss=0.000, reward_mean=0.1, reward_bound=0.0
9435: loss=0.000, reward_mean=0.1, reward_bound=0.0
9436: loss=0.000, reward_mean=0.1, reward_bound=0.0
9437: loss=0.000, reward_mean=0.1, reward_bound=0.0
9438: loss=0.000, reward_mean=0.1, reward_bound=0.0
9439: loss=0

9586: loss=0.000, reward_mean=0.0, reward_bound=0.0
9587: loss=0.000, reward_mean=0.0, reward_bound=0.0
9588: loss=0.000, reward_mean=0.1, reward_bound=0.0
9589: loss=0.000, reward_mean=0.2, reward_bound=0.0
9590: loss=0.000, reward_mean=0.0, reward_bound=0.0
9591: loss=0.000, reward_mean=0.1, reward_bound=0.0
9592: loss=0.000, reward_mean=0.0, reward_bound=0.0
9593: loss=0.000, reward_mean=0.0, reward_bound=0.0
9594: loss=0.000, reward_mean=0.0, reward_bound=0.0
9595: loss=0.000, reward_mean=0.2, reward_bound=0.0
9596: loss=0.000, reward_mean=0.1, reward_bound=0.0
9597: loss=0.000, reward_mean=0.0, reward_bound=0.0
9598: loss=0.000, reward_mean=0.0, reward_bound=0.0
9599: loss=0.000, reward_mean=0.0, reward_bound=0.0
9600: loss=0.000, reward_mean=0.1, reward_bound=0.0
9601: loss=0.000, reward_mean=0.1, reward_bound=0.0
9602: loss=0.000, reward_mean=0.0, reward_bound=0.0
9603: loss=0.000, reward_mean=0.0, reward_bound=0.0
9604: loss=0.000, reward_mean=0.1, reward_bound=0.0
9605: loss=0

9750: loss=0.000, reward_mean=0.1, reward_bound=0.0
9751: loss=0.000, reward_mean=0.0, reward_bound=0.0
9752: loss=0.000, reward_mean=0.0, reward_bound=0.0
9753: loss=0.000, reward_mean=0.2, reward_bound=0.0
9754: loss=0.000, reward_mean=0.0, reward_bound=0.0
9755: loss=0.000, reward_mean=0.0, reward_bound=0.0
9756: loss=0.000, reward_mean=0.0, reward_bound=0.0
9757: loss=0.000, reward_mean=0.1, reward_bound=0.0
9758: loss=0.000, reward_mean=0.1, reward_bound=0.0
9759: loss=0.000, reward_mean=0.1, reward_bound=0.0
9760: loss=0.000, reward_mean=0.1, reward_bound=0.0
9761: loss=0.000, reward_mean=0.1, reward_bound=0.0
9762: loss=0.000, reward_mean=0.1, reward_bound=0.0
9763: loss=0.000, reward_mean=0.1, reward_bound=0.0
9764: loss=0.000, reward_mean=0.1, reward_bound=0.0
9765: loss=0.000, reward_mean=0.0, reward_bound=0.0
9766: loss=0.000, reward_mean=0.0, reward_bound=0.0
9767: loss=0.000, reward_mean=0.0, reward_bound=0.0
9768: loss=0.000, reward_mean=0.1, reward_bound=0.0
9769: loss=0

9916: loss=0.000, reward_mean=0.1, reward_bound=0.0
9917: loss=0.000, reward_mean=0.1, reward_bound=0.0
9918: loss=0.000, reward_mean=0.1, reward_bound=0.0
9919: loss=0.000, reward_mean=0.1, reward_bound=0.0
9920: loss=0.000, reward_mean=0.1, reward_bound=0.0
9921: loss=0.000, reward_mean=0.0, reward_bound=0.0
9922: loss=0.000, reward_mean=0.1, reward_bound=0.0
9923: loss=0.000, reward_mean=0.1, reward_bound=0.0
9924: loss=0.000, reward_mean=0.1, reward_bound=0.0
9925: loss=0.000, reward_mean=0.0, reward_bound=0.0
9926: loss=0.000, reward_mean=0.1, reward_bound=0.0
9927: loss=0.000, reward_mean=0.1, reward_bound=0.0
9928: loss=0.000, reward_mean=0.0, reward_bound=0.0
9929: loss=0.000, reward_mean=0.0, reward_bound=0.0
9930: loss=0.000, reward_mean=0.0, reward_bound=0.0
9931: loss=0.000, reward_mean=0.0, reward_bound=0.0
9932: loss=0.000, reward_mean=0.0, reward_bound=0.0
9933: loss=0.000, reward_mean=0.0, reward_bound=0.0
9934: loss=0.000, reward_mean=0.1, reward_bound=0.0
9935: loss=0

10082: loss=0.000, reward_mean=0.1, reward_bound=0.0
10083: loss=0.000, reward_mean=0.2, reward_bound=0.0
10084: loss=0.000, reward_mean=0.0, reward_bound=0.0
10085: loss=0.000, reward_mean=0.1, reward_bound=0.0
10086: loss=0.000, reward_mean=0.1, reward_bound=0.0
10087: loss=0.000, reward_mean=0.1, reward_bound=0.0
10088: loss=0.000, reward_mean=0.1, reward_bound=0.0
10089: loss=0.000, reward_mean=0.1, reward_bound=0.0
10090: loss=0.000, reward_mean=0.0, reward_bound=0.0
10091: loss=0.000, reward_mean=0.2, reward_bound=0.0
10092: loss=0.000, reward_mean=0.0, reward_bound=0.0
10093: loss=0.000, reward_mean=0.1, reward_bound=0.0
10094: loss=0.000, reward_mean=0.1, reward_bound=0.0
10095: loss=0.000, reward_mean=0.0, reward_bound=0.0
10096: loss=0.000, reward_mean=0.0, reward_bound=0.0
10097: loss=0.000, reward_mean=0.1, reward_bound=0.0
10098: loss=0.000, reward_mean=0.1, reward_bound=0.0
10099: loss=0.000, reward_mean=0.1, reward_bound=0.0
10100: loss=0.000, reward_mean=0.0, reward_bou

KeyboardInterrupt: 

To understand what's going on, we need to look deeper at the reward structure of both
environments. In CartPole, every step of the environment gives us the reward 1.0, until
the moment that the pole falls. So, the longer our agent balanced the pole, the more
reward it obtained. Due to randomness in our agent's behavior, different episodes
were of different lengths, which gave us a pretty normal distribution of the episodes'
rewards. After choosing a reward boundary, we rejected less successful episodes
and learned how to repeat better ones (by training on successful episodes' data).

In the FrozenLake environment, episodes and their rewards look different. **We get
the reward of 1.0 only when we reach the goal, and this reward says nothing about
how good each episode was. Was it quick and efficient or did we make four rounds
on the lake before we randomly stepped into the final cell? We don't know; it's just 1.0
reward and that's it. The distribution of rewards for our episodes are also problematic.
There are only two kinds of episodes possible, with zero reward (failed) and one
reward (successful), and failed episodes will obviously dominate in the beginning of
the training. So, our percentile selection of "elite" episodes is totally wrong and gives
us bad examples to train on. This is the reason for our training failure.**

This example shows us the limitations of the cross-entropy method:
- For training, our episodes have to be finite and, preferably, short
- The total reward for the episodes should have enough variability to separate
good episodes from bad ones
- There is no intermediate indication about whether the agent has succeeded
or failed

limitations. For now, if you are curious about how FrozenLake can be solved using
the cross-entropy method, here is a list of tweaks of the code that you need to make
(the full example is in Chapter04/03_frozenlake_tweaked.py):
- Larger batches of played episodes: In CartPole, it was enough to have
16 episodes on every iteration, but FrozenLake requires at least 100 just
to get some successful episodes.
- Discount factor applied to the reward: To make the total reward for an
episode depend on its length, and add variety in episodes, we can use
a discounted total reward with the discount factor 0.9 or 0.95. In this case,
the reward for shorter episodes will be higher than the reward for longer
ones. This increases variability in reward distribution, which helps to avoid
situations like the one shown in Figure 4.10.
- Keeping "elite" episodes for a longer time: In the CartPole training,
we sampled episodes from the environment, trained on the best ones,
and threw them away. In FrozenLake, a successful episode is a much rarer
animal, so we need to keep them for several iterations to train on them.
- Decreasing learning rate: This will give our NN time to average more
training samples.
- Much longer training time: Due to the sparsity of successful episodes, and
the random outcome of our actions, it's much harder for our NN to get an
idea of the best behavior to perform in any particular situation. To reach
50% successful episodes, about 5k training iterations are required.

To incorporate all these into our code, we need to change the filter_batch function
to calculate discounted reward and return "elite" episodes for us to keep:

In [9]:
GAMMA = 0.9
BATCH_SIZE = 100

def filter_batch(batch, percentile):
    filter_fun = lambda s: s.reward * (GAMMA ** len(s.steps))
    disc_rewards = list(map(filter_fun, batch))
    reward_bound = np.percentile(disc_rewards, percentile)

    train_obs = []
    train_act = []
    elite_batch = []
    for example, discounted_reward in zip(batch, disc_rewards):
        if discounted_reward > reward_bound:
            train_obs.extend(map(lambda step: step.observation,
                                 example.steps))
            train_act.extend(map(lambda step: step.action,
                                 example.steps))
            elite_batch.append(example)

    return elite_batch, train_obs, train_act, reward_bound

Then, in the training loop, we will store previous "elite" episodes to pass them to the
preceding function on the next training iteration.

In [10]:
import random
random.seed(12345)

env = DiscreteOneHotWrapper(gym.make("FrozenLake-v0"))
# env = gym.wrappers.Monitor(env, directory="mon", force=True)
obs_size = env.observation_space.shape[0]
n_actions = env.action_space.n

net = Net(obs_size, HIDDEN_SIZE, n_actions)
objective = nn.CrossEntropyLoss()
optimizer = optim.Adam(params=net.parameters(), lr=0.001) # Lower learning rate by ten
writer = SummaryWriter(comment="-frozenlake-tweaked")

full_batch = []
for iter_no, batch in enumerate(iterate_batches(
        env, net, BATCH_SIZE)):
    reward_mean = float(np.mean(list(map(
        lambda s: s.reward, batch))))
    full_batch, obs, acts, reward_bound = \
        filter_batch(full_batch + batch, PERCENTILE)
    if not full_batch:
        continue
    obs_v = torch.FloatTensor(obs)
    acts_v = torch.LongTensor(acts)
    full_batch = full_batch[-500:]

    optimizer.zero_grad()
    action_scores_v = net(obs_v)
    loss_v = objective(action_scores_v, acts_v)
    loss_v.backward()
    optimizer.step()
    print("%d: loss=%.3f, rw_mean=%.3f, "
          "rw_bound=%.3f, batch=%d" % (
        iter_no, loss_v.item(), reward_mean,
        reward_bound, len(full_batch)))
    writer.add_scalar("loss", loss_v.item(), iter_no)
    writer.add_scalar("reward_mean", reward_mean, iter_no)
    writer.add_scalar("reward_bound", reward_bound, iter_no)
    if reward_mean > 0.8:
        print("Solved!")
        break
writer.close()

1: loss=1.365, rw_mean=0.010, rw_bound=0.000, batch=1
2: loss=1.356, rw_mean=0.030, rw_bound=0.000, batch=4
3: loss=1.364, rw_mean=0.020, rw_bound=0.000, batch=6
4: loss=1.362, rw_mean=0.030, rw_bound=0.000, batch=9
5: loss=1.362, rw_mean=0.010, rw_bound=0.000, batch=10
6: loss=1.360, rw_mean=0.000, rw_bound=0.000, batch=10
7: loss=1.358, rw_mean=0.000, rw_bound=0.000, batch=10
8: loss=1.358, rw_mean=0.030, rw_bound=0.000, batch=13
9: loss=1.355, rw_mean=0.030, rw_bound=0.000, batch=16
10: loss=1.351, rw_mean=0.020, rw_bound=0.000, batch=18
11: loss=1.353, rw_mean=0.030, rw_bound=0.000, batch=21
12: loss=1.349, rw_mean=0.040, rw_bound=0.000, batch=25
13: loss=1.349, rw_mean=0.010, rw_bound=0.000, batch=26
14: loss=1.349, rw_mean=0.020, rw_bound=0.000, batch=28
15: loss=1.345, rw_mean=0.020, rw_bound=0.000, batch=30
16: loss=1.343, rw_mean=0.010, rw_bound=0.000, batch=31
17: loss=1.343, rw_mean=0.010, rw_bound=0.000, batch=32
18: loss=1.343, rw_mean=0.030, rw_bound=0.000, batch=35
19: l

147: loss=0.956, rw_mean=0.030, rw_bound=0.387, batch=42
148: loss=0.955, rw_mean=0.050, rw_bound=0.387, batch=42
149: loss=0.974, rw_mean=0.070, rw_bound=0.338, batch=43
150: loss=0.955, rw_mean=0.040, rw_bound=0.401, batch=43
151: loss=0.954, rw_mean=0.020, rw_bound=0.424, batch=43
152: loss=0.953, rw_mean=0.040, rw_bound=0.450, batch=43
153: loss=0.955, rw_mean=0.040, rw_bound=0.478, batch=44
154: loss=0.951, rw_mean=0.040, rw_bound=0.478, batch=45
155: loss=0.955, rw_mean=0.060, rw_bound=0.478, batch=40
156: loss=0.943, rw_mean=0.030, rw_bound=0.211, batch=42
157: loss=0.958, rw_mean=0.070, rw_bound=0.451, batch=43
158: loss=0.957, rw_mean=0.000, rw_bound=0.191, batch=43
159: loss=0.957, rw_mean=0.040, rw_bound=0.424, batch=43
160: loss=0.956, rw_mean=0.080, rw_bound=0.424, batch=43
161: loss=0.947, rw_mean=0.050, rw_bound=0.478, batch=44
162: loss=0.955, rw_mean=0.070, rw_bound=0.478, batch=45
163: loss=0.957, rw_mean=0.050, rw_bound=0.478, batch=42
164: loss=0.956, rw_mean=0.040,

293: loss=0.768, rw_mean=0.050, rw_bound=0.000, batch=40
294: loss=0.762, rw_mean=0.030, rw_bound=0.206, batch=43
295: loss=0.764, rw_mean=0.070, rw_bound=0.254, batch=42
296: loss=0.746, rw_mean=0.050, rw_bound=0.349, batch=42
297: loss=0.756, rw_mean=0.060, rw_bound=0.387, batch=39
298: loss=0.748, rw_mean=0.070, rw_bound=0.387, batch=41
299: loss=0.774, rw_mean=0.070, rw_bound=0.430, batch=39
300: loss=0.762, rw_mean=0.040, rw_bound=0.206, batch=41
301: loss=0.764, rw_mean=0.040, rw_bound=0.314, batch=42
302: loss=0.760, rw_mean=0.050, rw_bound=0.365, batch=43
303: loss=0.759, rw_mean=0.090, rw_bound=0.430, batch=41
304: loss=0.754, rw_mean=0.080, rw_bound=0.478, batch=40
305: loss=0.750, rw_mean=0.070, rw_bound=0.445, batch=42
306: loss=0.755, rw_mean=0.050, rw_bound=0.451, batch=43
307: loss=0.743, rw_mean=0.060, rw_bound=0.478, batch=45
308: loss=0.759, rw_mean=0.060, rw_bound=0.478, batch=41
309: loss=0.766, rw_mean=0.050, rw_bound=0.229, batch=42
310: loss=0.760, rw_mean=0.050,

439: loss=0.603, rw_mean=0.050, rw_bound=0.387, batch=41
440: loss=0.576, rw_mean=0.090, rw_bound=0.430, batch=26
441: loss=0.568, rw_mean=0.080, rw_bound=0.000, batch=34
442: loss=0.555, rw_mean=0.090, rw_bound=0.185, batch=39
443: loss=0.553, rw_mean=0.050, rw_bound=0.244, batch=42
444: loss=0.591, rw_mean=0.040, rw_bound=0.254, batch=42
445: loss=0.597, rw_mean=0.060, rw_bound=0.282, batch=40
446: loss=0.595, rw_mean=0.080, rw_bound=0.314, batch=41
447: loss=0.600, rw_mean=0.030, rw_bound=0.254, batch=42
448: loss=0.562, rw_mean=0.100, rw_bound=0.387, batch=38
449: loss=0.546, rw_mean=0.050, rw_bound=0.118, batch=42
450: loss=0.548, rw_mean=0.100, rw_bound=0.274, batch=43
451: loss=0.566, rw_mean=0.030, rw_bound=0.328, batch=43
452: loss=0.561, rw_mean=0.040, rw_bound=0.387, batch=42
453: loss=0.565, rw_mean=0.050, rw_bound=0.418, batch=43
454: loss=0.567, rw_mean=0.050, rw_bound=0.430, batch=39
455: loss=0.577, rw_mean=0.060, rw_bound=0.244, batch=42
456: loss=0.578, rw_mean=0.060,

584: loss=0.479, rw_mean=0.030, rw_bound=0.314, batch=41
585: loss=0.496, rw_mean=0.020, rw_bound=0.150, batch=42
586: loss=0.481, rw_mean=0.060, rw_bound=0.338, batch=43
587: loss=0.481, rw_mean=0.050, rw_bound=0.364, batch=43
588: loss=0.500, rw_mean=0.020, rw_bound=0.387, batch=37
589: loss=0.496, rw_mean=0.080, rw_bound=0.254, batch=40
590: loss=0.488, rw_mean=0.040, rw_bound=0.324, batch=42
591: loss=0.515, rw_mean=0.040, rw_bound=0.349, batch=42
592: loss=0.511, rw_mean=0.080, rw_bound=0.418, batch=43
593: loss=0.550, rw_mean=0.010, rw_bound=0.430, batch=37
594: loss=0.564, rw_mean=0.070, rw_bound=0.296, batch=41
595: loss=0.546, rw_mean=0.080, rw_bound=0.387, batch=40
596: loss=0.536, rw_mean=0.050, rw_bound=0.387, batch=41
597: loss=0.555, rw_mean=0.070, rw_bound=0.430, batch=41
598: loss=0.548, rw_mean=0.060, rw_bound=0.478, batch=32
599: loss=0.548, rw_mean=0.060, rw_bound=0.000, batch=38
600: loss=0.524, rw_mean=0.050, rw_bound=0.134, batch=42
601: loss=0.529, rw_mean=0.050,

730: loss=0.519, rw_mean=0.050, rw_bound=0.500, batch=43
731: loss=0.518, rw_mean=0.070, rw_bound=0.500, batch=43
733: loss=0.552, rw_mean=0.020, rw_bound=0.000, batch=2
734: loss=0.461, rw_mean=0.030, rw_bound=0.000, batch=5
735: loss=0.483, rw_mean=0.030, rw_bound=0.000, batch=8
736: loss=0.530, rw_mean=0.060, rw_bound=0.000, batch=14
737: loss=0.530, rw_mean=0.040, rw_bound=0.000, batch=18
738: loss=0.512, rw_mean=0.040, rw_bound=0.000, batch=22
739: loss=0.528, rw_mean=0.020, rw_bound=0.000, batch=24
740: loss=0.531, rw_mean=0.040, rw_bound=0.000, batch=28
741: loss=0.529, rw_mean=0.080, rw_bound=0.000, batch=36
742: loss=0.514, rw_mean=0.050, rw_bound=0.015, batch=41
743: loss=0.498, rw_mean=0.050, rw_bound=0.109, batch=41
744: loss=0.502, rw_mean=0.050, rw_bound=0.185, batch=41
745: loss=0.502, rw_mean=0.020, rw_bound=0.065, batch=42
746: loss=0.477, rw_mean=0.060, rw_bound=0.229, batch=40
747: loss=0.478, rw_mean=0.040, rw_bound=0.206, batch=42
748: loss=0.471, rw_mean=0.060, rw

877: loss=0.421, rw_mean=0.090, rw_bound=0.424, batch=43
878: loss=0.420, rw_mean=0.040, rw_bound=0.424, batch=43
879: loss=0.420, rw_mean=0.050, rw_bound=0.450, batch=43
880: loss=0.428, rw_mean=0.080, rw_bound=0.478, batch=44
881: loss=0.427, rw_mean=0.100, rw_bound=0.478, batch=44
882: loss=0.427, rw_mean=0.060, rw_bound=0.478, batch=44
883: loss=0.438, rw_mean=0.060, rw_bound=0.478, batch=45
884: loss=0.414, rw_mean=0.080, rw_bound=0.478, batch=17
885: loss=0.400, rw_mean=0.080, rw_bound=0.000, batch=25
886: loss=0.391, rw_mean=0.050, rw_bound=0.000, batch=30
887: loss=0.388, rw_mean=0.050, rw_bound=0.000, batch=35
888: loss=0.413, rw_mean=0.070, rw_bound=0.094, batch=41
889: loss=0.428, rw_mean=0.070, rw_bound=0.185, batch=42
890: loss=0.436, rw_mean=0.090, rw_bound=0.266, batch=43
891: loss=0.430, rw_mean=0.070, rw_bound=0.328, batch=43
892: loss=0.443, rw_mean=0.130, rw_bound=0.349, batch=41
893: loss=0.399, rw_mean=0.060, rw_bound=0.387, batch=41
894: loss=0.396, rw_mean=0.040,

1023: loss=0.299, rw_mean=0.060, rw_bound=0.478, batch=23
1024: loss=0.314, rw_mean=0.070, rw_bound=0.000, batch=30
1025: loss=0.341, rw_mean=0.080, rw_bound=0.000, batch=38
1026: loss=0.337, rw_mean=0.100, rw_bound=0.185, batch=41
1027: loss=0.315, rw_mean=0.090, rw_bound=0.254, batch=41
1028: loss=0.320, rw_mean=0.100, rw_bound=0.314, batch=41
1029: loss=0.310, rw_mean=0.050, rw_bound=0.349, batch=42
1030: loss=0.324, rw_mean=0.080, rw_bound=0.387, batch=39
1031: loss=0.318, rw_mean=0.130, rw_bound=0.413, batch=42
1032: loss=0.318, rw_mean=0.070, rw_bound=0.430, batch=32
1033: loss=0.358, rw_mean=0.090, rw_bound=0.115, batch=40
1034: loss=0.357, rw_mean=0.090, rw_bound=0.191, batch=42
1035: loss=0.338, rw_mean=0.070, rw_bound=0.240, batch=43
1036: loss=0.308, rw_mean=0.090, rw_bound=0.282, batch=42
1037: loss=0.277, rw_mean=0.170, rw_bound=0.387, batch=39
1038: loss=0.292, rw_mean=0.060, rw_bound=0.244, batch=42
1039: loss=0.292, rw_mean=0.060, rw_bound=0.296, batch=43
1040: loss=0.2

1167: loss=0.305, rw_mean=0.050, rw_bound=0.387, batch=42
1168: loss=0.301, rw_mean=0.090, rw_bound=0.418, batch=43
1169: loss=0.303, rw_mean=0.050, rw_bound=0.430, batch=41
1170: loss=0.306, rw_mean=0.050, rw_bound=0.282, batch=42
1171: loss=0.302, rw_mean=0.080, rw_bound=0.406, batch=43
1172: loss=0.299, rw_mean=0.060, rw_bound=0.430, batch=42
1173: loss=0.290, rw_mean=0.070, rw_bound=0.347, batch=43
1174: loss=0.296, rw_mean=0.110, rw_bound=0.478, batch=44
1175: loss=0.290, rw_mean=0.090, rw_bound=0.478, batch=45
1176: loss=0.289, rw_mean=0.070, rw_bound=0.478, batch=40
1177: loss=0.282, rw_mean=0.080, rw_bound=0.400, batch=42
1178: loss=0.314, rw_mean=0.070, rw_bound=0.418, batch=43
1179: loss=0.280, rw_mean=0.050, rw_bound=0.430, batch=42
1180: loss=0.275, rw_mean=0.100, rw_bound=0.515, batch=43
1181: loss=0.275, rw_mean=0.130, rw_bound=0.500, batch=43
1183: loss=0.325, rw_mean=0.060, rw_bound=0.000, batch=6
1184: loss=0.371, rw_mean=0.020, rw_bound=0.000, batch=8
1185: loss=0.326

1312: loss=0.239, rw_mean=0.100, rw_bound=0.254, batch=40
1313: loss=0.230, rw_mean=0.070, rw_bound=0.263, batch=42
1314: loss=0.204, rw_mean=0.090, rw_bound=0.282, batch=42
1315: loss=0.183, rw_mean=0.100, rw_bound=0.314, batch=39
1316: loss=0.172, rw_mean=0.100, rw_bound=0.349, batch=36
1317: loss=0.191, rw_mean=0.090, rw_bound=0.142, batch=41
1318: loss=0.172, rw_mean=0.110, rw_bound=0.314, batch=42
1319: loss=0.181, rw_mean=0.030, rw_bound=0.349, batch=42
1320: loss=0.179, rw_mean=0.100, rw_bound=0.376, batch=43
1321: loss=0.175, rw_mean=0.090, rw_bound=0.387, batch=42
1322: loss=0.147, rw_mean=0.040, rw_bound=0.430, batch=35
1323: loss=0.177, rw_mean=0.100, rw_bound=0.272, batch=41
1324: loss=0.188, rw_mean=0.060, rw_bound=0.349, batch=42
1325: loss=0.162, rw_mean=0.080, rw_bound=0.387, batch=42
1326: loss=0.153, rw_mean=0.090, rw_bound=0.430, batch=41
1327: loss=0.153, rw_mean=0.110, rw_bound=0.387, batch=41
1328: loss=0.149, rw_mean=0.080, rw_bound=0.430, batch=42
1329: loss=0.1

1455: loss=0.175, rw_mean=0.130, rw_bound=0.349, batch=41
1456: loss=0.132, rw_mean=0.080, rw_bound=0.387, batch=34
1457: loss=0.139, rw_mean=0.100, rw_bound=0.208, batch=40
1458: loss=0.127, rw_mean=0.060, rw_bound=0.282, batch=41
1459: loss=0.137, rw_mean=0.110, rw_bound=0.387, batch=41
1460: loss=0.123, rw_mean=0.100, rw_bound=0.430, batch=36
1461: loss=0.134, rw_mean=0.060, rw_bound=0.188, batch=41
1462: loss=0.136, rw_mean=0.040, rw_bound=0.314, batch=40
1463: loss=0.120, rw_mean=0.120, rw_bound=0.349, batch=40
1464: loss=0.130, rw_mean=0.090, rw_bound=0.400, batch=42
1465: loss=0.135, rw_mean=0.010, rw_bound=0.271, batch=43
1466: loss=0.128, rw_mean=0.080, rw_bound=0.387, batch=42
1467: loss=0.126, rw_mean=0.070, rw_bound=0.406, batch=43
1468: loss=0.125, rw_mean=0.040, rw_bound=0.360, batch=43
1469: loss=0.127, rw_mean=0.090, rw_bound=0.430, batch=40
1470: loss=0.127, rw_mean=0.090, rw_bound=0.415, batch=42
1471: loss=0.128, rw_mean=0.090, rw_bound=0.451, batch=43
1472: loss=0.1

KeyboardInterrupt: 

The final point to note here is the effect of slipperiness in the FrozenLake
environment. Each of our actions with 33% probability is replaced with the 90°
rotated one (the "up" action, for instance, will succeed with 0.33 probability and there
will be a 0.33 chance that it will be replaced with the "left" action and 0.33 with the
"right" action).

The nonslippery version is in Chapter04/04_frozenlake_nonslippery.py, and the
only difference is in the environment creation (we need to peek into the core of Gym
to create the instance of the environment with tweaked arguments):

In [12]:
import random
random.seed(12345)

env = gym.envs.toy_text.frozen_lake.FrozenLakeEnv(
is_slippery=False)
env.spec = gym.spec("FrozenLake-v0")
env = gym.wrappers.TimeLimit(env, max_episode_steps=100)
env = DiscreteOneHotWrapper(env)

# env = gym.wrappers.Monitor(env, directory="mon", force=True)
obs_size = env.observation_space.shape[0]
n_actions = env.action_space.n

net = Net(obs_size, HIDDEN_SIZE, n_actions)
objective = nn.CrossEntropyLoss()
optimizer = optim.Adam(params=net.parameters(), lr=0.001) # Lower learning rate by ten
writer = SummaryWriter(comment="-frozenlake-tweaked")

full_batch = []
for iter_no, batch in enumerate(iterate_batches(
        env, net, BATCH_SIZE)):
    reward_mean = float(np.mean(list(map(
        lambda s: s.reward, batch))))
    full_batch, obs, acts, reward_bound = \
        filter_batch(full_batch + batch, PERCENTILE)
    if not full_batch:
        continue
    obs_v = torch.FloatTensor(obs)
    acts_v = torch.LongTensor(acts)
    full_batch = full_batch[-500:]

    optimizer.zero_grad()
    action_scores_v = net(obs_v)
    loss_v = objective(action_scores_v, acts_v)
    loss_v.backward()
    optimizer.step()
    print("%d: loss=%.3f, rw_mean=%.3f, "
          "rw_bound=%.3f, batch=%d" % (
        iter_no, loss_v.item(), reward_mean,
        reward_bound, len(full_batch)))
    writer.add_scalar("loss", loss_v.item(), iter_no)
    writer.add_scalar("reward_mean", reward_mean, iter_no)
    writer.add_scalar("reward_bound", reward_bound, iter_no)
    if reward_mean > 0.8:
        print("Solved!")
        break
writer.close()

0: loss=1.347, rw_mean=0.030, rw_bound=0.000, batch=3
1: loss=1.341, rw_mean=0.030, rw_bound=0.000, batch=6
2: loss=1.330, rw_mean=0.050, rw_bound=0.000, batch=11
3: loss=1.321, rw_mean=0.010, rw_bound=0.000, batch=12
4: loss=1.315, rw_mean=0.010, rw_bound=0.000, batch=13
5: loss=1.308, rw_mean=0.030, rw_bound=0.000, batch=16
6: loss=1.304, rw_mean=0.030, rw_bound=0.000, batch=19
7: loss=1.301, rw_mean=0.040, rw_bound=0.000, batch=23
8: loss=1.300, rw_mean=0.030, rw_bound=0.000, batch=26
9: loss=1.298, rw_mean=0.030, rw_bound=0.000, batch=29
10: loss=1.290, rw_mean=0.040, rw_bound=0.000, batch=33
11: loss=1.282, rw_mean=0.050, rw_bound=0.000, batch=38
12: loss=1.276, rw_mean=0.020, rw_bound=0.000, batch=40
13: loss=1.261, rw_mean=0.060, rw_bound=0.155, batch=42
14: loss=1.257, rw_mean=0.010, rw_bound=0.117, batch=43
15: loss=1.249, rw_mean=0.030, rw_bound=0.182, batch=43
16: loss=1.227, rw_mean=0.070, rw_bound=0.229, batch=41
17: loss=1.200, rw_mean=0.090, rw_bound=0.314, batch=36
18: 

KeyboardInterrupt: 

The effect is dramatic! The nonslippery version of the environment can be solved in
120-140 batch iterations, which is 100 times faster than the noisy environment.