# Chapter 4 - The Cross Entropy Method


## Taxonomy of RL methods 

The cross-entropy method falls into the model-free and policy-based category of methods. These notions are new, so let's spend some time exploring them. All methods in RL can be classified into various aspects:

* Model-free or model-based 
* Value-based or policy-based 
* On-policy or off-policy

There are other ways that you can taxonomize RL methods, but for now we're interested in the preceding three.
Let's define them, as your problem specifics can influence your decision on a particular method.

The term **model-free** means that the method **doesn't build a model** of the environment or reward; it just directly connects observations to actions (or values that are related to actions). In other words, the agent takes current observations and does some computations on them, and the result is the action that it should take. In contrast, **model-based** methods try to **predict what the next observation and/ or reward will be**. Based on this prediction, the agent is trying to choose the best possible action to take, very often making such predictions multiple times to **look more and more steps into the future**.

Both classes of methods have strong and weak sides, but usually pure **model-based methods are used in deterministic environments**, such as board games with strict rules. On the other hand, model-free methods are usually easier to train as it's hard to build good models of complex environments with rich observations. 

All of the methods described in this book are from the **model-free** category, as those methods have been the most active area of research for the past few years. Only recently have researchers started to mix the benefits from both worlds (for example, refer to DeepMind's papers on imagination in agents. This approach will be described in Chapter 17, Beyond Model-Free – Imagination).

By looking from an other angle, **policy-based methods** are directly approximating the policy of the agent, that is, what **actions the agent should carry out at every step**. Policy is usually represented by **probability distribution over the available actions**. 

In contrast, the method could be **value-based**. In this case, instead of the probability of actions, the agent calculates the **value of every possible action** and chooses the action with the best value. Both of those families of methods are equally popular and we'll discuss value-based methods in the next part of the book. Policy methods will be the topic of part three.

The third important classification of methods is on-policy versus off-policy. We'll discuss this distinction more in parts two and three of the book, but for now, it will be enough to explain **off-policy** as the ability of the method to **learn on old historical data** (obtained by a previous version of the agent or recorded by human demonstration or just seen by the same agent several episodes ago).
So, our **cross-entropy method is model-free, policy-based, and on-policy**, which means the following:

* It **doesn't build any model** of the environment; it just says to the agent what to do at every step
* It **approximates the policy of the agent**
* It requires **fresh data** obtained from the environment

## Practical cross-entropy 

The cross-entropy method description is split into two unequal parts: practical and theoretical.
The practical part is intuitive in its nature, while the theoretical explanation of why cross-entropy works, and what's happening is more sophisticated.
You may remember that the central, trickiest thing in RL is the agent, which is trying to accumulate as much total reward as possible by communicating with the environment. In practice, we follow a common ML approach and replace all of the complications of the agent with some kind of nonlinear trainable function, which maps the agent's input (observations from the environment) to some output. The details of the output that this function produces may depend on a particular method or a family of methods, as described in the previous section (such as value-based versus policy-based methods). As our cross-entropy method is policy-based, our **nonlinear function (neural network) produces policy**, which basically says for every observation which action the agent should take.
![](img/fig4-1.png)


In practice, **policy** is usually represented as **probability distribution over actions**, which makes it very similar to a classification problem, with the amount of classes being equal to amount of actions we can carry out. This abstraction makes our agent very simple: it needs to pass an observation from the environment to the network, get probability distribution over actions, and **perform random sampling using probability distribution to get an action to carry out**. This random sampling adds randomness to our agent, which is a good thing, as at the beginning of the training when our weights are random, the agent behaves randomly. After the agent gets an action to issue, it fires the action to the environment and obtains the next observation and reward for the last action. Then the loop continues.

During the agent's lifetime, its experience is present as **episodes**. Every episode is a **sequence of observations** that the agent has got from the environment, actions it has issued, and rewards for these actions. Imagine that our agent has played several such episodes. For every episode, we can calculate the total reward that the agent has claimed. It can be **discounted or not discounted**, but for simplicity, let's assume a discount factor of gamma = 1, which means just a sum of all local rewards for every episode. This total reward shows how good this episode was for the agent. Let's illustrate this with a diagram, which contains four episodes (note that different episodes have different values for Oi
, ai , and ri ):

![](img/fig4-2.png)

Every cell represents the agent's step in the episode. Due to randomness in the environment and the way that the agent selects actions to take, some episodes will be better than others. **The core of the cross-entropy method is to throw away bad episodes and train on better ones**. So, the steps of the method are as follows:

1. Play **N number of episodes** using our current model and environment. 
2. Calculate the total **reward for every episode** and decide on a reward boundary. Usually, we use some percentile of all rewards, such as 50th or 70th.
3. Throw away all episodes with a reward below the boundary.
4. **Train on the remaining "elite" episodes** using observations as the input and issued actions as the desired output.
5. Repeat from step 1 until we become satisfied with the result.

So, that's all about the cross-entropy method description. With the preceding procedure, **our neural network learns how to repeat actions, which leads to a larger reward**, constantly moving the boundary higher and higher. Despite the simplicity of this method, it works well in simple environments, it's easy to implement, and it's quite robust to hyperparameters changing, which makes it an ideal baseline method to try. Let's now apply it to our CartPole environment.

## Cross-entropy on CartPole
The whole code for this example is in ``Chapter04/01_cartpole.py``, but the following are the most important parts. Our model's core is a one-hidden-layer neural network, with ReLU and 128 hidden neurons (which is absolutely arbitrary). Other hyperparameters are also set almost randomly and aren't tuned, as the method is robust and converges very quickly.

### The  network
* Input - a single observation from the environment as an input vector
* Output- a number for every action we can perform - 

The output from the network is a probability distribution over actions, so a straightforward way to proceed would be to include softmax nonlinearity after the last layer. However, in the preceding network we don't apply softmax to increase the numerical stability of the training process. Rather than calculating softmax (which uses exponentiation) and then calculating cross-entropy loss (which uses logarithm of probabilities), we'll use the PyTorch class, ``nn.CrossEntropyLoss``, which **combines both softmax and cross-entropy** in a single, more numerically stable expression. CrossEntropyLoss requires raw, unnormalized values from the network (also called logits), and the downside of this is that we need to remember to apply softmax every time we need to get probabilities from our network's output.

In [1]:
#!/usr/bin/env python3
import gym  #for environment simulation - CartPole
from collections import namedtuple #for helper clases (namedtuple)
import numpy as np
from tensorboardX import SummaryWriter #to allow loging into tensor board

import torch
import torch.nn as nn
import torch.optim as optim



#Creating a Net class that inherits nn.Module class
class Net(nn.Module):
    def __init__(self, obs_size, hidden_size, n_actions):
        super(Net, self).__init__()
        self.net = nn.Sequential(
            nn.Linear(obs_size, hidden_size),
            nn.ReLU(),
            nn.Linear(hidden_size, n_actions)
        )

    def forward(self, x):
        return self.net(x)

HIDDEN_SIZE = 128 #Hidden layer size
BATCH_SIZE = 16
PERCENTILE = 70 #70th percentile - 30% of episodes sorte by reward 

#Creating helper clasess which are named tupled
#Episode represent one single step that our agent made in the episode. We'll use episode steps from elite episodes as training data.
Episode = namedtuple('Episode', field_names=['reward', 'steps']) 
#EpisodeStep is a single episode stored as total undiscounted reward and a collection of EpisodeStep
EpisodeStep = namedtuple('EpisodeStep', field_names=['observation', 'action'])






### The batch loop
One very important fact to understand in this function logic is that the training of our network and the generation of our episodes are performed at the same time. They are not completely in parallel, but every time our loop accumulates enough episodes (16), it passes control to this function caller, which is supposed to train the network using the gradient descent. So, when yield is returned, the network will have different, slightly better (we hope) behavior.
We don't need to explore proper synchronization, as our training and data gathering activities are performed at the same thread of execution, but you need to understand those constant jumps from network training to its utilization.

In [2]:
## Function to generate batches with episodes
#Our function is a generator, so every time the yield operator is executed,
#the control is transferred to the outer iteration loop and then continues after the yield line.
def iterate_batches(env, net, batch_size):
    batch = [] #create empyty list for batches
    episode_reward = 0.0 #initial reward
    episode_steps = [] #create empyty list for episode steps
    
    obs = env.reset() #resting the environment - return a list of 4 observed numbers
    sm = nn.Softmax(dim=1) #creating a softmax function 
    
    
    #Infinite loop
    while True:
        obs_v = torch.FloatTensor([obs]) #create a float tensor from the new observation - 1x4 tensor
        act_probs_v = sm(net(obs_v)) # execute the nn on the new observation and apply softmax on it
        act_probs = act_probs_v.data.numpy()[0] #Obtain the action probablities by accessing the output tensor data filed and create a numpy array from it
        action = np.random.choice(len(act_probs), p=act_probs) #select the action (number btwn 0 - n) by sampling from the return probablities
        next_obs, reward, is_done, _ = env.step(action) #making the next step in the enviroment with the selected action
        episode_reward += reward #incrementing the episode reward with the reurned rewars
        episode_steps.append(EpisodeStep(observation=obs, action=action)) #append the the  episode step with the current obs and reward and not those returned as the next ones.
        
        #when the eppisodes ends - the episode ends when the stick has fallen down despite our efforts
        if is_done: #if the episode is done (As returned from the environment)
            batch.append(Episode(reward=episode_reward, steps=episode_steps)) #append the entire episode
            episode_reward = 0.0 #reset the episode rewards
            episode_steps = [] #create empyty list for episode steps
            next_obs = env.reset() #reset the enviornmnet
            if len(batch) == batch_size: 
                yield batch #return the currnt batch as a generator
                batch = [] #reset the batch
        obs = next_obs #assign next_obs to obs




### Filter batch function
This function is at the **core of the cross-entropy method**: from the given batch of episodes and percentile value, it calculates a boundary reward, which is used to filter elite episodes to train on. To obtain the boundary reward, we're using NumPy's percentile function, which from the list of values and the desired percentile, calculates the percentile's value. Then we will calculate mean reward, which is used only for monitoring.

In [3]:
#filters the batches to yield only the top best eposidoe above a given percentile
def filter_batch(batch, percentile):
    rewards = list(map(lambda s: s.reward, batch)) #for each episode in the batch, extract the rewards
    #print('rewards:', rewards)
    reward_bound = np.percentile(rewards, percentile) #claulate prentiles for all rewards in the batch 
    reward_mean = float(np.mean(rewards)) #calcualte the rewards mean

    train_obs = []
    train_act = []

    #Filter off our episodes. For every episode in the batch,
    #we will check that the episode has a higher total reward than our boundary and if it has,
    #we will populate lists of observations and actions that we will train on.
    #This code will reject 70% of the episodes and 
    for example in batch:
        if example.reward < reward_bound: #non qualified episode
            #print('rejected reward=',example.reward)
            continue
        #else:
            #print('accepted reward=',example.reward)

        #Accumilating steps only from qualifed episodes
        train_obs.extend(map(lambda step: step.observation, example.steps)) #append the qualified eppisode's observations
        train_act.extend(map(lambda step: step.action, example.steps))  #append the qualifiedd eppisode's actions

    train_obs_v = torch.FloatTensor(train_obs) #create a tenso rof the observations
    train_act_v = torch.LongTensor(train_act) #create a tensor of the actions
    return train_obs_v, train_act_v, reward_bound, reward_mean




### Main code

In the beginning, we will create all the required objects: the environment, our neural network, the objective function, the optimizer, and the summary writer for TensorBoard. The commented line creates a monitor to write videos of your agent's performance.

In the training loop, we will iterate our batches (which are a list of Episode objects), then we perform filtering of the elite episodes using the filter_batch function. The result is variables of observations and taken actions, the reward boundary used for filtering and the mean reward. After that, we zero gradients of our network and pass observations to the network, obtaining its action scores. These scores are passed to the objective function, which calculates cross-entropy between the network output and the actions that the agent took. The idea of this is to reinforce our network to carry out those "elite" actions which have led to good rewards. Then, we will calculate gradients on the loss and ask the optimizer to adjust our network.

In [9]:
HIDDEN_SIZE = 128 #Hidden layer size
BATCH_SIZE = 16 #episods per batch
PERCENTILE = 70 #70th percentile - 30% of episodes sorte by reward 

if __name__ == "__main__":
    env = gym.make("CartPole-v0") #Initialize the env that is sourced from the gym package
    # env = gym.wrappers.Monitor(env, directory="mon", force=True) - 
    obs_size = env.observation_space.shape[0] #Get the observation size - 4 in the case of CartPole
    n_actions = env.action_space.n  # 2 actions (left, right) in the case of CartPole

    net = Net(obs_size, HIDDEN_SIZE, n_actions) #Create the nn from the Net class with input size = obs_size and output size = n_actions
            # Net(
            #   (net): Sequential(
            #     (0): Linear(in_features=4, out_features=128, bias=True)
            #     (1): ReLU()
            #     (2): Linear(in_features=128, out_features=2, bias=True)
            #   )
            # )
    objective = nn.CrossEntropyLoss() #populate the objective function to be CrossEntropy_Loss()
    optimizer = optim.Adam(params=net.parameters(), lr=0.01) #Set the Optimizer to be Adam
    writer = SummaryWriter(comment="-cartpole") #intialize the logs for the tensorboard

    #This is the main loop for the nn training
    #==========================================
    #It loops of the generstor that is returned from the iterate_batches() function
    #Each bacth includes BATCH_SIZE episodes
    for iter_no, batch in enumerate(iterate_batches(env, net, BATCH_SIZE)): #iterate_batch returns the iteration number and the list of episods oas named-tuples
        obs_v, acts_v, reward_b, reward_m = filter_batch(batch, PERCENTILE) #Obtain the filter (top 30%) episodeds
        #obs_v - tensor of n x 4 of all observation - n is the number of all accumlated observation of the qualified episodes
        #acts_v - array of n actions (0 or 1)
        #reward_b - reward threshold that represnet the 70% perecntile for the bacth
        #reward_m - mean reward for the batch 
        
        optimizer.zero_grad() #Reset the gradient for the optimizer
        action_scores_v = net(obs_v) #Generate predited actions by applying the nn on all the batch's observations and getting the actions scores
                                     #action_scores_v is a tensor of shape n 2 (2 actions)
        loss_v = objective(action_scores_v, acts_v)  #calculating the loss by Cross Entropy between the current net output and the actual taken action
        loss_v.backward() #run back propagation
        loss_v.backward() #run back propagation
        optimizer.step() #run optimization 
        print("%d: loss=%.3f, reward_mean=%.1f, reward_bound=%.1f" % (
            iter_no, loss_v.item(), reward_m, reward_b)) #Print results 
        writer.add_scalar("loss", loss_v.item(), iter_no) #Add the loss as scallar to the tensorboard log
        writer.add_scalar("reward_bound", reward_b, iter_no) #Add  the reward_bound as scallar to the tensorboard log
        writer.add_scalar("reward_mean", reward_m, iter_no) #Add  the reward_mean as scallar to the tensorboard log
        if reward_m > 50: # > 199: #stop the loop if the mean reward has reached 199
            print("Solved!")
            break
    writer.close()

[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
0: loss=0.680, reward_mean=22.6, reward_bound=21.0
1: loss=0.662, reward_mean=26.9, reward_bound=30.5
2: loss=0.660, reward_mean=34.2, reward_bound=36.5
3: loss=0.639, reward_mean=34.6, reward_bound=35.5
4: loss=0.651, reward_mean=43.2, reward_bound=47.0
5: loss=0.627, reward_mean=46.8, reward_bound=54.0
6: loss=0.632, reward_mean=45.2, reward_bound=53.5
7: loss=0.627, reward_mean=47.1, reward_bound=56.0
8: loss=0.612, reward_mean=55.3, reward_bound=77.5
Solved!


In [16]:
loss_v.storage_typ

 0.6115224361419678
[torch.FloatStorage of size 1]

To view the tensorboad charts:
 ``$ cd C:\Users\ilanr_000\OneDrive\Python\DeepRL\DeepRL``
 ``$ tensorboard --logdir .\runs --host localhost``
 Should see somthing like:
![](img/Tensorboard-fig-1.png)

## Cross-entropy on FrozenLake 
The next environment we'll try to solve using the cross-entropy method is FrozenLake. Its world is from the so-called "grid world" category, when your agent lives in a grid of size 4 × 4 and can move in four directions: up, down, left, and right. The agent always starts at a top-left position, and its goal is to reach the bottom-right cell of the grid. There are holes in the fixed cells of the grid and if you get into those holes, the episode ends and your reward is zero. If the agent reaches the destination cell, then it obtains the reward 1.0 and the episode ends.
To make life more complicated, the world is slippery (it's a frozen lake after all), so the agent's actions do not always turn out as expected: there is a 33% chance that it will slip to the right or to the left. You want the agent to move left, for example, but there is a 33% probability that it will indeed move left, a 33% chance that it will end up in the cell above, and a 33% chance that it will end up in the cell below. As we'll see at the end of the section, this makes progress difficult.

![](img/Fig5-1.png)

### Gym Environment for FrozenLake

Our observation space is discrete, which means that it's just a number from zero to 15 inclusive. Obviously, this number is our current position in the grid. The action space is also discrete, but can be from zero to three. Our network from the CartPole example expects a vector of numbers. To get this, we can apply the traditional "one-hot encoding" of discrete inputs, which means that input to our network will have 16 float numbers and zero everywhere, except the index that we'll encode. To minimize changes in our code, we can use the ObservationWrapper class from Gym and implement our DiscreteOneHotWrapper class.
With that wrapper applied to the environment, both the observation space and action space are 100% compatible with our CartPole solution (source code Chapter04/02_ frozenlake_naive.py). So we can use the same code as above.
However, by launching it, we can see that this doesn't improve the score over time.
To understand what's going on, we need to look deeper at the reward structure of both environments. In CartPole, every step of the environment gives us the reward 1.0, until the moment that the pole falls. So, the longer our agent balanced the pole, the more reward it obtained. Due to randomness in our agent's behavior, different episodes were of different lengths, which gave us a pretty normal distribution of the episodes' rewards. After choosing a reward boundary, we rejected less successful episodes and learned how to repeat better ones (by training on successful episodes' data).
This is shown in the following diagram:
![](img/Fig7-1.png)
In the FrozenLake environment, episodes and their reward look different. We get the reward of 1.0 only when we reach the goal, and this reward says nothing about how good each episode was. Was it quick and efficient or did we make four rounds on the lake before we randomly stepped into the final cell? We don't know, it's just 1.0 reward and that's it. The distribution of rewards for our episodes are also problematic. There are only two kinds of episodes possible, with zero reward (failed) and one reward (successful), and failed episodes will obviously dominate in the beginning of the training. So, our percentile selection of "elite" episodes is totally wrong and gives us bad examples to train on. This is the reason for our training failure.
![](img/Fig8-1.png)
This example shows us the limitations of the cross-entropy method: 
* For training, our **episodes** have to be **finite** and, preferably, **short**
* The total reward for the **episodes** should have enough **variability** to **separate good episodes from bad ones**
* There is no **intermediate indication** about whether the agent has succeeded or failed

Later in the book, we'll become familiar with other methods, which address these limitations. For now, if you're curious about how FrozenLake can be solved using cross-entropy, here is a list of tweaks of the code that you need to make (the full example is in Chapter04/03_frozenlake_tweaked.py):

* **Larger batches of played episodes:** In CartPole, it was enough to have 16 episodes on every iteration, but FrozenLake requires at least 100 just to get some successful episodes.
* **Discount factor applied to reward:** To make the total reward for the episode depend on episode length, and add variety in episodes, we can use a discounted total reward with the discount factor 0.9 or 0.95. In this case, the reward for shorter episodes will be higher than the reward for longer ones.
* **Keeping "elite" episodes for a longer time:** In the CartPole training, we sampled episodes from the environment, trained on the best ones, and threw them away. In FrozenLake, a successful episode is a much rarer animal, so we need to keep them for several iterations to train on them.
* **Decrease learning rate:** This will give our network time to average more training samples.
* **Much longer training time:** Due to the sparsity of successful episodes, and the random outcome of our actions, it's much harder for our network to get an idea of the best behavior to perform in any particular situation. To reach 50% successful episodes, about 5k training iterations are required.

To incorporate all these into our code, we need to change the filter_batch function to calculate discounted reward and return "elite" episodes for us to keep:


In [1]:
#!/usr/bin/env python3
import random
import gym, gym.spaces
from collections import namedtuple
import numpy as np
from tensorboardX import SummaryWriter

import torch
import torch.nn as nn
import torch.optim as optim



e = gym.make("FrozenLake-v0") #Creating a new Frozen Lake environment 
e.observation_space #Discrete(16) - one of the 16 possible poistions for the agent on a 4x4 grid
e.action_space #Discrete(4) - 4 possible directions: 0 to 3  the agent can move

e.reset() #reset the environment - output is 0 - intial position is top left corner of the grid
e.render() #renders the grid:(s - start, F-Free cell, H - hole, G - Goal)
# SFFF  
# FHFH
# FFFH
# HFFG
print(e.observation_space.n)

#Creating an heriting class from Gym's ObservationWraper class
#It converts the descrite inputs will have 16 float numbers, zero everywhere, except the currenl loction of the agent (as float 1)
class DiscreteOneHotWrapper(gym.ObservationWrapper):
    def __init__(self, env):
        super(DiscreteOneHotWrapper, self).__init__(env)
        assert isinstance(env.observation_space, gym.spaces.Discrete)
        self.observation_space = gym.spaces.Box(0.0, 1.0, (env.observation_space.n, ), dtype=np.float32)

    def observation(self, observation):
        res = np.copy(self.observation_space.low)
        res[observation] = 1.0
        return res


#Creating a Net class that inherits nn.Module class
class Net(nn.Module):
    def __init__(self, obs_size, hidden_size, n_actions):
        super(Net, self).__init__()
        self.net = nn.Sequential(
            nn.Linear(obs_size, hidden_size),
            nn.ReLU(),
            nn.Linear(hidden_size, n_actions)
        )

    def forward(self, x):
        return self.net(x)

    
#==================================================================================================================
## Function to generate batches with episodes
#Our function is a generator, so every time the yield operator is executed,
#the control is transferred to the outer iteration loop and then continues after the yield line.
def iterate_batches(env, net, batch_size):
    batch = [] #create empyty list for batches
    episode_reward = 0.0 #initial reward
    episode_steps = [] #create empyty list for episode steps
    
    obs = env.reset() #resting the environment - return a list of 4 observed numbers
    sm = nn.Softmax(dim=1) #creating a softmax function 
    
    
    #Infinite loop
    while True:
        obs_v = torch.FloatTensor([obs]) #create a float tensor from the new observation - 1x16 tensor with theobserved loction of the tensor
        act_probs_v = sm(net(obs_v)) # execute the nn on the new observation and apply softmax on it
        act_probs = act_probs_v.data.numpy()[0] #Obtain the action probablities by accessing the output tensor data filed and create a numpy array from it
        #print('act_probs=', act_probs)
        action = np.random.choice(len(act_probs), p=act_probs) #select the action (number btwn 0 - n) by sampling from the return probablities
        #print('action=',action)
        next_obs, reward, is_done, _ = env.step(action) #making the next step on the enviroment with the selected action
        episode_reward += reward #incrementing the episode reward with the reurned rewars
        episode_steps.append(EpisodeStep(observation=obs, action=action)) #append the the  episode step with the current obs and reward and not those returned as the next ones.
        
        #when the eppisodes ends - the episode ends when the stick has fallen down despite our efforts
        if is_done: #if the episode is done (As returned from the environment)
            batch.append(Episode(reward=episode_reward, steps=episode_steps)) #append the entire episode results
            episode_reward = 0.0 #reset the episode rewards
            episode_steps = [] #create empyty list for episode steps
            next_obs = env.reset() #reset the enviornmnet
            if len(batch) == batch_size: #if max number of episodes have been executed
                yield batch #return the currnt batch as a generator
                batch = [] #reset the batch
        obs = next_obs #assign next_obs to obs

#==================================================================================================================

#filters the batches to yield only the top best eposidoe above a given percentile
def filter_batch(batch, percentile):
    disc_rewards = list(map(lambda s: s.reward * (GAMMA ** len(s.steps)), batch)) #for each episode in the batch, extract the rewards - here we also add Gamma for the discount rate
    #print('Disc rewards:', disc_rewards)
    reward_bound = np.percentile(disc_rewards, percentile) #claulate prentiles for all rewards in the batch 
    #print('reward bound=',  reward_bound)
    #reward_mean = float(np.mean(disc_rewards)) #calcualte the rewards mean

    train_obs = []
    train_act = []
    elite_batch = []

    #Filter off our episodes. For every episode in the batch,
    #we will check that the episode has a higher total reward than our boundary and if it has,
    #we will populate lists of observations and actions that we will train on.
    #This code will reject 70% of the episodes and 
    for example, discounted_reward in zip(batch, disc_rewards):
        #print('discounted_reward=', discounted_reward,' reward bound=',  reward_bound)
        if discounted_reward > reward_bound:
            train_obs.extend(map(lambda step: step.observation, example.steps))
            train_act.extend(map(lambda step: step.action, example.steps))
            elite_batch.append(example)

    return elite_batch, train_obs, train_act, reward_bound

HIDDEN_SIZE = 128 #Hidden layer size
BATCH_SIZE = 100 #batch size - number of episodes in a single batch
PERCENTILE = 5 #5th percentile - 30% of episodes sorte by reward 
GAMMA = 0.95 #Reward discount rate
MAX_MEAN_REWARD = 0.8

#Creating helper clasess which are named tupled
#Episode represent one single step that our agent made in the episode. We'll use episode steps from elite episodes as training data.
Episode = namedtuple('Episode', field_names=['reward', 'steps']) 
#EpisodeStep is a single episode stored as total undiscounted reward and a collection of EpisodeStep
EpisodeStep = namedtuple('EpisodeStep', field_names=['observation', 'action'])


if __name__ == "__main__":
    random.seed(12345)
    env = DiscreteOneHotWrapper(gym.make("FrozenLake-v0")) #Initialize the env that is sourced from the gym package with a wraper to conver inputs and outputs
    # env = gym.wrappers.Monitor(env, directory="mon", force=True) - 
    obs_size = env.observation_space.shape[0] #Get the observation size - 16 in the case of FrozenLake
    n_actions = env.action_space.n  # 4 actions (left, right, up, down) in the case of FrozenLake

    net = Net(obs_size, HIDDEN_SIZE, n_actions) #Create the nn from the Net class with input size = obs_size and output size = n_actions
        # Net(
        #   (net): Sequential(
        #     (0): Linear(in_features=16, out_features=128, bias=True)
        #     (1): ReLU()
        #     (2): Linear(in_features=128, out_features=4, bias=True)
        #   )
        # )    

    objective = nn.CrossEntropyLoss() #populate the objective function to be CrossEntropy_Loss()
    optimizer = optim.Adam(params=net.parameters(), lr=0.001) #Set the Optimizer to be Adam
    writer = SummaryWriter(comment="-frozenlake-tweaked") #intialize the logs for the tensorboard
    
    full_batch = []

    #This is the main loop for the nn training
    #==========================================
    #It loops of the generstor that is returned from the iterate_batches() function
    #Each bacth includes BATCH_SIZE episodes
    for iter_no, batch in enumerate(iterate_batches(env, net, BATCH_SIZE)): #iterate_batch returns the iteration number and the list of episods oas named-tuples
        reward_mean = float(np.mean(list(map(lambda s: s.reward, batch)))) #extratc the rewar and claulate  mean reward for all episodes the batch 
        full_batch, train_obs, train_acts, reward_bound = filter_batch(full_batch + batch, PERCENTILE) #Obtain the filter (top 30%) episodeds
        #reward_b - reward threshold that represnet the PERCENTILE perecntile for the bacth
        #print('filtered batch length=', len(acts),'length full batch=',len(full_batch))
        if not full_batch:
            continue
        
        train_obs_v = torch.FloatTensor(train_obs) #tensor of n x 16 of all observation - n is the number of all accumlated observation of the qualified episodes
        train_acts_v = torch.LongTensor(train_acts)  #acts_v - tensor of n actions (0 or 1,2,3)
        full_batch = full_batch[-500:]
        
        ## NN Retraining
        ##==============
        optimizer.zero_grad() #Reset the gradient for the optimizer
        #print('obs_v length=', len(obs_v),)
        train_action_scores_v = net(train_obs_v) #Generate predited actions by applying the nn on all the commulative batch's observations and getting the actions scores
                                     #action_scores_v is a tensor of shape n x 2 (2 actions)
        loss_v = objective(train_action_scores_v, train_acts_v) #calculating the loss by Cross Entropy between the current net output and the actual taken action
        loss_v.backward() #run back propagation
        optimizer.step() #run optimization 
        
        print("%d: loss=%.3f, reward_mean=%.3f, reward_bound=%.3f, batch=%d" % (
            iter_no, loss_v.item(), reward_mean, reward_bound, len(full_batch))) #Print results 
        writer.add_scalar("loss", loss_v.item(), iter_no)
        writer.add_scalar("reward_mean", reward_mean, iter_no)
        writer.add_scalar("reward_bound", reward_bound, iter_no)
#        if True: #reward_mean > 0.8: #stop the loop if the mean reward has reached 0.8
        if reward_mean > MAX_MEAN_REWARD: #stop the loop if the mean reward has reached 0.8
            print("Solved!")
            break
    writer.close()


[41mS[0mFFF
FHFH
FFFH
HFFG
16
0: loss=1.387, reward_mean=0.040, reward_bound=0.000, batch=4
1: loss=1.378, reward_mean=0.020, reward_bound=0.000, batch=6
2: loss=1.370, reward_mean=0.010, reward_bound=0.000, batch=7
3: loss=1.363, reward_mean=0.000, reward_bound=0.000, batch=7
4: loss=1.364, reward_mean=0.040, reward_bound=0.000, batch=11
5: loss=1.365, reward_mean=0.030, reward_bound=0.000, batch=14
6: loss=1.364, reward_mean=0.010, reward_bound=0.000, batch=15
7: loss=1.361, reward_mean=0.000, reward_bound=0.000, batch=15
8: loss=1.361, reward_mean=0.020, reward_bound=0.000, batch=17
9: loss=1.359, reward_mean=0.000, reward_bound=0.000, batch=17
10: loss=1.356, reward_mean=0.010, reward_bound=0.000, batch=18
11: loss=1.350, reward_mean=0.010, reward_bound=0.000, batch=19
12: loss=1.350, reward_mean=0.020, reward_bound=0.000, batch=21
13: loss=1.349, reward_mean=0.000, reward_bound=0.000, batch=21
14: loss=1.350, reward_mean=0.010, reward_bound=0.000, batch=22
15: loss=1.348, rewar

127: loss=1.248, reward_mean=0.040, reward_bound=0.000, batch=323
128: loss=1.248, reward_mean=0.030, reward_bound=0.000, batch=326
129: loss=1.246, reward_mean=0.070, reward_bound=0.000, batch=333
130: loss=1.245, reward_mean=0.030, reward_bound=0.000, batch=336
131: loss=1.244, reward_mean=0.030, reward_bound=0.000, batch=339
132: loss=1.244, reward_mean=0.040, reward_bound=0.000, batch=343
133: loss=1.243, reward_mean=0.040, reward_bound=0.000, batch=347
134: loss=1.241, reward_mean=0.040, reward_bound=0.000, batch=351
135: loss=1.240, reward_mean=0.050, reward_bound=0.000, batch=356
136: loss=1.241, reward_mean=0.030, reward_bound=0.000, batch=359
137: loss=1.239, reward_mean=0.040, reward_bound=0.000, batch=363
138: loss=1.239, reward_mean=0.040, reward_bound=0.000, batch=367
139: loss=1.239, reward_mean=0.020, reward_bound=0.000, batch=369
140: loss=1.238, reward_mean=0.050, reward_bound=0.000, batch=374
141: loss=1.238, reward_mean=0.010, reward_bound=0.000, batch=375
142: loss=

252: loss=1.117, reward_mean=0.120, reward_bound=0.000, batch=500
253: loss=1.115, reward_mean=0.050, reward_bound=0.000, batch=500
254: loss=1.115, reward_mean=0.070, reward_bound=0.000, batch=500
255: loss=1.115, reward_mean=0.040, reward_bound=0.000, batch=500
256: loss=1.116, reward_mean=0.050, reward_bound=0.000, batch=500
257: loss=1.115, reward_mean=0.020, reward_bound=0.000, batch=500
258: loss=1.114, reward_mean=0.030, reward_bound=0.000, batch=500
259: loss=1.116, reward_mean=0.030, reward_bound=0.000, batch=500
260: loss=1.117, reward_mean=0.050, reward_bound=0.000, batch=500
261: loss=1.115, reward_mean=0.030, reward_bound=0.000, batch=500
262: loss=1.114, reward_mean=0.080, reward_bound=0.000, batch=500
263: loss=1.112, reward_mean=0.070, reward_bound=0.000, batch=500
264: loss=1.109, reward_mean=0.080, reward_bound=0.000, batch=500
265: loss=1.109, reward_mean=0.050, reward_bound=0.000, batch=500
266: loss=1.109, reward_mean=0.050, reward_bound=0.000, batch=500
267: loss=

377: loss=0.958, reward_mean=0.080, reward_bound=0.000, batch=500
378: loss=0.959, reward_mean=0.110, reward_bound=0.000, batch=500
379: loss=0.957, reward_mean=0.100, reward_bound=0.000, batch=500
380: loss=0.956, reward_mean=0.050, reward_bound=0.000, batch=500
381: loss=0.956, reward_mean=0.100, reward_bound=0.000, batch=500
382: loss=0.957, reward_mean=0.100, reward_bound=0.000, batch=500
383: loss=0.953, reward_mean=0.140, reward_bound=0.000, batch=500
384: loss=0.950, reward_mean=0.090, reward_bound=0.000, batch=500
385: loss=0.949, reward_mean=0.110, reward_bound=0.000, batch=500
386: loss=0.948, reward_mean=0.050, reward_bound=0.000, batch=500
387: loss=0.949, reward_mean=0.130, reward_bound=0.000, batch=500
388: loss=0.946, reward_mean=0.180, reward_bound=0.000, batch=500
389: loss=0.941, reward_mean=0.100, reward_bound=0.000, batch=500
390: loss=0.939, reward_mean=0.150, reward_bound=0.000, batch=500
391: loss=0.938, reward_mean=0.110, reward_bound=0.000, batch=500
392: loss=

502: loss=0.640, reward_mean=0.330, reward_bound=0.000, batch=500
503: loss=0.629, reward_mean=0.280, reward_bound=0.000, batch=500
504: loss=0.623, reward_mean=0.300, reward_bound=0.000, batch=500
505: loss=0.621, reward_mean=0.270, reward_bound=0.000, batch=500
506: loss=0.618, reward_mean=0.300, reward_bound=0.000, batch=500
507: loss=0.617, reward_mean=0.350, reward_bound=0.000, batch=500
508: loss=0.613, reward_mean=0.360, reward_bound=0.000, batch=500
509: loss=0.608, reward_mean=0.310, reward_bound=0.000, batch=500
510: loss=0.605, reward_mean=0.300, reward_bound=0.000, batch=500
511: loss=0.600, reward_mean=0.400, reward_bound=0.000, batch=500
512: loss=0.599, reward_mean=0.380, reward_bound=0.000, batch=500
513: loss=0.597, reward_mean=0.380, reward_bound=0.000, batch=500
514: loss=0.594, reward_mean=0.310, reward_bound=0.000, batch=500
515: loss=0.595, reward_mean=0.290, reward_bound=0.000, batch=500
516: loss=0.595, reward_mean=0.470, reward_bound=0.000, batch=500
517: loss=

627: loss=0.388, reward_mean=0.510, reward_bound=0.000, batch=500
628: loss=0.390, reward_mean=0.500, reward_bound=0.000, batch=500
629: loss=0.388, reward_mean=0.570, reward_bound=0.000, batch=500
630: loss=0.386, reward_mean=0.480, reward_bound=0.000, batch=500
631: loss=0.385, reward_mean=0.550, reward_bound=0.000, batch=500
632: loss=0.388, reward_mean=0.450, reward_bound=0.000, batch=500
633: loss=0.387, reward_mean=0.510, reward_bound=0.000, batch=500
634: loss=0.384, reward_mean=0.490, reward_bound=0.000, batch=500
635: loss=0.383, reward_mean=0.470, reward_bound=0.000, batch=500
636: loss=0.380, reward_mean=0.510, reward_bound=0.000, batch=500
637: loss=0.378, reward_mean=0.500, reward_bound=0.000, batch=500
638: loss=0.369, reward_mean=0.600, reward_bound=0.000, batch=500
639: loss=0.369, reward_mean=0.470, reward_bound=0.000, batch=500
640: loss=0.369, reward_mean=0.600, reward_bound=0.000, batch=500
641: loss=0.369, reward_mean=0.570, reward_bound=0.000, batch=500
642: loss=

752: loss=0.230, reward_mean=0.700, reward_bound=0.006, batch=500
753: loss=0.232, reward_mean=0.550, reward_bound=0.000, batch=500
754: loss=0.233, reward_mean=0.640, reward_bound=0.000, batch=500
755: loss=0.230, reward_mean=0.700, reward_bound=0.006, batch=500
756: loss=0.229, reward_mean=0.620, reward_bound=0.000, batch=500
757: loss=0.224, reward_mean=0.620, reward_bound=0.000, batch=500
758: loss=0.220, reward_mean=0.550, reward_bound=0.000, batch=500
759: loss=0.220, reward_mean=0.630, reward_bound=0.000, batch=500
760: loss=0.224, reward_mean=0.610, reward_bound=0.000, batch=500
761: loss=0.221, reward_mean=0.630, reward_bound=0.000, batch=500
762: loss=0.221, reward_mean=0.620, reward_bound=0.000, batch=500
763: loss=0.215, reward_mean=0.580, reward_bound=0.000, batch=500
764: loss=0.215, reward_mean=0.620, reward_bound=0.000, batch=500
765: loss=0.216, reward_mean=0.630, reward_bound=0.000, batch=500
766: loss=0.215, reward_mean=0.700, reward_bound=0.006, batch=500
767: loss=

877: loss=0.113, reward_mean=0.680, reward_bound=0.000, batch=500
878: loss=0.109, reward_mean=0.620, reward_bound=0.000, batch=500
879: loss=0.111, reward_mean=0.690, reward_bound=0.000, batch=500
880: loss=0.113, reward_mean=0.740, reward_bound=0.007, batch=500
881: loss=0.111, reward_mean=0.690, reward_bound=0.000, batch=500
882: loss=0.112, reward_mean=0.670, reward_bound=0.000, batch=500
883: loss=0.110, reward_mean=0.660, reward_bound=0.000, batch=500
884: loss=0.108, reward_mean=0.710, reward_bound=0.008, batch=500
885: loss=0.110, reward_mean=0.620, reward_bound=0.000, batch=500
886: loss=0.110, reward_mean=0.640, reward_bound=0.000, batch=500
887: loss=0.108, reward_mean=0.650, reward_bound=0.000, batch=500
888: loss=0.104, reward_mean=0.650, reward_bound=0.000, batch=500
889: loss=0.103, reward_mean=0.660, reward_bound=0.000, batch=500
890: loss=0.102, reward_mean=0.730, reward_bound=0.008, batch=500
891: loss=0.100, reward_mean=0.590, reward_bound=0.000, batch=500
892: loss=

To view Tensorboard:

$ ``cd  cd .\OneDrive\Python\DeepRL\DeepRL\``

$ `` tensorboard --logdir .\runs --host localhost``

![](img/Tensorboard-fig-2.png)

### Applying the trained model
At this point after we trained neural net, we can play the game and measure our success:


In [64]:
nGames = 1000 #number of games/episodes to play

obs = env.reset() #resting the environment - return a list of 4 observed numbers
sm = nn.Softmax(dim=1) #creating a softmax function 
success = 0 #success counter
for game in (range(0,nGames)):
    while True:
        obs_v = torch.FloatTensor([obs]) #create a float tensor from the new observation - 1x16 tensor with theobserved loction of the tensor
        #print('obs=',obs)
        act_probs_v = sm(net(obs_v)) # execute the nn on the new observation and apply softmax on it
        act_probs = act_probs_v.data.numpy()[0] #Obtain the action probablities by accessing the output tensor data filed and create a numpy array from it
        #print('act_probs=', act_probs)
        action = np.random.choice(len(act_probs), p=act_probs) #select the action (number btwn 0 - n) by sampling from the return probablities
        #print('action=',action)
        next_obs, reward, game_over, _ = env.step(action) #making the next step on the enviroment with the selected action
        if game_over: 
            #print('Game ',game, 'over - reward=', reward)
            if reward == 1: #succesful game 
                success += 1;
            next_obs = env.reset() #resting the environment for a new game
            break
        obs = next_obs #assign next_obs to obs
print('Successful games ratio =', success/nGames)


Successful games ratio = 0.697
