## We are implementing Reinforce in DOOM env

Remember that this is a complementary notebook-guide to the Medium post that probably brought you here, so the explanation you may be seeking is there.

![](images/doom_env.png)

#### Imports block

Here we import the modules used on our training. Numpy is used for generating the possible actions list and Matplotlib for some printing at the end. VizDoom and Torch and TorchVision are our main modules.

In [None]:
from vizdoom import *

import numpy as np
import matplotlib.pyplot as plt
from collections import deque

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import torchvision.transforms as T

from torch.autograd import Variable

#### Defining our device usage and path to save the network

You have the possibility of using it on Intel MKL-powered PyTorch for CPU (as I did, as we have low-cost with reasonable performance) or on GPU if you have one and want to try it. We also set the random seed and use deterministic-only algorithms on CUDA (if used, for reproducebility).

In [None]:
use_cuda = torch.cuda.is_available() 
device = torch.device('cuda' if use_cuda else 'cpu')
#use_cuda = False
#device = torch.device('cpu')
FloatTensor = torch.cuda.FloatTensor if use_cuda else torch.FloatTensor
LongTensor = torch.cuda.LongTensor if use_cuda else torch.LongTensor
DoubleTensor = torch.cuda.DoubleTensor if use_cuda else torch.DoubleTensor

torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
torch.manual_seed(42)

#### We here set the game assetes and load its configuration files.
It will popup a game window, but with no move as we are not yet performing actions on it. We also generate the possible actions list here.

In [None]:
#Doom game assets: game env creation and frame resize functiom
game = DoomGame()
game.load_config("health_gathering.cfg")
game.set_doom_scenario_path("health_gathering.wad")
game.init()
doom_actions = np.identity(3, dtype=int).tolist()

#### We also must create a Frame Stacker class

Here is our FrameStacker class. we use it to approach the non-Markov propriety issue on our frame-based env. It stacks the last 4 frames and return it with the input shape of our neural network, preprocessing it on the middle of the way.


In [None]:

class FrameStacker:
    def __init__(self):
        """
        We can set the memory size here.
        Our memory is a deque and, on each stack, it concatenates the frames in memory along the axis 0
        We also have a transformer from torch that handles the resizing.
        """
        self.memory_size = 4
        self.memory = deque(maxlen=self.memory_size)
        self.reset()
        self.transformer = T.Compose([T.ToPILImage(),
                                      T.Resize((84,84)),
                                      T.ToTensor()])
    def reset(self):
        """
        by feeding the deque with zero-tensors we restart the memory.
        """
        for i in range(4):
            self.memory.append(torch.zeros(1, 84, 84).to(device))
            
    def preprocess_frame(self, frame):
        """
        here we handle the cutting and flowing the frame through the transformer
        """
        frame = frame[80:,:]
        frame = self.transformer(frame)
        return frame.to(device)/255

    def stack(self, frame):
        """
        our stack method preprocesses the state and returns it stacked.
        """
        frame = self.preprocess_frame(frame)
        self.memory.append(frame)
        return torch.cat(tuple(self.memory), dim=0)

#### Here we define the architecture of our Policy Network.

Notice that we use many Tanh activation functions in order to bring more variance to the learning process, and also use BatchNorms to speed up training. You will be able to see it on TensorBoard if you feel like it.

![](images/network_architecture.png)

In [None]:
class PolicyNetwork(nn.Module):
    def __init__(self, lr):
        """
        We've put Tanh as activation in order to introduce variance on the learning.
        Adam optimizer was the best performant.
        I encourage you to try other architectures, optimizers and hyperparameters
        """
        super(PolicyNetwork, self).__init__()
        
        self.num_actions = 3
        
        self.conv_net = nn.Sequential(nn.Conv2d(in_channels=4, out_channels=32, kernel_size=8, stride=4),
                   nn.BatchNorm2d(32),
                   nn.ReLU(True),
                   nn.Conv2d(32, 64, kernel_size=4, stride=2),
                   nn.BatchNorm2d(64),
                   nn.ReLU(True),
                   nn.Conv2d(64, 128, kernel_size=4, stride=2 ),
                   nn.BatchNorm2d(128),
                   nn.ReLU(True))
        
        self.linear = nn.Sequential(nn.Linear(1152, 512),
                                    nn.Tanh(),
                                    nn.Linear(512, 512),
                                    nn.Tanh(),
                                    nn.Linear(512, self.num_actions),
                                    nn.Tanh())
        
        self.optimizer = optim.Adam(self.parameters(), lr=lr)

        
    def forward(self, state_stack):
        """
        simple feedforward method
        """
        x = self.conv_net(state_stack)
        x = x.view(x.size(0), -1)
        x = F.softmax(self.linear(x), dim=1)
        return x

    def get_action(self, state):
        state = state.float().unsqueeze(0)
        probs = self.forward(Variable(state))
        
        #we've decided to use stochastic action learning in order to introduce variance in the learning
        distribution = torch.distributions.categorical.Categorical(probs = probs.detach())
        highest_prob_action = distribution.sample()
        
        
        log_prob = torch.log(probs.squeeze(0)[highest_prob_action])
        
        #it returns the useful values for acting and optimizing
        return highest_prob_action, log_prob

#### Here is our Update Policy function

We get the normalized and discounted rewards, use it as labels with the log-softmax score function of the outputs of our network. The stochastic action leraning lets us learn the small probability actions too, which is good for training, as there are more situations for us to expose the model to.

In [None]:
def update_policy(policy_network, rewards, log_probs):
    """
    we gather the discounted reward for each step and multiply by the log-score of the action taken
    we then sum it, for computational cost issues and use it as the negative-cost we want to minimize
    in order to maximize the probability for the best performant action.
    """
    discounted_rewards = []

    for t in range(len(rewards)):
        Gt = 0
        pw = 0
        for r in rewards[t:]:
            Gt = Gt + GAMMA**pw * r
            pw += 1
        discounted_rewards.append(Gt)

    discounted_rewards = torch.tensor(discounted_rewards, device=device)
    discounted_rewards = (discounted_rewards - discounted_rewards.mean()) / (discounted_rewards.std())
    
    policy_gradient = []
    for log_prob, Gt in zip(log_probs, discounted_rewards):
        policy_gradient.append(-log_prob * Gt)
    
    policy_network.optimizer.zero_grad()
    policy_gradient_ = torch.stack(policy_gradient).sum()
    policy_gradient_.backward(retain_graph=True)
    policy_network.optimizer.step()

#### Finally we set up TensorBoard and some global variables

After running this cell, you can seek the model architecture (and later its learning) on TensorBoard, with the command:

```
tensorboard --logdir runs
```

If you are running this for the second time you may want to delete your TensorBoard previous records:
```
rm -r runs
```

In [None]:
from torch.utils.tensorboard import SummaryWriter
writer = SummaryWriter(flush_secs = 40)

#here we set the global variables
GAMMA = .95
EPISODES = 5000
learning_rate = 0.02

#our net and frame-stacker
stacker = FrameStacker()
policy_net = PolicyNetwork(lr=learning_rate).to(device)

#some lists to write the values, if you want to do some in-notebook plotting.
num_steps = []
avg_numsteps = []
all_rewards = []

#we leverage this cell to write our graph to TensorBoard.
writer.add_graph(policy_net, stacker.stack(torch.zeros(84, 84)).unsqueeze(0).to(device))

#### We run our train steps now

While it runs you can watch its actions being taken on the game windows and see its learning metrics on the TensorBoard window.

In [None]:
for episode in range(EPISODES):
    
    #reset the stacker and stack the first state
    stacker.reset()
    game.new_episode()
    state = game.get_state().screen_buffer
    
    #we stack our states and create our data lists
    state = stacker.stack(state)
    log_probs = []
    rewards = []
    
    #in order to start the training:
    done = False
    steps = 0
    
    while True:
        
        #here we gather the action from the state using the network
        #and select it from the action list
        action_idx, log_prob = policy_net.get_action(state)
        action = doom_actions[action_idx]
        
        #we act and check if the episode ended
        reward = game.make_action(action)
        done = game.is_episode_finished()
        
        #we must append our reward and log-softmax-score to their corresponding lists
        rewards.append(reward)
        log_probs.append(log_prob)
        
        steps += 1
        
        if done:
            break
            
        new_state = game.get_state().screen_buffer
        state = stacker.stack(new_state)
        
    writer.add_scalar("steps", steps, episode)
    update_policy(policy_net, rewards, log_probs)
    num_steps.append(steps)
    writer.add_scalar("avg_steps", np.mean(num_steps[-10:]), episode)
    
    avg_numsteps.append(np.mean(num_steps[-10:]))
    all_rewards.append(np.sum(rewards))
    print("Episode: {}, total_reward: {}, length: {}".format(episode+1, np.sum(rewards), steps))

In [None]:
#we do some in-notebook plotting if you want it.
plt.plot(num_steps)
plt.plot(avg_numsteps)
plt.xlabel('Episode')
plt.show()