# Deep Q-Networks

In this chapter, we'll try to apply the same theory to problems of much greater complexity: arcade games from the Atari 2600 platform, which are the de-facto benchmark of the RL research community. To deal with this new and more challenging goal, we'll talk about problems with the Value iteration method and introduce its variation, called Q-learning. In particular, we'll look at the application of Q-learning to so-called "grid world" environments, which is called tabular Q-learning, and then we'll discuss Q-learning in conjunction with neural networks. This combination has the name DQN. At the end of the chapter, we'll reimplement a DQN algorithm from the famous paper, Playing Atari with Deep Reinforcement Learning by V. Mnih and others, published in 2013, which started a new era in RL development.

## Real-life value iteration 
The improvements we got in the FrozenLake environment by switching from Cross-Entropy to the Value iteration method are quite encouraging, so it's tempting to apply the value iteration method to more challenging problems. However, let's first look at the assumptions and limitations that our Value iteration method has.
We will start with a quick recap of the method. The Value iteration method on every step does a loop on all states, and for every state, it performs an update of its value with a Bellman approximation. The variation of the same method for Q-values (values for actions) is almost the same, but we approximate and store values for every state and action. So, what's wrong with this process?
The first obvious problem is the count of environment states and our ability to iterate over them. In the Value iteration, we assume that we know all states in our environment in advance, can iterate over them and can store value approximation associated with the state. It's definitely true for the simple "grid world" environment of FrozenLake, but what about other tasks? First, let's try to understand how scalable the Value iteration approach is, or, in other words, how many states we can easily iterate over in every loop. Even a moderate-sized computer can keep several billion float values in memory (8.5 billion in 32 GB of RAM), so the memory required for value tables doesn't look like a huge constraint. Iteration over billions of states and actions will be more memory intensive, but not an insurmountable problem.
Nowadays, we have multicore systems that are mostly idle. The real problem is the number of samples required to get good approximations for state transition dynamics. Imagine that you have some environment with, say, a billion states (this corresponds approximately to a FrozenLake of size 31600 × 31600). To calculate even a rough approximation for every state of this environment, we'll need hundreds of billions of transitions evenly distributed over our states, which is not practical.
To give you an example of an environment with a much larger number of potential states, let's consider the Atari 2600 game console again. This was very popular in the 1980s and many arcade-style games were available for it. The Atari console is archaic by today's gaming standards, but its games give an excellent set of RL problems that humans can master fairly quickly, but still are challenging for computers. Not surprisingly, this platform (using an emulator, of course) is a very popular benchmark among RL researches.
Let's calculate the state space for the Atari platform. The resolution of the screen is 210 x 160 pixels, and every pixel has one of 128 colors. So, every frame of the screen has 210 × 160 = 33600 pixels and the total amount of different screens possible is $128^{33600}$ which is slightly more than $10^{70802}$. If we decide to just enumerate all possible states
of Atari once, it will take billions of billions of years even for the fastest supercomputer. Also, 99(.9)% of this job will be a waste of time, as most of the combinations will never be shown during even long gameplay, so we'll never have samples of those states. However, the value iteration method wants to iterate over them just in case.
Another problem with the value iteration approach is that it limits us to discrete action spaces. Indeed, both Q(s, a) and V(s) approximations assume that our actions are a mutually exclusive discrete set, which is not true for continuous control problems where actions can represent continuous variables, such as the angle of a steering wheel, the force on an actuator, or the temperature of a heater. This issue is much more challenging than the first, and we'll talk about it in the last part of the book, in chapters dedicated to continuous action space problems. For now, let's assume that we have a discrete count of actions and this count is not very large (orders of tens). How should we handle the state space size issue?

# Tabular Q-learning
First of all, do we really need to iterate over every state in the state space? We have an environment that can be used as a source of real-life samples of states. If some state in the state space is not shown to us by the environment, why should we care about its value? We can use states obtained from the environment to update values of states, which can save us lots of work.
This modification of the Value iteration method is known as Q-learning, as mentioned earlier, and for cases with explicit state-to-value mappings, has the following steps:

1. Start with an empty table, mapping states to values of actions.
2. By interacting with the environment, obtain the **tuple** $s, a, r, s′$ ``(state, action, reward, and the new state)``. In this step, we need to decide which action to take, and there is no single proper way to make this decision. We discussed this problem as **exploration versus exploitation** and will talk a lot about this.
3. Update the $Q(s, a)$ value using the Bellman approximation: $Q(s,a) \leftarrow r + \gamma \max_{a' \in A} Q_{s',a'}$
4. Repeat from step 2.

As in Value iteration, the end condition could be some threshold of the update or we can perform test episodes to estimate the expected reward from the policy. Another thing to note here is how to update the Q-values. As we take samples from the environment, it's generally a bad idea to just assign new values on top of existing values, as training can become unstable. What is usually done in practice is to update the $Q(s, a)$ with approximations using a **"blending"** technique, which is just averaging between old and new values of $Q$ using learning rate $\alpha$ with a value from 0 to 1:

$Q(s,a) \leftarrow (1 - \alpha) Q_{s,a} + \alpha (r + \gamma \max_{a' \in A} Q_{s',a'})$

This allows values of $Q$ to converge smoothly, even if our environment is noisy.
The final version of the algorithm is here:

1. Start with an empty table for $Q(s,a)$
2. Obtain the **tuple** ($s, a, r, s′)$ ``(state, action, reward, and the new state)`` from the environment. 
3. Make a Bellman update:
$Q(s,a) \leftarrow (1 - \alpha) Q_{s,a} + \alpha (r + \gamma \max_{a' \in A} Q_{s',a'})$
4. Check convergence conditions. If not met, repeat from step 2.

As mentioned earlier, this method is called **tabular Q-learning**, as we keep a **table of states with their Q-values**. Let's try it on our FrozenLake environment. The whole example code is in Chapter06/01_frozenlake_q_learning.py.
You may have noticed that this version used more iterations to solve the problem compared to the value iteration method from the previous chapter. The reason for that is that we're **no longer using the experience obtained during testing**. (In Chapter05/02_frozenlake_q_iteration.py, periodical tests cause an update of Q-table statistics. **Here we don't touch Q-values during the test**, which cause more iterations before the environment gets solved.) Overall, the total amount of samples required from the environment is almost the same. The reward chart in TensorBoard also shows good training dynamics, which are very similar to the value iteration method.

In [153]:
#!/usr/bin/env python3
import gym
import collections

from tensorboardX import SummaryWriter

ENV_NAME = "FrozenLake-v0"
#ENV_NAME = "FrozenLake8x8-v0"
GAMMA = 0.9 #Discount factor
ALPHA = 0.2 #Learning rate
TEST_EPISODES = 20


class Agent:
    #Initialize the agent
    def __init__(self): 
        self.env = gym.make(ENV_NAME) #reate a new envrionment 
        self.state = self.env.reset() #reset the env.
        self.values = collections.defaultdict(float) #initiate an empty values table

    #methos to execute a single step sample of the envrionment
    def sample_env(self):
        action = self.env.action_space.sample() #Obtain a single sampled action from the environment 
        old_state = self.state #safe the prev. state
        new_state, reward, is_done, _ = self.env.step(action) #Perform the sampled action on the env
        self.state = self.env.reset() if is_done else new_state #Store the new obtained state or reste if done
        return (old_state, action, reward, new_state) #return a tuple (s,a,r,s')

    #Method to select the best action in a given state
    def best_value_and_action(self, state):
        best_value, best_action = None, None
        for action in range(self.env.action_space.n): #iterate over all possible actions of the environment 
            action_value = self.values[(state, action)] #Get the current action value (Q-Value) from the values table
            if best_value is None or best_value < action_value: #calculate the best value and he best action
                best_value = action_value
                best_action = action
        return best_value, best_action #return the best value and the best action
    
    #Method to update the value table 
    def value_update(self, s, a, r, next_s):
        best_v, _ = self.best_value_and_action(next_s)
        new_val = r + GAMMA * best_v #caluclate the new value with Bellman equation
        old_val = self.values[(s, a)]
        self.values[(s, a)] = old_val * (1-ALPHA) + new_val * ALPHA #Update the value with wights based on the learning rate

    #Method to play a sinlge episode
    #We don't update the tables while runing the episode
    def play_episode(self, env):
        state = env.reset() #Create a new env. (not using the environment used to upadate the value table)
        total_reward = 0.0
        steps = 0
        while True:
            _, action = self.best_value_and_action(state) #get the best action for the current state
            new_state, reward, is_done, _ = env.step(action) #mae a step in the env with the best action
            total_reward += reward #increment total reward with the new reward
            if is_done:
                break
            state = new_state #keep the new state s the new state
            steps += 1
        return total_reward, steps


if __name__ == "__main__":
    test_env = gym.make(ENV_NAME)
    agent = Agent() #Create a new agent object
    writer = SummaryWriter(comment="-q-learning")

    iter_no = 0
    best_reward = 0.0
    while True:
        iter_no += 1
        s, a, r, next_s = agent.sample_env() #samle the environmet
        agent.value_update(s, a, r, next_s) #update the value table

        total_reward = 0.0
        total_steps = 0
        for _ in range(TEST_EPISODES): #Test on the neviroment using ghe new tables
            reward, steps = agent.play_episode(test_env)
            total_reward += reward
            total_steps += steps
        total_reward /= TEST_EPISODES
        writer.add_scalar("reward", reward, iter_no)
        if total_reward > best_reward:
            print("Best reward updated %.3f -> %.3f" % (best_reward, total_reward), 'env. steps per iteration = ', total_steps)
            best_reward = total_reward
        if total_reward > 0.80:
            print("Solved in %d iterations!" % iter_no)
            break
    writer.close()


Best reward updated 0.000 -> 0.050 env. steps per iteration =  337
Best reward updated 0.050 -> 0.100 env. steps per iteration =  259
Best reward updated 0.100 -> 0.150 env. steps per iteration =  280
Best reward updated 0.150 -> 0.200 env. steps per iteration =  242
Best reward updated 0.200 -> 0.250 env. steps per iteration =  266
Best reward updated 0.250 -> 0.300 env. steps per iteration =  622
Best reward updated 0.300 -> 0.350 env. steps per iteration =  419
Best reward updated 0.350 -> 0.450 env. steps per iteration =  504
Best reward updated 0.450 -> 0.500 env. steps per iteration =  763
Best reward updated 0.500 -> 0.550 env. steps per iteration =  398
Best reward updated 0.550 -> 0.650 env. steps per iteration =  732
Best reward updated 0.650 -> 0.700 env. steps per iteration =  775
Best reward updated 0.700 -> 0.750 env. steps per iteration =  605
Best reward updated 0.750 -> 0.800 env. steps per iteration =  975
Best reward updated 0.800 -> 0.850 env. steps per iteration = 

In [157]:
TEST_EPISODES = 1000
env = gym.make(ENV_NAME) #Create a new environment
total_reward = 0
total_steps = 0
for i in range(TEST_EPISODES):
    rewards, steps = agent.play_episode(env)
    #print('Episode: ',i,' Reward=', reward)
    total_reward += rewards
    total_steps += steps
print('Average reward=', total_reward/TEST_EPISODES, ' Av. steps per episode = ', total_steps/TEST_EPISODES)
    


Average reward= 0.585  Av. steps per episode =  45.015


## Deep Q-learning

For enviroment with continues states, we can use **a nonlinear representation that maps both state and action onto a value**. In machine learning this is called a **"regression problem"**. The concrete way to represent and train such a representation can vary, but, as you may have already guessed from this section's title, using a deep neural network is one of the most popular options, especially when dealing with observations represented as screen images. With this in mind, let's make modifications to the Q-learning algorithm:
We do a modifcation to the Tabluar Q-Learninng Algorithem above to accomudate very large state,action space and to use DNN:

1. Initialize $Q(s, a)$ with some initial approximation 
2. By interacting with the environment, obtain the tuple $(s, a, r, s′)$ 
3. Calculate loss: $L =(Q_{s,a} − r)^2$ if episode has ended or $L =(Q_{s,a} − (r + \gamma \max_{a' \in A} Q_{s',a'}))^2$ otherwise
4. Update $Q(s, a)$ using the stochastic gradient descent (SGD) algorithm, by minimizing the loss with respect to the model parameters
5. Repeat from step 2 until converged

### Interaction with the environment

First of all, we need to interact with the environment somehow to receive data to train on. In simple environments, such as FrozenLake, we can act randomly, but is this the best strategy to use? Imagine the game of Pong. What's the probability of winning a single point by randomly moving the paddle? It's not zero but it's extremely small, which just means that we'll need to wait for a very long time for such a rare situation. As an alternative, we can use our Q function approximation as a source of behavior (as we did before in the value iteration method, when we remembered our experience during testing).

If our representation of Q is good, then the experience that we get from the environment will show the agent relevant data to train on. However, we're in trouble when our approximation is not perfect (at the beginning of the training, for example). In such a case, our agent can be stuck with bad actions for some states without ever trying to behave differently. This **exploration versus exploitation dilemma** was mentioned briefly in Chapter 1, What is Reinforcement Learning?. On the one hand, our agent needs to explore the environment to build a complete picture of transitions and action outcomes. On the other hand, we should use interaction with the environment efficiently: we shouldn't waste time by randomly trying actions we've already tried and have learned their outcomes. As you can see, random behavior is better at the beginning of the training when our Q approximation is bad, as it gives us more uniformly distributed information about the environment states. As our training progresses, random behavior becomes inefficient and we want to fall back to our Q approximation to decide how to act.

A method which performs such a mix of two extreme behaviors is known as an **epsilon-greedy method**, which just means switching between random and $Q$ policy using the probability hyperparameter $\epsilon$. By varying $\epsilon$ we can select the ratio of random actions. The usual practice is to start with $\epsilon = 1.0$ (100% random actions) and slowly decrease it to some small value such as 5% or 2% of random actions. Using an epsilon-greedy method helps both to explore the environment in the beginning and to stick to good policy at the end of the training. There are other solutions to the **"exploration versus exploitation"** problem, and we'll discuss some of them in part three of the book. This problem is **one of the fundamental open questions in RL and an active area of research, which is not even close to being resolved completely**.

### SGD optimization
The core of our Q-learning procedure is borrowed from the supervised learning. Indeed, we are trying to approximate a complex, nonlinear function $Q(s, a)$ with a neural network. To do this, we calculate targets for this function using the Bellman equation and then pretend that we have a supervised learning problem at hand. That's okay, but one of the fundamental requirements for SGD optimization is that the training data is **independent and identically distributed (i.i.d)**.

In our case, data that we're going to use for the SGD update doesn't fulfill these criteria:
1. Our samples are not independent. Even if we accumulate a large batch of data samples, they all will be very close to each other, as they belong to the same episode.
2. Distribution of our training data won't be identical to samples provided by the optimal policy that we want to learn. Data that we have is a result of some other policy (our current policy, random, or both in the case of $\epsilon$-greedy), but we don't want to learn how to play randomly: we want an optimal policy with the best reward.

To deal with this nuisance, we usually need to use a **large buffer of our past experience and sample training data from it, instead of using our latest experience**. This method is called **replay buffer**. The simplest implementation is a buffer of **fixed size**, with new data added to the end of the buffer so that it pushes the oldest experience out of it. 

**Replay buffer** allows us to train on more-or-less independent data, but data will still be fresh enough to train on samples generated by our recent policy.

### Correlation between steps
Another practical issue with the default training procedure is also related to the lack of
i.i.d in our data, but in a slightly different manner. The Bellman equation provides us with the value of $Q(s, a)$ via $Q(s′, a′)$ (which has the name of **bootstrapping**). However, both states $s$ and $s′$ have only one step between them. This makes them very similar and it's really hard for neural networks to distinguish between them. When we perform an update of our network's parameters, to make $Q(s, a)$ closer to the desired result, we indirectly can alter the value produced for $Q(s′, a′)$ and other states nearby. This can make our training really unstable, like chasing our own tail: when we update $Q$ for state $s$, then on subsequent states we discover that $Q(s′, a′)$ becomes worse, but attempts to update it can spoil our $Q(s, a)$ approximation, and so on.

To make training more stable, there is a **trick, called target network**, when we keep a copy of our network and use it for the $Q(s′, a′)$ value in the Bellman equation. This network is synchronized with our main network only periodically, for example, once in $N$ steps (where $N$ is usually quite a large hyperparameter, such as 1k or 10k training iterations).

### The Markov property
Our RL methods use MDP formalism as their basis, which **assumes that the environment obeys the Markov property**: observation from the environment is all that we need to act optimally (in other words, our observations allow us to distinguish states from one another). As we've seen on the preceding Pong's screenshot, one single image from the Atari game is not enough to capture all important information (using only one image we have no idea about the speed and direction of objects, like the ball and our opponent's paddle). This obviously violates the Markov property and moves our single-frame Pong environment into the area of **partially observable MDPs (POMDP)**. A POMDP is basically MDP without the Markov property and they are very important in practice. For example, for most card games where you don't see your opponents' cards, game observations are POMDPs, because current observation (your cards and cards on the table) could correspond to different cards in your opponents' hands.
We'll not discuss POMPDs in detail in this book, so, for now, we'll use a small technique to push our environment back into the MDP domain. **The solution is maintaining several observations from the past and using them as a state**. In the case of Atari games, we usually stack $k$ subsequent frames together and use them as the observation at every state. This **allows our agent to deduct the dynamics of the current state**, for instance, to get the speed of the ball and its direction. The usual "classical" **number of $k$ for Atari is four**. Of course, it's just a hack, as there can be longer dependencies in the environment, but for most of the games it works well.

## The final form of DQN (Deep Q-Learning) training
There are many more tips and tricks that researchers have discovered to make **DQN** training more stable and efficient, and we'll cover the best of them in the next chapter. However, **$\epsilon$-greedy**, **replay buffer**, and **target network** form the basis that allows DeepMind to successfully train a DQN on a set of 49 Atari games and demonstrate the efficiency of this approach applied to complicated environments.

The original paper (without target network) was published at the end of 2013 ([Playing Atari with Deep Reinforcement Learning 1312.5602v1, Mnih and others.)](https://arxiv.org/pdf/1312.5602.pdf), and they used seven games for testing. Later, at the beginning of 2015, a revised version of the article, with 49 different games, was published in Nature ([Human-Level Control Through Deep Reinforcement Learning doi:10.1038/nature14236, Mnih and others.](https://www.nature.com/articles/nature14236))

The algorithm for DQN from the preceding papers has the following steps:
1. Initialize parameters for $Q(s, a)$ and $\hat Q (s, a)$ with random weights, **$\epsilon\leftarrow 1.0$**, and **empty replay buffer**
2. With probability $\epsilon$, select a random action $a$, otherwise $a = \arg \max_a Q_{s,a}$
3. Execute action $a$ in an **emulator** and observe reward $r$ and the next state $s′$ 
4. Store transition $(s, a, r, s′)$ in the **replay buffer**
5. Sample a random **minibatch** of transitions from the **replay buffer**
6. For every transition in the buffer, calculate target $y = r$ if the episode has ended at this step or $y = r + \gamma \max_{a' \in A} \hat Q_{s',a'}$ otherwise
7. Calculate loss: $L = (Q_{s,a} − y)^2$
8. Update $Q(s, a)$ using the **SGD algorithm** by minimizing the loss in respect to model parameters
9. Every $N$ steps copy weights from $Q$ to $\hat Q_t$
10. Repeat from step 2 until converged


## DQN on FrozenLake
### DQN Model

In [2]:
import torch
import torch.nn as nn

import numpy as np



#Class for setting the Deep Q-Learning Network
#For the FrozenLake game the w use as simple fully connected single hidden layer dnn
class DQN(nn.Module):
    #Method to intialize the DNN
    def __init__(self, obs_size, hidden_size, output_size):
        super(DQN, self).__init__()
        self.net = nn.Sequential(
            nn.Linear(obs_size, hidden_size),
            nn.ReLU(),
            nn.Linear(hidden_size, output_size)
        )

    #method to perform a single forward operation
    def forward(self, x):
        return self.net(x)



### DQN  Training on FrozenLake

In [183]:
#!/usr/bin/env python3

import gym

import argparse
import time
import numpy as np
import collections

import torch
import torch.nn as nn
import torch.optim as optim

from tensorboardX import SummaryWriter


#DEFAULT_ENV_NAME = "PongNoFrameskip-v4"
DEFAULT_ENV_NAME = "FrozenLake-v0"
#DEFAULT_ENV_NAME = "FrozenLake8x8-v0"

#MEAN_REWARD_BOUND = 19.5

GAMMA = 0.99 #0.99 #reward discount
BATCH_SIZE = 32 #32
REPLAY_SIZE = 10000 #10000
LEARNING_RATE = 1e-4 #1e-4
SYNC_TARGET_FRAMES = 1000 #1000
REPLAY_START_SIZE = 10000 #10000
HIDDEN_SIZE = 128 #Hidden layer size

#Epsilon control parameters - to control explore/exploite decision
EPSILON_DECAY_LAST_FRAME = 10**5 #10**5
EPSILON_START = .7
EPSILON_FINAL = 0.01


#Create a name tuppled object
Experience = collections.namedtuple('Experience', field_names=['state', 'action', 'reward', 'done', 'new_state'])

#A class for replay buffer
class ExperienceBuffer:
    def __init__(self, capacity):
        self.buffer = collections.deque(maxlen=capacity)

    def __len__(self):
        return len(self.buffer)

    def append(self, experience):
        self.buffer.append(experience)
        self

    def sample(self, batch_size): #Method to append results from a mini batch of a given batch_size
        indices = np.random.choice(len(self.buffer), batch_size, replace=False)
        states, actions, rewards, dones, next_states = zip(*[self.buffer[idx] for idx in indices])
        return np.array(states), np.array(actions), np.array(rewards, dtype=np.float32), \
               np.array(dones, dtype=np.uint8), np.array(next_states)

#Ineriting class from Gym's ObservationWraper class
#It converts the descrite inputs will have 16 float numbers, zero everywhere, except the currenl loction of the agent (as float 1)
class DiscreteOneHotWrapper(gym.ObservationWrapper):
    def __init__(self, env):
        super(DiscreteOneHotWrapper, self).__init__(env)
        assert isinstance(env.observation_space, gym.spaces.Discrete)
        self.observation_space = gym.spaces.Box(0.0, 1.0, (env.observation_space.n, ), dtype=np.float32)

    def observation(self, observation):
        res = np.copy(self.observation_space.low)
        res[observation] = 1.0
        return res


#The Agent class
class Agent:
    def __init__(self, env, exp_buffer): #method to intialize the Agent
        self.env = env
        self.exp_buffer = exp_buffer #Init the experience buffer
        self._reset()
        
    def _reset(self): #method to reset the agent
        self.state = env.reset()
        self.total_reward = 0.0
        
    def best_action():
            #state_a = np.array([self.state], copy=False) #Convert the state to np array
            #state_v = torch.tensor(state_a) # .to(device) 
            state_v = torch.tensor([self.state]) # .to(device) 
            #print(state_v)
            #get the best action for the current state
            q_vals_v = net(state_v) # sm(net(state_v))
            _, act_v = torch.max(q_vals_v, dim=1)
            action = int(act_v.item())
            return action


    def play_step(self, net, epsilon=0.0): #A method to play a single step
        done_reward = None

        #make an explore/exploit decision
        if np.random.random() < epsilon: #Explor: sampling from a unfied and check if result is below epsilon
            action = env.action_space.sample() # sample a random action from the environment 
        else: #Exploit: pick an action by scoring the nn
            action = self.best_action()


        # Make a step in the environment
        new_state, reward, is_done, _ = self.env.step(action) #perfrom a step on the agent's environment with the selected action
        self.total_reward += reward #increment the total reward with the new reward received.
        #new_state = new_state #Seems redundant...

        exp = Experience(self.state, action, reward, is_done, new_state) #Create a new experiance entry
        self.exp_buffer.append(exp) #append the experiance entry
        self.state = new_state 
        if is_done:
            done_reward = self.total_reward
            self._reset()
        return done_reward
    
    def play_episode(self, env, net):
        state = env.reset() #Create a new env. (not using the environment used to upadate the value table)
        total_reward = 0.0
        steps = 0
        while True:
            _, action = self.best_action() #get the best action for the current state
            new_state, reward, is_done, _ = env.step(action) #mae a step in the env with the best action
            total_reward += reward #increment total reward with the new reward
            if is_done:
                break
            state = new_state #keep the new state s the new state
            steps += 1
        return total_reward, steps


#Calculate loos on a given mini-batch for net and tgt_net
def calc_loss(batch, net, tgt_net):
    states, actions, rewards, dones, next_states = batch
    
    #Create tensors for states, next_states, actions, rewards and done flags
    states_v = torch.tensor(states) #.to(device)
    #print('states_v.shape=',states_v.shape,' states_v=', states_v)
    
    next_states_v = torch.tensor(next_states)#.to(device)
    #print('next_states_v.shape=',next_states_v.shape,' next_states_v=', next_states_v)
    actions_v = torch.tensor(actions) #.to(device)
    rewards_v = torch.tensor(rewards) #.to(device)
    done_mask = torch.ByteTensor(dones) #.to(device)

    #state_action_values = sm(net(states_v)).gather(1, actions_v.unsqueeze(-1)).squeeze(-1) # Applying the net on sates as input and then extract the Q-values based on the action taken
    state_action_values = net(states_v).gather(1, actions_v.unsqueeze(-1)).squeeze(-1) # Applying the net on sates as input and then extract the Q-values based on the action taken
    #print('net(states_v)', net(states_v), 'actions_v=', actions_v, 'actions_v.unsqueeze(-1)=', actions_v.unsqueeze(-1),
    #      'net(states_v).gather(1, actions_v.unsqueeze(-1))=', net(states_v).gather(1, actions_v.unsqueeze(-1)),
    #      'net(states_v).gather(1, actions_v.unsqueeze(-1)).squeeze(-1)=', net(states_v).gather(1, actions_v.unsqueeze(-1)).squeeze(-1))
    
    #next_state_values = sm(tgt_net(next_states_v)).max(1)[0] #apply the target network to our next state observations and calculate the maximum Q-value along the same action dimension 1. Function max() returns both maximum values and indices of those values (so it calculates both max and argmax), which is very convenient. However, in this case, we're interested only in values, so we take the first entry of the result.
    next_state_values = tgt_net(next_states_v).max(1)[0] #apply the target network to our next state observations and calculate the maximum Q-value along the same action dimension 1. Function max() returns both maximum values and indices of those values (so it calculates both max and argmax), which is very convenient. However, in this case, we're interested only in values, so we take the first entry of the result.
    next_state_values[done_mask] = 0.0 #Zero next state value in case current state is the last 1
    next_state_values = next_state_values.detach() #detach the value from its computation graph to prevent gradients from flowing into the neural network
    expected_state_action_values = next_state_values * GAMMA + rewards_v #Calculating the expected state action values based on Bellman equation 
    #print('next_state_values:', next_state_values, 'expected_state_action_values:', expected_state_action_values)
    
    
    loss = nn.MSELoss()(state_action_values, expected_state_action_values) #calculate the loss and return it
    #print('loss:', loss)
    #print('====================================================================')

    return loss


if __name__ == "__main__":
# Setup for GPU
#     parser = argparse.ArgumentParser()
#     parser.add_argument("--cuda", default=False, action="store_true", help="Enable cuda")
#     parser.add_argument("--env", default=DEFAULT_ENV_NAME,
#                         help="Name of the environment, default=" + DEFAULT_ENV_NAME)
#     parser.add_argument("--reward", type=float, default=MEAN_REWARD_BOUND,
#                         help="Mean reward boundary for stop of training, default=%.2f" % MEAN_REWARD_BOUND)
#     args = parser.parse_args()
#     device = torch.device("cuda" if args.cuda else "cpu")

#    env = wrappers.make_env(args.env)

    env = DiscreteOneHotWrapper(gym.make(DEFAULT_ENV_NAME)) #creating a new environment 
    

    obs_size = env.observation_space.shape[0]
    n_actions = env.action_space.n  # 4 actions (up, down, left, right) in the case of FrozenLake

    net = DQN(obs_size, HIDDEN_SIZE, n_actions) #.to(device)
    tgt_net = DQN(obs_size, HIDDEN_SIZE, n_actions)#.to(device)
#    net = DQN(obs_size, HIDDEN_SIZE, OUTPUT_SIZE) #.to(device)
#    tgt_net = DQN(obs_size, HIDDEN_SIZE, OUTPUT_SIZE)#.to(device)
    writer = SummaryWriter(comment="-") #(comment="-" + args.env)
    print(net)

    buffer = ExperienceBuffer(REPLAY_SIZE)
    agent = Agent(env, buffer) #Creating a new instance of an agent
    epsilon = EPSILON_START #Init the greed factor - higer value means more explor and less exploit

    optimizer = optim.Adam(net.parameters(), lr=LEARNING_RATE)
    total_rewards = []
    iter_idx = 0
    ts_frame = 0
    ts = time.time()
    best_mean_reward = None
    
    sm = nn.Softmax(dim=1) #creating a softmax function 
    prev_idx = 0
    prev_games = 0

    while True:
        iter_idx += 1
        epsilon = max(EPSILON_FINAL, EPSILON_START - iter_idx / EPSILON_DECAY_LAST_FRAME) #Update epslion (greed factor) by reducing it baed on the iter  

        reward = agent.play_step(net, epsilon) #, device=device)
        if reward is not None:
            total_rewards.append(reward)
            speed = (iter_idx - ts_frame) / (time.time() - ts)
            ts_frame = iter_idx
            ts = time.time()
            mean_reward = np.mean(total_rewards[-100:])
#             print("%d: done %d games, mean reward %.3f, eps %.2f, speed %.2f f/s" % (
#                 iter_idx, len(total_rewards), mean_reward, epsilon, speed))
            writer.add_scalar("epsilon", epsilon, iter_idx)
            writer.add_scalar("speed", speed, iter_idx)
            writer.add_scalar("reward_100", mean_reward, iter_idx)
            writer.add_scalar("reward", reward, iter_idx)
            if best_mean_reward is None or best_mean_reward < mean_reward:
                torch.save(net.state_dict(), "-best.dat") #Save (serialize the model parameters)
                3
                if best_mean_reward is not None:
                    print("total_iter=%d, delta_iteration=%d, games=%d, mean reward=%.3f, eps=%.2f" 
                          % (iter_idx, iter_idx - prev_idx, len(total_rewards) - prev_games, mean_reward, epsilon))
                    agent.explore_cnt = 0
                    agent.exploite_cnt = 0
                    prev_idx = iter_idx
                    prev_games = len(total_rewards)
                    tgt_net.load_state_dict(net.state_dict())

                best_mean_reward = mean_reward
            if mean_reward > 0.8: # args.reward:
                print("Solved in %d frames!" % iter_idx)
                break

        if len(buffer) < REPLAY_START_SIZE:
            continue

        if iter_idx % SYNC_TARGET_FRAMES == 0:
            tgt_net.load_state_dict(net.state_dict()) #load the model parameters to the target net


        optimizer.zero_grad()
        batch = buffer.sample(BATCH_SIZE) #sample from the experiance buffer 
        loss_t = calc_loss(batch, net, tgt_net) #, device=device)
        #print('loss=',loss_t.abs())
        loss_t.backward()
        optimizer.step()
        
    writer.close()

DQN(
  (net): Sequential(
    (0): Linear(in_features=16, out_features=128, bias=True)
    (1): ReLU()
    (2): Linear(in_features=128, out_features=4, bias=True)
  )
)
total_iter=1101, delta_iteration=1101, games=139, mean reward=0.010, eps=0.69
total_iter=4487, delta_iteration=3386, games=436, mean reward=0.020, eps=0.66
total_iter=4856, delta_iteration=369, games=53, mean reward=0.030, eps=0.65
total_iter=12640, delta_iteration=7784, games=924, mean reward=0.040, eps=0.57
total_iter=12806, delta_iteration=166, games=15, mean reward=0.050, eps=0.57
total_iter=14321, delta_iteration=1515, games=119, mean reward=0.060, eps=0.56
total_iter=14598, delta_iteration=277, games=19, mean reward=0.070, eps=0.55
total_iter=14863, delta_iteration=265, games=25, mean reward=0.080, eps=0.55
total_iter=18601, delta_iteration=3738, games=331, mean reward=0.090, eps=0.51
total_iter=18635, delta_iteration=34, games=3, mean reward=0.100, eps=0.51
total_iter=31280, delta_iteration=12645, games=934, mean

This chart shows the change in Epsilon and mean rewards through the iterations.
![](img/fig6-1.png)

In [184]:
net.state_dict()

OrderedDict([('net.0.weight',
              tensor([[-0.0760, -0.2173,  0.2194,  ..., -0.2344, -0.0173,  0.0881],
                      [-0.2444,  0.0782,  0.1192,  ...,  0.2748, -0.1500,  0.2343],
                      [ 0.1910, -0.1036,  0.1478,  ...,  0.1170, -0.1721,  0.0808],
                      ...,
                      [ 0.1321, -0.0216,  0.0821,  ..., -0.1359,  0.1232, -0.0817],
                      [-0.0653,  0.2222,  0.1947,  ..., -0.2278,  0.1962,  0.1978],
                      [-0.1780,  0.0648, -0.2420,  ..., -0.1714, -0.0635, -0.1386]])),
             ('net.0.bias',
              tensor([-0.2203,  0.1481,  0.0939,  0.1047,  0.0848,  0.0566,  0.1941, -0.1399,
                       0.0489, -0.2296,  0.1513,  0.1908,  0.0456, -0.0679, -0.1231,  0.1601,
                      -0.0535, -0.1717, -0.2272,  0.0862, -0.2288,  0.1366,  0.1227,  0.2410,
                       0.1662, -0.0878,  0.2100, -0.2190, -0.1651, -0.1694,  0.0623, -0.2102,
                      -0.0369,  

In [7]:
10^5

15