# Deep Q-Learning 

For this assignment we will implement the Deep Q-Learning algorithm with Experience Replay as described in breakthrough paper __"Playing Atari with Deep Reinforcement Learning"__. We will train an agent to play the famous game of __Breakout__.

In [1]:
import sys
import gym
import torch
import pylab
import random
import numpy as np
from collections import deque
from datetime import datetime
from copy import deepcopy
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.autograd import Variable
from utils import *
from agent import *
from model import *
from config import *
%matplotlib inline
%load_ext autoreload
%autoreload 2

## Understanding the environment

In the following cell, we initialise our game of __Breakout__ and you can see how the environment looks like. For further documentation of the of the environment refer to https://gym.openai.com/envs. 

In [2]:
env = gym.make('SpaceInvaders-v0')
#env.render()

In [3]:
number_lives = find_max_lifes(env)
state_size = env.observation_space.shape
action_size = 6
rewards, episodes = [], []

## Creating a DQN Agent

Here we create a DQN Agent. This agent is defined in the __agent.py__. The corresponding neural network is defined in the __model.py__. 

__Evaluation Reward__ : The average reward received in the past 100 episodes/games.

__Frame__ : Number of frames processed in total.

__Memory Size__ : The current size of the replay memory.

In [4]:
agent = Agent(action_size)
evaluation_reward = deque(maxlen=evaluation_reward_length)
frame = 0
memory_size = 0


### Main Training Loop

In [None]:
for e in range(EPISODES):
    done = False
    score = 0

    history = np.zeros([5, 84, 84], dtype=np.uint8)
    step = 0
    d = False
    state = env.reset()
    life = number_lives

    get_init_state(history, state)

    while not done:
        step += 1
        frame += 1
        if render_breakout:
            env.render()

        # Select and perform an action
        action, value = agent.get_action(np.float32(history[:4, :, :]) / 255.)

        
        next_state, reward, done, info = env.step(action)

        frame_next_state = get_frame(next_state)
        history[4, :, :] = frame_next_state
        terminal_state = check_live(life, info['ale.lives'])

        life = info['ale.lives']
        #r = np.clip(reward, -1, 1)
        r = reward
        
        # Store the transition in memory 
        agent.memory.push(deepcopy(frame_next_state), action, r, terminal_state, value, 0, 0)
        # Start training after random sample generation
        if(frame % train_frame == 0):
            agent.train_policy_net(frame)
            # Update the target network
            agent.update_target_net()
        score += r
        history[:4, :, :] = history[1:, :, :]

        if frame % 50000 == 0:
            print('now time : ', datetime.now())
            rewards.append(np.mean(evaluation_reward))
            episodes.append(e)
            pylab.plot(episodes, rewards, 'b')
            pylab.savefig("./save_graph/breakout_dqn.png")

        if done:
            evaluation_reward.append(score)
            # every episode, plot the play time
            print("episode:", e, "  score:", score, "  memory length:",
                  len(agent.memory), "  epsilon:", agent.epsilon, "   steps:", step,
                  "    evaluation reward:", np.mean(evaluation_reward))

            # if the mean of scores of last 10 episode is bigger than 400
            # stop training
            if np.mean(evaluation_reward) > 40 and len(evaluation_reward) > 700:
                torch.save(agent.policy_net, "./save_model/breakout_dqn")
                sys.exit()

  warn("Anti-aliasing will be enabled by default in skimage 0.15 to "
  probs = F.softmax(x[:,:self.action_size] - torch.max(x[:,:self.action_size],0)[0])


episode: 0   score: 155.0   memory length: 925   epsilon: 1.0    steps: 925     evaluation reward: 155.0
episode: 1   score: 15.0   memory length: 1427   epsilon: 1.0    steps: 502     evaluation reward: 85.0
episode: 2   score: 80.0   memory length: 2052   epsilon: 1.0    steps: 625     evaluation reward: 83.33333333333333
episode: 3   score: 55.0   memory length: 2416   epsilon: 1.0    steps: 364     evaluation reward: 76.25
episode: 4   score: 160.0   memory length: 3150   epsilon: 1.0    steps: 734     evaluation reward: 93.0
episode: 5   score: 50.0   memory length: 3635   epsilon: 1.0    steps: 485     evaluation reward: 85.83333333333333
Training network


  pol_loss += pol_avg.detach().cpu()[0]
  vf_loss += value_loss.detach().cpu()[0]
  ent_total += ent.detach().cpu()[0]


Iteration 1: Policy loss: -0.000008. Value loss: 2.160420. Entropy: 1.790013.
Iteration 2: Policy loss: -0.000068. Value loss: 2.151835. Entropy: 1.789993.
Iteration 3: Policy loss: -0.000094. Value loss: 2.162931. Entropy: 1.790013.
episode: 6   score: 120.0   memory length: 4096   epsilon: 1.0    steps: 616     evaluation reward: 90.71428571428571
episode: 7   score: 60.0   memory length: 4096   epsilon: 1.0    steps: 612     evaluation reward: 86.875
episode: 8   score: 175.0   memory length: 4096   epsilon: 1.0    steps: 1080     evaluation reward: 96.66666666666667
episode: 9   score: 105.0   memory length: 4096   epsilon: 1.0    steps: 636     evaluation reward: 97.5
episode: 10   score: 55.0   memory length: 4096   epsilon: 1.0    steps: 454     evaluation reward: 93.63636363636364
episode: 11   score: 105.0   memory length: 4096   epsilon: 1.0    steps: 638     evaluation reward: 94.58333333333333
Training network
Iteration 4: Policy loss: 0.000001. Value loss: 2.296194. Entrop

episode: 58   score: 180.0   memory length: 4096   epsilon: 1.0    steps: 635     evaluation reward: 141.4406779661017
episode: 59   score: 355.0   memory length: 4096   epsilon: 1.0    steps: 1066     evaluation reward: 145.0
Training network
Iteration 28: Policy loss: 0.000003. Value loss: 4.206216. Entropy: 1.790148.
Iteration 29: Policy loss: -0.000041. Value loss: 4.158674. Entropy: 1.790138.
Iteration 30: Policy loss: -0.000082. Value loss: 4.215108. Entropy: 1.790112.
episode: 60   score: 210.0   memory length: 4096   epsilon: 1.0    steps: 837     evaluation reward: 146.0655737704918
episode: 61   score: 110.0   memory length: 4096   epsilon: 1.0    steps: 768     evaluation reward: 145.48387096774192
episode: 62   score: 45.0   memory length: 4096   epsilon: 1.0    steps: 414     evaluation reward: 143.88888888888889
episode: 63   score: 155.0   memory length: 4096   epsilon: 1.0    steps: 522     evaluation reward: 144.0625
episode: 64   score: 80.0   memory length: 4096   ep

episode: 111   score: 35.0   memory length: 4096   epsilon: 1.0    steps: 333     evaluation reward: 137.75
episode: 112   score: 80.0   memory length: 4096   epsilon: 1.0    steps: 605     evaluation reward: 137.45
episode: 113   score: 110.0   memory length: 4096   epsilon: 1.0    steps: 602     evaluation reward: 137.7
Training network
Iteration 52: Policy loss: -0.000013. Value loss: 2.008911. Entropy: 1.790038.
Iteration 53: Policy loss: -0.000075. Value loss: 2.010473. Entropy: 1.790040.
Iteration 54: Policy loss: -0.000090. Value loss: 2.008089. Entropy: 1.790053.
episode: 114   score: 80.0   memory length: 4096   epsilon: 1.0    steps: 684     evaluation reward: 137.45
episode: 115   score: 180.0   memory length: 4096   epsilon: 1.0    steps: 833     evaluation reward: 136.5
episode: 116   score: 115.0   memory length: 4096   epsilon: 1.0    steps: 749     evaluation reward: 136.25
episode: 117   score: 165.0   memory length: 4096   epsilon: 1.0    steps: 842     evaluation rew

episode: 166   score: 80.0   memory length: 4096   epsilon: 1.0    steps: 596     evaluation reward: 135.8
Training network
Iteration 79: Policy loss: 0.000003. Value loss: 3.188288. Entropy: 1.790142.
Iteration 80: Policy loss: -0.000013. Value loss: 3.156586. Entropy: 1.790143.
Iteration 81: Policy loss: -0.000066. Value loss: 3.185999. Entropy: 1.790141.
episode: 167   score: 180.0   memory length: 4096   epsilon: 1.0    steps: 704     evaluation reward: 136.25
episode: 168   score: 160.0   memory length: 4096   epsilon: 1.0    steps: 830     evaluation reward: 136.6
episode: 169   score: 170.0   memory length: 4096   epsilon: 1.0    steps: 896     evaluation reward: 137.55
episode: 170   score: 215.0   memory length: 4096   epsilon: 1.0    steps: 876     evaluation reward: 138.5
episode: 171   score: 180.0   memory length: 4096   epsilon: 1.0    steps: 691     evaluation reward: 138.95
Training network
Iteration 82: Policy loss: 0.000002. Value loss: 3.318778. Entropy: 1.790243.
It

Iteration 107: Policy loss: -0.000021. Value loss: 3.037101. Entropy: 1.790059.
Iteration 108: Policy loss: -0.000047. Value loss: 3.029384. Entropy: 1.790076.
episode: 220   score: 120.0   memory length: 4096   epsilon: 1.0    steps: 806     evaluation reward: 145.05
episode: 221   score: 50.0   memory length: 4096   epsilon: 1.0    steps: 446     evaluation reward: 144.3
episode: 222   score: 180.0   memory length: 4096   epsilon: 1.0    steps: 789     evaluation reward: 144.55
episode: 223   score: 165.0   memory length: 4096   epsilon: 1.0    steps: 808     evaluation reward: 142.8
now time :  2018-12-19 11:41:12.387484
episode: 224   score: 235.0   memory length: 4096   epsilon: 1.0    steps: 807     evaluation reward: 144.2
episode: 225   score: 125.0   memory length: 4096   epsilon: 1.0    steps: 649     evaluation reward: 143.65
Training network
Iteration 109: Policy loss: 0.000002. Value loss: 3.287346. Entropy: 1.790174.
Iteration 110: Policy loss: -0.000041. Value loss: 3.29

episode: 273   score: 155.0   memory length: 4096   epsilon: 1.0    steps: 560     evaluation reward: 157.8
episode: 274   score: 410.0   memory length: 4096   epsilon: 1.0    steps: 897     evaluation reward: 160.3
episode: 275   score: 120.0   memory length: 4096   epsilon: 1.0    steps: 622     evaluation reward: 160.15
Training network
Iteration 136: Policy loss: 0.000002. Value loss: 3.960633. Entropy: 1.790072.
Iteration 137: Policy loss: -0.000029. Value loss: 3.989061. Entropy: 1.790067.
Iteration 138: Policy loss: -0.000081. Value loss: 3.990504. Entropy: 1.790076.
episode: 276   score: 65.0   memory length: 4096   epsilon: 1.0    steps: 704     evaluation reward: 159.6
episode: 277   score: 180.0   memory length: 4096   epsilon: 1.0    steps: 697     evaluation reward: 161.05
episode: 278   score: 225.0   memory length: 4096   epsilon: 1.0    steps: 1150     evaluation reward: 163.25
episode: 279   score: 55.0   memory length: 4096   epsilon: 1.0    steps: 517     evaluation 

In [None]:
torch.save(agent.policy_net, "./save_model/breakout_dqn")