# Deep Q-Learning 

For this assignment we will implement the Deep Q-Learning algorithm with Experience Replay as described in breakthrough paper __"Playing Atari with Deep Reinforcement Learning"__. We will train an agent to play the famous game of __Breakout__.

In [1]:
import sys
import gym
import torch
import pylab
import random
import numpy as np
from collections import deque
from datetime import datetime
from copy import deepcopy
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.autograd import Variable
from utils import *
from agent import *
from model import *
from config import *
%matplotlib inline
%load_ext autoreload
%autoreload 2

## Understanding the environment

In the following cell, we initialise our game of __Breakout__ and you can see how the environment looks like. For further documentation of the of the environment refer to https://gym.openai.com/envs. 

In [None]:
env = gym.make('SpaceInvaders-v0')
#env.render()

In [None]:
number_lives = find_max_lifes(env)
state_size = env.observation_space.shape
action_size = 6
rewards, episodes = [], []

## Creating a DQN Agent

Here we create a DQN Agent. This agent is defined in the __agent.py__. The corresponding neural network is defined in the __model.py__. 

__Evaluation Reward__ : The average reward received in the past 100 episodes/games.

__Frame__ : Number of frames processed in total.

__Memory Size__ : The current size of the replay memory.

In [None]:
agent = Agent(action_size)
evaluation_reward = deque(maxlen=evaluation_reward_length)
frame = 0
memory_size = 0


### Main Training Loop

In [None]:
for e in range(EPISODES):
    done = False
    score = 0

    history = np.zeros([5, 84, 84], dtype=np.uint8)
    step = 0
    d = False
    state = env.reset()
    life = number_lives

    get_init_state(history, state)

    while not done:
        step += 1
        frame += 1
        if render_breakout:
            env.render()

        # Select and perform an action
        action, value = agent.get_action(np.float32(history[:4, :, :]) / 255.)

        
        next_state, reward, done, info = env.step(action)

        frame_next_state = get_frame(next_state)
        history[4, :, :] = frame_next_state
        terminal_state = check_live(life, info['ale.lives'])

        life = info['ale.lives']
        #r = np.clip(reward, -1, 1)
        r = reward
        
        # Store the transition in memory 
        agent.memory.push(deepcopy(frame_next_state), action, r, terminal_state, value, 0, 0)
        # Start training after random sample generation
        if(frame % train_frame == 0):
            agent.train_policy_net(frame)
            # Update the target network
            agent.update_target_net()
        score += r
        history[:4, :, :] = history[1:, :, :]

        if frame % 50000 == 0:
            print('now time : ', datetime.now())
            rewards.append(np.mean(evaluation_reward))
            episodes.append(e)
            pylab.plot(episodes, rewards, 'b')
            pylab.savefig("./save_graph/breakout_dqn.png")

        if done:
            evaluation_reward.append(score)
            # every episode, plot the play time
            print("episode:", e, "  score:", score, "  memory length:",
                  len(agent.memory), "  epsilon:", agent.epsilon, "   steps:", step,
                  "    evaluation reward:", np.mean(evaluation_reward))

            # if the mean of scores of last 10 episode is bigger than 400
            # stop training
            if np.mean(evaluation_reward) > 40 and len(evaluation_reward) > 700:
                torch.save(agent.policy_net, "./save_model/breakout_dqn")
                sys.exit()

  warn("Anti-aliasing will be enabled by default in skimage 0.15 to "
  probs = F.softmax(x[:,:self.action_size] - torch.max(x[:,:self.action_size],0)[0])


episode: 0   score: 185.0   memory length: 821   epsilon: 1.0    steps: 821     evaluation reward: 185.0
Training network


  pol_loss += pol_avg.detach().cpu()[0]
  vf_loss += value_loss.detach().cpu()[0]
  ent_total += ent.detach().cpu()[0]


Iteration 1: Policy loss: -13.282112. Value loss: 12.888132. Entropy: 1.790153.
Iteration 2: Policy loss: -13.259413. Value loss: 12.864014. Entropy: 1.790195.
Iteration 3: Policy loss: -13.290983. Value loss: 12.893475. Entropy: 1.790215.
episode: 1   score: 110.0   memory length: 1024   epsilon: 1.0    steps: 676     evaluation reward: 147.5
Training network
Iteration 4: Policy loss: -11.884912. Value loss: 11.576634. Entropy: 1.790292.
Iteration 5: Policy loss: -12.058539. Value loss: 11.745104. Entropy: 1.790237.
Iteration 6: Policy loss: -11.863825. Value loss: 11.551113. Entropy: 1.790263.
episode: 2   score: 165.0   memory length: 1024   epsilon: 1.0    steps: 676     evaluation reward: 153.33333333333334
episode: 3   score: 105.0   memory length: 1024   epsilon: 1.0    steps: 533     evaluation reward: 141.25
Training network
Iteration 7: Policy loss: -7.510808. Value loss: 7.236870. Entropy: 1.790373.
Iteration 8: Policy loss: -7.452139. Value loss: 7.178435. Entropy: 1.790372

episode: 29   score: 105.0   memory length: 1024   epsilon: 1.0    steps: 476     evaluation reward: 158.33333333333334
Training network
Iteration 61: Policy loss: -11.958920. Value loss: 11.627741. Entropy: 1.790206.
Iteration 62: Policy loss: -11.702659. Value loss: 11.371776. Entropy: 1.790171.
Iteration 63: Policy loss: -11.918467. Value loss: 11.584954. Entropy: 1.790172.
episode: 30   score: 50.0   memory length: 1024   epsilon: 1.0    steps: 419     evaluation reward: 154.83870967741936
Training network
Iteration 64: Policy loss: -12.631729. Value loss: 12.279160. Entropy: 1.790323.
Iteration 65: Policy loss: -12.527327. Value loss: 12.177687. Entropy: 1.790295.
Iteration 66: Policy loss: -12.613468. Value loss: 12.262482. Entropy: 1.790321.
episode: 31   score: 205.0   memory length: 1024   epsilon: 1.0    steps: 976     evaluation reward: 156.40625
Training network
Iteration 67: Policy loss: -11.393254. Value loss: 11.068705. Entropy: 1.790371.
Iteration 68: Policy loss: -11.2

Iteration 118: Policy loss: -13.004067. Value loss: 12.631794. Entropy: 1.790501.
Iteration 119: Policy loss: -12.992560. Value loss: 12.619584. Entropy: 1.790449.
Iteration 120: Policy loss: -12.858914. Value loss: 12.486899. Entropy: 1.790491.
episode: 57   score: 210.0   memory length: 1024   epsilon: 1.0    steps: 981     evaluation reward: 146.1206896551724
Training network
Iteration 121: Policy loss: -10.303321. Value loss: 10.008017. Entropy: 1.790239.
Iteration 122: Policy loss: -10.300719. Value loss: 10.004076. Entropy: 1.790251.
Iteration 123: Policy loss: -10.126807. Value loss: 9.833957. Entropy: 1.790221.
episode: 58   score: 330.0   memory length: 1024   epsilon: 1.0    steps: 1171     evaluation reward: 149.23728813559322
episode: 59   score: 55.0   memory length: 1024   epsilon: 1.0    steps: 436     evaluation reward: 147.66666666666666
Training network
Iteration 124: Policy loss: -10.517057. Value loss: 10.269119. Entropy: 1.790402.
Iteration 125: Policy loss: -10.58

Training network
Iteration 175: Policy loss: -9.840235. Value loss: 9.622861. Entropy: 1.790485.
Iteration 176: Policy loss: -9.797318. Value loss: 9.575883. Entropy: 1.790477.
Iteration 177: Policy loss: -9.848945. Value loss: 9.627436. Entropy: 1.790483.
episode: 85   score: 105.0   memory length: 1024   epsilon: 1.0    steps: 531     evaluation reward: 148.6627906976744
episode: 86   score: 180.0   memory length: 1024   epsilon: 1.0    steps: 637     evaluation reward: 149.02298850574712
Training network
Iteration 178: Policy loss: -12.567269. Value loss: 12.266815. Entropy: 1.790225.
Iteration 179: Policy loss: -12.529522. Value loss: 12.225691. Entropy: 1.790250.
Iteration 180: Policy loss: -12.347881. Value loss: 12.047434. Entropy: 1.790233.
episode: 87   score: 485.0   memory length: 1024   epsilon: 1.0    steps: 947     evaluation reward: 152.8409090909091
Training network
Iteration 181: Policy loss: -30.344065. Value loss: 29.938408. Entropy: 1.790461.
Iteration 182: Policy l

Iteration 234: Policy loss: -17.092455. Value loss: 16.809120. Entropy: 1.790196.
episode: 112   score: 150.0   memory length: 1024   epsilon: 1.0    steps: 731     evaluation reward: 151.85
Training network
Iteration 235: Policy loss: -8.787663. Value loss: 8.465008. Entropy: 1.790401.
Iteration 236: Policy loss: -8.803160. Value loss: 8.479132. Entropy: 1.790430.
Iteration 237: Policy loss: -8.834949. Value loss: 8.508969. Entropy: 1.790434.
episode: 113   score: 75.0   memory length: 1024   epsilon: 1.0    steps: 654     evaluation reward: 151.05
episode: 114   score: 50.0   memory length: 1024   epsilon: 1.0    steps: 365     evaluation reward: 150.35
Training network
Iteration 238: Policy loss: -6.784424. Value loss: 6.629796. Entropy: 1.790388.
Iteration 239: Policy loss: -6.885087. Value loss: 6.727115. Entropy: 1.790395.
Iteration 240: Policy loss: -6.971486. Value loss: 6.811074. Entropy: 1.790373.
episode: 115   score: 140.0   memory length: 1024   epsilon: 1.0    steps: 1021

Iteration 292: Policy loss: -11.864009. Value loss: 11.634341. Entropy: 1.790278.
Iteration 293: Policy loss: -11.960702. Value loss: 11.725328. Entropy: 1.790285.
Iteration 294: Policy loss: -11.827771. Value loss: 11.597469. Entropy: 1.790292.
episode: 141   score: 100.0   memory length: 1024   epsilon: 1.0    steps: 625     evaluation reward: 152.65
episode: 142   score: 80.0   memory length: 1024   epsilon: 1.0    steps: 553     evaluation reward: 152.9
Training network
Iteration 295: Policy loss: -9.150613. Value loss: 8.996709. Entropy: 1.790313.
Iteration 296: Policy loss: -9.135756. Value loss: 8.981361. Entropy: 1.790288.
Iteration 297: Policy loss: -9.171033. Value loss: 9.014949. Entropy: 1.790341.
episode: 143   score: 295.0   memory length: 1024   epsilon: 1.0    steps: 977     evaluation reward: 154.85
Training network
Iteration 298: Policy loss: -17.562700. Value loss: 17.257673. Entropy: 1.790212.
Iteration 299: Policy loss: -17.664442. Value loss: 17.353514. Entropy: 1

episode: 169   score: 80.0   memory length: 1024   epsilon: 1.0    steps: 594     evaluation reward: 152.95
Training network
Iteration 352: Policy loss: -5.483321. Value loss: 5.197840. Entropy: 1.790164.
Iteration 353: Policy loss: -5.456127. Value loss: 5.170141. Entropy: 1.790117.
Iteration 354: Policy loss: -5.576268. Value loss: 5.287753. Entropy: 1.790116.
episode: 170   score: 80.0   memory length: 1024   epsilon: 1.0    steps: 517     evaluation reward: 152.7
Training network
Iteration 355: Policy loss: -8.815100. Value loss: 8.541373. Entropy: 1.790378.
Iteration 356: Policy loss: -8.786934. Value loss: 8.513477. Entropy: 1.790404.
Iteration 357: Policy loss: -8.698499. Value loss: 8.423737. Entropy: 1.790358.
episode: 171   score: 105.0   memory length: 1024   epsilon: 1.0    steps: 664     evaluation reward: 150.6
Training network
Iteration 358: Policy loss: -11.218341. Value loss: 11.016890. Entropy: 1.790445.
Iteration 359: Policy loss: -11.337604. Value loss: 11.128952. E

Training network
Iteration 409: Policy loss: -10.127960. Value loss: 9.884206. Entropy: 1.790458.
Iteration 410: Policy loss: -10.143192. Value loss: 9.900547. Entropy: 1.790484.
Iteration 411: Policy loss: -10.250078. Value loss: 10.001839. Entropy: 1.790487.
episode: 200   score: 210.0   memory length: 1024   epsilon: 1.0    steps: 897     evaluation reward: 141.55
episode: 201   score: 155.0   memory length: 1024   epsilon: 1.0    steps: 511     evaluation reward: 141.0
Training network
Iteration 412: Policy loss: -13.018665. Value loss: 12.868150. Entropy: 1.790340.
Iteration 413: Policy loss: -12.793745. Value loss: 12.647044. Entropy: 1.790326.
Iteration 414: Policy loss: -12.914585. Value loss: 12.766488. Entropy: 1.790329.
episode: 202   score: 335.0   memory length: 1024   epsilon: 1.0    steps: 833     evaluation reward: 142.8
Training network
Iteration 415: Policy loss: -21.584349. Value loss: 21.303173. Entropy: 1.790346.
Iteration 416: Policy loss: -21.189678. Value loss: 

Training network
Iteration 469: Policy loss: -9.090879. Value loss: 8.889984. Entropy: 1.789951.
Iteration 470: Policy loss: -9.261212. Value loss: 9.052521. Entropy: 1.789976.
Iteration 471: Policy loss: -9.067354. Value loss: 8.864246. Entropy: 1.789935.
episode: 228   score: 55.0   memory length: 1024   epsilon: 1.0    steps: 609     evaluation reward: 139.75
episode: 229   score: 90.0   memory length: 1024   epsilon: 1.0    steps: 551     evaluation reward: 139.95
Training network
Iteration 472: Policy loss: -4.740092. Value loss: 4.624137. Entropy: 1.790464.
Iteration 473: Policy loss: -4.740327. Value loss: 4.621662. Entropy: 1.790450.
Iteration 474: Policy loss: -4.717375. Value loss: 4.601945. Entropy: 1.790468.
episode: 230   score: 410.0   memory length: 1024   epsilon: 1.0    steps: 1039     evaluation reward: 142.8
Training network
Iteration 475: Policy loss: -22.973169. Value loss: 22.628332. Entropy: 1.790506.
Iteration 476: Policy loss: -23.199913. Value loss: 22.850729.

Iteration 527: Policy loss: -13.330326. Value loss: 13.051749. Entropy: 1.790506.
Iteration 528: Policy loss: -13.072192. Value loss: 12.793443. Entropy: 1.790530.
episode: 258   score: 255.0   memory length: 1024   epsilon: 1.0    steps: 911     evaluation reward: 134.25
episode: 259   score: 210.0   memory length: 1024   epsilon: 1.0    steps: 635     evaluation reward: 134.9
Training network
Iteration 529: Policy loss: -17.130033. Value loss: 16.871250. Entropy: 1.790529.
Iteration 530: Policy loss: -17.176376. Value loss: 16.909576. Entropy: 1.790534.
Iteration 531: Policy loss: -17.156563. Value loss: 16.892328. Entropy: 1.790556.
episode: 260   score: 210.0   memory length: 1024   epsilon: 1.0    steps: 723     evaluation reward: 134.9
Training network
Iteration 532: Policy loss: -13.760493. Value loss: 13.504955. Entropy: 1.790334.
Iteration 533: Policy loss: -13.771202. Value loss: 13.513225. Entropy: 1.790390.
Iteration 534: Policy loss: -13.698047. Value loss: 13.440574. Entr

Iteration 586: Policy loss: -9.609241. Value loss: 9.577874. Entropy: 1.790381.
Iteration 587: Policy loss: -9.735984. Value loss: 9.696740. Entropy: 1.790365.
Iteration 588: Policy loss: -9.669606. Value loss: 9.635387. Entropy: 1.790370.
episode: 287   score: 75.0   memory length: 1024   epsilon: 1.0    steps: 388     evaluation reward: 144.75
episode: 288   score: 150.0   memory length: 1024   epsilon: 1.0    steps: 656     evaluation reward: 145.15
Training network
Iteration 589: Policy loss: -11.020981. Value loss: 10.808487. Entropy: 1.790333.
Iteration 590: Policy loss: -11.079111. Value loss: 10.859413. Entropy: 1.790348.
Iteration 591: Policy loss: -10.988745. Value loss: 10.771398. Entropy: 1.790333.
Training network
Iteration 592: Policy loss: -8.558768. Value loss: 8.289751. Entropy: 1.790473.
Iteration 593: Policy loss: -8.835973. Value loss: 8.565042. Entropy: 1.790470.
Iteration 594: Policy loss: -8.714750. Value loss: 8.441933. Entropy: 1.790443.
episode: 289   score: 1

Training network
Iteration 646: Policy loss: -11.717524. Value loss: 11.544561. Entropy: 1.790489.
Iteration 647: Policy loss: -11.650399. Value loss: 11.479871. Entropy: 1.790488.
Iteration 648: Policy loss: -11.582023. Value loss: 11.412333. Entropy: 1.790480.
episode: 315   score: 460.0   memory length: 1024   epsilon: 1.0    steps: 981     evaluation reward: 153.3
Training network
Iteration 649: Policy loss: -27.060955. Value loss: 26.764496. Entropy: 1.790462.
Iteration 650: Policy loss: -27.087791. Value loss: 26.786234. Entropy: 1.790483.
Iteration 651: Policy loss: -27.333565. Value loss: 27.022682. Entropy: 1.790447.
episode: 316   score: 225.0   memory length: 1024   epsilon: 1.0    steps: 994     evaluation reward: 151.75
Training network
Iteration 652: Policy loss: -12.501622. Value loss: 12.451762. Entropy: 1.790420.
Iteration 653: Policy loss: -12.266083. Value loss: 12.220886. Entropy: 1.790401.
Iteration 654: Policy loss: -12.204730. Value loss: 12.159493. Entropy: 1.79

Iteration 708: Policy loss: -5.813091. Value loss: 5.807712. Entropy: 1.790463.
episode: 341   score: 155.0   memory length: 1024   epsilon: 1.0    steps: 644     evaluation reward: 169.9
Training network
Iteration 709: Policy loss: -15.002002. Value loss: 14.971511. Entropy: 1.790446.
Iteration 710: Policy loss: -14.969588. Value loss: 14.937591. Entropy: 1.790469.
Iteration 711: Policy loss: -15.002453. Value loss: 14.969148. Entropy: 1.790431.
episode: 342   score: 300.0   memory length: 1024   epsilon: 1.0    steps: 1018     evaluation reward: 172.35
episode: 343   score: 155.0   memory length: 1024   epsilon: 1.0    steps: 620     evaluation reward: 173.45
Training network
Iteration 712: Policy loss: -12.123165. Value loss: 12.097475. Entropy: 1.790542.
Iteration 713: Policy loss: -12.181132. Value loss: 12.150473. Entropy: 1.790546.
Iteration 714: Policy loss: -12.178974. Value loss: 12.148952. Entropy: 1.790526.
episode: 344   score: 185.0   memory length: 1024   epsilon: 1.0   

Training network
Iteration 766: Policy loss: -7.764306. Value loss: 7.675940. Entropy: 1.790544.
Iteration 767: Policy loss: -7.608905. Value loss: 7.522952. Entropy: 1.790554.
Iteration 768: Policy loss: -7.878951. Value loss: 7.780045. Entropy: 1.790575.
episode: 370   score: 55.0   memory length: 1024   epsilon: 1.0    steps: 618     evaluation reward: 183.2
Training network
Iteration 769: Policy loss: -8.598514. Value loss: 8.722417. Entropy: 1.790594.
Iteration 770: Policy loss: -8.624310. Value loss: 8.745901. Entropy: 1.790595.
Iteration 771: Policy loss: -8.545507. Value loss: 8.668560. Entropy: 1.790600.
episode: 371   score: 210.0   memory length: 1024   epsilon: 1.0    steps: 699     evaluation reward: 184.85
episode: 372   score: 225.0   memory length: 1024   epsilon: 1.0    steps: 754     evaluation reward: 185.95
Training network
Iteration 772: Policy loss: -13.701069. Value loss: 13.657098. Entropy: 1.790636.
Iteration 773: Policy loss: -13.723630. Value loss: 13.682191.

Iteration 825: Policy loss: -3.982701. Value loss: 4.061330. Entropy: 1.790659.
episode: 399   score: 125.0   memory length: 1024   epsilon: 1.0    steps: 636     evaluation reward: 178.0
episode: 400   score: 165.0   memory length: 1024   epsilon: 1.0    steps: 690     evaluation reward: 175.05
Training network
Iteration 826: Policy loss: -10.842195. Value loss: 10.879139. Entropy: 1.790572.
Iteration 827: Policy loss: -10.866974. Value loss: 10.905962. Entropy: 1.790586.
Iteration 828: Policy loss: -10.898937. Value loss: 10.936378. Entropy: 1.790580.
episode: 401   score: 185.0   memory length: 1024   epsilon: 1.0    steps: 796     evaluation reward: 175.7
episode: 402   score: 80.0   memory length: 1024   epsilon: 1.0    steps: 391     evaluation reward: 174.95
Training network
Iteration 829: Policy loss: -11.282341. Value loss: 11.060919. Entropy: 1.790599.
Iteration 830: Policy loss: -11.144481. Value loss: 10.927216. Entropy: 1.790634.
Iteration 831: Policy loss: -11.262127. Val

Iteration 882: Policy loss: -11.741304. Value loss: 11.500789. Entropy: 1.790209.
episode: 429   score: 190.0   memory length: 1024   epsilon: 1.0    steps: 1059     evaluation reward: 158.95
Training network
Iteration 883: Policy loss: -12.723302. Value loss: 12.712211. Entropy: 1.790322.
Iteration 884: Policy loss: -12.779024. Value loss: 12.763927. Entropy: 1.790314.
Iteration 885: Policy loss: -12.754311. Value loss: 12.744040. Entropy: 1.790329.
episode: 430   score: 210.0   memory length: 1024   epsilon: 1.0    steps: 986     evaluation reward: 159.25
Training network
Iteration 886: Policy loss: -20.325565. Value loss: 20.224117. Entropy: 1.790319.
Iteration 887: Policy loss: -20.211386. Value loss: 20.106924. Entropy: 1.790315.
Iteration 888: Policy loss: -20.193645. Value loss: 20.088160. Entropy: 1.790302.
episode: 431   score: 430.0   memory length: 1024   epsilon: 1.0    steps: 1512     evaluation reward: 159.15
Training network
Iteration 889: Policy loss: -9.553556. Value l

Iteration 940: Policy loss: -5.770843. Value loss: 5.672334. Entropy: 1.790347.
Iteration 941: Policy loss: -5.780067. Value loss: 5.678028. Entropy: 1.790316.
Iteration 942: Policy loss: -5.755987. Value loss: 5.658277. Entropy: 1.790288.
episode: 458   score: 110.0   memory length: 1024   epsilon: 1.0    steps: 687     evaluation reward: 144.55
episode: 459   score: 400.0   memory length: 1024   epsilon: 1.0    steps: 731     evaluation reward: 147.2
Training network
Iteration 943: Policy loss: -26.439016. Value loss: 26.221245. Entropy: 1.790645.
Iteration 944: Policy loss: -26.582270. Value loss: 26.363089. Entropy: 1.790618.
Iteration 945: Policy loss: -26.552385. Value loss: 26.316864. Entropy: 1.790609.
episode: 460   score: 105.0   memory length: 1024   epsilon: 1.0    steps: 585     evaluation reward: 147.3
episode: 461   score: 55.0   memory length: 1024   epsilon: 1.0    steps: 401     evaluation reward: 146.85
Training network
Iteration 946: Policy loss: -7.114702. Value lo

Iteration 998: Policy loss: -2.450995. Value loss: 2.669004. Entropy: 1.790593.
Iteration 999: Policy loss: -2.498639. Value loss: 2.706491. Entropy: 1.790598.
episode: 488   score: 20.0   memory length: 1024   epsilon: 1.0    steps: 393     evaluation reward: 136.1
episode: 489   score: 135.0   memory length: 1024   epsilon: 1.0    steps: 659     evaluation reward: 136.1
Training network
Iteration 1000: Policy loss: -10.066287. Value loss: 9.985717. Entropy: 1.790461.
Iteration 1001: Policy loss: -10.088377. Value loss: 10.002498. Entropy: 1.790467.
Iteration 1002: Policy loss: -10.015796. Value loss: 9.932034. Entropy: 1.790447.
episode: 490   score: 115.0   memory length: 1024   epsilon: 1.0    steps: 774     evaluation reward: 135.9
Training network
Iteration 1003: Policy loss: -7.247592. Value loss: 7.124660. Entropy: 1.790644.
Iteration 1004: Policy loss: -7.198859. Value loss: 7.072421. Entropy: 1.790643.
Iteration 1005: Policy loss: -7.242066. Value loss: 7.110168. Entropy: 1.7

Training network
Iteration 1054: Policy loss: -6.298542. Value loss: 6.403737. Entropy: 1.790455.
Iteration 1055: Policy loss: -6.224356. Value loss: 6.331198. Entropy: 1.790466.
Iteration 1056: Policy loss: -6.310217. Value loss: 6.412447. Entropy: 1.790443.
episode: 519   score: 60.0   memory length: 1024   epsilon: 1.0    steps: 617     evaluation reward: 140.05
episode: 520   score: 125.0   memory length: 1024   epsilon: 1.0    steps: 812     evaluation reward: 140.75
Training network
Iteration 1057: Policy loss: -8.166762. Value loss: 8.026156. Entropy: 1.790380.
Iteration 1058: Policy loss: -8.208202. Value loss: 8.060854. Entropy: 1.790404.
Iteration 1059: Policy loss: -8.176497. Value loss: 8.036833. Entropy: 1.790329.
episode: 521   score: 125.0   memory length: 1024   epsilon: 1.0    steps: 768     evaluation reward: 140.5
Training network
Iteration 1060: Policy loss: -7.227183. Value loss: 7.099004. Entropy: 1.790560.
Iteration 1061: Policy loss: -7.167808. Value loss: 7.038

Iteration 1113: Policy loss: -13.726963. Value loss: 13.574418. Entropy: 1.790482.
episode: 547   score: 110.0   memory length: 1024   epsilon: 1.0    steps: 751     evaluation reward: 131.4
Training network
Iteration 1114: Policy loss: -7.874037. Value loss: 7.901690. Entropy: 1.790627.
Iteration 1115: Policy loss: -7.816551. Value loss: 7.840342. Entropy: 1.790635.
Iteration 1116: Policy loss: -7.930241. Value loss: 7.944368. Entropy: 1.790639.
episode: 548   score: 80.0   memory length: 1024   epsilon: 1.0    steps: 430     evaluation reward: 131.15
episode: 549   score: 190.0   memory length: 1024   epsilon: 1.0    steps: 855     evaluation reward: 132.05
Training network
Iteration 1117: Policy loss: -10.307676. Value loss: 10.230753. Entropy: 1.790551.
Iteration 1118: Policy loss: -10.297706. Value loss: 10.221038. Entropy: 1.790509.
Iteration 1119: Policy loss: -10.136799. Value loss: 10.058465. Entropy: 1.790535.
episode: 550   score: 255.0   memory length: 1024   epsilon: 1.0  

episode: 576   score: 80.0   memory length: 1024   epsilon: 1.0    steps: 543     evaluation reward: 139.1
now time :  2018-12-19 13:29:31.058383
Training network
Iteration 1171: Policy loss: -4.866477. Value loss: 5.069601. Entropy: 1.790415.
Iteration 1172: Policy loss: -4.739642. Value loss: 4.953796. Entropy: 1.790439.
Iteration 1173: Policy loss: -4.931689. Value loss: 5.133544. Entropy: 1.790458.
episode: 577   score: 45.0   memory length: 1024   epsilon: 1.0    steps: 616     evaluation reward: 137.6
episode: 578   score: 185.0   memory length: 1024   epsilon: 1.0    steps: 764     evaluation reward: 138.55
Training network
Iteration 1174: Policy loss: -11.210770. Value loss: 11.064082. Entropy: 1.790772.
Iteration 1175: Policy loss: -11.229569. Value loss: 11.084403. Entropy: 1.790781.
Iteration 1176: Policy loss: -11.248122. Value loss: 11.101554. Entropy: 1.790757.
episode: 579   score: 90.0   memory length: 1024   epsilon: 1.0    steps: 696     evaluation reward: 138.8
Train

Iteration 1228: Policy loss: -9.701288. Value loss: 9.802184. Entropy: 1.790663.
Iteration 1229: Policy loss: -9.683365. Value loss: 9.780836. Entropy: 1.790641.
Iteration 1230: Policy loss: -9.642731. Value loss: 9.739333. Entropy: 1.790642.
Training network
Iteration 1231: Policy loss: -16.801811. Value loss: 16.514418. Entropy: 1.790289.
Iteration 1232: Policy loss: -16.876871. Value loss: 16.581997. Entropy: 1.790318.
Iteration 1233: Policy loss: -16.872894. Value loss: 16.581369. Entropy: 1.790296.
episode: 606   score: 320.0   memory length: 1024   epsilon: 1.0    steps: 1133     evaluation reward: 148.15
episode: 607   score: 145.0   memory length: 1024   epsilon: 1.0    steps: 857     evaluation reward: 148.6
Training network
Iteration 1234: Policy loss: -8.407542. Value loss: 8.217421. Entropy: 1.790520.
Iteration 1235: Policy loss: -8.385584. Value loss: 8.194049. Entropy: 1.790526.
Iteration 1236: Policy loss: -8.274394. Value loss: 8.085949. Entropy: 1.790511.
episode: 608 

episode: 634   score: 60.0   memory length: 1024   epsilon: 1.0    steps: 484     evaluation reward: 142.2
Training network
Iteration 1288: Policy loss: -7.855975. Value loss: 7.837366. Entropy: 1.790529.
Iteration 1289: Policy loss: -7.757934. Value loss: 7.740611. Entropy: 1.790514.
Iteration 1290: Policy loss: -7.957574. Value loss: 7.929995. Entropy: 1.790510.
episode: 635   score: 105.0   memory length: 1024   epsilon: 1.0    steps: 619     evaluation reward: 143.05
episode: 636   score: 105.0   memory length: 1024   epsilon: 1.0    steps: 512     evaluation reward: 142.9
episode: 637   score: 70.0   memory length: 1024   epsilon: 1.0    steps: 415     evaluation reward: 141.5
Training network
Iteration 1291: Policy loss: -6.905950. Value loss: 7.227736. Entropy: 1.790635.
Iteration 1292: Policy loss: -6.994492. Value loss: 7.296245. Entropy: 1.790651.
Iteration 1293: Policy loss: -6.980555. Value loss: 7.290988. Entropy: 1.790655.
episode: 638   score: 75.0   memory length: 1024 

episode: 664   score: 165.0   memory length: 1024   epsilon: 1.0    steps: 1094     evaluation reward: 138.05
Training network
Iteration 1345: Policy loss: -5.520286. Value loss: 5.967849. Entropy: 1.790428.
Iteration 1346: Policy loss: -5.535381. Value loss: 5.985423. Entropy: 1.790440.
Iteration 1347: Policy loss: -5.522268. Value loss: 5.971350. Entropy: 1.790423.
episode: 665   score: 335.0   memory length: 1024   epsilon: 1.0    steps: 806     evaluation reward: 140.5
Training network
Iteration 1348: Policy loss: -18.719652. Value loss: 18.490595. Entropy: 1.790364.
Iteration 1349: Policy loss: -18.810081. Value loss: 18.569818. Entropy: 1.790370.
Iteration 1350: Policy loss: -18.539978. Value loss: 18.301695. Entropy: 1.790400.
episode: 666   score: 135.0   memory length: 1024   epsilon: 1.0    steps: 671     evaluation reward: 140.65
episode: 667   score: 75.0   memory length: 1024   epsilon: 1.0    steps: 533     evaluation reward: 140.2
Training network
Iteration 1351: Policy 

Iteration 1404: Policy loss: -8.955646. Value loss: 8.973459. Entropy: 1.790337.
episode: 692   score: 105.0   memory length: 1024   epsilon: 1.0    steps: 622     evaluation reward: 147.05
episode: 693   score: 80.0   memory length: 1024   epsilon: 1.0    steps: 739     evaluation reward: 143.0
Training network
Iteration 1405: Policy loss: -6.645695. Value loss: 7.048760. Entropy: 1.790682.
Iteration 1406: Policy loss: -6.548018. Value loss: 6.954194. Entropy: 1.790653.
Iteration 1407: Policy loss: -6.453750. Value loss: 6.864163. Entropy: 1.790635.
episode: 694   score: 105.0   memory length: 1024   epsilon: 1.0    steps: 593     evaluation reward: 142.4
episode: 695   score: 90.0   memory length: 1024   epsilon: 1.0    steps: 556     evaluation reward: 141.75
Training network
Iteration 1408: Policy loss: -8.863056. Value loss: 8.936185. Entropy: 1.790668.
Iteration 1409: Policy loss: -8.640089. Value loss: 8.723355. Entropy: 1.790679.
Iteration 1410: Policy loss: -8.679076. Value lo

episode: 721   score: 110.0   memory length: 1024   epsilon: 1.0    steps: 630     evaluation reward: 147.2
episode: 722   score: 45.0   memory length: 1024   epsilon: 1.0    steps: 382     evaluation reward: 146.85
Training network
Iteration 1462: Policy loss: -7.250010. Value loss: 7.356271. Entropy: 1.790648.
Iteration 1463: Policy loss: -7.244108. Value loss: 7.350345. Entropy: 1.790645.
Iteration 1464: Policy loss: -7.183852. Value loss: 7.295191. Entropy: 1.790651.
episode: 723   score: 65.0   memory length: 1024   epsilon: 1.0    steps: 416     evaluation reward: 143.65
now time :  2018-12-19 13:35:29.101346
episode: 724   score: 155.0   memory length: 1024   epsilon: 1.0    steps: 890     evaluation reward: 144.5
Training network
Iteration 1465: Policy loss: -8.102412. Value loss: 8.287089. Entropy: 1.790618.
Iteration 1466: Policy loss: -8.208789. Value loss: 8.384427. Entropy: 1.790605.
Iteration 1467: Policy loss: -8.122915. Value loss: 8.301895. Entropy: 1.790610.
episode: 

Iteration 1517: Policy loss: -10.066642. Value loss: 10.015295. Entropy: 1.790622.
Iteration 1518: Policy loss: -9.987080. Value loss: 9.946028. Entropy: 1.790677.
episode: 752   score: 130.0   memory length: 1024   epsilon: 1.0    steps: 406     evaluation reward: 146.0
episode: 753   score: 100.0   memory length: 1024   epsilon: 1.0    steps: 850     evaluation reward: 146.2
Training network
Iteration 1519: Policy loss: -5.543866. Value loss: 5.984920. Entropy: 1.790499.
Iteration 1520: Policy loss: -5.501387. Value loss: 5.939048. Entropy: 1.790498.
Iteration 1521: Policy loss: -5.397815. Value loss: 5.848903. Entropy: 1.790493.
episode: 754   score: 130.0   memory length: 1024   epsilon: 1.0    steps: 632     evaluation reward: 147.4
Training network
Iteration 1522: Policy loss: -8.617120. Value loss: 9.008932. Entropy: 1.790521.
Iteration 1523: Policy loss: -8.607680. Value loss: 8.993595. Entropy: 1.790482.
Iteration 1524: Policy loss: -8.681946. Value loss: 9.065740. Entropy: 1.

episode: 780   score: 20.0   memory length: 1024   epsilon: 1.0    steps: 409     evaluation reward: 142.75
Training network
Iteration 1576: Policy loss: -13.459755. Value loss: 13.743070. Entropy: 1.790712.
Iteration 1577: Policy loss: -13.242714. Value loss: 13.536271. Entropy: 1.790722.
Iteration 1578: Policy loss: -13.224059. Value loss: 13.513487. Entropy: 1.790715.
episode: 781   score: 225.0   memory length: 1024   epsilon: 1.0    steps: 814     evaluation reward: 143.45
episode: 782   score: 110.0   memory length: 1024   epsilon: 1.0    steps: 750     evaluation reward: 142.75
Training network
Iteration 1579: Policy loss: -6.319079. Value loss: 6.797122. Entropy: 1.790463.
Iteration 1580: Policy loss: -6.416790. Value loss: 6.889909. Entropy: 1.790490.
Iteration 1581: Policy loss: -6.238068. Value loss: 6.717836. Entropy: 1.790449.
episode: 783   score: 140.0   memory length: 1024   epsilon: 1.0    steps: 641     evaluation reward: 142.35
Training network
Iteration 1582: Policy

In [None]:
torch.save(agent.policy_net, "./save_model/breakout_dqn")