# Simple Reinforcement Learning in Tensorflow Part 2: Policy Gradient Method
This tutorial contains a simple example of how to build a policy-gradient based agent that can solve the CartPole problem. For more information, see this [Medium post](https://medium.com/@awjuliani/super-simple-reinforcement-learning-tutorial-part-2-ded33892c724#.mtwpvfi8b).

For more Reinforcement Learning algorithms, including DQN and Model-based learning in Tensorflow, see my Github repo, [DeepRL-Agents](https://github.com/awjuliani/DeepRL-Agents). 

Parts of this tutorial are based on code by [Andrej Karpathy](https://gist.github.com/karpathy/a4166c7fe253700972fcbc77e4ea32c5) and [korymath](https://gym.openai.com/evaluations/eval_a0aVJrGSyW892vBM04HQA).

In [1]:
from __future__ import division

import numpy as np
try:
    import cPickle as pickle
except:
    import pickle
import tensorflow as tf
import tensorflow.contrib.slim as slim
%matplotlib inline
import matplotlib.pyplot as plt
import math

try:
    xrange = xrange
except:
    xrange = range

  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


### Loading the CartPole Environment
If you don't already have the OpenAI gym installed, use  `pip install gym` to grab it.

What happens if we try running the environment with random actions? How well do we do? (Hint: not so well.)

The goal of the task is to achieve a reward of 200 per episode. For every step the agent keeps the pole in the air, the agent recieves a +1 reward. By randomly choosing actions, our reward for each episode is only a couple dozen. Let's make that better with RL!

### Setting up our Neural Network agent
This time we will be using a Policy neural network that takes observations, passes them through a single hidden layer, and then produces a probability of choosing a left/right movement. To learn more about this network, see [Andrej Karpathy's blog on Policy Gradient networks](http://karpathy.github.io/2016/05/31/rl/).

In [2]:
# hyperparameters
H = 57600 # number of hidden layer neurons
batch_size = 5 # every how many episodes to do a param update?
learning_rate = 1e-2 # feel free to play with this to train faster or more stably.
gamma = 0.99 # discount factor for reward

D = 4 # input dimensionality

In [3]:
tf.reset_default_graph()

#This defines the network as it goes from taking an observation of the environment to 
#giving a probability of chosing to the action of moving left or right.
state = tf.placeholder(shape=[None,64,64,4], dtype=tf.float32)
conv1 = slim.conv2d( \
            inputs=state,num_outputs=16,kernel_size=[2,2],stride=[1,1],padding='VALID', 
                    biases_initializer=None,activation_fn=tf.nn.relu)
conv2 = slim.conv2d( \
            inputs=conv1,num_outputs=16,kernel_size=[2,2],stride=[1,1],padding='VALID', 
                    biases_initializer=None,activation_fn=tf.nn.relu)
conv3 = slim.conv2d( \
            inputs=conv2,num_outputs=16,kernel_size=[2,2],stride=[1,1],padding='VALID', 
                    biases_initializer=None,activation_fn=tf.nn.relu)
conv4 = slim.conv2d( \
            inputs=conv3,num_outputs=16,kernel_size=[2,2],stride=[1,1],padding='VALID', 
                    biases_initializer=None,activation_fn=tf.nn.relu)

convFlat = slim.flatten(conv4)
print("convFlat: " + str(convFlat))

#observations = tf.placeholder(tf.float32, [None,D] , name="input_x")
W = tf.get_variable("W", shape=[H, 3],
           initializer=tf.contrib.layers.xavier_initializer())
score = tf.matmul(convFlat, W)
probability = tf.nn.softmax(score)

#W_Left = tf.get_variable("W_Left", shape=[H, 2],
#           initializer=tf.contrib.layers.xavier_initializer())
#score_Left = tf.matmul(convFlat, W_Left)
#probability_Left = tf.nn.softmax(score_Left)

#W_Jump = tf.get_variable("W_Jump", shape=[H, 2],
#           initializer=tf.contrib.layers.xavier_initializer())
#score_Jump = tf.matmul(convFlat, W_Jump)
#probability_Jump = tf.nn.softmax(score_Jump)

#W_Right = tf.get_variable("W_Right", shape=[H, 2],
#           initializer=tf.contrib.layers.xavier_initializer())
#score_Right = tf.matmul(convFlat, W_Right)
#probability_Right = tf.nn.softmax(W_Right)

#From here we define the parts of the network needed for learning a good policy.
#tvars = tf.trainable_variables()
#input_y = tf.placeholder(tf.float32, [None,1], name="input_y")
#advantages = tf.placeholder(tf.float32, name="reward_signal")

real_action = tf.placeholder(shape=[None, 3], dtype=tf.int32)
#left_real_action = tf.placeholder(shape=[None, 2], dtype=tf.int32)
#jump_real_action = tf.placeholder(shape=[None, 2], dtype=tf.int32)
#right_real_action = tf.placeholder(shape=[None, 2], dtype=tf.int32)

# The loss function. This sends the weights in the direction of making actions 
# that gave good advantage (reward over time) more likely, and actions that didn't less likely.
#loglik = tf.log(input_y*(input_y - probability) + (1 - input_y)*(input_y + probability))
#loss = -tf.reduce_mean(loglik * advantages) 
loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=real_action, 
                                                                     logits=score))
#loss_left = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=left_real_action, 
#                                                                           logits=score_Left))
#loss_jump = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=jump_real_action, 
#                                                                        logits=score_Jump))
#loss_right = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=right_real_action, 
#                                                                          logits=score_Right))
#loss = loss_back + loss_forward + loss_jump + loss_attack
tf.summary.scalar('loss', loss)
train_step = tf.train.RMSPropOptimizer(0.001).minimize(loss)

# Merge all the summaries and write them out to /tmp/mnist_logs (by default)
merged = tf.summary.merge_all()

'''
newGrads = tf.gradients(loss, tvars)

# Once we have collected a series of gradients from multiple episodes, we apply them.
# We don't just apply gradeients after every episode in order to account for noise in the reward signal.
adam = tf.train.AdamOptimizer(learning_rate=learning_rate) # Our optimizer
W1Grad = tf.placeholder(tf.float32,name="batch_grad1") # Placeholders to send the final gradients through when we update.
W2Grad = tf.placeholder(tf.float32,name="batch_grad2")
batchGrad = [W1Grad,W2Grad]
updateGrads = adam.apply_gradients(zip(batchGrad,tvars))
'''

Instructions for updating:
Colocations handled automatically by placer.
Instructions for updating:
Use keras.layers.flatten instead.
convFlat: Tensor("Flatten/flatten/Reshape:0", shape=(?, 57600), dtype=float32)
Instructions for updating:

Future major versions of TensorFlow will allow gradients to flow
into the labels input on backprop by default.

See `tf.nn.softmax_cross_entropy_with_logits_v2`.



'\nnewGrads = tf.gradients(loss, tvars)\n\n# Once we have collected a series of gradients from multiple episodes, we apply them.\n# We don\'t just apply gradeients after every episode in order to account for noise in the reward signal.\nadam = tf.train.AdamOptimizer(learning_rate=learning_rate) # Our optimizer\nW1Grad = tf.placeholder(tf.float32,name="batch_grad1") # Placeholders to send the final gradients through when we update.\nW2Grad = tf.placeholder(tf.float32,name="batch_grad2")\nbatchGrad = [W1Grad,W2Grad]\nupdateGrads = adam.apply_gradients(zip(batchGrad,tvars))\n'

### Advantage function
This function allows us to weigh the rewards our agent recieves. In the context of the Cart-Pole task, we want actions that kept the pole in the air a long time to have a large reward, and actions that contributed to the pole falling to have a decreased or negative reward. We do this by weighing the rewards from the end of the episode, with actions at the end being seen as negative, since they likely contributed to the pole falling, and the episode ending. Likewise, early actions are seen as more positive, since they weren't responsible for the pole falling.

In [4]:
#!pip install 'tensorboard==1.0.0a6'

In [5]:
def discount_rewards(r):
    """ take 1D float array of rewards and compute discounted reward """
    discounted_r = np.zeros_like(r)
    running_add = 0
    for t in reversed(xrange(0, r.size)):
        running_add = running_add * gamma + r[t]
        discounted_r[t] = running_add
    return discounted_r

### Running the Agent and Environment

Here we run the neural network agent, and have it act in the CartPole environment.

In [6]:
'''
import minerl
import gym
env = gym.make('MineRLNavigateDense-v0')

obs  = env.reset()
done = False
net_reward = 0

while not done:
    action = env.action_space.noop()

    action['camera'] = [0, -10]
    action['back'] = 0
    action['forward'] = 1
    action['jump'] = 1
    action['attack'] = 1

    obs, reward, done, info = env.step(
        action)

    net_reward += reward
    print("Total reward: ", net_reward)
'''

'\nimport minerl\nimport gym\nenv = gym.make(\'MineRLNavigateDense-v0\')\n\nobs  = env.reset()\ndone = False\nnet_reward = 0\n\nwhile not done:\n    action = env.action_space.noop()\n\n    action[\'camera\'] = [0, -10]\n    action[\'back\'] = 0\n    action[\'forward\'] = 1\n    action[\'jump\'] = 1\n    action[\'attack\'] = 1\n\n    obs, reward, done, info = env.step(\n        action)\n\n    net_reward += reward\n    print("Total reward: ", net_reward)\n'

In [7]:
#import minerl
#data = minerl.data.make('MineRLNavigateDense-v0', '/home/kimbring2/MineRL/data/')

In [8]:

import minerl
import gym

env = gym.make('MineRLNavigateDense-v0')
obs = env.reset()


In [9]:
xs,hs,dlogps,drs,ys,tfps = [],[],[],[],[],[]
running_reward = None
reward_sum = 0
episode_number = 1
total_episodes = 10000
init = tf.global_variables_initializer()
with tf.Session() as sess:

# Launch the graph
    rendering = False
    sess.run(init)
    saver = tf.train.Saver(max_to_keep=5)
    train_writer = tf.summary.FileWriter('/home/kimbring2/MineRL/train_summary', sess.graph)
    
    print('Loading Model...')
    path = '/home/kimbring2/MineRL/model'
    ckpt = tf.train.get_checkpoint_state(path)
    saver.restore(sess, ckpt.model_checkpoint_path)
    
    env.init()
    obs = env.reset()
    net_reward = 0
    for i in range(0, 500000):
        pov = obs['pov'].astype(np.float32) / 255.0 - 0.5
        compass = obs['compassAngle']

        compass_channel = np.ones(shape=list(pov.shape[:-1]) + [1], dtype=np.float32) * compass
        compass_channel /= 180.0
        
        state_concat = np.concatenate([pov, compass_channel], axis=-1)
        #print("state_concat.shape: " + str(state_concat.shape))
        
        #action = env.action_space.noop()
        #observations = tf.placeholder(tf.float32, [None,D] , name="input_x")
        action_probability = sess.run(probability, feed_dict={state:[state_concat]})
        #l = sess.run(probability_Left, feed_dict={state:[obs['pov']]})
        #j = sess.run(probability_Jump, feed_dict={state:[obs['pov']]})
        #r =  sess.run(probability_Right, feed_dict={state:[obs['pov']]})
        
        #print("c: " + str(c))
        #print("np.argmax(b): " + str(np.argmax(b)))
        #print("np.argmax(f): " + str(np.argmax(f)))
        #print("np.argmax(j): " + str(np.argmax(j)))
        #print("np.argmax(a): " + str(np.argmax(a)))
        
        action = env.action_space.noop()
        if (np.argmax(action_probability) == 0):
            action['camera'] = [0, -10]
            action['jump'] = 0
        elif (np.argmax(action_probability) == 1):
            action['camera'] = [0, 10]
            action['jump'] = 0
        else:
            action['camera'] = [0, 0]
            action['jump'] = 1
        
        action['forward'] = 1
        action['back'] = 0
        action['left'] = 0
        #action['jump'] = np.argmax(j)
        action['right'] = 0
        action['sprint'] = 1

        obs, reward, done, info = env.step(action)
        
        if done == True:
            break

        net_reward += reward
        print("Total reward: ", net_reward)
    '''
    
    episode_count = 0
    for current_state, action, reward, next_state, done in data.sarsd_iter(num_epochs=500, max_sequence_len=200):
        #print("current_state['compassAngle']: " + str(current_state['compassAngle']))
        #print("action: " + str(action))
        #print("reward: " + str(reward))
        length = (current_state['pov'].shape)[0]

        #print("state_concat.shape: " + str(state_concat.shape))

        action_list = []
        states_list = []
        #camera_action_list = []
        #left_action_list = []
        #jump_action_list = []
        #right_action_list = []
        for i in range(0, length):
            #states = current_state['pov'][i]
            pov = current_state['pov'][i].astype(np.float32) / 255.0 - 0.5
            compass = current_state['compassAngle'][i]

            compass_channel = np.ones(shape=list(pov.shape[:-1]) + [1], dtype=np.float32) * compass
            compass_channel /= 180.0
        
            state_concat = np.concatenate([pov, compass_channel], axis=-1)
            #print("state_concat.shape: " + str(state_concat.shape))
            
            #print("action['camera'][i]: " + str(action['camera'][i]))
            #print("action['camera'][i][1]: " + str(action['camera'][i][1]))
            #print("")
            
            if (action['camera'][i][1] < 0):
                action_ = [1, 0, 0]
            elif (action['camera'][i][1] > 0):
                action_ = [0, 1, 0]
            else:
                action_ = [0, 0, 1]
                
            #camera_action = np.eye(2)[action['camera'][i]]
            #left_action = np.eye(2)[action['left'][i]]
            #jump_action = np.eye(2)[action['jump'][i]]
            #right_action = np.eye(2)[action['right'][i]]
            
            states_list.append(state_concat)
            action_list.append(action_)
            #left_action_list.append(left_action)
            #jump_action_list.append(jump_action)
            #right_action_list.append(right_action)
        
        episode_count = episode_count + 1
        
        #while True:
        feed_dict = {state:np.stack(states_list, 0),
                     real_action:np.stack(action_list, 0)
                    }
    
        if episode_count % 100 == 0:
            #print("loss_print: " + str(loss_print))
            #tf.summary.scalar('loss', loss)
            summary, _ = sess.run([merged, train_step], feed_dict=feed_dict)
            train_writer.add_summary(summary, episode_count)

        sess.run(train_step, feed_dict=feed_dict)
        
        if episode_count % 100 == 0:
            model_path = '/home/kimbring2/MineRL/model'
            saver.save(sess, model_path + '/model-' + str(episode_count) + '.cptk')
            print("Saved Model")
    '''
print(episode_number, 'Episodes completed.')

Loading Model...
Instructions for updating:
Use standard file APIs to check for files with this prefix.
INFO:tensorflow:Restoring parameters from /home/kimbring2/MineRL/model/model-60600.cptk
Total reward:  0.13396453857421875
Total reward:  0.20639801025390625
Total reward:  0.295379638671875
Total reward:  0.40094757080078125
Total reward:  0.5258026123046875
Total reward:  0.671539306640625
Total reward:  0.8387222290039062
Total reward:  1.0260086059570312
Total reward:  1.2327880859375
Total reward:  1.4574813842773438
Total reward:  1.6982421875
Total reward:  1.9146957397460938
Total reward:  2.3700637817382812
Total reward:  2.6284027099609375
Total reward:  2.8933868408203125
Total reward:  3.1661453247070312
Total reward:  3.44744873046875
Total reward:  3.7377548217773438
Total reward:  4.0355224609375
Total reward:  4.340732574462891
Total reward:  4.624519348144531
Total reward:  5.105266571044922
Total reward:  5.3719635009765625
Total reward:  5.64349365234375
Total rewa

Total reward:  60.43043231964111
Total reward:  60.52682113647461
Total reward:  60.62146520614624
Total reward:  60.714346408843994
Total reward:  60.80543088912964
Total reward:  60.90446376800537
Total reward:  61.0266695022583
Total reward:  61.169591426849365
Total reward:  61.32028675079346
Total reward:  61.45692300796509
Total reward:  61.5836763381958
Total reward:  61.703227519989014
Total reward:  61.81728649139404
Total reward:  61.92693614959717
Total reward:  62.03285789489746
Total reward:  62.135475158691406
Total reward:  62.23504400253296
Total reward:  62.331714153289795
Total reward:  62.425562381744385
Total reward:  62.51662063598633
Total reward:  62.60488557815552
Total reward:  62.70141935348511
Total reward:  62.82324266433716
Total reward:  62.9679160118103
Total reward:  63.12072563171387
Total reward:  63.25766372680664
Total reward:  63.383572578430176
Total reward:  63.50149869918823
Total reward:  63.613319396972656
Total reward:  63.7201566696167
Total 

Total reward:  47.28474235534668
Total reward:  47.1906795501709
Total reward:  47.0982608795166
Total reward:  47.00704383850098
Total reward:  46.91667175292969
Total reward:  46.82685470581055
Total reward:  46.73736381530762
Total reward:  46.64425468444824
Total reward:  46.539791107177734
Total reward:  46.42466163635254
Total reward:  46.30306625366211
Total reward:  46.187660217285156
Total reward:  46.07768249511719
Total reward:  45.9722843170166
Total reward:  45.87063407897949
Total reward:  45.77198028564453
Total reward:  45.6756706237793
Total reward:  45.58115196228027
Total reward:  45.487966537475586
Total reward:  45.3957405090332
Total reward:  45.30417060852051
Total reward:  45.213016510009766
Total reward:  45.12208366394043
Total reward:  45.02727699279785
Total reward:  44.920658111572266
Total reward:  44.80302429199219
Total reward:  44.67885780334473
Total reward:  44.56112480163574
Total reward:  44.448970794677734
Total reward:  44.34148979187012
Total rew

Total reward:  34.53849792480469
Total reward:  34.623512268066406
Total reward:  34.70937728881836
Total reward:  34.79584503173828
Total reward:  34.88276672363281
Total reward:  34.97002029418945
Total reward:  35.057464599609375
Total reward:  35.14499282836914
Total reward:  35.23250961303711
Total reward:  35.319942474365234
Total reward:  35.4072265625
Total reward:  35.48896408081055
Total reward:  35.56720733642578
Total reward:  35.65470886230469
Total reward:  35.75718688964844
Total reward:  35.87441635131836
Total reward:  36.005653381347656
Total reward:  36.14967346191406
Total reward:  36.30477714538574
Total reward:  36.466867446899414
Total reward:  36.86198806762695
Total reward:  37.08930778503418
Total reward:  37.316471099853516
Total reward:  37.54555702209473
Total reward:  37.77796173095703
Total reward:  38.01442337036133
Total reward:  38.25503730773926
Total reward:  38.499284744262695
Total reward:  38.74603843688965
Total reward:  38.99358367919922
Total r

Total reward:  59.12689018249512
Total reward:  59.22970008850098
Total reward:  59.327423095703125
Total reward:  59.42101001739502
Total reward:  59.51109027862549
Total reward:  59.5980863571167
Total reward:  59.682281494140625
Total reward:  59.76386833190918
Total reward:  59.84298133850098
Total reward:  59.91970920562744
Total reward:  59.99411201477051
Total reward:  60.07463073730469
Total reward:  60.17414665222168
Total reward:  60.29038858413696
Total reward:  60.41213035583496
Total reward:  60.521077156066895
Total reward:  60.62089776992798
Total reward:  60.71393156051636
Total reward:  60.801663398742676
Total reward:  60.885032176971436
Total reward:  60.96462869644165
Total reward:  61.0408239364624
Total reward:  61.11384868621826
Total reward:  61.1838436126709
Total reward:  61.250892639160156
Total reward:  61.3150429725647
Total reward:  61.38451814651489
Total reward:  61.47157287597656
Total reward:  61.57360124588013
Total reward:  61.67920637130737
Total re

Total reward:  44.284799575805664
Total reward:  44.15370559692383
Total reward:  44.029123306274414
Total reward:  43.90997123718262
Total reward:  43.79525184631348
Total reward:  43.68408393859863
Total reward:  43.57572364807129
Total reward:  43.46954345703125
Total reward:  43.365028381347656
Total reward:  43.26176071166992
Total reward:  43.15940475463867
Total reward:  43.05768966674805
Total reward:  42.95640182495117
Total reward:  42.85072898864746
Total reward:  42.731544494628906
Total reward:  42.59987258911133
Total reward:  42.461097717285156
Total reward:  42.3299503326416
Total reward:  42.205299377441406
Total reward:  42.086042404174805
Total reward:  41.9711799621582
Total reward:  41.96192169189453
Total reward:  41.95535850524902
Total reward:  41.930519104003906
Total reward:  41.89099311828613
Total reward:  41.8399715423584
Total reward:  41.77921676635742
Total reward:  41.711605072021484
Total reward:  41.63783836364746
Total reward:  41.55910301208496
Tota

Total reward:  66.85573029518127
Total reward:  66.91208302974701
Total reward:  66.95339524745941
Total reward:  66.97803962230682
Total reward:  66.98472905158997
Total reward:  66.9729266166687
Total reward:  66.94303786754608
Total reward:  66.9213091135025
Total reward:  66.93486106395721
Total reward:  66.97076427936554
Total reward:  66.98770010471344
Total reward:  66.95610022544861
Total reward:  66.90720748901367
Total reward:  66.84712791442871
Total reward:  66.78070914745331
Total reward:  66.71316826343536
Total reward:  66.6478306055069
Total reward:  66.58726871013641
Total reward:  66.5308997631073
Total reward:  66.47696256637573
Total reward:  66.42345058917999
Total reward:  66.36853468418121
Total reward:  66.31073534488678
Total reward:  66.24896824359894
Total reward:  66.18252348899841
Total reward:  66.11101484298706
Total reward:  66.03431487083435
Total reward:  65.95249009132385
Total reward:  65.86574292182922
Total reward:  65.77436256408691
Total reward: 

Total reward:  58.25824737548828
Total reward:  58.17400646209717
Total reward:  58.0895414352417
Total reward:  58.004638671875
Total reward:  57.919132232666016
Total reward:  57.83289909362793
Total reward:  57.745849609375
Total reward:  57.65791988372803
Total reward:  57.56709957122803
Total reward:  57.468055725097656
Total reward:  57.36103057861328
Total reward:  57.244521141052246
Total reward:  57.11976432800293
Total reward:  56.99958896636963
Total reward:  56.884867668151855
Total reward:  56.77547359466553
Total reward:  56.67080211639404
Total reward:  56.57007598876953
Total reward:  56.472496032714844
Total reward:  56.37732410430908
Total reward:  56.28391456604004
Total reward:  56.19172286987305
Total reward:  56.10029983520508
Total reward:  56.009284019470215
Total reward:  55.91838836669922
Total reward:  55.82738780975342
Total reward:  55.736106872558594
Total reward:  55.644412994384766
Total reward:  55.55220699310303
Total reward:  55.456336975097656
Total 

Total reward:  65.30435585975647
Total reward:  65.23220658302307
Total reward:  65.1649968624115
Total reward:  65.10544204711914
Total reward:  65.05491590499878
Total reward:  65.00528502464294
Total reward:  64.95568323135376
Total reward:  64.90520143508911
Total reward:  64.85302257537842
Total reward:  64.79847717285156
Total reward:  64.74721121788025
Total reward:  64.70306348800659
Total reward:  64.66106390953064
Total reward:  64.61047768592834
Total reward:  64.54998302459717
Total reward:  64.4851815700531
Total reward:  64.422842502594
Total reward:  64.36484694480896
Total reward:  64.31275677680969
Total reward:  64.26784229278564
Total reward:  64.23015856742859
Total reward:  64.19884014129639
Total reward:  64.17247700691223
Total reward:  64.14952683448792
Total reward:  64.12856554985046
Total reward:  64.10821056365967
Total reward:  64.0900456905365
Total reward:  64.07435464859009
Total reward:  64.05964851379395
Total reward:  64.04474925994873
Total reward:  

Total reward:  62.51427698135376
Total reward:  62.55615186691284
Total reward:  62.59425401687622
Total reward:  62.628607749938965
Total reward:  62.659202575683594
Total reward:  62.68601131439209
Total reward:  62.7161283493042
Total reward:  62.75887060165405
Total reward:  62.81121826171875
Total reward:  62.862502098083496
Total reward:  62.90126132965088
Total reward:  62.93170928955078
Total reward:  62.956252574920654
Total reward:  62.976194858551025
Total reward:  62.99218988418579
Total reward:  63.004517555236816
Total reward:  63.01325845718384
Total reward:  63.018391609191895
Total reward:  63.01985740661621
Total reward:  63.01758623123169
Total reward:  63.01151990890503
Total reward:  63.00161552429199
Total reward:  62.99331331253052
Total reward:  62.99246025085449
Total reward:  62.99626922607422
Total reward:  62.99609994888306
Total reward:  62.986074924468994
Total reward:  62.97003984451294
Total reward:  62.95003414154053
Total reward:  62.92704629898071
Tot

Total reward:  56.07253456115723
Total reward:  56.21451950073242
Total reward:  56.34635353088379
Total reward:  56.47042179107666
Total reward:  56.588401794433594
Total reward:  56.701491355895996
Total reward:  56.81055545806885
Total reward:  56.916229248046875
Total reward:  57.01898670196533
Total reward:  57.11918544769287
Total reward:  57.21709728240967
Total reward:  57.3129301071167
Total reward:  57.415199279785156
Total reward:  57.53769874572754
Total reward:  57.67829132080078
Total reward:  57.826104164123535
Total reward:  57.961360931396484
Total reward:  58.08749294281006
Total reward:  58.2067928314209
Total reward:  58.32080554962158
Total reward:  58.430583000183105
Total reward:  58.536850929260254
Total reward:  58.64011478424072
Total reward:  58.74073600769043
Total reward:  58.838972091674805
Total reward:  58.93500995635986
Total reward:  59.02898693084717
Total reward:  59.12100315093994
Total reward:  59.220030784606934
Total reward:  59.34030818939209
To

Total reward:  48.961843490600586
Total reward:  48.84164810180664
Total reward:  48.70937919616699
Total reward:  48.56996726989746
Total reward:  48.43782997131348
Total reward:  48.31203651428223
Total reward:  48.19158363342285
Total reward:  48.07550621032715
Total reward:  47.96293640136719
Total reward:  47.85312843322754
Total reward:  47.74545478820801
Total reward:  47.63939666748047
Total reward:  47.534528732299805
Total reward:  47.43051338195801
Total reward:  47.327077865600586
Total reward:  47.21944046020508
Total reward:  47.098432540893555
Total reward:  46.964975357055664
Total reward:  46.82427406311035
Total reward:  46.69121170043945
Total reward:  46.56475639343262
Total reward:  46.4438362121582
Total reward:  46.327444076538086
Total reward:  46.214683532714844
Total reward:  46.10479164123535
Total reward:  45.99712371826172
Total reward:  45.89114761352539
Total reward:  45.786434173583984
Total reward:  45.68263626098633
Total reward:  45.579477310180664
To

Total reward:  21.62303924560547
Total reward:  21.625938415527344
Total reward:  21.63202667236328
Total reward:  21.640583038330078
Total reward:  21.65103530883789
Total reward:  21.662914276123047
Total reward:  21.677654266357422
Total reward:  21.69705581665039
Total reward:  21.720436096191406
Total reward:  21.745315551757812
Total reward:  21.768173217773438
Total reward:  21.789634704589844
Total reward:  21.810073852539062
Total reward:  21.829727172851562
Total reward:  21.84872055053711
Total reward:  21.870609283447266
Total reward:  21.894622802734375
Total reward:  21.92029571533203
Total reward:  21.947227478027344
Total reward:  21.975093841552734
Total reward:  22.003623962402344
Total reward:  22.032608032226562
Total reward:  22.061866760253906
Total reward:  22.091259002685547
Total reward:  22.12320327758789
Total reward:  22.161468505859375
Total reward:  22.205265045166016
Total reward:  22.251262664794922
Total reward:  22.293560028076172
Total reward:  22.333

Total reward:  42.91950225830078
Total reward:  43.02108955383301
Total reward:  43.13972473144531
Total reward:  43.27366065979004
Total reward:  43.41440200805664
Total reward:  43.544715881347656
Total reward:  43.66709899902344
Total reward:  43.78334426879883
Total reward:  43.89476203918457
Total reward:  44.00232124328613
Total reward:  44.106746673583984
Total reward:  44.20858955383301
Total reward:  44.30827713012695
Total reward:  44.4061393737793
Total reward:  44.502431869506836
Total reward:  44.59736251831055
Total reward:  44.691091537475586
Total reward:  44.790483474731445
Total reward:  44.90704345703125
Total reward:  45.03904914855957
Total reward:  45.17795372009277
Total reward:  45.30659484863281
Total reward:  45.42744064331055
Total reward:  45.54225540161133
Total reward:  45.652320861816406
Total reward:  45.75858116149902
Total reward:  45.861745834350586
Total reward:  45.96234703063965
Total reward:  46.060794830322266
Total reward:  46.15740966796875
Tot

Total reward:  60.4584002494812
Total reward:  60.397565841674805
Total reward:  60.335153579711914
Total reward:  60.271328926086426
Total reward:  60.20402526855469
Total reward:  60.13233184814453
Total reward:  60.054821491241455
Total reward:  59.976853370666504
Total reward:  59.899813652038574
Total reward:  59.82412528991699
Total reward:  59.74971294403076
Total reward:  59.67627429962158
Total reward:  59.60342311859131
Total reward:  59.53077793121338
Total reward:  59.45798873901367
Total reward:  59.384761810302734
Total reward:  59.31085777282715
Total reward:  59.23608684539795
Total reward:  59.160305976867676
Total reward:  59.08255863189697
Total reward:  58.99905204772949
Total reward:  58.90936756134033
Total reward:  58.813639640808105
Total reward:  58.71950817108154
Total reward:  58.62778377532959
Total reward:  58.53852367401123
Total reward:  58.45142078399658
Total reward:  58.366021156311035
Total reward:  58.28183937072754
Total reward:  58.198424339294434


Total reward:  36.14423179626465
Total reward:  36.076297760009766
Total reward:  36.00349044799805
Total reward:  35.925567626953125
Total reward:  35.83903121948242
Total reward:  35.60374450683594
Total reward:  35.53874588012695
Total reward:  35.468894958496094
Total reward:  35.39663314819336
Total reward:  35.32026290893555
Total reward:  35.241024017333984
Total reward:  35.159454345703125
Total reward:  35.07793045043945
Total reward:  35.02448654174805
Total reward:  34.932132720947266
Total reward:  34.81075668334961
Total reward:  34.76540756225586
Total reward:  34.74090576171875
Total reward:  34.719844818115234
Total reward:  34.703529357910156
Total reward:  34.6926383972168
Total reward:  34.68722915649414
Total reward:  34.68681716918945
Total reward:  34.672706604003906
Total reward:  34.642662048339844
Total reward:  34.57072448730469
Total reward:  34.51475524902344
Total reward:  34.45614242553711
Total reward:  34.40029525756836
Total reward:  34.35021209716797
T

Total reward:  56.83265399932861
Total reward:  56.86912441253662
Total reward:  56.903032302856445
Total reward:  56.934791564941406
Total reward:  56.96461868286133
Total reward:  56.99261665344238
Total reward:  57.01882553100586
Total reward:  57.043253898620605
Total reward:  57.06589221954346
Total reward:  57.08672904968262
Total reward:  57.10574913024902
Total reward:  57.12667942047119
Total reward:  57.154526710510254
Total reward:  57.18771743774414
Total reward:  57.220746994018555
Total reward:  57.247328758239746
Total reward:  57.26958465576172
Total reward:  57.288740158081055
Total reward:  57.30547523498535
Total reward:  57.32015132904053
Total reward:  57.332942962646484
Total reward:  57.343923568725586
Total reward:  57.353111267089844
Total reward:  57.36049938201904
Total reward:  57.36606979370117
Total reward:  57.369802474975586
Total reward:  57.37168025970459
Total reward:  57.37461853027344
Total reward:  57.38203716278076
Total reward:  57.39254283905029

Total reward:  47.01122856140137
Total reward:  46.92067337036133
Total reward:  46.832576751708984
Total reward:  46.74637985229492
Total reward:  46.66160202026367
Total reward:  46.577842712402344
Total reward:  46.494773864746094
Total reward:  46.41213035583496
Total reward:  46.32969856262207
Total reward:  46.24388122558594
Total reward:  46.147573471069336
Total reward:  46.041358947753906
Total reward:  45.929025650024414
Total reward:  45.822265625
Total reward:  45.72043037414551
Total reward:  45.622758865356445
Total reward:  45.52850151062012
Total reward:  45.43696403503418
Total reward:  45.34754180908203
Total reward:  45.25972366333008
Total reward:  45.17308235168457
Total reward:  45.08726692199707
Total reward:  45.001996994018555
Total reward:  44.91704559326172
Total reward:  44.832231521606445
Total reward:  44.74375915527344
Total reward:  44.64424514770508
Total reward:  44.53438758850098
Total reward:  44.41830253601074
Total reward:  44.3081169128418
Total r

Total reward:  20.087627410888672
Total reward:  19.988040924072266
Total reward:  19.88922882080078
Total reward:  19.790992736816406
Total reward:  19.68807601928711
Total reward:  19.570907592773438
Total reward:  19.44066619873047
Total reward:  19.303417205810547
Total reward:  19.17430877685547
Total reward:  19.051910400390625
Total reward:  18.934986114501953
Total reward:  18.82248306274414
Total reward:  18.713512420654297
Total reward:  18.607349395751953
Total reward:  18.503395080566406
Total reward:  18.401172637939453
Total reward:  18.30028533935547
Total reward:  18.200424194335938
Total reward:  18.101337432861328
Total reward:  18.002822875976562
Total reward:  17.89962387084961
Total reward:  17.782146453857422
Total reward:  17.65158462524414
Total reward:  17.514022827148438
Total reward:  17.38460922241211
Total reward:  17.261917114257812
Total reward:  17.144695281982422
Total reward:  17.031890869140625
Total reward:  16.922618865966797
Total reward:  16.81615

Total reward:  7.578517913818359
Total reward:  7.544658660888672
Total reward:  7.511135101318359
Total reward:  7.477813720703125
Total reward:  7.444599151611328
Total reward:  7.411403656005859
Total reward:  7.378154754638672
Total reward:  7.343402862548828
Total reward:  7.3042449951171875
Total reward:  7.26092529296875
Total reward:  7.214996337890625
Total reward:  7.17120361328125
Total reward:  7.129268646240234
Total reward:  7.088890075683594
Total reward:  7.049770355224609
Total reward:  7.011634826660156
Total reward:  6.9742584228515625
Total reward:  6.937431335449219
Total reward:  6.900993347167969
Total reward:  6.864803314208984
Total reward:  6.8287506103515625
Total reward:  6.792747497558594
Total reward:  6.755130767822266
Total reward:  6.712650299072266
Total reward:  6.665576934814453
Total reward:  6.615684509277344
Total reward:  6.568241119384766
Total reward:  6.522911071777344
Total reward:  6.479343414306641
Total reward:  6.437202453613281
Total rew

Total reward:  -6.4927215576171875
Total reward:  -6.573768615722656
Total reward:  -6.6513214111328125
Total reward:  -6.7260589599609375
Total reward:  -6.798561096191406
Total reward:  -6.869293212890625
Total reward:  -6.938636779785156
Total reward:  -7.006927490234375
Total reward:  -7.074394226074219
Total reward:  -7.141265869140625
Total reward:  -7.2076873779296875
Total reward:  -7.2772369384765625
Total reward:  -7.3564453125
Total reward:  -7.4445648193359375
Total reward:  -7.537513732910156
Total reward:  -7.625030517578125
Total reward:  -7.70806884765625
Total reward:  -7.7874603271484375
Total reward:  -7.863914489746094
Total reward:  -7.9380340576171875
Total reward:  -8.010299682617188
Total reward:  -8.08111572265625
Total reward:  -8.15081787109375
Total reward:  -8.219657897949219
Total reward:  -8.287864685058594
Total reward:  -8.3555908203125
Total reward:  -8.422988891601562
Total reward:  -8.493667602539062
Total reward:  -8.574256896972656
Total reward:  -

Total reward:  11.714614868164062
Total reward:  11.81436538696289
Total reward:  11.913902282714844
Total reward:  12.013233184814453
Total reward:  12.11236572265625
Total reward:  12.2113037109375
Total reward:  12.310047149658203
Total reward:  12.408599853515625
Total reward:  12.506961822509766
Total reward:  12.60513687133789
Total reward:  12.703125
Total reward:  12.800926208496094
Total reward:  12.898536682128906
Total reward:  12.995964050292969
Total reward:  13.093204498291016
Total reward:  13.190258026123047
Total reward:  13.287120819091797
Total reward:  13.383796691894531
Total reward:  13.486675262451172
Total reward:  13.606624603271484
Total reward:  13.742095947265625
Total reward:  13.885169982910156
Total reward:  14.018814086914062
Total reward:  14.145061492919922
Total reward:  14.265472412109375
Total reward:  14.381237030029297
Total reward:  14.493278503417969
Total reward:  14.602313995361328
Total reward:  14.708915710449219
Total reward:  14.8135299682

Total reward:  40.50237846374512
Total reward:  40.606590270996094
Total reward:  40.72846603393555
Total reward:  40.866294860839844
Total reward:  41.011451721191406
Total reward:  41.14618110656738
Total reward:  41.27292442321777
Total reward:  41.39345359802246
Total reward:  41.509084701538086
Total reward:  41.62079429626465
Total reward:  41.72933006286621
Total reward:  41.83525848388672
Total reward:  41.93902397155762
Total reward:  42.04097366333008
Total reward:  42.14137649536133
Total reward:  42.240447998046875
Total reward:  42.338361740112305
Total reward:  42.442230224609375
Total reward:  42.56398963928223
Total reward:  42.70190620422363
Total reward:  42.84720230102539
Total reward:  42.9819278717041
Total reward:  43.10857582092285
Total reward:  43.22894859313965
Total reward:  43.34437561035156
Total reward:  43.45584487915039
Total reward:  43.56410598754883
Total reward:  43.669729232788086
Total reward:  43.773155212402344
Total reward:  43.87472915649414
To

Total reward:  63.022048473358154
Total reward:  62.95161056518555
Total reward:  62.880088806152344
Total reward:  62.80718994140625
Total reward:  62.73267889022827
Total reward:  62.65638256072998
Total reward:  62.578182220458984
Total reward:  62.49879789352417
Total reward:  62.41623544692993
Total reward:  62.329118728637695
Total reward:  62.234981060028076
Total reward:  62.14000654220581
Total reward:  62.04623889923096
Total reward:  61.954336643218994
Total reward:  61.8676872253418
Total reward:  61.7841010093689
Total reward:  61.70299577713013
Total reward:  61.62367582321167
Total reward:  61.54553031921387
Total reward:  61.47200918197632
Total reward:  61.40506076812744
Total reward:  61.34622812271118
Total reward:  61.29674959182739
Total reward:  61.25357532501221
Total reward:  61.215312480926514
Total reward:  61.18063306808472
Total reward:  61.14840126037598
Total reward:  61.1175742149353
Total reward:  61.08731555938721
Total reward:  61.05694341659546
Total 

As you can see, the network not only does much better than random actions, but achieves the goal of 200 points per episode, thus solving the task!