Reinforcement Learning with OpenAI Gym
---
This notebook will create and test different reinforcement learning agents and environments.

In [2]:
import tensorflow as tf
import gym

import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

Load the Environment
---
Call `gym.make("environment name")` to load a new environment.

Check out the list of available environments at <https://gym.openai.com/envs/>

Edit this cell to load different environments!

In [3]:
# TODO: Load an environment
env = gym.make("CartPole-v1")

In [4]:
# TODO: Print observation and action spaces
print(env.observation_space)
print(env.action_space)

Box(4,)
Discrete(2)


Run an Agent
---

Reset the environment before each run with `env.reset`

Step forward through the environment to get new observations and rewards over time with `env.step`

`env.step` takes a parameter for the action to take on this step and returns the following:
- Observations for this step
- Rewards earned this step
- "Done", a boolean value indicating if the game is finished
- Info - some debug information that some environments provide. 

In [4]:
# TODO Make a random agent
games_to_play = 10

for i in range(games_to_play):
    # Reset the environment
    obs = env.reset()
    episode_rewards = 0
    done = False
    
    while not done:
        # Render the environment so we can watch
        env.render()
        
        # Choose a random action
        action = env.action_space.sample()
        
        # Take a step in the environment with the chosen action
        obs, reward, done, info = env.step(action)
        episode_rewards += reward

    # Print episode total rewards when done
    print(episode_rewards)
    
# Close the environment
env.close()

14.0
9.0


26.0


16.0


12.0


33.0


36.0


15.0


27.0


13.0


Policy Gradients
---
The policy gradients algorithm records gameplay over a training period, then runs the results of the actions chosen through a neural network, making successful actions that resulted in a reward more likly, and unsuccessful actions less likely.

In [9]:
# TODO Build the policy gradient neural network
class Agent:
    def __init__(self, num_actions, state_size):
        
        initializer = tf.contrib.layers.xavier_initializer()
        
        self.input_layer = tf.placeholder(dtype=tf.float32, shape=[None, state_size])
        
        # Neural net starts here
        
        hidden_layer = tf.layers.dense(self.input_layer, 8, activation=tf.nn.relu, kernel_initializer=initializer)
        
        # Output of neural net
        self.outputs = tf.layers.dense(hidden_layer, num_actions, activation=tf.nn.softmax)
        self.choice = tf.argmax(self.outputs, 1)
        
        # Training Procedure
        self.rewards = tf.placeholder(shape=[None, 1], dtype=tf.float32)
        self.actions = tf.placeholder(shape=[None, 1], dtype=tf.int32)
        
        one_hot_actions = tf.one_hot(self.actions, num_actions)
    
        cross_entropy = tf.nn.softmax_cross_entropy_with_logits(labels=one_hot_actions, logits=self.outputs)
        
        self.loss = tf.reduce_mean(self.rewards * cross_entropy)
        
        self.tvars = tf.trainable_variables()
        
        self.gradients = tf.gradients(self.loss, self.tvars)
        
        # Compute gradients needed to make each action more likely
        #self.gradients = tf.gradients(loss, tf.trainable_variables())
        #self.adjusted_gradients = self.gradients * self.rewards
        
        # Create a placeholder list for gradients
        self.gradients_to_apply = []
        for index, variable in enumerate(tf.trainable_variables()):
            gradient_placeholder = tf.placeholder(tf.float32)
            self.gradients_to_apply.append(gradient_placeholder)
        
        # Create the operation to update gradients with the gradients placeholder.
        optimizer = tf.train.AdamOptimizer(learning_rate=1e-2)
        self.training_op = optimizer.minimize(self.loss)
        self.update_gradients = optimizer.apply_gradients(zip(self.gradients_to_apply, tf.trainable_variables()))
        

Discounting Rewards
---
In order to determine how "successful" a given action is, you need to track rewards over time in the environment. You could save each frame's reward at its full value, but in reality, actions are more likely to correlate to rewards closer to the current timeframe. "Discounting" rewards attempts to address this problem by scaling the value of a reward down the farther 

In [10]:
discount_rate = 0.99

def discount_normalize_rewards(rewards):
    discounted_rewards = np.zeros_like(rewards)
    total_rewards = 0
    
    for i in reversed(range(len(rewards))):
        total_rewards = total_rewards * discount_rate + rewards[i]
        discounted_rewards[i] = total_rewards
    
    discounted_rewards -= np.mean(discounted_rewards)
    discounted_rewards /= np.std(discounted_rewards)
    
    return discounted_rewards

Training Procedure
---
The agent will play games and record results. Every game, the gradients to apply will be calculated, and every few games they'll be averaged and applied.

In [11]:
tf.reset_default_graph()

agent = Agent(2,4)

training_episodes = 20000
max_steps_per_episode = 10000
episode_batch_size = 5

init = tf.global_variables_initializer()

with tf.Session() as sess:
    sess.run(init)
    
    batch_history = []
    
    total_episode_rewards = []
    
    # Create a buffer of 0'd gradients
    gradient_buffer = sess.run(agent.tvars)
    for index, gradient in enumerate(gradient_buffer):
        gradient_buffer[index] = gradient * 0

    for episode in range(training_episodes):

        state = env.reset()
        
        episode_history = []
        episode_rewards = 0
        
        for step in range(max_steps_per_episode):
            
            if episode % 100 == 0:
                env.render()
            
            # Get weights for each action
            action_probabilities = sess.run(agent.outputs, feed_dict={agent.input_layer: [state]})
            if( episode % 100 == 0 and step % 10 == 0):
                print(action_probabilities[0])
            action_choice = np.random.choice(range(2), p=action_probabilities[0])
            
            state_next, reward, done, _ = env.step(action_choice)
            episode_history.append([state, action_choice, reward, state_next])
            state = state_next
            
            episode_rewards += reward
            
            if done:
                total_episode_rewards.append(episode_rewards)
                episode_history = np.array(episode_history)
                episode_history[:,2] = discount_normalize_rewards(episode_history[:,2])
                
                ep_gradients = sess.run(agent.gradients, feed_dict={agent.input_layer: np.vstack(episode_history[:, 0]),
                                                                    agent.actions: np.vstack(episode_history[:, 1]),
                                                                    agent.rewards: np.vstack(episode_history[:, 2])})
                # add the gradients to the grad buffer:
                for index, gradient in enumerate(ep_gradients):
                    gradient_buffer[index] += gradient
                    
                #Record episode events to the batch
                #batch_history.extend(episode_history)
                break
            
        if episode % episode_batch_size == 0:
            
            if episode % 100 == 0:
                print("Average reward / 100 eps: " + str(np.mean(total_episode_rewards[-100:])))
            # Run a training step with the batch we've collected
            #batch_history = np.array(batch_history)
            #batch_states = batch_history[:,0]
            #batch_actions = batch_history[:,1]
            #batch_rewards = batch_history[:,2]
            
            feed_dict_gradients = dict(zip(agent.gradients_to_apply, gradient_buffer))
            
            sess.run(agent.update_gradients, feed_dict=feed_dict_gradients)
            
            for index, gradient in enumerate(gradient_buffer):
                gradient_buffer[index] = gradient * 0
            
            batch_history = []
            
        

[ 0.49376717  0.5062328 ]
[ 0.40159589  0.59840411]


[ 0.42864743  0.5713526 ]
[ 0.47861791  0.52138209]


[ 0.42404425  0.57595575]
[ 0.49186847  0.5081315 ]


Average reward / 100 eps: 57.0


[ 0.48161393  0.51838613]
[ 0.49135968  0.50864023]


[ 0.49678993  0.50321013]
[ 0.39692003  0.60308003]


[ 0.49085739  0.50914264]
[ 0.4845351   0.51546484]


[ 0.47462243  0.52537757]
[ 0.48269576  0.51730424]
Average reward / 100 eps: 26.38


[ 0.49669814  0.50330186]
[ 0.47595286  0.52404714]
Average reward / 100 eps: 26.68


[ 0.49218714  0.50781286]
[ 0.28290451  0.71709549]
Average reward / 100 eps: 26.83


[ 0.49039263  0.50960743]
[ 0.48342702  0.51657301]


[ 0.46714064  0.53285939]
[ 0.42769063  0.57230932]


[ 0.48123941  0.51876062]
[ 0.38837364  0.61162639]


Average reward / 100 eps: 26.31


[ 0.47971219  0.52028781]
[ 0.48814091  0.51185912]


[ 0.47480655  0.52519351]
Average reward / 100 eps: 28.39


[ 0.48422793  0.5157721 ]
[ 0.48095536  0.5190447 ]


[ 0.47287667  0.52712333]
Average reward / 100 eps: 25.38


[ 0.48388195  0.51611799]
[ 0.48777601  0.51222402]


[ 0.47454944  0.52545059]
Average reward / 100 eps: 23.07


[ 0.48303679  0.51696318]
[ 0.47057432  0.52942568]


[ 0.40358469  0.59641528]
[ 0.43298465  0.56701535]


[ 0.4862529   0.51374716]
[ 0.49060404  0.50939602]


Average reward / 100 eps: 22.9


[ 0.47307903  0.52692097]
[ 0.45207277  0.54792732]


[ 0.46348834  0.53651166]
Average reward / 100 eps: 23.59


[ 0.48712891  0.51287109]
[ 0.45683345  0.54316658]
Average reward / 100 eps: 22.96


[ 0.48848668  0.51151329]
[ 0.47607961  0.52392036]


[ 0.39364359  0.60635644]
Average reward / 100 eps: 22.34


[ 0.49867523  0.50132477]
[ 0.48247951  0.51752043]


Average reward / 100 eps: 22.24


[ 0.48599571  0.51400429]
Average reward / 100 eps: 22.27


[ 0.4915497   0.50845033]
[ 0.50413376  0.49586624]


[ 0.48329309  0.51670688]
[ 0.46611023  0.53388983]
Average reward / 100 eps: 21.97


[ 0.49729681  0.50270319]
Average reward / 100 eps: 23.58


[ 0.48806301  0.51193696]
[ 0.49220493  0.50779516]


[ 0.48006725  0.51993275]
[ 0.41571328  0.58428675]


Average reward / 100 eps: 25.14


[ 0.48546916  0.5145309 ]
[ 0.47082841  0.52917159]


[ 0.47334036  0.52665967]
Average reward / 100 eps: 23.19


[ 0.48473904  0.51526099]
[ 0.45840296  0.54159713]


Average reward / 100 eps: 23.49


[ 0.48428464  0.51571536]
[ 0.514503    0.48549706]


[ 0.5000059   0.49999416]
Average reward / 100 eps: 23.26


[ 0.48141244  0.51858765]
[ 0.4651683  0.5348317]


Average reward / 100 eps: 22.08


[ 0.49440181  0.50559813]
[ 0.55707085  0.44292912]


Average reward / 100 eps: 20.79


[ 0.51009041  0.48990965]
[ 0.52820629  0.47179374]


[ 0.49397412  0.50602585]
[ 0.52511609  0.47488388]


[ 0.59173948  0.40826052]
Average reward / 100 eps: 19.68


[ 0.50521851  0.49478143]
[ 0.55783474  0.44216532]


Average reward / 100 eps: 18.95


[ 0.50367761  0.49632239]
[ 0.49056819  0.50943178]


[ 0.52202463  0.47797534]
Average reward / 100 eps: 20.19


[ 0.49826568  0.50173438]
[ 0.48453847  0.5154615 ]


[ 0.53834438  0.46165556]
[ 0.51981306  0.48018694]


[ 0.45604038  0.54395962]
Average reward / 100 eps: 19.06


[ 0.51079959  0.48920041]
[ 0.48618537  0.51381457]


Average reward / 100 eps: 17.35


[ 0.53281921  0.46718079]
Average reward / 100 eps: 17.85


[ 0.55051935  0.44948065]
[ 0.49857154  0.50142843]


Average reward / 100 eps: 17.17


[ 0.56068444  0.43931556]
[ 0.72581023  0.27418986]


[ 0.91167986  0.08832011]
Average reward / 100 eps: 15.74


[ 0.60441458  0.39558542]
[ 0.76521832  0.23478171]


Average reward / 100 eps: 14.3


[ 0.63757962  0.36242035]
[ 0.62910497  0.370895  ]


[ 0.92422724  0.07577282]
Average reward / 100 eps: 14.37


[ 0.63266748  0.36733252]
Average reward / 100 eps: 13.98


[ 0.64921564  0.35078436]
[ 0.56591034  0.43408963]


[ 0.99464059  0.00535939]
Average reward / 100 eps: 12.96


[ 0.68399847  0.3160015 ]
Average reward / 100 eps: 11.73


[ 0.70409727  0.2959027 ]
Average reward / 100 eps: 12.66


[ 0.69734466  0.30265531]
[ 0.99841654  0.00158342]
Average reward / 100 eps: 11.86


[ 0.7499761  0.2500239]
Average reward / 100 eps: 11.26


[ 0.77017087  0.2298291 ]
Average reward / 100 eps: 10.79


[ 0.80633026  0.19366971]
[ 0.9980343  0.0019657]


Average reward / 100 eps: 10.79


[ 0.7960816   0.20391843]
[ 0.98890746  0.01109258]


Average reward / 100 eps: 10.98


[ 0.8422209   0.15777902]
[ 0.9955017   0.00449825]


Average reward / 100 eps: 9.97


[ 0.83379865  0.1662014 ]
Average reward / 100 eps: 10.79


[ 0.85919887  0.1408011 ]
[  9.99923944e-01   7.61058182e-05]
Average reward / 100 eps: 10.18


[ 0.87603766  0.12396238]
Average reward / 100 eps: 10.03


[ 0.86152053  0.13847946]
Average reward / 100 eps: 10.37


[ 0.87846875  0.12153124]
Average reward / 100 eps: 9.97


[ 0.89751655  0.10248343]
Average reward / 100 eps: 9.75


[ 0.88337511  0.11662486]
Average reward / 100 eps: 9.63


[ 0.90532869  0.09467133]
Average reward / 100 eps: 9.8


[ 0.91390443  0.08609564]
Average reward / 100 eps: 9.78


[ 0.91249961  0.08750039]
Average reward / 100 eps: 9.78


[ 0.92368442  0.07631559]
Average reward / 100 eps: 9.73


[ 0.9388783   0.06112168]
Average reward / 100 eps: 9.72


[ 0.931705    0.06829496]
Average reward / 100 eps: 9.85


[ 0.94405955  0.05594045]
Average reward / 100 eps: 9.53


[ 0.94142902  0.05857096]
Average reward / 100 eps: 9.66


[ 0.95410866  0.0458914 ]
Average reward / 100 eps: 9.67


[ 0.95078039  0.0492196 ]
Average reward / 100 eps: 9.57


[ 0.95953959  0.04046037]
Average reward / 100 eps: 9.34


[ 0.95054877  0.04945129]
Average reward / 100 eps: 9.62


[ 0.95133364  0.04866631]
Average reward / 100 eps: 9.72


[ 0.96152264  0.03847732]
Average reward / 100 eps: 9.41


[ 0.95860666  0.04139328]
Average reward / 100 eps: 9.52


[ 0.95470303  0.04529693]
Average reward / 100 eps: 9.61


[ 0.9623844   0.03761559]
Average reward / 100 eps: 9.41


[ 0.95803171  0.04196827]
Average reward / 100 eps: 9.41


[ 0.95972621  0.04027372]
Average reward / 100 eps: 9.49


[ 0.96200109  0.0379989 ]
Average reward / 100 eps: 9.46


[ 0.9657222   0.03427781]
Average reward / 100 eps: 9.41


[ 0.97512507  0.02487496]
Average reward / 100 eps: 9.53


[ 0.97546482  0.02453522]
Average reward / 100 eps: 9.5


[ 0.97690135  0.02309862]
Average reward / 100 eps: 9.5


[ 0.97193313  0.02806693]
Average reward / 100 eps: 9.44


[ 0.97047698  0.02952297]
Average reward / 100 eps: 9.38


[ 0.97221464  0.02778543]
Average reward / 100 eps: 9.53


[ 0.97247249  0.02752755]
Average reward / 100 eps: 9.42


[ 0.97459126  0.02540875]
Average reward / 100 eps: 9.45


[ 0.97238296  0.02761702]
Average reward / 100 eps: 9.53


[ 0.96957153  0.03042841]
Average reward / 100 eps: 9.41


[ 0.97336018  0.02663975]
[  1.00000000e+00   2.50047982e-08]
Average reward / 100 eps: 9.46


[ 0.97677886  0.02322118]
Average reward / 100 eps: 9.4


[ 0.97822011  0.02177992]
Average reward / 100 eps: 9.4


[ 0.97893196  0.02106798]
Average reward / 100 eps: 9.47


[ 0.97918773  0.02081227]
Average reward / 100 eps: 9.31


[ 0.98115104  0.018849  ]
Average reward / 100 eps: 9.47


[ 0.98082584  0.01917412]
Average reward / 100 eps: 9.42


[ 0.98238498  0.01761502]
Average reward / 100 eps: 9.23


[ 0.98205066  0.01794934]
Average reward / 100 eps: 9.48


[ 0.98418367  0.01581634]
Average reward / 100 eps: 9.37


[ 0.98411316  0.01588688]
Average reward / 100 eps: 9.38


[ 0.97941691  0.02058308]
Average reward / 100 eps: 9.39


[ 0.98050469  0.01949525]
Average reward / 100 eps: 9.35


[ 0.9855665   0.01443349]
Average reward / 100 eps: 9.43


[ 0.98672163  0.01327839]
Average reward / 100 eps: 9.44


[ 0.98189878  0.0181012 ]
Average reward / 100 eps: 9.42


[ 0.98332477  0.01667516]
Average reward / 100 eps: 9.38


[ 0.9860298   0.01397026]
Average reward / 100 eps: 9.43


[ 0.98328745  0.01671259]
Average reward / 100 eps: 9.48


[ 0.98837715  0.01162282]
Average reward / 100 eps: 9.5


[ 0.98653644  0.01346356]
Average reward / 100 eps: 9.42


[ 0.99047559  0.00952444]
Average reward / 100 eps: 9.38


[ 0.98815966  0.01184037]
Average reward / 100 eps: 9.29


[ 0.98521233  0.01478766]
Average reward / 100 eps: 9.26


[ 0.98720086  0.01279917]
Average reward / 100 eps: 9.44


[ 0.98940396  0.01059607]
Average reward / 100 eps: 9.42


[ 0.98592895  0.01407109]
Average reward / 100 eps: 9.43


[ 0.98868549  0.01131449]
Average reward / 100 eps: 9.35


[ 0.98955017  0.0104498 ]
Average reward / 100 eps: 9.43


[ 0.98547947  0.01452051]
Average reward / 100 eps: 9.41


[ 0.988518    0.01148202]
Average reward / 100 eps: 9.42


[ 0.98801684  0.01198311]
Average reward / 100 eps: 9.4


[ 0.98709548  0.01290449]
Average reward / 100 eps: 9.32


[ 0.98866755  0.0113324 ]
Average reward / 100 eps: 9.32


[ 0.99074101  0.00925897]
Average reward / 100 eps: 9.4


[ 0.98678708  0.01321291]
Average reward / 100 eps: 9.27


[ 0.98671407  0.01328589]
Average reward / 100 eps: 9.23


[ 0.98896796  0.01103206]
Average reward / 100 eps: 9.53


[ 0.99239963  0.0076004 ]
Average reward / 100 eps: 9.43


[ 0.98829275  0.01170726]
Average reward / 100 eps: 9.41


[ 0.99029213  0.00970785]
Average reward / 100 eps: 9.4


[ 0.99076957  0.00923049]
Average reward / 100 eps: 9.47


[ 0.9913584   0.00864156]
Average reward / 100 eps: 9.34


[ 0.98924482  0.01075525]
Average reward / 100 eps: 9.34


[ 0.98957217  0.01042787]
Average reward / 100 eps: 9.29


[ 0.99036729  0.00963277]
Average reward / 100 eps: 9.45


[ 0.99232703  0.00767295]
Average reward / 100 eps: 9.3


[ 0.99065256  0.00934748]
Average reward / 100 eps: 9.42


[ 0.99242002  0.00757995]
Average reward / 100 eps: 9.32


[ 0.99091816  0.00908178]
[  1.00000000e+00   1.32687761e-09]
Average reward / 100 eps: 9.44


[ 0.99253041  0.00746962]
Average reward / 100 eps: 9.31


[ 0.99067229  0.00932764]
Average reward / 100 eps: 9.4


[ 0.99328351  0.00671646]
Average reward / 100 eps: 9.36


[ 0.99346381  0.00653612]
Average reward / 100 eps: 9.37


[ 0.99181867  0.00818131]
Average reward / 100 eps: 9.37


[ 0.99395162  0.00604832]
Average reward / 100 eps: 9.35


[ 0.99282503  0.00717499]
Average reward / 100 eps: 9.42


[ 0.99403518  0.00596484]
Average reward / 100 eps: 9.4


[ 0.9928      0.00719996]
Average reward / 100 eps: 9.2


[ 0.99033922  0.00966074]
Average reward / 100 eps: 9.26


[ 0.99374074  0.0062592 ]
Average reward / 100 eps: 9.43


[ 0.99221933  0.00778067]
Average reward / 100 eps: 9.41


[ 0.99198443  0.00801565]
Average reward / 100 eps: 9.42


[ 0.99483889  0.00516113]
Average reward / 100 eps: 9.32


[ 0.9927845   0.00721547]
Average reward / 100 eps: 9.4


[ 0.992652    0.00734807]
Average reward / 100 eps: 9.36


[ 0.99381626  0.00618379]
Average reward / 100 eps: 9.29


[ 0.99434537  0.00565455]
Average reward / 100 eps: 9.2


[ 0.99461937  0.00538064]
Average reward / 100 eps: 9.42


[ 0.99273539  0.00726457]
Average reward / 100 eps: 9.29


[ 0.99484694  0.00515309]
Average reward / 100 eps: 9.42


[ 0.99227905  0.00772093]
Average reward / 100 eps: 9.21


[ 0.99503088  0.00496913]
Average reward / 100 eps: 9.35


[ 0.99385035  0.00614965]
Average reward / 100 eps: 9.42


[ 0.99511755  0.00488243]
Average reward / 100 eps: 9.35


[ 0.99368143  0.00631859]
Average reward / 100 eps: 9.25


[ 0.99551398  0.00448602]
Average reward / 100 eps: 9.4


[ 0.99361742  0.00638261]
Average reward / 100 eps: 9.31


[ 0.99225664  0.00774333]
Average reward / 100 eps: 9.35


[ 0.9951362   0.00486385]
Average reward / 100 eps: 9.48


[ 0.99448377  0.00551623]
Average reward / 100 eps: 9.34


[ 0.99348998  0.00651002]
Average reward / 100 eps: 9.32


[ 0.99529487  0.00470517]
Average reward / 100 eps: 9.28


[ 0.9947108   0.00528915]
Average reward / 100 eps: 9.37


[ 0.99471855  0.00528141]
Average reward / 100 eps: 9.3


[ 0.99385554  0.00614442]
Average reward / 100 eps: 9.38


[ 0.99426419  0.0057358 ]
Average reward / 100 eps: 9.38


[ 0.99327761  0.00672235]
Average reward / 100 eps: 9.6


[ 0.99509227  0.00490768]
Average reward / 100 eps: 9.54


[ 0.99488139  0.00511856]
Average reward / 100 eps: 9.25


[ 0.99382341  0.00617663]
Average reward / 100 eps: 9.38


[ 0.99575365  0.00424638]
Average reward / 100 eps: 9.29


[ 0.99391335  0.00608665]
Average reward / 100 eps: 9.36


[ 0.99537438  0.00462559]
Average reward / 100 eps: 9.43


[ 0.99592233  0.00407761]
Average reward / 100 eps: 9.41


[ 0.99593532  0.0040647 ]
Average reward / 100 eps: 9.36


[ 0.99517411  0.00482586]
Average reward / 100 eps: 9.42


[ 0.99447292  0.00552703]
Average reward / 100 eps: 9.33


[ 0.99573046  0.00426959]
Average reward / 100 eps: 9.37


[ 0.99496788  0.00503214]
Average reward / 100 eps: 9.38


[ 0.99532372  0.00467625]
Average reward / 100 eps: 9.38


[ 0.99563438  0.00436556]
Average reward / 100 eps: 9.18


[ 0.99361658  0.00638338]
Average reward / 100 eps: 9.44


[ 0.99475515  0.0052449 ]
Average reward / 100 eps: 9.45


[ 0.99680316  0.00319683]
Average reward / 100 eps: 9.56


[ 0.99472284  0.00527712]
Average reward / 100 eps: 9.39


[ 0.99574846  0.00425151]
Average reward / 100 eps: 9.2


[ 0.9967379   0.00326215]
Average reward / 100 eps: 9.35


[ 0.99475765  0.00524236]
Average reward / 100 eps: 9.41


[ 0.99605453  0.00394542]
Average reward / 100 eps: 9.29


[ 0.99516064  0.00483942]
Average reward / 100 eps: 9.49


[ 0.99519759  0.00480239]
Average reward / 100 eps: 9.41


[ 0.99671161  0.00328836]
Average reward / 100 eps: 9.38


[ 0.99677867  0.00322138]
Average reward / 100 eps: 9.37


[ 0.99655199  0.00344795]
Average reward / 100 eps: 9.4


[ 0.99645209  0.00354794]
Average reward / 100 eps: 9.22


[ 0.99503243  0.00496752]
Average reward / 100 eps: 9.3


[ 0.99571061  0.00428933]
Average reward / 100 eps: 9.43


[ 0.99612021  0.00387985]
Average reward / 100 eps: 9.31


[ 0.99587387  0.00412612]
[  1.00000000e+00   1.64897082e-10]
Average reward / 100 eps: 9.41


In [38]:
env.close()