# Deep Q-Learning

In this notebook, you will implement a deep Q-Learning reinforcement algorithm. The implementation borrows ideas from both the original DeepMind Nature paper and the more recent asynchronous version:<br/>
[1] "Human-Level Control through Deep Reinforcement Learning" by Mnih et al. 2015<br/>
[2] "Asynchronous Methods for Deep Reinforcement Learning" by Mnih et al. 2016.<br/>

In particular:
* We use separate target and Q-functions estimators with periodic updates to the target estimator. 
* We use several concurrent "threads" rather than experience replay to generate less biased gradient updates. 
* Threads are actually synchronized so we start each one at a random number of moves.
* We use an epsilon-greedy policy that blends random moves with policy moves.
* We taper the random action parameter (epsilon) and the learning rate to zero during training.

This gives a simple and reasonably fast general-purpose RL algorithm. We use it here for the Cartpole environment from OpenAI Gym, but it can easily be adapted to others. For this notebook, you will implement 4 steps:

1. The backward step for the Q-estimator
2. The $\epsilon$-greedy policy
3. "asynchronous" initialization 
4. The Q-learning algorithm

To get started, we import some prerequisites.

In [1]:
%matplotlib inline

import gym
import numpy as np
import sys
import matplotlib.pyplot as plt
import time
import pickle

The block below lists some parameters you can tune. They should be self-explanatory. They are currently set to train CartPole-V0 to a "solved" score (> 195) most of the time. 

In [2]:
nsteps = 10001                       # Number of steps to run (game actions per environment)
npar = 16                            # Number of parallel environments
target_window = 200                  # Interval to update target estimator from q-estimator
discount_factor = 0.99               # Reward discount factor
printsteps = 1000                    # Number of steps between printouts
render = False                       # Whether to render an environment while training

epsilon_start = 1.0                  # Parameters for epsilon-greedy policy: initial epsilon
epsilon_end = 0.0                    # Final epsilon
neps = int(0.8*nsteps)               # Number of steps to decay epsilon

learning_rate = 2e-3                 # Initial learning rate
lr_end = 0                           # Final learning rate
nlr = neps                           # Steps to decay learning rate
decay_rate = 0.99                    # Decay factor for RMSProp 

nhidden = 200                        # Number of hidden layers for estimators

init_moves = 2000                    # Upper bound on random number of moves to take initially
nwindow = 2                          # Sensing window = last n images in a state

Below are environment-specific parameters. The function "preprocess" should process an observation returned by the environment into a vector for training. For CartPole we simply append a 1 to implement bias in the first layer. 

For visual environments you would typically crop, downsample to 80x80, set color to a single bit (foreground/background), and flatten to a vector. That transformation is already implemented in the Policy Gradient code.

*nfeats* is the dimension of the vector output by *preprocess*.

In [3]:
game_type="CartPole-v0"                 # Model type and action definitions
VALID_ACTIONS = [0, 1]
nactions = len(VALID_ACTIONS)
nfeats = 5                              # There are four state features plus the constant we add

def preprocess(I):                      # preprocess each observation
    """Just append a 1 to the end"""
    return np.append(I.astype(float),1) # Add a constant feature for bias

Here is the Q-estimator class. We use two instances of this class, one for the target estimator, and one for the Q-estimator. The Q function is normally represented as a scalar $Q(x,a)$ where $x$ is the state and $a$ is an action. For ease of implementation, we actually estimate a vector-valued function $Q(x,.)$ which returns the estimated reward for every action. The model here has just a single hidden layer:

<pre>
Input Layer (nfeats) => FC Layer => RELU => FC Layer => Output (naction values)
</pre>

## 1. Implement Q-estimator gradient
Your first task is to implement the
<pre>Estimator.gradient(s, a, y)</pre>
method for this class. **gradient** should compute the gradients wrt weight arrays W1 and W2 into
<pre>self.grad['W1']
self.grad['W2']</pre>
respectively. Both <code>a</code> and <code>y</code> are vectors. Be sure to update only the output layer weights corresponding to the given action vector. 

In [4]:
from __future__ import division

In [130]:
class Estimator():

    def __init__(self, ninputs, nhidden, nactions):
        """ Create model matrices, and gradient and squared gradient buffers"""
        model = {}
        model['W1'] = np.random.randn(nhidden, ninputs) / np.sqrt(ninputs)   # "Xavier" initialization
        model['W2'] = np.random.randn(nactions, nhidden) / np.sqrt(nhidden)
        self.model = model
        self.grad = { k : np.zeros_like(v) for k,v in model.iteritems() }
        self.gradsq = { k : np.zeros_like(v) for k,v in model.iteritems() }   
        

    def forward(self, s):
        """ Run the model forward given a state as input.
    returns action predictions and the hidden state"""
        h = np.dot(self.model['W1'], s) # nhidden x npars
        h[h<0] = 0 # ReLU nonlinearity # nhidden x npars
        rew = np.dot(self.model['W2'], h) # nactions, npars
        print(type(rew))
        return rew, h
    
    
    def predict(self, s):
        """ Predict the action rewards from a given input state"""
        rew, h = self.forward(s)
        return rew, h
    
              
    def gradient(self, s, a, y):
        """ Given a state s, action a and target y, compute the model gradients"""
    ##################################################################################
    ##                                                                              ##
    ## TODO: Compute gradients and return a scalar loss on a minibatch of size npar ##
    ##    s is the input state matrix (ninputs x npar).                             ##
    ##    a is an action vector (npar,).                                            ##
    ##    y is a vector of target values (npar,) corresponding to those actions.    ##
    ##    return: the loss per sample (npar,).                                      ##                          
    ##                                                                              ##
    ## Notes:                                                                       ##
    ##    * If the action is ai in [0,...,nactions-1], backprop only through the    ##
    ##      ai'th output.                                                           ##
    ##    * loss should be L2, and we recommend you normalize it to a per-input     ##
    ##      value, i.e. return L2(target,predition)/sqrt(npar).                     ##
    ##    * save the gradients in self.grad['W1'] and self.grad['W2'].              ##
    ##    * update self.grad['W1'] and self.grad['W2'] by adding the gradients, so  ##
    ##      that multiple gradient steps can be used beteween updates.              ##
    ##                                                                              ##
    ##################################################################################
        ninputs, npar = s.shape
        loss_obs = [] # Collecting L_i
        loss = 0.0
        # Calculating Loss
        # Prediction: nactions x npar, h: nhidden x npar
        prediction, h = self.predict(s) 
        nactions, _ = prediction.shape
        predicted_reward = [] 
        #w2_grads = []
        w1_grads = []
        # ----- Lambda Function to determine the positivity of deviation ---- #
        sign_func = lambda a: (a>0) - (a<0)
        self.grad['W1'] = np.zeros_like(self.model['W1'])
        self.grad['W2'] = np.zeros_like(self.model['W2'])
        
        dh = np.zeros_like(h)
        for i in range(npar):
            sign = sign_func(prediction[a[i],i]-y[i])
            loss_i = prediction[a[i], i] - y[i]
            loss_i = sign*loss_i
            self.grad['W2'][a[i], :] += sign*h[i, :].T
            dh[:, i] = sign*(self.model['W2'][a[i],:]).T
            loss += loss_i
        
        self.grad['W1'] = np.dot(dh.T, s)
            
            
            
            
            
            
            
            # For each environment, we are conducting backpropagation
            #dscores = np.zeros(nactions)
            #dscores[a[i]] = 1
            #square_diffs[i] = (prediction[a[i], i] - y[i])**2
            #one_hot = np.zeros(nactions)
            #one_hot[a[i]] = 2*(prediction[a[i], i] - y[i]) # nactions x 1
            #tmp = np.zeros_like(self.model['W2'])
            #tmp[a[i],:] = h[:, i]*one_hot[a[i]]
            #w2_grads.append(tmp) 
            
            #dhidden = np.dot(dscores, self.model['W2']).T # nhidden x 1
            #dhidden[h[:, i] <= 0] = 0 # RELU backprop
            
            # On W1
            #tmp_2 = np.zeros_like(self.model['W1'])
            #tmp_2 = np.dot(dhidden, )
            # 
            
            
        #for i in range(npar):
        #    predicted_reward.append(prediction[a[i],i])
        #    loss_obs.append(np.sqrt((prediction[a[i], i] - y[i])**2))
            
        #predicted_reward = np.array(predicted_reward) ### 
        #loss_obs = np.array(loss_obs)
        #tmp_loss = np.linalg.norm(loss_obs)
        #loss = tmp_loss/(np.sqrt(npar))
        ###### dscroes as the one hot encoding 
        #dscores = np.zeros_like(prediction)
        
        # Updating Gradients
        #for i in range(npar):
        #    dscores[a[i], i] = 1
        #    action = a[i]
        #    self.grad['W2'][action,:] += h[:,i].T
        
        #self.grad['W2'] = self.grad['W2']/(2*tmp_loss*np.sqrt(npar))
        
        
        
        #for i in range(npar):
        #    action = a[i]
        #    self.grad['W1'] += (1/np.sqrt())
        
        
        
        #self.model['W1'] -= learning_rate*self.grad['W1']
        #self.model['W2'] -= learning_rate*self.grad['W2']
        
        return loss
    
    
    def rmsprop(self, learning_rate, decay_rate): 
        """ Perform model updates from the gradients using RMSprop"""
        for k in self.model:
            g = self.grad[k]
            self.gradsq[k] = decay_rate * self.gradsq[k] + (1 - decay_rate) * g*g
            self.model[k] -= learning_rate * g / (np.sqrt(self.gradsq[k]) + 1e-5)
            self.grad[k].fill(0.0)

## 2. Implement $\epsilon$-Greedy Policy

An $\epsilon$-Greedy policy should:
* with probability $\epsilon$ take a uniformly-random action.
* otherwise choose the best action according to the estimator from the given state.

The function below should implement this policy. It should return a matrix A of size (nactions, npar) such that A[i,j] is the probability of taking action i on input j. The probabilities of non-optimal actions should be $\epsilon/{\rm nactions}$ and the probability of the best action should be $1-\epsilon+\epsilon/{\rm nactions}$.

Since the function processes batches of states, the input <code>state</code> is a <code>ninputs x npar</code> matrix, and the returned value should be a <code>nactions x npar</code> matrix. 

In [6]:
def policy(estimator, state, epsilon):
    """ Take an estimator and state and predict the best action.
    For each input state, return a vector of action probabilities according to an epsilon-greedy policy"""
    ##################################################################################
    ##                                                                              ##
    ## TODO: Implement an epsilon-greedy policy                                     ##
    ##       estimator: is the estimator to use (instance of Estimator)             ##
    ##       state is an (ninputs x npar) state matrix                              ##
    ##       epsilon is the scalar policy parameter                                 ##
    ## return: an (nactions x npar) matrix A where A[i,j] is the probability of     ##
    ##       taking action i on input j.                                            ##
    ##                                                                              ##
    ## Use the definition of epsilon-greedy from the cell above.                    ##
    ##                                                                              ##
    ##################################################################################
    act_val, h = estimator.predict(state) # act_val: naction x npar
    #print(act_val)
    #print(act_val.shape)
    nactions, npar = act_val.shape
    opt_action = np.argmax(act_val, axis=0)
    #print(opt_action)
    A = np.zeros((nactions, npar)) 
    A += epsilon/nactions # the exploration part
    for i in range(npar):
        #print(i)
        A[opt_action[i], i] = 1 - epsilon + (epsilon/nactions)
    return A

In [97]:
state = np.random.random((nfeats*nwindow, npar))
q_estimator1 = Estimator(nfeats*nwindow, nhidden, nactions)
policy(q_estimator1, state, 0.1)

<type 'numpy.ndarray'>


array([[ 0.05,  0.05,  0.05,  0.05,  0.05,  0.05,  0.05,  0.05,  0.05,
         0.95,  0.05,  0.05,  0.05,  0.05,  0.05,  0.05],
       [ 0.95,  0.95,  0.95,  0.95,  0.95,  0.95,  0.95,  0.95,  0.95,
         0.05,  0.95,  0.95,  0.95,  0.95,  0.95,  0.95]])

In [11]:
state.shape

(10, 16)

This routine copies the state of one estimator into another. Its used to update the target estimator from the Q-estimator.

In [10]:
def update_estimator(to_estimator, from_estimator, window, istep):
    """ every <window> steps, Copy model state from from_estimator into to_estimator"""
    if (istep % window == 0):
        for k in from_estimator.model:
            np.copyto(to_estimator.model[k], from_estimator.model[k])

## 3. Implement "Asynchronous Threads"

Don't try that in Python!! Actually all we do here is create an array of environments and advance each one a random number of steps, using random actions at each step. Later on we will make *synchronous* updates to all the environments, but the environments (and their gradient updates) should remain uncorrelated. This serves the same goal as asynchronous updates in paper [2], or experience replay in paper [1].

In [82]:
import random

In [83]:
block_reward = np.zeros((16), dtype=float);
total_epochs = np.zeros((16), dtype=float);

In [84]:
block_reward[0]

0.0

In [85]:
print block_reward.shape, total_epochs.shape

(16,) (16,)


In [86]:
# Create estimators
q_estimator = Estimator(nfeats*nwindow, nhidden, nactions)
target_estimator = Estimator(nfeats*nwindow, nhidden, nactions)

# The epsilon and learning rate decay schedules
epsilons = np.linspace(epsilon_start, epsilon_end, neps)
learning_rates = np.linspace(learning_rate, lr_end, nlr)



In [87]:
nfeats * nwindow

10

In [88]:
import gym

env = gym.make('CartPole-v0')
for i_episode in range(20):
    observation = env.reset()
    for t in range(100):
        env.render()
        print(observation)
        print(env.action_space)
        action = env.action_space.sample()
        observation, reward, done, info = env.step(action)
        if done: 
            print("Episode finished after {} timesteps".format(t+1))
            break

[2016-11-21 16:23:01,257] Making new env: CartPole-v0


[ 0.02722678  0.03404252 -0.04800595 -0.03686328]
Discrete(2)
[ 0.02790763 -0.16035932 -0.04874321  0.2402952 ]
Discrete(2)
[ 0.02470044 -0.35475231 -0.04393731  0.51721353]
Discrete(2)
[ 0.01760539 -0.15904014 -0.03359304  0.21101517]
Discrete(2)
[ 0.01442459  0.03654562 -0.02937273 -0.09207244]
Discrete(2)
[ 0.0151555   0.232076   -0.03121418 -0.39387583]
Discrete(2)
[ 0.01979702  0.42762665 -0.0390917  -0.69623441]
Discrete(2)
[ 0.02834956  0.62326832 -0.05301639 -1.00096274]
Discrete(2)
[ 0.04081492  0.42889342 -0.07303564 -0.72538956]
Discrete(2)
[ 0.04939279  0.62494524 -0.08754343 -1.0401367 ]
Discrete(2)
[ 0.0618917   0.43108853 -0.10834617 -0.77617011]
Discrete(2)
[ 0.07051347  0.23761027 -0.12386957 -0.51944596]
Discrete(2)
[ 0.07526567  0.04442993 -0.13425849 -0.26822027]
Discrete(2)
[ 0.07615427  0.24118694 -0.13962289 -0.60005334]
Discrete(2)
[ 0.08097801  0.43795762 -0.15162396 -0.93325295]
Discrete(2)
[ 0.08973716  0.24517049 -0.17028902 -0.69179682]
Discrete(2)
[ 0.0946

In [77]:
envs[i].reset().shape

(4,)

In [78]:
observation_prev = np.zeros((4))
observation_prev.shape

(4,)

In [89]:
# Initialize the games
print("Initializing games..."); sys.stdout.flush()
envs = np.empty(npar, dtype=object)
state = np.zeros([nfeats * nwindow, npar], dtype=float)
rewards = np.zeros([npar], dtype=float)
dones = np.empty(npar, dtype=int)
actions = np.zeros([npar], dtype=int)


for i in range(npar):
    print('agent', i)
    envs[i] = gym.make(game_type)
    ##################################################################################
    ##                                                                              ##
    ## TODO: Advance each environment by a random number of steps, where the number ##
    ##       of steps is sampled uniformly from [nwindow, init_moves].              ##
    ##       Use random steps to advance.                                           ## 
    ##                                                                              ##
    ## Update the total reward and total epochs variables as you go.                ##
    ## If an environment returns done=True, reset it and increment the epoch count. ##
    ##                                                                              ##
    ##################################################################################
    length_game = random.randint(nwindow, init_moves)
    #print(state[:,i].shape)
    #print(envs[i].reset().shape)
    observation_prev = np.zeros((4)) # initiailzing frame t-2
    observation = envs[i].reset()
    for t in range(length_game):
        envs[i].render()
        actions[i] = envs[i].action_space.sample()
        #print(actions[i])
        observation_prev = observation
        observation, rewards[i], dones[i], _ = envs[i].step(actions[i])
        #print(rewards[i])
        block_reward[i] += rewards[i]
        total_epochs[i] += 1
        if dones[i]:
            print("Episode finished after {} timesteps".format(t+1))
            block_reward[i] = 0.0 # resettig total reward
            envs[i].reset()
    rewards[i] = block_reward[i]
    state[:, i] = np.hstack([observation, float(1), observation_prev, float(1)])
    
    
    
    
    
       

Initializing games...


[2016-11-21 16:23:17,038] Making new env: CartPole-v0


('agent', 0)
0
1.0
0
1.0
0
1.0
1
1.0
1
1.0
0
1.0
0
1.0
0
1.0
0
1.0
1
1.0
0
1.0
0
1.0
1
1.0
Episode finished after 13 timesteps
1
1.0
1
1.0
1
1.0
0
1.0
0
1.0
1
1.0
1
1.0
0
1.0
0
1.0
0
1.0
1
1.0
1
1.0
1
1.0
0
1.0
0
1.0
0
1.0
1
1.0
1
1.0
0
1.0
0
1.0
1
1.0
1
1.0
Episode finished after 35 timesteps
1
1.0
0
1.0
0
1.0
1
1.0
0
1.0
1
1.0
0
1.0
0
1.0
0
1.0
1
1.0
0
1.0
1
1.0
0
1.0
1
1.0
1
1.0
1
1.0
1
1.0
1
1.0
0
1.0
0
1.0
1
1.0
Episode finished after 56 timesteps
1
1.0
1
1.0
0
1.0
1
1.0
0
1.0
1
1.0
0
1.0
1
1.0
0
1.0
0
1.0
1
1.0
1
1.0
1
1.0
1
1.0
0
1.0
Episode finished after 71 timesteps
0
1.0
0
1.0
0
1.0
0
1.0
0
1.0
1
1.0
0
1.0
0
1.0
0
1.0
Episode finished after 80 timesteps
1
1.0
0
1.0
1
1.0
0
1.0
0
1.0
1
1.0
0
1.0
0
1.0
1
1.0
1
1.0
0
1.0
0
1.0
1
1.0
1
1.0
1
1.0
1
1.0
0
1.0
0
1.0
1
1.0
0
1.0
1
1.0
1
1.0
0
1.0
1
1.0
1
1.0
0
1.0
0
1.0
1
1.0
0
1.0
0
1.0
0
1.0
0
1.0
0
1.0
0
1.0
1
1.0
0
1.0
1
1.0
1
1.0
1
1.0
0
1.0
0
1.0
1
1.0
1
1.0
1
1.0
1
1.0
0
1.0
1
1.0
0
1.0
1
1.0
0
1.0
0
1.0
1
1.0
1
1.0
1
1.0
1
1

[2016-11-21 16:23:32,713] Making new env: CartPole-v0


1
1.0
1
1.0
1
1.0
1
1.0
1
1.0
0
1.0
Episode finished after 930 timesteps
0
1.0
1
1.0
1
1.0
1
1.0
('agent', 1)
1
1.0
0
1.0
0
1.0
0
1.0
1
1.0
0
1.0
0
1.0
1
1.0
0
1.0
0
1.0
0
1.0
1
1.0
0
1.0
1
1.0
Episode finished after 14 timesteps
1
1.0
0
1.0
0
1.0
0
1.0
0
1.0
0
1.0
0
1.0
1
1.0
1
1.0
1
1.0
0
1.0
0
1.0
0
1.0
1
1.0
1
1.0
Episode finished after 29 timesteps
0
1.0
1
1.0
1
1.0
1
1.0
0
1.0
1
1.0
0
1.0
1
1.0
1
1.0
1
1.0
0
1.0
0
1.0
0
1.0
0
1.0
0
1.0
0
1.0
1
1.0
0
1.0
1
1.0
0
1.0
0
1.0
0
1.0
0
1.0
0
1.0
1
1.0
0
1.0
1
1.0
0
1.0
0
1.0
1
1.0
1
1.0
0
1.0
1
1.0
0
1.0
1
1.0
1
1.0
1
1.0
1
1.0
0
1.0
1
1.0
1
1.0
1
1.0
1
1.0
Episode finished after 72 timesteps
1
1.0
1
1.0
1
1.0
0
1.0
0
1.0
0
1.0
1
1.0
0
1.0
1
1.0
1
1.0
0
1.0
1
1.0
1
1.0
0
1.0
1
1.0
0
1.0
1
1.0
1
1.0
0
1.0
Episode finished after 91 timesteps
1
1.0
0
1.0
1
1.0
0
1.0
0
1.0
1
1.0
0
1.0
1
1.0


[2016-11-21 16:23:34,463] Making new env: CartPole-v0


1
1.0
1
1.0
0
1.0
0
1.0
1
1.0
1
1.0
1
1.0
1
1.0
1
1.0
0
1.0
1
1.0
1
1.0
('agent', 2)
0
1.0
1
1.0
1
1.0
0
1.0
1
1.0
1
1.0
1
1.0
0
1.0
0
1.0
1
1.0
1
1.0
0
1.0
0
1.0
1
1.0
0
1.0
0
1.0
1
1.0
0
1.0
0
1.0
1
1.0
1
1.0
1
1.0
0
1.0
Episode finished after 23 timesteps
0
1.0
1
1.0
1
1.0
0
1.0
1
1.0
1
1.0
1
1.0
1
1.0
1
1.0
0
1.0
0
1.0
1
1.0
1
1.0
0
1.0
0
1.0
Episode finished after 38 timesteps
0
1.0
0
1.0
1
1.0
1
1.0
1
1.0
1
1.0
1
1.0
1
1.0
0
1.0
1
1.0
1
1.0
1
1.0
1
1.0
0
1.0
0
1.0
Episode finished after 53 timesteps
0
1.0
0
1.0
1
1.0
1
1.0
0
1.0
0
1.0
0
1.0
0
1.0
1
1.0
1
1.0
0
1.0
1
1.0
0
1.0
0
1.0
0
1.0
1
1.0
Episode finished after 69 timesteps
0
1.0
1
1.0
1
1.0
1
1.0
1
1.0
0
1.0
1
1.0
1
1.0
0
1.0
0
1.0
0
1.0
1
1.0
1
1.0
0
1.0
1
1.0
1
1.0
Episode finished after 85 timesteps
0
1.0
1
1.0
0
1.0
0
1.0
0
1.0
1
1.0
1
1.0
1
1.0
1
1.0
0
1.0
0
1.0
1
1.0
0
1.0
0
1.0
1
1.0
1
1.0
0
1.0
1
1.0
0
1.0
1
1.0
0
1.0
0
1.0
1
1.0
0
1.0
0
1.0
1
1.0
1
1.0
1
1.0
1
1.0
0
1.0
0
1.0
0
1.0
0
1.0
0
1.0
1
1.0
Episode finishe

[2016-11-21 16:23:35,705] Making new env: CartPole-v0


1
1.0
0
1.0
0
1.0
1
1.0
Episode finished after 479 timesteps
0
1.0
0
1.0
0
1.0
1
1.0
0
1.0
0
1.0
0
1.0
1
1.0
1
1.0
0
1.0
0
1.0
0
1.0
0
1.0
Episode finished after 492 timesteps
0
1.0
1
1.0
1
1.0
0
1.0
1
1.0
0
1.0
1
1.0
1
1.0
0
1.0
1
1.0
1
1.0
1
1.0
1
1.0
0
1.0
1
1.0
0
1.0
Episode finished after 508 timesteps
0
1.0
1
1.0
0
1.0
0
1.0
0
1.0
0
1.0
0
1.0
1
1.0
0
1.0
0
1.0
Episode finished after 518 timesteps
1
1.0
1
1.0
1
1.0
1
1.0
1
1.0
0
1.0
0
1.0
0
1.0
0
1.0
0
1.0
0
1.0
1
1.0
1
1.0
1
1.0
0
1.0
Episode finished after 533 timesteps
0
1.0
1
1.0
1
1.0
1
1.0
0
1.0
1
1.0
1
1.0
1
1.0
1
1.0
0
1.0
1
1.0
1
1.0
1
1.0
1
1.0
1
1.0
Episode finished after 548 timesteps
0
1.0
0
1.0
0
1.0
1
1.0
1
1.0
1
1.0
1
1.0
0
1.0
0
1.0
1
1.0
0
1.0
1
1.0
0
1.0
1
1.0
1
1.0
0
1.0
0
1.0
0
1.0
0
1.0
1
1.0
0
1.0
1
1.0
0
1.0
1
1.0
1
1.0
0
1.0
1
1.0
1
1.0
0
1.0
1
1.0
1
1.0
1
1.0
1
1.0
1
1.0
0
1.0
0
1.0
1
1.0
0
1.0
0
1.0
0
1.0
1
1.0
1
1.0
0
1.0
1
1.0
0
1.0
0
1.0
0
1.0
0
1.0
1
1.0
0
1.0
0
1.0
0
1.0
0
1.0
Episode finished after

[2016-11-21 16:23:37,122] Making new env: CartPole-v0


0
1.0
1
1.0
1
1.0
1
1.0
0
1.0
0
1.0
1
1.0
0
1.0
1
1.0
0
1.0
1
1.0
1
1.0
1
1.0
0
1.0
1
1.0
1
1.0
1
1.0
Episode finished after 507 timesteps
0
1.0
1
1.0
0
1.0
1
1.0
1
1.0
0
1.0
0
1.0
1
1.0
1
1.0
1
1.0
0
1.0
0
1.0
0
1.0
1
1.0
0
1.0
0
1.0
1
1.0
1
1.0
0
1.0
1
1.0
0
1.0
1
1.0
0
1.0
0
1.0
0
1.0
0
1.0
Episode finished after 533 timesteps
1
1.0
1
1.0
1
1.0
1
1.0
1
1.0
0
1.0
0
1.0
1
1.0
0
1.0
1
1.0
1
1.0
Episode finished after 544 timesteps
1
1.0
0
1.0
1
1.0
1
1.0
1
1.0
1
1.0
1
1.0
1
1.0
0
1.0
1
1.0
1
1.0
0
1.0
Episode finished after 556 timesteps
1
1.0
1
1.0
0
1.0
('agent', 4)
1
1.0
1
1.0
1
1.0
1
1.0
1
1.0
0
1.0
0
1.0
0
1.0
1
1.0
1
1.0
1
1.0
0
1.0
Episode finished after 12 timesteps
1
1.0
0
1.0
1
1.0
1
1.0
1
1.0
1
1.0
0
1.0
1
1.0
1
1.0
1
1.0
1
1.0
Episode finished after 23 timesteps
1
1.0
1
1.0
1
1.0
0
1.0
0
1.0
0
1.0
0
1.0
0
1.0
1
1.0
0
1.0
1
1.0
1
1.0
0
1.0
0
1.0
0
1.0
1
1.0
1
1.0
0
1.0
1
1.0
1
1.0
1
1.0
1
1.0
0
1.0
0
1.0
0
1.0
1
1.0
1
1.0
1
1.0
1
1.0
Episode finished after 52 timesteps
1
1.0

[2016-11-21 16:23:38,385] Making new env: CartPole-v0


1
1.0
0
1.0
0
1.0
0
1.0
0
1.0
0
1.0
1
1.0
0
1.0
Episode finished after 966 timesteps
0
1.0
0
1.0
0
1.0
1
1.0
1
1.0
1
1.0
0
1.0
1
1.0
1
1.0
0
1.0
0
1.0
1
1.0
0
1.0
1
1.0
1
1.0
1
1.0
1
1.0
0
1.0
0
1.0
0
1.0
0
1.0
1
1.0
Episode finished after 988 timesteps
1
1.0
1
1.0
1
1.0
1
1.0
1
1.0
0
1.0
1
1.0
1
1.0
0
1.0
0
1.0
1
1.0
0
1.0
Episode finished after 1000 timesteps
1
1.0
0
1.0
0
1.0
0
1.0
0
1.0
0
1.0
0
1.0
0
1.0
0
1.0
0
1.0
1
1.0
1
1.0
Episode finished after 1012 timesteps
1
1.0
0
1.0
1
1.0
0
1.0
0
1.0
0
1.0
0
1.0
1
1.0
0
1.0
0
1.0
1
1.0
0
1.0
1
1.0
0
1.0
0
1.0
0
1.0
0
1.0
Episode finished after 1029 timesteps
0
1.0
('agent', 5)
1
1.0
0
1.0
1
1.0
1
1.0
1
1.0
1
1.0
1
1.0
1
1.0
0
1.0
1
1.0
Episode finished after 10 timesteps
0
1.0
0
1.0
0
1.0
1
1.0
0
1.0
0
1.0
1
1.0
1
1.0
1
1.0
0
1.0
1
1.0
1
1.0
1
1.0
0
1.0
0
1.0
1
1.0
1
1.0
0
1.0
0
1.0
1
1.0
0
1.0
0
1.0
1
1.0
1
1.0
0
1.0
0
1.0
1
1.0
1
1.0
0
1.0
1
1.0
Episode finished after 40 timesteps
1
1.0
0
1.0
1
1.0
1
1.0
1
1.0
0
1.0
1
1.0
1
1.0
1
1.0
1

[2016-11-21 16:23:41,184] Making new env: CartPole-v0


0
1.0
1
1.0
0
1.0
0
1.0
1
1.0
1
1.0
Episode finished after 1827 timesteps
1
1.0
0
1.0
1
1.0
1
1.0
1
1.0
0
1.0
1
1.0
0
1.0
1
1.0
1
1.0
1
1.0
0
1.0
0
1.0
1
1.0
Episode finished after 1841 timesteps
1
1.0
0
1.0
0
1.0
0
1.0
1
1.0
0
1.0
1
1.0
1
1.0
0
1.0
1
1.0
0
1.0
0
1.0
0
1.0
0
1.0
0
1.0
0
1.0
0
1.0
1
1.0
1
1.0
1
1.0
Episode finished after 1861 timesteps
1
1.0
1
1.0
1
1.0
1
1.0
1
1.0
1
1.0
1
1.0
1
1.0
1
1.0
0
1.0
Episode finished after 1871 timesteps
0
1.0
0
1.0
1
1.0
1
1.0
1
1.0
1
1.0
0
1.0
0
1.0
0
1.0
0
1.0
1
1.0
1
1.0
1
1.0
0
1.0
0
1.0
1
1.0
1
1.0
1
1.0
1
1.0
0
1.0
1
1.0
0
1.0
0
1.0
1
1.0
1
1.0
0
1.0
1
1.0
1
1.0
0
1.0
0
1.0
1
1.0
1
1.0
0
1.0
Episode finished after 1904 timesteps
1
1.0
0
1.0
1
1.0
1
1.0
1
1.0
1
1.0
0
1.0
0
1.0
0
1.0
1
1.0
0
1.0
0
1.0
0
1.0
1
1.0
1
1.0
0
1.0
1
1.0
1
1.0
0
1.0
0
1.0
Episode finished after 1924 timesteps
0
1.0
1
1.0
1
1.0
1
1.0
1
1.0
0
1.0
0
1.0
0
1.0
0
1.0
0
1.0
0
1.0
0
1.0
0
1.0
1
1.0
0
1.0
0
1.0
1
1.0
1
1.0
0
1.0
0
1.0
1
1.0
Episode finished after 1945 

[2016-11-21 16:23:43,052] Making new env: CartPole-v0


1
1.0
1
1.0
1
1.0
0
1.0
0
1.0
0
1.0
0
1.0
1
1.0
1
1.0
0
1.0
1
1.0
1
1.0
0
1.0
1
1.0
1
1.0
0
1.0
1
1.0
1
1.0
1
1.0
0
1.0
Episode finished after 1339 timesteps
1
1.0
0
1.0
1
1.0
0
1.0
1
1.0
0
1.0
1
1.0
0
1.0
1
1.0
1
1.0
0
1.0
0
1.0
0
1.0
1
1.0
0
1.0
1
1.0
1
1.0
1
1.0
0
1.0
1
1.0
1
1.0
1
1.0
0
1.0
1
1.0
0
1.0
0
1.0
1
1.0
Episode finished after 1366 timesteps
0
1.0
0
1.0
1
1.0
0
1.0
0
1.0
1
1.0
1
1.0
0
1.0
0
1.0
1
1.0
1
1.0
1
1.0
1
1.0
0
1.0
1
1.0
0
1.0
Episode finished after 1382 timesteps
0
1.0
0
1.0
0
1.0
1
1.0
('agent', 7)
0
1.0
0
1.0
0
1.0
0
1.0
1
1.0
0
1.0
0
1.0
0
1.0
0
1.0
0
1.0
1
1.0
1
1.0
Episode finished after 12 timesteps
0
1.0
1
1.0
0
1.0
1
1.0
0
1.0
0
1.0
0
1.0
0
1.0
0
1.0
0
1.0
0
1.0
0
1.0
Episode finished after 24 timesteps
0
1.0
1
1.0
1
1.0
1
1.0
1
1.0
0
1.0
1
1.0
0
1.0
1
1.0
1
1.0
0
1.0
1
1.0
1
1.0
0
1.0
1
1.0
Episode finished after 39 timesteps
1
1.0
1
1.0
1
1.0
0
1.0
1
1.0
1
1.0
1
1.0
1
1.0
1
1.0
1
1.0
Episode finished after 49 timesteps
1
1.0
1
1.0
1
1.0
0
1.0
0
1.0
1
1

[2016-11-21 16:23:44,618] Making new env: CartPole-v0


1
1.0
0
1.0
1
1.0
0
1.0
0
1.0
1
1.0
0
1.0
0
1.0
0
1.0
1
1.0
0
1.0
1
1.0
0
1.0
1
1.0
0
1.0
1
1.0
1
1.0
Episode finished after 704 timesteps
0
1.0
0
1.0
1
1.0
1
1.0
1
1.0
1
1.0
1
1.0
0
1.0
1
1.0
0
1.0
0
1.0
0
1.0
1
1.0
1
1.0
1
1.0
1
1.0
1
1.0
0
1.0
Episode finished after 722 timesteps
1
1.0
1
1.0
0
1.0
1
1.0
1
1.0
1
1.0
0
1.0
0
1.0
1
1.0
1
1.0
1
1.0
1
1.0
0
1.0
1
1.0
Episode finished after 736 timesteps
0
1.0
0
1.0
0
1.0
0
1.0
0
1.0
1
1.0
0
1.0
0
1.0
0
1.0
Episode finished after 745 timesteps
1
1.0
1
1.0
0
1.0
0
1.0
0
1.0
0
1.0
1
1.0
1
1.0
0
1.0
1
1.0
0
1.0
0
1.0
0
1.0
1
1.0
1
1.0
0
1.0
1
1.0
1
1.0
1
1.0
1
1.0
0
1.0
1
1.0
0
1.0
0
1.0
0
1.0
0
1.0
0
1.0
1
1.0
0
1.0
0
1.0
0
1.0
1
1.0
0
1.0
0
1.0
Episode finished after 779 timesteps
1
1.0
0
1.0
0
1.0
0
1.0
1
1.0
1
1.0
0
1.0
0
1.0
1
1.0
1
1.0
0
1.0
0
1.0
1
1.0
0
1.0
1
1.0
1
1.0
1
1.0
0
1.0
1
1.0
0
1.0
0
1.0
1
1.0
1
1.0
1
1.0
1
1.0
0
1.0
1
1.0
1
1.0
0
1.0
0
1.0
1
1.0
1
1.0
1
1.0
1
1.0
1
1.0
0
1.0
1
1.0
0
1.0
1
1.0
0
1.0
1
1.0
0
1.0
1
1.0
1
1.0

[2016-11-21 16:23:46,505] Making new env: CartPole-v0


1
1.0
1
1.0
1
1.0
1
1.0
1
1.0
1
1.0
1
1.0
1
1.0
0
1.0
0
1.0
0
1.0
Episode finished after 789 timesteps
1
1.0
0
1.0
0
1.0
1
1.0
0
1.0
1
1.0
1
1.0
1
1.0
0
1.0
0
1.0
1
1.0
1
1.0
0
1.0
0
1.0
1
1.0
0
1.0
0
1.0
1
1.0
1
1.0
0
1.0
0
1.0
1
1.0
1
1.0
0
1.0
0
1.0
0
1.0
1
1.0
0
1.0
1
1.0
0
1.0
1
1.0
0
1.0
Episode finished after 821 timesteps
0
1.0
1
1.0
1
1.0
0
1.0
1
1.0
1
1.0
0
1.0
0
1.0
1
1.0
0
1.0
0
1.0
0
1.0
0
1.0
0
1.0
1
1.0
1
1.0
0
1.0
0
1.0
0
1.0
0
1.0
1
1.0
1
1.0
Episode finished after 843 timesteps
1
1.0
0
1.0
1
1.0
0
1.0
0
1.0
1
1.0
1
1.0
0
1.0
1
1.0
1
1.0
0
1.0
1
1.0
0
1.0
1
1.0
0
1.0
0
1.0
0
1.0
1
1.0
0
1.0
1
1.0
0
1.0
1
1.0
0
1.0
1
1.0
1
1.0
0
1.0
0
1.0
1
1.0
0
1.0
1
1.0
Episode finished after 873 timesteps
1
1.0
0
1.0
1
1.0
1
1.0
1
1.0
0
1.0
0
1.0
0
1.0
0
1.0
0
1.0
0
1.0
0
1.0
1
1.0
1
1.0
1
1.0
0
1.0
0
1.0
1
1.0
0
1.0
1
1.0
1
1.0
0
1.0
1
1.0
0
1.0
0
1.0
1
1.0
Episode finished after 899 timesteps
1
1.0
1
1.0
1
1.0
1
1.0
1
1.0
0
1.0
1
1.0
1
1.0
0
1.0
1
1.0
1
1.0
Episode finished after 

[2016-11-21 16:23:47,737] Making new env: CartPole-v0


1
1.0
1
1.0
0
1.0
0
1.0
0
1.0
1
1.0
0
1.0
0
1.0
0
1.0
1
1.0
0
1.0
1
1.0
1
1.0
1
1.0
1
1.0
0
1.0
0
1.0
0
1.0
0
1.0
0
1.0
0
1.0
1
1.0
0
1.0
0
1.0
1
1.0
Episode finished after 1299 timesteps
1
1.0
0
1.0
1
1.0
0
1.0
0
1.0
1
1.0
1
1.0
1
1.0
0
1.0
0
1.0
0
1.0
1
1.0
1
1.0
0
1.0
0
1.0
1
1.0
1
1.0
0
1.0
1
1.0
1
1.0
0
1.0
0
1.0
1
1.0
0
1.0
1
1.0
1
1.0
0
1.0
1
1.0
1
1.0
1
1.0
1
1.0
1
1.0
1
1.0
1
1.0
1
1.0
0
1.0
Episode finished after 1335 timesteps
0
1.0
1
1.0
1
1.0
0
1.0
1
1.0
1
1.0
0
1.0
1
1.0
1
1.0
1
1.0
0
1.0
0
1.0
0
1.0
1
1.0
0
1.0
1
1.0
0
1.0
0
1.0
1
1.0
0
1.0
1
1.0
1
1.0
Episode finished after 1357 timesteps
1
1.0
1
1.0
0
1.0
0
1.0
1
1.0
1
1.0
0
1.0
1
1.0
1
1.0
1
1.0
1
1.0
0
1.0
0
1.0
1
1.0
0
1.0
1
1.0
0
1.0
0
1.0
1
1.0
0
1.0
Episode finished after 1377 timesteps
0
1.0
0
1.0
0
1.0
1
1.0
0
1.0
1
1.0
0
1.0
1
1.0
0
1.0
1
1.0
0
1.0
0
1.0
0
1.0
1
1.0
Episode finished after 1391 timesteps
1
1.0
1
1.0
1
1.0
1
1.0
1
1.0
0
1.0
0
1.0
1
1.0
0
1.0
0
1.0
1
1.0
1
1.0
Episode finished after 1403 timestep

[2016-11-21 16:23:48,236] Making new env: CartPole-v0


0
1.0
0
1.0
0
1.0
1
1.0
1
1.0
0
1.0
1
1.0
('agent', 11)
1
1.0
0
1.0
0
1.0
1
1.0
0
1.0
1
1.0
0
1.0
0
1.0
0
1.0
1
1.0
0
1.0
1
1.0
0
1.0
1
1.0
1
1.0
0
1.0
1
1.0
0
1.0
1
1.0
Episode finished after 19 timesteps
0
1.0
0
1.0
1
1.0
0
1.0
1
1.0
1
1.0
0
1.0
1
1.0
1
1.0
1
1.0
0
1.0
1
1.0
1
1.0
1
1.0
0
1.0
0
1.0
1
1.0
0
1.0
0
1.0
0
1.0
1
1.0
0
1.0
1
1.0
0
1.0
0
1.0
0
1.0
0
1.0
1
1.0
0
1.0
0
1.0
1
1.0
0
1.0
1
1.0
1
1.0
0
1.0
0
1.0
1
1.0
0
1.0
1
1.0
1
1.0
0
1.0
0
1.0
0
1.0
0
1.0
1
1.0
0
1.0
1
1.0
0
1.0
Episode finished after 67 timesteps
0
1.0
1
1.0
1
1.0
0
1.0
1
1.0
0
1.0
1
1.0
0
1.0
0
1.0
1
1.0
0
1.0
1
1.0
1
1.0
0
1.0
1
1.0
1
1.0
1
1.0
0
1.0
1
1.0
0
1.0
1
1.0
1
1.0
0
1.0
Episode finished after 90 timesteps
1
1.0
0
1.0
1
1.0
0
1.0
0
1.0
1
1.0
0
1.0
0
1.0
1
1.0
0
1.0
0
1.0
0
1.0
0
1.0
1
1.0
0
1.0
0
1.0
1
1.0
1
1.0
Episode finished after 108 timesteps
0
1.0
1
1.0
0
1.0
0
1.0
0
1.0
0
1.0
1
1.0
0
1.0
1
1.0
1
1.0
1
1.0
0
1.0
0
1.0
1
1.0
1
1.0
1
1.0
0
1.0
1
1.0
0
1.0
0
1.0
0
1.0
0
1.0
1
1.0
Episode finis

[2016-11-21 16:23:49,557] Making new env: CartPole-v0


0
1.0
0
1.0
1
1.0
1
1.0
0
1.0
1
1.0
1
1.0
1
1.0
1
1.0
0
1.0
0
1.0
0
1.0
0
1.0
1
1.0
1
1.0
1
1.0
1
1.0
0
1.0
0
1.0
1
1.0
Episode finished after 703 timesteps
0
1.0
0
1.0
1
1.0
0
1.0
0
1.0
0
1.0
0
1.0
0
1.0
1
1.0
0
1.0
0
1.0
0
1.0
1
1.0
Episode finished after 716 timesteps
1
1.0
1
1.0
0
1.0
1
1.0
0
1.0
0
1.0
1
1.0
0
1.0
0
1.0
0
1.0
1
1.0
1
1.0
1
1.0
0
1.0
1
1.0
0
1.0
1
1.0
1
1.0
1
1.0
1
1.0
0
1.0
0
1.0
1
1.0
1
1.0
Episode finished after 740 timesteps
1
1.0
0
1.0
1
1.0
1
1.0
1
1.0
1
1.0
1
1.0
1
1.0
1
1.0
0
1.0
0
1.0
Episode finished after 751 timesteps
1
1.0
1
1.0
1
1.0
1
1.0
0
1.0
1
1.0
1
1.0
1
1.0
1
1.0
Episode finished after 760 timesteps
0
1.0
1
1.0
0
1.0
1
1.0
0
1.0
0
1.0
0
1.0
1
1.0
0
1.0
0
1.0
0
1.0
0
1.0
0
1.0
0
1.0
Episode finished after 774 timesteps
1
1.0
0
1.0
0
1.0
0
1.0
1
1.0
1
1.0
0
1.0
0
1.0
0
1.0
0
1.0
0
1.0
0
1.0
0
1.0
1
1.0
Episode finished after 788 timesteps
0
1.0
1
1.0
1
1.0
0
1.0
0
1.0
0
1.0
0
1.0
1
1.0
1
1.0
0
1.0
1
1.0
1
1.0
0
1.0
1
1.0
0
1.0
1
1.0
0
1.0
0
1.0
1
1

[2016-11-21 16:23:51,064] Making new env: CartPole-v0


1
1.0
0
1.0
1
1.0
1
1.0
0
1.0
0
1.0
0
1.0
0
1.0
1
1.0
1
1.0
0
1.0
1
1.0
1
1.0
1
1.0
1
1.0
0
1.0
0
1.0
0
1.0
Episode finished after 1751 timesteps
0
1.0
1
1.0
0
1.0
1
1.0
0
1.0
1
1.0
1
1.0
1
1.0
1
1.0
1
1.0
1
1.0
0
1.0
0
1.0
0
1.0
0
1.0
0
1.0
1
1.0
0
1.0
0
1.0
0
1.0
0
1.0
1
1.0
0
1.0
1
1.0
0
1.0
1
1.0
1
1.0
1
1.0
0
1.0
0
1.0
0
1.0
1
1.0
1
1.0
1
1.0
1
1.0
1
1.0
1
1.0
1
1.0
0
1.0
1
1.0
1
1.0
0
1.0
1
1.0
0
1.0
1
1.0
0
1.0
Episode finished after 1797 timesteps
0
1.0
0
1.0
1
1.0
0
1.0
1
1.0
1
1.0
1
1.0
0
1.0
1
1.0
1
1.0
1
1.0
1
1.0
1
1.0
0
1.0
1
1.0
1
1.0
0
1.0
0
1.0
0
1.0
0
1.0
1
1.0
1
1.0
0
1.0
1
1.0
0
1.0
1
1.0
Episode finished after 1823 timesteps
1
1.0
1
1.0
0
1.0
1
1.0
0
1.0
0
1.0
1
1.0
0
1.0
0
1.0
0
1.0
1
1.0
1
1.0
0
1.0
1
1.0
0
1.0
1
1.0
0
1.0
1
1.0
1
1.0
0
1.0
1
1.0
1
1.0
1
1.0
1
1.0
1
1.0
0
1.0
1
1.0
1
1.0
0
1.0
Episode finished after 1852 timesteps
1
1.0
1
1.0
1
1.0
0
1.0
1
1.0
0
1.0
0
1.0
1
1.0
1
1.0
0
1.0
1
1.0
0
1.0
1
1.0
Episode finished after 1865 timesteps
0
1.0
0
1.0
1
1.0


[2016-11-21 16:23:52,457] Making new env: CartPole-v0


1
1.0
1
1.0
0
1.0
1
1.0
Episode finished after 1274 timesteps
1
1.0
1
1.0
0
1.0
1
1.0
0
1.0
0
1.0
0
1.0
0
1.0
0
1.0
0
1.0
0
1.0
0
1.0
0
1.0
1
1.0
0
1.0
Episode finished after 1289 timesteps
0
1.0
1
1.0
0
1.0
1
1.0
0
1.0
1
1.0
0
1.0
0
1.0
0
1.0
1
1.0
1
1.0
1
1.0
1
1.0
1
1.0
1
1.0
1
1.0
0
1.0
1
1.0
0
1.0
0
1.0
1
1.0
1
1.0
0
1.0
1
1.0
1
1.0
0
1.0
0
1.0
Episode finished after 1316 timesteps
1
1.0
1
1.0
1
1.0
0
1.0
1
1.0
1
1.0
0
1.0
1
1.0
1
1.0
1
1.0
0
1.0
0
1.0
1
1.0
Episode finished after 1329 timesteps
1
1.0
0
1.0
0
1.0
0
1.0
0
1.0
1
1.0
0
1.0
0
1.0
1
1.0
0
1.0
0
1.0
1
1.0
0
1.0
0
1.0
1
1.0
Episode finished after 1344 timesteps
0
1.0
1
1.0
1
1.0
1
1.0
0
1.0
0
1.0
1
1.0
1
1.0
1
1.0
1
1.0
1
1.0
1
1.0
1
1.0
1
1.0
0
1.0
1
1.0
Episode finished after 1360 timesteps
0
1.0
0
1.0
0
1.0
1
1.0
1
1.0
0
1.0
0
1.0
1
1.0
0
1.0
0
1.0
1
1.0
1
1.0
1
1.0
0
1.0
0
1.0
0
1.0
0
1.0
Episode finished after 1377 timesteps
0
1.0
0
1.0
0
1.0
0
1.0
0
1.0
1
1.0
1
1.0
0
1.0
0
1.0
0
1.0
0
1.0
Episode finished after 138

[2016-11-21 16:24:17,451] Making new env: CartPole-v0


0
1.0
0
1.0
1
1.0
1
1.0
1
1.0
1
1.0
('agent', 15)
1
1.0
0
1.0
0
1.0
0
1.0
0
1.0
0
1.0
0
1.0
0
1.0
1
1.0
0
1.0
0
1.0
0
1.0
Episode finished after 12 timesteps
0
1.0
1
1.0
1
1.0
0
1.0
0
1.0
1
1.0
0
1.0
0
1.0
0
1.0
0
1.0
0
1.0
0
1.0
0
1.0
1
1.0
1
1.0
1
1.0
Episode finished after 28 timesteps
1
1.0
1
1.0
0
1.0
1
1.0
1
1.0
0
1.0
0
1.0
0
1.0
1
1.0
0
1.0
0
1.0
1
1.0
0
1.0
1
1.0
1
1.0
0
1.0
0
1.0
0
1.0
0
1.0
1
1.0
0
1.0
1
1.0
0
1.0
0
1.0
1
1.0
0
1.0
1
1.0
1
1.0
0
1.0
1
1.0
1
1.0
1
1.0
Episode finished after 60 timesteps
1
1.0
1
1.0
0
1.0
0
1.0
1
1.0
0
1.0
0
1.0
0
1.0
0
1.0
1
1.0
0
1.0
0
1.0
0
1.0
0
1.0
0
1.0
0
1.0
Episode finished after 76 timesteps
0
1.0
0
1.0
1
1.0
1
1.0
1
1.0
0
1.0
1
1.0
1
1.0
1
1.0
1
1.0
0
1.0
1
1.0
0
1.0
0
1.0
0
1.0
0
1.0
0
1.0
1
1.0
0
1.0
1
1.0
0
1.0
1
1.0
0
1.0
1
1.0
1
1.0
1
1.0
1
1.0
Episode finished after 103 timesteps
0
1.0
0
1.0
1
1.0
1
1.0
1
1.0
0
1.0
1
1.0
0
1.0
0
1.0
0
1.0
1
1.0
1
1.0
0
1.0
0
1.0
0
1.0
1
1.0
0
1.0
1
1.0
0
1.0
1
1.0
1
1.0
Episode finished after 12

In [100]:
Block_Reward = np.sum(block_reward)
Block_Reward

302.0

In [95]:
state[:, 5]

array([-0.04192921, -0.37404987,  0.0694541 ,  0.70579841,  1.        ,
       -0.03836699, -0.1781108 ,  0.06156695,  0.39435741,  1.        ])

## 4. Implement Deep Q-Learning
In this cell you actually implement the algorithm. We've given you comments to define all the steps. You should also add book-keeping steps to keep track of the loss, reward and number of epochs (where env.step() returns done = true). 

In [98]:
env_reward = np

10001

In [131]:
t0 = time.time()
block_loss = 0.0
last_epochs=0
#epsilon = 0.2

y_actual = np.zeros((npar), dtype=float)
a_actual = np.zeros((npar), dtype=float)
print(a_actual)

for istep in np.arange(nsteps): 
    if (render): envs[0].render()
  
    #########################################################################
    ## TODO: Implement Q-Learning                                          ##
    ##                                                                     ##
    ## At high level, your code should:                                    ##
    ## * Update epsilon and learning rate.                                 ##
    ## * Update target estimator from Q-estimator if needed.               ##
    ## * Get the next action probabilities for the minibatch by running    ##
    ##   the policy on the current state with the Q-estimator.             ##
    ## * Then for each environment:                                        ##
    ##     ** Pick an action according to the action probabilities.        ##
    ##     ** Step in the gym with that action.                            ##
    ##     ** Process the observation and concat it to the last nwindow-1  ##
    ##        processed observations to form a new state.                  ##
    ## Then for all environments (vectorized):                             ##
    ## * Predict Q-scores for the new state using the target estimator.    ##
    ## * Compute new expected rewards using those Q-scores.                ##
    ## * Using those expected rewards as a target, compute gradients and   ##
    ##   update the Q-estimator.                                           ##
    ## * Step to the new state.                                            ##
    ##                                                                     ##
    #########################################################################
    indx = int(istep/(nsteps/(1.0*neps)))
    epsilon = epsilons[indx]
    lr = learning_rates[indx]
    update_estimator(q_estimator, target_estimator, nwindow, istep)
    rew, _ = q_estimator.predict(state)
    A = policy(q_estimator, state, epsilon)
    for j in range(npar):
        # ------- Choosing the action using the Greedy Policy ---------- #
        action_prob = A[:, j]
        sampling_prob = np.random.uniform()
        action = int(sampling_prob > action_prob[0])
        #print(action)
        #print(a_actual)
        a_actual[j] = action
        state[5:9, j] = state[0:4, j]
        state[0:4, j], rewards[j], dones[j], _ = envs[j].step(action)
        if dones[j]:
            rewards[j] = 0
        y_actual[j] = rewards[j]
    q_estimator.gradient(state, a_actual, y_actual) 
    q_estimator.rmsprop()    
        
        
    Block_Reward += np.sum(rewards)
    t = time.time() - t0
    if (istep % printsteps == 0):     
        print("step {:0d}, time {:.1f}, loss {:.8f}, epochs {:0d}, reward/epoch {:.5f}".format(
                istep, t, block_loss/printsteps, total_epochs, Block_Reward/np.maximum(1,total_epochs-last_epochs)))
        last_epochs = total_epochs
        Block_Reward = 0.0
        block_loss = 0.0

[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
<type 'numpy.ndarray'>
<type 'numpy.ndarray'>


NameError: global name 'shape' is not defined

In [115]:
i = 5
state[:, 5]

array([ 0.23530939,  0.55745888,  0.92382342,  0.39634419,  0.28210271,
        0.92657936,  0.41145885,  0.75122915,  0.86642856,  0.56349944])

In [114]:
state[5:9, 5]

array([ 0.21600844,  0.25735466,  0.86689874,  0.91900546])

Let's save the model now. 

In [None]:
pickle.dump(q_estimator.model, open("cartpole_q_estimator.p", "wb"))


You can reload the model later if needed:

In [None]:
test_estimator = Estimator(nfeats*nwindow, nhidden, nactions)
test_estimator.model = pickle.load(open("cartpole_q_estimator.p", "rb"))

And animate the model's performance. 

In [None]:
state0 = state[:,0]
for i in np.arange(200):
    envs[0].render()
    preds = test_estimator.predict(state0)
    iaction = np.argmax(preds)
    obs, _, done0, _ = envs[0].step(VALID_ACTIONS[iaction])
    state0 = np.concatenate((state0[nfeats:], preprocess(obs)))
    if (done0): envs[0].reset()
    

So there we have it. Simple 1-step Q-Learning can solve easy problems very fast. Note that environments that produce images will be much slower to train on than environments (like CartPole) which return an observation of the state of the system. But this model can still train on those image-based games - like Atari games. It will take hours-days however. It you try training on visual environments, we recommend you run the most expensive step - rmsprop - less often (e.g. every 10 iterations). This gives about a 3x speedup. 

## Optional
Do **one** of the following tasks:
* Adapt the DQN algorithm to another environment - it can use direct state observations.  Call <code>env.get_action_meanings()</code> to find out what actions are allowed. Summarize training performance: your final average reward/epoch, the number of steps required to train, and any modifications to the model or its parameters that you made.
* Try smarter schedules for epsilon and learning rate. Rewards for CartPole increase very sharply (several orders of magnitude) with better policies, especially as epsilon --> 0. Gradients will also change drastically, so the initial learning rate is probably not good later on. Try schedules for decreasing epsilon that allow the model to better adapt. Try other learning rate schedules, or setting learning rate based on average reward. 
* Try a fancier model. e.g. add another hidden layer, or try sigmoid non-linearities.