##### [sample solution, trained for a few hours (not converged)]

# This tutorial is will bring you through your first deep reinforcement learning model


* Seaquest game as an example
* Training a simple lasagne neural network for Q_learning objective


## About OpenAI Gym

* Its a recently published platform that basicly allows you to train agents in a wide variety of environments with near-identical interface.
* This is twice as awesome since now we don't need to write a new wrapper for every game
* Go check it out!
  * Blog post - https://openai.com/blog/openai-gym-beta/
  * Github - https://github.com/openai/gym


## New to Lasagne and AgentNet?
* We only require surface level knowledge of theano and lasagne, so you can just learn them as you go.
* Alternatively, you can find Lasagne tutorials here:
 * Official mnist example: http://lasagne.readthedocs.io/en/latest/user/tutorial.html
 * From scratch: https://github.com/ddtm/dl-course/tree/master/Seminar4
 * From theano: https://github.com/craffel/Lasagne-tutorial/blob/master/examples/tutorial.ipynb
* This is pretty much the basic tutorial for AgentNet, so it's okay not to know it.


# Experiment setup
* Here we basically just load the game and check that it works

In [1]:
from __future__ import print_function 
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
%env THEANO_FLAGS="floatX=float32"

env: THEANO_FLAGS="floatX=float32"


In [9]:
#global params.
GAME = "CartPole-v0"

#number of parallel agents and batch sequence length (frames)
N_AGENTS = 1
SEQ_LENGTH = 20

In [15]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import gym
env = gym.make(GAME)
obs = env.step(0)[0]
action_names = np.array(["left","right"]) #i guess so... i may be wrong
state_size = len(obs)
print(obs)

[2016-11-25 16:15:03,312] Making new env: CartPole-v0


[-0.02822146 -0.20307278 -0.02872299  0.30772524]


In [16]:
env.action_space.n

2

# Basic agent setup
Here we define a simple agent that maps game images into Qvalues using shallow neural network.


In [17]:
import lasagne
from lasagne.layers import InputLayer,DenseLayer,batch_norm,dropout
#image observation at current tick goes here, shape = (sample_i,x,y,color)
observation_layer = InputLayer((None,state_size))

dense0 = DenseLayer(observation_layer,100,
                    name='dense',
                    nonlinearity = lasagne.nonlinearities.tanh)


In [18]:
#a layer that predicts Qvalues
qvalues_layer = DenseLayer(dense0,
                   num_units = env.action_space.n,
                   nonlinearity=lasagne.nonlinearities.linear,
                   name="q-evaluator layer")

#To pick actions, we use an epsilon-greedy resolver (epsilon is a property)
from agentnet.resolver import EpsilonGreedyResolver
action_layer = EpsilonGreedyResolver(qvalues_layer,name="e-greedy action picker")

action_layer.epsilon.set_value(np.float32(0.1))


In [19]:
from agentnet.target_network import TargetNetwork

targetnet = TargetNetwork(qvalues_layer)
old_qvalues_layer = targetnet.output_layers

##### Finally, agent
We declare that this network is and MDP agent with such and such inputs, states and outputs

In [20]:
from agentnet.agent import Agent
#all together
agent = Agent(observation_layers=observation_layer,
              policy_estimators=(qvalues_layer,old_qvalues_layer),
              action_layers=action_layer)


In [21]:
#Since it's a single lasagne network, one can get it's weights, output, etc
weights = lasagne.layers.get_all_params(action_layer,trainable=True)
weights

[dense.W, dense.b, q-evaluator layer.W, q-evaluator layer.b]

# Create and manage a pool of atari sessions to play with

* To make training more stable, we shall have an entire batch of game sessions each happening independent of others
* Why several parallel agents help training: http://arxiv.org/pdf/1602.01783v1.pdf
* Alternative approach: store more sessions: https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf

In [22]:
from agentnet.experiments.openai_gym.pool import EnvPool

pool = EnvPool(agent,GAME, N_AGENTS,max_size=1000)


[2016-11-25 16:15:11,557] Making new env: CartPole-v0


In [23]:
%%time
#interact for 7 ticks
_,action_log,reward_log,_,_,_  = pool.interact(7)


print(action_names[action_log][:2])
print(reward_log[:2])

[['left' 'right' 'right' 'left' 'right' 'left' 'left']]
[[ 1.  1.  1.  1.  1.  1.  0.]]
CPU times: user 7 ms, sys: 0 ns, total: 7 ms
Wall time: 5.69 ms


In [24]:
#load first sessions (this function calls interact and remembers sessions)
pool.update(SEQ_LENGTH)

# Q-learning
* An agent has a method that produces symbolic environment interaction sessions
* Such sessions are in sequences of observations, agent memory, actions, q-values,etc
  * one has to pre-define maximum session length.

* SessionPool also stores rewards (Q-learning objective)

In [25]:
#get agent's Qvalues obtained via experience replay
replay = pool.experience_replay.sample_session_batch(100,replace=True)

_,_,_,_,(qvalues_seq,old_qvalues_seq) = agent.get_sessions(
    replay,
    session_length=SEQ_LENGTH,
    optimize_experience_replay=True,
)



In [26]:
#get reference Qvalues according to Qlearning algorithm
from agentnet.learning import qlearning

#crop rewards to [-1,+1] to avoid explosion.
#import theano.tensor as T
#rewards = T.maximum(-1,T.minimum(rewards,1))

#loss for Qlearning = (Q(s,a) - (r+gamma*Q(s',a_max)))^2

elwise_mse_loss = qlearning.get_elementwise_objective(qvalues_seq,
                                                      replay.actions,
                                                      replay.rewards,
                                                      replay.is_alive,
                                                      Qvalues_target=old_qvalues_seq,
                                                      gamma_or_gammas=0.99,)

#compute mean over "alive" fragments
loss = elwise_mse_loss.sum() / replay.is_alive.sum()

In [27]:
# Compute weight updates
updates = lasagne.updates.adadelta(loss,weights)

In [28]:
#compile train function
import theano
train_step = theano.function([],loss,updates=updates)

# Demo run

In [29]:
untrained_reward = pool.evaluate(save_path="./records",record_video=False)

[2016-11-25 16:16:17,731] Making new env: CartPole-v0
[2016-11-25 16:16:17,738] Clearing 2 monitor files from previous run (because force=True was provided)
[2016-11-25 16:16:17,856] Finished writing results. You can upload them to the scoreboard via gym.upload('/home/apanin/jheuristic/vime/records')


Episode finished after 138 timesteps with reward=138.0


In [30]:
from IPython.display import HTML

#video_path="./records/openaigym.video.0.7346.video000000.mp4"

#HTML("""
#<video width="640" height="480" controls>
#  <source src="{}" type="video/mp4">
#</video>
#""".format(video_path))


# Vime

In [34]:
from bnn import bbpwrap, NormalApproximation,sample_output
from lasagne.layers import EmbeddingLayer
import theano.tensor as T
@bbpwrap(NormalApproximation())
class BayesDenseLayer(DenseLayer):pass
@bbpwrap(NormalApproximation())
class BayesEmbLayer(EmbeddingLayer):pass

from curiosity import compile_vime_reward

class BNN:
    curiosity=0.1
    target_rho = 1
    
    l_state = InputLayer((None,state_size),name='state var')
    l_action = InputLayer((None,),input_var=T.ivector())

    l_action_emb = BayesEmbLayer(l_action,env.action_space.n, 3)    
    
    l_concat = lasagne.layers.concat([l_action_emb,l_state])
    
    l_dense = BayesDenseLayer(l_concat,num_units=30,
                              nonlinearity=lasagne.nonlinearities.tanh)
    
    l_out = BayesDenseLayer(l_dense,num_units=state_size,
                            nonlinearity=None)
        
    params = lasagne.layers.get_all_params(l_out,trainable=True)
    ###training###
    pred_states = lasagne.layers.get_output(l_out)
    next_states = T.matrix("next states")
    mse = lasagne.objectives.squared_error(pred_states,next_states).mean()
    
    #replace logposterior with simple regularization on rho cuz we're lazy
    reg = sum([lasagne.objectives.squared_error(rho,target_rho).mean() 
              for rho in lasagne.layers.get_all_params(l_out,rho=True)])
    
    loss = mse+ 0.01*reg
    
    updates = lasagne.updates.adam(loss,params)
    
    train_step = theano.function([l_state.input_var,l_action.input_var,next_states],
                                 loss,updates=updates)
    
    ###sample random sessions from pool###
    observations, = replay.observations
    actions, = replay.actions
    observations_flat = observations[:,:-1].reshape((-1,)+tuple(observations.shape[2:]))
    actions_flat = actions[:,:-1].reshape((-1,))
    next_observations_flat = observations[:,1:].reshape((-1,)+tuple(observations.shape[2:]))
    sample_from_pool = theano.function([],[observations_flat,actions_flat,next_observations_flat])

    
    ###curiosity reward### aka KL(qnew,qold)
    get_vime_reward_elwise = compile_vime_reward(l_out,l_state,l_action,params,n_samples=10)
    
    @staticmethod
    def add_vime_reward(observations,actions,rewards,is_alive,h0):
        assert isinstance(observations,np.ndarray)
        observations_flat = observations[:,:-1].reshape((-1,)+observations.shape[2:]).astype('float32')
        actions_flat = actions[:,:-1].reshape((-1,)).astype('int32')
        next_observations_flat = observations[:,1:].reshape((-1,)+observations.shape[2:]).astype('float32')

        vime_rewards = BNN.get_vime_reward_elwise(observations_flat,actions_flat,next_observations_flat)
        vime_rewards = np.concatenate([vime_rewards.reshape(rewards[:,:-1].shape),
                                       np.zeros_like(rewards[:,-1:]),], axis=1)
        surrogate_rewards = rewards + BNN.curiosity*vime_rewards
        return (observations,actions,surrogate_rewards,is_alive,h0)
    
        


# Training loop

In [35]:
#starting epoch
epoch_counter = 1

#full game rewards
rewards = {epoch_counter:untrained_reward}

In [None]:
n_games = 20
action_layer.epsilon.set_value(0)
rewards[epoch_counter] = pool.evaluate( record_video=False,n_games=n_games,verbose=False)
print("Current score(mean over %i) = %.3f"%(n_games,np.mean(rewards[epoch_counter])))
action_layer.epsilon.set_value(np.float32(current_epsilon))


In [None]:

#the loop may take eons to finish.
#consider interrupting early.
for i in range(10000):    
    
    
    #train
    pool.update(SEQ_LENGTH,append=True,preprocess=BNN.add_vime_reward)
    
    for i in range(10):
        loss = train_step()
        
    BNN.train_step(*BNN.sample_from_pool())
        
    targetnet.load_weights(0.01)
    
    
    ##update resolver's epsilon (chance of random action instead of optimal one)
    current_epsilon = 0.05 + 0.45*np.exp(-epoch_counter/500.)
    action_layer.epsilon.set_value(np.float32(current_epsilon))
    
    if epoch_counter%10==0:
        #average reward per game tick in current experience replay pool
        pool_mean_reward = pool.experience_replay.rewards.get_value().mean()
        print("iter=%i\tepsilon=%.3f\treward/step=%.5f"%(epoch_counter,
                                                         current_epsilon,
                                                         pool_mean_reward))
        

    ##record current learning progress and show learning curves
    if epoch_counter%100 ==0:
        n_games = 20
        action_layer.epsilon.set_value(0)
        rewards[epoch_counter] = pool.evaluate( record_video=False,n_games=n_games,verbose=False)
        print("Current score(mean over %i) = %.3f"%(n_games,np.mean(rewards[epoch_counter])))
        action_layer.epsilon.set_value(np.float32(current_epsilon))
    
    
    epoch_counter  +=1

    
# Time to drink some coffee!

iter=10	epsilon=0.491	reward/step=1.08267
iter=20	epsilon=0.482	reward/step=1.10625
iter=30	epsilon=0.474	reward/step=1.09475
iter=40	epsilon=0.465	reward/step=1.09015
iter=50	epsilon=0.457	reward/step=1.08826
iter=60	epsilon=0.449	reward/step=1.08557
iter=70	epsilon=0.441	reward/step=1.08216
iter=80	epsilon=0.433	reward/step=1.08321
iter=90	epsilon=0.426	reward/step=1.08629


[2016-11-25 16:17:37,957] Making new env: CartPole-v0
[2016-11-25 16:17:37,965] Clearing 2 monitor files from previous run (because force=True was provided)
[2016-11-25 16:17:38,115] Finished writing results. You can upload them to the scoreboard via gym.upload('/home/apanin/jheuristic/vime/records')


iter=100	epsilon=0.418	reward/step=1.08570
Current score(mean over 20) = 9.200
iter=110	epsilon=0.411	reward/step=1.08643
iter=120	epsilon=0.404	reward/step=1.08495
iter=130	epsilon=0.397	reward/step=1.08405
iter=140	epsilon=0.390	reward/step=1.08354
iter=150	epsilon=0.383	reward/step=1.08380
iter=160	epsilon=0.377	reward/step=1.08260
iter=170	epsilon=0.370	reward/step=1.08004
iter=180	epsilon=0.364	reward/step=1.07972
iter=190	epsilon=0.358	reward/step=1.07963


[2016-11-25 16:17:57,862] Making new env: CartPole-v0
[2016-11-25 16:17:57,868] Clearing 2 monitor files from previous run (because force=True was provided)


iter=200	epsilon=0.352	reward/step=1.07931


[2016-11-25 16:18:00,251] Finished writing results. You can upload them to the scoreboard via gym.upload('/home/apanin/jheuristic/vime/records')


Current score(mean over 20) = 190.750
iter=210	epsilon=0.346	reward/step=1.07800
iter=220	epsilon=0.340	reward/step=1.07755
iter=230	epsilon=0.334	reward/step=1.07600
iter=240	epsilon=0.328	reward/step=1.07545
iter=250	epsilon=0.323	reward/step=1.07557
iter=260	epsilon=0.318	reward/step=1.07455
iter=270	epsilon=0.312	reward/step=1.07368
iter=280	epsilon=0.307	reward/step=1.07446
iter=290	epsilon=0.302	reward/step=1.07355


[2016-11-25 16:18:20,714] Making new env: CartPole-v0
[2016-11-25 16:18:20,721] Clearing 2 monitor files from previous run (because force=True was provided)


iter=300	epsilon=0.297	reward/step=1.07315


[2016-11-25 16:18:22,256] Finished writing results. You can upload them to the scoreboard via gym.upload('/home/apanin/jheuristic/vime/records')


Current score(mean over 20) = 124.250
iter=310	epsilon=0.292	reward/step=1.07200
iter=320	epsilon=0.287	reward/step=1.07167
iter=330	epsilon=0.283	reward/step=1.07090
iter=340	epsilon=0.278	reward/step=1.06943
iter=350	epsilon=0.273	reward/step=1.06866
iter=360	epsilon=0.269	reward/step=1.06793
iter=370	epsilon=0.265	reward/step=1.06758
iter=380	epsilon=0.260	reward/step=1.06661
iter=390	epsilon=0.256	reward/step=1.06599


[2016-11-25 16:18:42,593] Making new env: CartPole-v0
[2016-11-25 16:18:42,602] Clearing 2 monitor files from previous run (because force=True was provided)


iter=400	epsilon=0.252	reward/step=1.06519


[2016-11-25 16:18:45,006] Finished writing results. You can upload them to the scoreboard via gym.upload('/home/apanin/jheuristic/vime/records')


Current score(mean over 20) = 190.200
iter=410	epsilon=0.248	reward/step=1.06457
iter=420	epsilon=0.244	reward/step=1.06445
iter=430	epsilon=0.240	reward/step=1.06415
iter=440	epsilon=0.237	reward/step=1.06407
iter=450	epsilon=0.233	reward/step=1.06417
iter=460	epsilon=0.229	reward/step=1.06366
iter=470	epsilon=0.226	reward/step=1.06293
iter=480	epsilon=0.222	reward/step=1.06334
iter=490	epsilon=0.219	reward/step=1.06261


[2016-11-25 16:19:05,495] Making new env: CartPole-v0
[2016-11-25 16:19:05,501] Clearing 2 monitor files from previous run (because force=True was provided)


iter=500	epsilon=0.216	reward/step=1.06227


[2016-11-25 16:19:07,308] Finished writing results. You can upload them to the scoreboard via gym.upload('/home/apanin/jheuristic/vime/records')


Current score(mean over 20) = 141.250
iter=510	epsilon=0.212	reward/step=1.06196
iter=520	epsilon=0.209	reward/step=1.06100
iter=530	epsilon=0.206	reward/step=1.06029
iter=540	epsilon=0.203	reward/step=1.06029
iter=550	epsilon=0.200	reward/step=1.06034
iter=560	epsilon=0.197	reward/step=1.05978
iter=570	epsilon=0.194	reward/step=1.05963
iter=580	epsilon=0.191	reward/step=1.05935
iter=590	epsilon=0.188	reward/step=1.05879


[2016-11-25 16:19:27,679] Making new env: CartPole-v0
[2016-11-25 16:19:27,685] Clearing 2 monitor files from previous run (because force=True was provided)


iter=600	epsilon=0.186	reward/step=1.05825


[2016-11-25 16:19:30,055] Finished writing results. You can upload them to the scoreboard via gym.upload('/home/apanin/jheuristic/vime/records')


Current score(mean over 20) = 188.950
iter=610	epsilon=0.183	reward/step=1.05788
iter=620	epsilon=0.180	reward/step=1.05770
iter=630	epsilon=0.178	reward/step=1.05714
iter=640	epsilon=0.175	reward/step=1.05649
iter=650	epsilon=0.173	reward/step=1.05586
iter=660	epsilon=0.170	reward/step=1.05543
iter=670	epsilon=0.168	reward/step=1.05505
iter=680	epsilon=0.165	reward/step=1.05462
iter=690	epsilon=0.163	reward/step=1.05419


[2016-11-25 16:19:50,541] Making new env: CartPole-v0
[2016-11-25 16:19:50,548] Clearing 2 monitor files from previous run (because force=True was provided)


iter=700	epsilon=0.161	reward/step=1.05381


[2016-11-25 16:19:52,807] Finished writing results. You can upload them to the scoreboard via gym.upload('/home/apanin/jheuristic/vime/records')


Current score(mean over 20) = 177.600
iter=710	epsilon=0.159	reward/step=1.05345
iter=720	epsilon=0.157	reward/step=1.05311
iter=730	epsilon=0.155	reward/step=1.05277
iter=740	epsilon=0.152	reward/step=1.05233
