# Approximate q-learning

In this notebook you will teach a __tensorflow__ neural network to do Q-learning.

__Frameworks__ - we'll accept this homework in any deep learning framework. For example, it translates to __TensorFlow__ almost line-to-line. However, we recommend you to stick to theano/lasagne unless you're certain about your skills in the framework of your choice.

In [1]:
#XVFB will be launched if you run on a server
import os
if os.environ.get("DISPLAY") is str and len(os.environ.get("DISPLAY"))!=0:
    !bash ../xvfb start
    %env DISPLAY=:1

In [2]:
import gym
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [3]:
env = gym.make("CartPole-v0")
env.reset()
n_actions = env.action_space.n
state_dim = env.observation_space.shape

plt.imshow(env.render("rgb_array"))

[2017-02-27 21:57:12,164] Making new env: CartPole-v0


# Approximate (deep) Q-learning: building the network

In this section we will build and train naive Q-learning with theano/lasagne

First step is initializing input variables

In [4]:
import tensorflow as tf
assert tf.__version__ == "1.0.0", "try pip install --upgrade tensorflow(-gpu)"
import tensorflow.contrib.layers as tflayers  # Let's make TF simple again

In [5]:
#create input variables. We'll support multiple states at once

current_states = tf.placeholder(shape=(None,)+state_dim, dtype=tf.float32)
actions = tf.placeholder("action_ids[batch]")
rewards = tf.placeholder("rewards[batch]")
next_states = tf.placeholder("next states[batch,units]")
# is_end should be bool vector
is_end = tf.placeholder("vector[batch] where 1 means that session just ended")

In [6]:
def network(l_states, scope=None, reuse=False):
    assert l_states.get_shape().as_list() == list((None,)+state_dim)
    with tf.variable_scope(scope or "network") as scope:
        if reuse:
            scope.reuse_variables()
        
        # <Your architecture. Please start with a single-layer network>

        return l_qvalues

#### Predicting Q-values for `current_states`

In [7]:
#get q-values for ALL actions in current_states
predicted_qvalues = network(current_states)

In [8]:
#select q-values for chosen actions
predicted_qvalues_for_actions = <...>

#### Loss function and `update`
Here we write a function similar to `agent.update`.

In [9]:
predicted_next_qvalues = network(<...>, reuse=True)
gamma = 0.99
target_qvalues_for_actions = <target Q-values using rewards and predicted_next_qvalues>
target_qvalues_for_actions = tf.where(
    is_end, 
    tf.zeros_like(target_qvalues_for_actions),
    target_qvalues_for_actions)

In [10]:
#mean squared error loss function
loss = <mean squared between target_qvalues_for_actions and predicted_qvalues_for_actions>

In [11]:
#network updates. Note the small learning rate (for stability)
#Training function that resembles agent.update(state,action,reward,next_state) 
#with 1 more argument meaning is_end
train_step = tf.train.AdamOptimizer(1e-4).minimize(
    loss, var_list=tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope="network"))

### Playing the game

In [12]:
# Tensorflow feature - session
sess = tf.InteractiveSession()

In [13]:
# Tensorflow feature 2 - variables initializer
sess.run(tf.global_variables_initializer())

In [None]:
# You can check all your valiables by:
# [v.name for v in tf.trainable_variables()]
# they should all starts with "network"

In [14]:
inial_epsilon = epsilon = 0.5
final_epsilon = 0.01
n_epochs = 1000

def generate_session(t_max=1000):
    """play env with approximate q-learning agent and train it at the same time"""
    
    total_reward = 0
    s = env.reset()
    total_loss = 0
    
    for t in range(t_max):
        
        #get action q-values from the network
        q_values = sess.run(
            predicted_qvalues, 
            feed_dict={current_states:np.array([s])})[0]
        
        a = <sample action with epsilon-greedy strategy>
        
        new_s,r,done,info = env.step(a)
        
        #train agent one step. Note that we use one-element arrays instead of scalars 
        #because that's what function accepts.
        curr_loss, _ = sess.run(
            ..., 
            feed_dict={
                ...})

        total_reward += r
        total_loss += curr_loss
        
        s = new_s
        if done: break
            
    return total_reward, total_loss/float(t), t

In [15]:
from tqdm import trange
tr = trange(
    n_epochs,
    desc="mean reward = {:.3f}\tepsilon = {:.3f}\tloss = {:.3f}\tsteps = {:.3f}".format(0.0, 0.0, 0.0, 0.0),
    leave=True)


for i in tr:
    
    sessions = [generate_session() for _ in range(100)] #generate new sessions
    session_rewards, session_loss, session_steps = map(np.array, zip(*sessions))
    
    epsilon -= (inial_epsilon - final_epsilon) / float(n_epochs)
    
    tr.set_description("mean reward = {:.3f}\tepsilon = {:.3f}\tloss = {:.3f}\tsteps = {:.3f}".format(
        np.mean(session_rewards), epsilon, np.mean(session_loss), np.mean(session_steps)))

    if np.mean(session_rewards) > 300:
        print ("You Win!")
        break
        
    assert epsilon!=0, "Please explore environment"

mean reward = 93.500	epsilon = 0.010	loss = 0.914	steps = 92.500: 100%|██████████| 1000/1000 [1:15:00<00:00,  7.59s/it] 


### Video

In [16]:
epsilon=0 #Don't forget to reset epsilon back to initial value if you want to go on training

In [None]:
#record sessions
import gym.wrappers
env = gym.wrappers.Monitor(env,directory="videos",force=True)
sessions = [generate_session() for _ in range(100)]
env.close()
#unwrap 
env = env.env.env
#upload to gym
#gym.upload("./videos/",api_key="<your_api_key>") #you'll need me later

#Warning! If you keep seeing error that reads something like"DoubleWrapError",
#run env=gym.make("CartPole-v0");env.reset();

In [None]:
#show video
from IPython.display import HTML
import os

video_names = list(filter(lambda s:s.endswith(".mp4"),os.listdir("./videos/")))

HTML("""
<video width="640" height="480" controls>
  <source src="{}" type="video/mp4">
</video>
""".format("./videos/"+video_names[-1])) #this may or may not be _last_ video. Try other indices

### Homework

Two paths lie ahead of you, and which one to take is a rightfull choice of yours.

* __[recommended]__ Go deeper. Return to seminar1 and get 99% accuracy on MNIST
* __[alternative]__ Try approximate expected-value SARSA and other algorithms and compare it with q-learning 
  * +3 points for EV-SARSA and comparison to Q-learning
  * +2 per additional algorithm
* __[alternative hard]__ Pick ```<your favourite env>``` and solve it, using NN.
 * LunarLander, MountainCar or Breakout (from week1 bonus)
 * LunarLander should get at least +100
 * MountainCar should get at least -200
 * You will need to somehow stabilize learning
   
