### Playing atari with advantage actor-critic

This time we're going to learn something harder then CartPole :)

Gym atari games only allow raw image pixels as observation, hence demanding a more powerful agent network to find meaningful features. We shall use a convolutional neural network for such task.

Most of the code in this notebook is written for you, however you are _strongly encouraged to experiment with it_ to find better agent configuration and/or learning algorithm.

In [None]:
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

#setup theano/lasagne. Set to GPU if you have one
%env THEANO_FLAGS=device=cpi,floatX=float32
import theano

#If you are running on a server, launch xvfb to record game videos
#Please make sure you have xvfb installed (apt-get install xvfb, see gym readme on xvfb)
import os
if os.environ.get("DISPLAY") is str and len(os.environ.get("DISPLAY"))!=0:
    !bash xvfb start
    %env DISPLAY=:1



# Processing game image

Raw atari images are large, 210x160x3 by default. However, we don't need that level of detail in order to learn them.

We can thus save a lot of time by preprocessing game image, including
* Resizing to a smaller shape
* Converting to grayscale
* Cropping irrelevant image parts

In [None]:
import gym
from agentnet.experiments.openai_gym.wrappers import PreprocessImage
#game maker consider https://gym.openai.com/envs
def make_env():
    env = gym.make("KungFuMaster-v0")
    return PreprocessImage(env,height=64,width=64,
                           grayscale=True,
                           crop=lambda img:img[:,:]) #<Set croppings here, run cell to see test image


#spawn game instance
env = make_env()
observation_shape = env.observation_space.shape
n_actions = env.action_space.n

obs = env.reset()

plt.imshow(obs[0],interpolation='none',cmap='gray')

# Basic agent setup
Here we define a simple agent that maps game images into policy using simple convolutional neural network.

In [None]:
import theano, lasagne
import theano.tensor as T
from lasagne.layers import *
from agentnet.memory import WindowAugmentation

In [None]:
#observation goes here
observation_layer = InputLayer((None,)+observation_shape,)

#4-tick window over images
prev_wnd = InputLayer((None,4)+observation_shape,name='window from last tick')
new_wnd = WindowAugmentation(observation_layer,prev_wnd,name='updated window')
        
#reshape to (frame, h,w). If you don't use grayscale, 4 should become 12.
wnd_reshape = reshape(new_wnd, (-1,4*observation_shape[0])+observation_shape[1:])


#### Network body

Here will need to build a convolutional network that consists of 4 layers:
* 3 convolutional layers with 32 filters, 5x5 window size, 2x2 stride
 * Choose any nonlinearity but for softmax
 * You may want to increase number of filters for the last layer
* Dense layer on top of all convolutions
 * anywhere between 100 and 512 neurons

You may find a template for such network below

In [None]:
from lasagne.nonlinearities import rectify,elu,tanh,softmax

#network body
conv0 = Conv2DLayer(wnd_reshape,<...>)
conv1 = <another convolutional layer, growing from conv0>
conv2 = <yet another layer...>

##Tip: you want a _fast_ architecture, so consider using stride. 
#For example, 5x5 filters with stride 2. Use <layer>.output_shape to get the size of each layer

        
dense = DenseLayer(<what is it's input?>,
                   nonlinearity=tanh,
                   name='dense "neck" layer')

### Network head

You will now need to build output layers.
Since we're building advantage actor-critic algorithm, out network will require two outputs:
* policy, $pi(a|s)$, defining action probabilities
* state value, $V(s)$, defining expected reward from the given state

Both those layers will grow from final dense layer from the network body.

In [None]:
#actor head
logits_layer = DenseLayer(dense,n_actions,nonlinearity=None) 
#^^^ separately define pre-softmax policy logits to regularize them later

from lasagne.layers import NonlinearityLayer

policy_layer = <use NonlinearityLayer to compute probabilities pi(a|s) from logits. Mind the nonlinearity>

#critic head
V_layer = <use dense layer to predict V(s)>

#sample actions proportionally to policy_layer
from agentnet.resolver import ProbabilisticResolver
action_layer = ProbabilisticResolver(policy_layer)



##### Finally, agent
We declare that this network is and MDP agent with such and such inputs, states and outputs

In [None]:
from agentnet.agent import Agent
#all together
agent = Agent(observation_layers=observation_layer,
              policy_estimators=(logits_layer,V_layer),
              agent_states={new_wnd:prev_wnd},
              action_layers=action_layer)


In [None]:
#Since it's a single lasagne network, one can get it's weights, output, etc
weights = lasagne.layers.get_all_params([V_layer,policy_layer],trainable=True)
weights

# Create and manage a pool of atari sessions to play with

* To make training more stable, we shall have an entire batch of game sessions each happening independent of others
* Why several parallel agents help training: http://arxiv.org/pdf/1602.01783v1.pdf
* Alternative approach: store more sessions: https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf

In [None]:
from agentnet.experiments.openai_gym.pool import EnvPool

#number of parallel agents 
N_AGENTS = 10

pool = EnvPool(agent,make_env, N_AGENTS) #may need to adjust


In [None]:
%%time
#interact for 7 ticks
_,action_log,reward_log,_,_,_  = pool.interact(10)

print('actions:')
print(action_log[0])
print("rewards")
print(reward_log[0])

In [None]:
# batch sequence length (frames) 
SEQ_LENGTH = 10

#load first sessions (this function calls interact and remembers sessions)
pool.update(SEQ_LENGTH)

# Advantage actor-critic

* An agent has a method that produces symbolic environment interaction sessions
* Such sessions are in sequences of observations, agent memory, actions, q-values,etc
  * one has to pre-define maximum session length.

* SessionPool also stores rewards, alive indicators, etc.
* Code mostly copied from [here](https://github.com/yandexdataschool/tinyverse/blob/0b359aa6a5a9f666d2fa9eab97669c7930b7acb3/atari.py)

In [None]:
from a2c_helper import get_a2c_loss_symbolic
loss = get_a2c_loss_symbolic(agent,pool,reward_koeff=0.1)

In [None]:
# Compute weight updates, clip by norm
grads = T.grad(loss,weights)
grads = lasagne.updates.total_norm_constraint(grads,10)

updates = lasagne.updates.adam(grads, weights,1e-4)

#compile train function
train_step = theano.function([],loss,updates=updates)

# Demo run

In [None]:
untrained_reward = np.mean(pool.evaluate(save_path="./records",
                                         record_video=True))

In [None]:
#show video
from IPython.display import HTML
import os

video_names = list(filter(lambda s:s.endswith(".mp4"),os.listdir("./records/")))

HTML("""
<video width="640" height="480" controls>
  <source src="{}" type="video/mp4">
</video>
""".format("./records/"+video_names[-1])) #this may or may not be _last_ video. Try other indices

# Training loop

In [None]:
#starting epoch
epoch_counter = 1

#full game rewards
rewards = {}
loss,reward_per_tick,reward =0,0,0

In [None]:
from tqdm import trange
from IPython.display import clear_output

#the algorithm almost converges by 15k iterations, 50k is for full convergence
for i in trange(150000):    
    
    #play
    pool.update(SEQ_LENGTH)

    #train
    loss = 0.95*loss + 0.05*train_step()
    
    
    if epoch_counter%10==0:
        #average reward per game tick in current experience replay pool
        reward_per_tick = 0.95*reward_per_tick + 0.05*pool.experience_replay.rewards.get_value().mean()
        print("iter=%i\tloss=%.3f\treward/tick=%.3f"%(epoch_counter,
                                                      loss,
                                                      reward_per_tick))
        
    ##record current learning progress and show learning curves
    if epoch_counter%100 ==0:
        reward = 0.95*reward + 0.05*np.mean(pool.evaluate(record_video=False))
        rewards[epoch_counter] = reward
        
        clear_output(True)
        plt.plot(*zip(*sorted(rewards.items(),key=lambda (t,r):t)))
        plt.show()
        

    
    epoch_counter  +=1

    
# Time to drink some coffee!

In [None]:
import pandas as pd
plt.plot(*zip(*sorted(rewards.items(),key=lambda k:k[0])))

# Evaluating results
 * Here we plot learning curves and sample testimonials

In [None]:
from agentnet.utils.persistence import save
save(action_layer,"kung_fu.pcl")
#load(action_layer,"kung_fu.pcl")

In [None]:
rw = pool.evaluate(n_games=20,save_path="./records",record_video=True)
print("mean session score=%f.5"%np.mean(rw))

In [None]:
#show video
from IPython.display import HTML
import os

video_names = list(filter(lambda s:s.endswith(".mp4"),os.listdir("./records/")))

HTML("""
<video width="640" height="480" controls>
  <source src="{}" type="video/mp4">
</video>
""".format("./records/"+video_names[-1])) #this may or may not be _last_ video. Try other indices

## How to enhance
* Add recurrent memory (LSTM/GRU really helps for this env), here's a [tutorial](http://bit.ly/2oZ34Ap)
* More parallel agents
* Different constructs for recurrent memory
* Try something like [this](https://arxiv.org/abs/1611.01224)
* Maybe tune parameters in terms of regularization