# Simple Imitation Learning in MineRL
This tutorial contains a simple example of how to build a imitation-learning based agent that can solve the MineRLNavigateDense-v0 environment. For more information about that environment, see this [MineRL Docs](http://minerl.io/docs/environments/index.html#minerlnavigatedense-v0).

For more Imitation Learning algorithms, like a Dagger in Tensorflow, see that Github repo, [Dagger](https://github.com/zsdonghao/Imitation-Learning-Dagger-Torcs).

Parts of this tutorial are based on code by [Arthur Juliani](https://medium.com/@awjuliani/super-simple-reinforcement-learning-tutorial-part-2-ded33892c724).

In [1]:
from __future__ import division

import numpy as np
import tensorflow as tf
import tensorflow.contrib.slim as slim
%matplotlib inline
import matplotlib.pyplot as plt
import math

try:
    xrange = xrange
except:
    xrange = range

  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


### Loading the CartPole Environment
If you don't already have the OpenAI gym installed, use  `pip install gym` to grab it.

What happens if we try running the environment with random actions? How well do we do? (Hint: not so well.)

The goal of the task is to achieve a reward of 200 per episode. For every step the agent keeps the pole in the air, the agent recieves a +1 reward. By randomly choosing actions, our reward for each episode is only a couple dozen. Let's make that better with RL!

### Setting up our Neural Network agent
This time we will be using a Policy neural network that takes observations, passes them through a single hidden layer, and then produces a probability of choosing a left/right movement. To learn more about this network, see [Andrej Karpathy's blog on Policy Gradient networks](http://karpathy.github.io/2016/05/31/rl/).

In [2]:
H = 57600

tf.reset_default_graph()

state = tf.placeholder(shape=[None,64,64,4], dtype=tf.float32)
conv1 = slim.conv2d( \
            inputs=state,num_outputs=16,kernel_size=[2,2],stride=[1,1],padding='VALID', 
                    biases_initializer=None,activation_fn=tf.nn.relu)
conv2 = slim.conv2d( \
            inputs=conv1,num_outputs=16,kernel_size=[2,2],stride=[1,1],padding='VALID', 
                    biases_initializer=None,activation_fn=tf.nn.relu)
conv3 = slim.conv2d( \
            inputs=conv2,num_outputs=16,kernel_size=[2,2],stride=[1,1],padding='VALID', 
                    biases_initializer=None,activation_fn=tf.nn.relu)
conv4 = slim.conv2d( \
            inputs=conv3,num_outputs=16,kernel_size=[2,2],stride=[1,1],padding='VALID', 
                    biases_initializer=None,activation_fn=tf.nn.relu)

convFlat = slim.flatten(conv4)
print("convFlat: " + str(convFlat))

#observations = tf.placeholder(tf.float32, [None,D] , name="input_x")
W = tf.get_variable("W", shape=[H, 3],
           initializer=tf.contrib.layers.xavier_initializer())
score = tf.matmul(convFlat, W)
probability = tf.nn.softmax(score)

real_action = tf.placeholder(shape=[None, 3], dtype=tf.int32)
loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=real_action, 
                                                                     logits=score))
tf.summary.scalar('loss', loss)
train_step = tf.train.RMSPropOptimizer(0.001).minimize(loss)

merged = tf.summary.merge_all()

Instructions for updating:
Colocations handled automatically by placer.
Instructions for updating:
Use keras.layers.flatten instead.
convFlat: Tensor("Flatten/flatten/Reshape:0", shape=(?, 57600), dtype=float32)
Instructions for updating:

Future major versions of TensorFlow will allow gradients to flow
into the labels input on backprop by default.

See `tf.nn.softmax_cross_entropy_with_logits_v2`.



### Advantage function
This function allows us to weigh the rewards our agent recieves. In the context of the Cart-Pole task, we want actions that kept the pole in the air a long time to have a large reward, and actions that contributed to the pole falling to have a decreased or negative reward. We do this by weighing the rewards from the end of the episode, with actions at the end being seen as negative, since they likely contributed to the pole falling, and the episode ending. Likewise, early actions are seen as more positive, since they weren't responsible for the pole falling.

### Running the Agent and Environment

Here we run the neural network agent, and have it act in the CartPole environment.

In [3]:
'''
import minerl
import gym
env = gym.make('MineRLNavigateDense-v0')

obs  = env.reset()
done = False
net_reward = 0

while not done:
    action = env.action_space.noop()

    action['camera'] = [0, -10]
    action['back'] = 0
    action['forward'] = 1
    action['jump'] = 1
    action['attack'] = 1

    obs, reward, done, info = env.step(
        action)

    net_reward += reward
    print("Total reward: ", net_reward)
'''

'\nimport minerl\nimport gym\nenv = gym.make(\'MineRLNavigateDense-v0\')\n\nobs  = env.reset()\ndone = False\nnet_reward = 0\n\nwhile not done:\n    action = env.action_space.noop()\n\n    action[\'camera\'] = [0, -10]\n    action[\'back\'] = 0\n    action[\'forward\'] = 1\n    action[\'jump\'] = 1\n    action[\'attack\'] = 1\n\n    obs, reward, done, info = env.step(\n        action)\n\n    net_reward += reward\n    print("Total reward: ", net_reward)\n'

In [4]:
#import minerl
#data = minerl.data.make('MineRLNavigateDense-v0', '/home/kimbring2/MineRL/data/')

In [5]:
import minerl
import gym

env = gym.make('MineRLNavigateDense-v0')
obs = env.reset()

In [None]:
init = tf.global_variables_initializer()
with tf.Session() as sess:
# Launch the graph
    rendering = False
    sess.run(init)
    saver = tf.train.Saver(max_to_keep=5)
    train_writer = tf.summary.FileWriter('/home/kimbring2/MineRL/train_summary', sess.graph)
    
    print('Loading Model...')
    path = '/home/kimbring2/MineRL/model'
    ckpt = tf.train.get_checkpoint_state(path)
    saver.restore(sess, ckpt.model_checkpoint_path)
    
    env.init()
    obs = env.reset()
    net_reward = 0
    for i in range(0, 500000):
        pov = obs['pov'].astype(np.float32) / 255.0 - 0.5
        compass = obs['compassAngle']

        compass_channel = np.ones(shape=list(pov.shape[:-1]) + [1], dtype=np.float32) * compass
        compass_channel /= 180.0
        
        state_concat = np.concatenate([pov, compass_channel], axis=-1)
        action_probability = sess.run(probability, feed_dict={state:[state_concat]})

        action = env.action_space.noop()
        if (np.argmax(action_probability) == 0):
            action['camera'] = [0, -10]
            action['jump'] = 0
        elif (np.argmax(action_probability) == 1):
            action['camera'] = [0, 10]
            action['jump'] = 0
        else:
            action['camera'] = [0, 0]
            action['jump'] = 1
        
        action['forward'] = 1
        action['back'] = 0
        action['left'] = 0
        #action['jump'] = np.argmax(j)
        action['right'] = 0
        action['sprint'] = 1

        obs, reward, done, info = env.step(action)
        
        if done == True:
            break

        net_reward += reward
        print("Total reward: ", net_reward)

Loading Model...
Instructions for updating:
Use standard file APIs to check for files with this prefix.
INFO:tensorflow:Restoring parameters from /home/kimbring2/MineRL/model/model-60600.cptk


As you can see, the network not only does much better than random actions, but achieves the goal of 200 points per episode, thus solving the task!