# M2177.003100 Deep Learning <br>Assignment #5 Part 1: Implementing and Training a Deep Q-Network

Copyright (C) Data Science Laboratory, Seoul National University. This material is for educational uses only. Some contents are based on the material provided by other paper/book authors and may be copyrighted by them. Written by Hyemi Jang, November 2018

In this notebook, you will implement one of famous reinforcement learning algorithm, Deep Q-Network (DQN) of DeepMind. <br>
The goal here is to understand a basic form of DQN [1, 2] and learn how to use OpenAI Gym toolkit [3].<br>
You need to follow the instructions to implement the given classes.

1. [Play](#play) ( 50 points )

**Note**: certain details are missing or ambiguous on purpose, in order to test your knowledge on the related materials. However, if you really feel that something essential is missing and cannot proceed to the next step, then contact the teaching staff with clear description of your problem.

### Submitting your work:
<font color=red>**DO NOT clear the final outputs**</font> so that TAs can grade both your code and results.  
Once you have done **two parts of the assignment**, run the *CollectSubmission.sh* script with your **Team number** as input argument. <br>
This will produce a zipped file called *[Your team number].tar.gz*. Please submit this file on ETL. &nbsp;&nbsp; (Usage: ./*CollectSubmission.sh* &nbsp; Team_#)

### Some helpful references for assignment #4 :
- [1] Mnih, Volodymyr, et al. "Playing atari with deep reinforcement learning." arXiv preprint arXiv:1312.5602 (2013). [[pdf]](https://www.google.co.kr/url?sa=t&rct=j&q=&esrc=s&source=web&cd=3&cad=rja&uact=8&ved=0ahUKEwiI3aqPjavVAhXBkJQKHZsIDpgQFgg7MAI&url=https%3A%2F%2Fwww.cs.toronto.edu%2F~vmnih%2Fdocs%2Fdqn.pdf&usg=AFQjCNEd1AJoM72DeDpI_GBoPuv7NnVoFA)
- [2] Mnih, Volodymyr, et al. "Human-level control through deep reinforcement learning." Nature 518.7540 (2015): 529-533. [[pdf]](https://www.nature.com/nature/journal/v518/n7540/pdf/nature14236.pdf)
- [3] OpenAI GYM website [[link]](https://gym.openai.com/envs) and [[git]](https://github.com/openai/gym)

## 0. OpenAI Gym

OpenAI Gym is a toolkit to support diverse environments for developing reinforcement learning algorithms. You can use the toolkit with Python as well as TensorFlow. Installation guide of OpenAI Gym is offered by [this link](https://github.com/openai/gym#installation) or just type the command "pip install gym" (as well as "pip install gym[atari]" for Part2). 

After you set up OpenAI Gym, you can use APIs of the toolkit by inserting <font color=red>import gym</font> into your code. In this assignment, you must build one of famous reinforcement learning algorithms whose agent can run on OpenAI Gym environments. Please check how to use APIs such as funcions interacting with environments in the followings.

In [1]:
#import matplotlib.pyplot as plt
import tensorflow as tf
import cv2 
import gym
import numpy as np
import os

In [2]:
# Make an environment instance of CartPole-v0.
env = gym.make('CartPole-v0')

# Before interacting with the environment and starting a new episode, you must reset the environment's state.
state = env.reset()

# Uncomment to show the screenshot of the environment (rendering game screens)
# env.render() 

# You can check action space and state (observation) space.
num_actions = env.action_space.n
state_shape = env.observation_space.shape
print(num_actions)
print(state_shape)

# "step" function performs agent's actions given current state of the environment and returns several values.
# Input: action (numerical data)
#        - env.action_space.sample(): select a random action among possible actions.
# Output: next_state (numerical data, next state of the environment after performing given action)
#         reward (numerical data, reward of given action given current state)
#         terminal (boolean data, True means the agent is done in the environment)
next_state, reward, terminal, info = env.step(env.action_space.sample())

2
(4,)


## 1. Implement a DQN agent
## 1) Overview of implementation in the notebook

The assignment is based on a method named by Deep Q-Network (DQN) [1,2]. You could find the details of DQN in the papers. The followings show briefly architecture of DQN and its training computation flow.

- (Pink flow) Play an episode and save transition records of the episode into a replay memory.
- (Green flow) Train DQN so that a loss function in the figure is minimized. The loss function is computed using main Q-network and Target Q-network. Target Q-network needs to be periodically updated by copying the main Q-network.
- (Purple flow) Gradient can be autonomously computed by tensorflow engine, if you build a proper optimizer.

![](image/architecture.png)

There are major 4 components, each of which needs to be implemented in this notebook. The Agent class must have an instance(s) of each class (Environment, DQN, ReplayMemory).
- Environment
- DQN 
- ReplayMemory
- Agent

![](image/components.png)



## 2) Design classes

In the code cells, there are only names of functions which are used in TA's implementation and their brief explanations. <font color='green'>...</font> means that the functions need more arguments and <font color='green'>pass</font> means that you need to write more codes. The functions may be helpful when you do not know how to start the assignment. Of course, you could change the functions such as deleting/adding functions or extending/reducing roles of the classes, <font color='red'> just keeping the existence of the classes</font>.

### Environment class

In [3]:
class Environment(object):
    def __init__(self, args):
        #Initializing environments with arguments from the args.
        self.args = args
        self.env = env
        self.num_actions = num_actions
        self.state_shape = list(state_shape)
        
    def random_action(self):
        # Return a random action.
        return self.env.action_space.sample()
    
    def render_worker(self):
        # If display in your option is true, do rendering. Otherwise, do not.
        #Defining display argument as True/False from self.args.display
        if self.args.display:
            self.env.render()
            sleep(self.args.display_interval)
    
    def new_episode(self):
        # Start a new episode and return the first state of the new episode.
        return self.env.reset()
    
    def act(self, action):
        # Perform an action which is given by input argument and return the results of acting.
        #As it shows above code, we followed the code and exclude the self.info, because self.info has nothing important.
        self.state, self.reward, self.terminal, self.info = self.env.step(action)
        self.render_worker()
        return self.state, self.reward, self.terminal

### ReplayMemory class

In [4]:
class ReplayMemory(object):
    def __init__(self, args, state_shape):
        self.args = args
        self.state_shape = state_shape
        #Initialize count and current variable to count the index of action, reward, and termianls.
        self.count = 0
        self.current = 0
        
        #Initializing array shape of actions(uint8), rewards(float32), terminals(float32), and states (float32)
        #to check which index of each has what values.
        self.actions = np.empty(self.args.memory_size, dtype=np.uint8)
        self.rewards = np.empty(self.args.memory_size, dtype=np.float32)
        self.terminals = np.empty(self.args.memory_size, dtype=np.bool)
        self.next_states = np.empty([self.args.memory_size] + self.state_shape, dtype=np.float32)
        self.prestates = np.empty([self.args.batch_size] + self.state_shape, dtype=np.float32)
        self.poststates = np.empty([self.args.batch_size] + self.state_shape, dtype=np.float32)
        
    def add(self, action, reward, terminal, next_state):
        # Add current_state, action, reward, terminal, (next_state which can be added by your choice). 
        self.actions[self.current] = action
        self.rewards[self.current] = reward
        self.terminals[self.current] = terminal
        
        self.next_states[self.current] = next_state
        self.count = max(self.count, self.current+1)
        self.current = (self.current+1) % self.args.memory_size
    
    def mini_batch(self):
        # Return a mini_batch whose data are selected according to your sampling method. (such as uniform-random sampling in DQN papers)
        batch_idx = []
        while len(batch_idx) < self.args.batch_size:
            while True:
                idx = np.random.randint(low=1, high=self.count)
                if idx == self.current:
                    continue
                if self.terminals[idx-1]:
                    continue
                break
            self.prestates[len(batch_idx)] = self.next_states[idx-1]
            self.poststates[len(batch_idx)] = self.next_states[idx]
            batch_idx.append(idx)
            
        actions = self.actions[batch_idx]
        rewards = self.rewards[batch_idx]
        terminals = self.rewards[batch_idx]
        #Returns previous states, actions, rewards, terminals, and post state.
        return self.prestates, actions, rewards, terminals, self.poststates

### DQN class

In [5]:
import tensorflow.contrib.layers as layers

class DQN(object):
    def __init__(self, args, sess, memory, environment):
        self.args = args
        self.sess = sess
        self.memory = memory
        self.env = environment
        
        self.num_actions = num_actions
        self.input_shape = self.memory.state_shape
        
        self.states = tf.placeholder(tf.float32, [None] + self.input_shape)
        self.actions = tf.placeholder(tf.uint8, [None])
        self.rewards = tf.placeholder(tf.float32, [None])
        self.terminals = tf.placeholder(tf.float32, [None])
        self.max_q = tf.placeholder(tf.float32, [None])
        
        self.prediction_Q = self.build_network('pred')
        self.target_Q = self.build_network('target')
        self.loss, self.optimizer = self.build_optimizer()
    
    def build_network(self, name):
        # Make your a deep neural network
        #Tried with convolutional network but didn't give us good rewards.
        '''
        def conv2d(x, output_dim, kernel_size, stride, initializer, activation, padding='VALID', name='conv2d'):
            with tf.variable_scope(name):
                stride = [1, 1, stride[0], stride[1]]
                kernel = [kernel_size[0], kernel_size[1], x.get_shape()[1], output_dim]

                w = tf.get_variable('w', kernel, tf.float32, initializer=initializer)
                b = tf.get_variable('b', [output_dim], 
                        initializer=tf.constant_initializer(0.0))
                conv = tf.nn.conv2d(x, w, stride, padding, data_format='NCHW')
                out = activation(tf.nn.bias_add(conv, b, 'NCHW'))

                return out
        '''    
        #This is a function for layers with weights and bias matmul operation  
        '''
        def linear(x, output_size, stddev=0.02, bias_start=0.0, activation=None, name='linear'):
            shape = x.get_shape().as_list()
            
            with tf.variable_scope(name):
                w = tf.get_variable('w', [shape[1], output_size], tf.float32, tf.random_normal_initializer(stddev=stddev))
                b = tf.get_variable('b', [output_size], initializer=tf.constant_initializer(bias_start))
                out = tf.nn.bias_add(tf.matmul(x, w), b)
                if activation != None:
                    out = activation(out)

                return out
         '''
        with tf.variable_scope(name):
            #Tried with difference hidden layer, but took long time for training and the results weren't good enough.
            
            #fc1 = tf.layers.dense(inputs=self.states, units=100, activation=tf.nn.relu, kernel_initializer=self.kernel_initializer)
            #fc2 = tf.layers.dense(inputs=fc1, units=50, activation=tf.nn.relu, kernel_initializer=self.kernel_initializer)
            #fc3 = tf.layers.dense(inputs=fc2, units=10, activation=tf.nn.relu, kernel_initializer=self.kernel_initializer)
            #Q = tf.layers.dense(inputs=fc3, units=self.num_actions, activation=None, kernel_initializer=self.kernel_initializer)
            
            #Tried conv layers/linears with matmul method as well, but not good enough results performed.
            '''
            self.l1 = conv2d(self.states, 32, [8, 8], [4, 4], initializer, activation_fn, name='l1')
            self.l2 = conv2d(self.l1, 64, [4, 4], [2, 2], initializer, activation_fn, name='l2')
            self.l3 = con2d(self.l2, 64, [3, 3], [1, 1], initializer, activation_fn, name='l3')
                
            shape = self.l3.get_shape().as_list()
            self.l3_flat = tf.reshape(self.l3, [-1, reduce(lambda x, y: x*y, shape[1:])])
            self.l4 = linear(self.l3_flat, 512, activation_fn, name='l4')
            self.Q = linear(self.l4, self.num_actions, activation_fn, name='Q')
            '''
            
            #Set kernel, bias initializer, and activation function.
            self.kernel_initializer = tf.truncated_normal_initializer(mean=0.0, stddev=0.02)
            self.bias_initializer = tf.constant_initializer(0.05)
            self.activation_fn = tf.nn.relu
            
            #Created simple Fully_connected layers with small hidden units. This performs the best after configuring params.
            #Last layer which is Q being set with softmax function.
            fc1 = layers.fully_connected(self.states, 6, biases_initializer=None, activation_fn=self.activation_fn)
            fc2 = layers.fully_connected(fc1, 4, biases_initializer=None, activation_fn=self.activation_fn)
            Q = layers.fully_connected(fc2, self.num_actions, biases_initializer=None, activation_fn=tf.nn.softmax)
            
            return Q
        
    def build_optimizer(self):
        # Make your optimizer 
        # Calculating the target Q value (= r + gamma * maxQ(next_state)) which is written in the slide of chpater 15.
        target_q = self.rewards + tf.multiply(1-self.terminals, tf.multiply(self.args.discount_factor, self.max_q))
        
        # Calculating the predicted Q value
        action_one_hot = tf.one_hot(indices=self.actions, depth=self.num_actions, on_value=1.0, off_value=0.0)
        pred_q = tf.reduce_sum(tf.multiply(self.prediction_Q, action_one_hot), reduction_indices=1)
        
        # Calculating the loss and make an optimizer as always we did in previous assignments.
        loss = tf.reduce_mean(tf.square(pred_q - target_q))
        optimizer = tf.train.AdamOptimizer(learning_rate=self.args.learning_rate).minimize(loss)
        
        return loss, optimizer
    
    def train_network(self):
        # Train the prediction_Q network using a mini-batch sampled from the replay memory
        # Get minibatch values from ReplayMemory(mini_batch function)
        minib_prestates, minib_actions, minib_rewards, minib_terminals, minib_poststates = self.memory.mini_batch()
        
        # Calculating the target Q value (batch)
        minib_q_poststates = self.sess.run(self.target_Q, feed_dict={self.states: minib_poststates})
        minib_max_q = np.max(minib_q_poststates, axis=1)
        
        #Running optimizer by feeding dict.
        return self.sess.run([self.loss, self.optimizer], feed_dict={self.states: minib_prestates, self.actions: minib_actions, self.rewards: minib_rewards, self.terminals: minib_terminals, self.max_q: minib_max_q})
        
    def update_target_network(self):
        #We separated these codes as another function(update_target_network)
        copy_op = []
        pred_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope='pred')
        target_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope='target')
        for pred_var, target_var in zip(pred_vars, target_vars):
            copy_op.append(target_var.assign(pred_var.value()))
        self.sess.run(copy_op)
    
    def predict_Q(self, state):
        #Predicting Q.
        return self.sess.run(self.prediction_Q, feed_dict={self.states: [state]})
    

### Agent class

In [6]:
import os # to save and load
import time

class Agent(object):
    def __init__(self, args, sess):
        #As this is the Agent class, we need to initialize environment, replay memory, and DQN class.
        self.args = args
        self.sess = sess
        self.env = Environment(args)
        self.memory = ReplayMemory(args, self.env.state_shape)
        self.dqn = DQN(self.args, self.sess, self.memory, self.env)
        
        self.saver = tf.train.Saver()
        self.sess.run(tf.global_variables_initializer())
        
        # Synchronize the target network with the main network
        self.dqn.update_target_network()
        
        # Seed of the random number generator
        np.random.seed(int(time.time()))
    
    def select_action(self):
        # Select an action according ε-greedy. You need to use a random-number generating function and add a library if necessary.
        #If train is true, then use self.eps as max of below code. OR use eps_test value from arguments.
        if self.args.train:
            self.eps = np.max([self.args.eps_min, self.args.eps_init - (self.args.eps_init - self.args.eps_min)*(float(self.step)/float(self.args.max_exploration_step))])
        else:
            self.eps = self.args.eps_test
        #Picking random actions.
        if np.random.rand() < self.eps:
            action = self.env.random_action()
        else:
            q = self.dqn.predict_Q(self.state)[0]
            action = np.argmax(q)
        
        return action
    
    def train(self):
        # Train your agent which has the neural nets.
        # Several hyper-parameters are determined by your choice (Options class in the below cell)
        # Keep epsilon-greedy action selection in your mind 
        episodes_count = 0
        best_reward = 0
        best_avg_reward = 0
        episode_reward = 0
        episode_rewards = []
        
        print('----Creating Random Memory----')
        self.state = self.env.new_episode()
        for self.step in range(1, self.args.max_step + 1):
            if self.step == 1:
                print('----Trying to upate the network----')
            
            action = self.select_action() # Select an action by epsilon-greedy
            next_state, reward, terminal = self.env.act(action) # Perform the action and receive information of the environment
            self.memory.add(action, reward, terminal, next_state) # Save the information
            self.state = next_state # Update the input state
            
            #Adding reward to episode_reward.
            episode_reward += reward
            if terminal:
                episodes_count += 1
                episode_rewards.append(episode_reward)
                if episode_reward > best_reward:
                    best_reward = episode_reward
                episode_reward = 0
                self.state = self.env.new_episode()
            
            # Periodically update main network and target network
            if self.step >= self.args.training_start_step:
                #Updating target_network when self.step is divided by self.copy_interval that we initialized 
                if self.step % self.args.copy_interval == 0:
                    self.dqn.update_target_network()
                #Training DQN when the remainder is 0.
                if self.step % self.args.train_interval == 0:
                    loss, _ = self.dqn.train_network()
                if self.step % self.args.show_interval == 0:
                    #Calculating min, avg, max rewards from episode.
                    max_r = np.max(episode_rewards)
                    min_r = np.min(episode_rewards)
                    avg_r = np.mean(episode_rewards)
                    #Make max reward to best reward
                    if max_r > best_reward:
                        best_reward = max_r
                    if avg_r > best_avg_reward:
                        best_avg_reward = avg_r
                        #Save the checkpoint using save() function
                        self.save()
                    print('[%7d/%7d step] avg_r: %.4f, max_r: %3d, min_r: %3d, Best reward: %3d' %(self.step, self.args.max_step, avg_r, max_r, min_r, best_reward))
                    episode_rewards = []
    
    def play(self, num_episode=5, load=True):
        # Test your agent 
        # When performing test, you can show the environment's screen by rendering,
        #For each episode, we keep adding rewards until terminal is True. 
        best_reward = 0
        for episode in range(num_episode):
            self.state = self.env.new_episode()
            current_reward = 0
            
            terminal = False
            while not terminal:
                action = self.select_action()
                next_state, reward, terminal = self.env.act(action)
                current_reward += reward
                self.state = next_state
                
                if terminal:
                    break
            
            if current_reward > best_reward:
                best_reward = current_reward
        
        return best_reward
    
    def save(self):
        checkpoint_dir = 'cartpole'
        if not os.path.exists(checkpoint_dir):
            os.mkdir(checkpoint_dir)
        self.saver.save(self.sess, os.path.join(checkpoint_dir, 'trained_agent'))
        
    def load(self):
        print('Loading checkpoint...!')
        checkpoint_dir = 'cartpole'
        checkpoint_state = tf.train.get_checkpoint_state(checkpoint_dir)
        self.saver.restore(self.sess, os.path.join(checkpoint_dir, 'trained_agent'))
        print('Success to load checkpoint')
    

## 2. Train your agent 

Now, you train an agent to play CartPole-v0. Options class is the collection of hyper-parameters that you can choice. Usage of Options class is not mandatory.<br>
The maximum value of total reward which can be aquired from one episode is 200. 
<font color='red'>**You should show learning status such as the number of observed states and mean/max/min of rewards frequently (for instance, every 100 states).**</font>

In [7]:
import easydict

"""
You can add more arguments.
for example, visualize, memory_size, batch_size, discount_factor, eps_max, eps_min, learning_rate, train_interval, copy_interval and so on
"""
#Changed argparse with easydict. We don't know why but it gives us error while we complies it.
#So we find easydict method for alternative way
args = easydict.EasyDict({
    "env-name" : "CartPole-v0",
    "train" : True,
    "display" : False,
    
    "max_step" : 100000,
    "max_exploration_step" : 10000,
    "memory_size" : 10000,
    "batch_size" : 32,
    "num_skipping_states" : 4,
    "state_length" : 4,
    
    "discount_factor" : 0.99,
    "eps_init" : 1.0,
    "eps_min" : 0.1,
    "eps_test" : 0.05,
    "learning_rate" : 1e-5,
    
    "training_start_step" : 100,
    "train_interval" : 1,
    "copy_interval" : 100,
    "show_interval" : 2000,
    "display_interval" : 0.05,
    
    "gpu_num" : 0
})

# Basic DQN uses just one GPU.
os.environ["CUDA_VISIBLE_DEVICES"] = str(args.gpu_num)
config = tf.ConfigProto()
config.log_device_placement = False
config.gpu_options.allow_growth = True

In [8]:
# fot train
with tf.Session(config=config) as sess:
    myAgent = Agent(args, sess) # It depends on your class implementation
    myAgent.train()
    # myAgent.save()
    
    # test
    myAgent.load()
    rewards = []
    for i in range(20):
        r = myAgent.play() # play() returns the reward cumulated in one episode
        rewards.append(r)
    mean = np.mean(rewards)
    print(rewards)
    print(mean)

----Creating Random Memory----
----Trying to upate the network----
[   2000/ 100000 step] avg_r: 21.2553, max_r:  66, min_r:  10, Best reward:  66
[   4000/ 100000 step] avg_r: 18.7009, max_r:  87, min_r:   8, Best reward:  87
[   6000/ 100000 step] avg_r: 14.6765, max_r:  44, min_r:   8, Best reward:  87
[   8000/ 100000 step] avg_r: 12.4161, max_r:  22, min_r:   8, Best reward:  87
[  10000/ 100000 step] avg_r: 10.6543, max_r:  20, min_r:   8, Best reward:  87
[  12000/ 100000 step] avg_r: 9.8276, max_r:  16, min_r:   8, Best reward:  87
[  14000/ 100000 step] avg_r: 9.9851, max_r:  15, min_r:   8, Best reward:  87
[  16000/ 100000 step] avg_r: 9.7745, max_r:  17, min_r:   8, Best reward:  87
[  18000/ 100000 step] avg_r: 9.8235, max_r:  15, min_r:   8, Best reward:  87
[  20000/ 100000 step] avg_r: 9.8713, max_r:  14, min_r:   8, Best reward:  87
[  22000/ 100000 step] avg_r: 9.9406, max_r:  16, min_r:   8, Best reward:  87
[  24000/ 100000 step] avg_r: 9.9303, max_r:  15, min_r:   

## <a name="play"></a> 3. Test the trained agent ( 15 points )

Now, we test your agent and calculate an average reward of 20 episodes.
- 0 <= average reward < 50 : you can get 0 points
- 50 <= average reward < 100 : you can get 10 points
- 100 <= average reward < 190 : you can get 35 points
- 190 <= average reward <= 200 : you can get 50 points

In [8]:
# for test
with tf.Session(config=config) as sess:
    #args = parser.parse_args() # You set the option of test phase
    args["train"] = False
    myAgent = Agent(args, sess) # It depends on your class implementation
    myAgent.load()
    rewards = []
    for i in range(20):
        r = myAgent.play() # play() returns the reward cumulated in one episode
        rewards.append(r)
    mean = np.mean(rewards)
    print(rewards)
    print(mean)

Loading checkpoint...!
INFO:tensorflow:Restoring parameters from cartpole/trained_agent
Success to load checkpoint
[200.0, 68.0, 200.0, 200.0, 200.0, 200.0, 200.0, 149.0, 200.0, 200.0, 200.0, 200.0, 200.0, 200.0, 200.0, 200.0, 200.0, 157.0, 200.0, 200.0]
188.7
