# M2177.003100 Deep Learning <br>Assignment #5 Part 1: Implementing and Training a Deep Q-Network

Copyright (C) Data Science Laboratory, Seoul National University. This material is for educational uses only. Some contents are based on the material provided by other paper/book authors and may be copyrighted by them. Written by Hyemi Jang, November 2018

In this notebook, you will implement one of famous reinforcement learning algorithm, Deep Q-Network (DQN) of DeepMind. <br>
The goal here is to understand a basic form of DQN [1, 2] and learn how to use OpenAI Gym toolkit [3].<br>
You need to follow the instructions to implement the given classes.

1. [Play](#play) ( 50 points )

**Note**: certain details are missing or ambiguous on purpose, in order to test your knowledge on the related materials. However, if you really feel that something essential is missing and cannot proceed to the next step, then contact the teaching staff with clear description of your problem.

### Submitting your work:
<font color=red>**DO NOT clear the final outputs**</font> so that TAs can grade both your code and results.  
Once you have done **two parts of the assignment**, run the *CollectSubmission.sh* script with your **Team number** as input argument. <br>
This will produce a zipped file called *[Your team number].tar.gz*. Please submit this file on ETL. &nbsp;&nbsp; (Usage: ./*CollectSubmission.sh* &nbsp; Team_#)

### Some helpful references for assignment #4 :
- [1] Mnih, Volodymyr, et al. "Playing atari with deep reinforcement learning." arXiv preprint arXiv:1312.5602 (2013). [[pdf]](https://www.google.co.kr/url?sa=t&rct=j&q=&esrc=s&source=web&cd=3&cad=rja&uact=8&ved=0ahUKEwiI3aqPjavVAhXBkJQKHZsIDpgQFgg7MAI&url=https%3A%2F%2Fwww.cs.toronto.edu%2F~vmnih%2Fdocs%2Fdqn.pdf&usg=AFQjCNEd1AJoM72DeDpI_GBoPuv7NnVoFA)
- [2] Mnih, Volodymyr, et al. "Human-level control through deep reinforcement learning." Nature 518.7540 (2015): 529-533. [[pdf]](https://www.nature.com/nature/journal/v518/n7540/pdf/nature14236.pdf)
- [3] OpenAI GYM website [[link]](https://gym.openai.com/envs) and [[git]](https://github.com/openai/gym)

## 0. OpenAI Gym

OpenAI Gym is a toolkit to support diverse environments for developing reinforcement learning algorithms. You can use the toolkit with Python as well as TensorFlow. Installation guide of OpenAI Gym is offered by [this link](https://github.com/openai/gym#installation) or just type the command "pip install gym" (as well as "pip install gym[atari]" for Part2). 

After you set up OpenAI Gym, you can use APIs of the toolkit by inserting <font color=red>import gym</font> into your code. In this assignment, you must build one of famous reinforcement learning algorithms whose agent can run on OpenAI Gym environments. Please check how to use APIs such as funcions interacting with environments in the followings.

In [1]:
#import matplotlib.pyplot as plt
import tensorflow as tf
import cv2 
import gym
import numpy as np
import os

In [2]:
# Make an environment instance of CartPole-v0.
env = gym.make('CartPole-v0')

# Before interacting with the environment and starting a new episode, you must reset the environment's state.
state = env.reset()

# Uncomment to show the screenshot of the environment (rendering game screens)
# env.render() 

# You can check action space and state (observation) space.
num_actions = env.action_space.n
state_shape = env.observation_space.shape
print(num_actions)
print(state_shape)

# "step" function performs agent's actions given current state of the environment and returns several values.
# Input: action (numerical data)
#        - env.action_space.sample(): select a random action among possible actions.
# Output: next_state (numerical data, next state of the environment after performing given action)
#         reward (numerical data, reward of given action given current state)
#         terminal (boolean data, True means the agent is done in the environment)
next_state, reward, terminal, info = env.step(env.action_space.sample())

2
(4,)


## 1. Implement a DQN agent
## 1) Overview of implementation in the notebook

The assignment is based on a method named by Deep Q-Network (DQN) [1,2]. You could find the details of DQN in the papers. The followings show briefly architecture of DQN and its training computation flow.

- (Pink flow) Play an episode and save transition records of the episode into a replay memory.
- (Green flow) Train DQN so that a loss function in the figure is minimized. The loss function is computed using main Q-network and Target Q-network. Target Q-network needs to be periodically updated by copying the main Q-network.
- (Purple flow) Gradient can be autonomously computed by tensorflow engine, if you build a proper optimizer.

![](image/architecture.png)

There are major 4 components, each of which needs to be implemented in this notebook. The Agent class must have an instance(s) of each class (Environment, DQN, ReplayMemory).
- Environment
- DQN 
- ReplayMemory
- Agent

![](image/components.png)



## 2) Design classes

In the code cells, there are only names of functions which are used in TA's implementation and their brief explanations. <font color='green'>...</font> means that the functions need more arguments and <font color='green'>pass</font> means that you need to write more codes. The functions may be helpful when you do not know how to start the assignment. Of course, you could change the functions such as deleting/adding functions or extending/reducing roles of the classes, <font color='red'> just keeping the existence of the classes</font>.

### Environment class

In [3]:
import random
class Environment(object):
    def __init__(self, env, state_size, action_size):
        self.env = env
        self.state_size = state_size
        self.action_size = action_size
        pass
    
    def random_action(self):
        # Return a random action.
        return random.randrange(self.action_size)
        pass
    
    def render_worker(self, render=False):
        # If display in your option is true, do rendering. Otherwise, do not.
        if render:
            self.env.render()
        pass
    
    def new_episode(self):
        # Start a new episode and return the first state of the new episode.
        state = self.env.reset()
        state = np.reshape(state, [1, self.state_size])
        return state
        pass
    
    def act(self, action):
        # Perform an action which is given by input argument and return the results of acting.
        next_state, reward, terminal, _ = self.env.step(action)
        return next_state, reward, terminal
        pass

### ReplayMemory class

In [4]:
from collections import deque
import random

class ReplayMemory(object):
    def __init__(self, state_size, batch_size):
        self.memory = deque(maxlen=2000)
        self.batch_size = batch_size
        self.state_size = env.observation_space.shape[0]
        pass
    
    def add(self, state, action, reward, next_state, terminal):
        # Add current_state, action, reward, terminal, (next_state which can be added by your choice). 
        self.memory.append((state, action, reward, next_state, terminal))
        pass
    
    def mini_batch(self):
        # Return a mini_batch whose data are selected according to your sampling method. (such as uniform-random sampling in DQN papers)
        mini_batch = random.sample(self.memory, self.batch_size)
        
        states = np.zeros((self.batch_size, self.state_size))
        next_states = np.zeros((self.batch_size, self.state_size))
        actions, rewards, terminals = [], [], []
        
        for i in range(self.batch_size):
            states[i] = mini_batch[i][0]
            next_states[i] = mini_batch[i][3]
            actions.append(mini_batch[i][1])
            rewards.append(mini_batch[i][2])
            terminals.append(mini_batch[i][4])
            
        return states, actions, rewards, next_states, terminals            
        pass
        

### DQN class

In [5]:
#from keras.layers import Dense
#from keras.optimizers import Adam
#from keras.models import Sequential
class DQN(object):
    def __init__(self, state_size, action_size, learning_rate, replay, batch_size, discount_factor):
        self.state_size = state_size
        self.action_size = action_size
        self.learning_rate = learning_rate
        self.replay = replay
        self.batch_size = batch_size
        self.discount_facter = discount_factor
        self.prediction_Q = self.build_network('pred')
        self.target_Q = self.build_network('target')
        pass
    
    def build_network(self, name):
        # Make your a deep neural network
        with tf.variable_scope(name , reuse=tf.AUTO_REUSE):
            model = tf.keras.Sequential()
            model.add(tf.layers.Dense(25, input_dim=self.state_size, activation='relu', 
                            kernel_initializer='he_uniform'))
            model.add(tf.layers.Dense(25, activation='relu', kernel_initializer='he_uniform'))
            model.add(tf.layers.Dense(25, activation='relu', kernel_initializer='he_uniform'))
            model.add(tf.layers.Dense(self.action_size, activation='relu', 
                            kernel_initializer='he_uniform'))
            model.summary()
            model.compile(loss='mse', optimizer=tf.keras.optimizers.Adam(lr=self.learning_rate))
            return model
            pass
                
        #copy_op = []
        #pred_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope='pred')
        #target_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope='target')
        #for pred_var, target_var in zip(pred_vars, target_vars):
        #    copy_op.append(target_var.assign(pred_var.value()))
    
    #def build_optimizer(self):
        # Make your optimizer 
    #    pass
    
    def train_network(self, discount_factor):
        # Train the prediction_Q network using a mini-batch sampled from the replay memory
        states, actions, rewards, next_states, terminals = self.replay.mini_batch()
        
        pred_Q = self.prediction_Q.predict(states)
        tar_Q = self.target_Q.predict(next_states)
        
        for i in range(self.batch_size):
            if terminals[i]:
                pred_Q[i][actions[i]] = rewards[i]
            else:
                pred_Q[i][actions[i]] = rewards[i] + discount_factor*(np.amax(tar_Q[i]))
                
        self.prediction_Q.fit(states, pred_Q, batch_size=self.batch_size, epochs=1, verbose=0)
        pass
    
    def update_target_network(self):
        #self.sess.run(copy_op)
        self.target_Q.set_weights(self.prediction_Q.get_weights())
    
    #def predict_Q(self, ...):
    #    pass

### Agent class

In [6]:
import os # to save and load
import random
class Agent(object):
    def __init__(self, args, mode):
        self.env = gym.make(args.env_name)
        self.state_size = self.env.observation_space.shape[0]
        self.action_size = self.env.action_space.n
        #self.saver = tf.train.Saver()
        if mode=='train':
            self.epsilon = 1.0
        elif mode=='test':
            self.epsilon = 0.0
        else:
            raise Exception("mode type not supported: {}".format(mode))
        self.epsilon_decay_steps = args.epsilon_decay_steps
        self.learning_rate = args.learning_rate
        self.batch_size = args.batch_size
        self.discount_factor = args.discount_factor
        self.episodes = args.episodes
        self.ENV = Environment(self.env, self.state_size, self.action_size)
        self.replay = ReplayMemory(self.state_size, self.batch_size)
        self.dqn = DQN(self.state_size, self.action_size, self.learning_rate, 
                       self.replay, self.batch_size, self.discount_factor)
        pass
    
    def select_action(self, state):
        # Select an action according ε-greedy. You need to use a random-number generating function and add a library if necessary.
        if np.random.rand() <= self.epsilon:
            return random.randrange(self.action_size)
        else:
            q_value = self.dqn.prediction_Q.predict(state)
            return np.argmax(q_value[0])
        pass
    
    def train(self):
        # Train your agent which has the neural nets.
        # Several hyper-parameters are determined by your choice (Options class in the below cell)
        # Keep epsilon-greedy action selection in your mind 
                
        scores, episodes = [], []
        
        for e in range(self.episodes):
            terminal = False
            score = 0
            state = self.ENV.new_episode()
            
            int_e = 0
            while not terminal:
                action = self.select_action(state)
                next_state, reward, terminal = self.ENV.act(action)
                next_state = np.reshape(next_state, [1, self.state_size])
                self.replay.add(state, action, reward, next_state, terminal)
                
                if len(self.replay.memory)>=1000:
                    if self.epsilon > 0.1:
                        self.epsilon -= 0.9/1e4
                    self.dqn.train_network(self.discount_factor)
                    
                score += reward
                state = next_state
                int_e += 1
                
                if terminal:
                    self.dqn.update_target_network()
                    scores.append(score)
                    episodes.append(e)
                    print('episode:', e, ' score:', score, ' epsilon', self.epsilon, 
                          ' last 10 mean score', np.mean(scores[-min(10, len(scores)):]))
                    
                    if np.mean(scores[-min(10, len(scores)):]) > 195:
                        print('Already well trained')
                        return
            
        pass
    
    def play(self, test=False):
        # Test your agent 
        # When performing test, you can show the environment's screen by rendering,
        state = self.ENV.new_episode()
        self.ENV.render_worker(test)
        
        terminal = False
        score = 0
        while not terminal:
            action = self.select_action(state)
            next_state, reward, terminal = self.ENV.act(action)
            next_state = np.reshape(next_state, [1, self.state_size])
            score += reward
            state = next_state

            if terminal:
                return score
        pass
    
    def save(self):
        #checkpoint_dir = 'cartpole'
        #if not os.path.exists(checkpoint_dir):
        #    os.mkdir(checkpoint_dir)
        #self.saver.save(self.sess, os.path.join(checkpoint_dir, 'trained_agent'))
        self.dqn.prediction_Q.save_weights("./save_model/dqn.h5")
        
    def load(self):
        #checkpoint_dir = 'cartpole'
        #self.saver.restore(self.sess, os.path.join(checkpoint_dir, 'trained_agent'))
        self.dqn.prediction_Q.load_weights("./save_model/dqn.h5")

## 2. Train your agent 

Now, you train an agent to play CartPole-v0. Options class is the collection of hyper-parameters that you can choice. Usage of Options class is not mandatory.<br>
The maximum value of total reward which can be aquired from one episode is 200. 
<font color='red'>**You should show learning status such as the number of observed states and mean/max/min of rewards frequently (for instance, every 100 states).**</font>

In [7]:
import argparse
import sys
parser = argparse.ArgumentParser(description="CartPole")
parser.add_argument('--env_name', default='CartPole-v0', type=str,
                    help="Environment")
#parser.add_argument('--render', default=False, type=bool)
parser.add_argument('--epsilon_decay_steps', default=1000, type=int,
                    help="how many steps for epsilon to be 0.1")
parser.add_argument('--learning_rate', default=0.001, type=float)
parser.add_argument('--batch_size', default=64, type=int)
parser.add_argument('--discount_factor', default=0.99, type=float)
parser.add_argument('--episodes', default=300, type=float)
sys.argv = ['-f']
args = parser.parse_args()
print(args)
config = tf.ConfigProto()
os.environ["CUDA_VISIBLE_DEVICES"] = '0'
config.log_device_placement = False
config.gpu_options.allow_growth = True

#with tf.Session(config=config) as sess:
#myAgent = Agent(args, 'train') # It depends on your class implementation
#myAgent.train()
#myAgent.save()

Namespace(batch_size=64, discount_factor=0.99, env_name='CartPole-v0', episodes=300, epsilon_decay_steps=1000, learning_rate=0.001)


In [8]:
#with tf.Session(config=config) as sess:
myAgent = Agent(args, 'train') # It depends on your class implementation
myAgent.train()
myAgent.save()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_1 (Dense)              (None, 25)                125       
_________________________________________________________________
dense_2 (Dense)              (None, 25)                650       
_________________________________________________________________
dense_3 (Dense)              (None, 25)                650       
_________________________________________________________________
dense_4 (Dense)              (None, 2)                 52        
Total params: 1,477
Trainable params: 1,477
Non-trainable params: 0
_________________________________________________________________
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_5 (Dense)              (None, 25)                125       
_________________________________________________________________
dense_6 

episode: 92  score: 39.0  epsilon 0.8910999999999583  last 10 mean score 33.6
episode: 93  score: 37.0  epsilon 0.887769999999957  last 10 mean score 32.9
episode: 94  score: 11.0  epsilon 0.8867799999999566  last 10 mean score 30.9
episode: 95  score: 32.0  epsilon 0.8838999999999555  last 10 mean score 30.6
episode: 96  score: 27.0  epsilon 0.8814699999999546  last 10 mean score 28.9
episode: 97  score: 62.0  epsilon 0.8758899999999524  last 10 mean score 33.5
episode: 98  score: 19.0  epsilon 0.8741799999999518  last 10 mean score 29.9
episode: 99  score: 54.0  epsilon 0.8693199999999499  last 10 mean score 31.2
episode: 100  score: 23.0  epsilon 0.8672499999999491  last 10 mean score 32.0
episode: 101  score: 51.0  epsilon 0.8626599999999474  last 10 mean score 35.5
episode: 102  score: 25.0  epsilon 0.8604099999999465  last 10 mean score 34.1
episode: 103  score: 41.0  epsilon 0.8567199999999451  last 10 mean score 34.5
episode: 104  score: 39.0  epsilon 0.8532099999999437  last 1

episode: 196  score: 200.0  epsilon 0.2850399999998586  last 10 mean score 150.3
episode: 197  score: 200.0  epsilon 0.2670399999998628  last 10 mean score 154.5
episode: 198  score: 200.0  epsilon 0.24903999999986667  last 10 mean score 154.5
episode: 199  score: 200.0  epsilon 0.23103999999986533  last 10 mean score 159.8
episode: 200  score: 200.0  epsilon 0.21303999999986398  last 10 mean score 178.6
episode: 201  score: 200.0  epsilon 0.19503999999986263  last 10 mean score 197.1
Already well trained


## <a name="play"></a> 3. Test the trained agent ( 15 points )

Now, we test your agent and calculate an average reward of 20 episodes.
- 0 <= average reward < 50 : you can get 0 points
- 50 <= average reward < 100 : you can get 10 points
- 100 <= average reward < 190 : you can get 35 points
- 190 <= average reward <= 200 : you can get 50 points

In [9]:
#config = tf.ConfigProto()
# If you use a GPU, uncomment
#os.environ["CUDA_VISIBLE_DEVICES"] = '0'
#config.log_device_placement = False
# config.gpu_options.allow_growth = True
#with tf.Session(config=config) as sess:
#args = parser.parse_args() # You set the option of test phase
myAgent = Agent(args, 'test') # It depends on your class implementation
myAgent.load()
rewards = []
for i in range(20):
    r = myAgent.play() # play() returns the reward cumulated in one episode
    rewards.append(r)
mean = np.mean(rewards)
print(rewards)
print(mean)

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_9 (Dense)              (None, 25)                125       
_________________________________________________________________
dense_10 (Dense)             (None, 25)                650       
_________________________________________________________________
dense_11 (Dense)             (None, 25)                650       
_________________________________________________________________
dense_12 (Dense)             (None, 2)                 52        
Total params: 1,477
Trainable params: 1,477
Non-trainable params: 0
_________________________________________________________________
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_13 (Dense)             (None, 25)                125       
_________________________________________________________________
dense_14