# M2177.003100 Deep Learning <br>Assignment #5 Part 1: Implementing and Training a Deep Q-Network

Copyright (C) Data Science Laboratory, Seoul National University. This material is for educational uses only. Some contents are based on the material provided by other paper/book authors and may be copyrighted by them. Written by Hyungyu Lee, November 2019

In this notebook, you will implement one of famous reinforcement learning algorithm, Deep Q-Network (DQN) of DeepMind. <br>
The goal here is to understand a basic form of DQN [1, 2] and learn how to use OpenAI Gym toolkit [3].<br>
You need to follow the instructions to implement the given classes.

1. [Play](#play) ( 50 points )

**Note**: certain details are missing or ambiguous on purpose, in order to test your knowledge on the related materials. However, if you really feel that something essential is missing and cannot proceed to the next step, then contact the teaching staff with clear description of your problem.

### Submitting your work:
<font color=red>**DO NOT clear the final outputs**</font> so that TAs can grade both your code and results.  
Once you have done **two parts of the assignment**, run the *CollectSubmission.sh* script with your **Team number** as input argument. <br>
This will produce a zipped file called *[Your team number].tar.gz*. Please submit this file on ETL. &nbsp;&nbsp; (Usage: ./*CollectSubmission.sh* &nbsp; Team_#)

### Some helpful references for assignment #4 :
- [1] Mnih, Volodymyr, et al. "Playing atari with deep reinforcement learning." arXiv preprint arXiv:1312.5602 (2013). [[pdf]](https://www.google.co.kr/url?sa=t&rct=j&q=&esrc=s&source=web&cd=3&cad=rja&uact=8&ved=0ahUKEwiI3aqPjavVAhXBkJQKHZsIDpgQFgg7MAI&url=https%3A%2F%2Fwww.cs.toronto.edu%2F~vmnih%2Fdocs%2Fdqn.pdf&usg=AFQjCNEd1AJoM72DeDpI_GBoPuv7NnVoFA)
- [2] Mnih, Volodymyr, et al. "Human-level control through deep reinforcement learning." Nature 518.7540 (2015): 529-533. [[pdf]](https://www.nature.com/nature/journal/v518/n7540/pdf/nature14236.pdf)
- [3] OpenAI GYM website [[link]](https://gym.openai.com/envs) and [[git]](https://github.com/openai/gym)

## 0. OpenAI Gym

OpenAI Gym is a toolkit to support diverse environments for developing reinforcement learning algorithms. You can use the toolkit with Python as well as TensorFlow. Installation guide of OpenAI Gym is offered by [this link](https://github.com/openai/gym#installation) or just type the command "pip install gym" (as well as "pip install gym[atari]" for Part2). 

After you set up OpenAI Gym, you can use APIs of the toolkit by inserting <font color=red>import gym</font> into your code. In this assignment, you must build one of famous reinforcement learning algorithms whose agent can run on OpenAI Gym environments. Please check how to use APIs such as funcions interacting with environments in the followings.

In [13]:
import tensorflow as tf
import cv2 
import gym
import numpy as np
import os
import argparse
import sys
import matplotlib
from matplotlib import pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

#os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"
#os.environ["CUDA_VISIBLE_DEVICES"]="0"

In [14]:
# Make an environment instance of CartPole-v0.
env = gym.make('CartPole-v0')

# Before interacting with the environment and starting a new episode, you must reset the environment's state.
state = env.reset()

#rendering game screens, do not need for assignment evaluation
# env.render() 

# You can check action space and state (observation) space.
num_actions = env.action_space.n
state_shape = env.observation_space.shape
print(num_actions)
print(state_shape)

# "step" function performs agent's actions given current state of the environment and returns several values.
# Input: action (numerical data)
#        - env.action_space.sample(): select a random action among possible actions.
# Output: next_state (numerical data, next state of the environment after performing given action)
#         reward (numerical data, reward of given action given current state)
#         terminal (boolean data, True means the agent is done in the environment)
next_state, reward, terminal, info = env.step(env.action_space.sample())

[2019-12-03 14:50:26,694] Making new env: CartPole-v0


2
(4,)


## 1. Implement a DQN agent
## 1) Overview of implementation in the notebook

The assignment is based on a method named by Deep Q-Network (DQN) [1,2]. You could find the details of DQN in the papers. The followings show briefly architecture of DQN and its training computation flow.

- (Pink flow) Play an episode and save transition records of the episode into a replay memory.
- (Green flow) Train DQN so that a loss function in the figure is minimized. The loss function is computed using main Q-network and Target Q-network. Target Q-network needs to be periodically updated by copying the main Q-network.
- (Purple flow) Gradient can be autonomously computed by tensorflow engine, if you build a proper optimizer.

![](image/architecture.png)

There are major 4 components, each of which needs to be implemented in this notebook. The Agent class must have an instance(s) of each class (Environment, DQN, ReplayMemory).
- Environment
- DQN 
- ReplayMemory
- Agent

![](image/components.png)



## 2) Design classes

In the code cells, there are only names of functions which are used in TA's implementation and their brief explanations. <font color='green'>...</font> means that the functions need more arguments and <font color='green'>pass</font> means that you need to write more codes. The functions may be helpful when you do not know how to start the assignment. Of course, you could change the functions such as deleting/adding functions or extending/reducing roles of the classes, <font color='red'> just keeping the existence of the classes</font>.

### Environment class

In [15]:
#https://github.com/pikinder/DQN
#https://github.com/TangLaoDA/DQN_FOR_CartPole-v0/blob/master/game_CartPole_train.py

tf.reset_default_graph()
EXPERIENCE_REPLAY_BATCH = 64
EXPERIENCE_BUFFER_SIZE = 100000
START_EPSILON = 0.99
END_EPSILON = 0.1
EPSILON_STEP_LIMIT = 100
LEARNING_RATE = 0.0025
DISCOUNT_FACTORE = 0.99
ALPHA = 1
class Environment(object):
    def __init__(self,env):
        self.env = env
        
    def step(self,action):
        next_state, reward, done, info = self.env.step(action)
        return next_state, reward, done, info

    def reset(self):
        return self.env.reset()

    def render(self):
        self.env.render()

### ReplayMemory class

In [16]:
from collections import deque
import random

class ReplayMemory(object):
    def __init__(self):
        self.experience = []
        self.visited = {}

    def remember(self, state, next_state, action, reward, is_done):
        state = np.array(state, dtype=np.float64)
        next_state = np.array(next_state, dtype=np.float64)
        experience = (state, next_state, action, reward, is_done)
        if len(self.experience) > EXPERIENCE_BUFFER_SIZE:
            self.experience = self.experience[1:]

        self.experience.append(experience)

    def recall(self):
        experience_size = len(self.experience)
        _EXPERIENCE_REPLAY_BATCH = EXPERIENCE_REPLAY_BATCH
        if experience_size < EXPERIENCE_REPLAY_BATCH:
            _EXPERIENCE_REPLAY_BATCH = experience_size

        indexes = np.random.randint(
            experience_size, size=_EXPERIENCE_REPLAY_BATCH)
        experiences = []
        for index in indexes:
            experiences.append(self.experience[index])

        return experiences
    '''
    def __init__(self):
        self.replay_buffer = deque()
    
    def add(self, state, action, reward, next_action, done):            
        one_hot_action = np.zeros(self.action_dim)
        one_hot_action[action] = 1
        self.replay_buffer.append((state,one_hot_action,reward,next_state,done))
        if len(self.replay_buffer) > REPLAY_SIZE:
            self.replay_buffer.popleft()
        if len(self.replay_buffer) > BATCH_SIZE:
            self.train_Q_network()
        # Add current_state, action, reward, terminal to replay_memory (next_state which can be added by your choice). 
        pass
    
    def mini_batch(self, batch_size):
        return random.sample(self.replay_buffer,BATCH_SIZE)
    def size(self):
        return self.buffer_size
    def count(self):
    # if buffer is full, return buffer size
    # otherwise, return experience counter
        return self.num_experiences

    def erase(self):
        self.buffer = deque()
        self.num_experiences = 0
        '''
    

### DQN class

In [17]:


class DQN(object):
    def __init__(self, input_size, output_size):
        self.count = 0

        x = tf.placeholder(tf.float64, [None, input_size], name='x')
        y = tf.placeholder(tf.float64, [None, output_size], name='y')

        w1 = tf.Variable(tf.random_normal(
            [input_size, 64], dtype=tf.float64), name='w1')
        b1 = tf.Variable(tf.random_normal([64], dtype=tf.float64), name='b1')
        tf.summary.histogram('w1', w1)

        w2 = tf.Variable(tf.random_normal(
            [64, output_size], dtype=tf.float64), name='w2')
        b2 = tf.Variable(tf.random_normal(
            [output_size], dtype=tf.float64), name='b2')
        tf.summary.histogram('w2', w2)

        h1 = tf.add(tf.matmul(x, w1), b1, name='h1')
        relu_h1 = tf.nn.tanh(h1, name='relu_h1')
        tf.summary.histogram('relu_h1', relu_h1)

        self.model = tf.add(tf.matmul(relu_h1, w2), b2, name='model')
        tf.summary.histogram('model', self.model)

        self.error = tf.reduce_mean(tf.square(self.model - y), name='error')
        tf.summary.scalar('error', self.error)

        self.optimzer = tf.train.RMSPropOptimizer(
            LEARNING_RATE, name='Optimizer').minimize(self.error)
        self.step = 0
        self.sess = tf.Session()
        self.merged = tf.summary.merge_all()
        self.summary_writter = tf.summary.FileWriter(
            "/tmp/cart_pole", self.sess.graph)

        self.init = tf.global_variables_initializer()
        self.sess.run(self.init)
    def get_action(self, state):
        state = np.array(state, dtype=np.float64)
        output = self.sess.run([self.model], feed_dict={
            'x:0': state
        })
        return output[0][0]

    def train(self, states, actions):
        states = np.array(states, dtype=np.float64)
        actions = np.array(actions, dtype=np.float64)
        summary, _, error = self.sess.run([self.merged, self.optimzer, self.error], feed_dict={
            'x:0': states,
            'y:0': actions
        })
        self.summary_writter.add_summary(summary, self.count)
        self.count += 1
        # print error
        # sess.close()
        return error

    def get_action_multiple(self, states):
        states = np.array(states, dtype=np.float64)
        output = self.sess.run([self.model], feed_dict={
            'x:0': states
        })
        # sess.close()
        return output[0]

    def close(self):
        self.sess.close()
        
        '''
        # init some parameters
        self.time_step = 0
        self.epsilon = INITIAL_EPSILON
        self.state_shape = env.observation_space.shape[0]
        self.num_actions = env.action_space.n 
        self.prediction_Q = self.build_network('pred')
        self.target_Q = self.build_network('target')
        '''
        '''
    def build_network(self,name):
        with tf.name_scope(name):
        # Make your a deep neural network
            W1 = self.weight_variable([self.state_dim,20])
            b1 = self.bias_variable([20])
            W2 = self.weight_variable([20,self.action_dim])
            b2 = self.bias_variable([self.action_dim])
            # input layer
            self.state_input = tf.placeholder("float",[None,self.state_dim])
            # hidden layers
            h_layer = tf.nn.relu(tf.matmul(self.state_input,W1) + b1)
            # Q Value layer
            self.Q_value = tf.matmul(h_layer,W2) + b2
            
            # update target network with Q network
            copy_op = []
            pred_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope='pred')
            target_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope='target')
            for pred_var, target_var in zip(pred_vars, target_vars):
                copy_op.append(target_var.assign(pred_var.value()))
                
    def build_optimizer(self):
        # Make your optimizer 
        self.action_input = tf.placeholder("float",[None,self.action_dim]) # one hot presentation
        self.y_input = tf.placeholder("float",[None])
        Q_action = tf.reduce_sum(tf.multiply(self.Q_value,self.action_input),reduction_indices = 1)
        self.cost = tf.reduce_mean(tf.square(self.y_input - Q_action))
        tf.summary.scalar("loss",self.cost)
        global merged_summary_op
        merged_summary_op = tf.summary.merge_all()
        self.optimizer = tf.train.AdamOptimizer(0.0001).minimize(self.cost)
        
    def train_network(self):
        # Train the prediction_Q network using a mini-batch sampled from the replay memory
        pass
    
    def update_target_network(self, ...):
        self.sess.run(copy_op)
    
    def predict_Q(self, ...):
        pass
        '''

### Agent class

In [21]:
class Agent(object):
    def __init__(self, epsilons,total_episodes):
        #self.saver = tf.train.Saver()
        env = gym.make("CartPole-v0")
        self.env = Environment(env)
        self.total_episodes = total_episodes
        self.epsilons = epsilons
        self.epsilons_index = 0
        self.epsilon = START_EPSILON
        self.total_actions = 0
        self.total_greedy_actions = 0
        self.model = DQN(4, 2)
        self.memory = ReplayMemory()
        self.step = 0
        self.avg = []
        
       
    @staticmethod
    def is_greddy(epsilon):
        return np.random.choice([0, 1], 1, p=[epsilon, 1 - epsilon])[0]

    def update_epsilon(self):
        index = int(self.total_actions / EPSILON_STEP_LIMIT)
        if index > len(self.epsilons - 1):
            index = len(self.epsilons) - 1

        self.epsilons_index = index

    def take_action(self, state):
        """
        actions are whether you want to go right or left
        """
        self.total_actions += 1
        q_values = self.model.get_action(state.reshape(1, 4))
        is_greedy = Agent.is_greddy(self.epsilon)
        msg = ''
        if is_greedy:
            action = np.argmax(q_values)
        else:
            action = np.random.choice([0, 1], 1)[0]
            msg = 'explorer'

        self.epsilon = END_EPSILON + \
            (START_EPSILON - END_EPSILON) * \
            math.exp(-0.001 * self.total_actions)
        return action

    def observe_results(self, state, next_state, action, reward, is_done):
        """
        after taking action environment return result of it, store (state, action, reward, is_done) in memory 
        for experience replay
        """
        self.memory.remember(state, next_state, action, reward, is_done)
        self.update()

    def close(self):
        return self.model.close()

    def update(self):
        experiences = self.memory.recall()
        current_states = None
        next_states = None
        for experience in experiences:
            current_state, next_state, action, reward, is_done = experience
            current_state = np.array(current_state).reshape(1, 4)
            next_state = np.array(next_state).reshape(1, 4)
            if current_states is None:
                current_states = current_state
                next_states = next_state
            else:
                current_states = np.vstack((current_states, current_state))
                next_states = np.vstack((next_states, next_state))

        current_state_q_values = self.model.get_action_multiple(current_states)
        next_state_q_values = self.model.get_action_multiple(next_states)

        x = None
        y = None
        for i in range(len(experiences)):
            current_state, next_state, action, reward, is_done = experiences[i]
            current_state_q_value = np.array(
                current_state_q_values[i], dtype=np.float64)
            next_state_q_value = np.array(
                next_state_q_values[i], dtype=np.float64)
            if is_done:
                reward = -10
                next_state_q_value = [0.0, 0.0]

            current_state_q_value[action] = ALPHA * \
                (reward + DISCOUNT_FACTORE * np.amax(next_state_q_value))

            current_state = np.array(current_state).reshape(1, 4)
            current_state_q_value = np.array(
                current_state_q_value).reshape(1, 2)

            if x is None:
                x = current_state
                y = current_state_q_value
            else:
                x = np.vstack((x, current_state))
                y = np.vstack((y, current_state_q_value))

        self.model.train(x, y)
 #============================================
    def add_rewards(self, total_rewards):
        self.avg.append(total_rewards)
        l = len(self.avg)
        if l < 100:
            return False

        _avg = float(sum(self.avg[l - 100: l])) / max(len(self.avg[l - 100: l]), 1)
        print ('avg rewards: %s' % str(_avg))
        if _avg > 195:
            return True

        return False

    def run(self):
        episodes = 0
        #self.gym.wrappers.Monitor('results/cartpole', force=True)
        while episodes < self.total_episodes:
            print ('running episode: %s' % str(episodes + 1))
            state = self.env.reset()
            is_done = False
            total_reward = 0
            while not is_done:
                self.env.render()
                action = self.take_action(state)
                next_state, reward, is_done, info = self.env.step(action)
                self.step += 1
                total_reward += reward
                self.observe_results(
                    state, next_state, action, reward, is_done)
                state = next_state

            print ('rewards: %s, step: %s' % (str(total_reward), str(self.step)))
            if self.add_rewards(total_reward):
                print ('done with episods %s and steps: %s' % (str(episodes), str(self.step)))
                #self.env.monitor.close()
                self.agent.close()
                # self._plot()
                return

            episodes += 1

    def _plot(self):
        plt.plot(self.avg)
        plt.ylabel('Rewards')
        plt.xlabel('Episodes')
        plt.savefig('rewards.png')
        plt.show()
        
epsilons = np.linspace(START_EPSILON, END_EPSILON)

env = Agent(epsilons,50)
env.run()
'''
    def select_action(self, ...):
        # Select an action according ε-greedy. You need to use a random-number generating function and add a library if necessary.
        pass
    
    def train(self, ...):
        # Train your agent 
        # Several hyper-parameters are determined by your choice
        # Keep epsilon-greedy action selection in your mind 
        pass
    
    def play(self, ...):
        # Test your agent 
        # When performing test, you can show the environment's screen by rendering if you want
        pass
    
    def save(self):
        checkpoint_dir = 'cartpole'
        if not os.path.exists(checkpoint_dir):
            os.mkdir(checkpoint_dir)
        self.saver.save(self.sess, os.path.join(checkpoint_dir, 'trained_agent'))
        
    def load(self):
        checkpoint_dir = 'cartpole'
        self.saver.restore(self.sess, os.path.join(checkpoint_dir, 'trained_agent'))
        '''

[2019-12-03 14:51:22,182] Making new env: CartPole-v0


running episode: 1
rewards: 14.0, step: 14
running episode: 2
rewards: 19.0, step: 33
running episode: 3
rewards: 23.0, step: 56
running episode: 4
rewards: 17.0, step: 73
running episode: 5
rewards: 12.0, step: 85
running episode: 6
rewards: 11.0, step: 96
running episode: 7
rewards: 17.0, step: 113
running episode: 8
rewards: 54.0, step: 167
running episode: 9
rewards: 18.0, step: 185
running episode: 10
rewards: 12.0, step: 197
running episode: 11
rewards: 36.0, step: 233
running episode: 12
rewards: 31.0, step: 264
running episode: 13
rewards: 13.0, step: 277
running episode: 14
rewards: 21.0, step: 298
running episode: 15
rewards: 16.0, step: 314
running episode: 16
rewards: 19.0, step: 333
running episode: 17
rewards: 17.0, step: 350
running episode: 18
rewards: 35.0, step: 385
running episode: 19
rewards: 32.0, step: 417
running episode: 20
rewards: 33.0, step: 450
running episode: 21
rewards: 10.0, step: 460
running episode: 22
rewards: 43.0, step: 503
running episode: 23
rewar

ArgumentError: argument 2: <class 'TypeError'>: wrong type

## 2. Train your agent 

Now, you train an agent to play CartPole-v0. Options class is the collection of hyper-parameters that you can choice. Usage of Options class is not mandatory.<br>
The maximum value of total reward which can be aquired from one episode is 200. 
<font color='red'>**You should show learning status such as the number of observed states and mean/max/min of rewards frequently (for instance, every 100 states).**</font>

In [22]:
import argparse
import easydict
parser = argparse.ArgumentParser(description="CartPole")
parser.add_argument('--env-name', default='CartPole-v0', type=str,
                    help="Environment")
parser.add_argument('--epsilons', default=0.99, type=float,
                    help='1epsilons')

config = tf.ConfigProto()
#config.gpu_options.allow_growth = True
"""
You can add more arguments.
for example, visualize, memory_size, batch_size, discount_factor, eps_max, eps_min, learning_rate, train_interval, copy_interval and so on
"""
with tf.Session(config=config) as sess:
    #args = parser.parse_args()
    args = easydict.EasyDict({
        "epsilons": 0.99})
    myAgent = Agent(args) # It depends on your class implementation
    myAgent.train()
    myAgent.save()

AttributeError: 'Agent' object has no attribute 'train'

## <a name="play"></a> 3. Test the trained agent ( 50 points )

Now, we test your agent and calculate an average reward of 20 episodes.
- 0 <= average reward < 50 : you can get 0 points
- 50 <= average reward < 100 : you can get 10 points
- 100 <= average reward < 190 : you can get 35 points
- 190 <= average reward <= 200 : you can get 50 points

In [None]:
config = tf.ConfigProto()
# If you use a GPU, uncomment
# os.environ["CUDA_VISIBLE_DEVICES"] = '0'
# config.log_device_placement = False
# config.gpu_options.allow_growth = True
with tf.Session(config=config) as sess:
    args = parser.parse_args() # You set the option of test phase
    myAgent = Agent(args, test) # It depends on your class implementation
    myAgent.load()
    rewards = []
    for i in range(20):
        r = myAgent.play() # play() returns the reward cumulated in one episode
        rewards.append(r)
    mean = np.mean(rewards)
    print(rewards)
    print(mean)