## DDDQN (Double Dueling Deep Q Learning with Prioritized Experience Replay) Space Invader by Tomas
https://github.com/simoninithomas/Deep_reinforcement_learning_Course/blob/master/Dueling%20Double%20DQN%20with%20PER%20and%20fixed-q%20targets/Dueling%20Deep%20Q%20Learning%20with%20Doom%20(%2B%20double%20DQNs%20and%20Prioritized%20Experience%20Replay).ipynb

In [1]:
# ! pip install gym-retro
# ! pip install -U scikit-image
import retro
# !python3 -m retro.import ./ROMS

In [2]:
import numpy as np
from collections import deque
import random
from utils import print_var # User-defined lib. Look at utils.py

## Build Environment

In [4]:
env = retro.make(game='SpaceInvaders-Atari2600')

print("The size of our frame is: ", env.observation_space)
print("The action size is: ", env.action_space.n)
# Here we create an hot encoded version of our actions
# possible_actions = [[1, 0, 0, 0, 0, 0, 0, 0], [0, 1, 0, 0, 0, 0, 0, 0, 0]...]
possible_actions = np.array(np.identity(env.action_space.n, dtype=int).tolist())
possible_actions


The size of our frame is:  Box(210, 160, 3)
The action size is:  8


array([[1, 0, 0, 0, 0, 0, 0, 0],
       [0, 1, 0, 0, 0, 0, 0, 0],
       [0, 0, 1, 0, 0, 0, 0, 0],
       [0, 0, 0, 1, 0, 0, 0, 0],
       [0, 0, 0, 0, 1, 0, 0, 0],
       [0, 0, 0, 0, 0, 1, 0, 0],
       [0, 0, 0, 0, 0, 0, 1, 0],
       [0, 0, 0, 0, 0, 0, 0, 1]])

## Hyperparameter

In [5]:
### MODEL HYPERPARAMETERS

### PREPROCESSING HYPERPARAMETERS
stack_size = 4 # Number of frames stacked

state_size = [110, 84, 4] # Our input is a stack of 4 frames 110 * 84 * 4 (width, heifht, channel)
action_size = env.action_space.n # 8 possible actions
learning_rate = 0.00025

### TRAINING HYPERPARAMETERS
total_episodes = 1 # Total episodes for training
# total_episodes = 15 # Total episodes for training
# max_steps =  50000 # Max possible steps in an episode
max_steps = 10 # Max possible steps in an episode

# batch_size = 64
batch_size = 64

### MEMORY HYPERPARAMETERS
## If you have GPU, change to 1 million
# pretrain_length = batch_size # Number of experience stored in the Memory when initialized for the first time
pretrain_length = 1000

memory_size = 1000000 # Number of experiences the Memory can keep

# Fixed Q targets hyperparameters
max_tau = 10000 #Tau is the C step where we update our target network

# Exploration parameters for epsilon greedy strategy
explore_start = 1.0 # exploration prob. at start
explore_stop = 0.01 # minimum exploration prob
decay_rate = 0.00001 # exponential decay rate for exploration prob

# Q learning hyperparameters
gamma = 0.9 # Diccounting rate

### MODIFY THIS TO FALSE IF YOU JUST WANT TO SEE THE TRAINED AGENT
# training = False
training = True

## TRUN THIS TO TRUE IF YOU WANT TO RENDER THE ENVIRONMENT
episode_render = False

## Image Preprocessing

In [6]:
# Initialize deque with zero-images one array for each image
stacked_frames = deque([np.zeros((110,84), dtype=np.int) for i in range(stack_size)], maxlen=4)
from utils import preprocess_frame
from utils import stack_frames 



## Prioritized Experience Replay

A sumtree, which is a binary tree where parents nodes are the sume of the childrent nodes<br>
To summarize: <br>
<li>Step 1: We construct a SumTree, which is a Binary Sum tree where leaves contains priorities and a data array where index points to the index of leaves<br>he he Sum
    def init: initialize our SumTree data object with all nodes = 0 and data (data array) with all = 0<br>ep
    def add: add our priority score in the sumtree leaf and experience(S, A, R, S', Done) in data <br>
    def update: we update the leaf priority score and propagate through tree. <br>
    def get_leaf: retrieve priority score, index and experience associated with a leaf <br>
    def total_priority: get the root node value to calculate the total priority score of our replay buffer <br>
<li>Step2: We create a Memory object that will contain our sumtree and data
    def init: generates our sumtree and data by instantiating thehis  SumTree object<br>
    def store: we store a new experience in our tree. Each new experience will have priority = max_priority<br>
     (and then this hpriority will be corrected during the training (when we'll calculating the TD error hence the priority score) <br>
    def sample: <br>
     First, to sample a minibatch of k size, the range[0, priority_total] is / into k ranges. <br>
     Then a value is uniformly sampled from each range <br>
     We search in the sumtree, the experience where priority score correnspond to sample values are retrieved from. <br>
     Then we calculate IS weights for each minibatch element <br>
    def update_batch: update the priorities on the tree
     
    
   
    
    

Here we don't use deque anymore

## Initialize Memory
Here we'll deal with the empty memory problem: we pre-populate our memory by taking random actions and storing the experience

In [7]:
from utils import Memory


# Instantiate memory
# SumTree = utils.SumTree(memory_size)
# memory = utils.Memory(memory_size)
memory = Memory(memory_size)

# Render the envrironment
# game.new_episodes()

for i in range(pretrain_length):
    # If it's the first step
    if i == 0:
        # First we need a state
        state = env.reset()
        state, stacked_frames = stack_frames(stacked_frames, state, True)
        
    # Random action
    # action = random.choice(possible_actions)
    # Get the next_state, the rewards, done by taking a random action
    choice = random.randint(1, len(possible_actions)) -1
    # print_var("choice", choice)
    action = possible_actions[choice]
    
    # Get the rewards
    next_state, reward, done, _ = env.step(action)
    
    # Stack the frames
    next_state, stacked_frames = stack_frames(stacked_frames, next_state, False)
    
    # If the episode is finished (We're dead 3x)
    if done:
        # We finished the episode
        next_state = np.zeros(state.shape)
        
        # Add experience to memory
        experience = state, action, reward, next_state, done
        memory.store(experience)
        
        # Start a new episode
        state = env.reset()
        # Stack the frames
        state, stacked_frames = stack_frames(stacked_frames, state, True)
    else:
        # Add experience to memory
        experience = state, action, reward, next_state, done
        memory.store(experience)
        
        # Our state is now the next_state
        state = next_state
        

    

## Define Doule Dueling Deep Q-learning Neural Network model

In [8]:
import DDDQNNet 

# Instantiate the DQNetwork
DQNetwork = DDDQNNet.DDDQNNetwork(state_size, action_size, learning_rate, name="DQNetwork")
tf = DQNetwork.build()

# Instantiate the target network
TargetNetwork = DDDQNNet.DDDQNNetwork(state_size, action_size, learning_rate, name="TargetNetwork")
tf2 = TargetNetwork.build()




## Step 8: Train our Agent 🏃‍♂️

Our algorithm:
<br>
* Initialize the weights for DQN
* Initialize target value weights w- <- w
* Init the environment
* Initialize the decay rate (that will use to reduce epsilon) 
<br><br>
* **For** episode to max_episode **do** 
    * Make new episode
    * Set step to 0
    * Observe the first state $s_0$
    <br><br>
    * **While** step < max_steps **do**:
        * Increase decay_rate
        * With $\epsilon$ select a random action $a_t$, otherwise select $a_t = \mathrm{argmax}_a Q(s_t,a)$
        * Execute action $a_t$ in simulator and observe reward $r_{t+1}$ and new state $s_{t+1}$
        * Store transition $<s_t, a_t, r_{t+1}, s_{t+1}>$ in memory $D$
        
        * Sample random mini-batch from $D$: $<s, a, r, s'>$
        * Set target $\hat{Q} = r$ if the episode ends at $+1$, otherwise set $\hat{Q} = r + \gamma Q(s',argmax_{a'}{Q(s', a', w), w^-)}$
        * Make a gradient descent step with loss $(\hat{Q} - Q(s, a))^2$
        * Every C steps, reset: $w^- \leftarrow w$
    * **endfor**
    <br><br>
* **endfor**

    

In [9]:
"""
This function will do the part
With ϵϵ select a random action atat, otherwise select at=argmaxaQ(st,a)
"""
def predict_action(explore_start, explore_stop, decay_rate, decay_step, state, actions):
    ## EPSILON GREEDY STRATEGY
    # Choose action a from state s using epsilon greedy.
    ## First we randomize a number
    exp_exp_tradeoff = np.random.rand()

    # Here we'll use an improved version of our epsilon greedy strategy used in Q-learning notebook
    explore_probability = explore_stop + (explore_start - explore_stop) * np.exp(-decay_rate * decay_step)
    
    if (explore_probability > exp_exp_tradeoff):
        # Make a random action (exploration)
        choice = random.randint(1,len(possible_actions))-1
        action = possible_actions[choice]
        
    else:
        # Get action from Q-network (exploitation)
        # Estimate the Qs values state
        Qs = sess.run(DQNetwork.output, feed_dict = {DQNetwork.inputs_: state.reshape((1, *state.shape))})
        
        # Take the biggest Q value (= the best action)
        choice = np.argmax(Qs)
        action = possible_actions[choice]
                
                
    return action, explore_probability

In [10]:
# This function helps us to copy one set of variables to another
# In our case wt_e use it when we want to copy the parameters of DQN to Target_network

def update_target_graph():
    # Get the parameters of our DQNNetwork
    from_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, "DQNetwork")
    
    # Get the parameters of our Target_network
    to_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, "TargetNetwork")
    op_holder = []
    
    # Update our target_network parameters with DQNNetwork parameters
    for from_var, to_var in zip(from_vars, to_vars):
        op_holder.append(to_var.assign(from_var))
    return op_holder

In [11]:
# %%time

# Saver will help us to save our model
saver = tf.train.Saver()

rewards_list= list()



if training == True:
    with tf.Session() as sess:
        # Initialize the variables
        sess.run(tf.global_variables_initializer())

        # Initialize the decay rate (that will use to reduce epsilon) 
        decay_step = 0
        
        # Set tau = 0
        tau = 0
        
        # Update the parameters of our TargetNetwork with DQN_weights
        update_target = update_target_graph()
        sess.run(update_target)
        
        for episode in range(total_episodes):
            # Set step to 0
            step = 0
            
            # Initialize the rewards of the episode
            episode_rewards = []
            
            # Make a new episode and observe the first state
            state = env.reset()
            
            # Remember that stack frame function also call our preprocess function.
            state, stacked_frames = stack_frames(stacked_frames, state, True)
            
            while step < max_steps:
                step += 1
                
                # Increase decay_step
                decay_step += 1
                
                #Increase decay_step
                decay_step +=1
                
                # Predict the action to take and take it
                action, explore_probability = predict_action(explore_start, explore_stop, decay_rate, decay_step, state, possible_actions)
                
                #Perform the action and get the next_state, reward, and done information
                next_state, reward, done, _ = env.step(action)
                
                if episode_render:
                    env.render()
                
                # Add the reward to total reward
                episode_rewards.append(reward)
                
                # If the game is finished
                if done:
                    # The episode ends so no next state
                    next_state = np.zeros((110,84), dtype=np.int)
                    
                    next_state, stacked_frames = stack_frames(stacked_frames, next_state, False)

                    # Set step = max_steps to end the episode
                    step = max_steps

                    # Get the total reward of the episode
                    total_reward = np.sum(episode_rewards)

                    print('Episode: {}'.format(episode),
                                  'Total reward: {}'.format(total_reward),
                                  'Explore P: {:.4f}'.format(explore_probability),
                                'Training Loss {:.4f}'.format(loss))

                    rewards_list.append((episode, total_reward))
                    
                    # Add experience to memory
                    experience = state, action, reward, next_state, done
                    memory.store(experience)


                else:
                    # Stack the frame of the next_state
                    next_state, stacked_frames = stack_frames(stacked_frames, next_state, False)
                
                    # Add experience to memory
                    experience = state, action, reward, next_state, done
                    memory.store(experience)


                    # st+1 is now our current state
                    state = next_state
                    

                ### LEARNING PART            
                # Obtain random mini-batch from memory
                tree_idx, batch, ISWeights_mb = memory.sample(batch_size)
                # batch = memory.sample(batch_size)
                

                states_mb = np.array([each[0][0] for each in batch], ndmin=3)
                # print_var("states_mb", states_mb.shape)                
                actions_mb = np.array([each[0][1] for each in batch])
                # print_var("actions_mb", actions_mb.shape)    
                # print_var("actions_mb", actions_mb)                    
                
                
                rewards_mb = np.array([each[0][2] for each in batch]) 
                # print_var("rewards_mb", rewards_mb.shape)   
                print_var("rewards_mb", rewards_mb)                                                                
                next_states_mb = np.array([each[0][3] for each in batch], ndmin=3)
                # print_var("next_states_mb", next_states_mb.shape)                                                                
                dones_mb = np.array([each[0][4] for each in batch])
                # print_var("dones_mb", dones_mb.shape)  
                print("\n")

                target_Qs_batch = []
                
                ### Double DQN Logic
                # Use DQNNetwork to select the action to take at next_state (a') 
                # (action with the highest Q-value)
                # Use TargetNetwork to calculate the Q_val of Q(s',a')
                
                # Get Q values for next_state                
                q_next_state = sess.run(DQNetwork.output, feed_dict = {DQNetwork.inputs_: next_states_mb})

                
                # Calculate Qtarget for all actions that state
                q_target_next_state = sess.run(TargetNetwork.output, feed_dict = {TargetNetwork.inputs_: next_states_mb})
                
                # Set Q_target = r if the episode ends at s+1, otherwise set Q_target = r + gamma * Qtarget(s',a')
                for i in range(0, len(batch)):
                    terminal = dones_mb[i]
                    
                    # We got a'
                    action = np.argmax(q_next_state[i])
                    
                    # If we are in a terminal state, only equals reward
                    if terminal:
                        target_Qs_batch.append(rewards_mb[i])
                    else:
                        # Take the Qtarget for action a'
                        target = rewards_mb[i] + gamma * q_target_next_state[i][action]
                        target_Qs_batch.append(target)
                targets_mb = np.array([each for each in target_Qs_batch])
                _, loss, absolute_errors = sess.run([DQNetwork.optimizer, DQNetwork.loss, DQNetwork.absolute_errors],
                                                    feed_dict = {DQNetwork.inputs_: states_mb,
                                                                 DQNetwork.target_Q: targets_mb,
                                                                 DQNetwork.actions_: actions_mb,
                                                                 DQNetwork.ISWeights_: ISWeights_mb
                                                                }
                                                   )
                
                # Update priorituy
                memory.batch_update(tree_idx, absolute_errors)
                
                if tau > max_tau:
                    # Update the parameters of our TargetNetwork with DQN_weights
                    update_target = update_target_graph()
                    sess.run(update_target)
                    tau = 0
                    print("Model updated")
                    
\
            # Save model every 5 episodes
            if episode % 5 == 0:
                save_path = saver.save(sess, "./train_models/model.ckpt")
                print("Model Saved")

  max_weight = (p_min * n) ** (-self.PER_b)


rewards_mb :  [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
 20.  0.  0.  0.  0.  0.  0.  0.  0.  0.]


rewards_mb :  [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0. 25.  0.  0.  0.  0.  0.  0.  0.]


rewards_mb :  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]


rewards_mb :  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]


rewards_mb :  [0. 0. 0