<a href="https://colab.research.google.com/github/imiled/DL_Tools_For_Finance/blob/master/MAIN_DRL_class_2_DQN_solved.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#### Install dependencies

In [None]:
!pip install gym pyvirtualdisplay > /dev/null 2>&1
!apt-get install -y xvfb python-opengl ffmpeg > /dev/null 2>&1
!apt-get update > /dev/null 2>&1
!apt-get install cmake > /dev/null 2>&1
!pip install --upgrade setuptools > /dev/null 2>&1
!pip install ez_setup > /dev/null 2>&1
!pip install gym > /dev/null 2>&1
!pip install gym[atari] > /dev/null 2>&1
!pip install gym-super-mario-bros > /dev/null 2>&1
!pip install git+https://github.com/JKCooper2/gym-bandits#egg=gym-bandits > /dev/null 2>&1
!pip install tensorflow-gpu==2.0.0 > /dev/null 2>&1

In [None]:
%matplotlib inline
import gym_bandits
import gym
from gym import logger as gymlogger
from gym.wrappers import Monitor
gymlogger.set_level(40) #error only
from nes_py.wrappers import JoypadSpace
import gym_super_mario_bros
from gym_super_mario_bros.actions import SIMPLE_MOVEMENT

try:
    # %tensorflow_version solo existe in Colab.
    %tensorflow_version 2.x
except Exception:
    pass

import tensorflow as tf
from collections import deque
import progressbar
import numpy as np
import skimage
import random
import matplotlib
import matplotlib.pyplot as plt
import math
import glob
import io
import base64
from IPython.display import HTML

from IPython import display as ipythondisplay
from pyvirtualdisplay import Display
display = Display(visible=0, size=(1400, 900))
display.start()

TensorFlow 2.x selected.


xdpyinfo was not found, X start can not be checked! Please install xdpyinfo!


<Display cmd_param=['Xvfb', '-br', '-nolisten', 'tcp', '-screen', '0', '1400x900x24', ':1001'] cmd=['Xvfb', '-br', '-nolisten', 'tcp', '-screen', '0', '1400x900x24', ':1001'] oserror=None return_code=None stdout="None" stderr="None" timeout_happened=False>

### OpenAI Gym

In [None]:
class environment(object):
    def __init__(self, env_name):
        self.name = env_name
        self.env = self.wrap_env(gym.make(self.name))
        
    @staticmethod
    def show_video():
        mp4list = glob.glob('video/*.mp4')
        if len(mp4list) > 0:
            mp4 = mp4list[0]
            video = io.open(mp4, 'r+b').read()
            encoded = base64.b64encode(video)
            ipythondisplay.display(HTML(data='''<video alt="test" autoplay 
                        loop controls style="height: 400px;">
                        <source src="data:video/mp4;base64,{0}" type="video/mp4" />
                     </video>'''.format(encoded.decode('ascii'))))
        else: 
            print("Could not find video")

    @staticmethod
    def wrap_env(env):
        """
        Utility function to enable video recording of gym environment and displaying it
        To enable video, just do "env = wrap_env(env)""
        """
        env = Monitor(env, './video', force=True)
        return env

In [None]:
class utils_class(object):
    @staticmethod
    def preprocess(state):
        output = skimage.color.rgb2gray(state)
        output = skimage.util.crop(output, (34, 16), (0, 0))
        output = skimage.transform.resize(output,(84,84)) / 255.0
        return output

    def stack_states(self, state, next_state):
        next_state = self.preprocess(next_state)
        next_state = np.append(state[:,:,1:], np.expand_dims(next_state, 2), axis=2)
        return next_state

    def initial_state(self, state):
        state = self.preprocess(state)
        state = np.stack([state] * 4, axis=2)
        state = np.expand_dims(state, axis=0)
        return state


In [None]:
utils = utils_class()

# Deep Q Learning

In this notebook we will learn about the popular Deep Q-Network (DQN) algorithm.
In essence, DQN uses neural networks to approximate the Q function. We will apply the DQN agent to solve any Atari game environment, and learn about the basic tricks to let DQN work in practice.

Learning goals:
- Deep Q Network
- Experience Replay
- Target Network
- Double DQN
- Dueling architecture

## Deep Q Network

Deep Q-network is a seminal piece of work to make the training of Q-learning more data-efficient, when the Q value is approximated with a nonlinear function $Q(s, a, \theta) \sim Q^*(s,a)$. The Q network can be a multi-layer dense neural network, a convolutional network (CNN), or a recurrent network, depending on the problem.
In this notebook, we will focus on CNN, as the problem at hand concerns with the atari games universe.

In particular,  we deal with an architecture consisting on three
hidden convolutional layers, followed by one fully connected hidden layer, followed by the output layer.
The three successive hidden convolutional layers of DQN produce 32 8×8 feature maps, 64 4×4 feature maps, and 64 3×3 feature maps. The activation function of the units of each feature map is a
rectifier nonlinearity. 
As an input we simply pass the game screen alone and get the Q values for all possible actions in the state in the output layer.


We update the weights and minimize the loss through gradient descent. The loss is given by the expression:

$$ l = \left( r + \gamma max_{a'} Q(s', a', \theta) - Q(s, a, \theta) \right) ^2 = (y_i - Q(s, a, \theta))^2$$

However, training a non-linear Deep Neural Network use to be unstable if we apply it naively. Two main ingredients are necessary to stabilize training, namely experience replay and a separately updated target network.

In [None]:
class DQModel(tf.keras.Model):
    def __init__(self, action_size):
        super(DQModel, self).__init__()
        self.action_size = action_size
        self.conv_ft_32_kn_8_str_4_relu = tf.keras.layers.Convolution2D(32, (8, 8), strides=4, padding='same', activation='relu')
        self.conv_ft_64_kn_4_str_2_relu = tf.keras.layers.Convolution2D(64, (4, 4), strides=2, padding='same', activation='relu')
        self.conv_ft_64_kn_3_str_1_relu = tf.keras.layers.Convolution2D(64, (3, 3), strides=1, padding='same', activation='relu')
        self.flatten = tf.keras.layers.Flatten()
        self.dense_512_relu = tf.keras.layers.Dense(512, activation='relu')
        self.dense_actions_linear = tf.keras.layers.Dense(self.action_size)

    def call(self, inputs):
        x = self.conv_ft_32_kn_8_str_4_relu(inputs)
        x = self.conv_ft_64_kn_4_str_2_relu(x)
        x = self.conv_ft_64_kn_3_str_1_relu(x)
        x = self.flatten(x)
        x = self.dense_512_relu(x)
        x = self.dense_actions_linear(x)
        return x

### Experince Replay

This method stores the agent’s experience $(S, A, R, S_{next})$ at each time
step in a replay memory that is accessed in minibatches to perform the weight updates. Experience Replay decorrelates the data and leads to better data efficiency (in the end, the agent learns from a wide range of experiences!). 
Also, neural networks will overfit with correlated experience, so by selecting a random batch of experiences from reply buffer we will reduce the overfitting.

At the beginning, the replay buffer is filled with random experience.



In [None]:
class experience_replay(object):
    def __init__(self, maxlen=2000):
        self.buffer = deque(maxlen=maxlen)

    def store(self, state, action, reward, next_state, done):
        self.buffer.append((state, action, reward, next_state, done))


### Target Network

A second way to improve DQN stability is by using a separate network to estimate the TD target. 

Previously, we were using the same $Q$ function for calculating both the target value and the predicted value, which could give raise to strong divergences.
To avoid this problem, we use a separate network called a target network for just calculating the target value. 

This target network has the same architecture as the function approximator but with frozen parameters $\theta '$. Every T steps (a hyperparameter) the parameters from the Q network are copied to the target network. 

Under these hypothesis the loss function becomes:

$$ l = \left( r + \gamma max_{a'} Q(s', a', \theta') - Q(s, a, \theta) \right) ^2$$

Notice the $\theta'$ instead $\theta$ in the $max$ term of the formula.


In [None]:
class target_network(object):
    def __init__(self, model):
        self.target_model = model

    def copy_model_parameters(self, model):
        self.target_model.set_weights(model.get_weights())

### The algorithm

In light of all of above considerations, we can enumerate the steps involved in DQN algorithms as follows:
1. First, we preprocess and feed the game screen (state s) to our DQN, which will
return the Q values of all possible actions in the state.
2. Select an action using the epsilon-greedy policy: with probability
$\epsilon$ select a random action, otherwise, with probability $1-\epsilon$, select the action which has a maximum Q
3. Perform the action in a state s and move to a new state s', receiving a reward. 
4. Store the transition in the replay buffer as $<s,a,r,s'>$
5. Sample some random batches of transitions from the replay buffer and
calculate the loss.
6. Perform gradient descent with respect to our actual network parameters in
order to minimize this loss.
8. After every k steps, copy the actual network weights to the target network
weights .
9. Repeat these steps for M number of episodes.

__Ex.__ Implement DQN and test your implementation in the MsPacman-v0 env

In [None]:
class DQL(target_network, experience_replay):
    def __init__(self, agent, env, maxlen=2000, gamma=0.6, epsilon=0.1, learning_rate=0.01):
        target_network.__init__(self, agent)
        experience_replay.__init__(self, maxlen)

        self.env = env
        self.n_actions = self.env.action_space.n

        # Initialize parameters
        self.gamma = gamma
        self.epsilon = epsilon

        # Optimizer
        self.learning_rate = learning_rate
        self.opt = tf.keras.optimizers.Adam(lr=learning_rate)

        # base and target networks
        self.agent = agent
        #self.agent(utils.initial_state(state))
        self.copy_model_parameters(self.agent)


    def initialize_episode(self):
        state = self.env.reset()
        state = utils.preprocess(state)
        state = np.stack([state] * 4, axis=2)
        epoch, episodic_reward, episodic_loss, done = 0, 0, [], False
        return done, state, epoch, episodic_reward, episodic_loss

    def policy(self, state):
        if np.random.rand() <= self.epsilon:
            return self.env.action_space.sample() 
        q_values = self.agent(state)
        return np.argmax(q_values[0])
    
    def optimize(self, experiences):
        states = np.array(list(x[0] for x in experiences))
        actions = np.array(list(x[1] for x in experiences))
        rewards = np.array(list(x[2] for x in experiences))
        next_states = np.array(list(x[3] for x in experiences))
        done = np.array(list(x[4] for x in experiences))
        with tf.GradientTape() as tape:
            target = self.agent(states)
            action_one_hot = tf.one_hot(actions, self.n_actions, 1.0, 0.0)
            pred = tf.reduce_sum(target * action_one_hot, axis=-1)
            t = self.target_model(next_states)
            y = rewards + (1. - done) * self.gamma * tf.reduce_max(t, axis=-1)
            loss = tf.keras.losses.MSE(y, pred)

        grads = tape.gradient(loss, self.agent.trainable_weights)
        self.opt.apply_gradients(zip(grads,
                                      self.agent.trainable_weights)) 
        return loss
        
    
    def run(self, num_episodes=800, batch_size=50, copy_steps=100):
        step = 0
        for i in range(num_episodes):
            done, state, epoch, episodic_reward, episodic_loss = self.initialize_episode()
            while not done:
                action = self.policy(np.expand_dims(state, axis=0))
                next_state, reward, done, _ = self.env.step(action)

                # Store this transistion as an experience in the replay buffer
                next_state = utils.preprocess(next_state)
                next_state = utils.stack_states(state, next_state)
                self.buffer.append([state, action, reward, next_state, done])
                
                # After certain steps, we train our Q network with samples from the experience replay buffer
                if step > batch_size:
                    minibatch = random.sample(self.buffer, batch_size)
                    loss = self.optimize(minibatch)
                    episodic_loss.append(loss)

                # after some interval we copy our main Q network weights to target Q network
                if step % copy_steps == 0 and step > batch_size:
                    self.copy_model_parameters(self.agent)

                state = next_state
                step += 1
                episodic_reward += reward
            print(i, episodic_reward)
      
         

In [None]:
def runparralel(myDQN, num_episodes=800, batch_size=50, copy_steps=100):
        
                if num_episodes==1:
                    done, state, epoch, episodic_reward, episodic_loss = myDQN.initialize_episode()
                    while not done:
                        action = myDQN.policy(np.expand_dims(state, axis=0))
                        next_state, reward, done, _ = myDQN.env.step(action)

                        # Store this transistion as an experience in the replay buffer
                        next_state = utils.preprocess(next_state)
                        next_state = utils.stack_states(state, next_state)
                        myDQN.buffer.append([state, action, reward, next_state, done])
                        
                        # After certain steps, we train our Q network with samples from the experience replay buffer
                        if myDQN.step > batch_size:
                            minibatch = random.sample(myDQN.buffer, batch_size)
                            loss = myDQN.optimize(minibatch)
                            episodic_loss.append(loss)

                        # after some interval we copy our main Q network weights to target Q network
                        if myDQN.step % copy_steps == 0 and myDQN.step > batch_size:
                            myDQN.copy_model_parameters(myDQN.agent)

                        state = next_state
                        myDQN.step += 1
                        episodic_reward += reward
                    print(myDQN.step, episodic_reward)
                else :
                  b=num_episodes//2
                  runparralel(myDQN, num_episodes=b, batch_size=50, copy_steps=100)
                  runparralel(myDQN, num_episodes=b, batch_size=50, copy_steps=100)

In [None]:
env = environment("MsPacman-v0").env
dqnAgent = DQModel(env.action_space.n)
dqlearning = DQL(dqnAgent, env)


In [None]:
import _thread
import time
global step
step =0
runparralel(myDQN=dqlearning, num_episodes=800)

Error: ignored

In [None]:
def factorial_recursive(n):
    # Base case: 1! = 1
    if n == 1:
        return 1

    # Recursive case: n! = n * (n-1)!
    else:
        return n * factorial_recursive(n-1)

In [None]:
factorial_recursive(5)

In [None]:
import _thread
import time

# Define a function for the thread
def print_time( threadName, delay):
   count = 0
   while count < 5:
      time.sleep(delay)
      count += 1
      print("{}: {} ".format(threadName, time.ctime(time.time())))

global val
def sum_rec_inter(i, n):
    # Base case: 1! = 1
    global val 
    if i==n:
        val=val+n
        #print(val,"\n")
    #
    else:
        try:
          print("*",i, n,"*\n")
          _thread.start_new_thread( sum_rec_inter, ( i, i+(n-i)//2 ) )
          _thread.start_new_thread( sum_rec_inter, (1+i+(n-i)//2 , n ) )
        except:
          print("Error: unable to start thread")

# Create two threads as follows




NameError: ignored

In [None]:
val=0
sum_rec_inter(0,10)
print("le resultat est {}".format(val))

* 0 10 *

le resultat est 0
* **0  9 6 5 10 10 *
 *

*


* 3 5 *

* 6 8 **
 
3*  46  *
7
 *

* 0 2 *

* 0 1 *



In [None]:
val

55

In [None]:
val

15

In [None]:
x = "somevalue"

def func_A ():
   global x
   # Do things to x
   return x

def func_B():
   x=func_A()
   # Do things
   return x

#func_A()
func_B()

In [None]:
import _thread
import time

# Define a function for the thread
def print_time( threadName, delay):
   count = 0
   while count < 5:
      time.sleep(delay)
      count += 1
      print("{}: {} ".format(threadName, time.ctime(time.time())))

# Create two threads as follows
try:
   _thread.start_new_thread( print_time, ("Thread-1", 2, ) )
   _thread.start_new_thread( print_time, ("Thread-2", 4, ) )
except:
   print("Error: unable to start thread")

while 1:
   pass


In [None]:
dqnAgent.save_weights('/content/drive/My Drive/weightmodelpacman')

In [None]:
dqnAgent.save('my_model.h5')

In [None]:
from google.colab import drive
drive.mount('/content/drive')

### Double Q Network

 DQN tends to overestimate $Q$ values due to its max operation applied to both selecting and estimating actions. 

To get around this problem, we can use the Q network for selection and the target network for estimation when making updates. 

In practice, we modify our target function:

$$y_i ^{DQN} = r + \gamma max_{a'} Q(s', a', \theta') $$

as follows:

$$y_i ^{DoubleDQN} = r + \gamma Q(s, argmax Q(s', a', \hat{\theta}), \theta') $$

Notice we have two Q functions each with different weights $\theta'$ and $\hat{\theta}$. While one is used to select the best action, the other one is used to evaluate the action, and vice versa.

In [None]:
class DoubleDQL(DQL):
    def __init__(self, agent, env, maxlen=2000, gamma=0.6, epsilon=0.1, learning_rate=0.01):
        DQL.__init__(self, agent, env, maxlen, gamma, epsilon, learning_rate)

    def optimize(self, experiences):
        states = np.array(list(x[0] for x in experiences))
        actions = np.array(list(x[1] for x in experiences))
        rewards = np.array(list(x[2] for x in experiences))
        next_states = np.array(list(x[3] for x in experiences))
        done = np.array(list(x[4] for x in experiences))
        with tf.GradientTape() as tape:
            target = self.agent(states)
            action_one_hot = tf.one_hot(actions, self.n_actions, 1.0, 0.0)
            pred = tf.reduce_sum(target * action_one_hot, axis=-1)

            actions_next = tf.argmax(self.agent(next_states), axis=-1)
            t = self.target_model(next_states)
            q_next = np.array(list(t[i, j] for i, j in enumerate(actions_next)))
            y = rewards + (1. - done) * self.gamma * q_next
            loss = tf.keras.losses.MSE(y, pred)

        grads = tape.gradient(loss, self.agent.trainable_weights)
        self.opt.apply_gradients(zip(grads, self.agent.trainable_weights)) 
        return loss


In [None]:
env = environment("MsPacman-v0").env
dqnAgent = DQModel(env.action_space.n)
dqlearning = DoubleDQL(dqnAgent, env)
dqlearning.run()

### Dueling architecture

Dueling Q-network architecture is charecterizes by dividing the fully connected layer at the end of DQN into two branches, one for predicting the state value, V, and the other for predicting the advantage, A. Remember that the advantage function specifies how good it is for an agent to perform an action a compared to other actions.

![](https://lilianweng.github.io/lil-log/assets/images/dueling-q-network.png)

The Q-value is then reconstructed as 

$$Q(s,a)=V(s)+A(s,a)$$

To make sure the estimated advantage values sum up to zero, $\sum_a A(s,a) \pi(a|s)=0$, we deduct the mean value from the prediction.

$$Q(s,a)=V(s)+\left(A(s,a)−\frac{1}{|A|}\sum_a A(s,a)\right) $$

In [None]:
class DuelingDQModel(tf.keras.Model):
    def __init__(self, action_size):
        super(DuelingDQModel, self).__init__()
        self.action_size = action_size
        self.conv_32_8_8_str4_relu = tf.keras.layers.Convolution2D(32, (8, 8), strides=4, padding='same', activation='relu')
        self.conv_64_4_4_str2_relu = tf.keras.layers.Convolution2D(64, (4, 4), strides=2, padding='same', activation='relu')
        self.conv_64_3_3_str1_relu = tf.keras.layers.Convolution2D(64, (3, 3), strides=1, padding='same', activation='relu')
        self.flatten = tf.keras.layers.Flatten()
        self.dense_512_relu_state_value = tf.keras.layers.Dense(512, activation='relu')
        self.dense_512_relu_advantage = tf.keras.layers.Dense(512, activation='relu')
        self.dense_1_linear_state_value = tf.keras.layers.Dense(1)
        self.dense_actions_linear_advantage = tf.keras.layers.Dense(self.action_size)
        self.lambda_mean_advantage = tf.keras.layers.Lambda(lambda x: tf.reduce_mean(x, axis=1, keepdims=True))
        self.merge_subs_advantage = tf.keras.layers.Subtract()
        self.dense_actions_q = tf.keras.layers.Add()

    def call(self, inputs):
        x = self.conv_32_8_8_str4_relu(inputs)
        x = self.conv_64_4_4_str2_relu(x)
        x = self.conv_64_3_3_str1_relu(x)
        x = self.flatten(x)
        state_value = self.dense_512_relu_state_value(x)
        state_value = self.dense_1_linear_state_value(state_value)
        advantage = self.dense_512_relu_advantage(x)
        advantage = self.dense_actions_linear_advantage(advantage)
        mean_advantage = self.lambda_mean_advantage(advantage)
        adv = self.merge_subs_advantage([advantage, mean_advantage])
        q = self.dense_actions_q([state_value, adv])
        return q

In [None]:
env = environment("MsPacman-v0").env
dqnAgent = DuelingDQModel(env.action_space.n)
dqlearning = DQL(dqnAgent, env)
dqlearning.run()