## DDPG algorithm

- randomly initialize critic network `Q(s, a)` and actor `Mu(s)` network with `W_Q` and `W_Mu`  
- initialize target networks `Q'(s, a)` and `Mu'(s)` with weights `W_Q' = W_Q` and `W_Mu' = W_Mu`  
- initialize replay buffer `R`

- for iteration 1 to M:  
  - initialize a random process for **action exploration**
  - receive initial observation state `s1`
  
  - for t 1 to T:
      - select action `a_t = Mu(s_t) + N_t` according to the current policy and exploration noise
      - execute action `a_t` and observe reward `r_t` and observe new state s_{t+1}
      - store transition (s_t, a_t, r_t, s_{t+1}) in `R`
      - sample a random minibatch of N transitions (s_i, a_i, r_i, s_{i+1}) from `R`
      - set `y_i = r_i + gamma * Q'(s_{i+1}, Mu'(s_{i+1}))`
      - update critic by minimizing the loss: `L = (1/N) * sum_{(y_i - Q(s_i, a_i))^2}`
      - update the actor policy using the sampled policy gradient: del J = (1/N) sum_{del_{a} Q(s, a)|{s=s, a=Mu(si)} * del M}
      ![](policy_gradient.gif)  
      - update the target network `W_Q' = tau * W_Q + (1 - tao) * W_Q'` and `W_Mu' = tau * W_Mu + (1 - tau) * W_Mu'`
      
By Lillicrap et al

## Actor Network

In [None]:
import keras.backend as K
from keras.models import Model
from keras.layers import Dense

class Actor:
    
    def __init__(self, tf_session, action_shape, hidden_layer_units, 
                 learning_rate, batch_size, tau):
        
        self._tf_session = tf_session
        self._action_shape = action_shape    # the shape of states
        self._hidden_layer_units = hidden_layer_units
        self._learning_rate = learning_rate
        self._batch_size = batch_size
        self._tau = tau
        
        K.set_session(self._tf_session)
        
        
        
    def _generate_network(self):
        input_layer = None
        dense = Dense(unit=self._hidden_layer_units[0])(input_layer)
        for unit in self._hidden_layer_units[1:]:
            dense = Dense(unit=unit)(dense)
        output_layer = Dense(unit=self._action_shape, activation='sigmoid')
    
    def train_main_network(self):
        pass
    
    def train_target_network(self):
        pass
    
    