This directory contains the implementation of DDPG (Deep Deterministic Policy Gradient) algorithm appliied to various OpenAI gym environments (Pendullum-v0, MountainCarContinuous-v0), as well as to Unity Environment Reacher

Continuous Control with Reinforcment Learning


This directory contains the implementation of DDPG (Deep Deterministic Policy Gradient) algorithm applied to Unity Environment Reacher.

In this environment, a double-jointed arm can move to target locations. A reward of +0.1 is provided for each step that the agent's hand is in the goal location. Thus, the goal the agent is to maintain its position at the target location for as many time steps as possible. Detailed description may be found here)

State space is 33 dimensional vector with real numbers, consisting of position, rotation, velocity, and angular velocities of the arm.

Action space is 4 dimentional vector with real numbers, corresponding to torque applicable to two joints. Every entry in the action vector should be a number between -1 and 1.

Solution criteria: the environment is considered to be solved when the agent gets an average score of +30 over 100 consecutive episodes (averaged over all agents in case of multiagent environment).


For detailed Python environment setup (PyTorch, the ML-Agents toolkit, and a few more Python packages) please follow these steps: link

PreBuild Unity Environment:

Linux: 20 agents, 1 agent

Windows x32: 20 agents, 1 agent

Windows x64: 20 agents, 1 agent

Mac: 20 agents, 1 agent

Theoretical background

The DDPG algorithm was firstly presented in the papaer Lillicrap et. al. The pseudocode for this algorithm can be summarised as following:

The idea behind the algorithm:

Given the state of an Agent in the Environment, the Policy network returns an action from the continuous action space slightly perturbed by noise for the exploration purposes.

The QNetwork then evaluates this action given the state (So the networks accepts concatenated vector state-action and returns a single value).

The corresponding implementation of abovementioned algorithm is stated below with the auxiliary comments:

    def update_Qnet_and_policy(self, experiences):
        states, actions, rewards, next_states, dones = experiences 
        # ^ sample random experiences from the memory
        next_actions, next_actions_perturbed = self.actor_target(next_states) 
        # ^ get actions from the next states according to target policy
        Q_targets_next = self.critic_target(next_states, next_actions) 
        # ^ evaluate the q-function for the next states and next actions
        Q_targets = rewards + (self.__gamma*Q_targets_next*(1 - dones))  
        # ^ get target q-function value for the current states and actions
        Q_expected = self.critic_local(states, actions)
        # ^ get the estimation of q-function value for the current states and actions according 
        # to critic network
        loss_func = nn.MSELoss()
        loss_critic = loss_func(Q_expected, Q_targets.detach())
        # ^ define the loss functions for critic

        # ^ update the critic network (q-network)

        predicted_actions, predicted_actions_perturbed = self.actor_local(states) 
        # ^ get new predicted actions, not the ones stored in buffer

        loss_actor = -self.critic_local(states, predicted_actions).mean()
         # ^ define the loss functions for actor

        # ^ update the actor network (policy)

Code organization

The implementation is stored in the folder 'src', which includes:

  • the main file, to run the training of reinforcment learning agent. It includes hyperparameters and fucntion 'interact_and_train' which creates the instances of an Environmet and an Agent and runs their interaction. This file also includes all the hyperparameters
  • - contains the implementation of an Agent.
  • - implementation of internal buffer to sample the experiences.
  • - an ANN to evaluate Q-function.
  • - an ANN to chose an action given the state.
  • - generates the plot of scores acquired during the training.
  • - initializes an Agent with specified state dictionary and architecture and run visualisation of the Agent's performance.
  • - wrapper to run Unity Environments using the same code as for OpenAi gym Environments


To solve the Reacher environment the following parameters have been used:

params = dict()
params['action_dim'] = len(env.action_space.low)
params['state_dim'] = len(observation_space.low)
params['num_episodes'] = 200        #number of episodes for agent to interact with the environment
params['buffer_size'] = int(1e6)    # replay buffer size
params['batch_size'] = 128          # minibatch size
params['gamma'] = 0.99              # discount factor
params['tau'] = 1e-2                # for soft update of target parameters
params['eps'] = 0.8                 # exploration factor (modifies noise)
params['min_eps'] = 0.001           # min level of noise
min_e = params['min_eps']
e = params['eps']
N = params['num_episodes']
params['eps_decay'] = np.exp(np.log(min_e/e)/(0.8*N)) #decay of the level of the noise after each episode
params['lr'] = 1e-3                 # learning rate
params['update_every'] = 2          # how often to update the network (every update_every timestep)
params['seed'] = random.randint(0,1000)
params['max_t'] = 1000              # restriction on max number of timesteps per each episodes
params['noise_type'] = 'action'     # noise type; can be 'action' or 'parameter'
params['save_to'] = ('../results/' + env_name) # where to save the results to
params['threshold'] = 38            # the score above which the network parameters are saved

#parameters for the Policy (actor) network
params['arch_params_actor'] = OrderedDict(
        {'state_and_action_dims': (params['state_dim'], params['action_dim']),
         'layers': {
             'Linear_1': 128,   'ReLU_1': None,
             'Linear_2': 64,  'ReLU_2': None,
             'Linear_3': params['action_dim'],
             'Tanh_1': None
#parameters for the QNetwork (critic) network
params['arch_params_critic'] = OrderedDict(
    {'state_and_action_dims': (params['state_dim'], params['action_dim']),
     'layers': {
         'Linear_1': 128, 'ReLU_1': None,
         'Linear_2': 64, 'ReLU_2': None,
         'Linear_3': params['action_dim']

Performance of a trained agent

To demonstrate the results, I've trained the Reacher in the multiagent mode.

The scores plot for training Reacher:

Gif demonstration of a trained Agent

Ideas to try

  • The parameter noise needs some fine tuning. Also it'll be interesting to make the noise adpative to sensitivity of the parameters: the greater the gradient with respect to the specific weight, the lower the level of the noise imposed, and vise versa.
  • One may try to make the policy network to return the parameters (mean and variance) of the probability distribution from which the action is sampled. Then we may use the mean of the distribution for the update in QNetwork.
  • Reward Shaping for decreasing the jitter of arm (penalizing the Agent for unnessesary movements to make the Agent act smoother).
  • Of course there is always some space for further hyper-parameter tuning.
  • Implementing the PPO (Proximal Policy Optimization, see the original paper) or TRPO (Trust Region Policy Optimization, see the original paper) and comparing it to DDPG. It's been suggested, that the PPO-family algorithms work better for continuous control.


