## Agent Testing with OpenAI Gym's Continuous Lunar Lander Environment ##

In order to ensure that the agents built for this project were working correctly, they were first tested on OpenAI Gym's continuous lunar lander environment. Continuous lunar lander is a difficult task, but it is solvable. The environment is considered solved when an agent gets an average score of >=200 over 100 episodes. 

We won't go into too much detail on the agents here, but we'll do a brief walkthrough of the results.

In [1]:
import numpy as np
import gym
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import time
import sys

from simplerl import (DDPGAgent, 
                      DDPGNet,
                      TD3Agent,
                      TD3Net,
                      OrnsteinUhlenbeckProcess, 
                      GaussianProcess,
                      ExponentialScheduler, 
                      ConstantScheduler, 
                      LinearScheduler,
                      train,
                      GymMonitorHook)

The first agent built was a DDPG agent. You can read about the agent design in detail here: https://arxiv.org/pdf/1509.02971. For this task, we'll use a simple two layer feed forward network with 256 units in the first layer and 128 units in the second layer for both the policy and critic networks; there will be no shared parameters between the policy and critic networks. We'll also use the exploration noise suggested in the paper, an Ornstein Uhlenbeck process. 

The full list of selected hyperparameters are below. This first version of DDPG is the basic implementation, using regular experience replay, 1-step returns, and a single actor.

In [11]:
env_fn = lambda: gym.make('LunarLanderContinuous-v2')
model_fn = lambda: DDPGNet(8, 2, (256, 128), action_scale = 1.0)
noise_fn = lambda: OrnsteinUhlenbeckProcess((2, ), ConstantScheduler(0.2))

In [11]:
agent = DDPGAgent(env_fn = env_fn, 
                  model_fn = model_fn,
                  n_actors = 1,
                  action_range = 1.0,
                  gamma = 0.99,
                  exploration_noise = noise_fn,
                  batch_size = 64,
                  n_steps = 1,
                  replay_memory = 1000000,
                  use_per = False,
                  replay_start = 1000,
                  parameter_tau = 1e-3,
                  buffer_tau = 1e-3,
                  optimizer = optim.Adam,
                  policy_learning_rate = 1e-4,
                  critic_learning_rate = 1e-3,
                  weight_decay = 1e-5,
                  clip_gradients = 5.0,
                  update_freq = 4)

monitor = GymMonitorHook(verbose = 50, vector_env = True)
agent.train()
t = time.time()
train(agent, agent.env, train_steps = 400000, hooks = [monitor], vector_env = True)
print('\nWall time: {:.2f}'.format((time.time() - t) / 60))

Episode 50 | Time Steps: 86 | Average Score: -246.59
Episode 100 | Time Steps: 107 | Average Score: -266.01
Episode 150 | Time Steps: 190 | Average Score: -335.07
Episode 200 | Time Steps: 222 | Average Score: -338.35
Episode 250 | Time Steps: 280 | Average Score: -224.36
Episode 300 | Time Steps: 1000 | Average Score: -124.28
Episode 350 | Time Steps: 1000 | Average Score: -30.89
Episode 400 | Time Steps: 1000 | Average Score: -27.26
Episode 450 | Time Steps: 627 | Average Score: -32.18
Episode 500 | Time Steps: 479 | Average Score: 70.16
Episode 550 | Time Steps: 322 | Average Score: 101.76
Episode 600 | Time Steps: 571 | Average Score: 91.79
Episode 650 | Time Steps: 1000 | Average Score: 71.11
Episode 700 | Time Steps: 1000 | Average Score: 9.55
Episode 750 | Time Steps: 1000 | Average Score: 32.12
Episode 800 | Time Steps: 1000 | Average Score: 72.19
Episode 821 | Time Steps: 654 | Average Score: 90.28

Wall time: 43.61


The agent learns, but it is unable to solve the environment in the given 400,000 steps. With more training, vanilla DDPG is able to solve this environment though.

Next, we'll use a more sophisticated version of DDPG. This version uses:
1. Prioritized experience replay to better select experiences to learn from, which you can read about in detail here: https://arxiv.org/pdf/1511.05952.
2. 5-step returns to reduce bias, at the cost of somewhat increasing variance, of the value estimates.
3. 16 parallel actors (implemented using Python's multiprocessing library), to increase exploration.

In [8]:
agent = DDPGAgent(env_fn = env_fn, 
                  model_fn = model_fn,
                  n_actors = 16,
                  action_range = 1.0,
                  gamma = 0.99,
                  exploration_noise = noise_fn,
                  batch_size = 64,
                  n_steps = 5,
                  replay_memory = 1000000,
                  use_per = True,
                  alpha = 0.6,
                  beta = LinearScheduler(0.4, 1.0, 100000),
                  replay_start = 1000,
                  parameter_tau = 1e-3,
                  buffer_tau = 1e-3,
                  optimizer = optim.Adam,
                  policy_learning_rate = 1e-4,
                  critic_learning_rate = 1e-3,
                  weight_decay = 1e-5,
                  clip_gradients = 5.0,
                  update_freq = 4)

monitor = GymMonitorHook(verbose = 50, vector_env = True)
agent.train()
t = time.time()
train(agent, agent.env, train_steps = 200000, hooks = [monitor], vector_env = True)
print('\nWall time: {:.2f}'.format((time.time() - t) / 60))

Episode 50 | Time Steps: 139 | Average Score: -171.67
Episode 100 | Time Steps: 275 | Average Score: -186.51
Episode 150 | Time Steps: 580 | Average Score: -148.35
Episode 200 | Time Steps: 959 | Average Score: -8.21
Episode 250 | Time Steps: 687 | Average Score: 107.41
Episode 300 | Time Steps: 323 | Average Score: 153.65
Episode 350 | Time Steps: 263 | Average Score: 170.48
Episode 400 | Time Steps: 360 | Average Score: 190.53
Episode 450 | Time Steps: 284 | Average Score: 216.16
Episode 500 | Time Steps: 529 | Average Score: 222.36
Episode 503 | Time Steps: 597 | Average Score: 221.50

Wall time: 61.01


This agent is able to quickly solve the environment in about 400 episodes, and learning is much more stable with these extra features.

Lastly, we'll use a TD3 agent. TD3 is a variant of DDPG with 3 main differences:
1. The policy network is updated less frequently than the critic network so that the critic can provide better value estimates for the policy gradient. Here, the critic is updated twice before the policy network is updated.
2. Two critic networks are used to reduce overestimation bias in the value estimates. This is similar to double-Q learning in traditional reinforcement learning and Double DQN.
3. Gaussian noise is added to the actions passed to the critic networks as a form of regularization during training.

You can read more about TD3 here: https://arxiv.org/pdf/1802.09477.

In [13]:
model_fn = lambda: TD3Net(8, 2, (256, 128), action_scale = 1.0)

agent = TD3Agent(env_fn = env_fn, 
                 model_fn = model_fn,
                 n_actors = 16,
                 action_range = 1.0,
                 gamma = 0.99,
                 exploration_noise = noise_fn,
                 regularization_noise = GaussianProcess((64, 2), ConstantScheduler(0.2)),
                 noise_clip = (-0.5, 0.5),
                 batch_size = 64,
                 n_steps = 5,
                 replay_memory = 100000,
                 use_per = True,
                 alpha = 0.6,
                 beta = LinearScheduler(0.4, 1.0, 100000),
                 replay_start = 1000,
                 parameter_tau = 1e-3,
                 buffer_tau = 1e-3,
                 optimizer = optim.Adam,
                 policy_learning_rate = 1e-4,
                 critic_learning_rate = 1e-3,
                 weight_decay = 1e-5,
                 clip_gradients = 5.0,
                 update_freq = 4, 
                 policy_update_freq = 8)

monitor = GymMonitorHook(verbose = 50, vector_env = True)
agent.train()
t = time.time()
train(agent, agent.env, train_steps = 200000, hooks = [monitor], vector_env = True)
print('\nWall time: {:.2f}'.format((time.time() - t) / 60))

Episode 50 | Time Steps: 164 | Average Score: -282.23
Episode 100 | Time Steps: 707 | Average Score: -271.86
Episode 150 | Time Steps: 1000 | Average Score: -206.34
Episode 200 | Time Steps: 504 | Average Score: -97.24
Episode 250 | Time Steps: 662 | Average Score: 52.56
Episode 300 | Time Steps: 690 | Average Score: 182.74
Episode 350 | Time Steps: 181 | Average Score: 213.09
Episode 400 | Time Steps: 337 | Average Score: 212.59
Episode 450 | Time Steps: 153 | Average Score: 219.05
Episode 500 | Time Steps: 213 | Average Score: 224.59
Episode 550 | Time Steps: 202 | Average Score: 230.64
Episode 574 | Time Steps: 142 | Average Score: 240.74

Wall time: 81.87


The TD3 agent performs the best, solving the environment in about 350 espisodes, about 100 less than DDPG. While more complicated than DDPG, the improvements go a long way to stabilizing learning, and, therefore, we used TD3 as the final agent for the project.