Skip to content

Solving OpenAI's "Lunar Lander" with Reinforcement Learning

License

Notifications You must be signed in to change notification settings

hoverslam/rl-lunar-lander

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

70 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Lunar Lander

The Lunar Lander environment is a rocket trajectory optimization problem. The goal is to touch down at the landing pad as close as possible. The rocket starts at the top center with a random initial force applied to its center of mass.

There are four discrete action: do nothing, fire left engine, fire main engine, and fire right engine.

Each observation is an 8-dimensional vector containing: the lander position in x & y, its linear velocity in x & y, its angle, its angular velocity, and two boolean flags indicating whether each leg has contact with the ground.

Positive rewards are received for a landing (100-140, depending on the position) with +100 if the lander comes to a rest. Firing the engines gives a tiny (-0.03) and crashing a big (-100) negative reward. The problem is considered solved by reaching 200 points.

The following RL algorithms were implemented:

  • Neural Fitted Q Iteration (NFQ)
  • Deep Q-Network (DQN)
  • REINFORCE with baseline / Vanilla Policy Gradient (VPG)
  • Advantage Actor Critic (AC)

For better comparison, all algorithms use a 2-layer MLP (128, 64) and a discount factor of 0.999. The learning rate is set individually.

How to

Install dependencies with      pip install -r requirements.txt.

Run      main.py train <agent> <episodes>      to train an agent.

Run      main.py evaluate <agent> <episodes> <render>      to evaluate a pre-trained agent.

<agent> (string)    NFQ, DQN, VPG or AC

<episodes> (int)    Number of episodes

<render> (bool)      Display episodes on screen

Neural Fitted Q Iteration

Training After 2000 episodes

Reference: M. Riedmiller (2005) Neural Fitted Q Iteration - First Experiences with a Data Efficient Neural Reinforcement Learning Method

Deep Q-Network

Training After 1000 episodes

Reference: V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller (2013) Playing Atari with Deep Reinforcement Learning

REINFORCE with baseline / Vanilla Policy Gradient

Training After 5000 episodes

Reference: R. Sutton, and A. Barto (2018) Reinforcement Learning: An Introduction, p. 328

Reference: OpenAI: Spinning Up in Deep RL!, Vanilla Policy Gradient

Advantage Actor Critic

Training After 1000 episodes

Reference: RL Course by David Silver - Lecture 7: Policy Gradient Methods

Comparison

The score is the average return over 100 episodes on the trained agent.

Score
Neural Fitted Q Iteration -24.90
Deep Q-Network 271.47
Vanilla Policy Gradient 172.49
Advantage Actor Critic 205.77

Dependencies

  • Python v3.10.9
  • Gym v0.26.2
  • Matplotlib v3.6.2
  • Numpy v1.24.1
  • Pandas v1.5.2
  • PyTorch v1.13.1
  • Tqdm v4.64.1
  • Typer v0.7.0