Lunar Lander

The Lunar Lander environment is a rocket trajectory optimization problem. The goal is to touch down at the landing pad as close as possible. The rocket starts at the top center with a random initial force applied to its center of mass.

There are four discrete action: do nothing, fire left engine, fire main engine, and fire right engine.

Each observation is an 8-dimensional vector containing: the lander position in x & y, its linear velocity in x & y, its angle, its angular velocity, and two boolean flags indicating whether each leg has contact with the ground.

Positive rewards are received for a landing (100-140, depending on the position) with +100 if the lander comes to a rest. Firing the engines gives a tiny (-0.03) and crashing a big (-100) negative reward. The problem is considered solved by reaching 200 points.

The following RL algorithms were implemented:

Neural Fitted Q Iteration (NFQ)
Deep Q-Network (DQN)
REINFORCE with baseline / Vanilla Policy Gradient (VPG)
Advantage Actor Critic (AC)

For better comparison, all algorithms use a 2-layer MLP (128, 64) and a discount factor of 0.999. The learning rate is set individually.

How to

Install dependencies with pip install -r requirements.txt.

Run main.py train <agent> <episodes> to train an agent.

Run main.py evaluate <agent> <episodes> <render> to evaluate a pre-trained agent.

<agent> (string) NFQ, DQN, VPG or AC

<episodes> (int) Number of episodes

<render> (bool) Display episodes on screen

Neural Fitted Q Iteration

Training	After 2000 episodes

Reference: M. Riedmiller (2005) Neural Fitted Q Iteration - First Experiences with a Data Efficient Neural Reinforcement Learning Method

Deep Q-Network

Training	After 1000 episodes

Reference: V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller (2013) Playing Atari with Deep Reinforcement Learning

REINFORCE with baseline / Vanilla Policy Gradient

Training	After 5000 episodes

Reference: R. Sutton, and A. Barto (2018) Reinforcement Learning: An Introduction, p. 328

Reference: OpenAI: Spinning Up in Deep RL!, Vanilla Policy Gradient

Advantage Actor Critic

Training	After 1000 episodes

Reference: RL Course by David Silver - Lecture 7: Policy Gradient Methods

Comparison

The score is the average return over 100 episodes on the trained agent.

	Score
Neural Fitted Q Iteration	-24.90
Deep Q-Network	271.47
Vanilla Policy Gradient	172.49
Advantage Actor Critic	205.77

Dependencies

Python v3.10.9
Gym v0.26.2
Matplotlib v3.6.2
Numpy v1.24.1
Pandas v1.5.2
PyTorch v1.13.1
Tqdm v4.64.1
Typer v0.7.0

Name		Name	Last commit message	Last commit date
Latest commit History 70 Commits
img		img
models		models
modules		modules
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Lunar Lander

How to

Neural Fitted Q Iteration

Deep Q-Network

REINFORCE with baseline / Vanilla Policy Gradient

Advantage Actor Critic

Comparison

Dependencies

About

Releases

Packages

Languages

License

hoverslam/rl-lunar-lander

Folders and files

Latest commit

History

Repository files navigation

Lunar Lander

How to

Neural Fitted Q Iteration

Deep Q-Network

REINFORCE with baseline / Vanilla Policy Gradient

Advantage Actor Critic

Comparison

Dependencies

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages