RocketLander ✨🚀✨

Introduction

RocketLander is a comprehensive framework equipped with optimization algorithms, such as reinforcement learning, evolution strategies, genetic optimization, and simulated annealing, to enable a simplified version of a Falcon 9 rocket booster to land autonomously. RocketLander is designed to be simple to use and can be easily extended.

The framework uses PyBox2D a 2D physics library for rigid physics simulations, and PyGame for rendering and visualization.

I tried to make the simulation relatively realistic, even though that may conflict with Box2D's recommendation on object sizes. The booster has a height of about $46$ meters, a weight of about $25$ metric tons, and is made up of three parts. A long and low-density hull section containing mostly empty fuel tanks, a short but high-density engine section, and static medium-density landing legs.

In this framework, a booster is considered an agent that is equipped with a neural network (the agent's "brain") to learn how to propulsively land itself. The network is trained using reinforcement learning, evolution strategies, genetic optimization, or simulated annealing.

The neural network controls the actions of the booster. At each time step, the network receives the current state of the booster, which includes its position $r_x$ and $r_y$, velocity $v_x$ and $v_y$, angle $\theta$, and angular velocity $\omega$, as input. Based on this information, the network predicts an action, such as the levels of thrust and engine deflection.

Installation

To run RocketLander, install the latest master directly from GitHub. For a basic install, run:

git clone https://github.com/kaifishr/RocketLander
cd RocketLander 
pip3 install -r requirements.txt

Getting Started

To start a training session using a specified learning method, run one of the examples in the project folder. For example:

cd RocketLander
python -m projects.deep_q_learning.main
python -m projects.policy_gradient.main
python -m projects.evolution_strategies.main
python -m projects.genetic_optimization.main
python -m projects.simulated_annealing.main

Rendering and visualization can be turned off to greatly accelerate the optimization process by clicking on the PyGame window or by pressing SPACEBAR. Pressing P pauses the optimization process.

Track important metrics such as the reward with Tensorboard:

cd RocketLander
tensorboard --logdir runs/

Methods

Notation

In this project, the terms booster, agent, individual, and candidate are used interchangeably. Similarly, the terms epoch and episode are also used interchangeably. In the context of genetic optimization the terms fitness and reward are considered the same thing.

Reward Function

Independent of the optimization method the same reward function is used to measure the success of the agent during each episode. The reward function receives the booster's current position and velocity as input and outputs a scalar value. A simple reward function for landing a rocket booster can be designed as follows.

To encourage the booster to land as close as possible to the center of the landing pad, we can assign a high reward for proximity. For example, we can assign a reward of $1$ for a landing at the center of the landing pad, and reduce the reward to $0$ as the distance between the booster and the landing pad increases.

For the distance from booster to landing pad $d = \sqrt{(r_{x, \text{pad}} - r_{x, \text{booster}})^2 + (r_{y, \text{pad}} - r_{y, \text{booster}})^2}$, this can be formulated as follows:

$$R_{\text{proximity}} = \frac{1}{1 + \alpha \cdot d}$$

with the $x$- and $y$-coordinates of the landing pad and the booster. To avoid a rapid unscheduled disassembly of the booster, we define a term that takes the booster's instantaneous velocity $v = \sqrt{v_x^2 - v_y^2}$ into account,

$$R_{\text{velocity}} = \frac{1}{1 + \beta \cdot v}$$

with the $x$- and $y$-components of the booster's velocity. The reward is coupled with the booster's distance to the landing pad to encourage a soft landing. The hyperparameters, $\alpha$ and $\beta$, allow us to emphasize the rewards coming from proximity or velocity.

Now we can use a soft XNOR coupling of both reward terms to create a reward function that encourages a fast approach combined with a soft landing at the center of the landing pad:

$$R = R_{\text{proximity}} \cdot R_{\text{velocity}} + (1 - R_{\text{proximity}}) \cdot (1 - R_{\text{velocity}})$$

As a simple alternative, the following reward function that takes only the distance to the landing pad into account also works reasonably well:

$$R = \gamma R_{\text{proximity}}$$

where $\gamma$ controls how much the reward is taken into account.

In addition to the reward functions above, the agent also receives a large positive reward for a successful landing and a negative reward if the vehicle exceeds a certain stress level, experiences an impact, or leaves the domain.

To encourage the booster to land faster, the agent receives a small negative reward at every step. By reducing the number of simulation steps, we can model an implicit fuel restriction as the agent still tries to maximize the reward. This time restriction also motivates the booster to land more quickly.

Asynchronous Reinforcement Learning

Reinforcement Learning is without a doubt one of the most interesting subfields of machine learning in which an agent (here the booster) learns to make decisions in an environment by interacting with it and receiving rewards for its actions. This project comes with Deep Q-learning and the Policy Gradient reinforcement learning algorithm.

In reinforcement learning, the goal of the agent is to learn a policy, represented by the booster's neural network, that maximizes a reward over time. The neural network maps the booster's state to pairs of actions and Q-values, which the agent uses to determine the best action to take in a given situation. This is typically done by trial and error, with the agent learning from its mistakes and adjusting its policy-based actions over time to improve its performance.

The implemented reinforcement learning algorithms use a batch of episodes recorded by multiple boosters in parallel to learn a policy that maximizes the reward.

To train an agent with deep Q-learning we proceed as follows. We start by initializing our policy (the neural network) randomly. To ensure the agent explores its environment, we inject noise into its action space. That means, we either choose a random action from our discrete action space, or we take the action with the highest predicted utility predicted by the policy neural network at a given state.

As an aside, this is different from evolutionary strategies presented above, where we inject noise directly into the parameter space of the agent. Aside end.

As the agent interacts with the environment, we record its state, action, and the reward received at each time step. We can then use backpropagation to encourage state-action pairs that result in positive or high rewards and discourage those with negative or low rewards.

Action Space

Q-learning is a type of reinforcement learning algorithm that is typically used in situations where the action space is discrete. This means that the possible actions that can be taken in a given situation are finite and distinct, rather than continuous or infinitely variable.

When applying Q-learning to the problem of landing a booster, we need to discretize the action space and represent it as a lookup table from which the agent can choose an action. For example, if the main engine operates at two thrust levels and our booster can be fired at three different angles ${-15°, 0°, 15°}$, the discrete action space for the booster would consist of six possible actions, as shown below:

#	Thrust level	Angle
1	50%	-15°
2	50%	-0°
3	50%	15°
4	100%	-15°
5	100%	0°
6	100%	15°

However, when dealing with engines that can operate at multiple levels of thrust and deflection angles (such as main engines and cold gas thrusters), discretizing the action space can make learning inefficient. This is due to the curse of dimensionality, which is a phenomenon that occurs when the number of dimensions in a problem grows exponentially, making it difficult to find good solutions.

Evolution Strategies

Evolution strategies is a class of black-box stochastic optimization techniques that have achieved impressive results on reinforcement learning benchmarks. Despite their name, evolution strategies optimization has very little in common with genetic optimization. At its core, the evolution strategies optimization algorithm resembles simple hill-climbing in a high-dimensional space. It samples a population of candidate solutions and allows agents with high rewards to have a greater influence on the distribution of future generations.

Despite the simplicity, the evolution strategies algorithm is pretty powerful and overcomes many of reinforcement learning inconveniences. Optimization with evolution strategies is highly parallelizable, makes no assumptions about the underlying model to train, allows interactions between agents by default, and is not restricted to a discrete action space.

For the evolution strategies algorithm to work we only have to look at the parameterized reward function $R( \mathbf{s}; \mathbf{\theta} )$, that takes a state vector and outputs a scalar reward. During the optimization process, we estimate gradients that allow us to steer the parameters $\mathbf{\theta}$ into a direction to maximize the reward $R$. Thus, we are optimizing $R$ with respect to $\mathbf{\theta}$.

The evolution strategies algorithm generates at each time step a population of different parameter configurations $\mathbf{\theta}_i$ (the agents' neural network weights) from the base parameters $\mathbf{\theta}$ by adding gaussian noise ( $\mathbf{\theta}_i = \mathbf{\theta} + \mathbf{\epsilon}$ with $\mathbf{\epsilon} \sim \mathcal{N}(0, \sigma^2)$ and $i \in [1, N]$ ). After each agent has spend one episode in the environment a weighted sum over each agents policy network's parameters and gained reward is being created. This weighted sum of parameter vectors becomes the new base parameters.

Mathematically, evolution strategies uses finite differences along a few random directions at each optimization step to estimate the gradients of the reward function $R$ with respect to the parameter vector $\mathbf{\theta}$. The estimated gradient is then used to update the policy network's parameters in a direction that increases the reward. This process is repeated until the desired level of performance is reached.

Genetic Optimization

Inspired by evolution, genetic optimization uses a population of individuals that are slightly different from each other. These differences result from mutation, which is a fundamental property of evolution, and result in different behaviors in each agent. The difference in behavior makes some agents more successful than others. The fitness or success of an agent is represented by the fitness or reward function.

The algorithm begins with a population of candidate solutions, from which the agent with the highest fitness level advances to the next round. After selection, the fittest individual propagates its genetic traits to the next generation, with random mutations. This process is repeated until the desired fitness level is reached.

Here, we use the mutation operation during the optimization process to learn to land a rocket booster. Mutation operations act on all members of a population. The mutation operation is defined by the probability with which a parameter mutation is going to happen and the rate or strength with which the mutation acts on the parameter.

Asynchronous Simulated Annealing

In 1983, Kirkpatrick et al., combined the insights of heating and cooling materials to change their physical properties with the Monte Carlo-based Metropolis-Hastings algorithm, to find approximate solutions to the traveling salesman problem. This combination led to the technique of simulated annealing. It has been shown, that with a sufficiently high initial temperature and sufficiently long cooling time, the system's minimum-energy state is reached.

In a nutshell, simulated annealing selects at each iteration a randomly created candidate solution that is close to the current one under some distance metric. The system moves to the proposed solution either if it comes with a higher reward or with a temperature-dependent probability. With decreasing temperature, the temperature-dependent probability to accept worse solutions narrows and the optimization focuses more and more on improving solutions approaching a Monte Carlo algorithm behavior.

In a previous project called NeuralAnnealing, I implemented simulated annealing with JAX to optimize neural networks for a classification task. I had good results with this approach, so I thought I would try using it here.

TODOs

Add explicit fuel constraint.
Add dynamic landing legs.

References

Salimans et al., 2017, Evolution Strategies as a Scalable Alternative to Reinforcement Learning

Yaghmaie et al., 2021, A Crash Course on Reinforcement Learning

Box2D website

Box2D C++ documentation

PyBox2D on Github

Backends for PyBox2D

PyBox2D tutorial

Minimal PyBox2D examples

Citation

If you find this project useful, please use BibTeX to cite it as:

@article{fischer2022rocketlander,
  title   = "RocketLander",
  author  = "Fischer, Kai",
  journal = "GitHub repository",
  year    = "2022",
  month   = "December",
  url     = "https://github.com/kaifishr/RocketLander"
}

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 140 Commits
docs		docs
projects		projects
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RocketLander ✨🚀✨

Introduction

Installation

Getting Started

Methods

Notation

Reward Function

Asynchronous Reinforcement Learning

Action Space

Evolution Strategies

Genetic Optimization

Asynchronous Simulated Annealing

TODOs

References

Citation

License

About

Releases

Packages

Languages

License

kaifishr/RocketLander

Folders and files

Latest commit

History

Repository files navigation

RocketLander ✨🚀✨

Introduction

Installation

Getting Started

Methods

Notation

Reward Function

Asynchronous Reinforcement Learning

Action Space

Evolution Strategies

Genetic Optimization

Asynchronous Simulated Annealing

TODOs

References

Citation

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages