# Outer Space with Reinforcement Learning
##### Nikolay Zhechev, SoftUni
##### GitHub link: https://github.com/nikolayzhechev/Outer-Space-W-Reinforcement-Learning

<div style="max-width: 1200px;">
    
### Abstract
Space exploration methods are being improved upon with every day. Humanity is on the forefront of embarking on the journey of exploring space and reach new points in development. Traversing the vast land of space is undertaken mostly by machines and crafts that are of course operated mainly by humans. It is very plausible that for the foreseen future the vast amount of exploration will be performed by machinery. Machines that are classified as robots.  
Robots were sent multiple times, most notably to Mars and perform very substantial exploration on the surface on their own. This is accompanied by human supervision and instructions. Similarly in space environments, crafts perform multiple manoeuvres and decisions based on telemetry and computer programming.  
Space crafts require mission control and multiple people on the ground to perform communication and frequent monitor. Artificial intelligence (AI) can step in and take control of space craft making them more effective and autonomous. In general this evolution for robots and machinery seems natural. Due to the vast expanse of space and limitless possibilities in space exploration an AI robot can learn and adapt to environments very effectively.  
One process that produces good results is reinforcement learning (RL). In this situation the robot will be placed in an environment where it can perform actions. During traversing it will be rewarded or punished based on the action taken. This approach will be put to the test, leveraging multiple techniques. This project will present a few starter points and explore existing methods and technologies RL in space exploration.

</div>

<div style="max-width: 1200px;">

### Introduction

##### Briefly on RL
The main characters of RL are the agent and the environment. Agents are put in an unfamiliar environment and are rewarded or punished based on their actions. The main goal of the agent is to maximize the reward amount while traversing the environment. To describe is briefly, RL consists of **state** $s$ which is a complete description of the world. An **observation** $o$ is a description of the state which can be partial or full. Actions that an agent takes are mostly defined as **action space** which are the set of valid actions split into two: **discrete** where there is a finite number of moves available to the agent and **continuous**, in which the agent controls a robot in a physical world. The type of action space will have an impact on what algorithm will be implemented. Importantly the agent, also performs much of it's reasoning via a policy. It is a rule used to decide what action to take. It can be deterministic ( $a_t = \mu(s_t)$ ) or stochastic ( $a_t \backsim \pi(\cdot \mid s_t)$ ). Actions are formed from the policy, the sequence of actions is the trajectory which is the states and actions in the world. The goal is to maximize reward over a trajectory which would be defined by the **reward function** ($R$). It depends on the current action taken, the state of the world and the next state ($r_t = R (s_t, a_t, s_t + 1)$).  

##### Words on environments
Environment definition depends on the task at hand. While there are multiple available environments for prototyping and experimentations, setting up a custom environment for this task was a primary goal. A custom environment however, requires substantial testing and debugging as well as some domain knowledge. A good approach is to start with a simple, base environment and run tests on top of it gradually increasing the environment complexity and add further features while iterating over the results.

##### Approach
The approach here is testing on multiple environments while mostly simple there are a few more complex aspects within them. The RL agents performed much better on simpler environments which were iteratively improved and updated. Multiple algorithms were tested and initial measures and results are shown. Additionally, there are tests with popular frameworks and tools in the domain of RL. Some are more difficult to start with and some provide meaningful details a bit better.  
A description of how the reward scheme and policy is provided and further elaboration on the environment creation is also explained. The code implementation is available.  
The methods and results are shared in the next section where general spacecraft mission challenges and ideas, example trajectories and implementations as well as results and comparisons.

</div>

<div style="max-width: 1200px;">

### Methods and Results
There are a few questions that are a good start for this tasks: How will an algorithm perform on our environment? Can the algorithm's hyperparameters be tuned to learn and grow it's rewards? Or should the environment be tuned, so that the algorithm is more effective? What do we measure? How many time steps will each trial or run contain? If a terminal state is not reached after n steps, will we artificially terminate the episode? If the task is episodic, how many episodes should we run?

Defining main parameters and structure:  
- compute environment state into observation;
- `reset()` function to initiate a new episode for the environment;
- `step()` to compute the new state of the environment;


When will a reward be given? 
- spacecraft's target destination;
- efficiency or precision;

When will termination occur?
- goal reached;
- loss of fuel;
- collision with another object;
- error in path resulting to the final destination being impossible to reach;

Agent Goals:
- Landing: Soft-landing on a planetary surface;
- Interplanetary Transfer: Navigating from one celestial body to another;
- Orbit Change: Transitioning between different orbits efficiently;

**Observation space:**
- perceiving the environment;
- include enough details and information for the agent;
- consider: frame of reference, coordinate system (Cartesian, Keplerian), sensor errors (noise);

State variables:

| Variable      | Description   |
| ------------- | ------------- |
| $ x, y, z $ | Position in space (Cartesian) or orbital elements |
| $ v_x, v_y, v_z $ | Velocity vector |
| $ \theta, \phi, \psi $ | Orientation (Euler angles) |
| $ w_x, w_y, w_z $ | Angular velocity |
| $ f $ | Remaining fuel |
| $ d_{target} $ | Distance to target |

**Action space:**
- thruster commands;
- consider: thruster limitation, fuel efficiency (more thrust = more fuel consumption), control delays;

Action types:

| Action Type      | Example   |
| ------------- | ------------- |
| Continuous thrust | [Thrust in X, Y, Z] |
| Discrete thrusters | ON/OFF for RCS jets |
| Rotational control | 	[Torque in Roll, Pitch, Yaw]|


**Physics:**  
Physics considerations and implementation plus formulas and general physics laws.  

- Newtonian Dynamics:  
For simpler free space and no gravity motion.  
$ F = ma $ to update velocity and position (Newton's Second Law of Motion states that when a force acts on an object, it will cause the object to accelerate. The larger the object's mass, the greater the force will need to be to cause it to accelerate. This Law may be written as force = mass x acceleration).


- Orbital Mechanics:  
Consider orbital mechanics and forces from celestial bodies.  
Every object in the universe attracts every other body in the universe with a force directed along the line of centers of the two objects
that is proportional to the product of their masses and inversely proportional to the square of the distance between them,
$$ F=\frac{G_{m_1 m2}}{d^2} $$
In this relation, $G$ is the Newton gravitational constant, $m_1$ and $m_2$ are the
masses of the primary and secondary bodies, and $d$ is the distance between them.  
Kepler’s Equation: it relates the Mean Anomaly ($𝑀$) to the Eccentric Anomaly ($𝐸$) for an orbit with eccentricity $𝑒$:
$$ M = E - e \sin E $$
where:
    - $M$ = Mean Anomaly ($M = n(t - t_0)$), where $n = \frac{2_\pi}{T}$ is the mean motion, $T$ is the orbital period, and $t_0$ is a reference epoch.
    - E = Eccentric Anomaly.
    - e = Orbital eccentricity.  
    Once $E$ is found, the True Anomaly $ν$ (the angle between the position vector and the periapsis) can be determined using:
$$ \tan \frac{v}{2} = \sqrt{\frac{1+e}{1-e}} \tan \frac{E}{2}$$
The position in the orbital plane is given by:
$$ r = \frac{a(1 - e^2))}{1 + e \cos v}$$
where:
    - $r$ = radial distance from the focus (central body).  
    - $a$ = semi-major axis.  
  The Cartesian coordinates in the orbital plane (before transformation to an inertial frame) are:
$$ x = r \cos v$$ $$ y = r \sin v $$


- Attitude Control:  
Rotational dynamics for docking, planetary entry, or pointing. Usaga of *quaternions* for rotation.
Euler's theorem: The Orientation of a body is uniquely specified by a vector giving the direction of a body axis and a scalar specifying a rotation angle about the axis. 

- Fuel Consumption Model:  
Thrust burns should reduce remaining fuel. Thruster inefficiencies should be considered. Usage of Tsiolkovsky’s rocket equation:
$$ \varDelta v = v_e ln (\frac{m_i0}{m_f}) $$
where:
    - $\varDelta v$ is the change in velocity (delta-v);
    - $ v_e $ is the exhaust velocity;
    - $ m_0 $ is the initial mass of the rocket;
    - $ m_f $ is the final mass of the rocket after the propellant is expended;

#### Reward Function
Reward function should be shaped well and not only the end goal. The primary goal of the agent should be to reach the target, minimize fuel consumption, maintain stable motion, avoid oscillations, collisions and unsafe conditions. The reward function should balance all these goals.

| Reward Type      | Description  |
| ------------- | ------------- |
| Goal proximity | Reward for getting closer to the target |
| Fuel efficiency | Penalize excessive fuel use |
| Soft landing | Bonus for smooth landings (low velocity) |
| Docking precision | Bonus for aligning position & velocity |
| Stability | Penalize excessive rotations |

*Phase 1*: Reward based on moving towards the target.  
*Phase 2*: Reward for slowing down near the target.  
*Phase 3*: Reward only for successful docking.  

A combination of dense reward and spacrse reward can be applyed:
- Dense rewards can speed up learning and will provide continuous feedback.
- Sparse rewards will the given on task completion, but can be more challening for this scenario.

**Reward function design:**
1. Proximity: a proximity reward is provided to ensure that the spacecraft moves towards the goal. A **composite reward function** is used to combine multiple rewards.  
*Composite rewards combine multiple reward signals into a single function. This is useful in complex tasks where multiple objectives need to be balanced.*  
The distance based-reward will be calculated with the negative Euclidean distance:
$$ r_{distance} = - \lVert x - x_{target} \rVert $$
where $x$ is the current position and $x_{target}$ is the target position.
> ```python
> reward_distance = -np.linalg.norm(self.state[:3] - self.target_position)

2. Efficiency: minimizes excessive thrust burns and can penalize thurs usage.  
*Fuel penalty* where $a$ is the applied thrust vector:
$$ r_{fuel} = - \lVert a \rVert $$
> ```python
> reward_fuel = -np.linalg.norm(action) * 0.01
> 
For fuel *conservation*:  
$$ r_{fuel} = \frac{f_{remaining}}{f_{max}} $$
> ```python
> reward_fuel = self.state[-1] / self.max_fuel
>
3. Velocity: the agent needs to match the correct velocity and not overshoot. The velocity for precise docking has to be correct not to overshoot.
$$ r_{velocity} = - \lVert v - v{target} \rVert $$
> ```python
> reward_velocity = -np.linalg.norm(self.state[3:6] - self.target_velocity)
>
4. Orientation: in order to dock a spacecraft needs to align correctly and penalizing misalignment:
$$ r_{orientation} = -angle(q, q_{target}) $$
where $q$ is the quaternion representing orientation.
5. Landing: soft landing and encouraging stability. High velocity should be controlled during ladning anv penalized if the speed is to high:
$$ r_{ladning} = -v^2_z $$
6. Terminal rewards: in order to achiave correct behaviour, usage of large terminal rewards will be applyed for successfull missions. There will be a large penalty for failiures (for example crashes). Avoiding agent guessing and maximizing learning efficiency.

A weighted sum of all the components of the reward function:
$$ R = w1r_{distance} + w2r_{fuel} + w3r_{velocity} + w4r_{orientation} + w5r_{landing} + r_{goal} + r_{crash} $$

```python
reward = (
    -np.linalg.norm(self.state[:3] - self.target_position) * 1.0  # Distance penalty
    -np.linalg.norm(action) * 0.01  # Fuel efficiency penalty
    -np.linalg.norm(self.state[3:6] - self.target_velocity) * 0.5  # Velocity matching
    + orientation_penalty(self.state[6:10], self.target_orientation) * 0.3  # Orientation alignment
    + (100 if docking_success else 0)  # Goal reward
    + (-100 if collision_detected else 0)  # Collision penalty
)
```  
<br>

Avoid and consider:
- avoid unintended ways to for the agent to maximize rewards;
- sparce rewards can lead to less exploration of useful actions;
- avoid over-penalizing fuel consumption since the agent might not move;
- scaling reward terms in similar magnitude to achieve equal learning;


#### Training and Evaluation
Algorithms: Proximal Policy Optimization (PPO), Soft Actor-Critic (SAC), Twin Delayed DDPG (TD3), Deep Q Networks (DQN), Advantage Actor Critic (A2C)  
Hyperparameter tuning, learning rate, batch size, exploration strategies  

Evaluation Metrics:
1. Final Orbit Accuracy – Does the spacecraft reach the target orbit within a small margin?
2. Fuel Efficiency – Does the agent minimize fuel usage?
3. Number of Manoeuvres – Does the agent execute a minimal number of burns?
4. Success Rate – Percentage of successful orbital insertions across test episodes.
5. Time Taken – How many simulation steps are needed to reach the target?

</div>

<div style="max-width: 1200px; margin: 5px;">

#### Results and Implementation Details
Starting out with a more traditional environment: the lunar lander. It is available in Gymnasium and can be improved upon with a custom DQN agent. Interacting with the hyperparameters is important to reach good results. After testing a few implementations which weren't very good in terms of library execution and engine interaction, setting up a basic lunar lander with the help of Gymnasium proved to be one of the first successful attempts (figure 1). Agent is initialized with two Q-networks: local for action selection and target to compute target Q-values for learning.
Optimizer: Adam. Policy: epsilon-greedy. Selects the best action (1 - eps). Action is selected based on the Q-values predicted. Q-target: $ Q_{target} = r + \gamma \cdot max Q_{target}(s', a')$ and updates local Q-network using loss function: $Loss = MSE(Q_{expected}, Q_{target})$. 
</div>

<div style="max-width: 1200px; margin: 5px;">
    <p>Running ~1300 episodes provided a good reward and the agent seemed to land successfully in the boundaries (figure 1). This can be improved and used as a starting point.</p>
    <div style="display: flex; justify-content: space-between;">
            <img src="Figures/Results/lunar_lander_plot.png" alt="lunar" style="max-width: 65%; max-height: 375px;">

<div style="width:10%"></div>
<div>
<p> Provided were the below hyperparameters to the custom DQN agent:
    
```python
BUFFER_SIZE = int(1e5)
BATCH_SIZE = 64
GAMMA = 0.99 
TAU = 1e-3
LR = 5e-4
UPDATE_EVERY = 4
```
</p>
<p>

    
Figure 1: The plot shows the training progress of the DQN agent's reward over time. The high variance suggests high exploration, so the agent is trying to learn optimal actions followed by a steady improvement. At the end scores get higher indicating policy convergence. Additional stabilization can be achieved by hyperparameter adjustment such as learning rate and greedy exploration (which might lead to the high variance).

</p>
</div>
</div>
<p style="margin-top: 5px; text-align: left; margin: 5px;"></p>
</div>
<br/>
<div style="max-width: 1200px; margin: 5px;">
Using DQN to solve a similar problem with navigating to a specific target and tackling obstacles, however introduced was 3D space, in order to resemble a space environment even more. Starting off from simpler interactions and adding a few complexities proved to be a good approach. The policy mapping observations to actions in this usage was the Multi-Layer Perceptron (MLP), a fully connected neural network. It takes in an observation (state from the environment), passes it through fully connected (dense) layers with activation functions (ReLU by default) and outputs either discrete actions (for DQN, PPO, A2C) or continuous actions (for SAC, TD3). By default MlpPolicy consists of two hidden layers, each having 64 neurons.
Below are a set of hyperparameters used during experimentation with MlpPolicy:  
</div>
<br/>
<div style="display: flex; justify-content: space-between; margin: 5px; max-width: 900px;">

<div style="width: 33%;">
        
**Initial testing:**
   ```python
policy="MlpPolicy",
learning_rate=1e-3,
buffer_size=10000,
learning_starts=500,
batch_size=64,
gamma=0.99,
target_update_interval=100,
train_freq=4,
verbose=1,
   ```
</div>

<div style="width: 33%;">
    
**From stable baseline docs:**
```python
policy='MlpPolicy'
n_timesteps= !!float 1e5
learning_rate= !!float 6.3e-4
batch_size= 128
buffer_size= 50000
learning_starts= 0
gamma= 0.99
target_update_interval= 250
train_freq= 4
gradient_steps= -1
exploration_fraction= 0.12
exploration_final_eps= 0.1
policy_kwargs: net_arch=[256, 256]
```
</div>

<div style="width: 33%;">
        
**Improved results:**
```python
policy="MlpPolicy",
learning_rate=5e-4,
buffer_size=10000,
learning_starts=500,
batch_size=128,
gamma=0.95,
target_update_interval=100,
train_freq=10,
exploration_fraction=0.3,
verbose=1,
```
</div>

</div>

<br/>

<div style="max-width: 1200px; margin: 5px;">
In early training the results were good and the agent was very effective at reaching the goal and reaping the rewards (figure 2 (a)) unlike some other tests where agents required much more steps to earn reward (figure 3). 500 000 training timesteps were used in initial training but for our test run termination is accomplished around 25 steps.  
Observing the agent's path and manoeuvre (figure 2 (b)) can also be beneficial to improve since the agent should be able to take optimal moves and not explore mindlessly. This can be tweaked in the reward scheme but depends on the environment complexity.
</div>
<br/>
<div style="display: flex; align-items: center; justify-content: space-between; max-width: 1200px; margin: 5px;">
    <div>
        <img src="Figures/Results/navigate_dqn_01.png" alt="nav" style="width: 95%; max-height: 450px;">
        <p style="margin: 5px; text-align: left;">(a): Rewards per step indicating incremental learning.</p>
    </div>
    <div>
        <img src="Figures/Results/navigate_dqn_02.png" alt="nav" style="width: 95%; max-height: 450px;">
        <p style="margin: 5px; text-align: left;">(b): Movements of the agent in the environment.</p>
    </div>
</div>
<div style="max-width: 1200px; margin: 5px;">
    
Figure 2: Represents reward values per step within an episode and provides insight into how the agent interacts with the environment.

</div>
<div style="max-width: 1200px; margin: 5px;">
    
Looking at the hyperparameters which showed improvement after updates: **learning rate** is the step size that is used by the optimizer to update the neural network weights. A smaller value (e.g., 1e-4) results in slower learning but more stability, whereas a larger value (e.g., 1e-2) leads to faster convergence but might cause instability. The **replay buffer** will help the agent to store past experiences (state, action, reward, next state) in this buffer and samples from it for training. A larger buffer retains more diverse experiences but requires more memory. The **batch size** is the number of samples drawn from the replay buffer for training at each update. A larger batch size (e.g., 128) provides more stable updates but requires more computation. A smaller batch size (e.g., 32) results in more frequent updates but can be noisy. **Gamma** is the discount factor, which determines the importance of future rewards. A value close to 1 (e.g., 0.99) makes the agent prioritize long-term rewards. A lower value (e.g., 0.8) makes the agent focus more on short-term rewards. **Train frequency** is at which the model is trained. A value of 4 means the model updates every 4 steps. A lower value (e.g., 1) updates more frequently, while a higher value (e.g., 10) updates less often. The **gradient steps** hyperparameter determines how many gradient updates are performed per training step. When the agent trains, it samples a batch of experiences from the replay buffer and updates the neural network using backpropagation. The default value is typically 1, meaning one gradient update per training step. If gradient_steps > 1, the model updates multiple times using the same batch of data.

</div>

<br/>
<div style="display: flex; align-items: center; justify-content: space-between; max-width: 1200px; margin: 5px;">
    <div>
        <img src="Figures/Results/b_01.png" alt="nav" style="width: 95%; max-height: 450px;">
    </div>
    <div>
        <img src="Figures/Results/b_02.png" alt="nav" style="width: 95%; max-height: 450px;">
    </div>
</div>
<div style="max-width: 1200px; margin: 5px;">
Figure 3: Example of inconsistent results after training which were difficult to debug and make better over time. Agents here are stagnating without improvement. A more fundamental change was required to improve.
</div>
<br/>
<div style="max-width: 1200px; margin: 5px;">
Further experimentation was done with an improved environment and some possible considerations in mind:  
</div>
<div style="display: flex; justify-content: space-between; max-width: 1200px; margin: 5px;">
    <div style="width: 48%;">
        
Movement:
- Introduce continuous movement using velocity and acceleration.
- Apply Newton’s Second Law: $ F = ma $ where force from thrusters changes velocity.
- Add momentum so movement persists unless countered.
- Introduce gravity if near a celestial body.
  
Obstacles:
- Irregular-shaped asteroids (bounding boxes or convex hulls).
- Moving obstacles (asteroids or debris traveling with velocity).
- Gravity wells (black holes or planetary gravity altering paths).
  
Anomalies:
- Fuel leaks: Randomized decrease in fuel.
- Sensor noise: Small noise added to state observations.
- Thruster failures: Certain actions may not work sometimes.
</div>
<hr/>
    <div style="width: 48%;">
        
Sensors:
- Limited field of view: The agent only knows nearby objects.
- Lidar-style sensing: Returns distances to nearby objects.
- Radar-based goal detection: The goal’s direction is known, but not distance.
  
Reward:
- Reward efficient movement: Penalize unnecessary thrusts.
- Encourage proximity to the goal: Small positive reward for getting closer.
- Add time pressure: Slight penalty per step to encourage efficiency.
    </div>
</div>
<br/>
<div style="max-width: 1200px; margin: 5px;">
After an update to the environment and reward illustrated were good results once again and the agent was reaching the goal seemingly in an optimal way. Now the reward seemed to be more gradual and spikes were eliminated which suggested a more linear model and refined reward scheme (figure 4).
</div>
<br/>
<div style="max-width: 1200px; margin: 5px;">
    
![nav](Figures/Results/navigate_dqn_03.png)
Figure 4: Plots showing agent trajectory, rewards per episode and distance to the actual goal, plus fuel usage.

</div>
<div style="max-width: 1200px; margin: 5px;">
Using the same space exploration environment which proved to be a good starting point and had potential for further configuration, a comparison was done between the algorithms DQN, PPO and A2C. This test was important to understand how the agents differed and what hyperparameter changes and optimizations did to the results. The cumulative reward reflects the total score achieved by the agent and gives a good indication of its overall success in the environment.
After some tuning the agents started to see rewards and perform stabler overall. A2C was the one that had most fluctuations and didn't provide good results in initial training. DQN and PPO on the other hand are much more stabler and are rising in terms of reward.
</div>
<br/>
<div style="display: flex; align-items: center; justify-content: space-between; max-width: 1200px; margin: 5px;">
    <div style="margin-right: 5px;">
        <div>
            <img src="Figures/Results/comp_03.png" alt="comp" style="width: 100%; max-height: 520px;">
            <p style="margin: 5px; text-align: left;">(a) Initial performance for the three algorithms.</p>
        </div>
    </div>                       
    <div style="margin-right: 5px;">
        <div>
            <img src="Figures/Results/comp_04.png" alt="comp" style="width: 100%; max-height: 520px;">
            <p style="margin: 5px; text-align: left;">(b) Improved performance after hyperparameter tunning.</p>
        </div>
    </div>                       
    <div>
        <div>
            <img src="Figures/Results/comp_cm_04.png" alt="comp" style="width: 100%; max-height: 520px;">
            <p style="margin: 5px; text-align: left;">(c) Cumulative rewards over time after tunning.</p>
        </div>
    </div>
</div>
<br/>
<div style="max-width: 1200px; margin: 5px;">
Figure 5: DQN (Green) performs the best, achieving the highest and most stable rewards over time. In the first test (a) results are somewhat uncertain and do not reach good rewards. PPO (Blue) is stable and competitive but slightly lower than DQN in terms of rewards (b). Both DQN and PPO show steady improvement, stabilizing at higher rewards, while A2C remains volatile with generally negative performance. For the cumulative results (c) DQN accumulates rewards at a better rate than PPO, meaning it learns a more efficient policy over time.
</div>
<br/>
<div style="max-width: 1200px; margin: 5px;">
    
To improve these results even more a customization on the policy network is also possible. We can input `policy_kwargs`. For example [256, 256] which sets more neurons per layer suitable for more complex and high-dimensional spaces. A three layer architecture can also be considered: [128, 128, 64]. Experimentation with different batch sizes is beneficial and impacts performance 16, 32 or for a higher value 256. Buffer size and learning rate should also be considered when aiming for improvements. This article *Empirical Design in Reinforcement Learning* (Department of Computing Science and Alberta Machine Intelligence Institute) has very interesting information on designing reinforecement learning projects and how to approach them and improve. There are also two articles which have good insights on the RL topics *Optimal Tasking of Ground-Based Sensors for Space Situational Awareness Using Deep Reinforcement Learning* (Multidisciplinary Digital Publishing Institute) and *On Deep Reinforcement Learning for Spacecraft Guidance* (Carleton University, Ottawa, Ontario).

Expaning on visualisation for this topic can be very helpful. There are engines that can render every step and observation to show how the agent is operating.
</div>
<div style="display: flex; align-items: center; justify-content: space-between; max-width: 1200px; margin: 5px;">
        <div style="width: 45%;">
            <img src="Figures/Figure_2_GTO.png" alt="comp" style="max-width: 100%;">
            <p style="margin: 5px; text-align: left;">(a) The green circle represents a celestial body. The red line corresponds to a spacecraft's orbit, while the dashed line represent a reference orbit for comparison.
            </p>
        </div>  
        <div style="width: 2%;">
        </div>  
        <div style="width: 30%;">
            <img src="Figures/Figure_1_GTO.png" alt="comp" style="max-width: 100%;">
            <p style="margin: 5px; text-align: left;">(b)  Trajectory of a spacecraft in inertial space.</p>
            <img src="Figures/Figure_3_GTO.png" alt="comp" style="max-width: 100%;">
            <p style="margin: 5px; text-align: left;">(c) Spacecraft's trajectory along the three inertial axes.</p>
        </div>
</div>
<div style="display: flex; align-items: center; justify-content: space-between; max-width: 1200px; margin: 5px;">
    <p>Figure 6: A render from Basilisk RL framework with the use of Basilisk engine. Helpful with testing and buidling more complex environments.</p>
</div>
</div>
<br/>
<div style="max-width: 1200px; margin: 5px;">
    
**Code implementation** is provided in the `Models` section. It contains multiple Jupyter notebooks containing tests and results with different configurations. A folder called `Lunar_Lander` is also available with the lunar lander code. There Python files with custom environments and reward shaping. Other helper folder contain saved models after training and tensor board logs for perforamce monitoring.

</div>
</div>

<div style="max-width: 1200px; margin: 5px;">
    
### Conclusion
For this topic of RL tested were a few methods on how a policy can be set up to effectively converge in an environment resembling space. Approaches were focused on simplicity, exploration and understanding the basics while attempting to measure everything. Complex environments require much more testing, visualisation and evaluating the results under the hood. To expand and learn from this project a very good start is to begin with predefined environments which are already fully setup and configured, improve them and check the results. From there onwards a simple environment can be setup as well as a simple policy and stable reward scheme and improve and update them iteratively. It is very useful to take time and visualise all of the results and if possible visualise what the agent is doing in terms of movement since this will reveal patterns that can be addressed. There is a plethora of 2D, 3D engines for crafting a good simulation but expanding on the policy and setting up a good agent will also be very important. Training might be intense in some scenarios, so leveraging stable compute power will help.

</div>

<div style="max-width: 1200px; margin: 5px;">
    
### Sources
Empirical Design in Reinforcement Learning: https://arxiv.org/pdf/2304.01315  
The 37 Implementation Details of Proximal Policy Optimization: https://iclr-blog-track.github.io/2022/03/25/ppo-implementation-details/  
Stable-Baselines3: Reliable Reinforcement Learning Implementations: https://araffin.github.io/post/sb3/  
D4PG implementation of a spacecraft operations and tasks: https://github.com/Kirkados/AIAA_GNC2020_D4PG?tab=readme-ov-file  
Optimal Tasking of Ground-Based Sensors for Space
Situational Awareness Using Deep Reinforcement Learning: https://dspace.mit.edu/bitstream/handle/1721.1/145993/sensors-22-07847-v2.pdf?sequence=1  
On Deep Reinforcement Learning for Spacecraft Guidance: https://www.aiaa.org/docs/default-source/default-document-library/publications/ondeepreinforcementlearningforspacecraftguidance.pdf  
RL Lunar Lander: https://github.com/Awni00/RL-Lunar-Lander/tree/main  
DQN: https://github.com/sokistar24/Deep_Reinforcement_learning  
PPO: https://openai.com/index/openai-baselines-ppo/  
AgileRL Lunar Lander: https://docs.agilerl.com/en/latest/tutorials/gymnasium/agilerl_td3_tutorial.html  


**Tools and frameworks**  
Stable-baselines: https://stable-baselines.readthedocs.io/en/master/index.html  
Gymnasium: https://gymnasium.farama.org/  
Spinning Up (OpenAI): https://spinningup.openai.com/en/latest/index.html  
Poliastro: https://docs.poliastro.space/en/stable/index.html  
Astropy: https://www.astropy.org/  
Ray: https://docs.ray.io/en/latest/index.html  
AgileRL: https://docs.agilerl.com/en/latest/index.html  
Basilisk: https://avslab.github.io/basilisk/  
BSK-RL: https://avslab.github.io/bsk_rl/

**Other:**  
https://spinningup.openai.com/en/latest/spinningup/rl_intro.html#the-optimal-q-function-and-the-optimal-action  
https://spinningup.openai.com/en/latest/algorithms/ddpg.html#background  
http://galileoandeinstein.physics.virginia.edu/7010/CM_15_Keplerian_Orbits.pdf  
https://spsweb.fltops.jpl.nasa.gov/portaldataops/mpg/MPG_Docs/MPG%20Book/Release/Chapter7-OrbitalMechanics.pdf  
https://spinningup.openai.com/en/latest/algorithms/ppo.html  
https://ncr.mae.ufl.edu/aa/files/Spring2022/allen2022.pdf  
https://graphics.stanford.edu/courses/cs348a-17-winter/Papers/quaternion.pdf  
https://pytorch.org/tutorials/intermediate/reinforcement_q_learning.html  
</div>