# Reinforcement Learning: Lunar Lander Model

## Introduction
Reinforcement Learning (RL) is a branch of machine learning where an agent learns to make decisions by interacting with its environment. Instead of being told explicitly what to do, the agent discovers the best actions through trial and error, guided by rewards or penalties. A great example to showcase RL is teaching a robot to navigate a maze, learn from its mistakes, and eventually masters the task.

## Lunar Lander
The below Lunar Lander Demo illustrates the power of RL by teaching a spacecraft to succesfully land in the moon within the designated area.
The spacecraft (the RL Agent) has four discrete actions available:
- Power the main engine 
- Power the left engine
- Power the right engine
- Power no engine

In addition, fuel is infinite for the agent to learn.

## Model

### Load the required libraries
The lunar lander RL demo leverages the ´stable-baselines´. This library provides a set of reliable implementations of reinforcement learning algorithms in PyTorch. ´stable-baselines´ works together with ´gymnasium´ to create and visualise training environments. 

In [1]:
# Load the required libraries
import gymnasium as gym
import numpy as np

### Import the necessary requirements
The algorithm used for the lunar lander model is DeepQ Network.

The Deep Q-Network (DQN) algorithm is a reinforcement learning method that combines Q-learning with deep neural networks. It enables an agent to learn optimal actions in high-dimensional, complex environments by approximating the Q-value function with a neural network. DQN uses experience replay to store and sample past experiences, breaking the correlation between consecutive updates, and employs a target network to stabilize learning. 

In [2]:
# Import Deep Q-Network
from stable_baselines3 import DQN

# Import evaluate_policy to enable evaluation of the agent's performance
from stable_baselines3.common.evaluation import evaluate_policy

### Define the model

- Algorithm: Deep-Q Network
- Decisioning Policy: Multi-Layer Perception Policy (suitable for environments with vector-based observations, e.g., numerical state arrays)
- Final value of $\epsilon$ is set to 0.1 (the agent exploits its learned policy by taking the action with the highest Q-value with probability 1 - $\epsilon$)
- Target Q-Network update (with the weights of the policy Q-Network) is set to 250 training steps 

In [3]:
model = DQN(
    "MlpPolicy",
    "LunarLander-v3",
    exploration_final_eps=0.1,
    target_update_interval=250,
    verbose = 1 # prints model details on execution 
)

Using cpu device
Creating environment from the given name 'LunarLander-v3'
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.


### Evaluate the untrained model
To appreciate the power of RL we evaluate the reward of the inital model without training, which is expected to be poor.

In [4]:
# Create lunar lander environment for evaluation
eval_env = gym.make("LunarLander-v3")

# Evaluate the agent's performance before training
mean_reward, std_reward = evaluate_policy(
    model,
    eval_env,
    n_eval_episodes=10,
    deterministic=True,
)

print(f"The mean reward achieved by the agent is = {mean_reward:.2f}")

The mean reward achieved by the agent is = -553.61




### Train and save the agent
We train the agent for a total of 10,000,000 steps and save the output for future loading in the demo.

In [13]:
# Training
model.learn(total_timesteps=int(1e7))

# Save the agent
model.save("dqn_lunar")

### Evaluate the trained model
Once the agent has trained, its performance is re-evaluated and expected to increase significantly.

In [None]:
# Evaluate the agent's performance after training
mean_reward, std_reward = evaluate_policy(
    model,
    eval_env,
    n_eval_episodes=10,
    deterministic=True)

print(f"The mean reward achieved by the agent is = {mean_reward:.2f}")

del model # deletes the variable model to avoid interference with loading in demo

The mean reward achieved by the agent is = 176.65
