# Lunar Landing using Reinforcement Learning 
## Proximal Policy Optimization(PPO)

#### Step 1: Install necessary libraries

In [None]:
# Install pygame for rendering
!pip install pygame
# Install swig for compatibility with some environments
!pip install swig 
# Install box2d environments for gymnasium
!pip install gymnasium[box2d] 
# Install stable_baselines3 for reinforcement learning algorithms
!pip install stable_baselines3 

#### Step 2: Import required libraries

In [1]:
# gymansium for creating and interacting with environments.
import gymnasium as gym
# Proximal Policy Optimization algorithm for reinforcement learning.
from stable_baselines3 import PPO
# Wrapper for modifying/extending the environment's behavior/functionality.
from gymnasium import Wrapper

#### Step 3: Initialize the Lunar Lander environment
LunarLander-v3 is a classic control problem where the goal is to land a spacecraft.

1. Always define the environment in the same cell as where it is rendered. Due to the way pygame works, once the environment is closed, you need to remake it. 

2. The environment below is the base environment. It shows what happens when there is not trained model. 

In [2]:
env = gym.make(
    "LunarLander-v3", 
    continuous=False,  # Discrete action space (fixed thrust levels).
    gravity=-10.0,  # Custom gravity setting.
    enable_wind=False,  # No wind disturbances.
    wind_power=15.0,  # Strength of wind when enabled.
    turbulence_power=1.5,  # Intensity of turbulence.
    render_mode='human'  # Visual rendering for human observation.
)

#### Step 4: Run the environment with random actions (simulation)
This is a test loop to observe how the environment behaves with random actions.

In [3]:
observation, info = env.reset()  # Reset the environment to the initial state.
for _ in range(1000):
    action = env.action_space.sample()  # Sample a random action from the action space.
    observation, reward, terminated, truncated, info = env.step(action)  # Take the action and observe results.

    if terminated or truncated:  # Check if the episode has ended.
        observation, info = env.reset()  # Reset the environment for the next episode.

env.close()  # Close the environment to free resources.

#### Step 5: Define a custom wrapper to enhance the reward function
Wrappers allow us to modify the environment behavior, such as changing the reward system.

We create this wrapper so that during training our model knows that landing between the flagpoles will result in a bigger reward and that landing close to it is desirable over landing further way from it. 

In [5]:
class PrecisionLandingWrapper(Wrapper):
    def __init__(self, env):
        super().__init__(env)
        
    def step(self, action):
        obs, reward, terminated, truncated, info = self.env.step(action)
        
        # If the lander has landed (terminated), check landing precision
        if terminated:
            x_pos = obs[0]  # Horizontal position
            # Landing pad is roughly between x = -0.1 and x = 0.1
            # Give bonus reward for landing closer to center
            if abs(x_pos) < 0.05:  # Very close to center
                reward += 100  # Big bonus
            elif abs(x_pos) < 0.1:  # Within landing pad
                reward += 50   # Medium bonus
            elif abs(x_pos) < 0.2:  # Close to landing pad
                reward += 5   # Small bonus
            else:  # Far from landing pad
                reward -= 50   # Penalty for landing far away
                
        # # Alt logic: Adjust reward based on landing precision if the episode ends.
        # if terminated:
        #     x_pos = obs[0]  # Horizontal position of the lander.
        #     # Penalize based on distance from pad's center (x=0).
        #     reward -= abs(x_pos) * 10  # Reduce reward for imprecise landings.
                
        return obs, reward, terminated, truncated, info # updated results

#### Step 6: Wrap the environment with the custom wrapper

In [6]:
wrapped_env = PrecisionLandingWrapper(gym.make("LunarLander-v3", continuous=False, gravity=-10.0,
                     enable_wind=False, wind_power=15.0, turbulence_power=1.5))


#### Step 7: Train the PPO agent
Initialize the PPO model with the wrapped environment.

In [7]:
model = PPO(
    "MlpPolicy",  # Multi-layer perceptron policy (neural network).
    wrapped_env,  # Environment to train on.
    verbose=1,  # Verbosity level (1 for progress updates).
)

# Train the model for a specified number of time steps.
# Adjust timesteps as needed
model.learn(total_timesteps=200000, log_interval=50)

Using cpu device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.


KeyboardInterrupt: 

#### Step 8: Save and load the trained model

In [8]:
model.save("ppo_lunar_lander")  # Save the model to a file.
loaded_model = PPO.load("ppo_lunar_lander")  # Load the model back.

#### Step 9: Evaluate the trained model
Test the model by running it in the environment.

In [None]:
# Reinitialize the environment with rendering enabled
env = gym.make("LunarLander-v3", render_mode='human')
wrapped_env = PrecisionLandingWrapper(env)

# Reset and evaluate the trained model
obs, info = wrapped_env.reset()
for _ in range(10000):  # Limit the number of steps to visualize
    action, _states = loaded_model.predict(obs)  # Predict the action
    obs, reward, terminated, truncated, info = wrapped_env.step(action)  # Take the action and observe

    if terminated or truncated:
        obs, info = wrapped_env.reset()  # Reset on termination

wrapped_env.close()  # Ensure resources are released
