# Create a Simple Rule-Based LunaLander Agent

In this example, we will use Gymnasium, an environment to train agents via reinforcement learning (RL). We will not use RL here but just use the environment with a custom rule-based agent. 

## Install Gymnasium

The documentation for Gymnasium is available at https://gymnasium.farama.org/ 

Steps:
1. Create a new folder and open it with VS Code and install all needed Python Extensions in VS Code.
2. Create a new virtual environment (CTRL-Shift P Python Create Environment...)
3. I needed to install swig and the Python C++ headers on WSL2 via the terminal
    * `sudo apt install swig`
    * `sudo apt-get install python3-dev` 
4. Install gymnasium with the needed extras

In [57]:
%pip install gymnasium[box2d,classic_control]

Note: you may need to restart the kernel to use updated packages.


## LunarLander Environment 

The documentation of the environment is available at: https://gymnasium.farama.org/environments/box2d/lunar_lander/

* Performance Measure: A reward of -100 or +100 points for crashing or landing safely respectively. We do not use 
  intermediate rewards here.

* Environment: This environment is a classic rocket trajectory optimization problem. According to Pontryagin’s
  maximum principle, it is optimal to fire the engine at full throttle or turn it off. This is the reason why this environment has discrete actions: engine on or off. 

* Actuators: There are four discrete actions available:

    - 0: do nothing
    - 1: fire left orientation engine
    - 2: fire main engine
    - 3: fire right orientation engine

* Sensors: Each observation is an 8-dimensional vector: the coordinates of the lander in x & y, its linear velocities in x & y, its angle, its angular velocity, and two booleans that represent whether each leg is in contact with the ground or not.

Define an environment function based on the example in the documentation.

In [58]:
import gymnasium as gym

def run_episode(agent):
    # Initialise the environment
    env = gym.make("LunarLander-v3", render_mode="human")

    # Reset the environment to generate the first observation (use seed=42 in reset to get reproducible results)
    observation, info = env.reset()

    # run one episode (max. 1000 steps)
    for _ in range(1000):
        # call the agent to select an action
        action = agent(observation)

        print (f"Obs: {observation} -> Action: {action}")

        # step (transition) through the environment with the action
        observation, reward, terminated, truncated, info = env.step(action)
      
        env.render()

        if terminated:
            print(f"Final Reward: {reward}")
            break
    
    env.close()
    return reward

## Example: A Random Agent

We ranomly return one of the actions.


In [59]:
import numpy as np


# define a random agent
def random_agent(observation): 
    return np.random.choice([0, 1, 2, 3], p=[0.25, 0.25, 0.25, 0.25])

run_episode(random_agent)

Obs: [ 0.0023921   1.4112034   0.2422773   0.01258546 -0.00276507 -0.05487945
  0.          0.        ] -> Action: 3
Obs: [ 0.00486937  1.4108973   0.25262433 -0.0136226  -0.00761823 -0.09707143
  0.          0.        ] -> Action: 0
Obs: [ 0.00734673  1.4099916   0.25263974 -0.04028175 -0.01246848 -0.09701412
  0.          0.        ] -> Action: 1
Obs: [ 0.00975447  1.408481    0.24389851 -0.06717046 -0.01556388 -0.06191345
  0.          0.        ] -> Action: 0
Obs: [ 0.01216221  1.4063703   0.24390694 -0.09383942 -0.01865955 -0.06191928
  0.          0.        ] -> Action: 3
Obs: [ 0.01465445  1.4036508   0.25450855 -0.12094494 -0.02388103 -0.10443912
  0.          0.        ] -> Action: 3
Obs: [ 0.01722784  1.4003263   0.26467228 -0.14789109 -0.03113722 -0.1451373
  0.          0.        ] -> Action: 3
Obs: [ 0.01986694  1.3964083   0.272914   -0.17434584 -0.04003643 -0.17800036
  0.          0.        ] -> Action: 3
Obs: [ 0.02256794  1.3918903   0.2806471  -0.20111434 -0.05047909

-100

## A Simple Rule-based 

To make the code easier to read, we use enumerations.

In [60]:
from enum import Enum

class Act(Enum):
    LEFT = 1
    RIGHT = 3
    MAIN = 2
    NO_OP = 0

class Obs(Enum):
    X = 0
    Y = 1
    VX = 2
    VY = 3
    ANGLE = 4
    ANGULAR_VELOCITY = 5
    LEFT_LEG_CONTACT = 6
    RIGHT_LEG_CONTACT = 7


In [61]:
def rocket_agent(observation):
    """A simple agent."""

    if observation[Obs.VY.value] < -.1:  # if the lander is falling too fast
        return Act.MAIN.value

    return Act.NO_OP.value 

run_episode(rocket_agent)

Obs: [ 0.00575562  1.4031982   0.5829693  -0.34320864 -0.00666255 -0.13205107
  0.          0.        ] -> Action: 2
Obs: [ 0.01138258  1.3956645   0.56993926 -0.33488384 -0.01382381 -0.14323847
  0.          0.        ] -> Action: 2
Obs: [ 0.01694717  1.3889867   0.5640696  -0.2968776  -0.02134491 -0.1504359
  0.          0.        ] -> Action: 2
Obs: [ 0.02247581  1.3832597   0.56074774 -0.25466934 -0.02914378 -0.15599196
  0.          0.        ] -> Action: 2
Obs: [ 0.02802248  1.3779247   0.56256396 -0.23728271 -0.0369475  -0.15608895
  0.          0.        ] -> Action: 2
Obs: [ 0.03378372  1.3735191   0.583163   -0.19600196 -0.04390562 -0.13917544
  0.          0.        ] -> Action: 2
Obs: [ 0.03962927  1.3697926   0.59135044 -0.16583827 -0.05064116 -0.13472316
  0.          0.        ] -> Action: 2
Obs: [ 0.04565992  1.3670231   0.60923666 -0.12330996 -0.05674446 -0.12207732
  0.          0.        ] -> Action: 2
Obs: [ 0.0515914   1.3644234   0.5999227  -0.11581238 -0.06345761

-100

## Testing

Run the agent on 100 problems and report the average reward.

In [62]:
import numpy as np

def run_episode_test(agent):
    # Initialise the environment
    env = gym.make("LunarLander-v3", render_mode=None)

    # Reset the environment to generate the first observation
    observation, info = env.reset()

    # run one episode (max. 1000 steps)
    for _ in range(1000):
        # call the agent to select an action
        action = agent(observation)

        # step (transition) through the environment with the action
        observation, reward, terminated, truncated, info = env.step(action)
      
        if terminated:
            break
    
    env.close()
    return reward

def run_episodes(agent, n=100):
    rewards = []
    for _ in range(n):
        rewards.append(run_episode_test(agent))
    return rewards

rewards = run_episodes(rocket_agent)
print(rewards)

print(f"Average reward: {np.average(rewards)}")

[-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100]
Average reward: -100.0


## A better rule-based agent

Build a better rule-based agent that uses its right and left thrusters to land the craft. Test your agent function using 100 problems.

In [63]:
# Code goes here