# Lunar Lander Study
### This notebook is a study of the LunarLander-v2 environment using the Gymnasium environment
### The algorithm library is StableBaselines3

*This notebook was created in Jupyter Notebooks and is based off of the HuggingFace Unit 1 tutorial*

HuggingFace Tutorial: [unit1](https://github.com/huggingface/deep-rl-class/tree/main/unit1)

Environment: [LunarLander-v2](https://gymnasium.farama.org/environments/box2d/lunar_lander/)

RL-Library: [StableBaselines3](https://stable-baselines3.readthedocs.io/en/master/)

In [1]:
# import required libraries
import gym

from stable_baselines3 import PPO
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.env_util import make_vec_env
from stable_baselines3.common.vec_env import (DummyVecEnv, VecMonitor)
# utils is a python file containing useful scripts such as an mp4 video generator
import video_save_utility

### LunarLander-v2:
It is important to have an understanding about both the observation and action space.
The observation space will include all relevant data about the lander and the action space include all of the possible actions our agent can take. 

In this case, the observation space of the model includes the following:
- Horizontal x coordinate of the lander
- Horizontal y coordinate of the lander
- Linear x velocity
- Linear y velocity
- The lander's angle
- The lander's angular velocity
- Left left ground contact boolean
- Tight leg ground contact boolean

The action space includes the following:
- Do nothing
- Fire left engine
- Fire main engine
- Fire right engine

In [2]:
# create a gym environment
env = gym.make("LunarLander-v2")
# reset the gym environment
observation = env.reset()

In [3]:
print("Observation Space:")
# prints out the shape of our observation space
print("Shape: {}".format(observation.shape))
# prints out a random sample from our observation space
print("Sample: {}".format(env.observation_space.sample()))

Observation Space:
Shape: (8,)
Sample: [ 1.8101461  -1.3941718  -0.10744615 -0.60029733  1.0281636  -0.02530295
  1.1444058   1.0060687 ]


In [4]:
print("Action Space:")
# prints out the shape of our observation space
print("Shape: {}".format(env.action_space.n))
# prints out a random sample from our observation space
print("Sample: {}".format(env.action_space.sample()))

Action Space:
Shape: 4
Sample: 0


---
For our training we want to use a vectorized envirnoment so that we can have more diverse experiences during training. This method runs multiple copies of the same environment in parallel and provides a linear speedup in steps taken through sampling the multiple sub-environments at the same time. ([Gymnasium Vectors](https://gymnasium.farama.org/api/vector/))


In [5]:
v_env = make_vec_env('LunarLander-v2', n_envs = 16)

Now if we print out a sample of the obsevation space we have a list of vectors or size 16x8 instead of a single observation vector.

---

### PPO: Proximal Policy Optimization

PPO: Combines the ideas of A2C (multiple workers) and TRPO (it uses a trust region to improve the actor) ([sb3](https://stable-baselines3.readthedocs.io/en/master/modules/ppo.html#example%5D)). The main idea is that after and update, the new policy should not be too far from the old policy. According to the developers, this policy alternates between sampling data through interaction with the environment and optimizing a "surrogate" objective funtion using stochastic gradient ascent ([arxiv.org](https://arxiv.org/abs/1707.06347))

##### Hyperparameters:

Reinforcement Learning is highly dependent on hyperparameters. In the case of the PPO we have several that we can tune and change. In our case, our inputs are a vector instead of a frame of the game so we should use an MlpPolicy. This gym example has actually been optimized already by RLZoo and we can use their parameters as a starting point

In [12]:
mlp = 'MlpPolicy'
# learning rate
lr = 0.0003
# number of steps (state-action pairs) per environment update (epoch)
n_steps = 2048 #changed from 1024
# minibatch size, small ~8 medium ~64 large ~512
batch_size = 128 #changed from 64
# number of epochs when optimizing surrogate loss
n_epochs = 8 #changed from 4
# Discount factor
gamma = 0.999
# Factor for tradeoff of bias vs variance for Generalized Advantage Estimator
gae_lambda = 0.98
# Entropy coefficient
ent_coef = 0.01
# number of timesteps to train the agent
n_timesteps = 1000000.0


In [13]:
# define a model using our above hyperparameters
model = PPO(policy = mlp,
            env = v_env,
            learning_rate = lr,
            n_steps = n_steps,
            batch_size = batch_size,
            n_epochs = n_epochs,
            gamma = gamma,
            gae_lambda = gae_lambda,
            ent_coef = ent_coef,
            verbose = 1
           )

Using cuda device


Next we train the agent. This process can be time consuming. Let's train for 1 million steps similar to the recommened hyperparameters from RLZoo.

In [14]:
model.learn(total_timesteps=n_timesteps)

---------------------------------
| rollout/           |          |
|    ep_len_mean     | 94.6     |
|    ep_rew_mean     | -193     |
| time/              |          |
|    fps             | 3213     |
|    iterations      | 1        |
|    time_elapsed    | 10       |
|    total_timesteps | 32768    |
---------------------------------
------------------------------------------
| rollout/                |              |
|    ep_len_mean          | 90.9         |
|    ep_rew_mean          | -130         |
| time/                   |              |
|    fps                  | 1948         |
|    iterations           | 2            |
|    time_elapsed         | 33           |
|    total_timesteps      | 65536        |
| train/                  |              |
|    approx_kl            | 0.0071681016 |
|    clip_fraction        | 0.0665       |
|    clip_range           | 0.2          |
|    entropy_loss         | -1.38        |
|    explained_variance   | -0.00268     |
|    learning_r

-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 435         |
|    ep_rew_mean          | -9.57       |
| time/                   |             |
|    fps                  | 1141        |
|    iterations           | 11          |
|    time_elapsed         | 315         |
|    total_timesteps      | 360448      |
| train/                  |             |
|    approx_kl            | 0.008194237 |
|    clip_fraction        | 0.0617      |
|    clip_range           | 0.2         |
|    entropy_loss         | -1.22       |
|    explained_variance   | -5.96e-07   |
|    learning_rate        | 0.0003      |
|    loss                 | 298         |
|    n_updates            | 80          |
|    policy_gradient_loss | -0.00278    |
|    value_loss           | 550         |
-----------------------------------------
-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 510   

------------------------------------------
| rollout/                |              |
|    ep_len_mean          | 942          |
|    ep_rew_mean          | 108          |
| time/                   |              |
|    fps                  | 831          |
|    iterations           | 21           |
|    time_elapsed         | 827          |
|    total_timesteps      | 688128       |
| train/                  |              |
|    approx_kl            | 0.0077053616 |
|    clip_fraction        | 0.0433       |
|    clip_range           | 0.2          |
|    entropy_loss         | -1.11        |
|    explained_variance   | 0.953        |
|    learning_rate        | 0.0003       |
|    loss                 | 6.7          |
|    n_updates            | 160          |
|    policy_gradient_loss | -0.00165     |
|    value_loss           | 49.1         |
------------------------------------------
------------------------------------------
| rollout/                |              |
|    ep_len

-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 924         |
|    ep_rew_mean          | 157         |
| time/                   |             |
|    fps                  | 768         |
|    iterations           | 31          |
|    time_elapsed         | 1321        |
|    total_timesteps      | 1015808     |
| train/                  |             |
|    approx_kl            | 0.003738003 |
|    clip_fraction        | 0.04        |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.802      |
|    explained_variance   | 0.981       |
|    learning_rate        | 0.0003      |
|    loss                 | 40.9        |
|    n_updates            | 240         |
|    policy_gradient_loss | -0.00137    |
|    value_loss           | 33.6        |
-----------------------------------------


<stable_baselines3.ppo.ppo.PPO at 0x139a0857a30>

### Training function outputs:
- Rollout:
    - ep_len_mean: mean episode length
    - ep_rew_mean: mean epsiodic training reward averaged over 100 episodes
- Time:
    - fps: number of frames per second including the time taken by gradient updates
    - iterations: number of iterations (data collection + policy update for A2C/PPO)
    - time_elapsed: time in seconds since beginning of training
    - total_timesteps: total number of timesteps since beginning of training
- Train:
    - approx_kl: approximate mean KL divergence between old and new policy (for PPO). An estimation of how much change happened in the update
    - clip_fraction: mean fraction of surrogate loss that was clipped (aboce clip range threshold)
    - clip_range: current value of clipping factor for surrogate loss
    - entropy_loss: mean value of entropy loss (negative of the average of policy entropy)
    - explained_variance: fraction of the return variance explained by the value function (ev=0 => might as well have predicted 0, ev=1 => perfect prediction, ev<0 => worse than predicting 0
    - learning_rate: current learning rate
    - loss: current total loss
    - n_updates: number of gradient updates so far
    - policy_gradient_loss: current value of policy gradient loss (value does not have much meaning)
    - value_loss: Current value for value function loss for on-policy algorithms, usually the error between value function and Monte-Carlo estimate

In [15]:
#save model
model.save('ppo-LunarLanderv2')

After creating and saving the model it is very important that we evaluate the model and see the results of the training and determine how well our model performs.

In [6]:
model = PPO.load('ppo-LunarLanderv2')
# Create and evaluation environment ***Note! The evaluation environment needs to be vectorized if the agent is vectoized
eval_env = env
# mean reward, standard reward refer to the average reward per episode and the standard of that reward
mean_reward, std_reward = evaluate_policy(model, eval_env, n_eval_episodes = 10, deterministic = True)

print(f"mean_reward={mean_reward:.2f} +/- {std_reward}")



mean_reward=259.78 +/- 13.672612420427125


This is a good result! The mean reward is above 200 which means that the lander has successfully landed on the moon. Below we will save an mp4 file so that we can visualized the results.

---

In [7]:
#create a copy of the evaluation environment for the replay environment
replay_env = eval_env
#create a length of video in timesteps
video_length = 2000
#set model to be deterministic
is_d = True

video_save_utility.generate_replay(model, replay_env, video_length, is_d)


Saving video to C:\Users\socce\AppData\Local\Temp\tmphhv38mja\-step-0-to-step-2000.mp4
