<a href="https://colab.research.google.com/github/mohamedyosef101/101_learning_area/blob/area/Reinforcement%20Learning/01_land_on_the_moon.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

> *This project is part of the Hugging Face Deep Reinforcement Learning course. Check the [source code](https://colab.research.google.com/github/huggingface/deep-rl-class/blob/master/notebooks/unit1/unit1.ipynb#scrollTo=BE5JWP5rQIKf) if you want to learn more.*

# Install **the libraries**

In [1]:
!apt install swig cmake

!pip install stable-baselines3
!pip install swig
!pip install gymnasium[box2d]
!pip install huggingface_sb3

!sudo apt-get update
!apt install python-opengl
!apt install ffmpeg
!apt install xvfb
!pip install pyvirtualdisplay
!pip install IPython --upgrade

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
swig is already the newest version (4.0.2-1ubuntu1).
cmake is already the newest version (3.22.1-1ubuntu1.22.04.1).
0 upgraded, 0 newly installed, 0 to remove and 34 not upgraded.
Hit:1 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease
Hit:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease
Hit:3 http://archive.ubuntu.com/ubuntu jammy InRelease
Get:4 http://security.ubuntu.com/ubuntu jammy-security InRelease [110 kB]
Hit:5 http://archive.ubuntu.com/ubuntu jammy-updates InRelease
Hit:6 http://archive.ubuntu.com/ubuntu jammy-backports InRelease
Hit:7 https://ppa.launchpadcontent.net/c2d4u.team/c2d4u4.0+/ubuntu jammy InRelease
Hit:8 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Hit:9 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy InRelease
Hit:10 https://ppa.launchpadcontent.net/ubuntugis/ppa/ub

In [2]:
# Virtual display
from pyvirtualdisplay import Display

virtual_display = Display(visible=0, size=(1400, 900))
virtual_display.start()

<pyvirtualdisplay.display.Display at 0x7b427a9cba00>

# Import the **packages**

In [4]:
import gymnasium as gym
from huggingface_sb3 import load_from_hub, package_to_hub
from huggingface_hub import notebook_login

from stable_baselines3 import PPO
from stable_baselines3.common.env_util import make_vec_env
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.monitor import Monitor

# Work with **Gymnasium**

With Gymnasium:
1. We create our environment using `gymnasium.make()`
2. We reset the environment to initial state with `observation = env.reset()`
<br>

At each step:
3. Get an action using our model (in our example we take a random action)
4. Using `env.step(action)`, we perform this action in the environment and get:
- `observation`: The new state ($s_{t+1}$)
- `reward`: The reward we get after executing the action.
- `terminated`: Indicates if the episode terminated (agent reach the end)
- `truncated`: Introduced with this new version, it indicates a timelimit or if an agent go out of bounds of the environment for instance.
- `info`: A dictionary that provides additional information (depends on the environment).

<br>

If the episode is terminated:
- We reset the environment to its initial state with `observation = env.reset()`

---


💡 A good habit when you start to use an environment is to check its documentation

👉 https://gymnasium.farama.org/environments/box2d/lunar_lander/

---


In [5]:
# 1. Creating the environment
env = gym.make("LunarLander-v2")

# 2. Reset the environment
observation, info = env.reset()

for _ in range(20):
  # Take a random action
  action = env.action_space.sample()
  print(f"Action taken: {action}")

  # do the action and get the parameters
  observation, reward, terminated, truncated, info = env.step(action)

  # The game is ...
  # Terminated when we land or crashed
  # Truncated when the timeout

  if terminated or truncated:
    # reset the environment
    print("Environment is reset")
    obs, info = env.reset()
env.close()

Action taken: 0
Action taken: 2
Action taken: 0
Action taken: 0
Action taken: 2
Action taken: 1
Action taken: 0
Action taken: 0
Action taken: 2
Action taken: 3
Action taken: 2
Action taken: 0
Action taken: 2
Action taken: 1
Action taken: 1
Action taken: 1
Action taken: 0
Action taken: 1
Action taken: 3
Action taken: 1


There are four discrete actions available:

* **0:** do nothing
* **1:** fire left orientation engine
* **2:** fire main engine
* **3:** fire right orientation engine

# Create the **LunarLander** environment

In [6]:
# See how the environment looks like...
env = gym.make("LunarLander-v2")
env.reset()
print(f"""_________OBSERVAVTION SPACE____________
      \nObseravation Space Shape {env.observation_space.shape}
      \nSample observations \n{env.observation_space.sample()}
""")

_________OBSERVAVTION SPACE____________
      
Obseravation Space Shape (8,)
      
Sample observations 
[-0.6354245  -0.65760815 -2.4995499   1.3842746  -0.9205056  -1.5467273
  0.30492884  0.98875135]



**The observation is a vector of size 8**, where each value contains different information about the lander:
* Horizontal pad coordinate (x)
* Vertical pad coordinate (y)
* Horizontal speed (x)
* Vertical speed (y)
* Angle
* Angular speed
* If the left leg contact point has touched the land (boolean)
* If the right leg contact point has touched the land (boolean)


In [8]:
print(f"""_________ACTION SPACE____________
      \nAction Space Shape {env.action_space.n}
      \nAction Space Sample {env.action_space.sample()}
""")
# take a random action

_________ACTION SPACE____________
      
Action Space Shape 4
      
Action Space Sample 2



# Vectorized Environment

In [None]:
env = make_vec_env('LunarLander-v2', n_envs=16)

# Create the model

In [None]:
# create environment
env = gym.make("LunarLander-v2")

# Instantiate the agent
model = ppo('MlpPolicy', env, verbose=1)

# Train the agent
model.learn(total_timesteps=int(2e5))

Using cuda device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 91.9     |
|    ep_rew_mean     | -181     |
| time/              |          |
|    fps             | 480      |
|    iterations      | 1        |
|    time_elapsed    | 4        |
|    total_timesteps | 2048     |
---------------------------------
-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 91.8        |
|    ep_rew_mean          | -165        |
| time/                   |             |
|    fps                  | 423         |
|    iterations           | 2           |
|    time_elapsed         | 9           |
|    total_timesteps      | 4096        |
| train/                  |             |
|    approx_kl            | 0.008079256 |
|    clip_fraction        | 0.0324      |
|    clip_range           | 0.2         |
|    entropy_loss  

<stable_baselines3.ppo.ppo.PPO at 0x7fee3c20e500>

In [None]:
# Add some parameters to accelerate the training

model = ppo(
    policy = "MlpPolicy",
    env = env,
    n_steps= 1024,
    batch_size = 64,
    n_epochs = 4,
    gamma = 0.999,
    gae_lambda = 0.98,
    ent_coef = 0.01,
    verbose=0
)

In [None]:
# train the model for 1M timesteps
model.learn(total_timesteps=1000000)

<stable_baselines3.ppo.ppo.PPO at 0x7fee39540b80>

In [None]:
# Evaluate the model
eval_env = mntr(gym.make("LunarLander-v2"))
mean_reward, std_reward = ep(model, eval_env, n_eval_episodes=10, deterministic=True)
print(f"mean_reward={mean_reward:.2f} +/- {std_reward}")

mean_reward=265.02 +/- 30.133980042855278
