<a href="https://colab.research.google.com/github/prisar/ai_notebooks/blob/main/nb_089.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task
Train a PPO or A2C agent from Stable Baselines3 to solve the LunarLander-v2 continuous control environment.

## Install dependencies

### Subtask:
Install the necessary libraries, including `stable-baselines3`, `huggingface-sb3`, `shimmy`, and `gymnasium`.


**Reasoning**:
The user wants to install the necessary libraries for the project. I will install all the required libraries in a single code block.



In [1]:
!pip install "stable-baselines3[extra]" huggingface-sb3 shimmy gymnasium

Collecting huggingface-sb3
  Downloading huggingface_sb3-3.0-py3-none-any.whl.metadata (6.3 kB)
Collecting shimmy
  Downloading Shimmy-2.0.0-py3-none-any.whl.metadata (3.5 kB)
Collecting stable-baselines3[extra]
  Downloading stable_baselines3-2.7.0-py3-none-any.whl.metadata (4.8 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch<3.0,>=2.3->stable-baselines3[extra])
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch<3.0,>=2.3->stable-baselines3[extra])
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch<3.0,>=2.3->stable-baselines3[extra])
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch<3.0,>=2.3->stable-baselines3[extra])
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-

## Import libraries

### Subtask:
Import the necessary modules from the installed libraries.


**Reasoning**:
I will import the necessary modules as instructed in a single code block. `gymnasium` will be imported as `gym`, `PPO` will be imported from `stable_baselines3`, and `evaluate_policy` will be imported from `stable_baselines3.common.evaluation`.



In [2]:
import gymnasium as gym
from stable_baselines3 import PPO
from stable_baselines3.common.evaluation import evaluate_policy

## Create the environment

### Subtask:
Create the LunarLander-v2 environment.


**Reasoning**:
I will now create the LunarLander-v2 environment with continuous actions using `gym.make()`, store it in the `env` variable, and reset it to get the initial observation, as per the instructions.



In [3]:
env = gym.make("LunarLander-v2", continuous=True)
obs, info = env.reset()

  logger.deprecation(


DeprecatedEnv: Environment version v2 for `LunarLander` is deprecated. Please use `LunarLander-v3` instead.

**Reasoning**:
The previous attempt to create the environment failed because version `v2` of `LunarLander` is deprecated. I will now correct this by using the recommended version `v3` and re-run the code to create and reset the environment.



In [4]:
env = gym.make("LunarLander-v3", continuous=True)
obs, info = env.reset()

DependencyNotInstalled: Box2D is not installed, you can install it by run `pip install swig` followed by `pip install "gymnasium[box2d]"`

**Reasoning**:
The previous attempt to create the environment failed because the `Box2D` dependency is not installed. I will now install `swig` and `gymnasium[box2d]` as suggested by the error message.



In [5]:
!pip install swig
!pip install "gymnasium[box2d]"

Collecting swig
  Downloading swig-4.3.1-py3-none-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.metadata (3.5 kB)
Downloading swig-4.3.1-py3-none-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.9 MB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.9 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.5/1.9 MB[0m [31m16.4 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m31.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: swig
Successfully installed swig-4.3.1
Collecting box2d-py==2.3.5 (from gymnasium[box2d])
  Downloading box2d-py-2.3.5.tar.gz (374 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m374.4/374.4 kB[0m [31m10.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: box2d-py
  Building wheel for box2d-py (se

**Reasoning**:
Now that the `Box2D` dependency is installed, I can re-run the code to create and reset the `LunarLander-v3` environment.



In [6]:
env = gym.make("LunarLander-v3", continuous=True)
obs, info = env.reset()

  from pkg_resources import resource_stream, resource_exists
Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
  declare_namespace(pkg)
Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
  declare_namespace(pkg)
Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
  declare_namespace(pkg)


## Instantiate the agent

### Subtask:
Instantiate a PPO agent from Stable Baselines3.


**Reasoning**:
I will now instantiate the PPO agent with the specified parameters, as per the instructions.



In [7]:
model = PPO(
    policy="MlpPolicy",
    env=env,
    verbose=1
)

Using cuda device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.




## Train the agent

### Subtask:
Train the PPO agent on the LunarLander-v2 environment for a total of 50,000 timesteps.


**Reasoning**:
The next step is to train the PPO agent on the LunarLander-v2 environment for 50,000 timesteps, as specified in the instructions. I will call the `learn` method on the `model` object, setting `total_timesteps` to 50,000 and `progress_bar` to `True` to visualize the training process.



In [8]:
model.learn(total_timesteps=50000, progress_bar=True)

Output()

---------------------------------
| rollout/           |          |
|    ep_len_mean     | 120      |
|    ep_rew_mean     | -298     |
| time/              |          |
|    fps             | 568      |
|    iterations      | 1        |
|    time_elapsed    | 3        |
|    total_timesteps | 2048     |
---------------------------------
------------------------------------------
| rollout/                |              |
|    ep_len_mean          | 118          |
|    ep_rew_mean          | -281         |
| time/                   |              |
|    fps                  | 479          |
|    iterations           | 2            |
|    time_elapsed         | 8            |
|    total_timesteps      | 4096         |
| train/                  |              |
|    approx_kl            | 0.0045628757 |
|    clip_fraction        | 0.0334       |
|    clip_range           | 0.2          |
|    entropy_loss         | -2.85        |
|    explained_variance   | 0.00668      |
|    learning_r

<stable_baselines3.ppo.ppo.PPO at 0x7a5d3e394d10>

## Evaluate the agent

### Subtask:
Evaluate the trained agent's performance.


**Reasoning**:
To evaluate the trained agent's performance, I will use the `evaluate_policy` function from `stable_baselines3.common.evaluation`. This function runs the agent in the environment for a specified number of episodes and returns the mean and standard deviation of the rewards. I will set the number of evaluation episodes to 10 as requested, store the results in `mean_reward` and `std_reward`, and then print these values to the console.



In [9]:
mean_reward, std_reward = evaluate_policy(model, env, n_eval_episodes=10)

print(f"Mean reward: {mean_reward:.2f} +/- {std_reward:.2f}")



Mean reward: 63.45 +/- 163.80


## Summary:

### Data Analysis Key Findings
*   A PPO agent was successfully trained on the `LunarLander-v2` environment for 50,000 timesteps.
*   During training, the agent showed significant improvement, with the mean episode reward increasing from -298 to -70.2.
*   Post-training evaluation over 10 episodes yielded a mean reward of 63.45 with a standard deviation of 163.80.

### Insights or Next Steps
*   The trained agent has learned a decent policy but has not yet solved the environment, as a score of 200+ is typically required. To improve performance, consider increasing the `total_timesteps` for training.
*   The high standard deviation in the evaluation results suggests that the agent's performance is inconsistent. Further hyperparameter tuning or a longer training duration could help stabilize its performance.
