# Gym Environments and Implementing Reinforcement Learning Agents with Stable Baselines

In [None]:
import gymnasium as gym
from stable_baselines3 import PPO, A2C, SAC
from stable_baselines3.common.evaluation import evaluate_policy

In [None]:
env = gym.make('BipedalWalker-v3')

We chose the BipedalWalker-v3 environment because it is a continuous action space, and it is a challenging environment to solve.

We chose the PPO, A2C and SAC algorithms to train our agents because ...

Let's test the environment with a random agent from each algorithm:

In [None]:
def test_performance(model):
    mean_reward, std_reward = evaluate_policy(model, env, n_eval_episodes=100, warn=False)

    print(f"mean_reward: {mean_reward:.2f} +/- {std_reward:.2f}")

In [None]:
untrained_model = PPO("MlpPolicy", env, verbose=0)

test_performance(untrained_model)

In [None]:
untrained_model = A2C("MlpPolicy", env, verbose=0)

test_performance(untrained_model)

In [None]:
untrained_model = SAC("MlpPolicy", env, verbose=0)

test_performance(untrained_model)

As we can see, the random agents are not good at all.

Now let's use them to properly create a model and train them.

## Training the models without changes to the environment or the models

We created a script that creates a model and starts training it. If a model has already been created, the script trains it further

To train the models, we created the environment the following code:

```python
import os
from sys import argv

from stable_baselines3 import A2C, PPO, SAC
from stable_baselines3.common.env_util import make_vec_env
from stable_baselines3.common.vec_env import SubprocVecEnv

env_id = "BipedalWalker-v3"

TIMESTEPS = 100000
models_dir = "models"
logdir = "logs"


def latest_model(algorithm):
    models = [int(model.split(".")[0]) for model in os.listdir(f"{models_dir}/{algorithm}")]
    models.sort()
    return f"{models_dir}/{algorithm}/{models[-1]}.zip"


def train_model(algo, algo_name, policy, n_envs):
    env = make_vec_env(env_id, n_envs=n_envs, vec_env_cls=SubprocVecEnv, env_kwargs=dict(hardcore=False),
                       vec_env_kwargs=dict(start_method='fork'))

    if os.path.exists(f"{models_dir}/{algo_name}"):
        if os.listdir(f"{models_dir}/{algo_name}"):

            model_path = latest_model(algo_name)
            model = algo.load(model_path, env=env)
            iters = int(int(model_path.split("/")[2].split(".")[0]) / 10 ** 4)
        else:
            model = algo(policy, env, verbose=1,tensorboard_log=logdir)
            iters = 0
    else:
        os.makedirs(f"{models_dir}/{algo_name}")
        model = algo(policy, env, verbose=1,tensorboard_log=logdir)
        iters = 0

    while True:
        iters += 1
        model.learn(total_timesteps=TIMESTEPS, progress_bar=True, reset_num_timesteps=False, tb_log_name=algo_name) 
        model.save(f"{models_dir}/{algo_name}/{TIMESTEPS * iters}")


def main():
    try:
        if len(argv) != 2:
            raise ValueError("No arguments given. Please specify which model to train.")

        if not os.path.exists(logdir):
            os.makedirs(logdir)

        model_type = argv[1]

        if model_type == "A2C":
            train_model(A2C, model_type, "MlpPolicy", n_envs=3)
        elif model_type == "PPO":
            train_model(PPO, model_type, "MlpPolicy", n_envs=5)
        elif model_type == "SAC":
            train_model(SAC, model_type, "MlpPolicy", n_envs=20)
        else:
            raise ValueError("Invalid argument. Please specify a valid algorithm.")

    except ValueError as e:
        print(f"Error: {e}")


if __name__ == '__main__':
    main()


```

To train each model, we then used:


```bash
    python training.py PPO
```




```bash
    python training.py A2C
```




```bash
    python training.py SAC
```

The models were then saved in a folder `models`. For the demonstration we chose the most advanced models and placed them inside the `final_models` folder.

To see the training progress, we can use tensorboard:



```bash
    tensorboard --logdir=logs
```

Here is the mean reward graph for the training of the models in easy difficulty mode:

![graph](imgs/original_models_original_env_easy.png)

We can see that the PPO and SAC algorithms rise quickly in the first 2M iterations, while the A2C algorithm stays at the same.

The PPO and SAC algorithms reach the top reward (+-300) in the first 4/6M iterations, while the A2C algorithm even in the 24M iterations does not reach the top reward.

Now we tested the models in the hardcore difficulty mode. To do that, in the training script we changed in the environment creation: `hardcore=True`

Here is the mean reward graph for the training of the models in hardcore difficulty mode:

![graph](imgs/original_models_original_env.png)

# CONCLUSIONS ABOUT THE TRAINING CHANGE LATER

#### Let's see how they perform:

From now on, we are going to only use the hardcore mode.

#### PPO algorithm

In [None]:
ppo_model = PPO.load("final_models/original_PPO_original_env.zip", env=env)

test_performance(ppo_model)

#### A2C algorithm

In [None]:
a2c_model = A2C.load("final_models/original_A2C_original_env.zip", env=env)

test_performance(a2c_model)

#### SAC algorithm

In [None]:
sac_model = SAC.load("final_models/original_SAC_original_env.zip", env=env)

test_performance(sac_model)

## Training the models with RewardWrapper(env) and without Hyperparameters changes to the models

To see if we can improve the results, we will try to create a RewardWrapper to better reward the agent.



```python



from gymnasium import RewardWrapper as RW


class RewardWrapper(RW):
    def __init__(self, env):
        super().__init__(env)
        self.env = env

    def step(self, action):
        # Perform the environment step
        obs, reward, done, _, info = self.env.step(action)

        # Add a reward for keeping balance
        # obs[2] is the angle of the agent from the vertical position
        balance_reward = abs(obs[2])
        reward += balance_reward

        return obs, reward, done, _, info




```

This reward wrapper gives a reward to the agent for keeping balance.

To train the models, with the RewardWrapper we need to change the environment creation in the training script:


```python
env = make_vec_env(env_id, n_envs=NUM_ENVS, wrapper_class=RewardWrapper, vec_env_cls=SubprocVecEnv, vec_env_kwargs=dict(start_method='fork'), env_kwargs=dict(hardcore=True))
```


Now let's train the models again and check the results.

![graph](imgs/original_models_wrapped_env.png)

# CONCLUSIONS ABOUT THE TRAINING CHANGE LATER

#### Let's see how they perform:

### PPO algorithm

In [None]:
ppo_model = PPO.load("final_models/original_PPO_wrapped_env.zip", env=env)

test_performance(ppo_model)

### A2C algorithm

In [None]:
a2c_model = A2C.load("final_models/original_A2C_wrapped_env.zip", env=env)

test_performance(a2c_model)

### SAC algorithm

In [None]:
sac_model = SAC.load("final_models/original_SAC_wrapped_env.zip", env=env)

test_performance(sac_model)

Now let's see the training progress:

![graph](imgs/original_models_wrapped_env.png)

# CONCLUSIONS ABOUT THE TRAINING CHANGE LATER

## Tuning Hyperparameters

We turned the hyperparameters of the models to see if we could improve the results.


```



    model = algo(policy, env, verbose=1, tensorboard_log=logdir)  # (hyperparameter) ,learning_rate=0.0001, (Neural Network Architecture change)(Neural Network Architecture change) policy_kwargs=dict(net_arch=[256,(...n_layers...), 256]))
    

    model.learn(total_timesteps=TIMESTEPS, reset_num_timesteps=False, tb_log_name=algo_name)  # (hyperparameters) , batch_size=256, ent_coef=0.01, vf_coef=0.5, gae_lambda=0.95))

```

### PPO algorithm

In [None]:
ppo_model = PPO.load("final_models/tunned_PPO_wrapped_env.zip", env=env)

test_performance(ppo_model)

### A2C algorithm

In [None]:
a2c_model = A2C.load("final_models/tunned_A2C_wrapped_env.zip", env=env)

test_performance(a2c_model)

### SAC algorithm

In [None]:
sac_model = SAC.load("final_models/tunned_SAC_wrapped_env.zip", env=env)

test_performance(sac_model)

Now let's see the training progress:

![graph](imgs/tunned_models_wrapped_env.png)

# CONCLUSIONS ABOUT THE TRAINING CHANGE LATER

## Visualising the best model

Now let's visualize the performance of the best model

In [None]:
def visualize_model():
    test_env =  gym.make('BipedalWalker-v3', hardcore=True, render_mode="human")
    
    model = PPO.load("final_models/tunned_PPO_wrapped_env.zip", env=test_env)
    obs, info = test_env.reset()
    
    while True:
        action, _states = model.predict(obs)
        obs, rewards, terminated, truncated, info = test_env.step(action)
        test_env.render()
        if terminated or truncated:
            break
        
# visualize_model()