# Gym Environments and Implementing Reinforcement Learning Agents with Stable Baselines

<br>
<br>

## The environment

In [3]:
import gymnasium as gym

env_id = "BipedalWalker-v3"
env = gym.make(env_id, hardcore=True)

We chose the `BipedalWalker-v3` environment because it's a continuous action space, and it's (when in hardcore mode) a challenging environment to solve.

<br>
<br>

To see the training progress, we use tensorboard to visualize the training progress and took screenshots of the graphs to use them below. 


<br>


```bash
    tensorboard --logdir=logs
```

<br>

<br>


---


<br>



## The algorithms


In the assignment, to use our environment, we needed to choose a few from the following algorithms:

<br>

- ARS (Augmented Random Search)
- A2C (Advantage Actor Critic)
- DDPG (Deep Deterministic Policy Gradient)
- PPO (Proximal Policy Optimization)
- RecurrentPPO (Recurrent Proximal Policy Optimization)
- SAC (Soft Actor Critic)
- TD3 (Twin Delayed DDPG)
- TQC (Twin Q-Value Critic)
- TRPO (Trust Region Policy Optimization)

<br>
<br>







To see the top 3 algorithms, we trained all of them and chose the best ones:

To train them, we used the following script:


In [8]:
import os
from stable_baselines3.common.vec_env import SubprocVecEnv, DummyVecEnv
from stable_baselines3.common.env_util import make_vec_env

models_dir = "models"
logdir = "logs"
TIMESTEPS = 10**6


def train_model(algo, algo_name, policy, n_envs=os.cpu_count()):
    
    def latest_model(algorithm):
        models = [int(m.split(".")[0]) for m in os.listdir(f"{models_dir}/{algorithm}")]
        models.sort()
        return f"{models_dir}/{algorithm}/{models[-1]}.zip"


    if n_envs == 1:
        train_env = DummyVecEnv([lambda: gym.make(env_id, hardcore=True)])
    else:
        train_env = make_vec_env(env_id, n_envs=n_envs, vec_env_cls=SubprocVecEnv, env_kwargs=dict(hardcore=True),
                           vec_env_kwargs=dict(start_method='fork'))

    if os.path.exists(f"{models_dir}/{algo_name}"):
        if os.listdir(f"{models_dir}/{algo_name}"):

            model_path = latest_model(algo_name)
            model = algo.load(model_path, env=train_env)
            iters = int(int(model_path.split("/")[2].split(".")[0]) / 10 ** 4)
        else:

            model = algo(policy, train_env, verbose=1, tensorboard_log=logdir)
            iters = 0
    else:
        os.makedirs(f"{models_dir}/{algo_name}")
        model = algo(policy, train_env, verbose=1,tensorboard_log=logdir)
        iters = 0

    while True:
        iters += 1
        model.learn(total_timesteps=TIMESTEPS, progress_bar=True, reset_num_timesteps=False,tb_log_name=algo_name)
        model.save(f"{models_dir}/{algo_name}/{TIMESTEPS * iters}")


## Training all the possible models

To train the models, we called the funtion above.

Then we got this graph of the training progress:

![graph](imgs/all_models_original_env.png)


In [9]:
from stable_baselines3.common.evaluation import evaluate_policy
import random

def test_performance(model):
    mean_reward, std_reward = evaluate_policy(model, env, n_eval_episodes=100, warn=False)

    return mean_reward

Let's test the environment with a random agent from each algorithm:

In [11]:
from stable_baselines3 import A2C,DDPG, PPO, SAC, TD3
from sb3_contrib import ARS, RecurrentPPO, TQC, TRPO
import matplotlib.pyplot as plt

performances = [("A2C", test_performance(A2C("MlpPolicy", env, verbose=0))),
                ("ARS", test_performance(ARS("MlpPolicy", env, verbose=0))),
                ("DDPG", test_performance(DDPG("MlpPolicy", env, verbose=0))),
                ("PPO", test_performance(PPO("MlpPolicy", env, verbose=0))),
                ("QRDQN", test_performance(QRDQN("MlpPolicy", env, verbose=0))),
                ("RecurrentPPO", test_performance(RecurrentPPO("MlpLstmPolicy", env))),
                ("SAC", test_performance(SAC("MlpPolicy", env, verbose=0))),
                ("TD3", test_performance(TD3("MlpPolicy", env, verbose=0))),
                ("TQC", test_performance(TQC("MlpPolicy", env, verbose=0))),
                ("TRPO", test_performance(TRPO("MlpPolicy", env, verbose=0)))]

names, values = zip(*performances)
cmap = plt.get_cmap('Greens_r')
colors = [cmap(i / len(names)) for i in range(len(names))]
fig, ax = plt.subplots(figsize=(12, 6))
bars = ax.bar(names, values, color=colors, alpha=0.7)
ax.set_title('Performance')
plt.tight_layout()
plt.show()

## Training the top 5

Now let's continue the training with the top five algorithms:

- top1
- top2
- top3
- top4
- top5



<br>
<br>

---

<br>


## Training the models without changes to the environment or the models


Just to do a little sanity check, we are going to train with `hardcore=False`.


Here is the mean reward graph for the training of the models in the easy mode:

<br>

![graph](imgs/original_models_original_env_easy.png)

<br>

We can see that the PPO and SAC algorithms rise quickly in the first 2M iterations, while the A2C algorithm stays at the same.

The PPO and SAC algorithms reach the top reward (+-300) in the first 4/6M iterations, while the A2C algorithm even in the 24M iterations does not reach the top reward.

Now we tested the models in the hardcore difficulty mode. To do that, in the training script we changed in the environment creation: `hardcore=True`

Here is the mean reward graph for the training of the models in hardcore difficulty mode:

<br>

![graph](imgs/original_models_original_env.png)

<br>

We can see after we turned on hardcore mode, we saw a drop in the performance of the models. The PPO and SAC algorithms still rise quickly in the 10M iterations, while the A2C algorithm more or less stays at the same.

#### Let's see how they perform:

From now on, we are going to only use the hardcore mode.

#### PPO algorithm

In [None]:
ppo_model = PPO.load("final_models/original_PPO_original_env.zip", env=env)

test_performance(ppo_model)

#### A2C algorithm

In [None]:
a2c_model = A2C.load("final_models/original_A2C_original_env.zip", env=env)

test_performance(a2c_model)

#### SAC algorithm

In [None]:
sac_model = SAC.load("final_models/original_SAC_original_env.zip", env=env)

test_performance(sac_model)

## Training the models with RewardWrapper(env) and without Hyperparameters changes to the models

We checked, and the problem is that the walker gets stuck in the same place with it's legs open. 


![graph](imgs/stuck_walker.png)

To see if we can improve the results, we will try to reward the agent when they stay balanced upright.



```python



from gymnasium import RewardWrapper as RW


class RewardWrapper(RW):
    def __init__(self, env):
        super().__init__(env)
        self.env = env

    def step(self, action):
        # Perform the environment step
        obs, reward, done, _, info = self.env.step(action)

        # Add a reward for keeping balance
        # obs[2] is the angle of the agent from the vertical position
        balance_reward = abs(obs[2])
        reward += balance_reward

        return obs, reward, done, _, info




```

To train the models, with the RewardWrapper we need to change the environment creation in the training script:

```python
env = make_vec_env(env_id, n_envs=NUM_ENVS, wrapper_class=RewardWrapper, vec_env_cls=SubprocVecEnv, vec_env_kwargs=dict(start_method='fork'), env_kwargs=dict(hardcore=True))
```


Now let's train the models again and check the results.

![graph](imgs/original_models_wrapped_env.png)

After aplying our RewardWrapper, we can see that in the 50M iterations, the PPO learned very well and reached positive rewards.


We can see that the SAC algorithm stayed the same as before the wrapper. 

And finnaly the A2C algorithm learned a little bit. Still not reaching positive rewards but its better than before.

#### Let's see how they perform:

### PPO algorithm

In [None]:
ppo_model = PPO.load("final_models/original_PPO_wrapped_env.zip", env=env)

test_performance(ppo_model)

### A2C algorithm

In [None]:
a2c_model = A2C.load("final_models/original_A2C_wrapped_env.zip", env=env)

test_performance(a2c_model)

### SAC algorithm

In [None]:
sac_model = SAC.load("final_models/original_SAC_wrapped_env.zip", env=env)

test_performance(sac_model)

## Tuning Hyperparameters

We turned the hyperparameters of the models to see if we could improve the results.


```



    model = algo(policy, env, verbose=1, tensorboard_log=logdir)  # (hyperparameter) ,learning_rate=0.0001, (Neural Network Architecture change)(Neural Network Architecture change) policy_kwargs=dict(net_arch=[256,(...n_layers...), 256]))
    

    model.learn(total_timesteps=TIMESTEPS, reset_num_timesteps=False, tb_log_name=algo_name)  # (hyperparameters) , batch_size=256, ent_coef=0.01, vf_coef=0.5, gae_lambda=0.95))

```

### PPO algorithm

In [None]:
ppo_model = PPO.load("final_models/tunned_PPO_wrapped_env.zip", env=env)

test_performance(ppo_model)

### A2C algorithm

In [None]:
a2c_model = A2C.load("final_models/tunned_A2C_wrapped_env.zip", env=env)

test_performance(a2c_model)

### SAC algorithm

In [None]:
sac_model = SAC.load("final_models/tunned_SAC_wrapped_env.zip", env=env)

test_performance(sac_model)

Now let's see the training progress:

![graph](imgs/tunned_models_wrapped_env.png)

# CONCLUSIONS ABOUT THE TRAINING CHANGE LATER

## Visualising the best model

Now let's visualize the performance of the best model

In [None]:
def visualize_model():
    test_env =  gym.make('BipedalWalker-v3', hardcore=True, render_mode="human")
    
    model = PPO.load("final_models/original_PPO_wrapped_env.zip", env=test_env)
    obs, info = test_env.reset()
    
    while True:
        action, _states = model.predict(obs)
        obs, rewards, terminated, truncated, info = test_env.step(action)
        test_env.render()
        if terminated or truncated:
            break
        
visualize_model()