# Gym Environments and Implementing Reinforcement Learning Agents with Stable Baselines

In [None]:
import gymnasium as gym
from stable_baselines3 import PPO,A2C
from stable_baselines3.ppo.policies import MlpPolicy
from stable_baselines3.common.evaluation import evaluate_policy

In [None]:
env = gym.make('CarRacing-v2')

We chose the CarRacing-v2 environment because it is a continuous action space, and it is a challenging environment to solve.

To test ourselves, we created a script that allows us to play the game with the keyboard:

```bash
    python human_test.py
```

WE chose the PPO and A2C algorithms to train our agents because they are the most used algorithms in the stable baselines3 library.

Let's test the environment with a random agent from each algorithm:

In [None]:
def test_performance(model):
    mean_reward, std_reward = evaluate_policy(model, env, n_eval_episodes=100, warn=False)

    print(f"mean_reward: {mean_reward:.2f} +/- {std_reward:.2f}")

In [None]:
untrained_model = PPO(MlpPolicy, env, verbose=0)

test_performance(untrained_model)

In [None]:
untrained_model = A2C(MlpPolicy, env, verbose=0)

test_performance(untrained_model)

As we can see, the random agents are not good at all.

Now let's use them to properly create a model and train them.

## Training the models without changes to the environment or the models

We created a script that creates a model and starts training it. If a model has already been created, the script trains it further

To train the models, we created the environment the following code:

```python
import os
from sys import argv
from stable_baselines3 import PPO, A2C
from stable_baselines3.common.env_util import make_vec_env
from stable_baselines3.common.vec_env import SubprocVecEnv
from stable_baselines3.a2c import MlpPolicy as A2C_MlpPolicy
from stable_baselines3.ppo import MlpPolicy as PPO_MlpPolicy
from NewReward import RewardWrapper

env_id = "CarRacing-v2"
TIMESTEPS = 100000
models_dir = "models"
logdir = "logs"
NUM_ENVS = os.cpu_count()

def train_model(algo, algo_name, policy, env):
    if os.path.exists(f"{models_dir}/{algo_name}"):
        if os.listdir(f"{models_dir}/{algo_name}"):
            model_path = latest_model(algo_name)
            model = algo.load(model_path, env=env)
            iters = int(int(model_path.split("/")[2].split(".")[0]) / 10 ** 4)
        else:


            model = algo(policy, env, verbose=1, tensorboard_log=logdir)
            iters = 0
    else:
        os.makedirs(f"{models_dir}/{algo_name}")
        model = algo(policy, env, verbose=1, tensorboard_log=logdir)
        iters = 0

    while True:
        iters += 1
        model.learn(total_timesteps=TIMESTEPS, reset_num_timesteps=False, tb_log_name=algo_name)
        model.save(f"{models_dir}/{algo_name}/{TIMESTEPS * iters}")


def main():
    try:
        if len(argv) != 2:
            raise ValueError("No arguments given. Please specify which model to train.")

        if not os.path.exists(logdir):
            os.makedirs(logdir)

        model_type = argv[1]

        env = make_vec_env(env_id, n_envs=NUM_ENVS, vec_env_cls=SubprocVecEnv,
                           vec_env_kwargs=dict(start_method='fork'))

        if model_type == "A2C":
            train_model(A2C, model_type, A2C_MlpPolicy, env)
        elif model_type == "PPO":
            train_model(PPO, model_type, PPO_MlpPolicy, env)
        else:
            raise ValueError("Invalid argument. Please specify 'A2C' or 'PPO'.")

    except ValueError as e:
        print(f"Error: {e}")


if __name__ == '__main__':
    main()

```

To train each model, we then used:


```bash
    python training.py PPO
```




```bash
    python training.py A2C
```

The models were then saved in a folder `models`. For the demonstration we chose the most advanced models and placed them inside the `final_models` folder.

To see the training progress, we can use tensorboard:



```bash
    tensorboard --logdir=logs
```

![graph](imgs/original_models_original_env.png)

With the original models and the original environment in the time we've trained them, we can see the PPO algorithm is better than the A2C algorithm.

Though we can also see that the A2C model trained faster than the PPO model.

#### Let's see how they perform:

#### PPO algorithm

In [None]:
ppo_model = PPO.load("final_models/original_PPO_original_env.zip", env=env)

test_performance(ppo_model)

#### A2C algorithm

In [None]:
a2c_model = A2C.load("final_models/original_A2C_original_env.zip", env=env)

test_performance(a2c_model)

## Training the models with RewardWrapper(env) and without changes to the models

To see if we can improve the results, we will try to create a RewardWrapper to better reward the agent.



```



    from gymnasium import RewardWrapper as RW
    
    
    class RewardWrapper(RW):
        def __init__(self, env):
            super().__init__(env)
            self.total_reward = 25
    
        def reset(self, **kwargs):
            self.total_reward = 0
            return self.env.reset(**kwargs)
    
        def step(self, action):
            # Perform the environment step
            obs, reward, done, _, info = self.env.step(action)
    
            self.total_reward += reward
    
            if self.total_reward < 0:
                done = True
    
            return obs, reward, done, _, info


```

We noticed that the episode only ended when the car went out of the map.

In the training, we also noticed that the car was getting stuck doing circles.

When the car is not doing progress, there is a penalty of -0.1.

So we decided to end the episode when the total reward was negative. Meaning that the car was not making any progress for a long time.

To give the model time to start gaining total reward, we decided to start the total reward at 25.
 
This way, the model has time to start making progress without the episode ending too soon.


To train the models, with the RewardWrapper we need to change the environment creation in the training script:


```python
env = make_vec_env(env_id, n_envs=NUM_ENVS, wrapper_class=RewardWrapper, vec_env_cls=SubprocVecEnv, vec_env_kwargs=dict(start_method='fork'))
```

### PPO algorithm

In [None]:
ppo_model = PPO.load("final_models/original_PPO_wrapped_env.zip", env=env)

test_performance(ppo_model)

### A2C algorithm

In [None]:
a2c_model = A2C.load("final_models/original_A2C_wrapped_env.zip", env=env)

test_performance(a2c_model)

Now let's see the training progress:

![graph](imgs/original_models_wrapped_env.png)

## Tuning Hyperparameters

We turned the hyperparameters of the models to see if we could improve the results.


```



    model = algo(policy, env, verbose=1, tensorboard_log=logdir)  # (hyperparameter) ,learning_rate=0.0001, (Neural Network Architecture change)(Neural Network Architecture change) policy_kwargs=dict(net_arch=[256,(...n_layers...), 256]))
    

    model.learn(total_timesteps=TIMESTEPS, reset_num_timesteps=False, tb_log_name=algo_name)  # (hyperparameters) , batch_size=256, ent_coef=0.01, vf_coef=0.5, gae_lambda=0.95))

```

### PPO algorithm

In [None]:
ppo_model = PPO.load("final_models/tunned_PPO_wrapped_env.zip", env=env)

test_performance(ppo_model)

### A2C algorithm

In [None]:
a2c_model = A2C.load("final_models/tunned_A2C_wrapped_env.zip", env=env)

test_performance(a2c_model)

Now let's see the training progress:

![graph](imgs/tunned_models_wrapped_env.png)

## Visualising the best model

Now let's visualize the performance of the best model

In [None]:
test_env = gym.make('CarRacing-v2',render_mode='human')
test_model = PPO.load("final_models/tunned_PPO_wrapped_env.zip", env=test_env)
obs, info = env.reset()

done = False
while not done:
    action, _states = test_model.predict(obs)
    obs, rewards, done, _, info = test_env.step(action)
    env.render()
    print(rewards)