# Gym Environments and Implementing Reinforcement Learning Agents with Stable Baselines

In [None]:
import gymnasium as gym
import matplotlib.pyplot as plt
from training import latest_model
from stable_baselines3 import PPO,A2C
from stable_baselines3.ppo.policies import MlpPolicy
from stable_baselines3.common.evaluation import evaluate_policy

Trying the enviroment

In [None]:
env = gym.make('CarRacing-v2', render_mode="rgb_array")

In [None]:
untrained_model = PPO(MlpPolicy, env, verbose=0)

mean_reward, std_reward = evaluate_policy(untrained_model, env, n_eval_episodes=100, warn=False)

print(f"mean_reward: {mean_reward:.2f} +/- {std_reward:.2f}")

We created a script that creates a model and starts training it. If a model has already been created the script trains it further:

```bash
python training.py PPO
```

```bash
python training.py A2C
```

The models are saved in the folder `models` and then we use the latest model to test them

To see the training progress we can use tensorboard:

```bash
tensorboard --logdir=logs
```

PPO algorithm

In [None]:
ppo_model = PPO.load(latest_model("PPO"), env=env)

In [None]:
episodes = 100
obs, info = env.reset()
for ep in range(episodes):
    done = False
    while not done:
        action, _states = ppo_model.predict(obs)
        obs, rewards, done,_, info = env.step(action)
        env.render()
        print(rewards)

A2C algorithm

In [None]:
a2c_model = A2C.load(latest_model("A2C"), env=env)

In [None]:
episodes = 100
obs, info = env.reset()
for ep in range(episodes):
    done = False
    while not done:
        action, _states = a2c_model.predict(obs)
        obs, rewards, done,_, info = env.step(action)
        env.render()
        print(rewards)

With the original models and with the original env we got these results:

![graph](imgs/PPOoriginalvsA2Coriginal.png)

As we can see the PPO algorithm is better than the A2C algorithm, at least with the original models and the original env in the time we've trained them.

We can also see that the A2C model trained faster than the PPO model.

To see if we can improve the results we will try to create a RewardWrapper to better reward the agent.

## Tuning Hyperparameters