# Customizing OpenAI Gym Environments and Implementing Reinforcement Learning Agents with Stable Baselines

### Theme: Car Racing

- Constança
- Daniela Osório, 202208679
- Inês Amorim, 202108108

---

## Imports

In [None]:
%pip install -r requirements.txt

In [1]:
import math
from typing import Optional, Union
import numpy as np
import gymnasium as gym
from gymnasium import spaces
from gymnasium.envs.box2d.car_dynamics import Car
from gymnasium.error import DependencyNotInstalled, InvalidAction
from gymnasium.utils import EzPickle
from gymnasium.wrappers import RecordVideo
import pygame
from pygame import gfxdraw
import time
import matplotlib.pyplot as plt
#from pyvirtualdisplay import Display
from stable_baselines3 import A2C, PPO
from stable_baselines3.common.vec_env import SubprocVecEnv, VecVideoRecorder
from stable_baselines3.common.callbacks import CheckpointCallback, EvalCallback, CallbackList
from stable_baselines3.common.logger import configure
import os

---

## 1. Introduction

The CarRacing-v3 environment from Gymnasium (previously Gym) is part of the Box2D environments, and it offers an interesting challenge for training reinforcement learning agents. It's a top-down racing simulation where the track is randomly generated at the start of each episode. The environment offers both continuous and discrete action spaces, making it adaptable to different types of reinforcement learning algorithms.

- **Action Space:**

   - **Continuous:** Three actions: steering, gas, and braking. Steering ranges from -1 (full left) to +1 (full right).
   -  **Discrete:** Five possible actions: do nothing, steer left, steer right, gas, and brake.

- **Observation Space:**

    - The environment provides a 96x96 RGB image of the car and the track, which serves as the state input for the agent.

- **Rewards:**

    - The agent receives a -0.1 penalty for every frame, encouraging efficiency.
    - It earns a positive reward for visiting track tiles: the formula is Reward=1000−0.1×framesReward=1000−0.1×frames, where "frames" is the number of frames taken to complete the lap. The reward for completing a lap depends on how many track tiles are visited.

- Episode Termination:

    - The episode ends either when all track tiles are visited or if the car goes off the track, which incurs a significant penalty (-100 reward).

In [2]:
env = gym.make("CarRacing-v3", continuous=False, render_mode='rgb_array') 
obs, info = env.reset()
#continuous = False to use Discrete space

In [9]:
#check render modes
print(env.metadata["render_modes"])

['human', 'rgb_array', 'state_pixels']


- Checking if everything is okay and working

In [10]:
# Reset the environment and render the first frame
obs, info = env.reset()

# Close the environment
env.close()

print("Environment initialized successfully!")

Environment initialized successfully!


In [11]:
print("Action space:", env.action_space)

Action space: Discrete(5)


In [12]:
print("Action Space:", env.action_space)
print("Observation Space:", env.observation_space)
print("Environment Metadata:", env.metadata)


Action Space: Discrete(5)
Observation Space: Box(0, 255, (96, 96, 3), uint8)
Environment Metadata: {'render_modes': ['human', 'rgb_array', 'state_pixels'], 'render_fps': 50}


In [3]:
obs = env.reset()
for _ in range(10):
    """action = env.action_space.sample()  # Random action
    print(f"Action before step: {action}, Type: {type(action)}")
    obs, reward, done, info = env.step(action)"""
    env.step(env.action_space.sample())

env.close()

---
## 2. Training with PPO

In [11]:
env = gym.make("CarRacing-v3", continuous=False, render_mode='rgb_array') 
obs = env.reset()

In [12]:
MODELS_DIR = '../../labiacd/models/#isia'
TIMESTEPS = 30000000 #30M

In [13]:
#create directories
logs_dir = 'PPO_baseline_logs'
logs_path = os.path.join(MODELS_DIR, logs_dir)
os.makedirs(logs_path, exist_ok=True)

video_dir = os.path.join(logs_path, "videos")
tensorboard_dir = os.path.join(logs_path, "tensorboard")
model_dir = os.path.join(logs_path, "models")
os.makedirs(video_dir, exist_ok=True)
os.makedirs(tensorboard_dir, exist_ok=True)
os.makedirs(model_dir, exist_ok=True)

In [14]:
# Define the environment creation function
def make_env(seed):
    def _init():
        env = gym.make("CarRacing-v3", render_mode='rgb_array', continuous=False)
        env = RecordVideo(env, video_folder=video_dir, episode_trigger=lambda x: x % 1000 == 0) if seed == 0 else env
        env.reset()
        return env
    return _init

In [None]:
# Create a vectorized environment with multiple parallel instances
num_envs = 4
vec_env = SubprocVecEnv([make_env(i) for i in range(num_envs)])

In [7]:
# Configure logger for TensorBoard
new_logger = configure(tensorboard_dir, ["stdout", "tensorboard"])

Logging to ../../labiacd/models/#isia/PPO_baseline_logs/tensorboard


In [None]:
# Callbacks
# Save the model every 10,000 steps
checkpoint_callback = CheckpointCallback(save_freq=10000, save_path=model_dir, name_prefix="ppo_car_racing")
# Evaluation callback
eval_env = gym.make("CarRacing-v3", render_mode="rgb_array", continuous=False)
eval_callback = EvalCallback(eval_env, best_model_save_path=model_dir, log_path=model_dir, eval_freq=10000)

# Combine callbacks
callback = CallbackList([checkpoint_callback, eval_callback])

In [9]:
# Train the PPO model
model = PPO("CnnPolicy", env, verbose=1, tensorboard_log=tensorboard_dir)
model.set_logger(new_logger)  # Attach TensorBoard logger

Using cpu device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.
Wrapping the env in a VecTransposeImage.


[W1214 08:50:35.081932623 NNPACK.cpp:61] Could not initialize NNPACK! Reason: Unsupported hardware.


In [10]:
# Learn and save videos
model.learn(total_timesteps=30_000_000, callback=callback)



---------------------------------
| rollout/           |          |
|    ep_len_mean     | 1e+03    |
|    ep_rew_mean     | -53.1    |
| time/              |          |
|    fps             | 36       |
|    iterations      | 1        |
|    time_elapsed    | 55       |
|    total_timesteps | 2048     |
---------------------------------


KeyboardInterrupt: 

In [None]:
# Cleanup
env.close()
eval_env.close()

---

In [None]:
#treinar o modelo e guardar logs e vídeos
model.learn(total_timesteps=TIMESTEPS, reset_num_timesteps=False, tb_log_name="PPO_baseline")
model.save(f"{models_dir}/{TIMESTEPS*i}")

In [None]:
#load the model
model = PPO.load("trained_model_ppo")

In [None]:
#store model
model.save("trained_model_ppo")

In [None]:
#create the environment
env = gym.make('CarRacing-v3')  # continuous: LunarLanderContinuous-v2
env.reset()

treinar modelo sem se ver treino

In [None]:
model.learn(total_timesteps=100000)

ver treino, img giras

In [None]:
# parte para se ver 
episodes = 5

for ep in range(episodes):
	obs = env.reset()
	done = False
	while not done:
		action, _states = model.predict(obs)
		obs, rewards, done, info = env.step(action)
		env.render()  #permite ver as animações
		print(rewards)