# Customizing OpenAI Gym Environments and Implementing Reinforcement Learning Agents with Stable Baselines

### Theme: Car Racing

- Constança
- Daniela Osório, 202208679
- Inês Amorim, 202108108

---

## Imports

In [None]:
%pip install -r requirements.txt

In [None]:
%pip freeze

In [None]:
%pip install gymnasium[box2d]

In [None]:
%pip install box2d-py

In [4]:
%pip install ufal.pybox2d

Collecting ufal.pybox2d
  Downloading ufal.pybox2d-2.3.10.3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (416 bytes)
Downloading ufal.pybox2d-2.3.10.3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.7/3.7 MB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0mm
[?25hInstalling collected packages: ufal.pybox2d
Successfully installed ufal.pybox2d-2.3.10.3
Note: you may need to restart the kernel to use updated packages.


In [3]:
import math
from typing import Optional, Union

import numpy as np

import gymnasium as gym
from gymnasium import spaces
from gymnasium.envs.box2d.car_dynamics import Car
from gymnasium.error import DependencyNotInstalled, InvalidAction
from gymnasium.utils import EzPickle
import pygame
from pygame import gfxdraw
import time
import matplotlib.pyplot as plt

---

The CarRacing-v3 environment from Gymnasium (previously Gym) is part of the Box2D environments, and it offers an interesting challenge for training reinforcement learning agents. It's a top-down racing simulation where the track is randomly generated at the start of each episode. The environment offers both continuous and discrete action spaces, making it adaptable to different types of reinforcement learning algorithms.

- **Action Space:**

   - **Continuous:** Three actions: steering, gas, and braking. Steering ranges from -1 (full left) to +1 (full right).
   -  **Discrete:** Five possible actions: do nothing, steer left, steer right, gas, and brake.

- **Observation Space:**

    - The environment provides a 96x96 RGB image of the car and the track, which serves as the state input for the agent.

- **Rewards:**

    - The agent receives a -0.1 penalty for every frame, encouraging efficiency.
    - It earns a positive reward for visiting track tiles: the formula is Reward=1000−0.1×framesReward=1000−0.1×frames, where "frames" is the number of frames taken to complete the lap. The reward for completing a lap depends on how many track tiles are visited.

- Episode Termination:

    - The episode ends either when all track tiles are visited or if the car goes off the track, which incurs a significant penalty (-100 reward).

In [8]:
env = gym.make("CarRacing-v3", render_mode="rgb_array", continuous=True) 
#continuous = False to use Discrete space

In [4]:
# env_human = gym.make("CarRacing-v3", render_mode="human")
# notebook doesn't support this

- Checking if everything is okay and working

In [5]:
# Reset the environment and render the first frame
obs, info = env.reset()

# Close the environment
env.close()

print("Environment initialized successfully!")

Environment initialized successfully!


In [6]:
print("Action space:", env.action_space)

Action space: Box([-1.  0.  0.], 1.0, (3,), float32)


In [None]:
# Reset the environment
obs, info = env.reset()

# Play for 100 steps
for _ in range(100):
    action = env.action_space.sample()  # Random action
    obs, reward, done, truncated, info = env.step(action)
    
    # Display the frame
    plt.imshow(obs)  # Rendered frame as an image
    plt.axis('off')  # Hide axes
    plt.show()       # Display the frame
    
    if done or truncated:
        break

env.close()

---
train with PPO

In [2]:
%pip install stable-baselines3

Collecting stable-baselines3
  Downloading stable_baselines3-2.4.0-py3-none-any.whl.metadata (4.5 kB)
Collecting numpy<2.0,>=1.20 (from stable-baselines3)
  Downloading numpy-1.26.4-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
Collecting torch>=1.13 (from stable-baselines3)
  Downloading torch-2.5.1-cp312-cp312-manylinux1_x86_64.whl.metadata (28 kB)
Collecting pandas (from stable-baselines3)
  Using cached pandas-2.2.3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (89 kB)
Collecting filelock (from torch>=1.13->stable-baselines3)
  Using cached filelock-3.16.1-py3-none-any.whl.metadata (2.9 kB)
Collecting networkx (from torch>=1.13->stable-baselines3)
  Downloading networkx-3.4.2-py3-none-any.whl.metadata (6.3 kB)
Collecting jinja2 (from torch>=1.13->stable-baselines3)
  Downloading jinja2-3.1.4-py3-none-any.whl.metadata (2.6 kB)
Collecting fsspec (from torch>=1.13->stable-baselines3)
  Downloading fsspec-2024.10.0-py3-none-any.whl.me

In [1]:
from stable_baselines3 import PPO

In [None]:
#load the model
model = PPO.load("trainsed_model_ppo")
#se quiseres escolher onde fica guardado
#model = PPO.load("/path/trainsed_model_ppo")


In [None]:
#store model
model.save("trainsed_model_ppo")

In [None]:
#create the environment
env = gym.make('CarRacing-v0')  # continuous: LunarLanderContinuous-v2
env.reset()

In [None]:
#inicializar/crir o modelo
model = PPO('Model', env, verbose=1)  #outras possibilidades seriam TD3 SAC


treinar modelo sem se ver treino

In [None]:
model.learn(total_timesteps=100000)

ver treino, img giras

In [None]:
# parte para se ver 
episodes = 5

for ep in range(episodes):
	obs = env.reset()
	done = False
	while not done:
		action, _states = model.predict(obs)
		obs, rewards, done, info = env.step(action)
		env.render()  #permite ver as animações
		print(rewards)