In [1]:

from utils.training import train_model
from utils.evaluation import evaluate_model, aggregate_and_normalize_rewards
from gymnasium import register
import gymnasium



 ## 0. Notebook description

This notebook is used to compare the performance of different reinforcement learning algorithms on the highway environment with custom rewards. The algorithms being compared are DQN, PPO, A2C, and TRPO. Each algorithm will be trained using the same environment configuration to guarantee a fair comparison. The custom rewards are designed to promote safer and more realistic driving behaviours.


## 1. Train the model

##### Setup the environment and training configurations for the custom reward environment.

In [2]:
config_updates = {
    "safe_distance_reward": 0.1,
    "left_vehicle_overtaken_reward": -0.5,
    "collision_reward": -4,
    "smooth_driving_reward" : 0.3,
    "right_lane_reward" : 0.5
}

# Register the custom environment
register(
    id='CustomRewardEnv',
    entry_point='HighwayEnvCustomReward:HighwayEnvFastCustomReward',
)

### 1.1. Train the model using DQN

Deep Q-Network (DQN) is a value-based reinforcement learning algorithm. It approximates the Q-value function, which estimates the expected future rewards for each action in a given state. The agent selects actions based on these Q-values to maximize its cumulative reward. DQN is commonly used in environments with discrete action spaces.


In [11]:
# Create the environment with the custom parameter
# Set log_rewards_enabled to True or False as per your requirement
log_filename="2_dqn_custom_reward_log.csv"
log_performance_metrics_enabled=False

In [3]:
env = gymnasium.make('CustomRewardEnv', 
                     render_mode='rgb_array', 
                     log_performance_metrics_enabled=log_performance_metrics_enabled, 
                     log_filename=log_filename
                    )

train_model(
    env=env,
    config_updates=config_updates,
    session_name='2_Group15_RLProject',
    algorithm='DQN'
)

{'action': {'type': 'DiscreteMetaAction'},
 'centering_position': [0.3, 0.5],
 'collision_reward': -4,
 'controlled_vehicles': 1,
 'duration': 30,
 'ego_spacing': 1.5,
 'high_speed_reward': 0.4,
 'initial_lane_id': None,
 'lane_change_reward': 0,
 'lanes_count': 3,
 'left_vehicle_overtaken_reward': -0.5,
 'manual_control': False,
 'normalize_reward': True,
 'observation': {'type': 'Kinematics'},
 'offroad_terminal': False,
 'offscreen_rendering': False,
 'other_vehicles_type': 'highway_env.vehicle.behavior.IDMVehicle',
 'policy_frequency': 1,
 'real_time_rendering': False,
 'render_agent': True,
 'reward_speed_range': [20, 30],
 'right_lane_reward': 0.5,
 'safe_distance_reward': 0.1,
 'scaling': 5.5,
 'screen_height': 150,
 'screen_width': 600,
 'show_trajectories': False,
 'simulation_frequency': 5,
 'smooth_driving_reward': 0.3,
 'vehicles_count': 20,
 'vehicles_density': 1}
Using cpu device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.
Logging to ./log

In [4]:
# Create the environment with the custom parameter
log_performance_metrics_enabled=True

env = gymnasium.make('CustomRewardEnv', 
                     render_mode='rgb_array', 
                     log_performance_metrics_enabled=log_performance_metrics_enabled, 
                     log_filename=log_filename
                    )
evaluate_model(
    env=env,
    config_updates={**config_updates, "simulation_frequency": 15},
    model_path='models/2_Group15_RLProject_DQN',
    algorithm='DQN',
)

{'action': {'type': 'DiscreteMetaAction'},
 'centering_position': [0.3, 0.5],
 'collision_reward': -4,
 'controlled_vehicles': 1,
 'duration': 30,
 'ego_spacing': 1.5,
 'high_speed_reward': 0.4,
 'initial_lane_id': None,
 'lane_change_reward': 0,
 'lanes_count': 3,
 'left_vehicle_overtaken_reward': -0.5,
 'manual_control': False,
 'normalize_reward': True,
 'observation': {'type': 'Kinematics'},
 'offroad_terminal': False,
 'offscreen_rendering': False,
 'other_vehicles_type': 'highway_env.vehicle.behavior.IDMVehicle',
 'policy_frequency': 1,
 'real_time_rendering': False,
 'render_agent': True,
 'reward_speed_range': [20, 30],
 'right_lane_reward': 0.5,
 'safe_distance_reward': 0.1,
 'scaling': 5.5,
 'screen_height': 150,
 'screen_width': 600,
 'show_trajectories': False,
 'simulation_frequency': 15,
 'smooth_driving_reward': 0.3,
 'vehicles_count': 20,
 'vehicles_density': 1}
Logging metrics for step 15 and seconds elapsed 1.0
Logging metrics for step 30 and seconds elapsed 2.0
Loggi

In [12]:
metrics = aggregate_and_normalize_rewards(log_filename)

if metrics:
    print("Performance metric (as percent of all steps):")
    for metric_name, avg_metric in metrics.items():
        print(f"{metric_name}: {avg_metric*100:.4f}%")

Performance metric (as percent of all steps):
collision_count: 4.6954%
right_lane_count: 62.1397%
on_road_count: 100.0000%
safe_distance_count: 96.1070%
left_vehicle_overtaken_count: 5.2600%
abrupt_accelerations_count: 5.4383%


### 1.2. Train the model using PPO

Proximal Policy Optimization (PPO) is a policy-based reinforcement learning algorithm. It optimises the policy directly by using a surrogate objective function that prevents large updates to the policy, thus allowing stable and reliable learning. PPO is commonly used in environments with continuous action spaces.


In [14]:
# Create the environment with the custom parameter
log_filename="2_ppo_custom_reward_log.csv"
log_performance_metrics_enabled=False

env = gymnasium.make('CustomRewardEnv', 
                     render_mode='rgb_array', 
                     log_performance_metrics_enabled=log_performance_metrics_enabled, 
                     log_filename=log_filename
                    )

train_model(
    env=env,
    config_updates=config_updates,
    session_name='2_Group15_RLProject',
    algorithm='PPO'
)

{'action': {'type': 'DiscreteMetaAction'},
 'centering_position': [0.3, 0.5],
 'collision_reward': -4,
 'controlled_vehicles': 1,
 'duration': 30,
 'ego_spacing': 1.5,
 'high_speed_reward': 0.4,
 'initial_lane_id': None,
 'lane_change_reward': 0,
 'lanes_count': 3,
 'left_vehicle_overtaken_reward': -0.5,
 'manual_control': False,
 'normalize_reward': True,
 'observation': {'type': 'Kinematics'},
 'offroad_terminal': False,
 'offscreen_rendering': False,
 'other_vehicles_type': 'highway_env.vehicle.behavior.IDMVehicle',
 'policy_frequency': 1,
 'real_time_rendering': False,
 'render_agent': True,
 'reward_speed_range': [20, 30],
 'right_lane_reward': 0.5,
 'safe_distance_reward': 0.1,
 'scaling': 5.5,
 'screen_height': 150,
 'screen_width': 600,
 'show_trajectories': False,
 'simulation_frequency': 5,
 'smooth_driving_reward': 0.3,
 'vehicles_count': 20,
 'vehicles_density': 1}
Using cpu device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.
Logging to ./log

In [17]:
# Create the environment with the custom parameter

log_performance_metrics_enabled=True

env = gymnasium.make('CustomRewardEnv', 
                     render_mode='rgb_array', 
                     log_performance_metrics_enabled=log_performance_metrics_enabled, 
                     log_filename=log_filename
                    )

evaluate_model(
    env=env,
    config_updates={**config_updates, "simulation_frequency": 15},
    model_path='models/2_Group15_RLProject_PPO',
    algorithm='PPO',
)

{'action': {'type': 'DiscreteMetaAction'},
 'centering_position': [0.3, 0.5],
 'collision_reward': -4,
 'controlled_vehicles': 1,
 'duration': 30,
 'ego_spacing': 1.5,
 'high_speed_reward': 0.4,
 'initial_lane_id': None,
 'lane_change_reward': 0,
 'lanes_count': 3,
 'left_vehicle_overtaken_reward': -0.5,
 'manual_control': False,
 'normalize_reward': True,
 'observation': {'type': 'Kinematics'},
 'offroad_terminal': False,
 'offscreen_rendering': False,
 'other_vehicles_type': 'highway_env.vehicle.behavior.IDMVehicle',
 'policy_frequency': 1,
 'real_time_rendering': False,
 'render_agent': True,
 'reward_speed_range': [20, 30],
 'right_lane_reward': 0.5,
 'safe_distance_reward': 0.1,
 'scaling': 5.5,
 'screen_height': 150,
 'screen_width': 600,
 'show_trajectories': False,
 'simulation_frequency': 15,
 'smooth_driving_reward': 0.3,
 'vehicles_count': 20,
 'vehicles_density': 1}
Logging metrics for step 15 and seconds elapsed 1.0
Logging metrics for step 30 and seconds elapsed 2.0
Loggi

In [9]:
metrics = aggregate_and_normalize_rewards(log_filename)

if metrics:
    print("Performance metric (as percent of all steps):")
    for metric_name, avg_metric in metrics.items():
        print(f"{metric_name}: {avg_metric*100:.4f}%")

Performance metric (as percent of all steps):
collision_count: 0.0673%
right_lane_count: 69.7357%
on_road_count: 100.0000%
safe_distance_count: 99.9327%
left_vehicle_overtaken_count: 0.3535%
abrupt_accelerations_count: 3.4338%


### 1.3. Train the model using A2C

Advantage Actor-Critic (A2C) is a synchronous, deterministic variant of the Asynchronous Advantage Actor-Critic (A3C) algorithm. It uses both a policy network (actor) and a value network (critic) to learn the optimal policy and value function simultaneously. The advantage function helps to reduce the variance of the policy updates, leading to more stable learning.


In [20]:
log_filename="2_a2c_custom_reward_log.csv"
log_performance_metrics_enabled=False

env = gymnasium.make('CustomRewardEnv', 
                     render_mode='rgb_array', 
                     log_performance_metrics_enabled=log_performance_metrics_enabled, 
                     log_filename=log_filename
                    )


train_model(
    env=env,
    config_updates=config_updates,
    session_name='2_Group15_RLProject',
    algorithm='A2C'
)

{'action': {'type': 'DiscreteMetaAction'},
 'centering_position': [0.3, 0.5],
 'collision_reward': -4,
 'controlled_vehicles': 1,
 'duration': 30,
 'ego_spacing': 1.5,
 'high_speed_reward': 0.4,
 'initial_lane_id': None,
 'lane_change_reward': 0,
 'lanes_count': 3,
 'left_vehicle_overtaken_reward': -0.5,
 'manual_control': False,
 'normalize_reward': True,
 'observation': {'type': 'Kinematics'},
 'offroad_terminal': False,
 'offscreen_rendering': False,
 'other_vehicles_type': 'highway_env.vehicle.behavior.IDMVehicle',
 'policy_frequency': 1,
 'real_time_rendering': False,
 'render_agent': True,
 'reward_speed_range': [20, 30],
 'right_lane_reward': 0.5,
 'safe_distance_reward': 0.1,
 'scaling': 5.5,
 'screen_height': 150,
 'screen_width': 600,
 'show_trajectories': False,
 'simulation_frequency': 5,
 'smooth_driving_reward': 0.3,
 'vehicles_count': 20,
 'vehicles_density': 1}
Using cpu device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.
Logging to ./log

In [7]:
log_performance_metrics_enabled=True
env = gymnasium.make('CustomRewardEnv', 
                     render_mode='rgb_array', 
                     log_performance_metrics_enabled=log_performance_metrics_enabled, 
                     log_filename=log_filename
                    )

evaluate_model(
    env=env,
    config_updates={**config_updates, "simulation_frequency": 15},
    model_path='models/2_Group15_RLProject_A2C',
    algorithm='A2C',
)

{'action': {'type': 'DiscreteMetaAction'},
 'centering_position': [0.3, 0.5],
 'collision_reward': -4,
 'controlled_vehicles': 1,
 'duration': 30,
 'ego_spacing': 1.5,
 'high_speed_reward': 0.4,
 'initial_lane_id': None,
 'lane_change_reward': 0,
 'lanes_count': 3,
 'left_vehicle_overtaken_reward': -0.5,
 'manual_control': False,
 'normalize_reward': True,
 'observation': {'type': 'Kinematics'},
 'offroad_terminal': False,
 'offscreen_rendering': False,
 'other_vehicles_type': 'highway_env.vehicle.behavior.IDMVehicle',
 'policy_frequency': 1,
 'real_time_rendering': False,
 'render_agent': True,
 'reward_speed_range': [20, 30],
 'right_lane_reward': 0.5,
 'safe_distance_reward': 0.1,
 'scaling': 5.5,
 'screen_height': 150,
 'screen_width': 600,
 'show_trajectories': False,
 'simulation_frequency': 15,
 'smooth_driving_reward': 0.3,
 'vehicles_count': 20,
 'vehicles_density': 1}
Logging metrics for step 15 and seconds elapsed 1.0
Logging metrics for step 30 and seconds elapsed 2.0
Loggi

In [8]:
metrics = aggregate_and_normalize_rewards(log_filename)

if metrics:
    print("Performance metric (as percent of all steps):")
    for metric_name, avg_metric in metrics.items():
        print(f"{metric_name}: {avg_metric*100:.4f}%")

Performance metric (as percent of all steps):
collision_count: 8.3123%
right_lane_count: 100.0000%
on_road_count: 100.0000%
safe_distance_count: 95.8018%
left_vehicle_overtaken_count: 11.8388%
abrupt_accelerations_count: 6.8430%


### 1.4. Train the model using TRPO

Trust Region Policy Optimization (TRPO) is a reinforcement learning algorithm that improves policies safely by limiting how much they change at each step. It uses a constraint based on KL divergence to keep updates stable, helping the agent learn effectively without making drastic changes.

In [6]:
log_filename="2_trpo_custom_reward_log.csv"
log_performance_metrics_enabled=False

env = gymnasium.make('CustomRewardEnv', 
                     render_mode='rgb_array', 
                     log_performance_metrics_enabled=log_performance_metrics_enabled, 
                     log_filename=log_filename
                    )
train_model(
    env=env,
    config_updates=config_updates,
    session_name='2_Group15_RLProject',
    algorithm='TRPO'
)

{'action': {'type': 'DiscreteMetaAction'},
 'centering_position': [0.3, 0.5],
 'collision_reward': -4,
 'controlled_vehicles': 1,
 'duration': 30,
 'ego_spacing': 1.5,
 'high_speed_reward': 0.4,
 'initial_lane_id': None,
 'lane_change_reward': 0,
 'lanes_count': 3,
 'left_vehicle_overtaken_reward': -0.5,
 'manual_control': False,
 'normalize_reward': True,
 'observation': {'type': 'Kinematics'},
 'offroad_terminal': False,
 'offscreen_rendering': False,
 'other_vehicles_type': 'highway_env.vehicle.behavior.IDMVehicle',
 'policy_frequency': 1,
 'real_time_rendering': False,
 'render_agent': True,
 'reward_speed_range': [20, 30],
 'right_lane_reward': 0.5,
 'safe_distance_reward': 0.1,
 'scaling': 5.5,
 'screen_height': 150,
 'screen_width': 600,
 'show_trajectories': False,
 'simulation_frequency': 5,
 'smooth_driving_reward': 0.3,
 'vehicles_count': 20,
 'vehicles_density': 1}
Using cpu device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.
Logging to ./log

In [4]:
log_performance_metrics_enabled=True
env = gymnasium.make('CustomRewardEnv', 
                     render_mode='rgb_array', 
                     log_performance_metrics_enabled=log_performance_metrics_enabled, 
                     log_filename=log_filename
                    )

evaluate_model(
    env=env,
    config_updates={**config_updates, "simulation_frequency": 15},
    model_path='models/2_Group15_RLProject_TRPO',
    algorithm='TRPO',
)

{'action': {'type': 'DiscreteMetaAction'},
 'centering_position': [0.3, 0.5],
 'collision_reward': -4,
 'controlled_vehicles': 1,
 'duration': 30,
 'ego_spacing': 1.5,
 'high_speed_reward': 0.4,
 'initial_lane_id': None,
 'lane_change_reward': 0,
 'lanes_count': 3,
 'left_vehicle_overtaken_reward': -0.5,
 'manual_control': False,
 'normalize_reward': True,
 'observation': {'type': 'Kinematics'},
 'offroad_terminal': False,
 'offscreen_rendering': False,
 'other_vehicles_type': 'highway_env.vehicle.behavior.IDMVehicle',
 'policy_frequency': 1,
 'real_time_rendering': False,
 'render_agent': True,
 'reward_speed_range': [20, 30],
 'right_lane_reward': 0.5,
 'safe_distance_reward': 0.1,
 'scaling': 5.5,
 'screen_height': 150,
 'screen_width': 600,
 'show_trajectories': False,
 'simulation_frequency': 15,
 'smooth_driving_reward': 0.3,
 'vehicles_count': 20,
 'vehicles_density': 1}
Logging metrics for step 15 and seconds elapsed 1.0
Logging metrics for step 30 and seconds elapsed 2.0
Loggi

In [5]:
metrics = aggregate_and_normalize_rewards(log_filename)

if metrics:
    print("Performance metric (as percent of all steps):")
    for metric_name, avg_metric in metrics.items():
        print(f"{metric_name}: {avg_metric*100:.4f}%")

Performance metric (as percent of all steps):
collision_count: 0.1186%
right_lane_count: 64.4915%
on_road_count: 100.0000%
safe_distance_count: 99.9661%
left_vehicle_overtaken_count: 0.2712%
abrupt_accelerations_count: 3.4915%


## 2. Results

Below, we compare the performances of the algorithms on our performance metrics, and offer an explanation for their differences.

We compare the performance of each algorithm, trained on our custom reward function, on the following metrics: 
- The percent of steps where a collision was recorded
- The percent of steps where the car was not in the leftmost lane
- The percent of steps the car was on the road
- The percent of steps the car was a safe distance from the car directly ahead of it
- The percent of steps where the car overtook a vehicle directly to the left of it
- The percent of steps where the car accelerated past the acceptable limit (2 meters per second per second)

| Metric                        | DQN (% of total steps) | PPO (% of total steps) | A2C (% of total steps) | TRPO (% of total steps) |
|-------------------------------|------------------------|------------------------|------------------------|-------------------------|
| Collisions               | 4.70                   | 0.07                   | 8.31                   | 0.12                   |
| Right lane              | 62.14                  | 69.74                  | 100.00                 | 64.49                  |
| On Road                 | 100.00                | 100.00                | 100.00                 | 100.00                |
| Safe distance           | 96.11                 | 99.93                 | 95.80                  | 99.97                  |
| Left vehicle overtaken  | 5.26                  | 0.35                  | 11.84                  | 0.27                   |
| Abrupt accelerations    | 5.44                  | 3.43                  | 6.84                   | 3.49                   |


## 3. Discussion
## Safety performance 


### Collisions
PPO and TRPO excel in avoid collisions very well, likely due to their ability to balance exploration and exploitation effectively and prioritize safety-related policies. A2C’s aggressive behavior and less stable updates lead to riskier actions and more collisions.

### Right lane usage
PPO and TRPO emphasize stable policy updates, encouraging balanced behavior and leading to strong but not absolute right-lane adherence - both models allowed the ego vehicle to use the leftmost lane in at least 30% of steps.

A2C's 100% avoidance of the leftmost lane suggests that its optimization overly prioritized the "right lane reward." This might indicate that the policy fixated on staying in the right lane as a dominant strategy, potentially neglecting other objectives like overtaking or lane changes for safety or efficiency.


### On road 
All agents stayed on the road in 100% of recorded steps. The DiscreteMetaAction space does not appear to allow vehicles the option of running off the road, unlike the ContinuousActionSpace.

### Safe distance from other vehicles 

TRPO and PPO are strong here due to their focus on stable policy optimization. A2C's tendency to oscillate between strategies might result in closer, riskier behavior.

### Not overtaking vehicles to the left 
A2C's high overtaking rate may be due to its relatively fewer updates (compared to PPO and TRPO), which is penalized in this scenario. 

## Driver comfort performance comparison

### Comfortable acceleration
PPO and TRPO excel due to their optimization constraints, which discourage extreme changes in behavior. A2C’s less stable updates might result in jerky driving.