 ## 0. Notebook description

This notebook is used to compare the performance of different reinforcement learning algorithms on the highway environment with custom rewards. The algorithms being compared are DQN, PPO, A2C, and HER. Each algorithm will be trained using the same environment configuration to guarantee a fair comparison. The custom rewards are designed to promote safer and more realistic driving behaviours.


## 1. Train the model

##### Setup the environment and training configurations for the custom reward environment.

In [1]:
from utils.training import train_model
from utils.evaluation import evaluate_model
from gymnasium import register
import gymnasium

config_updates = {
    "safe_distance_reward": 0.1,
    "left_vehicle_overtaken_reward": -0.5,
    "collision_reward": -4,
}

# Register the custom environment
register(
    id='CustomRewardEnv',
    entry_point='HighwayEnvCustomReward:HighwayEnvFastCustomReward',
)

# Set log_rewards_enabled to True or False as per your requirement
log_rewards_enabled = True  # Set this to False if you don't want to log
log_rewards_enabled = True  # Set this to False if you don't want to log

# Create the environment with the custom parameter
env = gymnasium.make('CustomRewardEnv', render_mode='rgb_array', log_rewards_enabled=log_rewards_enabled)

### 1.1. Train the model using DQN

Deep Q-Network (DQN) is a value-based reinforcement learning algorithm. It approximates the Q-value function, which estimates the expected future rewards for each action in a given state. The agent selects actions based on these Q-values to maximize its cumulative reward. DQN is commonly used in environments with discrete action spaces.


In [2]:
train_model(
    env=env,
    config_updates=config_updates,
    session_name='2_Group15_RLProject',
    algorithm='DQN'
)

evaluate_model(
    env=env,
    config_updates=config_updates,
    model_path='models/custom_reward_function',
    algorithm='DQN',
    total_episodes=200
)

{'action': {'type': 'DiscreteMetaAction'},
 'centering_position': [0.3, 0.5],
 'collision_reward': -4,
 'controlled_vehicles': 1,
 'duration': 30,
 'ego_spacing': 1.5,
 'high_speed_reward': 0.4,
 'initial_lane_id': None,
 'lane_change_reward': 0,
 'lanes_count': 3,
 'left_vehicle_overtaken_reward': -0.5,
 'manual_control': False,
 'normalize_reward': True,
 'observation': {'type': 'Kinematics'},
 'offroad_terminal': False,
 'offscreen_rendering': False,
 'other_vehicles_type': 'highway_env.vehicle.behavior.IDMVehicle',
 'policy_frequency': 1,
 'real_time_rendering': False,
 'render_agent': True,
 'reward_speed_range': [20, 30],
 'right_lane_reward': 0.1,
 'safe_distance_reward': 0.1,
 'scaling': 5.5,
 'screen_height': 150,
 'screen_width': 600,
 'show_trajectories': False,
 'simulation_frequency': 5,
 'vehicles_count': 20,
 'vehicles_density': 1}
Using cpu device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.
Logging to ./logs/tensorboard/2_Group15_RLProje

AttributeError: 'HighwayEnvFastCustomReward' object has no attribute 'log_performance_metrics_enabled'

### 1.2. Train the model using PPO

Proximal Policy Optimization (PPO) is a policy-based reinforcement learning algorithm. It optimises the policy directly by using a surrogate objective function that prevents large updates to the policy, thus allowing stable and reliable learning. PPO is commonly used in environments with continuous action spaces.


In [3]:
train_model(
    env=env,
    config_updates=config_updates,
    session_name='2_Group15_RLProject',
    algorithm='PPO'
)

evaluate_model(
    env=env,
    config_updates=config_updates,
    model_path='models/custom_reward_function',
    algorithm='PPO',
    total_episodes=200
)

{'action': {'type': 'DiscreteMetaAction'},
 'centering_position': [0.3, 0.5],
 'collision_reward': -4,
 'controlled_vehicles': 1,
 'duration': 30,
 'ego_spacing': 1.5,
 'high_speed_reward': 0.4,
 'initial_lane_id': None,
 'lane_change_reward': 0,
 'lanes_count': 3,
 'left_vehicle_overtaken_reward': -0.5,
 'manual_control': False,
 'normalize_reward': True,
 'observation': {'type': 'Kinematics'},
 'offroad_terminal': False,
 'offscreen_rendering': False,
 'other_vehicles_type': 'highway_env.vehicle.behavior.IDMVehicle',
 'policy_frequency': 1,
 'real_time_rendering': False,
 'render_agent': True,
 'reward_speed_range': [20, 30],
 'right_lane_reward': 0.1,
 'safe_distance_reward': 0.1,
 'scaling': 5.5,
 'screen_height': 150,
 'screen_width': 600,
 'show_trajectories': False,
 'simulation_frequency': 5,
 'vehicles_count': 20,
 'vehicles_density': 1}
Using cpu device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.
Logging to ./logs/tensorboard/2_Group15_RLProje

KeyboardInterrupt: 

### 1.3. Train the model using A2C

Advantage Actor-Critic (A2C) is a synchronous, deterministic variant of the Asynchronous Advantage Actor-Critic (A3C) algorithm. It uses both a policy network (actor) and a value network (critic) to learn the optimal policy and value function simultaneously. The advantage function helps to reduce the variance of the policy updates, leading to more stable learning.


In [None]:
train_model(
    env=env,
    config_updates=config_updates,
    session_name='2_Group15_RLProject',
    algorithm='A2C'
)

evaluate_model(
    env=env,
    config_updates=config_updates,
    model_path='models/custom_reward_function',
    algorithm='A2C',
    total_episodes=200
)

### 1.4. Train the model using HER

Hindsight Experience Replay (HER) is a technique used to improve the sample efficiency of reinforcement learning algorithms, particularly in sparse reward environments. It allows the agent to learn from unsuccessful episodes by pretending that the goal it was trying to achieve was actually the state it ended up in, thus providing more learning opportunities.

In [None]:
train_model(
    env=env,
    config_updates=config_updates,
    session_name='2_Group15_RLProject',
    algorithm='HER'
)

evaluate_model(
    env=env,
    config_updates=config_updates,
    model_path='models/custom_reward_function',
    algorithm='HER',
    total_episodes=200
)

### SAC (not compatible)

Soft Actor-Critic (SAC) is an off-policy actor-critic algorithm that aims to maximize both the expected reward and the entropy of the policy. The entropy term encourages exploration by penalizing certainty, leading to more robust policies. SAC is used in environments with continuous action spaces and is known for its sample efficiency and stability.
