## Notebook description

This notebook demonstrates the customization of the reward function for the `highway-fast-v0 environment`. The custom reward logic is implemented in the `utils/highwayEnvCustomReward.py` file and replaces or enhances elements of the default reward function to better **align with realistic driving scenarios and enhance safety aspects**.

## Summary of the default reward function in `highway-fast-v0`

The default reward function in the highway-fast-v0 environment is designed to promote safe and efficient driving behaviors. It calculates a scalar reward for each action based on four key factors:

1. **Collision Penalty:**

   - The `collision_reward` penalizes the agent if the vehicle crashes. This is a binary penalty, either 0 (no crash) or a negative value (collision occurred).

2. **Right Lane Reward:**

   - The `right_lane_reward` rewards the agent for staying in the rightmost lane. The reward increases as the vehicle moves closer to the rightmost lane, encouraging proper lane usage.

3. **High-Speed Reward:**

   - The `high_speed_reward` incentivizes driving at higher speeds. The reward is proportional to the vehicle's forward speed, normalized within a defined speed range.

4. **On-Road Reward:**
   - The `on_road_reward` encourages the vehicle to stay on the road by multiplying the total reward by 1 if the vehicle is on the road or 0 if it is off-road. This means that if the vehicle is off-road, the total reward is also zero.

## Shortcomings of the default reward function in `highway-fast-v0`

1. **Overemphasising the rightmost lane:**

   - The `right_lane_reward` encourages staying in the rightmost lane, which may not be in line with realistic driving goals. For example, overtaking may require swerving into other lanes, which is not directly encouraged.

2. **Speed reward out of context:**

   - The `high_speed_reward` rewards speed linearly, but does not take into account traffic density. Speeding in traffic jams should be penalised.

3. **Binary Collision Penalty:**

   - The `collision_reward` is binary, providing the same penalty regardless of the severity or cause of the crash. This ignores scenarios like near-misses, which could be penalized slightly to encourage caution.

4. **Lack of a safe distance mechanism:**

   - The current reward function does not provide an incentive to maintain a safe distance from the vehicle in front. However, promoting a safe following distance is a fundamental part of safe driving. It helps avoid collisions and creates smoother traffic flow.


5. **Overtaking cars directly to the left:**
   - The current reward function does not penalize actions overtaking cars directly to the rider's left, which is illegal in most EU countries. The highway environment does not penalize overtaking cars to the left of the ego vehicle, and therefore would violate Belgium's highway code. 

6. **Not prioritizing rider's comfort:**
   - The current reward function does not penalize actions that would hurt the rider's user experience, for example, increasing or decreasing speed too quickly, which would jolt the rider. Acceleration of +/- 2 meters per second per second is the threshold after which drivers begin to experience discomfort.



## Changes made to the reward function

### Add safe distance reward

**Implementation**

1. Identify the Front Vehicle:

   - Use the existing `road.neighbour_vehicles(vehicle)` method to locate the vehicle ahead of the agent.

2. Compute the Distance:

   - Calculate the distance between the agent's vehicle and the identified front vehicle. If no front vehicle exists, assume the distance is infinite.

3. Define the Reward:

   - Reward the agent if the distance exceeds a safe threshold (e.g., 10 meters). Penalize the agent as the distance decreases below this threshold.

**Advantages**

- Encourages the agent to maintain a safe distance, reducing the likelihood of rear-end collisions.
- Promotes safer and more realistic driving behaviors.


---
**Code**

````python
# safe distance reward
front_vehicle, _ = self.road.neighbour_vehicles(self.vehicle) # identify the front vehicle
safe_distance = 5
if front_vehicle:
    distance = max(front_vehicle.position[0] - self.vehicle.position[0], 0)

    if distance > safe_distance:
        safe_distance_reward = 1  # Full reward if distance is safe
    else:
        safe_distance_reward = -1 * (safe_distance - distance) / safe_distance
else:
    safe_distance_reward = 0
````

---

### Add high speed reward

**Implementation**

1. Define Traffic Radius:
    - Set a traffic_radius (e.g., 10 meters) within which the agent's vehicle assesses the surrounding traffic density.

2. Count Nearby Vehicles:
    - Iterate through all vehicles on the road and calculate the Euclidean distance between the agent's vehicle and others.
    - Increment the count of nearby vehicles if they are within the defined traffic radius.

3. Compute Traffic Density Factor:
    - Use the number of nearby vehicles to calculate a traffic density factor, where higher traffic density reduces the reward.
    - Define a maximum density (max_density, e.g., 10 vehicles) for scaling. The factor decreases linearly as the density approaches this maximum.

4. Scale the High-Speed Reward:
    - Adjust the agent's speed-based reward (scaled_speed) using the traffic density factor.
    - Clip the reward within a range of 0 to 1 for consistency.

**Advantages**
- Encourages the agent to drive at higher speeds when traffic density is low, promoting efficient driving.
- Discourages risky high-speed driving in dense traffic, reducing the likelihood of collisions or unsafe maneuvers.

---
**Code**

````python
# high speed reward
traffic_radius = 10

# Count the number of vehicles within the traffic radius
nearby_vehicles = 0
for other_vehicle in self.road.vehicles:
    if other_vehicle is not self.vehicle:
        distance = np.linalg.norm(
            np.array(other_vehicle.position) - np.array(self.vehicle.position)
        )
        if distance < traffic_radius:
            nearby_vehicles += 1

# Traffic density factor: more vehicles -> higher penalty
max_density = 10
traffic_density_factor = max(0, 1 - nearby_vehicles / max_density)

# Adjust high-speed reward based on traffic density
high_speed_reward = np.clip(scaled_speed * traffic_density_factor, 0, 1)
````
---

### Add rule to discourage overtaking vehicles to the left of the car 

**Implementation**
1. Retrieve the vehicle horizontally closest and to the left to the ego vehicle.
2. Record the position of the left vehicle and the position of the ego vehicle.
3. In the following step, compare the position of the left vehicle and the ego vehicle to their recorded positions in the previous step.
4. If the left vehicle was ahead of the ego vehicle in the previous step, but behind the ego vehicle in the current step, the ego vehicle is considered to have overtaken the left vehicle, and is penalized.


**Advantages**
- Penalizing overtaking to the left aligns with traffic laws in regions where overtaking is legally allowed only on the right (e.g., in countries with right-hand traffic).



---
**Code**

```python
ego_vehicle = self.vehicle

ego_current_position = ego_vehicle.position[0]
left_vehicle = self.get_closest_left_vehicle( ego_vehicle)
car_overtook_left_vehicle = False

if (left_vehicle):
    left_current_position = left_vehicle.position[0]

ego_previous_position_exists =hasattr(self, "ego_vehicle_previous_position") and (self.ego_vehicle_previous_position is not None)
left_previous_position_exists =hasattr(self, "left_vehicle_previous_position") and (self.left_vehicle_previous_position is not None)
        
if left_vehicle and ego_previous_position_exists and left_previous_position_exists:
    left_vehicle_was_ahead_of_ego = ~(self.ego_vehicle_previous_position > self.left_vehicle_previous_position)
    left_vehicle_now_behind_ego = ego_current_position > left_current_position
    car_overtook_left_vehicle = (left_vehicle_was_ahead_of_ego and left_vehicle_now_behind_ego)
    
self.ego_vehicle_previous_position = ego_current_position
# Reset previous position if no left car
if left_vehicle is None:
    self.left_vehicle_previous_position = None
else: 
    self.left_vehicle_previous_position = left_vehicle.position[0]
    
if (car_overtook_left_vehicle):
    return 1 # Get full negative reward if a left car was overtaken
return 0 # Get no negative reward if a left car was overtaken
```


### Add rule to discourage uncomfortable acceleration / decelerration


**Implementation**
1. Calculate the current forward/horizontal speed of the vehicle.
2. If the environment has a recorded speed from the previous step, get the absolute value of the difference between the vehicle's current forward speed and the vehicle's last recorded forward speed to get the vehicle's acceleration. 
3. If the vehicle's acceleration exceeds the maximum acceleration limit (2 meters per second per second), do not reward the vehicle (multiply the weight defined in the custom configuration by 0).
4. If the vehicle's acceleration is within the acceleration limit, give the vehicle the full reward by multiplying the corresponding weight by 1. 


**Advantages**
- Rewards the agent for comfortable acceleration, but does not directly penalize abrupt acceleration (for example, if the car needs to apply an emergency break).

---
**Code**

```python
def get_smooth_driving_reward(self):
        """Calculate reward for abrupt acceleration or deceleration."""
        max_allowed_acceleration = 2.0  # Threshold for acceptable acceleration (m/s^2)

        smooth_driving_reward = 1 

        if hasattr(self, "previous_speed") and (self.previous_speed is not None):
                current_forward_speed = self.vehicle.speed * np.cos(self.vehicle.heading)
                speed_change = abs(current_forward_speed - self.previous_speed)
                
                if speed_change > max_allowed_acceleration:
                smooth_driving_reward = 0
                

        # Update the previous speed for the next step
        self.previous_speed = self.vehicle.speed
        return smooth_driving_reward
```

## Evaluate the custom reward function

We evaluate the performance of our custom reward function by comparing it to the default reward function. During the model evaluation process, a 1 or 0 is logged to a CSV file, capturing whether or not the ego vehicle, at that step:

- Collided with another car
- Was not in the leftmost lane
- Was not on the road
- Was a safe distance from the car directly ahead of it
- Overtook a vehicle directly to the left of it
- Accelerated past the acceptable limit (2 meters per second per second)

At the end of the evaluation, we sum up these logged counts from both the custom and default reward functions and compare the results to assess the effectiveness of the custom function in achieving the desired outcomes (safety and real-world driving rules).

In [1]:
import highway_env
from gymnasium import register
import gymnasium
import csv
from collections import defaultdict
from utils.training import train_model
from utils.evaluation import evaluate_model

In [2]:
def aggregate_and_normalize_rewards(file_path):
    """
    Read and aggregate rewards from the CSV file, then normalize by the number of steps.

    Parameters:
    - file_path (str): Path to the CSV file.

    Returns:
    - dict: Dictionary of normalized rewards (average reward per step).
    """
    rewards_summary = defaultdict(float)
    total_steps = 0  # Count the number of rows (steps)

    try:
        with open(file_path, mode='r') as file:
            reader = csv.DictReader(file)
            for row in reader:
                total_steps += 1  # Increment step count for each row
                for key, value in row.items():
                    rewards_summary[key] += float(value)
    except FileNotFoundError:
        print(f"Error: {file_path} not found.")
        return None

    # Normalize rewards by total steps
    if total_steps > 0:
        normalized_rewards = {key: value / total_steps for key, value in rewards_summary.items()}
    else:
        print("Error: No steps logged in the file.")
        return None

    return normalized_rewards


### Default reward function
First, we evaluate the performance using the default reward function.

In [3]:
register(
    id='DefaultRewardEnv',
    entry_point='HighwayEnvDefaultReward:HighwayEnvDefaultReward',
)

# Set log_rewards_enabled to True or False as per your requirement
log_filename="1_default_reward_log.csv"
log_performance_metrics_enabled=False


# Create the environment with the custom parameter
env = gymnasium.make('DefaultRewardEnv', 
                     render_mode='rgb_array', 
                     log_performance_metrics_enabled=log_performance_metrics_enabled,
                     log_filename=log_filename
                     )

In [4]:
# train the model with the default reward function

train_model(
    env=env,
    session_name="1_Group15_RLProject_default",
)

{'action': {'type': 'DiscreteMetaAction'},
 'centering_position': [0.3, 0.5],
 'collision_reward': -1,
 'controlled_vehicles': 1,
 'duration': 30,
 'ego_spacing': 1.5,
 'high_speed_reward': 0.4,
 'initial_lane_id': None,
 'lane_change_reward': 0,
 'lanes_count': 3,
 'manual_control': False,
 'normalize_reward': True,
 'observation': {'type': 'Kinematics'},
 'offroad_terminal': False,
 'offscreen_rendering': False,
 'other_vehicles_type': 'highway_env.vehicle.behavior.IDMVehicle',
 'policy_frequency': 1,
 'real_time_rendering': False,
 'render_agent': True,
 'reward_speed_range': [20, 30],
 'right_lane_reward': 0.1,
 'scaling': 5.5,
 'screen_height': 150,
 'screen_width': 600,
 'show_trajectories': False,
 'simulation_frequency': 5,
 'vehicles_count': 20,
 'vehicles_density': 1}
Using cpu device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.
Logging to ./logs/tensorboard/1_Group15_RLProject_default_DQN_Training/DQN_1
----------------------------------
| rol

In [5]:

log_performance_metrics_enabled=True


config_updates={
"simulation_frequency": 15 # Use a higher frame rate
}

env = gymnasium.make('DefaultRewardEnv', 
                     render_mode='rgb_array', 
                    log_performance_metrics_enabled=log_performance_metrics_enabled,
                     log_filename=log_filename
                     )

# evaluate the model with the default reward function
evaluate_model(
    env=env,
    config_updates=config_updates,
    model_path="models/1_Group15_RLProject_default",
    algorithm='DQN',
)

{'action': {'type': 'DiscreteMetaAction'},
 'centering_position': [0.3, 0.5],
 'collision_reward': -1,
 'controlled_vehicles': 1,
 'duration': 30,
 'ego_spacing': 1.5,
 'high_speed_reward': 0.4,
 'initial_lane_id': None,
 'lane_change_reward': 0,
 'lanes_count': 3,
 'manual_control': False,
 'normalize_reward': True,
 'observation': {'type': 'Kinematics'},
 'offroad_terminal': False,
 'offscreen_rendering': False,
 'other_vehicles_type': 'highway_env.vehicle.behavior.IDMVehicle',
 'policy_frequency': 1,
 'real_time_rendering': False,
 'render_agent': True,
 'reward_speed_range': [20, 30],
 'right_lane_reward': 0.1,
 'scaling': 5.5,
 'screen_height': 150,
 'screen_width': 600,
 'show_trajectories': False,
 'simulation_frequency': 15,
 'vehicles_count': 20,
 'vehicles_density': 1}
Logging metrics for step 15 and seconds elapsed 1.0
Logging metrics for step 30 and seconds elapsed 2.0
Logging metrics for step 45 and seconds elapsed 3.0
Logging metrics for step 60 and seconds elapsed 4.0
Lo

#### Below, we also load our performance metrics collected during evaluation of the default environment, defined as the fraction of total steps where: 
- A collision occurred
- The vehicle did not drive in the leftmost lane
- The vehicle drove off the road
- The vehicle was a safe distance from other vehicles
- The vehicle overtook a car to its left
- The vehicle accelerated/decelerrated more than 2m/sˆ2 

In [6]:
metrics = aggregate_and_normalize_rewards(log_filename)

if metrics:
    print("Performance metric (as percent of all steps):")
    for metric_name, avg_metric in metrics.items():
        print(f"{metric_name}: {avg_metric*100:.4f}%")

Performance metric (as percent of all steps):
collision_count: 12.4301%
right_lane_count: 71.1001%
on_road_count: 100.0000%
safe_distance_count: 91.3611%
left_vehicle_overtaken_count: 13.1137%
abrupt_accelerations_count: 23.8036%


### Custom reward function

Next, we evaluate the performance using our custom reward function.

In [3]:
# Register the custom environment
register(
    id='CustomRewardEnv',
    entry_point='HighwayEnvCustomReward:HighwayEnvFastCustomReward',
)

# Set log_rewards_enabled to True or False as per your requirement
log_filename="1_custom_reward_log.csv"
log_performance_metrics_enabled=False

# Create the environment with the custom parameter
env = gymnasium.make('CustomRewardEnv', 
                     render_mode='rgb_array', 
                    log_performance_metrics_enabled=log_performance_metrics_enabled,
                     log_filename=log_filename
                     )

# Configuration updates for the environment
config_updates = {
    "safe_distance_reward": 0.1,
    "left_vehicle_overtaken_reward": -0.5,
    "collision_reward": -4,
    "smooth_driving_reward" : 0.3,
    "right_lane_reward" : 0.5
}

In [4]:
# Start training with the updated environment
train_model(
    env=env,
    config_updates=config_updates,
    session_name="1_Group15_RLProject_custom",
)

{'action': {'type': 'DiscreteMetaAction'},
 'centering_position': [0.3, 0.5],
 'collision_reward': -4,
 'controlled_vehicles': 1,
 'duration': 30,
 'ego_spacing': 1.5,
 'high_speed_reward': 0.4,
 'initial_lane_id': None,
 'lane_change_reward': 0,
 'lanes_count': 3,
 'left_vehicle_overtaken_reward': -0.5,
 'manual_control': False,
 'normalize_reward': True,
 'observation': {'type': 'Kinematics'},
 'offroad_terminal': False,
 'offscreen_rendering': False,
 'other_vehicles_type': 'highway_env.vehicle.behavior.IDMVehicle',
 'policy_frequency': 1,
 'real_time_rendering': False,
 'render_agent': True,
 'reward_speed_range': [20, 30],
 'right_lane_reward': 0.5,
 'safe_distance_reward': 0.1,
 'scaling': 5.5,
 'screen_height': 150,
 'screen_width': 600,
 'show_trajectories': False,
 'simulation_frequency': 5,
 'smooth_driving_reward': 0.3,
 'vehicles_count': 20,
 'vehicles_density': 1}
Using cpu device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.
Logging to ./log

In [6]:
# Set log_rewards_enabled to True or False as per your requirement
log_performance_metrics_enabled=True

env = gymnasium.make('CustomRewardEnv', 
                     render_mode='rgb_array', 
                    log_performance_metrics_enabled=log_performance_metrics_enabled,
                     log_filename=log_filename
                     )


# Configuration updates for the environment
config_updates = {
    "safe_distance_reward": 0.1,
    "left_vehicle_overtaken_reward": -0.5,
    "collision_reward": -4,
    "smooth_driving_reward": 0.3,
    "right_lane_reward" : 0.5,
    "simulation_frequency": 15 # smoother frame rate for animation
}

# evaluate the model with the default reward function
evaluate_model(
    env=env,
    config_updates=config_updates,
    model_path="models/1_Group15_RLProject_custom",
    algorithm='DQN',
)

{'action': {'type': 'DiscreteMetaAction'},
 'centering_position': [0.3, 0.5],
 'collision_reward': -4,
 'controlled_vehicles': 1,
 'duration': 30,
 'ego_spacing': 1.5,
 'high_speed_reward': 0.4,
 'initial_lane_id': None,
 'lane_change_reward': 0,
 'lanes_count': 3,
 'left_vehicle_overtaken_reward': -0.5,
 'manual_control': False,
 'normalize_reward': True,
 'observation': {'type': 'Kinematics'},
 'offroad_terminal': False,
 'offscreen_rendering': False,
 'other_vehicles_type': 'highway_env.vehicle.behavior.IDMVehicle',
 'policy_frequency': 1,
 'real_time_rendering': False,
 'render_agent': True,
 'reward_speed_range': [20, 30],
 'right_lane_reward': 0.5,
 'safe_distance_reward': 0.1,
 'scaling': 5.5,
 'screen_height': 150,
 'screen_width': 600,
 'show_trajectories': False,
 'simulation_frequency': 15,
 'smooth_driving_reward': 0.3,
 'vehicles_count': 20,
 'vehicles_density': 1}
Logging metrics for step 15 and seconds elapsed 1.0
Logging metrics for step 30 and seconds elapsed 2.0
Loggi

Below, we also load our performance metrics collected during evaluation of the environment with our custom reward function, defined as the fraction of total steps where: 
- A collision occurred
- The vehicle did not drive in the leftmost lane
- The vehicle drove off the road
- The vehicle was a safe distance from other vehicles
- The vehicle overtook a car to its left
- The vehicle accelerated/decelerrated more than 2m/sˆ2 (defined as our comfort limit for the driver)

In [7]:
# Path to the CSV file where rewards are logged

metrics = aggregate_and_normalize_rewards(log_filename)

if metrics:
    print("Performance metric (as percent of all steps):")
    for metric_name, avg_metric in metrics.items():
        print(f"{metric_name}: {avg_metric*100:.4f}%")

Performance metric (as percent of all steps):
collision_count: 6.0549%
right_lane_count: 77.0419%
on_road_count: 100.0000%
safe_distance_count: 95.9003%
left_vehicle_overtaken_count: 9.2400%
abrupt_accelerations_count: 5.4557%


We compare the performance of the default reward function to the custom reward function on our performance metrics: 
- The percent of steps where a collision was recorded
- The percent of steps where the car was not in the leftmost right lane
- The percent of steps the car was on the road
- The percent of steps the car was a safe distance from the car directly ahead of it
- The percent of steps where the car overtook a vehicle directly to the left of it
- The percent of steps where the car accelerated past the acceptable limit (2 meters per second per second)

| Metric                        | Default Reward Function (% of total steps) | Custom Reward Function (% of total steps) |
|-------------------------------|-----------------------------|-----------------------------|
| Collisions               | 12.43                    | 6.05                      |
| Right lane              | 71.10                    | 77.04                     |
| On Road                 | 100.00                   | 100.00                    |
| Safe distance           | 91.36                    | 95.90                     |
| Left vehicle overtaken  | 13.11                    | 9.24                      |
| Abrupt accelerations    | 23.80                    | 5.46                      |



## Safety performance comparison

### Collisions
The agent trained on our custom reward function outperforms the default reward function in the number of agent collisions. In the evaluation of the agent trained on our custom reward function, a collision occurs in only 6 percent of recorded steps, while a collision occurs in roughly 12 percent of all recorded steps in the agent trained on the default reward function. This improvement is likely due to the heavier negative weight we apply to collisions (-4), compared to the weight -1 used by the default reward function.

### Right lane usage
The agent trained on our custom reward function is better at staying out of the leftmost lane than the agent trained with the default reward function. The agent trained on the custom reward function stayed out of the leftmost lane in roughly 77 percent of all recorded steps, while the default agent stayed out of the leftmost lane in roughly 71% of all recorded steps. It is likely that this decrease in left lane usage is due to the higher reward weight we place on right lane usage (0.4 in the custom environment, compared to 0.1 in the default environment).  


### On road 
Both the agent trained on the custom reward function and the default reward function stayed in on the road in 100% of recorded steps.

### Safe distance from other vehicles 

The agent trained on our custom reward function is better at staying a safe distance away from cars in front of it. This agent stayed at a safe distance from other cars in roughly 96% of recorded steps, while the agent trained on the default reward function stayed a safe distance from other cars in roughly 91% of recorded steps. 

### Not overtaking vehicles to the left 
The agent trained on our custom reward function performs well at not overtaking vehicles to its left, a maneuver which is illegal in Belgium. The default reward function includes no such penalty, and the agent trained on this function overtakes cars to its left in 13% of steps, compared to 9% of steps with our custom reward function.


## Driver comfort performance comparison

### Comfortable acceleration
The agent trained on our custom reward function also performs better on our driver comfort metric, the number of abrupt accelerations beyond the comfortable limit (2 meters per second per second). In roughly 24% of recorded steps, the agent trained on the default reward function accelerated at a rate higher than the comfortable limit, while the agent trained on the custom reward function only accelerated past the comfortable limit in 5% of all recorded steps.