### **Proposed Experiment: Impact of Lane Preference Reward on Agent Behavior**

#### **Motivation & Hypothesis**  
The original reward structure incentivizes staying in the **rightmost lane** (+0.5 reward). We hypothesize:  
- **Stronger lane preference rewards** will make the agent prioritize staying in the right lane, even at the cost of reduced speed (due to traffic).  
- **Weaker/penalized lane preference** will encourage lane changes for overtaking but increase collision risk.  

This explores the **trade-off between lane discipline and traffic navigation efficiency**.

---

#### **Experimental Design**  
**Variables:**  
- **Independent Variable:** `right_lane_reward` (varied from **-0.5 to +1.0** in increments of **0.25**).  
  - Negative values penalize right-lane occupancy, positive values reward it.  
- **Dependent Variables:**  
  - Lane distribution (percentage of time spent in each lane).  
  - Collision rate and average speed.  
  - Reward stability (mean/std of episodic rewards).  

**Methodology:**  
1. **Training:**  
   - Train 7 DQN agents with `right_lane_reward` values: **[-0.5, -0.25, 0, +0.25, +0.5, +0.75, +1.0]**.  
   - Use the same hyperparameters as Task 1 (e.g., `gamma=0.8`, `learning_rate=5e-4`).  
2. **Evaluation:**  
   - Test each agent over **50 episodes** to measure:  
     - Lane occupancy (track actions corresponding to lane changes).  
     - Collision rate and average speed.  

---

#### **Key Metrics & Analysis**  
1. **Lane Distribution Plot:**  
   - Show the percentage of time spent in each lane (rightmost, middle-right, middle-left, leftmost) for different rewards.  
   - **Expected trend:** Higher rewards → more time in the right lane.  

2. **Collision Rate vs. Lane Reward:**  
   - Plot collision rate as a function of `right_lane_reward`.  
   - **Prediction:** Collisions will peak for **negative rewards** (agents forced to make risky lane changes).  

3. **Speed vs. Lane Reward:**  
   - Analyze if lane preference affects speed (e.g., agents in the right lane may drive slower due to traffic).  

---

#### **Why This Works**  
- **Novelty:** Focuses on **spatial decision-making** instead of speed.  
- **Ease of Implementation:** Only requires modifying the `right_lane_reward` parameter in the environment config (similar to the original notebook).  
- **Insight:** Reveals how reward shaping influences navigation strategies (e.g., "defensive driving" vs. "aggressive overtaking").  


In [5]:
import pickle

import gymnasium as gym
import highway_env  # noqa: F401

# Discrete version
ENVIRONEMNT = "highway-fast-v0"

config_dict = {
    "observation": {
        "type": "OccupancyGrid",
        "vehicles_count": 10,
        "features": ["presence", "x", "y", "vx", "vy", "cos_h", "sin_h"],
        "features_range": {
            "x": [-100, 100],
            "y": [-100, 100],
            "vx": [-20, 20],
            "vy": [-20, 20],
        },
        "grid_size": [[-20, 20], [-20, 20]],
        "grid_step": [5, 5],
        "absolute": False,
    },
    "action": {
        "type": "DiscreteMetaAction",
    },
    "lanes_count": 4,
    "vehicles_count": 15,
    "duration": 60,  # [s]
    "initial_spacing": 0,
    "collision_reward": -1,  # The reward received when colliding with a vehicle.
    "right_lane_reward": 0.5,  # The reward received when driving on the right-most lanes, linearly mapped to
    # zero for other lanes.
    "high_speed_reward": 0.1,  # The reward received when driving at full speed, linearly mapped to zero for
    # lower speeds according to config["reward_speed_range"].
    "lane_change_reward": 0,
    "reward_speed_range": [
        20,
        30,
    ],  # [m/s] The reward for high speed is mapped linearly from this range to [0, HighwayEnv.HIGH_SPEED_REWARD].
    "simulation_frequency": 5,  # [Hz]
    "policy_frequency": 1,  # [Hz]
    "other_vehicles_type": "highway_env.vehicle.behavior.IDMVehicle",
    "screen_width": 600,  # [px]
    "screen_height": 150,  # [px]
    "centering_position": [0.3, 0.5],
    "scaling": 5.5,
    "show_trajectories": True,
    "render_agent": True,
    "offscreen_rendering": False,
    "disable_collision_checks": True,
}

if __name__ == "__main__":
    # Save the config_dict to a pickle file
    with open("task1_config.pkl", "wb") as f:
        pickle.dump(config_dict, f)

    env = gym.make(ENVIRONEMNT, render_mode="rgb_array")
    env.unwrapped.configure(config_dict)


In [None]:
# %% [markdown]
# # Task 4: Lane Preference Reward Experiment (Condensed)

# %%
# Import libraries
import gymnasium as gym
import highway_env
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from stable_baselines3 import DQN

# %%
# Reduced experiment configuration
LANE_REWARDS = [-0.5, 0, 0.5, 1.0]  # 4 key values
N_EPISODES_EVAL = 30  # Reduced evaluation episodes
TRAIN_STEPS = 15_000  # Shorter training

# %%
# Training loop
for reward in LANE_REWARDS:
    print(f"Training with right_lane_reward={reward}")
    model = DQN(
        'MlpPolicy',
        env,
        policy_kwargs=dict(net_arch=[128, 128]),  # Smaller network
        learning_rate=1e-4,
        buffer_size=1_000_000,
        batch_size=32,
        gamma=0.95,
        target_update_interval=500,
        verbose=1,
        tensorboard_log="lane_reward_experiment/"
    )
    
    model.learn(total_timesteps=TRAIN_STEPS)
    model.save(f"lane_reward_experiment/model_{reward}")

# %%
# Evaluation and visualization
results = []
for reward in LANE_REWARDS:
    print(f"Training with right_lane_reward={reward}")
    
    # Create environment
    env = gym.make("highway-fast-v0")
    
    # Apply Task 1 configuration (including observation parameters)
    env.configure(config_dict)  # Use the config_dict from your setup code
    env.config["right_lane_reward"] = reward  # Override specific parameters
    model = DQN.load(f"lane_reward_experiment/model_{reward}")
    crash_count = 0
    lane_time = np.zeros(4)
    speeds = []
    
    for _ in range(N_EPISODES_EVAL):
        env = gym.make("highway-fast-v0")
        obs, _ = env.reset()
        done = False
        
        while not done:
            action, _ = model.predict(obs, deterministic=True)
            obs, _, done, _, info = env.step(action)
            
            # Track lane position
            lane = info.get('lane_index', 0)
            if lane < 4:
                lane_time[lane] += 1
                
            # Track speed
            speeds.append(obs[3])  # Assuming vx is at index 3
            
            if info.get('crashed', False):
                crash_count += 1

    # Calculate metrics
    total_steps = N_EPISODES_EVAL * env.config['duration']
    results.append({
        'Reward': reward,
        'Lane 0 (%)': (lane_time[0]/total_steps)*100,
        'Lane 1 (%)': (lane_time[1]/total_steps)*100,
        'Lane 2 (%)': (lane_time[2]/total_steps)*100,
        'Lane 3 (%)': (lane_time[3]/total_steps)*100,
        'Crash Rate': crash_count / total_steps,
        'Avg Speed': np.mean(speeds)
    })

# %%
# Create DataFrame
df = pd.DataFrame(results)

# %%
# Unique visualizations
plt.figure(figsize=(14, 8))

# Stacked bar chart for lane distribution
plt.subplot(2, 2, 1)
df[['Lane 0 (%)', 'Lane 1 (%)', 'Lane 2 (%)', 'Lane 3 (%)']].plot.bar(
    stacked=True,
    colormap='viridis',
    width=0.8
)
plt.title('Lane Occupancy Distribution')
plt.xlabel('Reward Value')
plt.xticks(ticks=range(4), labels=LANE_REWARDS)
plt.ylabel('Time Spent (%)')

# Crash rate with marginal histogram
plt.subplot(2, 2, 2)
sns.scatterplot(data=df, x='Reward', y='Crash Rate', s=100, hue='Reward')
sns.histplot(data=df, y='Crash Rate', bins=5, kde=True, legend=False)
plt.title('Collision Risk Analysis')

# Speed vs Crash rate matrix
plt.subplot(2, 2, 3)
sns.heatmap(
    df.pivot_table(index='Reward', values=['Avg Speed', 'Crash Rate']),
    annot=True,
    cmap='coolwarm'
)
plt.title('Speed-Safety Matrix')

# Reward impact radar chart
plt.subplot(2, 2, 4, polar=True)
categories = ['Lane 0 (%)', 'Crash Rate', 'Avg Speed']
angles = np.linspace(0, 2*np.pi, len(categories), endpoint=False).tolist()
values = df[categories].mean().values.flatten().tolist()
values += values[:1]

ax = plt.subplot(2, 2, 4, polar=True)
ax.fill(angles, values, color='blue', alpha=0.25)
plt.title('Overall Behavior Profile')

plt.tight_layout()
plt.show()

Training with right_lane_reward=-0.5
Using cpu device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.




Logging to lane_reward_experiment/DQN_1
----------------------------------
| rollout/            |          |
|    ep_len_mean      | 17       |
|    ep_rew_mean      | 12.3     |
|    exploration_rate | 0.957    |
| time/               |          |
|    episodes         | 4        |
|    fps              | 24       |
|    time_elapsed     | 2        |
|    total_timesteps  | 68       |
----------------------------------
----------------------------------
| rollout/            |          |
|    ep_len_mean      | 12.9     |
|    ep_rew_mean      | 9.15     |
|    exploration_rate | 0.935    |
| time/               |          |
|    episodes         | 8        |
|    fps              | 24       |
|    time_elapsed     | 4        |
|    total_timesteps  | 103      |
----------------------------------
----------------------------------
| rollout/            |          |
|    ep_len_mean      | 14.7     |
|    ep_rew_mean      | 11.3     |
|    exploration_rate | 0.889    |
| time/        



Logging to lane_reward_experiment/DQN_2
----------------------------------
| rollout/            |          |
|    ep_len_mean      | 8.75     |
|    ep_rew_mean      | 7.39     |
|    exploration_rate | 0.978    |
| time/               |          |
|    episodes         | 4        |
|    fps              | 21       |
|    time_elapsed     | 1        |
|    total_timesteps  | 35       |
----------------------------------
----------------------------------
| rollout/            |          |
|    ep_len_mean      | 10.9     |
|    ep_rew_mean      | 8.33     |
|    exploration_rate | 0.945    |
| time/               |          |
|    episodes         | 8        |
|    fps              | 25       |
|    time_elapsed     | 3        |
|    total_timesteps  | 87       |
----------------------------------
----------------------------------
| rollout/            |          |
|    ep_len_mean      | 11.2     |
|    ep_rew_mean      | 8.71     |
|    exploration_rate | 0.915    |
| time/        



----------------------------------
| rollout/            |          |
|    ep_len_mean      | 15.2     |
|    ep_rew_mean      | 12.1     |
|    exploration_rate | 0.961    |
| time/               |          |
|    episodes         | 4        |
|    fps              | 17       |
|    time_elapsed     | 3        |
|    total_timesteps  | 61       |
----------------------------------
----------------------------------
| rollout/            |          |
|    ep_len_mean      | 12.6     |
|    ep_rew_mean      | 9.68     |
|    exploration_rate | 0.936    |
| time/               |          |
|    episodes         | 8        |
|    fps              | 18       |
|    time_elapsed     | 5        |
|    total_timesteps  | 101      |
----------------------------------
----------------------------------
| rollout/            |          |
|    ep_len_mean      | 17.1     |
|    ep_rew_mean      | 13.3     |
|    exploration_rate | 0.87     |
| time/               |          |
|    episodes       