<font size="8">Reinforcement Learning: Project - Highway-Environment</font>

Group: `el_grupo_87`

Group members:
1. `André Moreira Lopes : 20230570`
2. `Luís Queiroz : 20230584`
3. `André Filipe Silva : 20230972`
4. `Pedro Cerejeira : 20230442`
5. `João Gonçalves : 20230560`

# Reinforcement Learning Final Project 

Welcome to your Reinforcement Learning project! Join in groups of a maximum of 5 students on a project focused on developing an RL agent capable of solving an environment for decision-making in Autonomous Driving. The project deadline has been set to the 2nd of June.

Autonomous Driving has long been considered a field in which RL algorithms excel, and this project aims to leverage the power of RL to create an intelligent agent that can solve the Farama’s foundation “highway-env” project, namely the Highway environment (refer to https://highway-env.farama.org/environments/highway/).

## Project Requirements:

* The environments observation’s format can vary according to our preference, namely Kinematics, Grayscale Image, Occupancy grid and Time to collision (refer to https://highway-env.farama.org/observations/). In your solutions you should use 2 of these types.
* The agents actions can also vary, as continuous actions, discrete actions and discrete meta-actions (refer to https://highway-env.farama.org/actions/). In your solutions you should use 2 of these types.
* As for the algorithms to use, any algorithm is valid (seen or not in class), with a minimum requirement of 3 different algorithms used.
* Apart from the environment observation types and agent action types you must use environment’s configuration provided in the annexed notebook!
Note: Your delivery should comprise 4 solutions to the highway environment (corresponding to the combinations of the two environment observation’s types and the two agent’s action types), in which you just need to use one algorithm for each combination (knowing that you need to use at least 3 different algorithms).


## Project Objectives:

* Train an RL agent to solve the Highway environment: The primary objective of this project is to develop an RL agent that can maximize the reward given by the highway environment (refer to https://highway-env.farama.org/rewards/), which leverages to maximize speed while minimizing crash risk! 
* Optimize decision-making using RL algorithms: Explore different RL algorithms to train the agent. Compare and analyse their effectiveness in learning and decision-making capabilities in the context of the environment.
* Explore and expand on the reward system: Although you should evaluate your agent with the reward function provided by the environment, you could/should expand it to better train your agent.
* Enhance interpretability and analysis: Develop methods to analyse the agent's decision-making process and provide insights into its strategic thinking. Investigate techniques to visualize the agent's evaluation of chess positions and understand its reasoning behind specific moves.



### Extra Objectives:

* Investigate transfer learning and generalization: Explore techniques for transfer learning to leverage knowledge acquired in related domains or from pre-training on large chess datasets. Investigate the agent's ability to generalize its knowledge.
* Explore multi agent approaches: The environment allows you to use more than one agent per episode. Explore multi agent alternatives to improve your learning times and overall benchmarks.


## Imports Required

You might need to restart the kernel after installation

In [17]:
import time
import numpy as np
import pandas as pd
import gymnasium as gym
from copy import deepcopy
import tensorflow as tf
import torch

from typing import List, Dict, Optional, Any

from stable_baselines3 import DQN, PPO, SAC
from stable_baselines3.common.env_util import make_vec_env
from stable_baselines3.common.vec_env import SubprocVecEnv, DummyVecEnv

In [18]:
print('Num GPUs Available: ', len(tf.config.list_physical_devices('GPU')))
print(tf.config.list_physical_devices('GPU'))

print('GPU Available:', torch.cuda.is_available())

Num GPUs Available:  1
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
GPU Available: True


## Environment Configuration

Apart from the environment observation types and agent action types you must use some of the environment’s configurations provided bellow!

In [19]:
def config_generator(
        obs_action_type: Optional[Dict[str, Dict[str, str]]] = None, 
        reward_params: Optional[Dict[str, Any]] = None, 
        policy_frequency: Optional[Dict[str, int]] = None
) -> Dict[str, Any]:
    """
    Generate a configuration dictionary for a highway environment with customizable observation, action, and reward parameters.

    Args:
        obs_action_type (dict, optional): Dictionary specifying observation and action types.
            Default is None, which uses:
            {
                "observation": {
                    "type": "Kinematics"
                },
                "action": {
                    "type": "DiscreteMetaAction"
                }
            }
        reward_params (dict, optional): Dictionary specifying reward-related parameters.
            Default is None, which uses:
            {
                'collision_reward': -1,
                'reward_speed_range': [20, 30],
                'simulation_frequency': 15,
            }
        policy_frequency (dict, optional): Dictionary specifying the policy frequency parameter.
            Default is None, which uses:
            {
                'policy_frequency': 1
            }

    Returns:
        dict: A dictionary containing the configured parameters for the highway environment.
        
    Configuration Parameters:
    - lanes_count (int): Number of lanes in the highway (fixed to 10).
    - vehicles_count (int): Number of other vehicles in the environment (fixed to 50).
    - duration (int): Duration of the environment in seconds (fixed to 120 seconds).
    - other_vehicles_type (str): Policy type for other vehicles in the environment.
    - initial_spacing (int): Initial spacing between vehicles.
    - screen_width (int): Width of the visualization screen in pixels.
    - screen_height (int): Height of the visualization screen in pixels.
    - centering_position (list): Centering position of the visualization.
    - scaling (int): Scaling factor for visualization.
    - show_trajectories (bool): Flag to show trajectories in visualization.
    - render_agent (bool): Flag to render the agent in visualization.
    - offscreen_rendering (bool): Flag to enable offscreen rendering.
    - offroad_terminal (bool): Flag to enable terminal state when the ego vehicle goes off the road.

    See Also:
        - Observation space types: https://highway-env.readthedocs.io/en/latest/user_guide/observation_spaces.html
        - Action space types: https://highway-env.readthedocs.io/en/latest/user_guide/action_spaces.html
    """
    # Default observation and action types
    if obs_action_type is None:
        obs_action_type = {
            'observation': {
                'type': 'Kinematics'
            },
            'action': {
                'type': 'DiscreteMetaAction'
            }
        }

    # Default reward-related parameters
    if reward_params is None:
        reward_params = {
            'collision_reward': -1,
            'reward_speed_range': [20, 30],
            'simulation_frequency': 15,
        }
    
    # Default policy frequency parameter
    if policy_frequency is None:
        policy_frequency = {'policy_frequency': 1}

    # Base configuration parameters
    configuration = {
        # Parameters below cannot be changed
        'lanes_count': 10,  # The environment must always have 10 lanes
        'vehicles_count': 50,  # The environment must always have 50 other vehicles
        'duration': 120,  # [s] The environment must terminate never before 120 seconds
        'other_vehicles_type': 'highway_env.vehicle.behavior.IDMVehicle',  # This is the policy of the other vehicles
        'initial_spacing': 2,  # Initial spacing between vehicles needs to be at most 2

        # Refer to https://highway-env.farama.org/observations/ to change observation space type
        'observation': {
            'type': 'Kinematics'
        },

        # Refer to https://highway-env.farama.org/actions/ to change action space type
        'action': {
            'type': 'DiscreteMetaAction',
        },

        # Parameters below can be changed (as it refers mostly to the reward system)
        'collision_reward': -1,  # The reward received when colliding with a vehicle. (Can be changed)
        'reward_speed_range': [20, 30],  # [m/s] The reward for high speed is mapped linearly from this range to [0, HighwayEnv.HIGH_SPEED_REWARD]. (Can be changed)
        'simulation_frequency': 15,  # [Hz] (Can be changed)
        'policy_frequency': 1,  # [Hz] (Can be changed)
        
        # Parameters defined below are purely for visualization purposes! You can alter them as you please
        'screen_width': 800,  # [px]
        'screen_height': 600,  # [px]
        'centering_position': [0.5, 0.5],
        'scaling': 5,
        'show_trajectories': False,
        'render_agent': True,
        'offscreen_rendering': False,
        
        # Auxiliary Parameters
        'slower_than_others_penalty': False,
        'lane_centering_cost': 1
    }

    # Update configuration with observation and action parameters
    configuration.update(obs_action_type)
    
    # Update configuration with reward parameters
    configuration.update(reward_params)
    
    # Update configuration with policy frequency
    configuration.update(policy_frequency)

    return configuration

In [20]:
def create_env(id: str = 'highway-v0', render_mode: Optional[str] = None, config: Optional[Dict[str, Any]] = None) -> gym.Env:
    """
    Create a gym environment with the given ID, render mode, and configuration.

    Args:
        id (str): The ID of the gym environment to create. Default is 'highway-v0'.
        render_mode (str, optional): The render mode for the environment. Default is None.
        config (dict, optional): Configuration dictionary for the environment. Default is None, 
                                 which uses the default configuration from config_generator().

    Returns:
        gym.Env: The created gym environment.
    """
    if config is None:
        config = config_generator()
    return gym.make(id=id, render_mode=render_mode, config=config)

In [21]:
def record_step_data(step_rewards: List[float], actions: List, policy_frequency: int) -> pd.DataFrame:
    """
    Record step/action data into a DataFrame.
    
    Args:
        step_rewards (List[float]): List of rewards for each step/action.
        actions (List): List of actions taken.
        policy_frequency (int): Policy frequency for recording actions.
    
    Returns:
        pd.DataFrame: DataFrame containing the rewards and actions for each step.
    """
    step_data = {
        'reward': [],
        'action': []
    }

    # Process each set of steps based on the policy frequency
    for i in range(0, len(step_rewards), policy_frequency):
        # Average the rewards for the current set of steps
        avg_reward = np.mean(step_rewards[i:i + policy_frequency])
        # List of actions taken in the current set of steps
        action_list = actions[i:i + policy_frequency]
        
        step_data['reward'].append(avg_reward)
        step_data['action'].append(action_list)
    
    return pd.DataFrame(step_data)

In [22]:
def evaluate_model(model, environment, model_name, max_episodes=10) -> pd.DataFrame:
    """
    Evaluate the given RL model in the specified environment and output a .xlsx file with the results.

    Args:
        model (BaseAlgorithm): The RL model to be evaluated.
        environment (gym.environment): The environment in which to evaluate the model.
        model_name (str): The name of the model passed.
        max_episodes (int): Number of episodes to evaluate the model on. 
        
    Returns:
        pd.DataFrame: DataFrame containing the evaluation results per episode.
    """
    policy_frequency = environment.config['policy_frequency']
    output_filename = f'results/{environment.config["action"]["type"]}_{environment.config["observation"]["type"]}_{model_name}.xlsx'

    obs = environment.reset()
    episodes_data = []
    step_data_frames = []
    
    for episode in range(max_episodes):
        start_time = time.time()
        episode_returns = 0.0
        step_rewards = []
        actions = []

        while True:
            if isinstance(obs, tuple):
                obs = obs[0]

            action, _ = model.predict(obs, deterministic=True)
            obs, reward, done, truncated, _ = environment.step(action)
            
            step_rewards.append(reward)
            actions.append(action)
            episode_returns += reward

            if done or truncated:
                end_time = time.time()
                episode_duration = end_time - start_time
                normalized_return = episode_returns / policy_frequency
                
                episodes_data.append({
                    'average_standardized_reward': normalized_return,
                    'episode_duration': episode_duration,
                    'returns_list': (normalized_return, truncated)
                })
                
                # Record step data
                step_data_df = record_step_data(step_rewards, actions, policy_frequency)
                step_data_frames.append((f'episode_{episode + 1}_steps', step_data_df))
                
                obs = environment.reset()
                break

    # Convert episodes data to DataFrame
    episodes_df = pd.DataFrame(episodes_data)

    # Write to Excel with multiple sheets
    with pd.ExcelWriter(output_filename) as writer:
        episodes_df.to_excel(writer, sheet_name='overall_evaluation', index=False)
        for sheet_name, df in step_data_frames:
            df.to_excel(writer, sheet_name=sheet_name, index=False)

    return episodes_df

##### Registering Custom Enviroment

In [23]:
gym.envs.register(id='CustomHighwayEnv-v0', entry_point='highway_env_custom:CustomHighwayEnv')

  logger.warn(f"Overriding environment {new_spec.id} already in registry.")


##### Proving it is just an extension of the base environment

In [24]:
highway = create_env(id='highway-v0', render_mode=None , config=None)
custom_highway = create_env(id='CustomHighwayEnv-v0', render_mode=None, config=None)

  logger.warn(
  logger.warn(


In [25]:
print(f'Highway Observation Space: {highway.observation_space.shape} \n Highway Action Space:{highway.action_space.shape} \n Type: {type(highway)}')
print('')
print(f'Custom Highway Observation Space: {custom_highway.observation_space.shape} \n Custom Highway Action Space:{custom_highway.action_space.shape} \n Type:{type(custom_highway)}')

Highway Observation Space: (5, 5) 
 Highway Action Space:() 
 Type: <class 'gymnasium.wrappers.order_enforcing.OrderEnforcing'>

Custom Highway Observation Space: (5, 5) 
 Custom Highway Action Space:() 
 Type:<class 'gymnasium.wrappers.order_enforcing.OrderEnforcing'>


## Solution 0 - Example Solution


Environment Observation Type: **Kinematics** \
Agent Action Type: **DiscreteMetaAction** \
Algorithm Used: **Random**

Example of the environment's usage using a random policy.

In [26]:
# env = gym.make('highway-v0', render_mode='human', config=configuration)
# 
# obs, info = env.reset(seed=42)
# done = truncated = False
# 
# Return = 0
# n_steps = 1
# Episode = 0
# while not (done or truncated):
#   # Dispatch the observations to the model to get the tuple of actions
#   action = env.action_space.sample()
#   # Execute the actions
#   next_obs, reward, done, truncated, info = env.step(action)
#   Return += reward
# 
#   print('Episode: {}, Step: {}, Return: {}'.format(Episode, n_steps, round(Return,2)))
#   n_steps+=1
# env.close()

## Solution 1 - DiscreteMeta_Kinematics
Environment Observation Type: **Kinematics** \
Agent Action Type: **DiscreteMetaAction** \
Algorithm Used:

#### Configs

In [27]:
discrete_meta_kinematics_1 = config_generator()
discrete_meta_kinematics_2 = config_generator(reward_params={'collision_reward': -4})

#### Model 1

##### (training)

In [28]:
# env_dmk = create_env(id='highway-v0', render_mode=None, config=discrete_meta_kinematics_1)

In [29]:
# model = DQN('MlpPolicy',
#             env_dmk,
#             learning_rate=5e-4,
#             buffer_size=500000,
#             learning_starts=200,
#             batch_size=32,
#             gamma=0.8,
#             train_freq=1,
#             gradient_steps=1,
#             target_update_interval=50,
#             exploration_fraction=0.7,
#             verbose=1,
#             tensorboard_log='highway_dqn/')
# 
# model.learn(total_timesteps=int(20000))
# 
# model.save('models/DiscreteMeta_Kinematics/dqn_1')

##### (evaluation)

In [30]:
dqn_1_dmk = DQN.load('models/DiscreteMeta_Kinematics/dqn_1')

Exception: Can't get attribute '_function_setstate' on <module 'cloudpickle.cloudpickle' from 'C:\\Users\\pedro\\anaconda3\\envs\\py310\\lib\\site-packages\\cloudpickle\\cloudpickle.py'>
Exception: Can't get attribute '_function_setstate' on <module 'cloudpickle.cloudpickle' from 'C:\\Users\\pedro\\anaconda3\\envs\\py310\\lib\\site-packages\\cloudpickle\\cloudpickle.py'>


In [31]:
env_dmk_1 = create_env(id='highway-v0', render_mode='human', config=config_generator())

  logger.warn(


In [32]:
results_dmk_1 = evaluate_model(dqn_1_dmk, env_dmk_1, 'DQN_1')

In [33]:
results_dmk_1

Unnamed: 0,average_standardized_reward,episode_duration,returns_list
0,28.640696,12.721871,"(28.640696452919983, False)"
1,23.530443,9.676084,"(23.53044325085273, False)"
2,3.993524,1.837489,"(3.9935242319006594, False)"
3,93.313199,39.018213,"(93.31319859856248, True)"
4,2.780063,1.314662,"(2.780063215891057, False)"
5,37.136821,13.277817,"(37.13682120528132, False)"
6,93.820565,38.929678,"(93.82056474095737, True)"
7,34.897683,13.410062,"(34.89768278627791, False)"
8,18.914938,7.633412,"(18.91493828384177, False)"
9,17.489109,7.945916,"(17.489109097002785, False)"


#### Model 2

##### (training)

In [34]:
# env_dmk = create_env(id='highway-v0', render_mode=None, config=discrete_meta_kinematics_2)

In [35]:
# model = DQN('MlpPolicy',
#             env_dmk,
#             learning_rate=5e-4,
#             buffer_size=500000,
#             learning_starts=200,
#             batch_size=32,
#             gamma=0.8,
#             train_freq=1,
#             gradient_steps=1,
#             target_update_interval=50,
#             exploration_fraction=0.7,
#             verbose=1,
#             tensorboard_log='highway_dqn/')
# 
# model.learn(total_timesteps=int(20000))
# 
# model.save('models/DiscreteMeta_Kinematics/dqn_2')

##### (evaluation)

In [36]:
dqn_2_dmk = DQN.load('models/DiscreteMeta_Kinematics/dqn_2')

Exception: code expected at most 16 arguments, got 18
Exception: code expected at most 16 arguments, got 18


In [37]:
env_dmk_1 = create_env(id='highway-v0', render_mode='human', config=config_generator())

  logger.warn(


In [38]:
results_dmk_2 = evaluate_model(dqn_2_dmk, env_dmk_1, 'DQN_2')

In [39]:
results_dmk_2

Unnamed: 0,average_standardized_reward,episode_duration,returns_list
0,85.002782,38.857615,"(85.00278162863779, True)"
1,86.42731,39.010568,"(86.42730958013301, True)"
2,6.313688,2.910963,"(6.313687647808391, False)"
3,20.187708,8.968179,"(20.187708382951605, False)"
4,54.121291,23.078587,"(54.12129074028867, False)"
5,1.60416,0.973331,"(1.6041598054128516, False)"
6,82.867849,38.165244,"(82.86784876869774, True)"
7,91.847244,38.1163,"(91.84724401045355, True)"
8,48.788388,21.970414,"(48.788387544542644, False)"
9,85.907504,39.841136,"(85.90750365522582, True)"


## Solution 2 - Continuous_Kinematics
Environment Observation Type: **Kinematics** \
Agent Action Type: **Continuous Actions** \
Algorithm Used: **PPO**

#### Config

In [40]:
continuous_kinematics = config_generator(
    obs_action_type={
        'observation': {
            'type': 'Kinematics',
            'vehicles_count': 5,
            'features': [
                'presence',
                'x',
                'y',
                'vx',
                'vy',
                'cos_h',
                'sin_h'
            ],
            'absolute': False
        },
        'action': {
            'type': 'ContinuousAction',
        },     
    },
    reward_params={
        'collision_reward': -5, 
        'reward_speed_range': [10, 40],
        'simulation_frequency': 15,
        'speed_reward': 5, 
        'lane_change_penalty': -0.2,
        'offroad_terminal': True,
        'lane_centering_cost': 2,
        'lane_centering_reward': 0.2,
        'action_reward': -0.3,
        'slower_than_others_penalty': True # if True then reward is negative for speeds lower than 25m/s
    },
    policy_frequency={'policy_frequency': 8},
)

continuous_kinematics_2 = deepcopy(continuous_kinematics)

continuous_kinematics_2['collision_reward'], continuous_kinematics_2['reward_speed_range'], continuous_kinematics_2['action_reward'] = -10, [15, 35], -0.01

#### Model 1

##### (training)

In [41]:
# vec_env_ck = make_vec_env(
#     env_id='CustomHighwayEnv-v0',
#     n_envs=5, # Number of parallel environments ran on different cpu_cores
#     seed=0,
#     vec_env_cls=SubprocVecEnv,
#     env_kwargs={'config':continuous_kinematics, 'render_mode':None}
# )

In [42]:
# # Policy network architecture
# policy_kwargs = dict(
#     net_arch=[256, 256]  # network architecture
# )
# 
# # Create the PPO model
# model = PPO(
#     'MlpPolicy',
#     vec_env_ck,
#     n_steps=2048,  # Increased to gather more experience per update
#     batch_size=256,  # Increased batch size
#     learning_rate=3e-4,  # Adjusted learning rate for stability
#     ent_coef=0.01,  # Higher entropy coefficient to encourage exploration
#     clip_range=0.2,  # Clipping range for PPO
#     vf_coef=0.5,  # Value function coefficient
#     max_grad_norm=0.5,  # Gradient clipping
#     policy_kwargs=policy_kwargs,
#     verbose=2
# )
# 
# # Train the model
# model.learn(total_timesteps=100_000)  # Number of timesteps for training
# model.save('models/Continuous_Kinematics/ppo_1')

##### (evalutation)

In [43]:
ppo_1_ck = PPO.load('models/Continuous_Kinematics/ppo_1')

Exception: code expected at most 16 arguments, got 18
Exception: code expected at most 16 arguments, got 18


In [44]:
env_ck_1 = create_env(id='CustomHighwayEnv-v0', render_mode='human', config=continuous_kinematics)

  logger.warn(


In [45]:
results_ck_1 = evaluate_model(ppo_1_ck, env_ck_1, 'PPO_1')

In [46]:
results_ck_1

Unnamed: 0,average_standardized_reward,episode_duration,returns_list
0,0.111854,0.433159,"(0.11185377399936493, False)"
1,2.43482,1.388849,"(2.4348201719394678, False)"
2,10.328912,3.400883,"(10.328912310549395, False)"
3,4.121161,1.809552,"(4.121160747500981, False)"
4,4.570697,1.953107,"(4.570696677965851, False)"
5,1.762089,1.196512,"(1.7620888337065468, False)"
6,4.893903,1.951734,"(4.893903273889418, False)"
7,2.275612,1.209923,"(2.275611652012633, False)"
8,0.059783,0.362678,"(0.05978250963685562, False)"
9,7.972792,2.5984,"(7.972792316714172, False)"


#### Model 2

##### (training)

In [None]:
# vec_env_ck = make_vec_env(
#     env_id='CustomHighwayEnv-v0',
#     n_envs=5, # Number of parallel environments ran on different cpu_cores
#     seed=0,
#     vec_env_cls=SubprocVecEnv,
#     env_kwargs={'config':continuous_kinematics_2, 'render_mode':None}
# )

In [None]:
# # Policy network architecture
# policy_kwargs = dict(
#     net_arch=[256, 256]  # network architecture
# )
# 
# # Create the PPO model
# model = PPO(
#     'MlpPolicy',
#     vec_env_ck,
#     n_steps=2048,  # Increased to gather more experience per update
#     batch_size=256,  # Increased batch size
#     learning_rate=3e-4,  # Adjusted learning rate for stability
#     ent_coef=0.01,  # Higher entropy coefficient to encourage exploration
#     clip_range=0.2,  # Clipping range for PPO
#     vf_coef=0.5,  # Value function coefficient
#     max_grad_norm=0.5,  # Gradient clipping
#     policy_kwargs=policy_kwargs,
#     verbose=2
# )
# 
# # Train the model
# model.learn(total_timesteps=1_300_000)  # Number of timesteps for training
# model.save('models/Continuous_Kinematics/ppo_2')

##### (evaluation)

In [11]:
ppo_2_ck = PPO.load('models/Continuous_Kinematics/ppo_2')

Exception: code expected at most 16 arguments, got 18
Exception: code expected at most 16 arguments, got 18


In [12]:
env_ck_1 = create_env(id='highway-v0', render_mode='human', config=continuous_kinematics_2)

  logger.warn(


In [15]:
results_ck_2 = evaluate_model(ppo_2_ck, env_ck_1, 'PPO_2')

In [16]:
results_ck_2

Unnamed: 0,average_standardized_reward,episode_duration,returns_list
0,55.107678,17.087796,"(55.10767805471098, False)"
1,21.729253,6.493152,"(21.72925278878278, False)"
2,119.274493,35.654491,"(119.2744932674568, True)"
3,27.504507,8.316161,"(27.504506771033224, False)"
4,11.418005,3.287526,"(11.41800485024134, False)"
5,31.78442,9.569394,"(31.784419970317092, False)"
6,15.463158,4.575972,"(15.463157795141505, False)"
7,119.074944,35.527438,"(119.07494448395634, True)"
8,9.549812,2.721412,"(9.549812300893958, False)"
9,22.261771,6.573251,"(22.261770658069622, False)"


## Solution 3 - DiscreteMeta_Grayscale
Environment Observation Type: **Gray Scale** \
Agent Action Type: **DiscreteMetaAction** \
Algorithm Used: **DQN**

#### Configs

In [50]:
discrete_meta_grayscale_1 = config_generator(obs_action_type={
        'observation': {
            'type': 'GrayscaleObservation',
            'observation_shape': (128, 64),
            'stack_size': 4,
            'weights': [0.2989, 0.5870, 0.1140],
            'scaling': 1.75,
        },
        'action': {
            'type': 'DiscreteMetaAction',
        },
    })

In [51]:
discrete_meta_grayscale_2 = deepcopy(discrete_meta_grayscale_1)
discrete_meta_grayscale_2['collision_reward'], discrete_meta_grayscale_2['reward_speed_range'] = -5, [25, 35]

#### Model 1

##### (training)

In [52]:
# env_dmg = create_env(id='highway-v0',render_mode=None, config=discrete_meta_grayscale_2)

In [53]:
# model = DQN('CnnPolicy',
#             env_dmg,
#             learning_rate=5e-4,
#             buffer_size=15000,
#             learning_starts=200,
#             batch_size=32,
#             gamma=0.8,
#             train_freq=1,
#             gradient_steps=1,
#             target_update_interval=50,
#             exploration_fraction=0.7,
#             verbose=1,
#             tensorboard_log='highway_dqn/')
# 
# model.learn(total_timesteps=int(2e4))
# 
# model.save('models/DiscreteMeta_Grayscale/dqn_1')

##### (evaluation)

In [54]:
dqn_1_dmg = DQN.load('models/DiscreteMeta_Grayscale/dqn_1')

Exception: code expected at most 16 arguments, got 18
Exception: code expected at most 16 arguments, got 18


In [55]:
env_dmg_1 = create_env(id='highway-v0',render_mode='human', config=discrete_meta_grayscale_1)

In [56]:
results_dmg_1 = evaluate_model(dqn_1_dmg, env_dmg_1, 'DQN_1')

In [57]:
results_dmg_1

Unnamed: 0,average_standardized_reward,episode_duration,returns_list
0,7.555966,3.676856,"(7.555965514333092, False)"
1,23.741156,8.519618,"(23.741156460285943, False)"
2,2.76582,1.336143,"(2.7658195967702057, False)"
3,1.805349,0.986621,"(1.8053493042319608, False)"
4,103.998501,39.546077,"(103.99850105462943, True)"
5,0.879994,0.684051,"(0.8799940278395417, False)"
6,7.68353,3.063447,"(7.683530426172107, False)"
7,1.962696,0.974907,"(1.9626959530521901, False)"
8,13.284627,5.15558,"(13.284627096947291, False)"
9,1.807965,0.977983,"(1.8079650703103967, False)"


#### Model 2

##### (training)

In [58]:
# env_dmg = create_env(id='highway-v0',render_mode=None, config=discrete_meta_grayscale_1)

In [59]:
# model_2 = DQN(
#     'CnnPolicy',
#     env_dmg,
#     learning_rate=3e-4,             # Slightly reduced learning rate for stability
#     buffer_size=50000,              # Increased buffer size for more experience
#     learning_starts=1000,           # More initial exploration before learning starts
#     batch_size=32,                  # Keeping batch size the same
#     gamma=0.9,                      # Increased gamma to consider long-term rewards
#     train_freq=4,                   # Training less frequently
#     gradient_steps=4,               # More gradient steps to balance less frequent training
#     target_update_interval=100,     # Update target network less frequently
#     exploration_fraction=0.5,       # Reduced exploration fraction for earlier exploitation
#     verbose=1,
# )
# 
# model_2.learn(total_timesteps=int(2e5))
# 
# model_2.save('models/DiscreteMeta_Grayscale/dqn_2')

##### (evaluation)

In [60]:
dqn_2_dmg = DQN.load('models/DiscreteMeta_Grayscale/dqn_2')

Exception: code expected at most 16 arguments, got 18
Exception: code expected at most 16 arguments, got 18


In [61]:
env_dmg_1 = create_env(id='highway-v0',render_mode='human', config=discrete_meta_grayscale_1)

In [62]:
results_dmg_4 = evaluate_model(dqn_2_dmg, env_dmg_1, 'DQN_2')

In [63]:
results_dmg_4

Unnamed: 0,average_standardized_reward,episode_duration,returns_list
0,14.738975,5.848241,"(14.738975149431306, False)"
1,7.277482,3.080122,"(7.277482409018, False)"
2,32.864879,12.902146,"(32.86487936740899, False)"
3,50.224053,17.038968,"(50.22405265471093, False)"
4,11.196617,4.329761,"(11.196617130545592, False)"
5,22.977891,8.762615,"(22.97789075172317, False)"
6,41.159169,15.265447,"(41.15916941319433, False)"
7,88.902291,38.787627,"(88.90229110834827, True)"
8,9.13459,3.793373,"(9.1345898172851, False)"
9,1.686163,0.958965,"(1.6861629424681035, False)"


## Solution 4 - Continuous_Grayscale
Environment Observation Type: **Gray Scale** \
Agent Action Type: **Continuous Actions** \
Algorithm Used: **SAC + PPO**

#### Config

In [64]:
continuous_grayscale = config_generator(
    obs_action_type={
        'observation': {
           'type': 'GrayscaleObservation',
           'observation_shape': (128, 64),
           'stack_size': 8,
           'weights': [0.2989, 0.5870, 0.1140],
           'scaling': 1,
        },
        'action': {
            'type': 'ContinuousAction',
            'longitudinal': True,
            'lateral': True,
            'acceleration_range': [-1, 1],
            'steering_range': [-0.15, 0.15],
            'speed_range': [0, 40]
        }
    }, 
    reward_params={
        'collision_reward': -1, #
        'reward_speed_range': [0, 40], #
        # 'high_speed_reward': 2, # this is renamed for the sake of interpretability
        'speed_reward': 2, # high_speed_reward renamed
        'stationary_penalty': -1, # amount of reward fed into the agent if stationary for too many steps
        'lane_change_penalty': -4, #
        'offroad_penalty': -1, #
        'lane_centering_cost': 4, #
        'lane_centering_reward': 1, #
        'slower_than_others_penalty': False, # if True then reward is negative for speeds lower than 25m/s
        'simulation_frequency': 15,
        'offroad_terminal': True,
        'normalize_reward': True,
        'dynamical': False,
    },
    policy_frequency={'policy_frequency': 4}
)

In [65]:
continuous_grayscale_pf8 = deepcopy(continuous_grayscale)
continuous_grayscale_pf8['policy_frequency'] = 8

#### Model SAC

###### (training)

In [66]:
# vec_env_cg = make_vec_env(env_id='CustomHighwayEnv-v0', n_envs=5, vec_env_cls=DummyVecEnv, env_kwargs={'config':continuous_grayscale})

In [67]:
# model = SAC(
#     'CnnPolicy',
#     vec_env_cg,
#     verbose=1,
#     buffer_size=15000,
#     batch_size=64,
#     learning_rate=0.0001,
#     gamma=0.92,
# )
# 
# model.learn(total_timesteps=1000000, log_interval=4)
# 
# model.save('models/Continuous_Grayscale/sac_2_continuous_grayscale')

###### (evaluation)

In [68]:
sac_cg = SAC.load('models/Continuous_Grayscale/sac.zip')

In [69]:
# training with lower policy_frequency and evaluating with higher seems to lead to better results in a continuous environment
env_cg_1 = create_env(id='highway-v0',render_mode='human', config=continuous_grayscale_pf8)
# env_cg_2 = create_env(id='CustomHighwayEnv-v0', render_mode='human', config=continuous_grayscale_pf8)

In [70]:
# (~>110, True) = Completed | (~<100, True) = Stayed behind | (x, False) = crashed or went offroad
results_cg_1_df = evaluate_model(sac_cg, env_cg_1, 'SAC')

In [71]:
results_cg_1_df

Unnamed: 0,average_standardized_reward,episode_duration,returns_list
0,91.086962,29.622779,"(91.08696207775449, True)"
1,20.27603,5.314952,"(20.27602987961856, False)"
2,39.537728,10.634967,"(39.53772834457063, False)"
3,53.359172,14.501132,"(53.35917241684028, False)"
4,19.387417,5.388121,"(19.387416637691064, False)"
5,116.842809,29.58382,"(116.84280911384374, True)"
6,51.647301,13.64262,"(51.64730135207125, False)"
7,65.053748,17.475284,"(65.0537479430315, False)"
8,30.823429,8.365913,"(30.82342941598586, False)"
9,27.225796,7.693783,"(27.225796392766455, False)"


In [72]:
# (~>220, True) = Completed | (~140, True) = Stayed behind | (x, False) = crashed or went offroad
# results_cg_2_df = evaluate_model(sac_cg, env_cg_2, 'SAC')

In [73]:
# results_cg_2_df